NDIからMLX-VLMに接続して画面にどんなものが写ってるのかリアルタイムに説明させる

2024年9月19日 07:32

NDIで遊べるようになった吾輩はいてもたってもいられなくなり、そのままアドレナリンが出っ放しの状態でなんとかNDIとAIをどうにかイングリモングリできないものかと思案した。

Ubuntuに入れようかと思ったのだが、NDIのライブラリがどうもPythonとのバインディングが微妙で上手く行かなかった上に、やはりUbuntu+GPUは高価すぎるのでもう少しセコいマシン構成が通用する世界観はないものかと色々考えた結果、MacBookのMLXを使うことにした。MLX便利だ。

そしてMLX-VLMという、つまりApple SiliconでVLM(画像言語Vision Languageモデル)を動かすというそのまんまのツールがあったのだが公式のサンプルコードが動かねえ!

もう最近はこんなの日常茶飯事なので、マニュアル読んだだけで動かないとか泣き言言ってる奴はハッカーじゃねえ。ソース読め。

というわけでソースを読んで正しい動かし方を会得した。

import mlx.core as mx
from mlx_vlm.utils import generate, get_model_path, load, load_config, load_image_processor
from mlx_vlm.prompt_utils import apply_chat_template

# 必要なパラメータを指定して関数を呼び出す
model_path='qnguyen3/nanoLLaVA'
config = load_config(model_path)
model, processor = load(model_path, {"trust_remote_code": True})
image_processor = load_image_processor(model_path)
prompt = "What are they doing?where are they?what are they wearing?"
prompt = apply_chat_template(processor, config, prompt)
res=generate(
    model=model,
    max_tokens=100,
    temp=0.0,
	prompt=prompt,
	processor=processor,
	image_processor=image_processor,
	image="https://assets.st-note.com/production/uploads/images/154833755/rectangle_large_type_2_c7a13fbdd99eee6bf2407f2ccdcd9382.png"
)
print(res)

これにNDIからの入力を突っ込むと、リアルタイムで結果が得られるはずである。VLMには画像データをURLとかファイルとかで渡さなければならなかったので、面倒だからBASE64で無理やり突っ込む。

すると、NDIで見てるYouTubeの内容をリアルタイムで解説するツールができた。ビバ!AI!

もちろん速いだけあって少しバカなのだが、この辺のトレードオフはどうとでもなる(はずだ)。

ちなみにmlx-community/nanoLLaVA-1.5-8bitに変えたらもっといっぱい出てきた。便利。

YouTubeで流れてるテレビのニュースをNDI経由で見せるとこんな文章が生成された。内容は察してください

画像だけを見ると、服装や状況から判断すると、参加者は正式なイベント、おそらくは政治または政府の集会に参加しているようです。参加者は正式な服装をしており、その場の形式と重要性を示唆しています。国旗と正式な服装規定の存在は、イベントが政府または政治の性質を持つものであることを示唆しています。会議の性質や参加者の特定の活動など、イベントの正確な性質は不明です。

抄訳

Based on the image alone, it appears that the individuals are engaged in a formal event, possibly a political or governmental gathering, as indicated by the attire and the setting. They are dressed in formal attire, which suggests a level of formality and significance to the occasion. The presence of the flag and the formal dress code implies that the event is of a governmental or political nature. The exact nature of the event, such as the nature of the meeting or the specific activities of the individuals, cannot be
The individuals in the image are engaged in a social activity, likely a gathering or a meeting. They are standing in a group, facing each other, and appear to be conversing or participating in a discussion. The attire of the individuals varies, with some wearing casual clothing and others in more formal attire. The specific details of their clothing are not clear from the image, but they are dressed in a manner that suggests a casual or semi-formal event. The setting appears to be a public space,
The individuals in the image are engaged in a social activity, likely a gathering or a celebration. They are standing in a group, facing each other, and appear to be interacting with each other. The attire of the individuals varies, with some wearing dresses and others in casual clothing. The specific details of their clothing are not clear due to the resolution and angle of the image. The setting is a public space, possibly a park or a street, given the presence of benches and the open-air environment.
The individuals in the image are engaged in a social gathering, likely a casual meet-up or a casual event. They are standing in a group, possibly conversing or enjoying each other's company. The attire of the individuals varies, with some wearing casual clothing and others in more formal attire. The specific details of their clothing are not clearly discernible from the image. The setting appears to be a public space, possibly a park or a street, given the presence of benches and the open space around them
The individuals in the image are engaged in a social activity, likely a gathering or a celebration. They are standing in a group, facing each other, and appear to be interacting with each other. The attire of the individuals varies, with some wearing dresses and others in casual clothing. The specific details of their clothing are not clear due to the resolution and angle of the image. The setting is a public space, possibly a park or a street, given the presence of benches and the open-air environment.
The individuals in the image are engaged in what appears to be a public gathering or event, possibly a market or a public square. They are standing in a line, facing the camera, and are dressed in casual, everyday clothing. The clothing style suggests a relaxed, informal setting, possibly a street market or a public gathering where people are dressed in comfortable, practical attire. The presence of the police car indicates that the event is of some significance, possibly a public safety or security-related event.

実際のソースコードは以下

import finder
import receiver
import lib
#Other Import
import imutils
import cv2
import numpy as np
from PIL import Image
import io
import base64

def ndarray_to_base64_img_tag(image_array):
    # NumPy ndarray から PIL 画像に変換
    image = Image.fromarray(np.uint8(image_array))
    
    # 画像データをバイナリデータに変換 (BytesIO)
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")  # PNG 形式で保存
    
    # バイナリデータを Base64 にエンコード
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    
    # 画像データ URI スキームを作成
    img_data_uri = f"data:image/png;base64,{img_str}"
    
    # <img> タグを作成
    img_tag = f'<img src="{img_data_uri}" alt="Image"/>'
    
    return img_data_uri

import codecs
import mlx.core as mx
from mlx_vlm.utils import generate, get_model_path, load, load_config, load_image_processor
from mlx_vlm.prompt_utils import apply_chat_template

model_path='mlx-community/nanoLLaVA-1.5-8bit'
config = load_config(model_path)
model, processor = load(model_path, {"trust_remote_code": True})
image_processor = load_image_processor(model_path)
prompt = "What are they doing?where are they?what are they wearing?"
prompt = codecs.decode(prompt, "unicode_escape")
prompt = apply_chat_template(processor, config, prompt)
"""
res=generate(
    model=model,
    max_tokens=100,
    temp=0.0,
	prompt=prompt,
	processor=processor,
	image_processor=image_processor,
	image="https://assets.st-note.com/production/uploads/images/154833755/rectangle_large_type_2_c7a13fbdd99eee6bf2407f2ccdcd9382.png"
)
print(res)
"""

find = finder.create_ndi_finder()
NDIsources = find.get_sources()

recieveSource = None; 

# If there is one or more sources then list the names of all source.
# If only 1 source is detected, then automatically connect to that source.
# If more than 1 source detected, then list all sources detected and allow user to choose source.
if(len(NDIsources) > 0):
	print(str(len(NDIsources)) + " NDI Sources Detected")
	for x in range(len(NDIsources)):
		print(str(x) + ". "+NDIsources[x].name + " @ "+str(NDIsources[x].address))
	if(len(NDIsources) == 1):
		#If only one source, connect to that source
		recieveSource = NDIsources[0]
		print("Automatically Connecting To Source...")
	else:
		awaitUserInput = True;
		while(awaitUserInput):
			print("")
			try:
				key = int(input("Please choose a NDI Source Number to connect to:"))
				if(key < len(NDIsources) and key >= 0):
					awaitUserInput = False
					recieveSource = NDIsources[key]
				else:
					print("Input Not A Number OR Number not in NDI Range. Please pick a number between 0 and "+ str(len(NDIsources)-1))		
			except:
				print("Input Not A Number OR Number not in NDI Range. Please pick a number between 0 and "+ str(len(NDIsources)-1))
		
		#If more than one source, ask user which NDI source they want to use		
else:
	print("No NDI Sources Detected - Please Try Again")



print("Width Resized To 500px. Not Actual Source Size")
reciever = receiver.create_receiver(recieveSource)

while(1):
	frame = reciever.read()
	size = [str(frame.shape[0]),str(frame.shape[1]), frame.shape[2]]
	frame = imutils.resize(frame, width=500)

	data=ndarray_to_base64_img_tag(frame)
	res=generate(
		model=model,
		max_tokens=100,
		temp=0.0,
		prompt=prompt,
		processor=processor,
		image_processor=image_processor,
		image=data
	)
	print(res)

	cv2.putText(frame, recieveSource.name,(0,15),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
	cv2.putText(frame, "Size:"+ size[1] + "x" +size[0],(0,35),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
	mode = ""
	if(size[2] == 4):
		mode = "RGB Alpha"
	if(size[2] == 3):
		mode = "RGB"	
	cv2.putText(frame, "Mode:"+ mode,(0,55),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
	
	cv2.imshow("image", frame)
	k = cv2.waitKey(30) & 0xff
	if k == 27:
		break
  
print("User Quit")
cv2.destroyAllWindows()

リポジトリは以下
https://github.com/shi3z/pyNDI-multi