NDIからMLX-VLMに接続して画面にどんなものが写ってるのかリアルタイムに説明させる
NDIで遊べるようになった吾輩はいてもたってもいられなくなり、そのままアドレナリンが出っ放しの状態でなんとかNDIとAIをどうにかイングリモングリできないものかと思案した。
Ubuntuに入れようかと思ったのだが、NDIのライブラリがどうもPythonとのバインディングが微妙で上手く行かなかった上に、やはりUbuntu+GPUは高価すぎるのでもう少しセコいマシン構成が通用する世界観はないものかと色々考えた結果、MacBookのMLXを使うことにした。MLX便利だ。
そしてMLX-VLMという、つまりApple SiliconでVLM(画像言語モデル)を動かすというそのまんまのツールがあったのだが公式のサンプルコードが動かねえ!
もう最近はこんなの日常茶飯事なので、マニュアル読んだだけで動かないとか泣き言言ってる奴はハッカーじゃねえ。ソース読め。
というわけでソースを読んで正しい動かし方を会得した。
import mlx.core as mx
from mlx_vlm.utils import generate, get_model_path, load, load_config, load_image_processor
from mlx_vlm.prompt_utils import apply_chat_template
# 必要なパラメータを指定して関数を呼び出す
model_path='qnguyen3/nanoLLaVA'
config = load_config(model_path)
model, processor = load(model_path, {"trust_remote_code": True})
image_processor = load_image_processor(model_path)
prompt = "What are they doing?where are they?what are they wearing?"
prompt = apply_chat_template(processor, config, prompt)
res=generate(
model=model,
max_tokens=100,
temp=0.0,
prompt=prompt,
processor=processor,
image_processor=image_processor,
image="https://assets.st-note.com/production/uploads/images/154833755/rectangle_large_type_2_c7a13fbdd99eee6bf2407f2ccdcd9382.png"
)
print(res)
これにNDIからの入力を突っ込むと、リアルタイムで結果が得られるはずである。VLMには画像データをURLとかファイルとかで渡さなければならなかったので、面倒だからBASE64で無理やり突っ込む。
すると、NDIで見てるYouTubeの内容をリアルタイムで解説するツールができた。ビバ!AI!
もちろん速いだけあって少しバカなのだが、この辺のトレードオフはどうとでもなる(はずだ)。
ちなみにmlx-community/nanoLLaVA-1.5-8bitに変えたらもっといっぱい出てきた。便利。
YouTubeで流れてるテレビのニュースをNDI経由で見せるとこんな文章が生成された。内容は察してください
Based on the image alone, it appears that the individuals are engaged in a formal event, possibly a political or governmental gathering, as indicated by the attire and the setting. They are dressed in formal attire, which suggests a level of formality and significance to the occasion. The presence of the flag and the formal dress code implies that the event is of a governmental or political nature. The exact nature of the event, such as the nature of the meeting or the specific activities of the individuals, cannot be
The individuals in the image are engaged in a social activity, likely a gathering or a meeting. They are standing in a group, facing each other, and appear to be conversing or participating in a discussion. The attire of the individuals varies, with some wearing casual clothing and others in more formal attire. The specific details of their clothing are not clear from the image, but they are dressed in a manner that suggests a casual or semi-formal event. The setting appears to be a public space,
The individuals in the image are engaged in a social activity, likely a gathering or a celebration. They are standing in a group, facing each other, and appear to be interacting with each other. The attire of the individuals varies, with some wearing dresses and others in casual clothing. The specific details of their clothing are not clear due to the resolution and angle of the image. The setting is a public space, possibly a park or a street, given the presence of benches and the open-air environment.
The individuals in the image are engaged in a social gathering, likely a casual meet-up or a casual event. They are standing in a group, possibly conversing or enjoying each other's company. The attire of the individuals varies, with some wearing casual clothing and others in more formal attire. The specific details of their clothing are not clearly discernible from the image. The setting appears to be a public space, possibly a park or a street, given the presence of benches and the open space around them
The individuals in the image are engaged in a social activity, likely a gathering or a celebration. They are standing in a group, facing each other, and appear to be interacting with each other. The attire of the individuals varies, with some wearing dresses and others in casual clothing. The specific details of their clothing are not clear due to the resolution and angle of the image. The setting is a public space, possibly a park or a street, given the presence of benches and the open-air environment.
The individuals in the image are engaged in what appears to be a public gathering or event, possibly a market or a public square. They are standing in a line, facing the camera, and are dressed in casual, everyday clothing. The clothing style suggests a relaxed, informal setting, possibly a street market or a public gathering where people are dressed in comfortable, practical attire. The presence of the police car indicates that the event is of some significance, possibly a public safety or security-related event.
実際のソースコードは以下
import finder
import receiver
import lib
#Other Import
import imutils
import cv2
import numpy as np
from PIL import Image
import io
import base64
def ndarray_to_base64_img_tag(image_array):
# NumPy ndarray から PIL 画像に変換
image = Image.fromarray(np.uint8(image_array))
# 画像データをバイナリデータに変換 (BytesIO)
buffered = io.BytesIO()
image.save(buffered, format="PNG") # PNG 形式で保存
# バイナリデータを Base64 にエンコード
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
# 画像データ URI スキームを作成
img_data_uri = f"data:image/png;base64,{img_str}"
# <img> タグを作成
img_tag = f'<img src="{img_data_uri}" alt="Image"/>'
return img_data_uri
import codecs
import mlx.core as mx
from mlx_vlm.utils import generate, get_model_path, load, load_config, load_image_processor
from mlx_vlm.prompt_utils import apply_chat_template
model_path='mlx-community/nanoLLaVA-1.5-8bit'
config = load_config(model_path)
model, processor = load(model_path, {"trust_remote_code": True})
image_processor = load_image_processor(model_path)
prompt = "What are they doing?where are they?what are they wearing?"
prompt = codecs.decode(prompt, "unicode_escape")
prompt = apply_chat_template(processor, config, prompt)
"""
res=generate(
model=model,
max_tokens=100,
temp=0.0,
prompt=prompt,
processor=processor,
image_processor=image_processor,
image="https://assets.st-note.com/production/uploads/images/154833755/rectangle_large_type_2_c7a13fbdd99eee6bf2407f2ccdcd9382.png"
)
print(res)
"""
find = finder.create_ndi_finder()
NDIsources = find.get_sources()
recieveSource = None;
# If there is one or more sources then list the names of all source.
# If only 1 source is detected, then automatically connect to that source.
# If more than 1 source detected, then list all sources detected and allow user to choose source.
if(len(NDIsources) > 0):
print(str(len(NDIsources)) + " NDI Sources Detected")
for x in range(len(NDIsources)):
print(str(x) + ". "+NDIsources[x].name + " @ "+str(NDIsources[x].address))
if(len(NDIsources) == 1):
#If only one source, connect to that source
recieveSource = NDIsources[0]
print("Automatically Connecting To Source...")
else:
awaitUserInput = True;
while(awaitUserInput):
print("")
try:
key = int(input("Please choose a NDI Source Number to connect to:"))
if(key < len(NDIsources) and key >= 0):
awaitUserInput = False
recieveSource = NDIsources[key]
else:
print("Input Not A Number OR Number not in NDI Range. Please pick a number between 0 and "+ str(len(NDIsources)-1))
except:
print("Input Not A Number OR Number not in NDI Range. Please pick a number between 0 and "+ str(len(NDIsources)-1))
#If more than one source, ask user which NDI source they want to use
else:
print("No NDI Sources Detected - Please Try Again")
print("Width Resized To 500px. Not Actual Source Size")
reciever = receiver.create_receiver(recieveSource)
while(1):
frame = reciever.read()
size = [str(frame.shape[0]),str(frame.shape[1]), frame.shape[2]]
frame = imutils.resize(frame, width=500)
data=ndarray_to_base64_img_tag(frame)
res=generate(
model=model,
max_tokens=100,
temp=0.0,
prompt=prompt,
processor=processor,
image_processor=image_processor,
image=data
)
print(res)
cv2.putText(frame, recieveSource.name,(0,15),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
cv2.putText(frame, "Size:"+ size[1] + "x" +size[0],(0,35),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
mode = ""
if(size[2] == 4):
mode = "RGB Alpha"
if(size[2] == 3):
mode = "RGB"
cv2.putText(frame, "Mode:"+ mode,(0,55),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
cv2.imshow("image", frame)
k = cv2.waitKey(30) & 0xff
if k == 27:
break
print("User Quit")
cv2.destroyAllWindows()