WSL2でLlama3.2-11B-Vision-Instructを試してみる

2024年9月28日 00:41

「11B および 90B サイズ (テキスト + 画像入力 / テキスト出力) の、事前トレーニング済みで命令調整済みの画像推論生成モデル」らしく、かつ「視覚認識、画像推論、キャプション作成、画像に関する一般的な質問への回答に最適化」されているらしいLlama 3.2-Visionの11Bモデルを試してみます。

使用するPCはドスパラさんの「GALLERIA UL9C-R49」。スペックは
・CPU: Intel® Core™ i9-13900HX Processor
・Mem: 64 GB
・GPU: NVIDIA® GeForce RTX™ 4090 Laptop GPU(16GB)
・GPU: NVIDIA® GeForce RTX™ 4090 (24GB)
・OS: Ubuntu22.04 on WSL2（Windows 11）
です。

1. 準備

仮想環境の構築

python3 -m venv trans
cd $_
source bin/activate

つづいて、パッケージのインストール。

pip install torch transformers accelerate

ライセンスへの同意

LLAMA 3.2 COMMUNITY LICENSE AGREEMENTへの同意が必要となるので、モデルのダウンロードに先駆けて同意しておきましょう。

2. 試してみる

pythonコマンドを実行して、

CUDA_VISIBLE_DEVICES=0 python

pythonプロンプトが表示されたら、以下のコードを流し込みます。

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

#url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
#image = Image.open(requests.get(url, stream=True).raw)
url = "/path/to/non-doraemon.jpg"
image = Image.open(url)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "これについて俳句を書くとしたら、次のようになります。"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=4096)
print(processor.decode(output[0]))

サンプルのコードからの変更点は、以下の3点。

画像ファイルの変更
プロンプトを日本語に変更
max_new_tokensを30から4096へ

与えた画像ファイルはこちらで、

推論の結果がこちら。

このイラストを題材にした俳句は以下のようになります。
　　星のある服
　　空を飛ぶ猫
　　宇宙船の乗り物
　　赤い耳を付けた
　　飛行服を着て

meta-llama/Llama-3.2-11B-Vision-Instruct

俳句とは…。

VRAMは22.1GBほど消費しております。