Llama3.2でAIで作成した画像のキャプションを付けてみる

みし

2024年10月13日 09:03

　オープンソースの生成AI、Llama3.2は、デフォルトの状態で日本語が使え、画像を読み込んで解釈できるので試して見た。結論だけ書くとあまり上手くいかない。プロンプトとモデルのファインチューニングがいるかな。

テストに使うモデル

　今回は必要なVRAMが少なくて済むSeanScripts/Llama-3.2-11B-Vision-Instruct-nf4を利用する。nf4についてはzennに書いた。

　さらにサンプルスクリプトもそのまま転用する。

注意事項：

あらかじめhuggingface-cliでログインしないとモデルがダウンロード出来ない。Loginにはhuggingfaceでログイン用Tokenを発行しておく必要がある。
metaの公式モデルはダウンロード前にライセンスを了承しないと行けないはず
マニュアル通りやっても不足するライブラリがあるのでエラーをチェックしながらインストールしないといけない。
- transformers
- huggingface-hub
- bitsandbytes
- tourchはOSとGPUやドライバでインストールするものが変わる

――と

 huggingface-cli login

テスト用サンプル

from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time
import os

# Load model
model_id = "SeanScripts/Llama-3.2-11B-Vision-Instruct-nf4"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0"
)
# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

base_folder = "<path>/<to>" #ここに検索するフォルダを指定

for filename in os.listdir(base_folder):
# Caption a local image (could use a more specific prompt)
  if filename.endswith(".jpg"):
    IMAGE = Image.open(os.path.join(base_folder, filename)).convert("RGB")
  else:
    continue
  PROMPT = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
  画像の中の少女が話しそうなセリフを15文字以内の日本語で書いてください(出力は {"text":"キャプション"} ):
  <|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
  """

  inputs = processor(IMAGE, PROMPT, return_tensors="pt").to(model.device)
  prompt_tokens = len(inputs['input_ids'][0])
  print(f"Prompt tokens: {prompt_tokens}")

  retry = True
  while retry:
    t0 = time.time()
    generate_ids = model.generate(**inputs, max_new_tokens=64)
    t1 = time.time()
    total_time = t1 - t0
    generated_tokens = len(generate_ids[0]) - prompt_tokens
    time_per_token = generated_tokens/total_time
    print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

    output = processor.decode(generate_ids[0][prompt_tokens:]).replace('<|eot_id|>', '')
    try:
      import json
      text = json.loads(output)['text']
      if "おい" in text or "おお" in text or "おや" in text or len(text) > 15:
        continue
      retry = False
    except:
      print(f"Error: {output}")
      pass

  print(filename, text)

出力形式を明示的にないと出力が安定しない。
1. 入力プロンプトを「画像の中の少女が話しそうなセリフを15文字以内の日本語で書いてください(出力は {"text":"キャプション"})」　として JSON形式で出力させる。
JSONでparse出来ない場合はリトライする
1. {"text":"おおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおおみたいなバグがあるので。
変な日本語が出てくる場合もリトライさせる
1. おやおや、　っておかしくないか……
概ねTokenに比例して遅くなるのでTokenは小さめに設定する（今回は、15文字程度としているので20tokenで十分）Tokenを増やしすぎるとメモリが圧迫され急激に遅くなる。