Gemini 2.0 の 2D Spatial Understanding を試す

npaka

2024年12月13日 15:04

「Gemini 2.0」の「2D Spatial Understanding」を試したのでまとめました。

・2D spatial understanding with Gemini 2

1. 2D Spatial Understanding

「2D Spatial Understanding」(2次元空間認識) は、2D画像や動画から、平面上での物体の位置・形状や関係性を理解する能力や技術のことです。

2. Gemini API の準備

「Google Colab」で Gemini API を準備する手順は、次のとおりです。

(1) パッケージのインストール。

# パッケージのインストール
!pip install -U -q google-genai

(2) 「Google AI Studio」からAPIキーを取得し、Colabのシークレットマネージャーに登録し、以下のセルを実行。

from google.colab import userdata
import os

# APIキーの準備
os.environ['GOOGLE_API_KEY'] = userdata.get("GOOGLE_API_KEY")

(3) 推論の実行。

from google import genai

# クライアントの準備
client = genai.Client()

# 推論の実行
response = client.models.generate_content(
    model="gemini-2.0-flash-exp", 
    contents="こんにちは"
)
print(response.text)

こんにちは！何かお手伝いできることはありますか？

3. 2D Spatial Understanding を試す

「Google Colab」で「2D Spatial Understanding」を試す手順は、次のとおりです。

(1) 左端のフォルダアイコンから画像をアップロードして、次のセルを実行。
・sample.jpg

from PIL import Image
from io import BytesIO

# 画像の準備 (1024x1024にリサイズ)
image = "sample.jpg"
img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)
im

(2) 推論の実行。
バウンディングボックスの検出結果がJSONで返されます。

from google.genai import types

# プロンプトの準備
prompt = "Detect the 2d bounding boxes of the cat (The label has a cat pattern.)"

# 推論の実行
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction="""
            Return bounding boxes as a JSON array with labels. Never return masks or code fencing. Limit to 25 objects.
            If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
        """,
        temperature=0.5,
    )
)
print(response.text)

```json
[
  {"box_2d": [375, 4, 972, 370], "label": "orange cat"},
  {"box_2d": [372, 363, 849, 693], "label": "black cat"},
  {"box_2d": [519, 582, 997, 997], "label": "white cat"}
]
```

【翻訳】
prompt = "猫の 2D 境界ボックスを検出します (ラベルには猫の柄)"

system_instruction="""
ラベル付きの JSON 配列としてバウンディングボックスを返します。
Mask や Code Fencing は返しません。物体は25個までに制限します。
物体が複数回存在する場合は、固有の特性 (色、サイズ、位置、固有の特性など) に従って名前を付けます。
"""

4. バウンディングボックスの描画

「Google Colab」でバウンディングボックスを描画する手順は、次のとおりです。

(1) フォントのインストール。

# フォントのインストール
!apt-get install fonts-noto-cjk

(2) ユーティリティ関数の準備。
公式サンプルコードの parse_json() と plot_bounding_boxes() を使います。フォントサイズだけ14から28に増やしました。

def parse_json(json_output):
    # Markdown Fencing の解析
    lines = json_output.splitlines()
    for i, line in enumerate(lines):
        if line == "```json":
            json_output = "\n".join(lines[i+1:])  # "```json" 前のすべてを削除
            json_output = json_output.split("```")[0]  # "```" 後のすべてを削除
            break  # "```json"が見つかったらループ終了
    return json_output

import json
import random
import io
from PIL import Image, ImageDraw, ImageFont
from PIL import ImageColor

additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def plot_bounding_boxes(im, bounding_boxes):
    """
    PILと正規化座標と色を使用して、名前ごとにマーカーが付いたバウンディングボックスを画像上に描画

    引数:
        im: 画像
        bounding_boxes: 物体名と正規化された [y1 x1 y2 x2] 形式での位置を含むバウンディングボックスのリスト
    """

    # 画像の読み込み
    img = im
    width, height = img.size
    print(img.size)

    # 描画オブジェクトの準備
    draw = ImageDraw.Draw(img)

    # 色の定義
    colors = [
    'red',
    'green',
    'blue',
    'yellow',
    'orange',
    'pink',
    'purple',
    'brown',
    'gray',
    'beige',
    'turquoise',
    'cyan',
    'magenta',
    'lime',
    'navy',
    'maroon',
    'teal',
    'olive',
    'coral',
    'lavender',
    'violet',
    'gold',
    'silver',
    ] + additional_colors

    # Markdown Fencing の解析
    bounding_boxes = parse_json(bounding_boxes)

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", size=28)

    # バウンディングボックスを反復処理
    for i, bounding_box in enumerate(json.loads(bounding_boxes)):
      # Select a color from the list
      color = colors[i % len(colors)]

      # 正規化座標を絶対座標に変換
      abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
      abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
      abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
      abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)
      if abs_x1 > abs_x2:
        abs_x1, abs_x2 = abs_x2, abs_x1
      if abs_y1 > abs_y2:
        abs_y1, abs_y2 = abs_y2, abs_y1

      # バウンディングボックスの描画
      draw.rectangle(
          ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4
      )

      # テキストの描画
      if "label" in bounding_box:
        draw.text((abs_x1 + 8, abs_y1 + 6), bounding_box["label"], fill=color, font=font)

    # 画像の表示
    img.show()

(3) バウンディングボックスの描画。

# バウンディングボックスの描画
plot_bounding_boxes(im, response.text)
im