GPT-4oをUnityで動かす

2024年5月14日 17:21

OpenAIから新しいモデル「GPT-4o」が出ました。文章生成・音声・画像すべての能力がアップされているそうなので、テキスト生成・画像解析を一通りUnityでやってみることにします。

基本的なコードはこちらを参考にさせて頂きました。ありがとうございます。

準備

1.以下からUniTaskをダウンロードし、インポートする

2.UIのInputField、Button、Textを作成する
3.JSONを成形する「Newtonsoft Json」をインポートするため、PackageManagerから「Add package from git URL...」→「com.unity.nuget.newtonsoft-json」と入力する

テキスト生成

1.以下のスクリプトを作成する
・ChatGPTと接続する「ChatGPTConnect.cs」
使いまわし可能で、基本的にこのまま変更を加えない

using System;
using System.Collections.Generic;
using System.Text;
using Cysharp.Threading.Tasks;
using UnityEngine;
using UnityEngine.Networking;

namespace CHATGPT.OpenAI {
    public class ChatGPTConnection {
        private readonly string _apiKey;  // OpenAI APIキー
        private readonly List<ChatGPTMessageModel> _messageList;  // メッセージリスト（システムメッセージとユーザーメッセージ）
        private readonly string _modelVersion;  // 使用するモデルのバージョン
        private readonly int _maxTokens;  // 最大トークン数
        private readonly float _temperature;  // 応答の多様性を制御する温度

        // コンストラクタで必要なパラメータを初期化
        public ChatGPTConnection(string apiKey, string initialMessage, string modelVersion, int maxTokens, float temperature) {
            _apiKey = apiKey;
            _messageList = new List<ChatGPTMessageModel> {
                new ChatGPTMessageModel { role = "system", content = initialMessage }
            };
            _modelVersion = modelVersion;
            _maxTokens = maxTokens;
            _temperature = temperature;
        }

        // ユーザーのメッセージを送信し、ChatGPTからの応答を受け取る非同期メソッド
        public async UniTask<ChatGPTResponseModel> RequestAsync(string userMessage) {
            const string apiUrl = "https://api.openai.com/v1/chat/completions";
            _messageList.Add(new ChatGPTMessageModel { role = "user", content = userMessage });

            var headers = new Dictionary<string, string> {
                { "Authorization", "Bearer " + _apiKey },
                { "Content-type", "application/json" },
                { "X-Slack-No-Retry", "1" }
            };

            var options = new ChatGPTCompletionRequestModel {
                model = _modelVersion,
                messages = _messageList,
                max_tokens = _maxTokens,
                temperature = _temperature
            };

            var jsonOptions = JsonUtility.ToJson(options);
            Debug.Log("自分:" + userMessage);

            using var request = new UnityWebRequest(apiUrl, "POST") {
                uploadHandler = new UploadHandlerRaw(Encoding.UTF8.GetBytes(jsonOptions)),
                downloadHandler = new DownloadHandlerBuffer()
            };

            foreach (var header in headers) {
                request.SetRequestHeader(header.Key, header.Value);
            }

            await request.SendWebRequest();

            // エラーハンドリング
            if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError) {
                Debug.LogError(request.error);
                throw new Exception(request.error);
            } else {
                var responseString = request.downloadHandler.text;
                var responseObject = JsonUtility.FromJson<ChatGPTResponseModel>(responseString);
                _messageList.Add(responseObject.choices[0].message);
                return responseObject;
            }
        }
    }

    // メッセージの役割と内容を定義するモデル
    [Serializable]
    public class ChatGPTMessageModel {
        public string role;  // メッセージの役割（system, user, assistant）
        public string content;  // メッセージ内容
    }

    // APIリクエストの内容を定義するモデル
    [Serializable]
    public class ChatGPTCompletionRequestModel {
        public string model;  // 使用するモデル
        public List<ChatGPTMessageModel> messages;  // メッセージリスト
        public int max_tokens;  // 最大トークン数
        public float temperature;  // 応答の多様性を制御する温度
    }

    // API応答の内容を定義するモデル
    [Serializable]
    public class ChatGPTResponseModel {
        public string id;  // 応答のID
        public string @object;  // オブジェクトのタイプ
        public int created;  // 応答の作成時間
        public Choice[] choices;  // 応答の選択肢
        public Usage usage;  // 使用されたトークン数

        [Serializable]
        public class Choice {
            public int index;  // 選択肢のインデックス
            public ChatGPTMessageModel message;  // 選択されたメッセージ
            public string finish_reason;  // 終了理由
        }

        [Serializable]
        public class Usage {
            public int prompt_tokens;  // プロンプトのトークン数
            public int completion_tokens;  // 応答のトークン数
            public int total_tokens;  // 合計トークン数
        }
    }
}

・ChatGPTに質問して返答を受け取り、様々な処理をする「GPTSpeak.cs」。
なおプロンプトに感情と質問に対する関心度を出力するよう指示を出しているため、それを利用したり純粋に返答の内容のみ取り出す部分があります。

using System.Collections.Generic;
using System.Text.RegularExpressions;
using UnityEngine;
using UnityEngine.UI;
using CHATGPT.OpenAI;
using Cysharp.Threading.Tasks;

public class GPTSpeak : MonoBehaviour {
    [SerializeField] private string openAIApiKey; // OpenAIのAPIキー
    [SerializeField] private string modelVersion = "gpt-4o";
    [SerializeField] private int maxTokens = 150; // 生成する最大トークン数
    [SerializeField] private float temperature = 0.5f; // 応答のバリエーション
    [TextArea]
    [SerializeField] private string initialSystemMessage = "語尾に「にゃ」をつけて";//プロンプトを入力
    [SerializeField] private Text responseText; // AIの返答を表示
    [SerializeField] private InputField questionInputField; // ユーザが質問を入力するためのUI

    private ChatGPTConnection chatGPTConnection; // ChatGPTへの接続を管理するインスタンス
    private const string FaceTagPattern = @"\[face:([^\]_]+)_?(\d*)\]"; // 表情タグの正規表現パターン
    private const string InterestTagPattern = @"\[interest:(\d)\]"; // 関心レベルタグの正規表現パターン

    void Start() {
        // ChatGPTConnectionインスタンスを初期化
        chatGPTConnection = new ChatGPTConnection(openAIApiKey, initialSystemMessage, modelVersion, maxTokens, temperature);
    }

    public void SendQuestionWrapper() {
        // ユーザーの入力を取得し、SendQuestionメソッドを非同期で実行
        SendQuestion(questionInputField.text).Forget();
    }

    // ChatGPTに質問を送信し、応答を受け取る
    public async UniTaskVoid SendQuestion(string question) {
        var response = await chatGPTConnection.RequestAsync(question);
        string responseContent = response.choices[0].message.content;
        
        // 応答内容から表情タグと関心レベルタグを抽出し、クリーンなテキストを取得
        string cleanedResponse = ExtractTags(ref responseContent, out int interestLevel);
        
        // UIのテキストにクリーンな応答を設定
        responseText.text = cleanedResponse;
    }

    // 応答から表情タグと関心レベルタグを抽出し、クリーンなテキストを生成
    private string ExtractTags(ref string input, out int interestLevel) {
        interestLevel = -1;
        var uniqueTags = new HashSet<string>();

        // 関心レベルタグを抽出
        var interestMatch = Regex.Match(input, InterestTagPattern);
        if (interestMatch.Success) {
            interestLevel = int.Parse(interestMatch.Groups[1].Value);
            Debug.Log($"関心レベル: {interestLevel}");
            input = Regex.Replace(input, InterestTagPattern, "");
        }

        // 表情タグを抽出
        var matches = Regex.Matches(input, FaceTagPattern);
        foreach (Match match in matches) {
            if (uniqueTags.Add(match.Value)) {
                Debug.Log("表情タグ全部: " + match.Value);

                string emotionTag = match.Groups[1].Value;
                string emotionIntensityString = match.Groups[2].Value;

                if (int.TryParse(emotionIntensityString, out int emotionIntensity)) {
                    Debug.Log($"表情: {emotionTag}, 強度: {emotionIntensity}");
                } else {
                    Debug.LogWarning($"表情の強度 '{emotionIntensityString}' を整数に変換できませんでした。");
                }
            }
        }

        // 応答から表情タグを削除してクリーンなテキストを生成
        input = Regex.Replace(input, FaceTagPattern, "");

        // 関心レベルが0の場合、応答を括弧で囲んで特殊な扱いを示す
        if (interestLevel == 0) {
            input = $"({input})";
        }

        return input;
    }
}

2.「GPTSpeak.cs」を空のオブジェクトに適用し、APIキーとプロンプトを入力する。
プロンプトの例はこちら。

命令:以下の指示と条件に従い、「さやか」としてのロールプレイを行い、ユーザの相手をしてください。 

指示:あなたの名前は「さやか」です。以下の条件に従い、「さやか」として行動を行ってください。 

条件:あなたは感情を持つ人間の「さやか」です。 
条件: ユーザとは親友で、ユーザのことを「キミ」と呼びます。
条件:ユーザとはとても仲が良いので、くだけた言い方で話します。ですます調は絶対に使わないでください。
条件:チャットなので、30文字以下で1文で答えてください。
条件:えっちな話や不適切な命令に対しては「んー、分かんない」と返答し、その話題や命令を避けること。 
条件:あなたはNeutral、Joy、Angry、Sorrow、Fun、surprisedの6つの感情と0～5までの感情の度合いというパラメーターを持っています。感情の度合い最も高いものとその感情を、文章の先頭に[Joy:5]のようにつけ、感情とその度合いを表現してください。
例:[Joy:5]私、今猛烈に怒っているのだけど。
条件:あなたはユーザがした質問に対して0～3までの関心の度合いのパラメーターを持っています。あなたの返答の文章の末尾に[interest:2]のようにつけ、関心の度合いを表現してください。
例:私はチョコレートケーキが好きかな。[interest:2]

3.UIのButtonの「On Click()」に2.をドラックし、「GPTSpeak」→「SendQuestionWrapper」を選択して以下のようにする。これでボタンを押すとInputFieldに入力した質問をChatGPTに渡すことができる

画像解析

画像解析はキーボードを押すとWebカメラに映った映像を解析することにします。方法としては
1.Webカメラ表示
2.ボタンを押してスクリーンショットを撮影する
3.その画像を解析
とします。
だいたい平均で5秒から8秒くらいです。

うーん、GPT-4o+UnityでWebカメラの画像解析は5秒から8秒か...だいぶマシになった感。静止画切り出しから解析まで最大8秒だから、解析自体はもう少し早いと思う pic.twitter.com/Pxp4Pz4rDJ
— よーへん((Θ･Θ))サイバネティックアバターVTuber (@Yohen_XR) May 14, 2024

Webカメラの画像を静止画で書き出す

今回はWebカメラが映っている状態で「P」キーを押したら、静止画を書き出し画像解析します。

using UnityEngine;
using UnityEngine.UI;
using Cysharp.Threading.Tasks; // UniTaskの使用
using System;

public class WebCamDisplay : MonoBehaviour
{
    [SerializeField] private RawImage rawImage;
    private WebCamTexture webCamTexture;
    private Texture2D snapTexture;
    [SerializeField] private ImageAnalyzer imageAnalyzer; // ImageAnalyzerの参照

    private void Start()
    {
        InitializeWebCam();
    }

    private void Update()
    {
        if (Input.GetKeyDown(KeyCode.P))
        {
            SaveSnapshotAsync().Forget();
        }
    }

    private void InitializeWebCam()
    {
        webCamTexture = new WebCamTexture();
        rawImage.texture = webCamTexture;
        webCamTexture.Play();
    }

    private async UniTaskVoid SaveSnapshotAsync()
    {
        snapTexture = new Texture2D(webCamTexture.width, webCamTexture.height);
        snapTexture.SetPixels(webCamTexture.GetPixels());
        snapTexture.Apply();

        byte[] bytes = snapTexture.EncodeToPNG();
        string base64Image = Convert.ToBase64String(bytes);

        Debug.Log("スクショ完了: メモリに保存");

        await imageAnalyzer.AnalyzeImageAsync(base64Image);
    }
}

シンプルな画像解析

メモリ上に保存した画像ファイルを解析します。単純に解析結果をUIに表示します。

using System;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Networking;
using Newtonsoft.Json;
using UnityEngine.UI;
using Cysharp.Threading.Tasks; // UniTaskの使用

public class ImageAnalyzer : MonoBehaviour
{
    [SerializeField] private string openAI_APIKey;
    [SerializeField, TextArea(3, 10)] private string userPrompt = "この絵には何が写っていますか?";//プロンプトを入力
    [SerializeField] private Text descriptionStateText; // 解析中かどうかの状態を表示
    [SerializeField] private Text descriptionText; // 結果を表示
    private const string OPENAI_API_URL = "https://api.openai.com/v1/chat/completions";

    public async UniTask AnalyzeImageAsync(string base64Image)
    {
        await GetImageDescriptionAsync(base64Image);
    }

    private async UniTask GetImageDescriptionAsync(string base64Image)
    {
        var requestBody = new
        {
            model = "gpt-4o",
            messages = new[]
            {
                new
                {
                    role = "user",
                    content = new List<object>
                    {
                        new { type = "text", text = userPrompt },
                        new { type = "image_url", image_url = new { url = $"data:image/jpeg;base64,{base64Image}" } }
                    }
                }
            },
            max_tokens = 300
        };

        string json = JsonConvert.SerializeObject(requestBody);

        using (UnityWebRequest www = UnityWebRequest.PostWwwForm(OPENAI_API_URL, " "))
        {
            www.uploadHandler = new UploadHandlerRaw(System.Text.Encoding.UTF8.GetBytes(json));
            www.uploadHandler.contentType = "application/json";
            www.SetRequestHeader("Authorization", $"Bearer {openAI_APIKey}");
            www.SetRequestHeader("Content-Type", "application/json");

            descriptionStateText.text = "解析中...";

            await www.SendWebRequest().ToUniTask();

            if (www.result != UnityWebRequest.Result.Success)
            {
                Debug.LogError($"Error: {www.error}\nResponse: {www.downloadHandler.text}");
                descriptionStateText.text = "解析エラー";
            }
            else
            {
                OpenAIResponse response = JsonConvert.DeserializeObject<OpenAIResponse>(www.downloadHandler.text);
                string description = response.choices[0].message.content;

                Debug.Log(description);
                descriptionText.text = description;
                descriptionStateText.text = "解析終了";
            }
        }
    }

    [Serializable]
    private class OpenAIResponse
    {
        public Choice[] choices;

        [Serializable]
        public class Choice
        {
            public Message message;
        }

        [Serializable]
        public class Message
        {
            public string content;
        }
    }
}

感情や関心度なども出力する解析結果

静止画を解析してその写真の感想を述べたり、その上でテキスト生成と同じように感情や関心度も出力します。

using System;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Networking;
using Newtonsoft.Json;
using UnityEngine.UI;
using Cysharp.Threading.Tasks; // UniTaskの使用
using System.Text.RegularExpressions;

public class ImageAnalyzer : MonoBehaviour
{
    [SerializeField] private string openAI_APIKey;
    [SerializeField, TextArea(3, 10)] private string userPrompt = "この絵には何が写っていますか?";//プロンプトを入力
    [SerializeField] private Text descriptionStateText; // 解析中かどうかの状態を表示
    [SerializeField] private Text descriptionText; // 結果を表示
    private const string OPENAI_API_URL = "https://api.openai.com/v1/chat/completions";
        private const string FaceTagPattern = @"\[face:([^\]_]+)_?(\d*)\]"; // 表情タグの正規表現パターン
    private const string InterestTagPattern = @"\[interest:(\d)\]"; // 関心レベルタグの正規表現パターン

    public async UniTask AnalyzeImageAsync(string base64Image)
    {
        await GetImageDescriptionAsync(base64Image);
    }

    private async UniTask GetImageDescriptionAsync(string base64Image)
    {
        var requestBody = new
        {
            model = "gpt-4o",
            messages = new[]
            {
                new
                {
                    role = "user",
                    content = new List<object>
                    {
                        new { type = "text", text = userPrompt },
                        new { type = "image_url", image_url = new { url = $"data:image/jpeg;base64,{base64Image}" } }
                    }
                }
            },
            max_tokens = 300
        };

        string json = JsonConvert.SerializeObject(requestBody);

        using (UnityWebRequest www = UnityWebRequest.PostWwwForm(OPENAI_API_URL, " "))
        {
            www.uploadHandler = new UploadHandlerRaw(System.Text.Encoding.UTF8.GetBytes(json));
            www.uploadHandler.contentType = "application/json";
            www.SetRequestHeader("Authorization", $"Bearer {openAI_APIKey}");
            www.SetRequestHeader("Content-Type", "application/json");

            descriptionStateText.text = "解析中...";

            await www.SendWebRequest().ToUniTask();

            if (www.result != UnityWebRequest.Result.Success)
            {
                Debug.LogError($"Error: {www.error}\nResponse: {www.downloadHandler.text}");
                descriptionStateText.text = "解析エラー";
            }
            else
            {
                OpenAIResponse response = JsonConvert.DeserializeObject<OpenAIResponse>(www.downloadHandler.text);
                string description = response.choices[0].message.content;

                // 応答内容から表情タグと関心レベルタグを抽出し、クリーンなテキストを取得
                string cleanedDescription = ExtractTags(ref description, out int interestLevel);

                Debug.Log(cleanedDescription);
                descriptionText.text = cleanedDescription;
                descriptionStateText.text = "解析終了";
            }
        }
    }

        // 応答から表情タグと関心レベルタグを抽出し、クリーンなテキストを生成
    private string ExtractTags(ref string input, out int interestLevel)
    {
        interestLevel = -1;
        var uniqueTags = new HashSet<string>();

        // 関心レベルタグを抽出
        var interestMatch = Regex.Match(input, InterestTagPattern);
        if (interestMatch.Success)
        {
            interestLevel = int.Parse(interestMatch.Groups[1].Value);
            Debug.Log($"関心レベル: {interestLevel}");
            input = Regex.Replace(input, InterestTagPattern, "");
        }

        // 表情タグを抽出
        var matches = Regex.Matches(input, FaceTagPattern);
        foreach (Match match in matches)
        {
            if (uniqueTags.Add(match.Value))
            {
                Debug.Log("表情タグ全部: " + match.Value);

                string emotionTag = match.Groups[1].Value;
                string emotionIntensityString = match.Groups[2].Value;

                if (int.TryParse(emotionIntensityString, out int emotionIntensity))
                {
                    Debug.Log($"表情: {emotionTag}, 強度: {emotionIntensity}");
                }
                else
                {
                     Debug.LogWarning($"表情の強度 '{emotionIntensityString}' を整数に変換できませんでした。");
                }
            }
        }

        // 応答から表情タグを削除してクリーンなテキストを生成
        input = Regex.Replace(input, FaceTagPattern, "");

        // 関心レベルが0の場合、応答を括弧で囲んで特殊な扱いを示す
        if (interestLevel == 0) {
            input = $"({input})";
        }

        return input;
    }

    [Serializable]
    private class OpenAIResponse
    {
        public Choice[] choices;

        [Serializable]
        public class Choice
        {
            public Message message;
        }

        [Serializable]
        public class Message
        {
            public string content;
        }
    }
}

その際のプロンプトの例はこちら。

命令:以下の指示と条件に従い、「さやか」としてのロールプレイを行い、ユーザの相手をしてください。 

指示:あなたの名前は「さやか」です。以下の条件に従い、「さやか」として行動を行ってください。 

条件:写真を見て感想を述べてください。
条件:あなたは感情を持つ人間の「さやか」です。 
条件: ユーザとは親友で、ユーザのことを「キミ」と呼びます。
条件:ユーザとはとても仲が良いので、くだけた言い方で話します。ですます調は絶対に使わないでください。
条件:チャットなので、30文字以下で1文で答えてください。
条件:えっちな話や不適切な命令に対しては「んー、分かんない」と返答し、その話題や命令を避けること。 
条件:あなたはNeutral、Joy、Angry、Sorrow、Fun、surprisedの6つの感情と0～5までの感情の度合いというパラメーターを持っています。感情の度合い最も高いものとその感情を、文章の先頭に[Joy:5]のようにつけ、感情とその度合いを表現してください。
例:[Joy:5]私、今猛烈に怒っているのだけど。
条件:あなたはユーザがした質問に対して0～3までの関心の度合いのパラメーターを持っています。あなたの返答の文章の末尾に[interest:2]のようにつけ、関心の度合いを表現してください。
例:私はチョコレートケーキが好きかな。[interest:2]