OpenAI API の Predicted Outputs の使い方

npaka

2024年12月25日 08:35

以下の記事が面白かったので、簡単にまとめました。

・Predicted Outputs

1. Predicted Outputs

「Predicted Outputs」は、出力トークンの多くが事前にわかっている場合に、ChatCompletionの応答を高速化する機能です。これは、テキストまたはコードを少し変更して再生成する場合に役立ちます。

2. コードリファクタリングの例

「GPT-4o」で「TypeScript」のコードの一部をリファクタリングし、Userクラスのusernameプロパティをemailに変換するとします。

class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;

上記の4行目を除き、ファイルの大部分は変更されません。現在のコードを予測テキストとして使用すると、ファイル全体をより高速に再生成できます。ファイルが大きいほど時間を節約します。

以下は、predictionパラメータで、現在のコードを予測テキストとして指定する例です。

from openai import OpenAI

code = """
class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;
"""

refactor_prompt = """
Replace the "username" property with an "email" property. Respond only 
with code, and with no markdown formatting.
"""

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": refactor_prompt
        },
        {
            "role": "user",
            "content": code
        }
    ],
    prediction={
        "type": "content",
        "content": code
    }
)

print(completion)
print(completion.choices[0].message.content)

【翻訳】
「username」プロパティを「mail」プロパティに置き換えます。コードのみで応答し、Markdonw形式は使用しません。

リファクタリングされたコードに加えて、応答には次のようなデータが含まれます。

{
  id: 'chatcmpl-xxx',
  object: 'chat.completion',
  created: 1730918466,
  model: 'gpt-4o-2024-08-06',
  choices: [ /* ...actual text response here... */],
  usage: {
    prompt_tokens: 81,
    completion_tokens: 39,
    total_tokens: 120,
    prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
    completion_tokens_details: {
      reasoning_tokens: 0,
      audio_tokens: 0,
      accepted_prediction_tokens: 18,
      rejected_prediction_tokens: 10
    }
  },
  system_fingerprint: 'fp_159d8341cc'
}

「usage」の「accepted_prediction_tokens」と「denied_prediction_tokens」に注目してください。今回は、予測テキストから、18個の予測トークンが応答高速化に使用され、10個が拒否されました。

3. ストリーミングの例

「ストリーミング」を使用すると、「Predicted Outputs」のレイテンシーのメリットがさらに大きくなります。以下は、同じコードリファクタリングの使用例ですが、代わりにストリーミングを使用しています。

from openai import OpenAI

code = """
class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;
"""

refactor_prompt = """
Replace the "username" property with an "email" property. Respond only 
with code, and with no markdown formatting.
"""

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": refactor_prompt
        },
        {
            "role": "user",
            "content": code
        }
    ],
    prediction={
        "type": "content",
        "content": code
    },
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

4. 応答内の予測テキストの位置

予測テキストを提供する場合、予測は生成された応答内のどこにでも表示され、応答のレイテンシが短縮されます。予測テキストが以下に示す単純な Hono サーバであるとします。

import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";

const app = new Hono();

app.get("/api", (c) => {
  return c.text("Hello Hono!");
});

// You will need to build the client code first `pnpm run ui:build`
app.use(
  "/*",
  serveStatic({
    rewriteRequestPath: (path) => `./dist${path}`,
  })
);

const port = 3000;
console.log(`Server is running on port ${port}`);

serve({
  fetch: app.fetch,
  port,
});

次のようなプロンプトを使用して、モデルにファイルを再生成するように指示できます。

Add a get route to this application that responds with 
the text "hello world". Generate the entire application 
file again with this route added, and with no other 
markdown formatting.

【翻訳】
このアプリケーションに、テキスト「hello world」で応答する get ルートを追加します。このルートを追加し、他のMarkdown形式を使用せずに、アプリケーションファイル全体を再度生成します。

プロンプトに対する応答は次のようになります。

import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";

const app = new Hono();

app.get("/api", (c) => {
  return c.text("Hello Hono!");
});

app.get("/hello", (c) => {
  return c.text("hello world");
});

// You will need to build the client code first `pnpm run ui:build`
app.use(
  "/*",
  serveStatic({
    rewriteRequestPath: (path) => `./dist${path}`,
  })
);

const port = 3000;
console.log(`Server is running on port ${port}`);

serve({
  fetch: app.fetch,
  port,
});

予測テキストが応答に追加された新しいコンテンツの前と後の両方に表示されたとしても、応答には受け入れられた予測トークンが引き続き表示されます。

{
  id: 'chatcmpl-xxx',
  object: 'chat.completion',
  created: 1731014771,
  model: 'gpt-4o-2024-08-06',
  choices: [ /* completion here... */],
  usage: {
    prompt_tokens: 203,
    completion_tokens: 159,
    total_tokens: 362,
    prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
    completion_tokens_details: {
      reasoning_tokens: 0,
      audio_tokens: 0,
      accepted_prediction_tokens: 60,
      rejected_prediction_tokens: 0
    }
  },
  system_fingerprint: 'fp_9ee9e968ea'
}

今回は、拒否された予測トークンはありませんでした。

5. 制限事項

「Predicted Outputs」の制限事項は、次のとおりです。

・「GPT-4o」「GPT-4o-mini」でのみサポートします。

・拒否された予測トークンも、完了トークンの料金で課金されます。
拒否された予測トークンの数を確認するには、「usage」の「denied_prediction_tokens」を参照してください。

・次のパラメータはサポートされません。

・n : 1 より大きい値はサポートされていません
・logprobs : サポートされていません
・presence_penalty : 0 より大きい値はサポートされていません
・frequency_penalty : 0 より大きい値はサポートされていません
・audio : オーディオ入力および出力と互換性がありません
・modalities : テキストモダリティのみがサポートされています
・max_completion_tokens : サポートされていません
・tools : Function Callingはサポートされていません