OpenAI o1再現モデルまとめ（2024年12月時点）

2024年12月14日 11:39

はじめに

2024年9月にo1で新しい言語モデルのあり方をOpenAIが示し、注目を集めました。2024年9月にo1-preview、o1-miniが発表され、最近o1(-full)とo1-proが公開されました。

いまだにo1のような推論モデル（Reasoning Model）の開発方法は絶対にこれだろう！というものはわかってきてはいませんが、o1再現を目指す取り組みがちらほら出てきており、モデルの公開もされていたりするのでオープンなモデルの状況を中心にまとめておこうと思います。

これまでに、OpenAI o1をどう作るかというのは（あくまで予測ながら）記事を書いているので興味がある方はそちらもご確認ください。

1. o1再現モデル一覧

私が知っている限り、o1再現を目指して作成されたオープンなモデルをは以下のとおりです。

Marco-o1(🤗)：ライセンス：Apache 2.0
QwQ-32B-Preview(🤗)：ライセンス：Apache 2.0
Llava-CoT(Llava-o1)(🤗)：ライセンス：Apache 2.0
- 唯一の画像モデル
Skywork(🤗)：ライセンス：other
Steiner(🤗)：ライセンス：Apache 2.0

上記5モデルのうち、開発方法が論文として公開されているのはMarco-o1とLlava-CoTなのでこの２つを中心に見ていきます。
ただし、後述しますがQwQ-32B-PreviewはMarco-o1と同じAlibaba系列のチームが開発したモデルであり、開発方法がMarco-o1とほぼ同じと考えられます。

Skyworkは報酬モデルの作り方に関して論文が公開されており、Steinerは個人プロジェクトのようで技術ブログが公開されています。余裕があればこれらも見ていきます。

2. モデル開発方法

2.1 Marco-o1

Marco-o1は、最もオープンな推論モデル再現の取り組みだと思います。
論文、学習データの一部、モデルの重みが公開されています。
リンク：🤗、 Github、📝

Qwenのような通常のLLMをベースに幾つかのステップを経て推論モデルにしていきます。

Step 1: CoTデータ、Instructionデータでの微調整

CoTデータセットと呼ばれるデータセットと、InstructionデータでFull FineTuningを行います。
CoTデータセットは以下のような形式です。

推論ステップ全体が<Thought>< /Thought>タグで囲われている
最終出力は<Output>< /Output>タグで囲われている
（おそらく推論ステップの区切りとして）"\n\n"が間に挟まっている
Reflection after Thinkingを使用している（後述）

{
    "instruction": "The House of Lords operates as part of the political system of the United Kingdom.\\nIs the above claim true?\\nOptions:\\n- yes\\n- no Let's think first. Step-by-step reasoning process:",
    "response": "<Thought>\nok, i need to determine whether the statement \"The House of Lords operates as part of the political system of the United Kingdom\" is true. let's break this down step by step.\n\nfirst, i recall that the United Kingdom has a parliamentary system of government, which includes the House of Commons and the House of Lords. the House of Commons is the lower house, while the House of Lords is the upper house. both houses work together to create and pass laws.\n\nbut wait, is the House of Lords truly part of the political system? or is it just a ceremonial body? i think historically, the House of Lords has been involved in lawmaking, but its role has evolved over time. maybe it's not as powerful as it used to be, but it still plays a role in the legislative process.\n\nlet me consider the current status. i believe the House of Lords is composed of appointed members, not elected ones, which differentiates it from the House of Commons. this appointment process might limit its influence, but it still contributes to the overall political landscape by providing expertise and experience.\n\nalso, there have been debates about reforming the House of Lords to make it more representative. some argue for an elected upper house, while others support its current structure. regardless of the ongoing discussions, the House of Lords remains an integral part of the UK's political system, even if its powers are somewhat limited compared to the House of Commons.\n\nanother angle to consider is the role of the House of Lords in scrutinizing legislation and providing a second chamber for debate. this function ensures that laws are thoroughly examined before they become binding, which is a crucial aspect of the political system.\n\nbut then again, some might argue that with the increasing power of the House of Commons, the House of Lords' role is diminishing. however, its presence and functions still make it a significant component of the UK's political structure.\n\nin summary, while the House of Lords may not hold the same level of power as the House of Commons, it is indeed part of the political system of the United Kingdom, serving as a legislative body alongside the House of Commons.\n\n</Thought>\n<Output>\nyes\n</Output>"
}

Reflection after Thinkingとは、推論ステップの中に「ちょっと待って！間違いを犯したかもしれない！最初から考え直す必要がある。」といったような、フレーズを追加する工夫のことです。
このフレーズのおかげで、モデルがこれまでの推論を振り返って誤りがないか見直す癖が癖がつくようです。

Instructionデータセットは、モデルが複雑なタスクを実行する際に途中で初期の命令を忘れてしまわないように命令追従能力を上げる目的で使用されます。

これらのデータで学習されたモデルはMarco-o1-CoTと名付けられており、CoT微調整だけでは推論能力はそこまで高まらないどころか低下する可能性があるようです。元のモデルを崩して定型のCoTを押し付ける形になるからなのでしょうか。
ただし、以降のために必要なステップなのだと思います。

Table 2 | Experimental results on MGSM datasets.（論文より）

Step 2: MCTS統合

CoTモデル（CoTデータセットで学習したものを推論モデルと分けて便宜上そう呼ぶことにします）は上記の通り、このままだと性能が上がりません。
そこで、MCTS（Monte Carlo tree search; モンテカルロ木探索）を推論のステップのロジックに統合します。

MCTSはAlphaGoなどのゲームAIで採用されている手法で、最適な次の一手を巨大な探索空間の中から決定するヒューリスティックな（=確率的なサンプリングに基づいて意思決定を行う）探索アルゴリズムです。
つまり、囲碁で次の一手を探索するのと同じように、推論問題において次の推論ステップを探索するのにMCTSを用います。

MCTSでの探索時、探索対象のノード１つ１つは推論１ステップに相当します。これを用いて、数ステップ先までそれらしい推論に辿り着けそうなことを確認して次の推論ステップを決定、今度はそのステップをルートとして数ステップ先まで探索するMCTSを行い次の推論ステップを決定、という流れになるのだと考えられます。

では、MCTSで次のステップを探索する際、何を指標とするでしょうか。これは驚くべきことに、各トークンの確率（対数確率）を元にするようです。
各推論ステップで生成されたトークンだけでなく、その時の予測された上位5位までの対数確率を使用します。これを元に、まず信頼スコアを計算します。

それを元に、報酬スコアを計算します。

シンプルな式ですし、ここでは細かい議論は省きます。
重要なのは、推論ステップ毎の評価値（報酬値）をトークンの対数確率だけで計算していることです！次のトークンを予測するのに使用するトークンの対数確率で、より長期の推論ステップの確からしさをも表すことができるということを意味していると思います。個人的にはこれは驚きでした。
さらには、このように報酬値を割り出せることで、PRMのような報酬モデルを使う場合と異なり、最終的に正解不正解のある問題以外でも報酬値が出せるようになりました！
実際、これによって以下のようにCoTモデルや元のInstructionモデルから大きくスコアを伸ばしています。

MCTSを統合した推論ステップを踏んだモデルはMarco-o1-MCTSと名付けられています。
括弧内はReasoning Action Strategyの違いを表しています。
Reasoning Action Strategyとは、簡単にいうと推論ステップ（MCTSにおける１ノード）の区切り方です。これは大きく分けて、Step as ActionとMini-step as Actionがあります。

Step as Action：モデルが完全な推論ステップをアクションとして生成できると論文に記載されています。これはおそらく、例えば"\n\n"をステップの区切りとしてCoTデータセットを作成したとき"\n\n"までを１推論ステップとするということだと考えられます。つまり、モデル自身が推論ステップの長さをコントロールできる状態です。
Mini-step as Action：これはよりシンプルで、トークン数で推論ステップを区切る方法です。これはmini-stepと呼ばれ、論文では32 tokensと64 tokensの2種類が実験されています。

面白いのが必ずしもどちらが良いというわけではなく、英語の数学タスクではstep > mini-step(64 tokens) > mini-step(32 tokens)なのに対して中国語の数学タスクでは逆のmini-step(32 tokens) > mini-step(64 tokens) > stepとなっています。
これは完全に想像ですが、日本語の場合は言語として英語よりも中国語に近いと思われるため、mini-stepの方が相性が良いかもしれません。

Step 3: ORM/PRM、強化学習

論文ではMCTSまでしか実験されていませんが、今後の展望として２つが述べられています。

Outcome Reward Modeling (ORM)やProcess Reward Modeling (PRM)を用いてのMCTSの報酬シグナル（おそらく報酬スコアのこと）の改良
強化学習を用いての微調整

ORMやPRMは報酬モデル（reward model）の1種です。ORMは最終的な結果や出力のみに基づいて報酬を学習します。PRMは問題解決の過程や推論のステップも含めて報酬を学習します。
推論モデルという性質上ORMというよりはPRMが用いられると想像していますが、PRMは学習データをどう作るのかが課題になります。

また、現状PRMなどの報酬モデルなしに報酬値を出す方法が採用されていますがこれとどうPRMや強化学習を絡めていくのか、今後の研究成果が楽しみです。

まとめ

Marco-o1が最もオープンな推論モデル開発の取り組みであり、その手法はCoTデータセットによる微調整とMCTSの統合の２段階によるものでした。

モデルはオープンになっていますが、これまでに記載した通りMCTS統合部分のロジックはTransformersではサポートされていないためこれは単なるCoTモデルなのだと思います。

個人的に、Qwen2.5-0.5B-InstructモデルにCoTデータセット作成とMCTS統合を行ったところきちんと日本語MGSMの結果が伸びました。そのうちコードや具体的な手法についても公開しようと思います。

2.1 QwQ-32B-Preview

QwQはモデルのみオープンになっており、作り方はオープンになっていません。ただ、QwQ-32B-PreviewはMarco-o1と同じAlibaba系列のチームが開発したモデルであり、開発方法がMarco-o1とほぼ同じと考えられます。
実際出力を確認すると以下の通りです。（日本語は中国語が混じることがあるので英語で確認。）

問題：

Two distinct primes, each greater than 20, are multiplied. What is the least possible product of these two primes?

QwQによる回答：

So I have this problem here: "Two distinct primes, each greater than 20, are multiplied. What is the least possible product of these two primes?" Okay, let's break this down.

First, I need to find two different prime numbers, both of which are greater than 20. Then, I have to multiply them together and find the smallest possible product.

So, the key here is to find the smallest possible product, which means I should choose the smallest possible primes greater than 20.

Let me start by listing some prime numbers greater than 20.

I know that 23 is a prime number. Let's check 24; no, 24 is divisible by 2, 3, 4, 6, 8, 12, so not prime. 25 is divisible by 5, so not prime. 26 is even, so divisible by 2. 27 is divisible by 3. 28 is even. 29 is a prime. 30 is divisible by 2, 3, 5, etc. So, the primes greater than 20 seem to be 23, 29, 31, and so on.

Wait, is 31 a prime? Yes, I think so. Let me check: 31 is not divisible by 2, 3, 5, or 7 (since 7 times 4 is 28, and 7 times 5 is 35, which is greater than 31), so yes, 31 is prime.

Similarly, 37 is prime, etc.

So, the smallest primes greater than 20 are 23 and 29.

Wait a minute, is 29 the next prime after 23? Let's see: 23, 24 (not prime), 25 (not prime), 26 (not prime), 27 (not prime), 28 (not prime), 29 (prime). Yes, so 23 and 29 are the smallest primes greater than 20.

Now, the problem says "distinct primes," so I can't choose the same prime twice. I have to choose two different ones.

So, to get the smallest possible product, I should choose the two smallest distinct primes greater than 20, which are 23 and 29.

Now, I need to multiply them together.

Let me calculate 23 times 29.

Hmm, what's 23 times 29?

Well, I know that 20 times 29 is 580, and 3 times 29 is 87. So, 23 times 29 is 580 plus 87, which is 667.

So, the least possible product is 667.

Wait a minute, is there a smaller product possible if I choose different primes?

Let me think about this. Are there any primes between 23 and 29 that I might have missed?

Let's see: 23, 24 (not prime), 25 (not prime), 26 (not prime), 27 (not prime), 28 (not prime), 29 (prime). That's it. So, 23 and 29 are indeed the smallest primes greater than 20.

Therefore, the smallest possible product is 23 times 29, which is 667.

But just to be sure, let's check the next prime after 29, which is 31.

23 times 31 is... 20 times 31 is 620, plus 3 times 31 is 93, so 620 plus 93 is 713. That's larger than 667.

Similarly, 29 times 31 is... 20 times 31 is 620, plus 9 times 31 is 279, so 620 plus 279 is 899, which is even larger.

So, indeed, 23 and 29 give the smallest product.

Alternatively, maybe there's a prime between 20 and 23 that I missed.

Wait, no. 23 is the smallest prime greater than 20. 21 is divisible by 3 and 7, and 22 is even, so 23 is the smallest.

Similarly, 29 is the next one.

So, I think 667 is the correct answer.

But let me double-check if there are any primes between 20 and 29.

As far as I know, 23 and 29 are the only primes in that range.

Wait, is 29 greater than 20? Yes.

Is there a prime number between 23 and 29? No, as I already listed.

So, yes, 23 and 29 are the smallest distinct primes greater than 20.

Therefore, their product is 23 times 29, which is 667.

So, the least possible product is 667.

Final Answer

[ \boxed{667} ]

上記の通り、"Wait."が頻発し、各ステップが"\n\n"で繋がれているように見えます。　確実にとは言えませんが、形式がMarco-o1のCoTデータセットに似ています。
学習データ自体は異なるかもしれませんが、推論モデル開発の手法としてはMarco-o1の知見が活用されている気がします。

また、QwQのモデルは公開されていますがこれもMCTSのようなロジックは組み込まれていない単なるCoTモデルなのだと思います。（強化学習まで行っている可能性もありますが。）

QwQについてわかるのはこのくらいだと思います。

2.3 Llava-CoT(Llava-o1)

Llava-CoTは元々Llava-o1という名前で公開されていたようですが、LlavaーCoTという名前に修正されました。
これは唯一の画像推論モデルです。作成手順は大きく分けて2つのステップを踏みます。

Step 1: CoTデータセットでの微調整

Marco-o1と同じように、Llava-CoTもCoTデータセットによる微調整から入ります。しかし、Llava-CoTはMarco-o1とは異なり４ステップでの推論を強制させる手法をとっています。
（この時点でOpenAI o1の推論時間スケーリングという方向性とは考え方とは違うなと思います。ただし画像モデルの推論能力強化という意味では参考になると思います。）

推論は以下のようにSummary、Caption、Reasoning、Conclusionの４ステップからなります。

学習データはApache 2.0でHuggingFaceで公開されているものの、作成にGPT-4oを使っているようなので使用には注意が必要です。

また、学習データ作成のためのコードもGithubに公開されています。（🔗）

Step2: ビームサーチの統合

Llava-CoTでもCoTデータセットでの学習だけでなく、推論ステップを探索する機能を外付けしています。こちらではよりシンプルなStage-level Beam Searchという手法が採用されています。

An illustration of inference approaches.（論文より）

これはシンプルに以下の手順を繰り返します。

複数の推論パスを生成
その中から2つの候補をランダムに選択。
2つの候補の中からより良い候補をLlava-CoTモデル自身に選択させる
次のステージに進む

Marco-o1ではトークンの確率を用いましたが、ここではより簡単にモデル自身に良い推論パスを聞いて推論パスを決めていきます。より良い推論パスを評価するプロンプトを含む、推論コードもGithub上に公開されています。（🔗）

まとめ

Llava-CoTもMarco-o1と同じように、CoTデータセットによる微調整+推論パス探索の仕組みの統合という仕組みで成り立っていました。ただし、推論ステップ毎に評価をLLMに行わせるので、より計算資源を使う可能性がある手法だなと感じました。

モデルもHugging Face上でApache 2.0ライセンスで公開されていますが、データセットがGPT-4oを用いているので扱いには気をつける必要があると思います。また、推論パスサーチの仕組みは外付けなのでLlava-CoTもCoTデータセットで微調整した後のモデルです。

まとめ

主にMarco-o1とLlava-CoTの解説記事となってしまいましたが、一旦力尽きたのでここを区切りとしたいと思います。
どちらも、CoTデータセットでの微調整 + なんらかの推論パス探索の仕組みを採用しており、大枠はこの方向性で推論モデルは正しいのではないかと感じました。

一方で、MCTSなどの推論パス探索の仕組みは外付けが必要であり、バラバラにそれぞれの研究機関から推論のコードを公開されているのは使いづらいなと思います。Transformersで推論パス探索の仕組みが実装されるか、新たに推論パス探索の仕組みをまとめたライブラリが出てきて欲しいです。
もしくはその先の強化学習まで行われて、これらの仕組みが不要になればTransformersライブラリのみで手軽に推論モデルが利用できるようになるかもしれません。

今回読めなかったSkyworkではPRMの作り方が書かれているようで、Steinerでは強化学習も行なっているみたいなのでそちらも今後調べてみようと思います。また、モデル自体は公開されていなくても推論モデル開発のための論文や取り組みはいくつか出てきているようなのでそちらも読んでいきたいです。

議論やツッコミがあればNoteやX(Twitter)でコメント・ご教示いただけると助かります。

X: https://twitter.com/CurveWeb

お読みいただきありがとうございました。