Sakana AIのAI-Scientistでどのように汎用AIエージェントのAiderが活用されているか

2024年10月4日 19:25

AI-Scientistの仕組み

Sakana AIが発表したAI-Scientistは科学者のように自ら研究して論文を執筆することのできるAIエージェント的ソフトウェアです。

AI-Scientistがどのように動作しているのか図解したワークフロー図（公式より）

AI-Scientistは大前提として、全自動にするために、プログラムを実行するだけで実験ができる情報科学系のアイディアしか扱うことができません。オリジナルのアイディア(seed_ideas.json)とPythonで実行すると実験ができるプログラムコード(experiment.py)、experiment.pyを実行した結果得られたデータから実験結果の図を生成するプログラムコード(plot.py)他などはユーザが事前に作る必要があります。

事前準備ができた状態で、launch_scientist.py を実行すると、大雑把に分けて以下の5つの工程で生成AIと対話しながら論文PDFの出力まで進んでいきます。

Idea Generation
ブレスト

この最初のステップでは、ユーザから提供されたseed_ideas.jsonからLLMと対話をしてどんな研究のアイディアがあるかを列挙していくプロセスです。このプロセスではシンプルな対話で事足りるとの判断からか、お馴染みのOpenAI等のAPIが直接使われており、AIエージェントのAiderは使われていません。

公式から提供されているサンプルの研究テーマはどれもCUDA対応のGPUが必要で環境構築が大変です。そこでCPUだけで手軽に試せる研究テーマを作ってみました。「データ規模とソーティングアルゴリズムと並列実行の性能比較」についてという研究でseed_ideas.json, prompt.jsonを作成してみます。

# templates/parallel_sorting/seed_ideas.json
[
  {
    "Name": "parallel_sorting_comparison",
    "Title": "Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes",
    "Experiment": "This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.",
    "Interestingness": 8,
    "Feasibility": 9,
    "Novelty": 6
  }
]

# templates/parallel_sorting/prompt.json
{
    "system": "You are an ambitious AI PhD student specializing in parallel computing and algorithm optimization, aiming to publish a groundbreaking paper in the field.",
    "task_description": "You are working with a file that compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. The experiment uses execution time as the primary evaluation metric. Your task is to propose innovative ideas to enhance this research, such as introducing new parallel sorting algorithms, optimizing existing ones, or exploring novel performance metrics beyond execution time."
}

なお、紹介しているファイルはAI-Scientistをフォークしたこちらのリポジトリのtemplates/parallel_sorting/で確認できます。

本題から少し逸れますが、当然これらのファイルはローカルでAiderを実行して生成しています。私の場合、experiment.pyがfinal_info.jsonというものを生成する必要があることに後の工程で気づいた段階で、final_info.jsonを出力する必要があるらしいということをAiderに伝えて修正してもらいました。AIエージェントのおかげでほとんどコードを把握せずにここまで進められています。先ほどのリポジトリのコミットログも合わせてみるとAiderのコミット内容がわかります。

次のログは、私がtemplateを作るために実行したAiderとの対話の一部抜粋したものです(####行がユーザの入力)

####  `templates/2d_diffusion/seed_ideas.json` や `templates/2d_diffusion/experiment.py` これらのファイルが何をしている？ 
はい、これらのファイルの内容を説明いたします。

1. `templates/2d_diffusion/seed_ideas.json`:
このファイルには、実験のアイデアが JSON 形式で格納されています。
...

#### /ask GPUでなくCPU数分でワンサイクル回せる情報科学の実験のアイディア  

GPUを使用せず、CPU数分で1サイクルを完了できる情報科学の実験アイデアをいくつか提案します：

1. 並列ソートアルゴリズムの比較実験:
   異なる並列ソートアルゴリズム（並列マージソート、並列クイックソートなど）の性能を、データサイズとCPUコア数を変えて比較します。
...

#### 1番のネタでtemplates/以下にディレクトリを作って、 既存の実装を参考に、seed_ideas.jsonとexperiment.pyを作って  

承知しました。並列ソートアルゴリズムの比較実験のために、新しいディレクトリとファイルを作成します。以下に、`seed_ideas.json`と`experiment.py`の内容を提案します。

#### 2d_diffusion/prompt.jsonを参考に、 `templates/parallel_sorting/prompt.json`も作って。
この新しい`prompt.json`ファイルは、2D拡散モデルの例を参考にしつつ、並列ソーティングの実験に特化した内容になっています。
...

#### `templates/2d_diffusion/experiment.py`にはrun dirにfinal_info.jsonを出力しているが、 parallel_sorting/experiment.pyにはそれがない
ご指摘ありがとうございます。`templates/parallel_sorting/experiment.py` に `final_info.json` を出力する機能を追加しましょう。
...

AI-ScientistはLLMと対話してseed_ideas.jsonからideas.jsonというファイルを作って研究アイディア候補を列挙していきます。それぞれに対して既存研究があるかどうかを論文検索APIを使って革新性を確認したりしてアイディアを絞っていきます。（私は手軽に試したかったのでSemanticScholar APIを使った既存研究の調査はコードを直接修正してスキップさせました、これをやると論文執筆で引用がうまくできなくなることに注意）

% python launch_scientist.py --experiment parallel_sorting --num-ideas 3

独自の研究を実行するのに最低限必要なファイルが揃ったので launch_scientist.pyで実行します。

# templates/parallel_sorting/ideas.json
{
    "Name": "time_memory_scalability_sorting",
    "Title": "Time, Memory, and Scalability Analysis of Parallel Sorting Algorithms",
    "Experiment": "Extend the existing experiment to measure both execution time and peak memory usage using the memory_profiler library. Modify the run_experiment function to return time and memory metrics. Introduce a composite efficiency score combining time and memory usage. Add a scalability analysis component to evaluate how algorithms perform with increasing data sizes and core counts. Calculate speedup and efficiency metrics for each algorithm. Update the results structure to include all these metrics. Analyze the time-memory trade-offs and scalability characteristics of different algorithms across various scenarios.",
    "Interestingness": 9,
    "Feasibility": 8,
    "Novelty": 9
},
{
    "Name": "adaptive_distribution_aware_sorting",
    "Title": "Adaptive Parallel Sorting Based on Data Distribution Analysis",
    "Experiment": "Implement data generation for uniform, normal (\u03bc=0, \u03c3=1), exponential (\u03bb=1), and skewed (\u03b1=4) distributions using numpy. Modify run_experiment to accept distribution type. Compare relative performance of parallel merge sort and quicksort across distributions, sizes, and core counts. Implement an adaptive algorithm that analyzes sample statistics (e.g., skewness, kurtosis) to choose between merge sort and quicksort. Evaluate performance gains of the adaptive approach across various scenarios.",
    "Interestingness": 9,
    "Feasibility": 6,
    "Novelty": 9
},
{
    "Name": "cache_friendly_parallel_sorting",
    "Title": "Optimizing Parallel Sorting Algorithms with Cache-Friendly Techniques",
    "Experiment": "Implement cache-oblivious parallel merge sort and block-based parallel quicksort. Modify run_experiment to include these new algorithms. Compare performance of original and cache-friendly variants across different data sizes and core counts. Develop a simple theoretical model to predict cache behavior based on data access patterns. Analyze the correlation between predicted cache efficiency and observed execution times. Update the results structure to include comparative analysis between original and cache-friendly variants, as well as the theoretical predictions versus actual performance.",
    "Interestingness": 9,
    "Feasibility": 8,
    "Novelty": 8
}

アイディアが膨らまされて、コンピュータの中で自律的な実験サイクルが始まります。

以降のプロセスではAiderがエレガントに使われていて面白いです。

Aiderについて: AI Pair Programming with LLM

Aiderのデモ。一言指示をしただけで意図を汲み取って、複数のファイルにまたがって適切にファイルを生成・編集してくれる汎用のAIエージェントツール

AiderはAI-Scientistと同じAIエージェントツールという枠に当てはまるツールだと思いますが、より汎用的かつプリミティブなAIエージェントツールです。（その代わり手の込んだものを作ろうとすると "今はまだ" 何度もインタラクションが必要）、本来デモにあるようにCLIツールとしてターミナルの中で対話的に使うツールです。ここまでなら、人気のあるCursorなども同様の機能を持っていると思いますが、AI-Scientistの中にはScripting aiderというAiderならではのプログラマブルな使い方で組み込まれています。

# AI-Scientistの中でAiderが使われている部分を抜粋したもの
from aider.coders import Coder

coder = Coder.create(
  main_model=main_model,
  fnames=fnames,
  io=io,
  stream=False,
  use_git=False,
  edit_format="diff",
)

next_prompt = coder_prompt.format(
  title=idea["Title"],
  idea=idea["Experiment"],
  max_runs=MAX_RUNS,
  baseline_results=baseline_results,
)

coder_out = coder.run(next_prompt)
print(coder_out)
if "ALL_COMPLETED" in coder_out:
    break

他のAIエージェントツールとの比較しながらAiderのポジションが気になる方はこちらの記事も参照ください。

AI-Scientistの中のAider

先ほどから実行している並列ソートアルゴリズムの比較というテーマだった時、それぞれのステップで実際にどのようなpromptが使われているかを取り上げてみます。

Code Implementation & Experimentation
コードの実装と実験

Aiderには事前に修正対象のファイル名を伝えておきます。このフェーズでは templates/parallel_sorting/experiment.py と同じディレクトリのplot.py、notes.txt (実際にはオリジナルが変更されないようにワーキングディレクトリにコピーされたもの)が指定されています。その上でAIーScientistはAiderに対して次のようにプロンプトを投げています。

Your goal is to implement the following idea: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes.  
The proposed experiment is as follows: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric..  
You are given a total of up to 5 runs to complete the necessary experiments. You do not need to use all 5.  
  
First, plan the list of experiments you would like to run. For example, if you are sweeping over a specific hyperparameter, plan each value you would like to test for each run.  
  
Note that we already provide the vanilla baseline results, so you do not need to re-run it.  
  
For reference, the baseline results are as follows:  
  
{'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}  
  
After you complete each change, we will run the command `python experiment.py --out_dir=run_i' where i is the run number and evaluate the results.  
YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.  
You can then implement the next thing on your list.

研究の概要と今取り組んでいるアイディア・ゴールを伝えつつベースラインの実験結果を入れて、次にやるべき実験内容を考えさせます。


Thank you for providing the context and guidelines. I understand that we're comparing parallel sorting algorithms across different CPU cores and data sizes, focusing on execution time as the primary metric. We'll use up to 5 runs to complete the experiments, and we already have the baseline results.

Here's the plan for the experiments:

1. Run 1: Increase the maximum number of cores to test scalability
2. Run 2: Expand the range of data sizes to test performance on larger datasets
3. Run 3: Implement a hybrid sorting algorithm combining merge sort and quicksort
4. Run 4: Optimize the parallel merge sort algorithm
5. Run 5: Optimize the parallel quicksort algorithm

Let's start with Run 1: Increasing the maximum number of cores to test scalability.

experiment.py
```python
<<<<<<< SEARCH
    parser.add_argument("--max_cores", type=int, default=mp.cpu_count())
=======
    parser.add_argument("--max_cores", type=int, default=min(32, mp.cpu_count()))
>>>>>>> REPLACE
```

This change will increase the maximum number of cores to 32 (or the maximum available on the system if less than 32). This will help us test the scalability of our sorting algorithms on systems with more cores.

Now, let's update the notes.txt file to reflect this change:

notes.txt
```text
<<<<<<< SEARCH
# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
=======
# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
## Run 1: Increased Maximum Cores
Description: Increased the maximum number of cores to 32 (or the maximum available on the system if less than 32) to test scalability of sorting algorithms on systems with more cores.
>>>>>>> REPLACE
```

このケースでは、CPUの最大数を引き上げて数字どうなるか見てみようと発想して、experiment.pyを1行だけ編集しました。notes.txtにRun 1に関する研究ノートを追記するのも同じレスポンスの中で行われています。

実験は一度では終われません。フィードバックループで繰り返しブラッシュアップをさせて洗練させていきます。

Run 1 completed. Here are the results:  
{'merge': {'training_time': 4.392096626758575, 'eval_loss': 0, 'inference_time': 4.392096626758575, 'kl_divergence': 0}, 'quick': {'training_time': 2.3139435155051093, 'eval_loss': 0, 'inference_time': 2.3139435155051093, 'kl_divergence': 0}}  
  
Decide if you need to re-plan your experiments given the result (you often will not need to).  
  
Someone else will be using `notes.txt` to perform a writeup on this in the future.  
Please include *all* relevant information for the writeup on Run 1, including an experiment description and the run number. Be as verbose as necessary.  
  
Then, implement the next thing on your list.  
We will then run the command `python experiment.py --out_dir=run_2'.  
YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.  
If you are finished with experiments, respond with 'ALL_COMPLETED'.

Run 1の実験結果をAiderに伝えつつ、Run 2を計画するようにプロンプトを投げています。もう十分だという時には ALL_COMPLETED を出力させる使用です。

以下がAiderの答え

Based on the results of Run 1, there's no need to re-plan our experiments. We'll proceed with our original plan and move on to Run 2, which involves expanding the range of data sizes to test performance on larger datasets.

First, let's update the `notes.txt` file to include the results and description for Run 1:

notes.txt
```text
<<<<<<< SEARCH
# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
## Run 1: Increased Maximum Cores
Description: Increased the maximum number of cores to 32 (or the maximum available on the system if less than 32) to test scalability of sorting algorithms on systems with more cores.
=======
# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
## Run 1: Increased Maximum Cores
Description: Increased the maximum number of cores to 32 (or the maximum available on the system if less than 32) to test scalability of sorting algorithms on systems with more cores.
Results: {'merge': {'training_time': 4.392096626758575, 'eval_loss': 0, 'inference_time': 4.392096626758575, 'kl_divergence': 0}, 'quick': {'training_time': 2.3139435155051093, 'eval_loss': 0, 'inference_time': 2.3139435155051093, 'kl_divergence': 0}}
Analysis: In Run 1, we observed that increasing the maximum number of cores led to longer average execution times for both merge sort and quicksort algorithms. This unexpected result could be due to various factors:
1. Increased overhead: Using more cores might introduce additional overhead for task distribution and result collection.
2. Memory bandwidth limitations: With more cores, memory access might become a bottleneck, especially for larger datasets.
3. Workload distribution: The parallelization strategy might not be optimal for the given problem sizes and core counts.
4. System-specific factors: The performance could be influenced by the specific hardware and system configuration used for the experiment.

Quicksort appears to perform better than merge sort in this scenario, with an average execution time of about 2.31 seconds compared to merge sort's 4.39 seconds. This difference might be due to quicksort's in-place sorting nature, which could lead to better cache utilization and reduced memory overhead.

Further investigation and profiling would be beneficial to understand the performance characteristics better and potentially optimize the parallel implementations.

## Run 2: Expanded Data Sizes
Description: Add a larger data size (10,000,000 elements) to test the performance of sorting algorithms on bigger datasets.
>>>>>>> REPLACE
```

Now, let's update the `experiment.py` file to include the larger data size for Run 2:

experiment.py
```python
<<<<<<< SEARCH
    parser.add_argument("--data_sizes", nargs="+", type=int, default=[10000, 100000, 1000000])
=======
    parser.add_argument("--data_sizes", nargs="+", type=int, default=[10000, 100000, 1000000, 10000000])
>>>>>>> REPLACE
```

> Tokens: 10,388 sent, 1,192 received. Cost: $0.05 message, $0.14 session.

今度はnotes.txtを先に編集しながら、データサイズを増やしてみようと発想しました。experiment.pyの編集は先ほどと同じく1行で済んでいるようです。このAiderの返事の後、AI-Scientistは python experiment.pyを実行してRun 2の実験結果をファイルに保存します。このように実験のループが回っていきます。

今回はAiderのプロンプトの処理に使われたトークンの内訳も載せてみました。Claude3.5 Sonnetを使って、Run2に関するもので $0.05でした。

Plotting Experimentation Results
実験結果の画像描画

run_plotting()では特に難しいことはしていません。最初に用意したplot.pyに対して、実験ループで実験結果のデータ構造が変わったりした時のためにplot.pyを更新した方がいいか確認してという指示をAiderに出しています。

Great job! Please modify `plot.py` to generate the most relevant plots for the final writeup. 

In particular, be sure to fill in the "labels" dictionary with the correct names for each run that you want to plot.

Only the runs in the `labels` dictionary will be plotted, so make sure to include all relevant runs.

We will be running the command `python plot.py` to generate the plots.

Great job!はexperimentループが終わったことに対して言っている。

以下は実際に今回plot.pyで描画された画像とnotes.textの例です。

run_2の結果画像。データ規模を大きくしたらどうかという実験のアイディアに従って４つ目のグラフが追加。クイックソートがパフォーマンス良いという結果に

# 2回目の実験イテレーションrun_2までにアップデートされた notes.txt

# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
## Run 1: Increased Maximum Cores
Description: Increased the maximum number of cores to 32 (or the maximum available on the system if less than 32) to test scalability of sorting algorithms on systems with more cores.
Results: {'merge': {'training_time': 4.392096626758575, 'eval_loss': 0, 'inference_time': 4.392096626758575, 'kl_divergence': 0}, 'quick': {'training_time': 2.3139435155051093, 'eval_loss': 0, 'inference_time': 2.3139435155051093, 'kl_divergence': 0}}
Analysis: In Run 1, we observed that increasing the maximum number of cores led to longer average execution times for both merge sort and quicksort algorithms. This unexpected result could be due to various factors:
1. Increased overhead: Using more cores might introduce additional overhead for task distribution and result collection.
2. Memory bandwidth limitations: With more cores, memory access might become a bottleneck, especially for larger datasets.
3. Workload distribution: The parallelization strategy might not be optimal for the given problem sizes and core counts.
4. System-specific factors: The performance could be influenced by the specific hardware and system configuration used for the experiment.

Quicksort appears to perform better than merge sort in this scenario, with an average execution time of about 2.31 seconds compared to merge sort's 4.39 seconds. This difference might be due to quicksort's in-place sorting nature, which could lead to better cache utilization and reduced memory overhead.

Further investigation and profiling would be beneficial to understand the performance characteristics better and potentially optimize the parallel implementations.

## Run 2: Expanded Data Sizes
Description: Added a larger data size (10,000,000 elements) to test the performance of sorting algorithms on bigger datasets.
Results: {'merge': {'training_time': 4.449713490690504, 'eval_loss': 0, 'inference_time': 4.449713490690504, 'kl_divergence': 0}, 'quick': {'training_time': 2.311549878120423, 'eval_loss': 0, 'inference_time': 2.311549878120423, 'kl_divergence': 0}}
Analysis: In Run 2, we expanded the range of data sizes to include a larger dataset of 10,000,000 elements. The results show:
1. Consistent performance trend: Similar to Run 1, quicksort continues to outperform merge sort, even with the larger dataset.
2. Scalability: Both algorithms show similar execution times compared to Run 1, suggesting good scalability to larger data sizes.
3. Quicksort advantage: Quicksort maintains its performance edge, with an average execution time of about 2.31 seconds compared to merge sort's 4.45 seconds.
4. Minimal impact of data size increase: The execution times for both algorithms are very close to those in Run 1, indicating that the increase in data size from 1,000,000 to 10,000,000 elements did not significantly impact performance.

These results suggest that both algorithms handle the larger dataset efficiently, with quicksort maintaining its performance advantage. The similar execution times between Run 1 and Run 2 might indicate that the algorithms are well-optimized for larger datasets or that other factors (such as parallelization overhead or memory bandwidth) are limiting further performance improvements.

画像描画が完了したら対応するファイル名をnotes.txtに追記するように指示しています。ここで対応付けをメモしておくことが後に論文執筆の時に重要だと思います。

Please modify `notes.txt` with a description of what each plot shows along with the filename of the figure. Please do so in-depth.

Somebody else will be using `notes.txt` to write a report on this in the future.

Paper Write-up
論文の執筆

今回はこれ以降のステップは長くなりすぎるため詳細な解説は割愛しますが、notes.txtの中間メモやplot.pyで生成した画像を使って、最初にAbstractを書き上げて、Introduction, Background, Method, Experimental Setup, Results, Conclusionとセクションごとにプロンプトを分けてAiderに指示を出しています。Conclusionまで書かせた後に、第2章のRelated Workにとりかかるという順番の工夫があるようでした。

この章で特筆すべきは論文の最後についている引用のセクションでしょう。既存の論文を正確に引用しなければならないので雰囲気では終われません。大枠としては、先にどんな論文を引用すべきかをLLM APIで考えさて論文データベース(SemanticScholar.org)のAPIに投げる検索キーワード(Query)をゲットして、実際のSemanticScholarのAPI結果を元にLLMに引用すべき論文を選ばせて、なぜそれが必要だと思ったか(Description)を返させます。それの結果を bibtex(引用論文の一覧）の形に変換してAiderへのプロンプトに含めます。Aiderの仕事はbibtexの情報を使って論文本文中に \citeを挿入することです。

The following citations have just been added to the end of the `references.bib` file definition at the top of the file:
"""
{bibtex}
"""
You do not need to add them yourself.
ABSOLUTELY DO NOT ADD IT AGAIN!!!

...中略...

You must use \cite or \citet to reference papers, do not manually type out author names

架空の論文を追加したりしないように繰り返し厳重に警告されるAiderとその先にいるLLM😭

Aiderのテクニック

AI-Scientistで切り拓かれたなと感じたAiderのテクニックを３つ取り上げたいと思います。

1. LLMにあまり大きな出力をさせない

現代のLLMはアウトプットのコンテキストサイズがまだ短いです。大きなコンテキストに対しては集中力も持ちません。Aiderは特に短いClaude 3.5 Sonnetのアウトプットを内部で繋げて長い出力ができるようにまだ工夫してくれたりしていますが、それでも一回のプロンプトで何種類も実験コードを書いたり論文執筆まで出力することはできないので、細かくステップを分けて実行しています。もちろんこの点は想像通りな印象がありますが、実際に細かいステップを積み上げていけば論文執筆という大きなアウトプットもできることを示した功績は大きいと思います。

2. 進捗メモがあると長期ループさせやすい

AI-Scientistは、ワークフローのステップごとにnote.txtに進捗メモを書き足していくことで最後に論文を書き上げるときに重要なことを書き逃さないようにしています。

# results/parallel_sorting/xxx_parallel_sorting_comparison/notes.txt

# Title: Comparing Parallel Sorting Algorithms: Performance Analysis across Different CPU Cores and Data Sizes
# Experiment description: This experiment compares the performance of different parallel sorting algorithms (parallel merge sort, parallel quicksort) across varying data sizes and number of CPU cores. We use execution time as the primary evaluation metric.
## Run 0: Baseline
Results: {'merge': {'training_time': 0.13898344948178246, 'eval_loss': 0, 'inference_time': 0.13898344948178246, 'kl_divergence': 0}, 'quick': {'training_time': 0.1187679336184547, 'eval_loss': 0, 'inference_time': 0.1187679336184547, 'kl_divergence': 0}}
Description: Baseline results.
## Run 1: Increased Maximum Cores
Description: Increased the maximum number of cores to 32 (or the maximum available on the system if less than 32) to test scalability of sorting algorithms on systems with more cores.
Results: {'merge': {'training_time': 4.392096626758575, 'eval_loss': 0, 'inference_time': 4.392096626758575, 'kl_divergence': 0}, 'quick': {'training_time': 2.3139435155051093, 'eval_loss': 0, 'inference_time': 2.3139435155051093, 'kl_divergence': 0}}
Analysis: In Run 1, we observed that increasing the maximum number of cores led to longer average execution times for both merge sort and quicksort algorithms. This unexpected result could be due to various factors:
1. Increased overhead: Using more cores might introduce additional overhead for task distribution and result collection.
2. Memory bandwidth limitations: With more cores, memory access might become a bottleneck, especially for larger datasets.
3. Workload distribution: The parallelization strategy might not be optimal for the given problem sizes and core counts.
4. System-specific factors: The performance could be influenced by the specific hardware and system configuration used for the experiment.

Quicksort appears to perform better than merge sort in this scenario, with an average execution time of about 2.31 seconds compared to merge sort's 4.39 seconds. This difference might be due to quicksort's in-place sorting nature, which could lead to better cache utilization and reduced memory overhead.

Further investigation and profiling would be beneficial to understand the performance characteristics better and potentially optimize the parallel implementations.

## Run 2: Expanded Data Sizes
Description: Added a larger data size (10,000,000 elements) to test the performance of sorting algorithms on bigger datasets.
Results: {'merge': {'training_time': 4.449713490690504, 'eval_loss': 0, 'inference_time': 4.449713490690504, 'kl_divergence': 0}, 'quick': {'training_time': 2.311549878120423, 'eval_loss': 0, 'inference_time': 2.311549878120423, 'kl_divergence': 0}}
Analysis: In Run 2, we expanded the range of data sizes to include a larger dataset of 10,000,000 elements. The results show:
1. Consistent performance trend: Similar to Run 1, quicksort continues to outperform merge sort, even with the larger dataset.
2. Scalability: Both algorithms show similar execution times compared to Run 1, suggesting good scalability to larger data sizes.
3. Quicksort advantage: Quicksort maintains its performance edge, with an average execution time of about 2.31 seconds compared to merge sort's 4.45 seconds.
4. Minimal impact of data size increase: The execution times for both algorithms are very close to those in Run 1, indicating that the increase in data size from 1,000,000 to 10,000,000 elements did not significantly impact performance.

These results suggest that both algorithms handle the larger dataset efficiently, with quicksort maintaining its performance advantage. The similar execution times between Run 1 and Run 2 might indicate that the algorithms are well-optimized for larger datasets or that other factors (such as parallelization overhead or memory bandwidth) are limiting further performance improvements.

## Run 3: Implement Hybrid Sorting Algorithm
Description: Implement a hybrid sorting algorithm that combines merge sort and quicksort to potentially leverage the strengths of both algorithms.
...

今回の parallel_sorting の実験で実際に生成された notes.txtの一部です。今どこまでやってそれぞれでどんな結果が得られたかをまとめた内容になっています。LLMに大きなアウトプットをさせるには進捗を確認できるテキストファイルを用意するとループ処理でもうまくいきやすいと思われます。

3. ファイルを直接編集

Aiderはgitリポジトリ配下のディレクトリでプロンプトの指示を汲み取って、ファイルを直接編集してくれます。

# ai_scientist/perform_experiments.py::perform_experiments()
next_prompt = """
Please modify `notes.txt` with a description of what each plot shows along with the filename of the figure. Please do so in-depth.
Somebody else will be using `notes.txt` to write a report on this in the future.
"""
coder.run(next_prompt)

先ほど紹介のnotes.txtにまつわる例で見てみましょう。AI-Scientistでは、一つのアイディアに対して実験ループと図の書き出しが終わった後、上記のように図のファイル名と該当する実験の対応づけと簡単な評価をnotes.txtに追記するようにAiderに指示を出しています。具体的にはcoder.run(next_prompt)の行が終わったら notes.txt が更新されているということです。これは、result = coder.run(next_prompt) のようにresultを受け取ってopen(notes.txt)のようにファイルを開いて追記してなどする必要がないということです。この人間と会話しているかのような表現の指示でもファイル操作まで問題なくやってくれるところが、コードを簡潔にできて指示を変更するだけで拡張性も出せる、LLMは編集差分だけ出力すればいいのでファイルが大きくなっても（コスト的にも安定動作的にも）ボトルネックにならない、Aiderのいい使い所だと感じます。

AI-Scientistをフォークすれば…

アイディア次第で、研究論文以外の通常のビジネスシーンで扱うようなアウトプットも扱うことが可能だと思います。

例えば、アイディアから特定のソフトウェア、Webサービスやスマホアプリを開発するのに過不足無い仕様書を自律的に試行錯誤したりレビューし合ったりしながら生成したりできると思います。一般的に、アイディアの状態は仕様が曖昧なのでAIに任せても作れる規模に制限があると思いますが、曖昧さがなくなった仕様書がありさえすれば、機能ができるまでAiderをループ実行するだけでコードが仕上がるという未来も近そうです。