bitnet.cpp で Llama3-8B-1.58-100B-tokens を試す

2024年10月19日 04:34

「bitnet.cpp」で「Llama3-8B-1.58-100B-tokens」試したのでまとめました。

1. bitnet.cpp

「bitnet.cpp」は、Microsoftが開発した1bit LLM用の推論フレームワークです。主な特徴は、次のとおりです。

・1ビットLLM対応 : BitNet b1.58のような1bit LLMに特化した推論を可能にする。
・CPU最適化 : 現在はCPUでの推論に焦点を当てており、将来的にNPUとGPUのサポートも予定されている。
・高速推論 : ARM CPUで1.37倍から5.07倍、x86 CPUで2.37倍から6.17倍の高速化を実現。
・省エネルギー : ARM CPUで55.4%から70.0%、x86 CPUで71.9%から82.2%のエネルギー消費削減を達成。

2. Llama3-8B-1.58-100B-tokens

「Llama3-8B-1.58-100B-tokens」は、「Llama-3-8B-Instruct」をベースに、BitNet 1.58b アーキテクチャでファインチューニングされたモデルです。

・HF1BitLLM/Llama3-8B-1.58-100B-tokens

3. bitnet.cpp で Llama3-8B-1.58-100B-tokens を試す

ローカルマシン (M3 Maxを使いました) で「bitnet.cpp」で「Llama3-8B-1.58-100B-tokens」を試す手順は、次のとおりです。

(1) Python 仮想環境の準備。

(2) リポジトリのクローンとインストール。

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
pip install -r requirements.txt

(3) セットアップ。
HuggingFaceからモデルをダウンロードし、量子化されたgguf形式に変換してプロジェクトをビルドします。


python setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s

(4) 推論の実行。

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "What anime is popular in Japan?\nAnswer:" -n 256 -temp 0.1

What anime is popular in Japan?
Answer: One Piece
What is the most popular anime in Japan?
Answer: Dragon Ball
What is the most popular anime in the world?
Answer: Dragon Ball
What is the most popular anime in the US?
Answer: Dragon Ball
What is the most popular anime in the UK?
Answer: Dragon Ball
:
llama_perf_sampler_print: sampling time = 11.08 ms / 266 runs ( 0.04 ms per token, 24013.72 tokens per second)
llama_perf_context_print: load time = 418.17 ms
llama_perf_context_print: prompt eval time = 618.46 ms / 10 tokens ( 61.85 ms per token, 16.17 tokens per second)
llama_perf_context_print: eval time = 16038.58 ms / 255 runs ( 62.90 ms per token, 15.90 tokens per second)
llama_perf_context_print: total time = 16674.58 ms / 265 tokens

【翻訳】
日本で人気のあるアニメは何ですか？
Answer: ワンピース
日本で最も人気のあるアニメは何ですか？
Answer: ドラゴンボール
世界で最も人気のあるアニメは何ですか？
Answer: ドラゴンボール
アメリカで最も人気のあるアニメは何ですか？
Answer: ドラゴンボール
イギリスで最も人気のあるアニメは何ですか？
Answer: ドラゴンボール
:

M3 Maxでの速度は、15.90 tokens per secondでした。

(4) repeat-penaltyを調整して推論。
「run_inference.py」の「--repeat-penalty」を調整しました。

    command = [
        f'{main_path}',
        '-m', args.model,
        '-n', str(args.n_predict),
        '-t', str(args.threads),
        '-p', args.prompt,
        '--repeat-penalty','1.1', # 追加
        '-ngl', '0',
        '-c', str(args.ctx_size),
        '--temp', str(args.temperature),
        "-b", "1"
    ]

実行コマンドは先程と同じものを使います。

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "What anime is popular in Japan?\nAnswer:" -n 256 -temp 0.1

What anime is popular in Japan?
Answer: One Piece
One Piece is a manga and anime series created by Eiichiro Oda. It follows the adventures of Luffy, a pirate who wants to become the Pirate King of the Seven Seas.
The story begins with Luffy’s childhood dream of becoming the Pirate King of the Seven Seas. He meets the legendary Pirate King, Gol D. Roger, and is given a chance to join his crew as a member of the Straw Hat Pirates. The series follows Luffy’s journey as he tries to become the Pirate King.
The story has been praised for its action-packed plot, colorful characters, and its use of Japanese culture and history. It has also been criticized for its portrayal of women and its use of violence.
What is the most popular anime in Japan?
Answer: One Piece
One Piece is a manga and anime series created by Eiichiro Oda. It follows the adventures of Luffy, a pirate who wants to become the Pirate King of the Seven Seas.
The story begins with Luffy’s childhood dream of becoming the Pirate King. He meets the legendary Pirate King, Gol D. Roger, and is given a chance to join his crew as a member of the Straw Hat Pirates. The series follows Luffy’s journey as he tries to become

llama_perf_sampler_print: sampling time = 68.71 ms / 266 runs ( 0.26 ms per token, 3871.23 tokens per second)
llama_perf_context_print: load time = 414.23 ms
llama_perf_context_print: prompt eval time = 602.73 ms / 10 tokens ( 60.27 ms per token, 16.59 tokens per second)
llama_perf_context_print: eval time = 16026.76 ms / 255 runs ( 62.85 ms per token, 15.91 tokens per second)
llama_perf_context_print: total time = 16704.59 ms / 265 tokens

【翻訳】
日本で人気のあるアニメは何ですか？
Answer: ワンピース
ワンピースは尾田栄一郎による漫画およびアニメシリーズです。七つの海の海賊王を目指す海賊ルフィの冒険を描いています。物語は、七つの海の海賊王になるというルフィの幼少期の夢から始まります。彼は伝説の海賊王ゴール・D・ロジャーと出会い、麦わらの一味の一員として彼の乗組員に加わるチャンスを与えられます。このシリーズは、海賊王になろうとするルフィの旅を追っています。
この物語は、アクション満載のプロット、多彩なキャラクター、日本の文化と歴史の活用が高く評価されています。一方、女性の描写や暴力の使用については批判されています。
日本で最も人気のあるアニメは何ですか？
Answer: ワンピース
ワンピースは尾田栄一郎による漫画およびアニメシリーズです。七つの海の海賊王を目指す海賊ルフィの冒険を描いた作品。物語は、ルフィが子供の頃に海賊王になるという夢を抱いたことから始まります。彼は伝説の海賊王ゴール・D・ロジャーと出会い、麦わらの一味の一員として彼の仲間に加わるチャンスを得ます。このシリーズは、海賊王を目指すルフィの旅を追っています。

bitnet.cpp をお試し中。
M3 Max : 15.90 tokens per secondhttps://t.co/gTShRUSagm

【出力の日本語訳】
日本で人気のアニメは何ですか？
Answer: ワンピース… pic.twitter.com/RVpUZxImSX
— 布留川英一 / Hidekazu Furukawa (@npaka123) October 18, 2024

【おまけ】スレッド数

「run_inference.py」に「-t 16」でスレッド数を増やしたところ、高速化しました。

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "What anime is popular in Japan?\nAnswer:" -n 256 -temp 0.1 -t 16

llama_perf_sampler_print: sampling time = 108.64 ms / 266 runs ( 0.41 ms per token, 2448.34 tokens per second)
llama_perf_context_print: load time = 437.10 ms
llama_perf_context_print: prompt eval time = 481.97 ms / 10 tokens ( 48.20 ms per token, 20.75 tokens per second)
llama_perf_context_print: eval time = 10505.99 ms / 255 runs ( 41.20 ms per token, 24.27 tokens per second)
llama_perf_context_print: total time = 11107.85 ms / 265 tokens

Macの物理コア数と論理コアの調べ方は、次のとおりです。

$ sysctl -n hw.physicalcpu
16
$ sysctl -n hw.logicalcpu
16