bitnet.cpp を試す

ぬこぬこ

2024年10月18日 11:48

tl;dr

Microsoft が 1-bit LLM 推論フレームワーク bitnet.cpp を公開したよ
llama.cpp をベースにした CPU 推論対応フレームワーク
8B パラメータの 1.58-bit 量子化モデルをシングル CPU で実行可能だよ
macOS 環境におけるセットアップと実行手順を書いたよ（uv で実行確認）
英語での推論は良さそうだけど、日本語出力はちょっと厳しそう

Microsoft から llama.cpp ベースの 1bit-LLM 推論用の bitnet.cpp が公開されました。おもしろそうなので動かしてみます。

Mac Studio 2023
Apple M2 Ultra
Memory 192GB

ローカル環境（macOS）

README.md によると conda での実行を推奨されていますが、個人的に uv が好きなので uv で実行します。ソースコードに軽微な変更を加えるため、もし同じ環境で動かない場合は conda で試してみてください。

まず git clone します。llama.cpp をサブモジュールとして内包しているので、--recursive オプションをつけてください。

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

uv の環境構築を進めます。README.md には Python 3.9 を使っていますが、3.11 でも問題なく動いたので 3.11 を指定します。既存の requirements.txt を uv add で入れてください。

uv init --python 3.11
uv add -r requirements.txt

このまま実行すると setup_env.py でコケるので修正します。setup_env.py にて pip install のコマンドが直書きされている部分を uv 仕様に書き換えます。117 行目周辺です。

def setup_gguf():
    # Install the pip package
    # run_command([sys.executable, "-m", "pip", "install", "3rdparty/llama.cpp/gguf-py"], log_step="install_gguf")
    run_command(["uv", "add", "3rdparty/llama.cpp/gguf-py"], log_step="install_gguf")

では setup_env.py を実行しましょう。

uv run setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s

> uv run setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Downloading model HF1BitLLM/Llama3-8B-1.58-100B-tokens from HuggingFace to models/Llama3-8B-1.58-100B-tokens...
INFO:root:Converting HF model to GGUF format...
INFO:root:GGUF model saved at models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf

実行ログ

4 分 35 秒でセットアップは完了しました。では公式のサンプル通りに実行してみます。

uv run run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0

> uv run run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0
build: 3946 (53717102) with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin24.0.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Llama3-8B-1.58-100B-tokens
llama_model_loader: - kv 2: llama.block_count u32 = 32
…
Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?
Answer: Mary is in the garden.
（参考訳）
ダニエルは庭に戻った。メアリーはキッチンに向かった。サンドラはキッチンに向かった。サンドラは廊下へ。ジョンは寝室へ。メアリーは庭に戻りました。メアリーはどこにいますか？メアリーは庭にいます。

llama_perf_sampler_print: sampling time = 0.26 ms / 54 runs ( 0.00 ms per token, 204545.45 tokens per second)
llama_perf_context_print: load time = 3877.35 ms
llama_perf_context_print: prompt eval time = 3231.54 ms / 48 tokens ( 67.32 ms per token, 14.85 tokens per second)
llama_perf_context_print: eval time = 337.25 ms / 5 runs ( 67.45 ms per token, 14.83 tokens per second)
llama_perf_context_print: total time = 3569.54 ms / 53 tokens
ggml_metal_free: deallocating

実行ログ

ちなみに原文だと「Daniel went back to the the the garden.」となっているのですが、この the の繰り返しはなんなんでしょう笑

日本語は話せるでしょうか？

uv run run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "ダニエルは庭に戻った。メアリーはキッチンに向かった。サンドラはキッチンに向かった。サンドラは廊下へ。ジョンは寝室へ。メアリーは庭に戻りました。メアリーはどこにいる？\nAnswer:" -n 6 -temp 0

ダニエルは庭に戻った。メアリーはキッチンに向かった。サンドラはキッチンに向かった。サンドラは廊下へ。ジョンは寝室へ。メ??リーは庭に戻りました。メアリーはどこにいる？
Answer: 1. The cat is

厳しそうだ…！

さすがに日本語の出力は難しいようでした。Llama3 8B がベースとなっているので、最近の日本語の得意なモデルをベースとした Gemma 系、Qwen 系、それこそつい先日 NVIDIA から発表された Llama-3.1-Nemotron-70B-Instruct などであればもう少し日本語としてはまともな回答ができるのかもしれません。楽しみです。

すべて勉強代に充てさせていただきます！アウトプットします！