MacBook ProでPowerInferを試してみる

noguchi-shoji

2023年12月26日 00:43

無謀にも「PowerInfer」をMacBook Proで試してみます。
試したマシンは、MacBook Pro M3 Proチップ、メモリ18GBです。

やってみなくちゃわからない…。いや、分かっている、

Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)

GitHub - SJTU-IPADS/PowerInfer: High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

と書いているから。それでも今回試してみるモデルは以下です。

LLaMA(ReLU)-2-7B

1. 準備

venv準備

python3 -m venv powerinfer
cd $_
source bin/activate

PowerInferのセットアップ

git cloneしてパッケージをインストールします。

git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt

pip listの結果はこちらです。

% pip list
Package            Version    Editable project location
------------------ ---------- -----------------------------------------------------------------
certifi            2023.11.17
charset-normalizer 3.3.2
cvxopt             1.3.2
filelock           3.13.1
fsspec             2023.12.2
gguf               0.5.2      /Users/noguchi/path/to/venv/powerinfer/PowerInfer/gguf-py
huggingface-hub    0.20.1
idna               3.6
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
networkx           3.2.1
numpy              1.26.2
packaging          23.2
pip                23.3.2
powerinfer         0.0.1      /Users/noguchi/path/to/venv/powerinfer/PowerInfer/powerinfer-py
PyYAML             6.0.1
regex              2023.12.25
requests           2.31.0
safetensors        0.4.1
sentencepiece      0.1.99
setuptools         58.0.4
sympy              1.12
tokenizers         0.15.0
torch              2.1.2
tqdm               4.66.1
transformers       4.36.2
typing_extensions  4.9.0
urllib3            2.1.0

ビルド

CMakeを使ってPowerInferをビルドします。macなのでCPUのほうで。

cmake -S . -B build
cmake --build build --config Release

2. モデルのダウンロード

7Bをダウンロードします。

mkdir ReluLLaMA-7B-PowerInfer-GGUF
wget -P ReluLLaMA-7B-PowerInfer-GGUF https://huggingface.co/PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF/resolve/main/llama-7b-relu.powerinfer.gguf

3. 試してみる

聞いてみよう

128トークンではなく16トークンで聞きます。その理由は直ぐに分かります。

./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 16 -t 8 -p "Doraemon is"

Doraemon is a popular Japanese anime series. It was first released in Japan in the year

PowerInfer/ReluLLaMA-7B-PowerInfer-GGUFより

総時間が約2分50秒。スレッド数（-tオプション）を上げても変わらず。

llama_print_timings:        load time =    9199.25 ms
llama_print_timings:      sample time =       1.41 ms /    16 runs   (    0.09 ms per token, 11347.52 tokens per second)
llama_print_timings: prompt eval time =   12530.26 ms /     5 tokens ( 2506.05 ms per token,     0.40 tokens per second)
llama_print_timings:        eval time =  155936.53 ms /    15 runs   (10395.77 ms per token,     0.10 tokens per second)
llama_print_timings:       total time =  168475.21 ms

4. まとめ

もう少し待ってから試しましょう。