📝 FlexGen を M1/M2 Mac で動かしてみる

2023年3月11日 01:09

ChatGPTはすごい。それは分かっている。だけど俺はもうちょい気軽に遊んでみたいんだ。そう、銭ゲバだから。

まあそれはそれとして、StableDiffusion をちょっと触った身としては、公開された学習モデルが如何に世の中を一変させてしまうのかの片鱗を感じてしまったので、ブラックボックスのChatGPTでここまで世の中を変えるんならFlexGenはどのような変化をもたらしてくれるんだろうとワクワクしてしまう。

だから触ってみる。

以下は各所で共有されているし私も読んでみて良かったので是非課金してでも読んでほしい。私は有料になってしまう前に読んだんだけど・・・

Apple Silicon Mac で FlexGen を動かす記事はすでに存在しているが、ここでは備忘録的に残す。

主に以下の記事を参考にした

FlexGen

大本のリポジトリ。

実は `m1` というブランチが存在し、README が一部更新されているので今回はこちらを参照する
https://github.com/FMInference/FlexGen/tree/m1

差分がみたいなら以下のURL (2023/03/10 時点で main から26コミット分遅れているので注意)
https://github.com/FMInference/FlexGen/compare/main...m1

FlexGen の使用方法

Google Colab を使用する
ローカルPCで使用する
1. GPU 搭載PC (正攻法)
2. Apple Silicon (M1/M2) Mac (まだ non-official)

今回は Apple Silicon (M1/M2) Mac を試す

ここ数週間ではじめて Python を触って程度だが、感覚的に Python 3.11 はエラーが多いので現時点で最新の 3.10.10 を使用する。(私は asdf で Python のバージョン管理を行っているので以下)

❯ echo "python 3.10.10" > /Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.tool-versions

# Python の仮想環境を構築・起動する
❯ python -m venv .env
❯ source .env/bin/activate

# 仮想環境が立ち上がる
.env ❯

FlexGen m1 ブランチの READMEを見ると、以下の記述がある

### CPU and M1/M2 GPU platform
To run models on CPU platforms, all you need to do is to add an `--platform` entry:
```
python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
```
To run on M1/M2 platforms, [PyTorch nightly](https://pytorch.org/) is required for kernel coverage and better performance. Once you have PyTorch nightly installed, you can simply replace `cpu` with `mps:0`.

https://github.com/FMInference/FlexGen/compare/main...m1#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R134-R139

PyTorch nightly (一般公開されていない最新版) が必要なのでインストールする

# MPS acceleration is available on MacOS 12.3+
❯ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

今回は torchaudio は不要かもだが、一旦ドキュメントのまま進める。

flexgen は `pip install flexgen` でインストールできるが、今回は m1 ブランチを指定したいのでローカルに落としてくるしかない。参考にした記事では con3office/FlexGen を落としているので一応確認すると、元リポジトリの m1 ブランチとは別のコミットが積まれている

https://github.com/FMInference/FlexGen/compare/m1...con3office:FlexGen:m1

ここでは一旦元の FMInference/FlexGen を落としてインストールしてみる (git submodule で管理した方が良さそうか？今後の課題)

.env ❯ git clone https://github.com/FMInference/FlexGen.git ./flexgen
.env ❯ cd flexgen/ && git checkout m1
.env ❯ pip3 install -e .
.env ❯ cd ..

FlexGen を実行してみる

インストールできたのでREADMEに従って実行する

.env ❯ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████| 685/685 [00:00<00:00, 184kB/s]
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████| 651/651 [00:00<00:00, 267kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████| 899k/899k [00:01<00:00, 795kB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████| 456k/456k [00:00<00:00, 605kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████| 221/221 [00:00<00:00, 66.5kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████| 1/1 [01:09<00:00, 69.67s/it]
Convert format: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.17s/it]
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB	projected: False
prefill latency: 5.284 s	prefill throughput: 387.571 token/s
decode latency: 23.636 s	decode throughput: 5.246 token/s
total latency: 28.920 s	total throughput: 4.426 token/s

LLMは facebook/opt-1.3b を指定している。以下のログを見ると面白いのだけど、

Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.

学習モデルが公開されている Hugging Face (AIのGitHubを目指している？らしい) から取得しているんだなー。画像生成AIの学習モデルもここから取得することが多い。やはり、という感覚。

更にログを眺めると、どうやらちゃんと出力できているらしいことが分かる

Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

再実行

.env ❯ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB	projected: False
prefill latency: 5.396 s	prefill throughput: 379.565 token/s
decode latency: 23.060 s	decode throughput: 5.377 token/s
total latency: 28.455 s	total throughput: 4.498 token/s

差分

.env ❯ diff -u  run.txt rerun.txt
--- run.txt	2023-03-11 00:27:55
+++ rerun.txt	2023-03-11 00:27:55
@@ -1,15 +1,6 @@
 .env ❯ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
-Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████| 685/685 [00:00<00:00, 184kB/s]
-Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████| 651/651 [00:00<00:00, 267kB/s]
-Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████| 899k/899k [00:01<00:00, 795kB/s]
-Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████| 456k/456k [00:00<00:00, 605kB/s]
-Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████| 221/221 [00:00<00:00, 66.5kB/s]
 model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
 init weight...
-Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
-Downloading pytorch_model.bin: 100%|████████████████████████████████████████████| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
-Fetching 1 files: 100%|████████████████████████████████████████████████████████████████| 1/1 [01:09<00:00, 69.67s/it]
-Convert format: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.17s/it]
 warmup - generate
 benchmark - generate
 /Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
@@ -31,6 +22,6 @@
   cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
 model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
 peak gpu mem: 0.000 GB	projected: False
-prefill latency: 5.284 s	prefill throughput: 387.571 token/s
-decode latency: 23.636 s	decode throughput: 5.246 token/s
-total latency: 28.920 s	total throughput: 4.426 token/s
+prefill latency: 5.396 s	prefill throughput: 379.565 token/s
+decode latency: 23.060 s	decode throughput: 5.377 token/s
+total latency: 28.455 s	total throughput: 4.498 token/s

コード (主にプロンプト周り) を確認する

コードはここから始動する

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    add_parser_arguments(parser)
    args = parser.parse_args()

    assert len(args.percent) == 6

    if "cuda" in args.platform:
        if not torch.cuda.is_available():
            if torch.backends.mps.is_available():
                args.platform = "mps:0"
            else:
                args.platform = "cpu"
            print("CUDA devices not available, {} is used instead".format(args.platform))

    if "mps" in args.platform:
        if not torch.backends.mps.is_available():
            args.platform = "cpu"
            print("MPS devices not available, CPU is used instead")

    if "cuda" not in args.platform:
        # not clear how to enable overlap on MPS platform yet
        args.overlap = False
        args.pin_weight = False

    if args.platform == "cpu":
        args.percent = [0, 100, 0, 100, 0, 100]

    run_flexgen(args)

https://github.com/FMInference/FlexGen/blob/f62cfec766f3c0f8d542f5622319fa7a12a6437d/flexgen/flex_opt.py#L1320-L1348

出力周りはここら辺なので、ここまでの流れをざっくり確認したい

    if DUMMY_WEIGHT not in args.path:
        outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        show_str = "Outputs:\n" + 70 * '-' + "\n"
        for i in [0, len(outputs)-1]:
            show_str += f"{i}: {outputs[i]}\n"
            show_str += "-" * 70 + "\n"
        if args.verbose >= 2:
            print(show_str)

https://github.com/FMInference/FlexGen/blob/f62cfec766f3c0f8d542f5622319fa7a12a6437d/flexgen/flex_opt.py#L1248-L1255

今日は時間切れ。明日以降確認していく