่ฆ‹ๅ‡บใ—็”ปๅƒ

๐Ÿ“ FlexGen ใ‚’ M1/M2 Mac ใงๅ‹•ใ‹ใ—ใฆใฟใ‚‹

ChatGPTใฏใ™ใ”ใ„ใ€‚ใใ‚Œใฏๅˆ†ใ‹ใฃใฆใ„ใ‚‹ใ€‚ใ ใ‘ใฉไฟบใฏใ‚‚ใ†ใกใ‚‡ใ„ๆฐ—่ปฝใซ้Šใ‚“ใงใฟใŸใ„ใ‚“ใ ใ€‚ใใ†ใ€้Šญใ‚ฒใƒใ ใ‹ใ‚‰ใ€‚

ใพใ‚ใใ‚Œใฏใใ‚Œใจใ—ใฆใ€StableDiffusion ใ‚’ใกใ‚‡ใฃใจ่งฆใฃใŸ่บซใจใ—ใฆใฏใ€ๅ…ฌ้–‹ใ•ใ‚ŒใŸๅญฆ็ฟ’ใƒขใƒ‡ใƒซใŒๅฆ‚ไฝ•ใซไธ–ใฎไธญใ‚’ไธ€ๅค‰ใ•ใ›ใฆใ—ใพใ†ใฎใ‹ใฎ็‰‡้ฑ—ใ‚’ๆ„Ÿใ˜ใฆใ—ใพใฃใŸใฎใงใ€ใƒ–ใƒฉใƒƒใ‚ฏใƒœใƒƒใ‚ฏใ‚นใฎChatGPTใงใ“ใ“ใพใงไธ–ใฎไธญใ‚’ๅค‰ใˆใ‚‹ใ‚“ใชใ‚‰FlexGenใฏใฉใฎใ‚ˆใ†ใชๅค‰ๅŒ–ใ‚’ใ‚‚ใŸใ‚‰ใ—ใฆใใ‚Œใ‚‹ใ‚“ใ ใ‚ใ†ใจใƒฏใ‚ฏใƒฏใ‚ฏใ—ใฆใ—ใพใ†ใ€‚

ใ ใ‹ใ‚‰่งฆใฃใฆใฟใ‚‹ใ€‚


ไปฅไธ‹ใฏๅ„ๆ‰€ใงๅ…ฑๆœ‰ใ•ใ‚Œใฆใ„ใ‚‹ใ—็งใ‚‚่ชญใ‚“ใงใฟใฆ่‰ฏใ‹ใฃใŸใฎใงๆ˜ฏ้ž่ชฒ้‡‘ใ—ใฆใงใ‚‚่ชญใ‚“ใงใปใ—ใ„ใ€‚็งใฏๆœ‰ๆ–™ใซใชใฃใฆใ—ใพใ†ๅ‰ใซ่ชญใ‚“ใ ใ‚“ใ ใ‘ใฉใƒปใƒปใƒป


Apple Silicon Mac ใง FlexGen ใ‚’ๅ‹•ใ‹ใ™่จ˜ไบ‹ใฏใ™ใงใซๅญ˜ๅœจใ—ใฆใ„ใ‚‹ใŒใ€ใ“ใ“ใงใฏๅ‚™ๅฟ˜้Œฒ็š„ใซๆฎ‹ใ™ใ€‚

ไธปใซไปฅไธ‹ใฎ่จ˜ไบ‹ใ‚’ๅ‚่€ƒใซใ—ใŸ



FlexGen

ๅคงๆœฌใฎใƒชใƒใ‚ธใƒˆใƒชใ€‚


ๅฎŸใฏ `m1` ใจใ„ใ†ใƒ–ใƒฉใƒณใƒใŒๅญ˜ๅœจใ—ใ€README ใŒไธ€้ƒจๆ›ดๆ–ฐใ•ใ‚Œใฆใ„ใ‚‹ใฎใงไปŠๅ›žใฏใ“ใกใ‚‰ใ‚’ๅ‚็…งใ™ใ‚‹
https://github.com/FMInference/FlexGen/tree/m1

ๅทฎๅˆ†ใŒใฟใŸใ„ใชใ‚‰ไปฅไธ‹ใฎURL (2023/03/10 ๆ™‚็‚นใง main ใ‹ใ‚‰26ใ‚ณใƒŸใƒƒใƒˆๅˆ†้…ใ‚Œใฆใ„ใ‚‹ใฎใงๆณจๆ„)
https://github.com/FMInference/FlexGen/compare/main...m1

README ใฎ diff


FlexGen ใฎไฝฟ็”จๆ–นๆณ•

  1. Google Colab ใ‚’ไฝฟ็”จใ™ใ‚‹

  2. ใƒญใƒผใ‚ซใƒซPCใงไฝฟ็”จใ™ใ‚‹

    1. GPU ๆญ่ผ‰PC (ๆญฃๆ”ปๆณ•)

    2. Apple Silicon (M1/M2) Mac (ใพใ  non-official)

ไปŠๅ›žใฏ Apple Silicon (M1/M2) Mac ใ‚’่ฉฆใ™ 

ใ“ใ“ๆ•ฐ้€ฑ้–“ใงใฏใ˜ใ‚ใฆ Python ใ‚’่งฆใฃใฆ็จ‹ๅบฆใ ใŒใ€ๆ„Ÿ่ฆš็š„ใซ Python 3.11 ใฏใ‚จใƒฉใƒผใŒๅคšใ„ใฎใง ็พๆ™‚็‚นใงๆœ€ๆ–ฐใฎ 3.10.10 ใ‚’ไฝฟ็”จใ™ใ‚‹ใ€‚(็งใฏ asdf ใง Python ใฎใƒใƒผใ‚ธใƒงใƒณ็ฎก็†ใ‚’่กŒใฃใฆใ„ใ‚‹ใฎใงไปฅไธ‹)

โฏ echo "python 3.10.10" > /Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.tool-versions

# Python ใฎไปฎๆƒณ็’ฐๅขƒใ‚’ๆง‹็ฏ‰ใƒป่ตทๅ‹•ใ™ใ‚‹
โฏ python -m venv .env
โฏ source .env/bin/activate

# ไปฎๆƒณ็’ฐๅขƒใŒ็ซ‹ใกไธŠใŒใ‚‹
.env โฏ


FlexGen m1 ใƒ–ใƒฉใƒณใƒใฎ READMEใ‚’่ฆ‹ใ‚‹ใจใ€ไปฅไธ‹ใฎ่จ˜่ฟฐใŒใ‚ใ‚‹

### CPU and M1/M2 GPU platform
To run models on CPU platforms, all you need to do is to add an `--platform` entry:
```
python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
```
To run on M1/M2 platforms, [PyTorch nightly](https://pytorch.org/) is required for kernel coverage and better performance. Once you have PyTorch nightly installed, you can simply replace `cpu` with `mps:0`.

https://github.com/FMInference/FlexGen/compare/main...m1#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R134-R139

PyTorch nightly (ไธ€่ˆฌๅ…ฌ้–‹ใ•ใ‚Œใฆใ„ใชใ„ๆœ€ๆ–ฐ็‰ˆ) ใŒๅฟ…่ฆใชใฎใงใ‚คใƒณใ‚นใƒˆใƒผใƒซใ™ใ‚‹

# MPS acceleration is available on MacOS 12.3+
โฏ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

ไปŠๅ›žใฏ torchaudio ใฏไธ่ฆใ‹ใ‚‚ใ ใŒใ€ไธ€ๆ—ฆใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆใฎใพใพ้€ฒใ‚ใ‚‹ใ€‚


flexgen ใฏ `pip install flexgen` ใงใ‚คใƒณใ‚นใƒˆใƒผใƒซใงใใ‚‹ใŒใ€ไปŠๅ›žใฏ m1 ใƒ–ใƒฉใƒณใƒใ‚’ๆŒ‡ๅฎšใ—ใŸใ„ใฎใงใƒญใƒผใ‚ซใƒซใซ่ฝใจใ—ใฆใใ‚‹ใ—ใ‹ใชใ„ใ€‚ๅ‚่€ƒใซใ—ใŸ่จ˜ไบ‹ใงใฏ con3office/FlexGen ใ‚’่ฝใจใ—ใฆใ„ใ‚‹ใฎใงไธ€ๅฟœ็ขบ่ชใ™ใ‚‹ใจใ€ๅ…ƒใƒชใƒใ‚ธใƒˆใƒชใฎ m1 ใƒ–ใƒฉใƒณใƒใจใฏๅˆฅใฎใ‚ณใƒŸใƒƒใƒˆใŒ็ฉใพใ‚Œใฆใ„ใ‚‹

https://github.com/FMInference/FlexGen/compare/m1...con3office:FlexGen:m1

ใ“ใ“ใงใฏไธ€ๆ—ฆๅ…ƒใฎ FMInference/FlexGen ใ‚’่ฝใจใ—ใฆใ‚คใƒณใ‚นใƒˆใƒผใƒซใ—ใฆใฟใ‚‹ (git submodule ใง็ฎก็†ใ—ใŸๆ–นใŒ่‰ฏใ•ใใ†ใ‹๏ผŸไปŠๅพŒใฎ่ชฒ้กŒ)

.env โฏ git clone https://github.com/FMInference/FlexGen.git ./flexgen
.env โฏ cd flexgen/ && git checkout m1
.env โฏ pip3 install -e .
.env โฏ cd ..


FlexGen ใ‚’ๅฎŸ่กŒใ—ใฆใฟใ‚‹

ใ‚คใƒณใ‚นใƒˆใƒผใƒซใงใใŸใฎใงREADMEใซๅพ“ใฃใฆๅฎŸ่กŒใ™ใ‚‹

.env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
Downloading (โ€ฆ)okenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 685/685 [00:00<00:00, 184kB/s]
Downloading (โ€ฆ)lve/main/config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 651/651 [00:00<00:00, 267kB/s]
Downloading (โ€ฆ)olve/main/vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 899k/899k [00:01<00:00, 795kB/s]
Downloading (โ€ฆ)olve/main/merges.txt: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 456k/456k [00:00<00:00, 605kB/s]
Downloading (โ€ฆ)cial_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 221/221 [00:00<00:00, 66.5kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading pytorch_model.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
Fetching 1 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [01:09<00:00, 69.67s/it]
Convert format: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:05<00:00,  5.17s/it]
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB	projected: False
prefill latency: 5.284 s	prefill throughput: 387.571 token/s
decode latency: 23.636 s	decode throughput: 5.246 token/s
total latency: 28.920 s	total throughput: 4.426 token/s

LLMใฏ facebook/opt-1.3b ใ‚’ๆŒ‡ๅฎšใ—ใฆใ„ใ‚‹ใ€‚ไปฅไธ‹ใฎใƒญใ‚ฐใ‚’่ฆ‹ใ‚‹ใจ้ข็™ฝใ„ใฎใ ใ‘ใฉใ€

Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.

ๅญฆ็ฟ’ใƒขใƒ‡ใƒซใŒๅ…ฌ้–‹ใ•ใ‚Œใฆใ„ใ‚‹ Hugging Face (AIใฎGitHubใ‚’็›ฎๆŒ‡ใ—ใฆใ„ใ‚‹๏ผŸใ‚‰ใ—ใ„) ใ‹ใ‚‰ๅ–ๅพ—ใ—ใฆใ„ใ‚‹ใ‚“ใ ใชใƒผใ€‚็”ปๅƒ็”ŸๆˆAIใฎๅญฆ็ฟ’ใƒขใƒ‡ใƒซใ‚‚ใ“ใ“ใ‹ใ‚‰ๅ–ๅพ—ใ™ใ‚‹ใ“ใจใŒๅคšใ„ใ€‚ใ‚„ใฏใ‚Šใ€ใจใ„ใ†ๆ„Ÿ่ฆšใ€‚


ๆ›ดใซใƒญใ‚ฐใ‚’็œบใ‚ใ‚‹ใจใ€ใฉใ†ใ‚„ใ‚‰ใกใ‚ƒใ‚“ใจๅ‡บๅŠ›ใงใใฆใ„ใ‚‹ใ‚‰ใ—ใ„ใ“ใจใŒๅˆ†ใ‹ใ‚‹

Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

 ๅ†ๅฎŸ่กŒ

.env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
TorchDevice: cpu
  cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB	projected: False
prefill latency: 5.396 s	prefill throughput: 379.565 token/s
decode latency: 23.060 s	decode throughput: 5.377 token/s
total latency: 28.455 s	total throughput: 4.498 token/s

ๅทฎๅˆ†

.env โฏ diff -u  run.txt rerun.txt
--- run.txt	2023-03-11 00:27:55
+++ rerun.txt	2023-03-11 00:27:55
@@ -1,15 +1,6 @@
 .env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
-Downloading (โ€ฆ)okenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 685/685 [00:00<00:00, 184kB/s]
-Downloading (โ€ฆ)lve/main/config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 651/651 [00:00<00:00, 267kB/s]
-Downloading (โ€ฆ)olve/main/vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 899k/899k [00:01<00:00, 795kB/s]
-Downloading (โ€ฆ)olve/main/merges.txt: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 456k/456k [00:00<00:00, 605kB/s]
-Downloading (โ€ฆ)cial_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 221/221 [00:00<00:00, 66.5kB/s]
 model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
 init weight...
-Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
-Downloading pytorch_model.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
-Fetching 1 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [01:09<00:00, 69.67s/it]
-Convert format: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:05<00:00,  5.17s/it]
 warmup - generate
 benchmark - generate
 /Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
@@ -31,6 +22,6 @@
   cur_mem: 5.2851 GB,  peak_mem: 0.0000 GB
 model size: 2.443 GB	cache size: 0.398 GB	hidden size (p): 0.008 GB
 peak gpu mem: 0.000 GB	projected: False
-prefill latency: 5.284 s	prefill throughput: 387.571 token/s
-decode latency: 23.636 s	decode throughput: 5.246 token/s
-total latency: 28.920 s	total throughput: 4.426 token/s
+prefill latency: 5.396 s	prefill throughput: 379.565 token/s
+decode latency: 23.060 s	decode throughput: 5.377 token/s
+total latency: 28.455 s	total throughput: 4.498 token/s


ใ‚ณใƒผใƒ‰ (ไธปใซใƒ—ใƒญใƒณใƒ—ใƒˆๅ‘จใ‚Š) ใ‚’็ขบ่ชใ™ใ‚‹

ใ‚ณใƒผใƒ‰ใฏใ“ใ“ใ‹ใ‚‰ๅง‹ๅ‹•ใ™ใ‚‹

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    add_parser_arguments(parser)
    args = parser.parse_args()

    assert len(args.percent) == 6

    if "cuda" in args.platform:
        if not torch.cuda.is_available():
            if torch.backends.mps.is_available():
                args.platform = "mps:0"
            else:
                args.platform = "cpu"
            print("CUDA devices not available, {} is used instead".format(args.platform))

    if "mps" in args.platform:
        if not torch.backends.mps.is_available():
            args.platform = "cpu"
            print("MPS devices not available, CPU is used instead")

    if "cuda" not in args.platform:
        # not clear how to enable overlap on MPS platform yet
        args.overlap = False
        args.pin_weight = False

    if args.platform == "cpu":
        args.percent = [0, 100, 0, 100, 0, 100]

    run_flexgen(args)

https://github.com/FMInference/FlexGen/blob/f62cfec766f3c0f8d542f5622319fa7a12a6437d/flexgen/flex_opt.py#L1320-L1348


ๅ‡บๅŠ›ๅ‘จใ‚Šใฏใ“ใ“ใ‚‰่พบใชใฎใงใ€ใ“ใ“ใพใงใฎๆตใ‚Œใ‚’ใ–ใฃใใ‚Š็ขบ่ชใ—ใŸใ„

    if DUMMY_WEIGHT not in args.path:
        outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        show_str = "Outputs:\n" + 70 * '-' + "\n"
        for i in [0, len(outputs)-1]:
            show_str += f"{i}: {outputs[i]}\n"
            show_str += "-" * 70 + "\n"
        if args.verbose >= 2:
            print(show_str)

https://github.com/FMInference/FlexGen/blob/f62cfec766f3c0f8d542f5622319fa7a12a6437d/flexgen/flex_opt.py#L1248-L1255


ไปŠๆ—ฅใฏๆ™‚้–“ๅˆ‡ใ‚Œใ€‚ๆ˜Žๆ—ฅไปฅ้™็ขบ่ชใ—ใฆใ„ใ

ใ„ใ„ใชใจๆ€ใฃใŸใ‚‰ๅฟœๆดใ—ใ‚ˆใ†๏ผ