๐ FlexGen ใ M1/M2 Mac ใงๅใใใฆใฟใ
ChatGPTใฏใใใใใใใฏๅใใฃใฆใใใใ ใใฉไฟบใฏใใใกใใๆฐ่ปฝใซ้ใใงใฟใใใใ ใใใใ้ญใฒใใ ใใใ
ใพใใใใฏใใใจใใฆใStableDiffusion ใใกใใฃใจ่งฆใฃใ่บซใจใใฆใฏใๅ ฌ้ใใใๅญฆ็ฟใขใใซใๅฆไฝใซไธใฎไธญใไธๅคใใใฆใใพใใฎใใฎ็้ฑใๆใใฆใใพใฃใใฎใงใใใฉใใฏใใใฏในใฎChatGPTใงใใใพใงไธใฎไธญใๅคใใใใชใFlexGenใฏใฉใฎใใใชๅคๅใใใใใใฆใใใใใ ใใใจใฏใฏใฏใฏใใฆใใพใใ
ใ ใใ่งฆใฃใฆใฟใใ
ไปฅไธใฏๅๆใงๅ ฑๆใใใฆใใใ็งใ่ชญใใงใฟใฆ่ฏใใฃใใฎใงๆฏ้่ชฒ้ใใฆใงใ่ชญใใงใปใใใ็งใฏๆๆใซใชใฃใฆใใพใๅใซ่ชญใใ ใใ ใใฉใปใปใป
Apple Silicon Mac ใง FlexGen ใๅใใ่จไบใฏใใงใซๅญๅจใใฆใใใใใใใงใฏๅๅฟ้ฒ็ใซๆฎใใ
ไธปใซไปฅไธใฎ่จไบใๅ่ใซใใ
FlexGen
ๅคงๆฌใฎใชใใธใใชใ
ๅฎใฏ `m1` ใจใใใใฉใณใใๅญๅจใใREADME ใไธ้จๆดๆฐใใใฆใใใฎใงไปๅใฏใใกใใๅ็
งใใ
https://github.com/FMInference/FlexGen/tree/m1
ๅทฎๅใใฟใใใชใไปฅไธใฎURL (2023/03/10 ๆ็นใง main ใใ26ใณใใใๅ้
ใใฆใใใฎใงๆณจๆ)
https://github.com/FMInference/FlexGen/compare/main...m1
FlexGen ใฎไฝฟ็จๆนๆณ
Google Colab ใไฝฟ็จใใ
ใญใผใซใซPCใงไฝฟ็จใใ
GPU ๆญ่ผPC (ๆญฃๆปๆณ)
Apple Silicon (M1/M2) Mac (ใพใ non-official)
ไปๅใฏ Apple Silicon (M1/M2) Mac ใ่ฉฆใ
ใใๆฐ้ฑ้ใงใฏใใใฆ Python ใ่งฆใฃใฆ็จๅบฆใ ใใๆ่ฆ็ใซ Python 3.11 ใฏใจใฉใผใๅคใใฎใง ็พๆ็นใงๆๆฐใฎ 3.10.10 ใไฝฟ็จใใใ(็งใฏ asdf ใง Python ใฎใใผใธใงใณ็ฎก็ใ่กใฃใฆใใใฎใงไปฅไธ)
โฏ echo "python 3.10.10" > /Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.tool-versions
# Python ใฎไปฎๆณ็ฐๅขใๆง็ฏใป่ตทๅใใ
โฏ python -m venv .env
โฏ source .env/bin/activate
# ไปฎๆณ็ฐๅขใ็ซใกไธใใ
.env โฏ
FlexGen m1 ใใฉใณใใฎ READMEใ่ฆใใจใไปฅไธใฎ่จ่ฟฐใใใ
PyTorch nightly (ไธ่ฌๅ ฌ้ใใใฆใใชใๆๆฐ็) ใๅฟ ่ฆใชใฎใงใคใณในใใผใซใใ
# MPS acceleration is available on MacOS 12.3+
โฏ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
ไปๅใฏ torchaudio ใฏไธ่ฆใใใ ใใไธๆฆใใญใฅใกใณใใฎใพใพ้ฒใใใ
flexgen ใฏ `pip install flexgen` ใงใคใณในใใผใซใงใใใใไปๅใฏ m1 ใใฉใณใใๆๅฎใใใใฎใงใญใผใซใซใซ่ฝใจใใฆใใใใใชใใๅ่ใซใใ่จไบใงใฏ con3office/FlexGen ใ่ฝใจใใฆใใใฎใงไธๅฟ็ขบ่ชใใใจใๅ ใชใใธใใชใฎ m1 ใใฉใณใใจใฏๅฅใฎใณใใใใ็ฉใพใใฆใใ
https://github.com/FMInference/FlexGen/compare/m1...con3office:FlexGen:m1
ใใใงใฏไธๆฆๅ ใฎ FMInference/FlexGen ใ่ฝใจใใฆใคใณในใใผใซใใฆใฟใ (git submodule ใง็ฎก็ใใๆนใ่ฏใใใใ๏ผไปๅพใฎ่ชฒ้ก)
.env โฏ git clone https://github.com/FMInference/FlexGen.git ./flexgen
.env โฏ cd flexgen/ && git checkout m1
.env โฏ pip3 install -e .
.env โฏ cd ..
FlexGen ใๅฎ่กใใฆใฟใ
ใคใณในใใผใซใงใใใฎใงREADMEใซๅพใฃใฆๅฎ่กใใ
.env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
Downloading (โฆ)okenizer_config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 685/685 [00:00<00:00, 184kB/s]
Downloading (โฆ)lve/main/config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 651/651 [00:00<00:00, 267kB/s]
Downloading (โฆ)olve/main/vocab.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 899k/899k [00:01<00:00, 795kB/s]
Downloading (โฆ)olve/main/merges.txt: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 456k/456k [00:00<00:00, 605kB/s]
Downloading (โฆ)cial_tokens_map.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 221/221 [00:00<00:00, 66.5kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading pytorch_model.bin: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
Fetching 1 files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [01:09<00:00, 69.67s/it]
Convert format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:05<00:00, 5.17s/it]
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
TorchDevice: cpu
cur_mem: 5.2851 GB, peak_mem: 0.0000 GB
TorchDevice: cpu
cur_mem: 5.2851 GB, peak_mem: 0.0000 GB
model size: 2.443 GB cache size: 0.398 GB hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB projected: False
prefill latency: 5.284 s prefill throughput: 387.571 token/s
decode latency: 23.636 s decode throughput: 5.246 token/s
total latency: 28.920 s total throughput: 4.426 token/s
LLMใฏ facebook/opt-1.3b ใๆๅฎใใฆใใใไปฅไธใฎใญใฐใ่ฆใใจ้ข็ฝใใฎใ ใใฉใ
ๅญฆ็ฟใขใใซใๅ ฌ้ใใใฆใใ Hugging Face (AIใฎGitHubใ็ฎๆใใฆใใ๏ผใใใ) ใใๅๅพใใฆใใใใ ใชใผใ็ปๅ็ๆAIใฎๅญฆ็ฟใขใใซใใใใใๅๅพใใใใจใๅคใใใใฏใใใจใใๆ่ฆใ
ๆดใซใญใฐใ็บใใใจใใฉใใใใกใใใจๅบๅใงใใฆใใใใใใใจใๅใใ
ๅๅฎ่ก
.env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
data_ptr = tensor.storage().data_ptr()
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/flexgen/flexgen/utils.py:140: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
element_size = tensor.storage().element_size()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
TorchDevice: cpu
cur_mem: 5.2851 GB, peak_mem: 0.0000 GB
TorchDevice: cpu
cur_mem: 5.2851 GB, peak_mem: 0.0000 GB
model size: 2.443 GB cache size: 0.398 GB hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB projected: False
prefill latency: 5.396 s prefill throughput: 379.565 token/s
decode latency: 23.060 s decode throughput: 5.377 token/s
total latency: 28.455 s total throughput: 4.498 token/s
ๅทฎๅ
.env โฏ diff -u run.txt rerun.txt
--- run.txt 2023-03-11 00:27:55
+++ rerun.txt 2023-03-11 00:27:55
@@ -1,15 +1,6 @@
.env โฏ python3 -m flexgen.flex_opt --model facebook/opt-1.3b --platform cpu
-Downloading (โฆ)okenizer_config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 685/685 [00:00<00:00, 184kB/s]
-Downloading (โฆ)lve/main/config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 651/651 [00:00<00:00, 267kB/s]
-Downloading (โฆ)olve/main/vocab.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 899k/899k [00:01<00:00, 795kB/s]
-Downloading (โฆ)olve/main/merges.txt: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 456k/456k [00:00<00:00, 605kB/s]
-Downloading (โฆ)cial_tokens_map.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 221/221 [00:00<00:00, 66.5kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
-Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
-Downloading pytorch_model.bin: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2.63G/2.63G [01:08<00:00, 38.3MB/s]
-Fetching 1 files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [01:09<00:00, 69.67s/it]
-Convert format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:05<00:00, 5.17s/it]
warmup - generate
benchmark - generate
/Users/yoshikouki/src/github.com/yoshikouki/flexgen-for-mac/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
@@ -31,6 +22,6 @@
cur_mem: 5.2851 GB, peak_mem: 0.0000 GB
model size: 2.443 GB cache size: 0.398 GB hidden size (p): 0.008 GB
peak gpu mem: 0.000 GB projected: False
-prefill latency: 5.284 s prefill throughput: 387.571 token/s
-decode latency: 23.636 s decode throughput: 5.246 token/s
-total latency: 28.920 s total throughput: 4.426 token/s
+prefill latency: 5.396 s prefill throughput: 379.565 token/s
+decode latency: 23.060 s decode throughput: 5.377 token/s
+total latency: 28.455 s total throughput: 4.498 token/s
ใณใผใ (ไธปใซใใญใณใใๅจใ) ใ็ขบ่ชใใ
ใณใผใใฏใใใใๅงๅใใ
if __name__ == "__main__":
parser = argparse.ArgumentParser()
add_parser_arguments(parser)
args = parser.parse_args()
assert len(args.percent) == 6
if "cuda" in args.platform:
if not torch.cuda.is_available():
if torch.backends.mps.is_available():
args.platform = "mps:0"
else:
args.platform = "cpu"
print("CUDA devices not available, {} is used instead".format(args.platform))
if "mps" in args.platform:
if not torch.backends.mps.is_available():
args.platform = "cpu"
print("MPS devices not available, CPU is used instead")
if "cuda" not in args.platform:
# not clear how to enable overlap on MPS platform yet
args.overlap = False
args.pin_weight = False
if args.platform == "cpu":
args.percent = [0, 100, 0, 100, 0, 100]
run_flexgen(args)
ๅบๅๅจใใฏใใใ่พบใชใฎใงใใใใพใงใฎๆตใใใใฃใใ็ขบ่ชใใใ
if DUMMY_WEIGHT not in args.path:
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
show_str = "Outputs:\n" + 70 * '-' + "\n"
for i in [0, len(outputs)-1]:
show_str += f"{i}: {outputs[i]}\n"
show_str += "-" * 70 + "\n"
if args.verbose >= 2:
print(show_str)
ไปๆฅใฏๆ้ๅใใๆๆฅไปฅ้็ขบ่ชใใฆใใ