Metalによるllama2 13B chatの高速実行
実行環境
Macbook Pro 16 M1 Max 32 core gpu
npakaさんの記事ではmetal利用の高速化の影響が確認できなかったとのことでしたが私の環境ではmetalを使った方が高速化したので報告しておきます。
llama.cppのリポジトリはクローン済の前提でバージョン的には下記のコミットのあたりを含む最新バージョンです
llama-2-13b-chat.ggmlv3.q4_0.binのWeightはwgetでダウンロード済。
ビルドとかも野良スクリプトでLLAMA_METAL=1で実行しました。
llama.cppクローンとビルドとモデルダウンロード
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build it
LLAMA_METAL=1 make
# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-Chat-GGML/resolve/main/${MODEL}"
*makeする前にcd llama.cppした後にやっとくと良いです
git pullでMasterブランチの最新のcommitを取り込んでおく
$ git pull
$ git log
commit f514d1b306e1114c2884fcb25dd9bd48ae64ba32 (HEAD -> master, tag: master-f514d1b, origin/master, origin/HEAD)
Author: Johannes Gäßler <johannesg@5d6.de>
Date: Sat Aug 5 18:20:44 2023 +0200
CUDA: faster k-quant mul_mat_q kernels (#2525)
commit 332311234a0aa2974b2450710e22e09d90dd6b0b (tag: master-3323112)
Author: Jonas Wunderlich <32615971+jonas-w@users.noreply.github.com>
Date: Fri Aug 4 20:16:11 2023 +0000
fix firefox autoscroll (#2519)
commit 182af739c4ce237f7579facfe8f94dc53a1f573f (tag: master-182af73)
Author: Cebtenzzre <cebtenzzre@gmail.com>
Date: Fri Aug 4 15:00:57 2023 -0400
server: regenerate completion.js.hpp (#2515)
commit 4329d1acb01c353803a54733b8eef9d93d0b84b2 (tag: master-4329d1a)
Author: Cebtenzzre <cebtenzzre@gmail.com>
Date: Fri Aug 4 11:35:22 2023 -0400
CUDA: use min compute capability of GPUs actually used (#2506)
commit 02f9d96a866268700b8d8e7acbbcb4392c5ff345 (tag: master-02f9d96)
Author: Cebtenzzre <cebtenzzre@gmail.com>
Date: Fri Aug 4 11:34:32 2023 -0400
CUDA: check if event is NULL before cudaStreamWaitEvent (#2505)
Fixes #2503
commit 3498588e0fb4daf040c4e3c698595cb0bfd345c0 (tag: master-3498588)
Author: DannyDaemonic <DannyDaemonic@gmail.com>
Date: Fri Aug 4 08:20:12 2023 -0700
Add --simple-io option for subprocesses and break out console.h and cpp (#1558)
commit 5f631c26794b6371fcf2660e8d0c53494a5575f7 (tag: master-5f631c2)
Author: Stephen Nichols <snichols@users.noreply.github.com>
Metalを有効化して実行-ngl 1
結果: 33.79 tokens per second
$ ./main -m ./llama-2-13b-chat.ggmlv3.q4_0.bin -n 128 -ngl 1 -t 10 -p "Hello"
main: build = 944 (8183159)
main: seed = 1691275227
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7349.72 MB (+ 400.00 MB per state)
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/keigofukumoto/git/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x122e58a50
ggml_metal_init: loaded kernel_add_row 0x122f06570
ggml_metal_init: loaded kernel_mul 0x122f06bf0
ggml_metal_init: loaded kernel_mul_row 0x122f0a370
ggml_metal_init: loaded kernel_scale 0x12389c450
ggml_metal_init: loaded kernel_silu 0x122f097f0
ggml_metal_init: loaded kernel_relu 0x122f0b0f0
ggml_metal_init: loaded kernel_gelu 0x122f0b410
ggml_metal_init: loaded kernel_soft_max 0x122f0bee0
ggml_metal_init: loaded kernel_diag_mask_inf 0x122f0d960
ggml_metal_init: loaded kernel_get_rows_f16 0x1037045f0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x103706050
ggml_metal_init: loaded kernel_get_rows_q4_1 0x103705250
ggml_metal_init: loaded kernel_get_rows_q2_K 0x122f0ce70
ggml_metal_init: loaded kernel_get_rows_q3_K 0x122f0e020
ggml_metal_init: loaded kernel_get_rows_q4_K 0x122f0e420
ggml_metal_init: loaded kernel_get_rows_q5_K 0x122f0ec80
ggml_metal_init: loaded kernel_get_rows_q6_K 0x122f0f590
ggml_metal_init: loaded kernel_rms_norm 0x122f10030
ggml_metal_init: loaded kernel_norm 0x122f112a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x122e594f0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x122e5a4b0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x122f12390
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x122f12790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x122f13420
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x122f13e00
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x122f14600
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x122e5b9f0
ggml_metal_init: loaded kernel_rope 0x122e5aa30
ggml_metal_init: loaded kernel_alibi_f32 0x122e5c6d0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x122e5cca0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x122e5dad0
ggml_metal_init: loaded kernel_cpy_f16_f16 0x122e5e640
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.52 / 21845.34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.17 MB, ( 6996.69 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 7398.69 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 162.00 MB, ( 7560.69 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 192.00 MB, ( 7752.69 / 21845.34)
system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello! Welcome to the first issue of my new newsletter, where I'll be sharing updates on my work, writing tips and resources, and other interesting tidbits about the world of words.
I hope you'll join me on this journey as we explore the power of language, the craft of writing, and the many ways in which words can shape our understanding of the world. Whether you're a fellow writer, a reader, or simply someone who loves language, I hope you'll find something here that resonates with you.
In this first issue, I want to share some exciting updates on
llama_print_timings: load time = 10397.77 ms
llama_print_timings: sample time = 97.33 ms / 128 runs ( 0.76 ms per token, 1315.18 tokens per second)
llama_print_timings: prompt eval time = 583.26 ms / 2 tokens ( 291.63 ms per token, 3.43 tokens per second)
llama_print_timings: eval time = 3758.22 ms / 127 runs ( 29.59 ms per token, 33.79 tokens per second)
llama_print_timings: total time = 4451.20 ms
ggml_metal_free: deallocating
Metalを無効化して実行 -ngl 0
結果:5.69 tokens per second
$ ./main -m ./llama-2-13b-chat.ggmlv3.q4_0.bin -n 128 -ngl 0 -t 10 -p "Hello"
main: build = 944 (8183159)
main: seed = 1691275431
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7349.72 MB (+ 400.00 MB per state)
llama_new_context_with_model: kv self size = 400.00 MB
system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello! Welcome to my website. I'm a writer and a professor, and I've created this space to share some of my thoughts, ideas, and experiences with you.
I write about a wide range of topics, including literature, history, culture, and current events. My goal is to provide insightful and thought-provoking content that will engage and inspire you.
Please feel free to explore my website and learn more about me and my work. You can find information about my books, articles, and teaching experience, as well as some of my recent writing projects. I also have a blog where
llama_print_timings: load time = 527.15 ms
llama_print_timings: sample time = 90.10 ms / 128 runs ( 0.70 ms per token, 1420.61 tokens per second)
llama_print_timings: prompt eval time = 313.51 ms / 2 tokens ( 156.76 ms per token, 6.38 tokens per second)
llama_print_timings: eval time = 22316.09 ms / 127 runs ( 175.72 ms per token, 5.69 tokens per second)
llama_print_timings: total time = 22732.38 ms
動いてるところをみると体感的にも速さの違いがすぐ分かるので
今度動画でも上げてみようかと思います。
llama.cppも整備されてmetal高速実行の環境が当初より整ってきているのかもしれません。