AMD9950X/128GB DDR5+4090によるvllmベンチマーク
AMD9950Xと4090をせっかくインストールしたので、とりあえずvllmでどのくらいのスループットが出るのか確かめてみた。
トルコのイスタンブールから浅草橋の技研にリモートアクセスして確かめる
機材構成としては以下の通り
なのでまあ基本的にGPUのスピードだと思うが、何パターンか試してみた。
まずはvllmのデフォルトのパラメータでやってみる(プロンプト数=2)
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
============ Serving Benchmark Result ============
Successful requests: 2
Benchmark duration (s): 5.26
Total input tokens: 24
Total generated tokens: 28
Request throughput (req/s): 0.38
Input token throughput (tok/s): 4.56
Output token throughput (tok/s): 5.32
---------------Time to First Token----------------
Mean TTFT (ms): 25.08
Median TTFT (ms): 25.08
P99 TTFT (ms): 25.24
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.05
Median TPOT (ms): 18.05
P99 TPOT (ms): 18.06
---------------Inter-token Latency----------------
Mean ITL (ms): 18.02
Median ITL (ms): 18.03
P99 ITL (ms): 18.16
==================================================
これはいくらなんでも遅すぎる。5.32tok/s。
ちょっと処理するプロンプトが少なすぎて正確に計測できてないようなので、200でやってみる
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 200
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 24.79
Total input tokens: 3661
Total generated tokens: 13588
Request throughput (req/s): 8.07
Input token throughput (tok/s): 147.68
Output token throughput (tok/s): 548.11
---------------Time to First Token----------------
Mean TTFT (ms): 552.77
Median TTFT (ms): 665.02
P99 TTFT (ms): 674.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 41.62
Median TPOT (ms): 40.16
P99 TPOT (ms): 79.77
---------------Inter-token Latency----------------
Mean ITL (ms): 32.93
Median ITL (ms): 32.08
P99 ITL (ms): 53.19
==================================================
2より200の方が高速に処理しているのがわかる。Outputが548t/sはそんなに悪くない印象
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2000
============ Serving Benchmark Result ============
Successful requests: 14
Benchmark duration (s): 14.37
Total input tokens: 168
Total generated tokens: 768
Request throughput (req/s): 0.97
Input token throughput (tok/s): 11.69
Output token throughput (tok/s): 53.45
---------------Time to First Token----------------
Mean TTFT (ms): 6000.30
Median TTFT (ms): 5998.19
P99 TTFT (ms): 6016.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 19.25
Median TPOT (ms): 19.23
P99 TPOT (ms): 20.32
---------------Inter-token Latency----------------
Mean ITL (ms): 18.92
Median ITL (ms): 18.78
P99 ITL (ms): 38.20
==================================================
2000にすると却って遅くなってしまった。53.45tok/s
500で試してみる
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 500
============ Serving Benchmark Result ============
Successful requests: 499
Benchmark duration (s): 33.50
Total input tokens: 8948
Total generated tokens: 36459
Request throughput (req/s): 14.90
Input token throughput (tok/s): 267.13
Output token throughput (tok/s): 1088.44
---------------Time to First Token----------------
Mean TTFT (ms): 2110.71
Median TTFT (ms): 914.80
P99 TTFT (ms): 5295.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 63.63
Median TPOT (ms): 69.98
P99 TPOT (ms): 83.27
---------------Inter-token Latency----------------
Mean ITL (ms): 48.79
Median ITL (ms): 45.24
P99 ITL (ms): 192.39
==================================================
毎回変わるかな、と思って試してみたが、500くらいは何回かやっても安定して1000tok/sくらい出た。この辺りがボリュームゾーンか?
1000でやってみる
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 1000
=========== Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 26.39
Total input tokens: 12
Total generated tokens: 35
Request throughput (req/s): 0.04
Input token throughput (tok/s): 0.45
Output token throughput (tok/s): 1.33
---------------Time to First Token----------------
Mean TTFT (ms): 20676.33
Median TTFT (ms): 20676.33
P99 TTFT (ms): 20676.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.47
Median TPOT (ms): 17.47
P99 TPOT (ms): 17.47
---------------Inter-token Latency----------------
Mean ITL (ms): 17.47
Median ITL (ms): 17.46
P99 ITL (ms): 17.57
==================================================
全然だめ。GPU溢れちゃうのかな。
というわけで750もやってみる
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 750
============ Serving Benchmark Result ============
Successful requests: 249
Benchmark duration (s): 30.00
Total input tokens: 4509
Total generated tokens: 16892
Request throughput (req/s): 8.30
Input token throughput (tok/s): 150.32
Output token throughput (tok/s): 563.15
---------------Time to First Token----------------
Mean TTFT (ms): 5276.42
Median TTFT (ms): 5492.24
P99 TTFT (ms): 5506.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 43.34
Median TPOT (ms): 44.11
P99 TPOT (ms): 82.41
---------------Inter-token Latency----------------
Mean ITL (ms): 35.13
Median ITL (ms): 39.73
P99 ITL (ms): 49.14
==================================================
563tok/s。500より悪くなってる。
600で試す
============ Serving Benchmark Result ============
Successful requests: 399
Benchmark duration (s): 33.74
Total input tokens: 7434
Total generated tokens: 29589
Request throughput (req/s): 11.82
Input token throughput (tok/s): 220.31
Output token throughput (tok/s): 876.88
---------------Time to First Token----------------
Mean TTFT (ms): 3267.80
Median TTFT (ms): 2824.51
P99 TTFT (ms): 5734.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 57.66
Median TPOT (ms): 61.29
P99 TPOT (ms): 77.98
---------------Inter-token Latency----------------
Mean ITL (ms): 44.81
Median ITL (ms): 40.96
P99 ITL (ms): 108.18
==================================================
500より少し悪くなってる。
550で試す
============ Serving Benchmark Result ============
Successful requests: 449
Benchmark duration (s): 33.41
Total input tokens: 8345
Total generated tokens: 33188
Request throughput (req/s): 13.44
Input token throughput (tok/s): 249.76
Output token throughput (tok/s): 993.31
---------------Time to First Token----------------
Mean TTFT (ms): 2346.12
Median TTFT (ms): 1516.16
P99 TTFT (ms): 5284.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 62.04
Median TPOT (ms): 68.58
P99 TPOT (ms): 84.35
---------------Inter-token Latency----------------
Mean ITL (ms): 47.13
Median ITL (ms): 40.93
P99 ITL (ms): 209.50
==================================================
1000に近づいてきた。逆に500より減らしてみる
============ Serving Benchmark Result ============
Successful requests: 450
Benchmark duration (s): 32.76
Total input tokens: 8357
Total generated tokens: 33212
Request throughput (req/s): 13.73
Input token throughput (tok/s): 255.07
Output token throughput (tok/s): 1013.69
---------------Time to First Token----------------
Mean TTFT (ms): 1664.00
Median TTFT (ms): 914.51
P99 TTFT (ms): 4439.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 61.85
Median TPOT (ms): 65.42
P99 TPOT (ms): 81.02
---------------Inter-token Latency----------------
Mean ITL (ms): 47.14
Median ITL (ms): 40.97
P99 ITL (ms): 204.71
==================================================
(base) shi3z@amd9950shi3z-1:~/git/vllm$
450だと500と近い性能になる。
400では?
============ Serving Benchmark Result ============
Successful requests: 400
Benchmark duration (s): 32.02
Total input tokens: 7446
Total generated tokens: 29700
Request throughput (req/s): 12.49
Input token throughput (tok/s): 232.54
Output token throughput (tok/s): 927.53
---------------Time to First Token----------------
Mean TTFT (ms): 1406.67
Median TTFT (ms): 916.24
P99 TTFT (ms): 3879.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 62.00
Median TPOT (ms): 64.81
P99 TPOT (ms): 87.21
---------------Inter-token Latency----------------
Mean ITL (ms): 46.65
Median ITL (ms): 41.35
P99 ITL (ms): 211.05
==================================================
さらに下がった。
ということはこのシステムとこのベンチマーク方法では500プロンプトあたりが最大効率に近く、最大効率は1088tok/sくらいだということがわかった。
今回はCPU使ってないのでCPUをもっと使うようなオフロードでも試すべきだろうな。ちょっと飛行機の時間が迫ってるのでそれはまた今度