WSL2でTensorRT-LLM Backendを試してみるも...
python3 -m venv tensorrt-llm
cd $_
source bin/activate
# TensorRT-LLM
git clone -b v0.7.0 \
https://github.com/NVIDIA/TensorRT-LLM.git TensorRT-LLM/TensorRT-LLM-0.7.0
# TensorRT-LLM Backend
git clone -b release/0.5.0 \
https://github.com/triton-inference-server/tensorrtllm_backend.git tensorrtllm_backend
cd TensorRT-LLM/TensorRT-LLM-0.7.0
pip install tensorrt_llm==0.7 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
# 量子化のために
cd examples/quantization
pip install -r requirements.txt
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-libs==9.2.0.post12.dev5
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-bindings==9.2.0.post12.dev5
cd ../../
TensorRT-LLM Backend、つまりAPIサーバは今回Dockerで立ち上げますので、Docker環境の設定を行います。
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
2. TRTエンジンの作成
from huggingface_hub import snapshot_download
REPO_ID = "elyza/ELYZA-japanese-Llama-2-13b-instruct"
snapshot_download(repo_id=REPO_ID, revision="main")
cd TensorRT-LLM/TensorRT-LLM-0.7.0
# HFのキャッシュディレクトリに13あるELYZA Bのディレクトリを指し示します
cd examples/llama
python quantize.py --model_dir ${hf_elyza_cache} \
--dtype float16 \
--qformat int4_awq \
--export_path ./elyza13_int4_awq_weights \
--calib_size 32
# TensorRT-LLM/TensorRT-LLM-0.7.0ディレクトリへ
cd ../../
続いてTRTエンジンの作成です。Chat with RTXのときから1つオプションが追加になっています。
cd examples/llama
CUDA_VISIBLE_DEVICES=0 python build.py \
--model_dir ${hf_elyza_cache} \
--quant_ckpt_path ./elyza13_int4_awq_weights/llama_tp1_rank0.npz \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--use_weight_only \
--weight_only_precision int4_awq \
--per_group \
--output_dir ./elyza13_int4_engine \
一番最後の --use_inflight_batching です。こちらを指定してTRTエンジンを作成しておかないと、APIサーバ起動時に以下のようなエラーが出力され、モデルのロードに失敗します。
E0218 11:37:08.806588 121 backend_model.cc:635] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E0218 11:37:08.806845 121 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I0218 11:37:08.806998 121 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E0218 11:37:08.807143 121 model_repository_manager.cc:580] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.;
# TensorRT-LLM/TensorRT-LLM-0.7.0ディレクトリへ
cd ../../
# venvのトップディレクトリへ
cd ../../
3. TensorRT-LLM Backendの設定
cd tensorrtllm_backend
mkdir triton_model_repo
# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
TensorRT-LLM Backend(v0.5.0)では4つのモデルが提供されており、HTTPなどのプロトコルでAPIアクセスするときに「どのモデルを使用するか?」を指定できるようになっています。
以下は推論時に使用するURIです。この${model_name}の箇所を適切なモデルで置き換える必要があります。(大半は ensumbleでOKですが)
preprocessing : このモデルはトークン化、つまりプロンプト (文字列) から input_ids (整数のリスト) への変換に使用される
tensorrt_llm : このモデルは TensorRT-LLM モデルのラッパーであり、推論に使用される
postprocessing : このモデルは、トークン化解除、つまり、output_ids(int のリスト) から Outputs(string) への変換に使用される
ensemble : このモデルは、前処理モデル、tensorrt_llm および後処理モデルを一緒に結びつける(chain)に使用する
--- all_models/inflight_batcher_llm/preprocessing/config.pbtxt 2024-02-18 11:53:47.195359840 +0900
+++ triton_model_repo/preprocessing/config.pbtxt 2024-02-18 16:45:41.833518710 +0900
@@ -80,14 +80,14 @@
parameters {
key: "tokenizer_dir"
value: {
- string_value: "${tokenizer_dir}"
+ string_value: "/tensorrtllm_backend/elyza13_hf"
parameters {
key: "tokenizer_type"
value: {
- string_value: "${tokenizer_type}"
+ string_value: "llama"
--- all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt 2024-02-18 11:54:02.505354486 +0900
+++ triton_model_repo/tensorrt_llm/config.pbtxt 2024-02-18 19:33:51.222473569 +0900
@@ -29,7 +29,7 @@
max_batch_size: 128
model_transaction_policy {
- decoupled: ${decoupled_mode}
+ decoupled: true
input [
@@ -173,36 +173,36 @@
parameters: {
key: "gpt_model_path"
value: {
- string_value: "${engine_dir}"
+ string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
- string_value: "${max_tokens_in_paged_kv_cache}"
+ string_value: ""
parameters: {
key: "batch_scheduler_policy"
value: {
- string_value: "${batch_scheduler_policy}"
+ string_value: "guaranteed_completion"
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
- string_value: "${kv_cache_free_gpu_mem_fraction}"
+ string_value: "0.2"
parameters: {
key: "max_num_sequences"
value: {
- string_value: "${max_num_sequences}"
+ string_value: "1"
parameters: {
key: "enable_trt_overlap"
value: {
- string_value: "${enable_trt_overlap}"
+ string_value: "true"
--- all_models/inflight_batcher_llm/postprocessing/config.pbtxt 2024-02-18 11:53:26.125362288 +0900
+++ triton_model_repo/postprocessing/config.pbtxt 2024-02-18 16:45:22.853520699 +0900
@@ -45,14 +45,14 @@
parameters {
key: "tokenizer_dir"
value: {
- string_value: "${tokenizer_dir}"
+ string_value: "/tensorrtllm_backend/elyza13_hf"
parameters {
key: "tokenizer_type"
value: {
- string_value: "${tokenizer_type}"
+ string_value: "llama"
# 改めて変数に設定。既に設定済みならば不要。
mkder -p elyza13_hf
cp -p ${hf_elyza_cache}/config.json elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer.json elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer.model elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer_config.json elyza13_hf
cp -p ../TensorRT-LLM/TensorRT-LLM-0.7.0/examples/llama/elyza13_int4_engine/* \
$ ls -l triton_model_repo/tensorrt_llm/1/
total 6801536
-rw-r--r-- 1 shoji_noguchi shoji_noguchi 1570 Feb 18 19:16 config.json
-rw-r--r-- 1 shoji_noguchi shoji_noguchi 6964474668 Feb 18 19:16 llama_float16_tp1_rank0.engine
-rw-r--r-- 1 shoji_noguchi shoji_noguchi 282176 Feb 18 19:16 model.cache
cd ../
4. APIサーバを立ち上げる
Dockerを起動します。今回使用したイメージファイルは tritonserver:23.12-trtllm-python-py3です。
# Dockerマウント用。
mkdir cache
# Docker起動
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v ./tensorrtllm_backend:/tensorrtllm_backend \
-v cache:/root/.cache \
nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 bash
[TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 226, Serialized Engine Version: 228)
TensorRT version is not compatible, expecting library version got · Issue #194
pip install sentencepiece protobuf torch==2.1.0
cd /tensorrtllm_backend
CUDA_VISIBLE_DEVICES=0 python3 scripts/launch_triton_server.py \
--world_size=1 \
I0218 10:52:48.720655 123 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x205000000' with size 268435456
I0218 10:52:48.720745 123 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0218 10:52:48.724949 123 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0218 10:52:48.724988 123 model_lifecycle.cc:461] loading: preprocessing:1
I0218 10:52:48.725305 123 model_lifecycle.cc:461] loading: postprocessing:1
I0218 10:52:48.742660 123 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0218 10:52:48.768943 123 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter version cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I0218 10:52:50.390619 123 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0218 10:52:50.481971 123 model_lifecycle.cc:818] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 6641 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 6684, GPU 8267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +8, now: CPU 6686, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +6635, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6721, GPU 10137 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6721, GPU 10145 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6763, GPU 10163 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 6764, GPU 10171 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] Allocate 3040870400 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 3712 total tokens in paged KV cache, and 20 blocks per sequence
[TensorRT-LLM][WARNING] max_num_sequences is smaller than 2 times the engine max_batch_size. Batches smaller than max_batch_size will be executed.
I0218 10:53:04.380211 123 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I0218 10:53:04.380593 123 model_lifecycle.cc:461] loading: ensemble:1
I0218 10:53:04.381033 123 model_lifecycle.cc:818] successfully loaded 'ensemble'
I0218 10:53:04.381109 123 server.cc:606]
| Repository Agent | Path |
I0218 10:53:04.381136 123 server.cc:633]
| Backend | Path | Config |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
I0218 10:53:04.381152 123 server.cc:676]
| Model | Version | Status |
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
I0218 10:53:04.437396 123 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
I0218 10:53:04.442141 123 metrics.cc:710] Collecting CPU metrics
I0218 10:53:04.442397 123 tritonserver.cc:2483]
| Option | Value |
| server_id | triton |
| server_version | 2.41.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
I0218 10:53:04.457051 123 grpc_server.cc:2495] Started GRPCInferenceService at
I0218 10:53:04.457583 123 http_server.cc:4619] Started HTTPService at
I0218 10:53:04.557072 123 http_server.cc:282] Started Metrics Service at
5. APIを叩いて見る
Triton Inference Serverのプロトコル仕様は以下で公開されています。server/docs/protocol/README.md at main · triton-inference-server/server · GitHub
ヘルスチェックは、200 OKと返ってきました。動いていますね。
$ curl -vvv http://localhost:8000/v2/health/ready
* Trying
* Connected to localhost ( port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.81.0
> Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
* Connection #0 to host localhost left intact
$ curl -X POST http://localhost:8000/v2/repository/models/index 2>/dev/null | jq .
"name": "ensemble",
"version": "1",
"state": "READY"
"name": "postprocessing",
"version": "1",
"state": "READY"
"name": "preprocessing",
"version": "1",
"state": "READY"
"name": "tensorrt_llm",
"version": "1",
"state": "READY"
では、モデル ensemble を指定してgenerateを叩いて見ましょう。
curl -X POST http://localhost:8000/v2/models/ensemble/generate -d
"text_input": "What is machine learning?",
"max_tokens": 20,
"bad_words": "",
"stop_words": ""
TensorRT-LLM Backendを立ち上げるところまではできましたが、肝心の推論を実行することができませんでした…。
0.5.0ではなく、最新版の TensorRT-LLM Backendでも試しましたが、違うエラー(セグメンテーションフォルト)が発生してしまい、原因が特定できていません…。