WSL2でTensorRT-LLM Backendを試してみるも...

2024年2月18日 23:52

TensorRT-LLMの性能がいかほどのものなのか、試してみたいのでBackend（APIサーバ）を立てて試してみたいと思います。

準備

venv環境

python3 -m venv tensorrt-llm
cd $_
source bin/activate

リポジトリをクローンします。

# TensorRT-LLM
git clone -b v0.7.0 \
    https://github.com/NVIDIA/TensorRT-LLM.git TensorRT-LLM/TensorRT-LLM-0.7.0

# TensorRT-LLM Backend
git clone -b release/0.5.0 \ 
    https://github.com/triton-inference-server/tensorrtllm_backend.git tensorrtllm_backend

必要となるパッケージをインストールします。

cd TensorRT-LLM/TensorRT-LLM-0.7.0
pip install tensorrt_llm==0.7 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
#
pip install -r requirements.txt

# 量子化のために
cd examples/quantization
pip install -r requirements.txt
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-libs==9.2.0.post12.dev5
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-bindings==9.2.0.post12.dev5
cd ../../
#

Docker環境

TensorRT-LLM Backend、つまりAPIサーバは今回Dockerで立ち上げますので、Docker環境の設定を行います。

aptリポジトリの設定を行い、

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

Dockerパッケージのインストールです。

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

2. TRTエンジンの作成

モデルのダウンロード

pythonを起動して以下を流し込み、elyza/ELYZA-japanese-Llama-2-13b-instructをダウンロードしておきます。

from huggingface_hub import snapshot_download

REPO_ID = "elyza/ELYZA-japanese-Llama-2-13b-instruct"
snapshot_download(repo_id=REPO_ID, revision="main")

4bit量子化モデルの作成

TensorRT-LLM-0.7.0にカレントディレクトリを移動します。

cd TensorRT-LLM/TensorRT-LLM-0.7.0

量子化モデルを作成します。

# HFのキャッシュディレクトリに13あるELYZA Bのディレクトリを指し示します
hf_elyza_cache=~/.cache/huggingface/hub/models--elyza--ELYZA-japanese-Llama-2-13b-instruct/snapshots/ed15089024f3ecad9a8c4ce1db302cc01aa9f4ee
#
cd examples/llama
python quantize.py --model_dir ${hf_elyza_cache}  \
                --dtype float16 \
                --qformat int4_awq \
                --export_path ./elyza13_int4_awq_weights \
                --calib_size 32

作成できたら、元のディレクトリに戻ります。

# TensorRT-LLM/TensorRT-LLM-0.7.0ディレクトリへ
cd ../../

TRTエンジンの作成

続いてTRTエンジンの作成です。Chat with RTXのときから１つオプションが追加になっています。

cd examples/llama
CUDA_VISIBLE_DEVICES=0 python build.py \
    --model_dir ${hf_elyza_cache} \
    --quant_ckpt_path ./elyza13_int4_awq_weights/llama_tp1_rank0.npz \
    --dtype float16 \
    --remove_input_padding \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --per_group \
    --output_dir ./elyza13_int4_engine \
    --use_inflight_batching

一番最後の --use_inflight_batching です。こちらを指定してTRTエンジンを作成しておかないと、APIサーバ起動時に以下のようなエラーが出力され、モデルのロードに失敗します。

E0218 11:37:08.806588 121 backend_model.cc:635] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E0218 11:37:08.806845 121 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I0218 11:37:08.806998 121 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E0218 11:37:08.807143 121 model_repository_manager.cc:580] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.;

ではカレントディレクトリを移動します。

# TensorRT-LLM/TensorRT-LLM-0.7.0ディレクトリへ
cd ../../
# venvのトップディレクトリへ
cd ../../

3. TensorRT-LLM Backendの設定

カレントディレクトリを移動します。

cd tensorrtllm_backend

モデルリポジトリ用ディレクトリの作成

TRTエンジンを格納するリポジトリ用ディレクトリを作成し、サンプル（all_models/inflight_batcher_llm）配下のディレクトリをまま複製します。

mkdir triton_model_repo
# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

設定ファイルの修正

TensorRT-LLM Backend（v0.5.0）では４つのモデルが提供されており、HTTPなどのプロトコルでAPIアクセスするときに「どのモデルを使用するか？」を指定できるようになっています。

以下は推論時に使用するURIです。この${model_name}の箇所を適切なモデルで置き換える必要があります。（大半は ensumbleでOKですが）

http://localhost:8000/v2/models/${model_name}/generate

４つのモデルの定義は以下です。

preprocessing : このモデルはトークン化、つまりプロンプト (文字列) から input_ids (整数のリスト) への変換に使用される
tensorrt_llm : このモデルは TensorRT-LLM モデルのラッパーであり、推論に使用される
postprocessing : このモデルは、トークン化解除、つまり、output_ids(int のリスト) から Outputs(string) への変換に使用される
ensemble : このモデルは、前処理モデル、tensorrt_llm および後処理モデルを一緒に結びつける（chain）に使用する

それでは、モデル毎にその設定ファイル（config.pbtxt）の書き換え内容を見ていきましょう。

・preprocessing/config.pbtxt

--- all_models/inflight_batcher_llm/preprocessing/config.pbtxt  2024-02-18 11:53:47.195359840 +0900
+++ triton_model_repo/preprocessing/config.pbtxt        2024-02-18 16:45:41.833518710 +0900
@@ -80,14 +80,14 @@
 parameters {
   key: "tokenizer_dir"
   value: {
-    string_value: "${tokenizer_dir}"
+    string_value: "/tensorrtllm_backend/elyza13_hf"
   }
 }

 parameters {
   key: "tokenizer_type"
   value: {
-    string_value: "${tokenizer_type}"
+    string_value: "llama"
   }
 }

・tensorrt_llm/config.pbtxt

--- all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt   2024-02-18 11:54:02.505354486 +0900
+++ triton_model_repo/tensorrt_llm/config.pbtxt 2024-02-18 19:33:51.222473569 +0900
@@ -29,7 +29,7 @@
 max_batch_size: 128

 model_transaction_policy {
-  decoupled: ${decoupled_mode}
+  decoupled: true
 }

 input [
@@ -173,36 +173,36 @@
 parameters: {
   key: "gpt_model_path"
   value: {
-    string_value: "${engine_dir}"
+    string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
   }
 }
 parameters: {
   key: "max_tokens_in_paged_kv_cache"
   value: {
-    string_value: "${max_tokens_in_paged_kv_cache}"
+    string_value: ""
   }
 }
 parameters: {
   key: "batch_scheduler_policy"
   value: {
-    string_value: "${batch_scheduler_policy}"
+    string_value: "guaranteed_completion"
   }
 }
 parameters: {
   key: "kv_cache_free_gpu_mem_fraction"
   value: {
-    string_value: "${kv_cache_free_gpu_mem_fraction}"
+    string_value: "0.2"
   }
 }
 parameters: {
   key: "max_num_sequences"
   value: {
-    string_value: "${max_num_sequences}"
+    string_value: "1"
   }
 }
 parameters: {
   key: "enable_trt_overlap"
   value: {
-    string_value: "${enable_trt_overlap}"
+    string_value: "true"
   }
 }

・postprocessing/config.pbtxt

--- all_models/inflight_batcher_llm/postprocessing/config.pbtxt 2024-02-18 11:53:26.125362288 +0900
+++ triton_model_repo/postprocessing/config.pbtxt       2024-02-18 16:45:22.853520699 +0900
@@ -45,14 +45,14 @@
 parameters {
   key: "tokenizer_dir"
   value: {
-    string_value: "${tokenizer_dir}"
+    string_value: "/tensorrtllm_backend/elyza13_hf"
   }
 }

 parameters {
   key: "tokenizer_type"
   value: {
-    string_value: "${tokenizer_type}"
+    string_value: "llama"
   }
 }

・ensemble/config.pbtxt
修正箇所はありません。

※./tools/fill_template.pyコマンドで変更できればすればよいのですが、コマンド実行環境におけるファイル存在チェックが行われてしまい、Docker環境でのファイルパスを指定するとエラーとなってしまうため、差分での表記としています。

トークナイザ及びTRTエンジンのコピー

トークナイザ関連の４ファイルを所定のフォルダにコピーします。

# 改めて変数に設定。既に設定済みならば不要。
hf_elyza_cache=~/.cache/huggingface/hub/models--elyza--ELYZA-japanese-Llama-2-13b-instruct/snapshots/ed15089024f3ecad9a8c4ce1db302cc01aa9f4ee/
#
mkder -p elyza13_hf
cp -p ${hf_elyza_cache}/config.json elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer.json elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer.model elyza13_hf
cp -p ${hf_elyza_cache}/tokenizer_config.json elyza13_hf

続いてTRTエンジンをコピーします。

cp -p ../TensorRT-LLM/TensorRT-LLM-0.7.0/examples/llama/elyza13_int4_engine/* \
    triton_model_repo/tensorrt_llm/1/

コピーされたファイルは以下です。

$ ls -l  triton_model_repo/tensorrt_llm/1/
total 6801536
-rw-r--r-- 1 shoji_noguchi shoji_noguchi       1570 Feb 18 19:16 config.json
-rw-r--r-- 1 shoji_noguchi shoji_noguchi 6964474668 Feb 18 19:16 llama_float16_tp1_rank0.engine
-rw-r--r-- 1 shoji_noguchi shoji_noguchi     282176 Feb 18 19:16 model.cache

カレントディレクトリをvenvのトップディレクトリに移動します。

cd ../

これで準備は完了です。

4. APIサーバを立ち上げる

Dockerを起動します。今回使用したイメージファイルは tritonserver:23.12-trtllm-python-py3です。

# Dockerマウント用。
mkdir cache
# Docker起動
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v ./tensorrtllm_backend:/tensorrtllm_backend \
    -v cache:/root/.cache \
    nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 bash

イメージファイルのバージョンが23.10、23.11ですと、TRT生成時のtensorrtのバージョンと不一致と検出され、以下のエラーが発生します。

[TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 226, Serialized Engine Version: 228)

※TRTエンジンは9.2.0.5で作成しています。しかしながら、Dockerイメージ内に展開されているバージョンが9.2.0.4のためのようです。
TensorRT version is not compatible, expecting library version 9.2.0.4 got 9.2.0.5 · Issue #194

パッケージを追加インストールして、APIサーバを起動します。

pip install sentencepiece protobuf torch==2.1.0
#
cd /tensorrtllm_backend
CUDA_VISIBLE_DEVICES=0 python3 scripts/launch_triton_server.py \
    --world_size=1 \
    --model_repo=/tensorrtllm_backend/triton_model_repo

コマンドを実行してから正常にAPIサーバが起動するまでに出力されるログはこちら。

I0218 10:52:48.720655 123 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x205000000' with size 268435456
I0218 10:52:48.720745 123 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0218 10:52:48.724949 123 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0218 10:52:48.724988 123 model_lifecycle.cc:461] loading: preprocessing:1
I0218 10:52:48.725305 123 model_lifecycle.cc:461] loading: postprocessing:1
I0218 10:52:48.742660 123 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0218 10:52:48.768943 123 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter version cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I0218 10:52:50.390619 123 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0218 10:52:50.481971 123 model_lifecycle.cc:818] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 6641 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 6684, GPU 8267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +8, now: CPU 6686, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +6635, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6721, GPU 10137 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6721, GPU 10145 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6763, GPU 10163 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 6764, GPU 10171 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6635 (MiB)
[TensorRT-LLM][INFO] Allocate 3040870400 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 3712 total tokens in paged KV cache, and 20 blocks per sequence
[TensorRT-LLM][WARNING] max_num_sequences is smaller than  2 times the engine max_batch_size. Batches smaller than max_batch_size will be executed.
I0218 10:53:04.380211 123 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I0218 10:53:04.380593 123 model_lifecycle.cc:461] loading: ensemble:1
I0218 10:53:04.381033 123 model_lifecycle.cc:818] successfully loaded 'ensemble'
I0218 10:53:04.381109 123 server.cc:606]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0218 10:53:04.381136 123 server.cc:633]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                  |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
|             |                                                                 | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}             |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
|             |                                                                 | e-capability":"6.000000","default-max-batch-size":"4"}}                                                 |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

I0218 10:53:04.381152 123 server.cc:676]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+

I0218 10:53:04.437396 123 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
I0218 10:53:04.442141 123 metrics.cc:710] Collecting CPU metrics
I0218 10:53:04.442397 123 tritonserver.cc:2483]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                               |
| server_version                   | 2.41.0                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
|                                  | mory binary_tensor_data parameters statistics trace logging                                                                                          |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                               |
| model_control_mode               | MODE_NONE                                                                                                                                            |
| strict_model_config              | 1                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                   |
| cache_enabled                    | 0                                                                                                                                                    |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

I0218 10:53:04.457051 123 grpc_server.cc:2495] Started GRPCInferenceService at 0.0.0.0:8001
I0218 10:53:04.457583 123 http_server.cc:4619] Started HTTPService at 0.0.0.0:8000
I0218 10:53:04.557072 123 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002

起動しましたね。

5. APIを叩いて見る

Triton Inference Serverのプロトコル仕様は以下で公開されています。server/docs/protocol/README.md at main · triton-inference-server/server · GitHub

v2/health/ready
ヘルスチェックは、200 OKと返ってきました。動いていますね。

$ curl -vvv http://localhost:8000/v2/health/ready
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host localhost left intact
$

v2/repository/models/index
それぞれのモデル、READYです。

$ curl -X POST http://localhost:8000/v2/repository/models/index 2>/dev/null | jq .
[
  {
    "name": "ensemble",
    "version": "1",
    "state": "READY"
  },
  {
    "name": "postprocessing",
    "version": "1",
    "state": "READY"
  },
  {
    "name": "preprocessing",
    "version": "1",
    "state": "READY"
  },
  {
    "name": "tensorrt_llm",
    "version": "1",
    "state": "READY"
  }
]

では、モデル ensemble を指定してgenerateを叩いて見ましょう。

curl -X POST http://localhost:8000/v2/models/ensemble/generate -d
'{
    "text_input": "What is machine learning?",
    "max_tokens": 20,
    "bad_words": "",
    "stop_words": ""
}'

しばし待てども応答がない・・・。

まとめ

TensorRT-LLM Backendを立ち上げるところまではできましたが、肝心の推論を実行することができませんでした…。

0.5.0ではなく、最新版の TensorRT-LLM Backendでも試しましたが、違うエラー（セグメンテーションフォルト）が発生してしまい、原因が特定できていません…。