WSL2でCogVideoX-5Bを試してみる

2024年8月28日 09:28

「テキストプロンプトに基づいてビデオを生成するために設計された大規模拡散トランスモデル」、「ビデオデータを効率的にモデル化するために、3D Variational Autoencoder (VAE) をレバレッジして、空間次元と時間次元の両方でビデオを圧縮することを提案する」らしいCogVideoの5Bモデルを試してみます。

使用するPCはドスパラさんの「GALLERIA UL9C-R49」。スペックは
・CPU: Intel® Core™ i9-13900HX Processor
・Mem: 64 GB
・GPU: NVIDIA® GeForce RTX™ 4090 Laptop GPU(16GB)
・GPU: NVIDIA® GeForce RTX™ 4090 (24GB)
・OS: Ubuntu22.04 on WSL2（Windows 11）
です。

1. 準備

環境セットアップ

python3 -m venv cogvideo
cd $_
source bin/activate

リポジトリをクローン。

git clone https://github.com/THUDM/CogVideo
cd CogVideo

インストールしたパッケージはこちら。

pip install -r requirements.txt

# for cli_demo_quantization.py
pip install torchao

2. 試してみる

inferenceディレクトリの下にいろいろとあります。どれが5Bモデルを使用しているのか確認しましょう。

$ $ grep -i -e 5B inference/*.py
inference/cli_demo.py:    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B and `CogVideoXDPMScheduler` for CogVideoX-5B.
inference/cli_demo.py:        "--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
inference/cli_demo.py:    # For CogVideoX-5B model, use torch.bfloat16.
inference/cli_demo_quantization.py:python cli_demo_quantization.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --quantization_scheme fp8 --dtype bfloat16
inference/cli_demo_quantization.py:        "--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
inference/gradio_web_demo.py:                    "For the 5B model, 50 steps will take approximately 350 seconds."
inference/gradio_web_demo.py:                <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay></video>
$

なるほど。cli_demo.pyとcli_demo_quantization.pyのよう。

それでは、READMEにあるサンプルのプロンプトを使用してcli_demo.pyを実行します。

CUDA_VISIBLE_DEVICES=0 python inference/cli_demo.py --prompt "A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a twitter bird on a mottled wall."

4分半ほど経過すると…。

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.63it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.73it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████| 50/50 [04:29<00:00,  5.38s/it]

終了しました。
カレントディレクトリに ./output.mp4というファイル名で動画が生成されています。動画はこちら。

CogVideoX-5Bhttps://t.co/J3InnSSu5N
試してみた。すっごいあっさりできてしまった。
$ python inference/cli_demo.py --prompt "..." みたく、サンプルのプロンプトを指定して生成されたのがこちら。4090使用、VRAMが17.5GBほどで生成時間は4分29秒。 pic.twitter.com/xxtXjYBWEU
— NOGUCHI, Shoji (@noguchis) August 27, 2024

推論中に使用していたVRAMは17.5GB。最後、ファイルに書き出す？ところで若干増えて18.0GBほど使用していました。

この記事が気に入ったらサポートをしてみませんか？