WSL2でVideoCrafter2を試してみる

2024年1月29日 09:32

「ビデオコンテンツを作成するためのオープンソースのビデオ生成および編集ツールボックス（現在、Text2VideoモデルとImage2Videoモデルが含まれる）」らしいVideoCrafterの2が先日公開されましたので、試してみます。

使用するPCはドスパラさんの「GALLERIA UL9C-R49」。スペックは
・CPU: Intel® Core™ i9-13900HX Processor
・Mem: 64 GB
・GPU: NVIDIA® GeForce RTX™ 4090 Laptop GPU(16GB)
・GPU: NVIDIA® GeForce RTX™ 4090 (24GB)
・OS: Ubuntu22.04 on WSL2（Windows 11）
です。

1. 準備

環境構築

python3 -m venv videocrafter
cd $_
source bin/activate

リポジトリをクローン。

git clone https://github.com/AILab-CVC/VideoCrafter.git
cd VideoCrafter

パッケージのインストール。

pip install -r requirements.txt

Gradioのデモアプリを起動する際にエラーとなってしまったため（RowにはStyleという属性はない）、ダウングレードしておきます。

pip uninstall -y gradio
pip install gradio==3.41.2

ダウンロード

Text2VideoモデルとImage2Videoモデルそれぞれのチェックポイントを所定のディレクトリにダウンロードします。

# Text2Videoモデル
wget -P checkpoints/base_512_v2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt
# Image2Videoモデル
wget -P checkpoints/i2v_512_v1 https://huggingface.co/VideoCrafter/Image2Video-512/resolve/main/model.ckpt

2. コードを覗いてみる

scripts/run_text2video.sh

name="base_512_v2"

ckpt='checkpoints/base_512_v2/model.ckpt'
config='configs/inference_t2v_512_v2.0.yaml'

prompt_file="prompts/test_prompts.txt"
res_dir="results"

python3 scripts/evaluation/inference.py \
--seed 123 \
--mode 'base' \
--ckpt_path $ckpt \
--config $config \
--savedir $res_dir/$name \
--n_samples 1 \
--bs 1 --height 320 --width 512 \
--unconditional_guidance_scale 12.0 \
--ddim_steps 50 \
--ddim_eta 1.0 \
--prompt_file $prompt_file \
--fps 28

このスクリプトに対する入力は、プロンプトが記述されたprompts/test_prompts.txt です。中身は、以下の２行。

A tiger walks in the forest, photorealistic, 4k, high definition
A boat moving on the sea, flowers and grassland on the shore

１行ずつ読み込み、１行単位に動画を生成するようです。
当スクリプト経由で生成動画を変更したい場合、このファイルを修正すればよいですね。

scripts/run_image2video.sh

name="i2v_512_test"

ckpt='checkpoints/i2v_512_v1/model.ckpt'
config='configs/inference_i2v_512_v1.0.yaml'

prompt_file="prompts/i2v_prompts/test_prompts.txt"
condimage_dir="prompts/i2v_prompts"
res_dir="results"

python3 scripts/evaluation/inference.py \
--seed 123 \
--mode 'i2v' \
--ckpt_path $ckpt \
--config $config \
--savedir $res_dir/$name \
--n_samples 1 \
--bs 1 --height 320 --width 512 \
--unconditional_guidance_scale 12.0 \
--ddim_steps 50 \
--ddim_eta 1.0 \
--prompt_file $prompt_file \
--cond_input $condimage_dir \
--fps 8

このスクリプトの入力は、プロンプトと画像です。
・プロンプト: prompts/i2v_prompts/test_prompts.txt

horses are walking on the grassland
a boy and a girl are talking on the seashore

・画像: prompts/i2v_prompts

$ ls -l prompts/i2v_prompts/*.png
-rw-r--r-- 1 user user 476298 Jan 29 02:07 prompts/i2v_prompts/horse.png
-rw-r--r-- 1 user user 390038 Jan 29 02:07 prompts/i2v_prompts/seashore.png
$

3. 試してみる

今回は、Laptop GPU(16GB)で試します。

Text2Video

サンプルのプロンプトのまま実行します。

CUDA_VISIBLE_DEVICES=1 bash ./scripts/run_text2video.sh

実行時のログはこちら。

@CoLVDM Inference: 2024-01-29-02-35-52
Global seed set to 123
AE working on z of shape (1, 4, 64, 64) = 16384 dimensions.
>>> model checkpoint loaded.
[rank:0] 2/2 samples loaded.
[rank:0] batch-1 (1)x1 ...
DDIM scale True
ddim device cuda:0
/mnt/data/shoji_noguchi/venv/videocrafter/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[rank:0] batch-2 (1)x1 ...
DDIM scale True
ddim device cuda:0
Saved in results/base_512_v2. Time used: 192.64 seconds

約192秒で動画が生成されました。

$ ls -l results/base_512_v2/
total 1552
-rw-r--r-- 1 user user 690689 Jan 29 02:37 0001.mp4
-rw-r--r-- 1 user user 894752 Jan 29 02:39 0002.mp4
$

生成された0002.mp4はこちらの動画

VideoCrafter2。
RTX 4090 Laptop GPU(16GB)でもサンプルは動く。動画の例示プロンプトは
> A boat moving on the sea, flowers and grassland on the shore

あと、requirements.txtに「gradio」とあるけれど、これだとdemoが動かない。「gradio==3.41.2」とする必要あり。https://t.co/Sf7tLbS0Rw pic.twitter.com/HgemvYf6sl
— NOGUCHI, Shoji (@noguchis) January 28, 2024

VRAM使用量は、9.5GB（10.6 - 1.1）でした。
※1.1GBは、ディスプレイ表示で使用。

Image2Video

サンプルのプロンプトと画像のまま実行します。

CUDA_VISIBLE_DEVICES=1 bash ./scripts/run_text2video.sh

stdoutに表示されたログはこちら。

Global seed set to 123
AE working on z of shape (1, 4, 64, 64) = 16384 dimensions.
>>> model checkpoint loaded.
[rank:0] 2/2 samples loaded.
[rank:0] batch-1 (1)x1 ...
DDIM scale True
ddim device cuda:0
/mnt/data/shoji_noguchi/venv/videocrafter/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
cd [rank:0] batch-2 (1)x1 ...
DDIM scale True
ddim device cuda:0
Saved in results/i2v_512_test. Time used: 195.98 seconds

約195秒で動画が生成されました。

$ ls -l results/i2v_512_test/
total 1820
-rw-r--r-- 1 user user 1197040 Jan 29 02:42 horse.mp4
-rw-r--r-- 1 user user  661930 Jan 29 02:43 seashore.mp4

Horse.mp4がこちら。

先の生成動画はText2Video。こちらはImage2Videoで生成したもの。
プロンプトは例示のまま。
> horses are walking on the grassland pic.twitter.com/JVjGZmUAKZ
— NOGUCHI, Shoji (@noguchis) January 29, 2024

Image2Video時、VRAMは12.5GB（13.6 - 1.1GB）使用。

Gradio

CUDA_VISIBLE_DEVICES=1 python gradio_app.py

起動しましたので、プロンプトを与えて動画生成！

Doraemon is walking over the sea, 4K, high definition

生成された動画（というか画像）はこちら。

脳内で再生くださいw

VRAMは、生成時9.6GB（10.7 - 1.1GB）使用。です。

4. まとめ

（1～2秒程度の動画生成ならば）RTX 4090 Laptop GPU(16GB)でも普通に動きます。