大規模言語モデルの学習コードまとめ

2023年4月5日 16:08

大規模言語モデルの学習コードをまとめました。

1. Alpacaの学習コード

「LLaMA 」を標準の「HuggingFace Transformers」の学習コードをでファインチューニングを行います。

「Transformers」はまだ「LLaMA」を公式サポートしてないため、特定のフォーク(68d640f7c368bcaaaecfc678f11908ebbd3d6176)を使用します。

以下は、FSDP full_shard モードで 4つの A100 80G GPU を搭載したマシン上のデータセットで「LLaMA-7B」をファインチューニングするコマンドです。

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

2. Alpaca-LoRAの学習コード

「LLaMA」を「PEFT」でLoRAのファインチューニングを行います。

「finetune.py」には、「LLaMA」への「PEFT」の直接的な適用と、プロンプトの構築とトークン化に関連するいくつかのコードが含まれています。

学習コードの使用例は、次のとおりです。

python finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir './lora-alpaca'

ハイパーパラメータを調整することもできます。

python finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir './lora-alpaca' \
    --batch_size 128 \
    --micro_batch_size 4 \
    --num_epochs 3 \
    --learning_rate 1e-4 \
    --cutoff_len 512 \
    --val_set_size 2000 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lora_target_modules '[q_proj,v_proj]' \
    --train_on_inputs \
    --group_by_length

3. GPT4ALLの学習コード

「GPT4ALL」の学習コードの使用例は、次のとおりです。

accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16  --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune-7b.yaml

カスタムデータを使用するには、設定ファイルのdataset_path を変更します。

・configs/train/finetune.yaml

# model/tokenizer
model_name: # add model here
tokenizer_name: # add model here
gradient_checkpointing: true
save_name: "nomic-ai/gpt4all-full-multi-turn"

# dataset
streaming: false
num_proc: 64
dataset_path: # update
max_length: 1024
batch_size: 32

# train dynamics
lr: 5.0e-5
eval_every: 800
eval_steps: 100
save_every: 800
output_dir: "ckpts/gpt4all-full-multi"
checkpoint: null
lora: false
warmup_steps: 100
num_epochs: 2

# logging
wandb: true
wandb_entity: # update
wandb_project_name: # update
seed: 42

・configs/train/finetune_lora.yaml

# model/tokenizer
model_name: # update
tokenizer_name: # update
gradient_checkpointing: false
save_name: "nomic-ai/gpt4all-lora-multi-turn"

# dataset
streaming: false
num_proc: 64
dataset_path: "data_multiturn"
max_length: 1024
batch_size: 4

# train dynamics
lr: 5.0e-5
eval_every: 2000
eval_steps: 100
save_every: 2000
output_dir: "ckpts/gpt4all-lora-multi"
checkpoint: null
lora: true
warmup_steps: 100
num_epochs: 2

# logging
wandb: true
wandb_entity: # update
wandb_project_name: # update
seed: 42

データフォーマットは次のようなjsonl形式です。

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

4. RWKVの学習コード

4.1 RWKV-4のゼロから学習

デフォルトで「enwik8データセット」(https://data.deepai.org/enwik8.zip)を使用する「train.py」実行します。

「GPT」バージョンを学習することになりますが、これは並列化可能で学習が速いからです。「RWKV-4」は外挿が可能なので、ctxLen 1024で学習すると、ctxLen 2500以上でも機能します。より長いctxLenでモデルをファインチューニングすることができ、より長いctxLenに素早く適応することができます。

4.2 RWKV-4 Pileのファインチューニング

「BlinkDL/RWKV-v2-RNN-Pile」の「prepare-data.py」を使用して、「.txt」を「train.npy」のデータにトークン化します。次に、「train.py」で「EXPRESS_PILE_MODE」をTrueに設定して実行します。

「src/model.py」の推論コードを読んで、最終的に隠れた状態（.xx .aa .bb）を忠実な文章埋め込みとして、他のタスクに使ってみてください。おそらく、.xxとaa/.bb (.aa divided by .bb) から始めるとよいです。

・RWKV-4 PileのファインチューニングのためのColab

4.3 大規模コーパスの準備

「gpt-neox」を使って .jsonl を .bin と .idx に変換します。

python tools/preprocess_data.py --input ./my_data.jsonl --output-prefix ./data/my_data --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod

jsonl形式のサンプルは、次のとおりです。

{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}

以下のようなコードで生成されます。

ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")