GPT2のモデルをQLoRAでファインチューニングするときのメモ（LINE1.7b・llm-jp1.3b想定）

2023年12月9日 13:53

LLM Advent Calendar 2023 シリーズ2 12月9日の記事として投稿します。

LINEやLLM-jpから、軽量なLLMが公開されています。パラメーター数は1.7B、1.3Bと小さめ。当然、LLMとしての性能はパラメーター数の大きなものには劣りますが、その分、動作が軽いメリットがあります。応用の幅はいろいろ考えられそうです。

これらのモデルはGPT2という、現在主流のGPT-NeoXやLlama 2よりも前の世代のアーキテクチャーで作られています。GPT2をPEFTを使ってQLoRAファインチューニングをする方法が意外と世の中に書かれていなかったので、メモ代わりに残しておきます。

先に結論を言ってしまうと、target_modulesで指定するのは

target_modules = ["wte", "wpe", "c_attn", "c_proj", "c_fc"]

になります。他は、世にあるQLoRAの解説と同じで大丈夫です。パラメーターは各自の用途で探っていくしかありません。

target_modulesにさらにlm_headを追加することもできたのですが、ここはまだ詳細を追い切れてないので、いったん、上記の5つでやっています。

上記5つのtarget_modulesで実際に line-corporation/japanese-large-lm-1.7b をQLoRAでファインチューニングして、相づち特化にさせてみたというのが以下の記事になります。

なぜ"wte", "wpe", "c_attn", "c_proj", "c_fc"なのか

QLoRAでtarget_modulesを指定する際に、適当にやると、以下の様に怒られます。

Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`.

エラーメッセージ

現状のQLoRAでターゲットにできるのは、Linear層、Embedding層、Conv2d層、Conv1D層だけだよ、という内容です。

ということで、LINEやLLM-jpのモデルの内容を見てみましょう。モデルの構造は、以下の様な簡単なスクリプトでチェックすることができます。（読み込みにはわりと時間がかかります）

from transformers import AutoModelForCausalLM
model_id = "line-corporation/japanese-large-lm-1.7b-instruction-sft"
model = AutoModelForCausalLM.from_pretrained(model_id)
print(model)

実際に出力された内容が以下です。

line-corporation/japanese-large-lm-1.7b-instruction-sft



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(51200, 2304)
    (wpe): Embedding(2048, 2304)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((2304,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((2304,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): GELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((2304,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2304, out_features=51200, bias=False)
)
LINE pad token id: 2
LINE pad token: </s>
LINE eos token id: 2
LINE eos token: </s>

llm-jp/llm-jp-1.3b-v1.0

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50688, 2048)
    (wpe): Embedding(2048, 2048)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): GELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2048, out_features=50688, bias=False)
)
LLM-jp pad token id: 4
LLM-jp pad token: <pad|LLM-jp>
LLM-jp eos token id: 7
LLM-jp eos token: <EOD|LLM-jp>

この中で、Linear層、Embedding層、Conv2d層、Conv1D層となっているものを探すと、(wte), (wpe), (c_attn), (c_proj), (c_fc), (lm_head)です。なので、これらがQLoRAのtarget_modulesとなります。

lm_headについては、ここを対象にしたときにアダプターをどう取り扱えばいいのか、私の理解が不十分だったのでいったんスキップした感じです。理解したらチャレンジしたいと思います。

学習結果としては、先に挙げた記事の通りで、効力を実感できています。パラメーターの調整など、より探求していきたいところです。

最後に、line1.7B instruction sftにQLoRA学習させたときのスクリプトを以下に貼っておきます。おかしなところがあれば、ぜひ御指摘いただけるとありがたいです。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "line-corporation/japanese-large-lm-1.7b-instruction-sft"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=128,
    lora_alpha=32,
    target_modules=["wte", "wpe", "c_attn", "c_proj", "c_fc"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import datasets

# データセットのJSONLを開く
import json

outputs = []
with open("line_format_aiduti.jsonl", "r", encoding="utf-8") as f:
    lines = f.readlines()
    for line in lines:
        pair = json.loads(line)
        outputs.append(pair["text"])

# outputsをdatasets.DatasetDictに変換
data = datasets.Dataset.from_dict({"text": outputs})
data = datasets.DatasetDict({"train": data})
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)
print(data)

# dataを80%の訓練データと20%の検証データに分割
data = data['train']
data = data.train_test_split(test_size=0.2)
train_dataset = data["train"]
eval_dataset = data["test"]

import transformers
import os
os.environ["WANDB_DISABLED"] = "true"

tokenizer.pad_token = "</s>"
tokenizer.pad_token_id = 2
tokenizer.eos_token = "</s>"
tokenizer.eos_token_id = 2

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=2,
        warmup_steps=200,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=100,
        output_dir="results-nolmhead",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

LLM Advent Calendarには興味深い記事がたくさんあります。全部読みたい！

お読みいただき、ありがとうございました。

GPT2のモデルをQLoRAでファインチューニングするときのメモ（LINE1.7b・llm-jp1.3b想定）

なぜ"wte", "wpe", "c_attn", "c_proj", "c_fc"なのか

line-corporation/japanese-large-lm-1.7b-instruction-sft

llm-jp/llm-jp-1.3b-v1.0

いいなと思ったら応援しよう！