1BitLLMの実力を見る
1BitLLMは本当に実現可能なのか?そして、実現されると予告されていることに意味はあるのか?
ようやく再現実装に成功した人が現れたので僕も試してみた。
ちなみに1Bit(1.58bit) LLMについての考察はこのページが面白いので一読をお勧めする。
肝心の1Bit LLMの実装はここで公開されている。
ただし、普通のHuggingFaceのお作法とはかなり違うので注意が必要。
まず、このHuggingFaceリポジトリを丸ごとgit cloneする
$ git lfs install
$ git clone https://huggingface.co/1bitLLM/bitnet_b1_58-3B
$ cd bitnet_b1_58-3B
これをやらずにいつもの凡例みたいにいきなりpipelineに読み込もうとすると謎のエラーが出て悩まされることになる。海外でも悩んでる人が何人もいるみたいだ。まあ個人的には「こんな説明で誰がわかる?」と思うが。
まずはPPLのevalをやってみる。
ちなみに元のコード(eval_ppl.py)だと以下のようになっていて、場合によってはエラーが出ることもある
from .modeling_bitnet import BitnetForCausalLM
from .tokenization_bitnet import BitnetTokenizer
その場合は、頭の「.」を取ってやれば良い
from modeling_bitnet import BitnetForCausalLM
from tokenization_bitnet import BitnetTokenizer
するとmodeling_bitnet.pyでも同じようなエラーが出る
from .configuration_bitnet import BitnetConfig
from .utils_quant import BitLinear
これも頭の「.」をとると、動く
from configuration_bitnet import BitnetConfig
from utils_quant import BitLinear
元の論文では9.91、このリポジトリの報告では9.88とされているが、果たして手元の環境(ドスパラ製A6000x2マシン)ではどうなるか。
$ pip install lm-eval=0.3.0
$ python eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048
2024-04-17 09:37:11.458427: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-17 09:37:11.498720: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-17 09:37:12.244885: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading shards: 100%|███████████████████████| 3/3 [00:00<00:00, 6550.19it/s]
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:03<00:00, 1.31s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading readme: 100%|██████████████████| 41.1k/41.1k [00:00<00:00, 3.22MB/s]
Downloading data: 100%|████████████████████| 40.5M/40.5M [00:06<00:00, 5.82MB/s]
Generating validation split: 45576 examples [00:00, 236137.11 examples/s]
avg_loss = 3.28951310140223: 100%|██████| 10811/10811 [2:50:26<00:00, 1.06it/s]
c4 PPL: 9.777821724196246
Downloading readme: 100%|██████████████████████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 20.1MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████| 733k/733k [06:33<00:00, 1.86kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████| 6.36M/6.36M [00:00<00:00, 9.89MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████| 657k/657k [00:00<00:00, 2.03MB/s]
Generating test split: 100%|█████████████████████████████████████████| 4358/4358 [00:00<00:00, 254020.08 examples/s]
Generating train split: 100%|██████████████████████████████████████| 36718/36718 [00:00<00:00, 479998.17 examples/s]
Generating validation split: 100%|███████████████████████████████████| 3760/3760 [00:00<00:00, 270781.46 examples/s]
avg_loss = 3.3192534145618975: 100%|██████████████████████████████████████████████| 174/174 [01:57<00:00, 1.48it/s]
wikitext2 PPL: 9.98147770371931
[9.777821724196246, 9.98147770371931]
Avg PPL: 9.879649713957779
評価に数時間かかったが、結果は9.879なので、まあ9.88と呼んでいい範囲だろう。
ただ、正直数字だけ見てもこの手のやつは「実感」として本当に「使えるものになってるのか、そうでもないのか」がわからない。
そこでマニュアルにはないが、自分で読み込んで試してみることにする。
ちなみにCUDAは使ってないのでMacでも動くかも?
$ python
Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tokenization_bitnet import BitnetTokenizer
/home/memeplex/.pyenv/versions/anaconda3-2023.03/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
>>> from modeling_bitnet import BitnetForCausalLM
2024-04-18 06:38:17.567606: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-18 06:38:17.603207: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-18 06:38:18.173705: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>>
>>> model_str="1bitLLM/bitnet_b1_58-3B"
>>> import torch
>>> model = BitnetForCausalLM.from_pretrained(
... model_str,
... device_map='auto',
... low_cpu_mem_usage=True,
... use_flash_attention_2=True,
... torch_dtype=torch.float16,
... ).half()
>>> tokenizer = BitnetTokenizer.from_pretrained(model_str,use_fast=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
これで準備OK
まずは英語で生成できるか試してみる。
>>> d = tokenizer("The most important thing of the United States",padding="max_length",return_tensors="pt")
>>> with torch.no_grad():
.... t = model.generate(d["input_ids"],attention_mask=d["attention_mask"],max_length=100,repetition_penalty=1.5)
>>> tokenizer.decode(t[0])
'<s> The most important thing of the United States is that it has a great history. It was founded by brave men and women who fought for freedom, liberty, justice, equality, democracy etc… They were all fighting to make this country better than what they had before so we can live in peace with each other as well as our environment.\nThe first president George Washington said “We hold these truths to be self-evident: That there are certain rights which God hath given us;'
>>>
入力したプロンプトは「The most important thing of the United States」で、出力は以下
ある程度ちゃんと予測できている!ここは一応「わあ、びっくり」と驚いておくポイントかもしれない。
日本語もやってみよう。
>>> d = tokenizer("東京は",padding="max_length",return_tensors="pt")
>>> with torch.no_grad():
... t = model.generate(d["input_ids"],attention_mask=d["attention_mask"],max_length=100,repetition_penalty=1.5)
...
>>> tokenizer.decode(t[0])
'<s> 東京は、半年に終了した。\nTokyo is about to finish. (Japanese)\nThe Tokyo Olympics are over, and the city has been transformed into a ghost town for weeks now as people have fled in droves from their homes due to fears of an outbreak or even worse: COVID-19 itself. The games were supposed to be held between July 23rd – August 8th but'
>>>
終わっとるんかい!!