
ついにBitNet Llama8Bが登場! CPUのみで爆速推論するLLM,BitNet.cpp









使ったのはMacBookPro M2 Max。


~/BitNet (main)> python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0

build: 3947 (406a5036) with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin24.0.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.lm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = I2_S - 2 bpw ternary
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.58 GiB (3.83 BPW) 
llm_load_print_meta: general.name     = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3669.02 MiB
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 66
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2

system_info: n_threads = 2 (n_threads_batch = 2) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 | 

sampler seed: 4294967295
sampler params: 
	repeat_last_n = 64, repeat_penalty = 15.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy 
generate: n_ctx = 2048, n_batch = 1, n_predict = 6, n_keep = 1

Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?
Answer: She’s in a different room

llama_perf_sampler_print:    sampling time =       1.68 ms /    54 runs   (    0.03 ms per token, 32066.51 tokens per second)
llama_perf_context_print:        load time =     493.92 ms
llama_perf_context_print: prompt eval time =    3366.27 ms /    48 tokens (   70.13 ms per token,    14.26 tokens per second)
llama_perf_context_print:        eval time =     354.39 ms /     5 runs   (   70.88 ms per token,    14.11 tokens per second)
llama_perf_context_print:       total time =    3722.92 ms /    53 tokens
ggml_metal_free: deallocating


def run_inference():
    build_dir = "build"
    if platform.system() == "Windows":
        main_path = os.path.join(build_dir, "bin", "Release", "llama-cli.exe")
        if not os.path.exists(main_path):
            main_path = os.path.join(build_dir, "bin", "llama-cli")
        main_path = os.path.join(build_dir, "bin", "llama-cli")
    command = [
        '-m', args.model,
        '-n', str(args.n_predict),
        '-t', str(args.threads),
        '-p', args.prompt,
        '-ngl', '0',
        '-c', str(args.ctx_size),
        '--temp', str(args.temperature),
        "-b", "1"



$ python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "The Secret Life of Walter Mitty: A novel about a man who is a secret agent.\n\nOne day,a woman" -n 600 -temp 0.7

sampler seed: 2897422009
sampler params: 
	repeat_last_n = 64, repeat_penalty = 15.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 2.700
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 8192, n_batch = 1, n_predict = 600, n_keep = 1

The Secret Life of Walter Mitty: A novel about a man who is a secret agent.

One day,a woman comes up to him and asks what she can do for his wife. He has never met the people at her place so he takes an opportunity which arises next.
He goes along with them when they go home ,which turns out in reality not too much of their world because all that exists are just empty shells.They ask one another question,and then begin to live there.Their house is now a prison from where escape has become almost impossible.But what seems bad at the moment does turn into good as it becomes obvious gradually with time,that you have lived here alone in your own separate small paradise.
One day he happens on two people having an interesting chat and hearing their names that of his secret wife whom one was thinking about just then.The man whose name is Mitty begins to live more a happier life ever since
They become friends. One tells the other why there should be no secrets between them ,because it all does not belong together anyhow.They find out soon in this new world where everything can take part of themselves with an open heart and free spirit.It becomes one great place.
So then they began to live like people did on Earth,they had a nice time for living long.The secret life begins after only few days that have come since the first day .
What does happen is wonderful ,how many wonders are there when two humans become as it were .You never know what kind of person you meet in this strange new world. Then suddenly their whole family and other people join them.
But they all had no idea about him for being an agent.Their small town became very big.They did not only see but also hear things that nobody heard before nor even could say with mouth.What else than a good secret life then can you imagine ?
All was quiet first .Everything around is still as if someone would whisper in somebody ears who doesn’t have much sense about him or her,But all does live very well thereafter.
Now it will never be same to any of the two people but one person living at large for sure and that other smaller like he used when with others. But everyone enjoys happiness together ,they feel what else do not
At least some happy thoughts come through from each .How nice then! That can all see after few days.
What happens in life is nothing wrong.But it was sad to loose people as a family so much.However,the rest lives on without fear at heart,only for many good and beautiful years.It doesn’t happen anymore
One day comes back his secret wife ,whom he has lost recently.They find out soon that she had been taken away.The couple sees her once again but what will make them feel as before?
In the house there is something they can never be without seeing every moment to understand more in life.Therefore a happy memory with tears. For even then after losing each other one for several years.
A little boy from their family ,one of his relatives has left already.When it happened at last so,they had been friends and brothers again together just as

llama_perf_sampler_print:    sampling time =     167.24 ms /   624 runs   (    0.27 ms per token,  3731.08 tokens per second)
llama_perf_context_print:        load time =     553.08 ms
llama_perf_context_print: prompt eval time =    1732.66 ms /    24 tokens (   72.19 ms per token,    13.85 tokens per second)
llama_perf_context_print:        eval time =   45388.91 ms /   599 runs   (   75.77 ms per token,    13.20 tokens per second)
llama_perf_context_print:       total time =   47313.09 ms /   623 tokens
ggml_metal_free: deallocating


ウォルター・ミティの秘密の生活: 秘密諜報員である男についての小説。

しかし、彼らは皆、彼がエージェントであることなどまったく知りませんでした。彼らの小さな町は非常に大きくなりました。彼らは、これまで誰も聞いたことも口で言うことさえできなかったものを見るだけでなく聞くこともしました。それでは、良い秘密の生活以外に何ができるでしょうか?想像する ?

