Skip to content

convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780

Open
JoursBleu wants to merge 2 commits into
ggml-org:masterfrom
JoursBleu:convert/add-fuse-qkv-flag
Open

convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780
JoursBleu wants to merge 2 commits into
ggml-org:masterfrom
JoursBleu:convert/add-fuse-qkv-flag

Conversation

@JoursBleu

Copy link
Copy Markdown
Contributor

Overview

#21245 already introduced the model-loading and graph-building paths for fused QKV computation. This PR further adds an opt-in --fuse-qkv flag to convert_hf_to_gguf.py: during HF-to-GGUF conversion, it concatenates the separate Q / K / V weight tensors into a single fused attn_qkv tensor.

  • convert_hf_to_gguf.py: add the --fuse-qkv CLI flag. Buffer ATTN_Q / ATTN_K / ATTN_V per layer; once all three have arrived, concatenate them with torch.cat into a single fused ATTN_QKV. Without the flag, conversion is identical to master.
  • gguf-py/gguf/constants.py: register MODEL_TENSOR.ATTN_QKV for every arch that already declares ATTN_Q + ATTN_K + ATTN_V (85 entries).
  • src/llama-model.cpp: in the fused branch of create_tensor_qkv, if wqkv is found but wqkv_b is absent, fall back to loading the separate wq_b / wk_b / wv_b. --fuse-qkv only fuses weights, not biases, so this fallback is required.
  • src/llama-graph.cpp: in the fused branch of build_qkv, if wqkv_b is absent but wq_b / wk_b / wv_b exist, use ggml_concat to concatenate the three bias segments and add them to the result of the fused matmul.

Reason for the changes in src/llama-model.cpp and src/llama-graph.cpp: on architectures with attention bias (qwen2, phi2, starcoder2, stablelm, etc.) the bias would otherwise be silently dropped, producing garbage output; after the fix the output of these archs on a --fuse-qkv GGUF is bit-for-bit identical to the nofuse GGUF.

Diffstat: 4 files, +138 −2.

Test

test-llama-archs: all OK, 0 FAIL.

Additional information

Following the two-step split discussed in #20628 with @am17an and @ngxson:

  1. model : refactor QKV into common build_qkv and create_tensor_qkv helpers #21245 (merged) — pure refactor extracting the build_qkv / create_tensor_qkv helpers.
  2. This PR — on top of the helpers from model : refactor QKV into common build_qkv and create_tensor_qkv helpers #21245, adds the converter-side Q/K/V weight fusion flag.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — used as a translation tool for translating the PR description

@JoursBleu JoursBleu requested a review from CISC as a code owner May 7, 2026 02:24
@github-actions github-actions Bot added the python python script changes label May 7, 2026
@JoursBleu JoursBleu marked this pull request as draft May 7, 2026 02:27
@am17an

am17an commented May 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for the PR, do you have any performance analysis for this change?

@JoursBleu

Copy link
Copy Markdown
Contributor Author

Hi @am17an ,

Too many models to cover, so I only benchmarked Q8_0 on a subset:

The speedup is not universal, but since the flag is opt-in, users can choose whether to enable it per model.

Hardware: AMD Radeon AI PRO R9700
Bench: llama-bench -p 512 -n 128 -r 5.

Q8_0 pp512 (t/s)

Arch Model nofuse fuse Δ
bloom * bloom-7b1 1489.10 1488.10 ~0%
gemma gemma-7b 1145.86 1138.45 -0.6%
gemma2 gemma-2-9b-it 464.98 477.75 +2.7%
gemma3 gemma-3-4b-it 520.37 545.13 +4.8%
glm4 glm-4-9b-chat 459.73 449.42 -2.2%
internlm2 internlm2.5-7b 616.20 601.46 -2.4%
llama Llama-3.1-8B 586.12 546.95 -6.7%
mistral3 Mistral-Small-3.1-24B 261.52 275.38 +5.3%
nemotron Nemotron-Mini-4B 621.92 648.71 +4.3%
openai-moe Mixtral-8x7B 309.50 334.66 +8.1%
phi2 phi-2 2678.32 2846.56 +6.3%
phi3 * Phi-3.5-mini 2260.09 2263.78 ~0%
qwen2 Qwen2.5-7B 593.65 671.34 +13.1%
qwen2moe Qwen1.5-MoE-A2.7B 1952.52 1951.36 ~0%
qwen2vl Qwen2-VL-7B 601.64 618.33 +2.8%
qwen3 Qwen3-8B 579.11 593.76 +2.5%
qwen3moe Qwen3-30B-A3B 404.64 386.03 -4.6%

Q8_0 tg128 (t/s)

Arch Model nofuse fuse Δ
bloom * bloom-7b1 59.10 59.04 ~0%
gemma gemma-7b 51.60 52.27 +1.3%
gemma2 gemma-2-9b-it 42.42 43.26 +2.0%
gemma3 gemma-3-4b-it 79.16 79.81 +0.8%
glm4 glm-4-9b-chat 46.42 47.85 +3.1%
internlm2 internlm2.5-7b 55.96 57.17 +2.2%
llama Llama-3.1-8B 55.01 56.25 +2.3%
mistral3 Mistral-Small-3.1-24B 20.30 20.75 +2.2%
nemotron Nemotron-Mini-4B 97.55 102.14 +4.7%
openai-moe Mixtral-8x7B 34.91 35.36 +1.3%
phi2 phi-2 112.82 114.23 +1.2%
phi3 * Phi-3.5-mini 98.39 98.44 ~0%
qwen2 Qwen2.5-7B 60.22 61.19 +1.6%
qwen2moe Qwen1.5-MoE-A2.7B 116.67 116.96 +0.2%
qwen2vl Qwen2-VL-7B 56.71 59.84 +5.5%
qwen3 Qwen3-8B 52.64 53.98 +2.5%
qwen3moe Qwen3-30B-A3B 78.83 81.15 +2.9%

* = natively fused QKV

@am17an

am17an commented May 7, 2026

Copy link
Copy Markdown
Contributor

I think for PP your results might have a lot of noise, I will post some results on other hardware. According to me this should be +ve performance for everything

@am17an

am17an commented May 7, 2026

Copy link
Copy Markdown
Contributor

Looks like gemma 4 couldn't be converted, can you check?

@JoursBleu JoursBleu force-pushed the convert/add-fuse-qkv-flag branch from 9204fcb to 3f8c2f9 Compare May 8, 2026 12:58
@JoursBleu

Copy link
Copy Markdown
Contributor Author

hi @am17an,

Gemma4 added. Greedy decode bit-identical vs nofuse. gemma-4-31B-it:

R9700:

quant GPUs pp512 nofuse → fuse tg128 nofuse → fuse
Q4_K_M 1 890.35 → 854.15 (−4.1%) 24.13 → 24.06 (−0.3%)
Q8_0 2 314.13 → 312.10 (−0.6%) 15.31 → 15.50 (+1.2%)

Halo:

quant pp512 nofuse → fuse tg128 nofuse → fuse
Q4_K_M 290.20 → 299.87 (+3.3%) 9.76 → 9.96 (+2.0%)
Q8_0 298.76 → 304.02 (+1.8%) 6.35 → 6.42 (+1.1%)

@CISC

CISC commented May 8, 2026

Copy link
Copy Markdown
Member

Fun side quest, make sure this works for NVFP4 too.

@github-actions github-actions Bot added the model Model specific label May 8, 2026
Comment thread src/models/gemma4.cpp Outdated
// Optional fused QKV (from converter --fuse-qkv). Only present on full attention layers
// (KV-shared layers have no V, so the converter cannot fuse them).
const int64_t n_embd_q = n_embd_head * n_head;
layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be using the function in llama_graph?

@am17an

am17an commented May 9, 2026

Copy link
Copy Markdown
Contributor

Performance uplift mostly everywhere, but I guess larger ubatch will have larger benefit due to sharing the activation

Without fusion (2x 4090)

model size params backend ngl n_ubatch fa test t/s
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 1 1 pp512 28.27 ± 0.00
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 2 1 pp512 55.29 ± 0.00
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 4 1 pp512 109.55 ± 0.02
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 8 1 pp512 202.16 ± 0.04
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 16 1 pp512 386.55 ± 0.19
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 32 1 pp512 759.59 ± 0.05
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 512 1 pp512 2756.28 ± 21.14

With fusion

model size params backend ngl n_ubatch fa test t/s
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 1 1 pp512 28.45 ± 0.00
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 2 1 pp512 55.72 ± 0.01
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 4 1 pp512 110.37 ± 0.02
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 8 1 pp512 205.69 ± 0.05
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 16 1 pp512 394.31 ± 0.09
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 32 1 pp512 774.52 ± 0.10
gemma4 31B Q8_0 30.38 GiB 30.70 B CUDA 99 512 1 pp512 2779.28 ± 19.25

@JoursBleu

Copy link
Copy Markdown
Contributor Author

@am17an Thanks for the test! I reproduced the results:

2× 4090, gemma4-31B Q8_0, -ngl 99 -fa 1 -p 512 -n 0 -r 3:

n_ubatch nofuse t/s fuse t/s Δ
1 28.95 29.14 +0.7%
8 206.24 210.00 +1.8%
16 396.04 403.81 +2.0%
32 779.30 794.65 +2.0%
512 2777.38 2801.18 +0.9%

ub=16~32 peaks at +2%, then ub=512 drops back to +0.9%.

We captured a profile trace, nsys at ub=32:

                        instances  total      avg/inst
mul_mat_q<Q8_0> nofuse  2442       224.6 ms   91.98 µs
mul_mat_q<Q8_0> fuse    1842       223.1 ms  121.14 µs
stream_k_fixup  nofuse  2382         8.85 ms   3.71 µs
stream_k_fixup  fuse    1482         6.24 ms   4.21 µs
quantize_mmq_q8_1 nf    2442         4.12 ms   1.69 µs
quantize_mmq_q8_1 fu    1842         3.26 ms   1.77 µs

Fusion did reduce 600 mul_mat_q + 900 fixup + 600 quantize launches, saving ~5 ms of launch time.
On the other hand, fusion lets Q/K/V share the same activation input, so the activation read from HBM drops from 3 times to 1 time;
as the input length grows, the arithmetic intensity also grows, so the benefit from reduced memory access shrinks;
at the same time, since launch time is fixed, that benefit also shrinks, which is why the gain drops back when the input length grows to 512.

We also tested on spark. Since spark's FLOPS/bandwidth ratio is much higher than 4090, the roofline knee is relatively further back (it takes a larger ubatch to enter the compute-bound region), so the BW-bound region is wider, and fusion can keep eating benefit even at large ub.

GB10 / Spark, gemma4-31B Q8_0:

n_ubatch nofuse t/s fuse t/s Δ
1 7.16 7.17 +0.1%
8 52.95 53.36 +0.8%
32 179.66 182.18 +1.4%
128 564.89 577.81 +2.3%
512 691.13 757.33 +9.6%

In short: "larger ubatch ⇒ larger benefit" holds in the launch-bound region, but on 31B / 4090 the per-layer GEMM is already BW-bound, so the curve tops out at ~2% and drops again once ub=512 enters compute-bound. On spark the gain is noticeably larger.

@JoursBleu JoursBleu force-pushed the convert/add-fuse-qkv-flag branch 2 times, most recently from c3e64f5 to 74b94d9 Compare May 23, 2026 06:30
@JoursBleu

Copy link
Copy Markdown
Contributor Author

Fixes the Qwen-3.6-35B-A3B --fuse-qkv crash reported on #22710:

Arch model PATCHED+FUSED PATCHED+NOFUSE master+FUSED
qwen35 Qwen3.5-0.8B ok ok (identical) crash
qwen35moe Qwen-3.6-35B-A3B ok ok (identical) crash
olmo2 OLMo-2-1B ok ok (identical) crash
gemma3n Gemma-3n-E4B ok ok (identical) crash
olmoe OLMoE-1B-7B ok ok (identical) crash
falcon-h1 Falcon-H1-Tiny-90M ok ok (identical) ok
lfm2 LFM2.5-1.2B ok ok (identical) ok
jamba Jamba-3B ok ok (identical) ok
granite-hybrid Granite-4.1-3B ok ok (identical) ok
nemotron-h Nemotron-H-8B ok ok (identical) ok

@JoursBleu JoursBleu force-pushed the convert/add-fuse-qkv-flag branch 2 times, most recently from 2be160a to 435b225 Compare May 29, 2026 12:12
@JoursBleu JoursBleu force-pushed the convert/add-fuse-qkv-flag branch from 435b225 to a26ba0f Compare May 29, 2026 12:17
@JoursBleu JoursBleu marked this pull request as ready for review June 3, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants