convert : add `--fuse-qkv` flag to fuse Q/K/V into QKV during HF-to-GGUF conversion by JoursBleu · Pull Request #22780 · ggml-org/llama.cpp

JoursBleu · 2026-05-07T02:24:04Z

Overview

#21245 already introduced the model-loading and graph-building paths for fused QKV computation. This PR further adds an opt-in --fuse-qkv flag to convert_hf_to_gguf.py: during HF-to-GGUF conversion, it concatenates the separate Q / K / V weight tensors into a single fused attn_qkv tensor.

convert_hf_to_gguf.py: add the --fuse-qkv CLI flag. Buffer ATTN_Q / ATTN_K / ATTN_V per layer; once all three have arrived, concatenate them with torch.cat into a single fused ATTN_QKV. Without the flag, conversion is identical to master.
gguf-py/gguf/constants.py: register MODEL_TENSOR.ATTN_QKV for every arch that already declares ATTN_Q + ATTN_K + ATTN_V (85 entries).
src/llama-model.cpp: in the fused branch of create_tensor_qkv, if wqkv is found but wqkv_b is absent, fall back to loading the separate wq_b / wk_b / wv_b. --fuse-qkv only fuses weights, not biases, so this fallback is required.
src/llama-graph.cpp: in the fused branch of build_qkv, if wqkv_b is absent but wq_b / wk_b / wv_b exist, use ggml_concat to concatenate the three bias segments and add them to the result of the fused matmul.

Reason for the changes in src/llama-model.cpp and src/llama-graph.cpp: on architectures with attention bias (qwen2, phi2, starcoder2, stablelm, etc.) the bias would otherwise be silently dropped, producing garbage output; after the fix the output of these archs on a --fuse-qkv GGUF is bit-for-bit identical to the nofuse GGUF.

Diffstat: 4 files, +138 −2.

Test

test-llama-archs: all OK, 0 FAIL.

Additional information

Following the two-step split discussed in #20628 with @am17an and @ngxson:

model : refactor QKV into common build_qkv and create_tensor_qkv helpers #21245 (merged) — pure refactor extracting the build_qkv / create_tensor_qkv helpers.
This PR — on top of the helpers from model : refactor QKV into common build_qkv and create_tensor_qkv helpers #21245, adds the converter-side Q/K/V weight fusion flag.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES — used as a translation tool for translating the PR description

am17an · 2026-05-07T02:55:39Z

Thanks for the PR, do you have any performance analysis for this change?

JoursBleu · 2026-05-07T03:57:09Z

Hi @am17an ,

Too many models to cover, so I only benchmarked Q8_0 on a subset:

The speedup is not universal, but since the flag is opt-in, users can choose whether to enable it per model.

Hardware: AMD Radeon AI PRO R9700
Bench: llama-bench -p 512 -n 128 -r 5.

Q8_0 pp512 (t/s)

Arch	Model	nofuse	fuse	Δ
bloom *	bloom-7b1	1489.10	1488.10	~0%
gemma	gemma-7b	1145.86	1138.45	-0.6%
gemma2	gemma-2-9b-it	464.98	477.75	+2.7%
gemma3	gemma-3-4b-it	520.37	545.13	+4.8%
glm4	glm-4-9b-chat	459.73	449.42	-2.2%
internlm2	internlm2.5-7b	616.20	601.46	-2.4%
llama	Llama-3.1-8B	586.12	546.95	-6.7%
mistral3	Mistral-Small-3.1-24B	261.52	275.38	+5.3%
nemotron	Nemotron-Mini-4B	621.92	648.71	+4.3%
openai-moe	Mixtral-8x7B	309.50	334.66	+8.1%
phi2	phi-2	2678.32	2846.56	+6.3%
phi3 *	Phi-3.5-mini	2260.09	2263.78	~0%
qwen2	Qwen2.5-7B	593.65	671.34	+13.1%
qwen2moe	Qwen1.5-MoE-A2.7B	1952.52	1951.36	~0%
qwen2vl	Qwen2-VL-7B	601.64	618.33	+2.8%
qwen3	Qwen3-8B	579.11	593.76	+2.5%
qwen3moe	Qwen3-30B-A3B	404.64	386.03	-4.6%

Q8_0 tg128 (t/s)

Arch	Model	nofuse	fuse	Δ
bloom *	bloom-7b1	59.10	59.04	~0%
gemma	gemma-7b	51.60	52.27	+1.3%
gemma2	gemma-2-9b-it	42.42	43.26	+2.0%
gemma3	gemma-3-4b-it	79.16	79.81	+0.8%
glm4	glm-4-9b-chat	46.42	47.85	+3.1%
internlm2	internlm2.5-7b	55.96	57.17	+2.2%
llama	Llama-3.1-8B	55.01	56.25	+2.3%
mistral3	Mistral-Small-3.1-24B	20.30	20.75	+2.2%
nemotron	Nemotron-Mini-4B	97.55	102.14	+4.7%
openai-moe	Mixtral-8x7B	34.91	35.36	+1.3%
phi2	phi-2	112.82	114.23	+1.2%
phi3 *	Phi-3.5-mini	98.39	98.44	~0%
qwen2	Qwen2.5-7B	60.22	61.19	+1.6%
qwen2moe	Qwen1.5-MoE-A2.7B	116.67	116.96	+0.2%
qwen2vl	Qwen2-VL-7B	56.71	59.84	+5.5%
qwen3	Qwen3-8B	52.64	53.98	+2.5%
qwen3moe	Qwen3-30B-A3B	78.83	81.15	+2.9%

* = natively fused QKV

am17an · 2026-05-07T04:41:41Z

I think for PP your results might have a lot of noise, I will post some results on other hardware. According to me this should be +ve performance for everything

am17an · 2026-05-07T16:39:51Z

Looks like gemma 4 couldn't be converted, can you check?

JoursBleu · 2026-05-08T13:30:56Z

hi @am17an,

Gemma4 added. Greedy decode bit-identical vs nofuse. gemma-4-31B-it:

R9700:

quant	GPUs	pp512 nofuse → fuse	tg128 nofuse → fuse
Q4_K_M	1	890.35 → 854.15 (−4.1%)	24.13 → 24.06 (−0.3%)
Q8_0	2	314.13 → 312.10 (−0.6%)	15.31 → 15.50 (+1.2%)

Halo:

quant	pp512 nofuse → fuse	tg128 nofuse → fuse
Q4_K_M	290.20 → 299.87 (+3.3%)	9.76 → 9.96 (+2.0%)
Q8_0	298.76 → 304.02 (+1.8%)	6.35 → 6.42 (+1.1%)

CISC · 2026-05-08T13:36:51Z

Fun side quest, make sure this works for NVFP4 too.

am17an · 2026-05-09T09:52:09Z

+        // Optional fused QKV (from converter --fuse-qkv). Only present on full attention layers
+        // (KV-shared layers have no V, so the converter cannot fuse them).
+        const int64_t n_embd_q = n_embd_head * n_head;
+        layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i),


shouldn't this be using the function in llama_graph?

am17an · 2026-05-09T10:58:01Z

Performance uplift mostly everywhere, but I guess larger ubatch will have larger benefit due to sharing the activation

Without fusion (2x 4090)

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	1	1	pp512	28.27 ± 0.00
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	2	1	pp512	55.29 ± 0.00
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	4	1	pp512	109.55 ± 0.02
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	8	1	pp512	202.16 ± 0.04
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	16	1	pp512	386.55 ± 0.19
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	32	1	pp512	759.59 ± 0.05
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	512	1	pp512	2756.28 ± 21.14

With fusion

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	1	1	pp512	28.45 ± 0.00
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	2	1	pp512	55.72 ± 0.01
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	4	1	pp512	110.37 ± 0.02
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	8	1	pp512	205.69 ± 0.05
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	16	1	pp512	394.31 ± 0.09
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	32	1	pp512	774.52 ± 0.10
gemma4 31B Q8_0	30.38 GiB	30.70 B	CUDA	99	512	1	pp512	2779.28 ± 19.25

JoursBleu · 2026-05-12T04:16:11Z

@am17an Thanks for the test! I reproduced the results:

2× 4090, gemma4-31B Q8_0, -ngl 99 -fa 1 -p 512 -n 0 -r 3:

n_ubatch	nofuse t/s	fuse t/s	Δ
1	28.95	29.14	+0.7%
8	206.24	210.00	+1.8%
16	396.04	403.81	+2.0%
32	779.30	794.65	+2.0%
512	2777.38	2801.18	+0.9%

ub=16~32 peaks at +2%, then ub=512 drops back to +0.9%.

We captured a profile trace, nsys at ub=32:

                        instances  total      avg/inst
mul_mat_q<Q8_0> nofuse  2442       224.6 ms   91.98 µs
mul_mat_q<Q8_0> fuse    1842       223.1 ms  121.14 µs
stream_k_fixup  nofuse  2382         8.85 ms   3.71 µs
stream_k_fixup  fuse    1482         6.24 ms   4.21 µs
quantize_mmq_q8_1 nf    2442         4.12 ms   1.69 µs
quantize_mmq_q8_1 fu    1842         3.26 ms   1.77 µs

Fusion did reduce 600 mul_mat_q + 900 fixup + 600 quantize launches, saving ~5 ms of launch time.
On the other hand, fusion lets Q/K/V share the same activation input, so the activation read from HBM drops from 3 times to 1 time;
as the input length grows, the arithmetic intensity also grows, so the benefit from reduced memory access shrinks;
at the same time, since launch time is fixed, that benefit also shrinks, which is why the gain drops back when the input length grows to 512.

We also tested on spark. Since spark's FLOPS/bandwidth ratio is much higher than 4090, the roofline knee is relatively further back (it takes a larger ubatch to enter the compute-bound region), so the BW-bound region is wider, and fusion can keep eating benefit even at large ub.

GB10 / Spark, gemma4-31B Q8_0:

n_ubatch	nofuse t/s	fuse t/s	Δ
1	7.16	7.17	+0.1%
8	52.95	53.36	+0.8%
32	179.66	182.18	+1.4%
128	564.89	577.81	+2.3%
512	691.13	757.33	+9.6%

In short: "larger ubatch ⇒ larger benefit" holds in the launch-bound region, but on 31B / 4090 the per-layer GEMM is already BW-bound, so the curve tops out at ~2% and drops again once ub=512 enters compute-bound. On spark the gain is noticeably larger.

JoursBleu · 2026-05-27T04:56:50Z

Fixes the Qwen-3.6-35B-A3B --fuse-qkv crash reported on #22710:

Arch	model	PATCHED+FUSED	PATCHED+NOFUSE	master+FUSED
qwen35	Qwen3.5-0.8B	ok	ok (identical)	crash
qwen35moe	Qwen-3.6-35B-A3B	ok	ok (identical)	crash
olmo2	OLMo-2-1B	ok	ok (identical)	crash
gemma3n	Gemma-3n-E4B	ok	ok (identical)	crash
olmoe	OLMoE-1B-7B	ok	ok (identical)	crash
falcon-h1	Falcon-H1-Tiny-90M	ok	ok (identical)	ok
lfm2	LFM2.5-1.2B	ok	ok (identical)	ok
jamba	Jamba-3B	ok	ok (identical)	ok
granite-hybrid	Granite-4.1-3B	ok	ok (identical)	ok
nemotron-h	Nemotron-H-8B	ok	ok (identical)	ok

…F conversion

JoursBleu requested a review from CISC as a code owner May 7, 2026 02:24

github-actions Bot added the python python script changes label May 7, 2026

JoursBleu marked this pull request as draft May 7, 2026 02:27

JoursBleu force-pushed the convert/add-fuse-qkv-flag branch from 9204fcb to 3f8c2f9 Compare May 8, 2026 12:58

github-actions Bot added the model Model specific label May 8, 2026

am17an reviewed May 9, 2026

View reviewed changes

am17an mentioned this pull request May 12, 2026

Fuse rms_norm, mul, quantize_q8_1 #22710

Open

JoursBleu force-pushed the convert/add-fuse-qkv-flag branch 2 times, most recently from c3e64f5 to 74b94d9 Compare May 23, 2026 06:30

JoursBleu force-pushed the convert/add-fuse-qkv-flag branch 2 times, most recently from 2be160a to 435b225 Compare May 29, 2026 12:12

convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGU…

a26ba0f

…F conversion

JoursBleu force-pushed the convert/add-fuse-qkv-flag branch from 435b225 to a26ba0f Compare May 29, 2026 12:17

JoursBleu marked this pull request as ready for review June 3, 2026 06:25

Merge branch 'master' into convert/add-fuse-qkv-flag

02ebd8f

github-actions Bot added the conversion label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : add `--fuse-qkv` flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780

convert : add `--fuse-qkv` flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780
JoursBleu wants to merge 2 commits into
ggml-org:masterfrom
JoursBleu:convert/add-fuse-qkv-flag

JoursBleu commented May 7, 2026

Uh oh!

am17an commented May 7, 2026

Uh oh!

JoursBleu commented May 7, 2026

Uh oh!

am17an commented May 7, 2026

Uh oh!

am17an commented May 7, 2026

Uh oh!

JoursBleu commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

am17an May 9, 2026

Uh oh!

am17an commented May 9, 2026 •

edited

Loading

Uh oh!

JoursBleu commented May 12, 2026

Uh oh!

JoursBleu commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JoursBleu commented May 7, 2026

Overview

Test

Additional information

Requirements

Uh oh!

am17an commented May 7, 2026

Uh oh!

JoursBleu commented May 7, 2026

Q8_0 pp512 (t/s)

Q8_0 tg128 (t/s)

Uh oh!

am17an commented May 7, 2026

Uh oh!

am17an commented May 7, 2026

Uh oh!

JoursBleu commented May 8, 2026

Uh oh!

CISC commented May 8, 2026

Uh oh!

am17an May 9, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoursBleu commented May 12, 2026

Uh oh!

JoursBleu commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented May 9, 2026 •

edited

Loading