convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780
convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780JoursBleu wants to merge 2 commits into
--fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion#22780Conversation
|
Thanks for the PR, do you have any performance analysis for this change? |
|
Hi @am17an , Too many models to cover, so I only benchmarked Q8_0 on a subset: The speedup is not universal, but since the flag is opt-in, users can choose whether to enable it per model. Hardware: AMD Radeon AI PRO R9700 Q8_0 pp512 (t/s)
Q8_0 tg128 (t/s)
|
|
I think for PP your results might have a lot of noise, I will post some results on other hardware. According to me this should be +ve performance for everything |
|
Looks like gemma 4 couldn't be converted, can you check? |
9204fcb to
3f8c2f9
Compare
|
hi @am17an, Gemma4 added. Greedy decode bit-identical vs nofuse. gemma-4-31B-it: R9700:
Halo:
|
|
Fun side quest, make sure this works for NVFP4 too. |
| // Optional fused QKV (from converter --fuse-qkv). Only present on full attention layers | ||
| // (KV-shared layers have no V, so the converter cannot fuse them). | ||
| const int64_t n_embd_q = n_embd_head * n_head; | ||
| layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i), |
There was a problem hiding this comment.
shouldn't this be using the function in llama_graph?
|
Performance uplift mostly everywhere, but I guess larger ubatch will have larger benefit due to sharing the activation Without fusion (2x 4090)
With fusion
|
|
@am17an Thanks for the test! I reproduced the results: 2× 4090, gemma4-31B Q8_0,
ub=16~32 peaks at +2%, then ub=512 drops back to +0.9%. We captured a profile trace, nsys at ub=32: Fusion did reduce 600 mul_mat_q + 900 fixup + 600 quantize launches, saving ~5 ms of launch time. We also tested on spark. Since spark's FLOPS/bandwidth ratio is much higher than 4090, the roofline knee is relatively further back (it takes a larger ubatch to enter the compute-bound region), so the BW-bound region is wider, and fusion can keep eating benefit even at large ub. GB10 / Spark, gemma4-31B Q8_0:
In short: "larger ubatch ⇒ larger benefit" holds in the launch-bound region, but on 31B / 4090 the per-layer GEMM is already BW-bound, so the curve tops out at ~2% and drops again once ub=512 enters compute-bound. On spark the gain is noticeably larger. |
c3e64f5 to
74b94d9
Compare
|
Fixes the Qwen-3.6-35B-A3B
|
2be160a to
435b225
Compare
435b225 to
a26ba0f
Compare
Overview
#21245 already introduced the model-loading and graph-building paths for fused QKV computation. This PR further adds an opt-in
--fuse-qkvflag toconvert_hf_to_gguf.py: during HF-to-GGUF conversion, it concatenates the separate Q / K / V weight tensors into a single fusedattn_qkvtensor.convert_hf_to_gguf.py: add the--fuse-qkvCLI flag. BufferATTN_Q/ATTN_K/ATTN_Vper layer; once all three have arrived, concatenate them withtorch.catinto a single fusedATTN_QKV. Without the flag, conversion is identical to master.gguf-py/gguf/constants.py: registerMODEL_TENSOR.ATTN_QKVfor every arch that already declaresATTN_Q+ATTN_K+ATTN_V(85 entries).src/llama-model.cpp: in the fused branch ofcreate_tensor_qkv, ifwqkvis found butwqkv_bis absent, fall back to loading the separatewq_b/wk_b/wv_b.--fuse-qkvonly fuses weights, not biases, so this fallback is required.src/llama-graph.cpp: in the fused branch ofbuild_qkv, ifwqkv_bis absent butwq_b/wk_b/wv_bexist, useggml_concatto concatenate the three bias segments and add them to the result of the fused matmul.Reason for the changes in
src/llama-model.cppandsrc/llama-graph.cpp: on architectures with attention bias (qwen2,phi2,starcoder2,stablelm, etc.) the bias would otherwise be silently dropped, producing garbage output; after the fix the output of these archs on a--fuse-qkvGGUF is bit-for-bit identical to the nofuse GGUF.Diffstat: 4 files, +138 −2.
Test
test-llama-archs: all OK, 0 FAIL.Additional information
Following the two-step split discussed in #20628 with @am17an and @ngxson:
build_qkv/create_tensor_qkvhelpers.Requirements