Fuse rms_norm, mul, quantize_q8_1 by lnigam · Pull Request #22710 · ggml-org/llama.cpp

lnigam · 2026-05-05T11:08:57Z

Overview

Fuse rms_norm+ mul+ qunatize_q8_1

Additional information

Tested on Qwen-3.6-35B-A3B-Q4_KM model, out of 81 RMS norms 41 are fused and 40 remained unfused due to incompatible MoE gate_input which requires F32 input only.

On 5090, it gives around 3-5% perf boost:
command: ./llama-bench -m ~/trust/Qwen3.6-35B-A3B-Q4_K_M.gguf -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0

Without change:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	220.76 ± 8.60
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	248.71 ± 11.13
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	245.73 ± 10.62
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	242.65 ± 10.62

With-change:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	232.91 ± 9.05
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	259.02 ± 10.00
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	255.50 ± 10.40
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	251.02 ± 9.78

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: yes, mostly for code review

am17an

This PR creates a separate ggml_tensor via a copy, but actually still modifies the underlying data pointer of the mul node from f32 to q8_1. I think it is maybe ok, @JohannesGaessler what do you think?

Also heads up the current PR assumes we also quantize activations to Q8_1 today (which is correct), but potentially could be changed to bf16 scales (e.g. #22571)

am17an · 2026-05-06T13:22:23Z

+        for (int j = i + 2; j < cgraph->n_nodes && found < mul_use_count; j++) {
+            ggml_tensor * cand     = cgraph->nodes[j];
+            const bool    uses_mul = cand->src[0] == mul_node || cand->src[1] == mul_node;
+            if (!uses_mul) { continue; }


probably run this through git clang-format

Suggested change

if (!uses_mul) { continue; }

if (!uses_mul) {

continue;

}

am17an · 2026-05-06T13:23:18Z

+            // can use the pre-quantized buffer directly, without type-patching the
+            // original tensor or allocating a separate side buffer.


Suggested change

// can use the pre-quantized buffer directly, without type-patching the

// original tensor or allocating a separate side buffer.

// can use the pre-quantized buffer directly

ORippler

Let's not materialize a fleeting tensor of MUL_MAT (GGML does not even expose a GGML_OP_QUANTIZE operator yet). Instead, I feel we should fuse RMS_NORM + MUL + MUL_MAT instead of RMS_NORM + MUL -> Q8_1 fused quantize + MUL_MAT.

This avoids mutating the ggml_cgraph (which is only allowed in the graph_optimize function inside the cuda backend), and avoids materializing the fleeting, quantized, activation-tensor of the cuda backend's MUL_MAT op.

ORippler · 2026-05-06T14:16:09Z

Out of curiosity: How much perf gain is left on BW GPUs if you merge this an #22522? Still think this is something we should do, but I suspect it will mainly benefit GPUs with CC < 90.

am17an · 2026-05-06T14:20:41Z

I agree that quantizing rms_norm + mul + mul_mat is the best. I tried it earlier but it was quite complex and didn't lead to the perf gains I had hoped. #18538 was another attempt at doing this in a different way.

However there are two factors long term that may make this exercise a bit futile. One is the fused exps (in mainline, ggml-org models seem to use this by default), and the other is the fused QKV (not in mainline yet but should be there soon hopefully). If there was a simple enough fusion which doesn't modify the graph then it would still be worth merging obviously.

lnigam · 2026-05-06T14:43:31Z

@ORippler RMS_NORM+MUL is fed to 5 MUL_MAT (quantize_q8_1 + MUL_MAT) in case of Qwen-3.6-35b-A3b, If we fuse RMS_NORM + MUL + MUL_MAT, we only save dispatch latency but won't be able to reuse compute. There is also one more RMS_NORM+MUL here which is getting consumed by 2 MUL_MAT_ID(expert_gate, expert_up needs Q8_1), expert_input(F32), shared_expert_up and gate(Q8_1), and residual(F32) which is not getting fused here. or else I need to have separate Q8_1 tensor per layer per RMS.
If there is an Quantize operator, we can better fuse and reuse the compute.

JohannesGaessler

Fusion should strictly be an operation that executes multiple whole ggml ops as one kernel, not one that executes only part of an ggml op like only the quantization of activations for MUL_MAT. The solution of taking a q8_1 view of an FP32 tensor in particular is not acceptable because it opens up the risk of accidentally overwriting data that is still supposed to be read by another kernel.

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication. The ggml graph allocator will then assign a dedicated buffer for the quantized activations that is safe to use (and also preferable over the current pool buffer since it is more VRAM efficient). To avoid adding complexity for the other backends, it may make sense to use GGML_OP_CUSTOM for operations that the CUDA backend is adding internally for itself.

lnigam · 2026-05-06T16:17:55Z

I will change the map to thread_local which will prevent race conditions. Before this fusion, I am checking for the consumers of this tensor, if all the consumers are expecting the same Q8_1 data, then only I am fusing here, else not fusing at all. At MUL node, I am writing F32 or Q8_1 based on the fusion condition. in case of qwen-3.6-35B-A3B, per decode there are around 81 RMS_NORMS+MUL, out of which only 41 are getting fused because all 5 consumers are MUL_MAT and expect Q8_1. but rest 40 RMS_NORMS+MUL are left without fusion. so this change should be safe.
Regarding addition of the GGML_OP_CUSTOM for quantization, when(before or after split) and where(gallocr or ggml_cuda_pool) I am supposed to allocate the tensor memory for this node?

ORippler · 2026-05-07T09:30:29Z

@ORippler RMS_NORM+MUL is fed to 5 MUL_MAT (quantize_q8_1 + MUL_MAT) in case of Qwen-3.6-35b-A3b, If we fuse RMS_NORM + MUL + MUL_MAT, we only save dispatch latency but won't be able to reuse compute. There is also one more RMS_NORM+MUL here which is getting consumed by 2 MUL_MAT_ID(expert_gate, expert_up needs Q8_1), expert_input(F32), shared_expert_up and gate(Q8_1), and residual(F32) which is not getting fused here. or else I need to have separate Q8_1 tensor per layer per RMS.
If there is an Quantize operator, we can better fuse and reuse the compute.

I wasn't aware of this. A graphical visualization of what happens at the ggml_cgraph level would have helped in the initial PR description, but thanks for laying it out like this.

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication. The ggml graph allocator will then assign a dedicated buffer for the quantized activations that is safe to use (and also preferable over the current pool buffer since it is more VRAM efficient). To avoid adding complexity for the other backends, it may make sense to use GGML_OP_CUSTOM for operations that the CUDA backend is adding internally for itself.

I agree with Johannes proposal as the most sensical solution to this.

Regarding addition of the GGML_OP_CUSTOM for quantization, when(before or after split) and where(gallocr or ggml_cuda_pool) I am supposed to allocate the tensor memory for this node?

As per johannes comment above, this should happen in ggml_backend_cuda_graph_optimize, as graph_optimize is

allowed to change a ggml_cgraph (see my comment here Fuse rms_norm, mul, quantize_q8_1 #22710 (review)), and
called by ggml's backend scheduler before reserving and allocating memory -> you would only have to create the tensor there, and defer allocation to the backend scheduler.

I think it makes sense to do this in a stacked PR to isolate ggml changes from cuda-backend-changes while giving an example for their value. I personally think having a quantize/dequantize op is sensible, but feel we should co-design this with expanding support for "derived"/non-exclusively-block-scaled-quant recipes. Unfortunately, I don't have the bandwidth to spend focus time on this at the moment to come up with a proposal to share on my side.

JohannesGaessler · 2026-05-07T12:00:19Z

As @ORippler said, the graph optimization happens before it is handed over to the graph allocator so from your side you should not need to worry about allocating ggml_tensor::data, it should happen automatically. Though as of right now I think all graph optimizations result in graphs that are the same size or smaller than the original graph and I'm not 100% sure whether adding extra nodes will be OK in terms of how much memory is currently being allocated for ggml_cgraph::nodes in the backend scheduler for the "splits" per backend.

ORippler · 2026-05-07T15:54:03Z

Okay upon further reading the way to quantize/dequantize in ggml is via ggml_cast, so no custom op needed.

I guess we should first align/confirm that a backend is allowed to change the number of nodes in the call to graph_optimize (@ggerganov any idea if this may break something? From a naive perspective, I feel mutating all aspects of a graph should be allowed by a backend here). After, we could proceed with the suggestion from @JohannesGaessler

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication.

lnigam · 2026-05-07T16:46:31Z

any node(custom_op or ggml_cast) addition in graph_optimize would need the full_graph * (currently only have a view of graph) to make changes in the graph.

first iterate through the split and create a list of nodes to be inserted in one pass
after complete parsing of the split, insert all these nodes into the full_graph. Insertion of node would require array reserve or re-allocation.
indexes of the subsequent splits would also change.
sched_graph created from the original graph would allocate required tensors (one F32 and one quantized)
execute original graph would fuse these nodes.

Is this flow acceptable?

am17an · 2026-05-07T16:51:01Z

I don't think such an intrusive change is warranted for a few % of performance. As I said earlier this fusion might not be needed after we fusion QKV and gate and up. Even if not then we should make GGML_CUDA_GRAPH_OPT work more generally which can parallelize these operations.

JohannesGaessler · 2026-05-07T17:17:52Z

I think it would be enough to just overallocate the ggml_context and ggml_cgraph memory for the splits by 2x and to provide ggml backends with the ggml_context in which they can allocate extra nodes, the backend should then just re-arrange the nodes in the memory buffers of ggml_cgraph as needed.

I don't think such an intrusive change is warranted for a few % of performance. As I said earlier this fusion might not be needed after we fusion QKV and gate and up. Even if not then we should make GGML_CUDA_GRAPH_OPT work more generally which can parallelize these operations.

I agree that factoring out activation quantization into a separate ggml op would not be worthwhile in terms of opportunity cost just to get a few more % performance. However, there is another reason to do this: it would minimize buffer pool allocations. As it is we frequently need to allocate memory for type conversions that is not part of the memory allocated by the graph allocator. This is undesirable because that memory cannot be recycled for other nodes in the graph and it cannot be reliably estimated for --fit. So I think it would be preferable to use the buffer pool only for allocations that are negligible in size.

lnigam · 2026-05-11T10:28:06Z

@ORippler @JohannesGaessler with PDL I see around 15% improvement. With my changes (changing the tensor type in graph_optimize, fusing 41 out of 81 RMS norms) on top of PDL gives 2-3% perf improvement.

Command: .\llama-bench.exe --model "D:\models\Qwen\GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0
Fused RMS+PDL:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	256.49 ± 3.12
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	291.00 ± 14.26
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	284.91 ± 14.70
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	280.79 ± 13.37

only-PDL:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	247.12 ± 4.02
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	284.08 ± 14.45
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	279.55 ± 14.01
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	275.74 ± 13.93

Latest without any change:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	212.03 ± 16.99
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	246.12 ± 9.82
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	244.33 ± 10.73
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	239.31 ± 9.32

…atch mul->type=Q8_1 in graph_optimize so gallocr allocates Q8_1 memory when all consumers are MMVQ-eligible; fuse norm+mul+quantize into single block_reduce kernel; MMVQ consumers skip re-quantization when src1 is already Q8_1; add test_rms_norm_mul_q8_1_mul_mat

…d to 64 registers for 1024 blocks failing test case on SM75. __launch_bounds__will take care of that

am17an · 2026-05-12T07:16:14Z

Not sure why you're still proceeding the approach that is not acceptable (changing the mul's node type), you can also check whether fusing QKV like this #22780 (comment) results in the same speed-up

lnigam · 2026-05-12T08:10:03Z

@am17an I am changing the mul node in graph_optimize now which should be acceptable as per the above discussion. gallocr will allocate either q8_1(if fused) or FP32(not fused). I am not reusing the F32 buffer for storing Q8_1 which I did earlier. As @JohannesGaessler mentioned it will also reduce the buffer pool allocation for quantization.

am17an · 2026-05-12T08:41:25Z

Okay I'm a bit unclear on the discussion then. @JohannesGaessler is this what you recommended?

lnigam · 2026-05-12T10:42:07Z

@am17an with this [QKV-Fusion PR], the exported Qwen-3.6-35B-A3B model is not working with the corresponding llama build. model without QKV fusions works well.
attached fused and non-fused model's llama-bench logs.
log-without-QKV-fusion.txt
QKV_fusion_verbose_logs.txt

JoursBleu · 2026-05-13T03:02:21Z

@lnigam @am17an The root cause here is that Qwen-3.6-35B-A3B is a hybrid architecture. The non-standard Gated Delta Net layers share the attn_qkv namespace with the full-attention layers. So the converter ends up fusing tensors that the loader's fused path can't interpret correctly. I will fix this.
But, for all these hybrid archs in the same (qwen35moe / qwen3next / nemotron-h / jamba / granite-hybrid / falcon-h1 / lfm2 / kimi-linear / gemma3n), how should we handle the non-standard layers? Would it be acceptable to restrict --fuse-qkv to only the standard self-attention layers and skip the SSM / GDN / linear-attention ones?

JohannesGaessler

I think the approach of changing the type of the tensor in the graph optimization step is potentially workable. However, the way this is implemented is definitely not correct for all edge cases. One problem is for example if fusion is disabled via environment variable.

The fusion logic in this PR is currently being complicated by legacy code surrounding ggml_cuda_op_mul_mat. I would suggest we postpone this PR until --split-mode row has been removed, after that ggml_cuda_op_mul_mat is no longer needed and I think it will be much easier to reason about the correctness of fusing the quantization kernel with other tensors.

JohannesGaessler · 2026-05-17T20:32:00Z

+            if (cand->src[0] != mul && cand->src[1] != mul) {
+                continue;
+            }


Just iterate over GGML_MAX_SRC, I don't think it makes sense to risk a potential but in the future here.

JohannesGaessler · 2026-05-17T20:36:12Z

+            const bool    is_mmvq_op     = cand->op == GGML_OP_MUL_MAT || cand->op == GGML_OP_MUL_MAT_ID;
+            const bool    src0_quantized = cand->src[0] && ggml_is_quantized(cand->src[0]->type);
+            const int64_t batch          = (cand->op == GGML_OP_MUL_MAT_ID) ? cand->ne[2] : cand->ne[1];
+            if (!is_mmvq_op || !src0_quantized || batch > MMVQ_MAX_BATCH_SIZE) {
+                all_mmvq = false;
+            }


This is the wrong logic, it has to exactly mirror the kernel selection logic in ggml_cuda_mul_mat.

am17an · 2026-05-18T04:41:08Z

Since I wrote most of fusion code and intend to maintain it, I don't like how this PR is done. It introduces a new concept for very little gain. If the intention is to reduce pool allocations, then it should be done separately.

JohannesGaessler · 2026-05-18T13:10:14Z

Okay, that is I think understandable. I think we should in any case remove the legacy code first since that will reduce the maintenance burden either way.

lnigam requested review from a team and ggerganov as code owners May 5, 2026 11:08

lnigam changed the title ~~Fuse rms norm mul qunatize q8 1~~ Fuse rms norm mul quantize q8 1 May 5, 2026

lnigam changed the title ~~Fuse rms norm mul quantize q8 1~~ Fuse rms_norm, mul, quantize_q8_1 May 5, 2026

am17an reviewed May 5, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/norm.cu Outdated

am17an requested changes May 5, 2026

View reviewed changes

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 5, 2026

am17an reviewed May 6, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

am17an reviewed May 6, 2026

View reviewed changes

ORippler requested changes May 6, 2026

View reviewed changes

JohannesGaessler requested changes May 6, 2026

View reviewed changes

lnigam force-pushed the fuse_rms_norm_mul_qunatize_q8_1 branch from 382ceb3 to bebea65 Compare May 11, 2026 12:35

lnigam added 2 commits May 11, 2026 12:50

clang formatting changes

a16f79a

register pressure in rms_norm_mul_q8_1_f32 kernel is getting increase…

ac3e043

…d to 64 registers for 1024 blocks failing test case on SM75. __launch_bounds__will take care of that

lnigam commented May 11, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/norm.cu

lnigam requested a review from am17an May 12, 2026 07:04

lnigam requested review from JohannesGaessler and ORippler May 12, 2026 07:04

JohannesGaessler reviewed May 17, 2026

View reviewed changes

JoursBleu mentioned this pull request May 27, 2026

convert : add --fuse-qkv flag to fuse Q/K/V into QKV during HF-to-GGUF conversion #22780

Open

		// can use the pre-quantized buffer directly, without type-patching the
		// original tensor or allocating a separate side buffer.

Uh oh!

Conversation

lnigam commented May 5, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

am17an left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an May 6, 2026

Choose a reason for hiding this comment

Uh oh!

am17an May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ORippler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ORippler commented May 6, 2026

Uh oh!

am17an commented May 6, 2026

Uh oh!

lnigam commented May 6, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

lnigam commented May 6, 2026

Uh oh!

ORippler commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 7, 2026

Uh oh!

ORippler commented May 7, 2026

Uh oh!

lnigam commented May 7, 2026

Uh oh!

am17an commented May 7, 2026

Uh oh!

JohannesGaessler commented May 7, 2026

Uh oh!

lnigam commented May 11, 2026

Uh oh!

Uh oh!

am17an commented May 12, 2026

Uh oh!

lnigam commented May 12, 2026

Uh oh!

am17an commented May 12, 2026

Uh oh!

lnigam commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoursBleu commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler May 17, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler May 17, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 18, 2026

Uh oh!

JohannesGaessler commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

am17an left a comment •

edited

Loading

ORippler left a comment •

edited

Loading

ORippler commented May 7, 2026 •

edited

Loading

lnigam commented May 12, 2026 •

edited

Loading

JoursBleu commented May 13, 2026 •

edited

Loading