Skip to content

Fuse rms_norm, mul, quantize_q8_1#22710

Open
lnigam wants to merge 3 commits into
ggml-org:masterfrom
lnigam:fuse_rms_norm_mul_qunatize_q8_1
Open

Fuse rms_norm, mul, quantize_q8_1#22710
lnigam wants to merge 3 commits into
ggml-org:masterfrom
lnigam:fuse_rms_norm_mul_qunatize_q8_1

Conversation

@lnigam

@lnigam lnigam commented May 5, 2026

Copy link
Copy Markdown
Contributor

Overview

Fuse rms_norm+ mul+ qunatize_q8_1

Additional information

Tested on Qwen-3.6-35B-A3B-Q4_KM model, out of 81 RMS norms 41 are fused and 40 remained unfused due to incompatible MoE gate_input which requires F32 input only.

On 5090, it gives around 3-5% perf boost:
command: ./llama-bench -m ~/trust/Qwen3.6-35B-A3B-Q4_K_M.gguf -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0

Without change:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 220.76 ± 8.60
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 248.71 ± 11.13
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 245.73 ± 10.62
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 242.65 ± 10.62

With-change:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 232.91 ± 9.05
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 259.02 ± 10.00
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 255.50 ± 10.40
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 251.02 ± 9.78

Requirements

@lnigam lnigam requested review from a team and ggerganov as code owners May 5, 2026 11:08
@lnigam lnigam changed the title Fuse rms norm mul qunatize q8 1 Fuse rms norm mul quantize q8 1 May 5, 2026
@lnigam lnigam changed the title Fuse rms norm mul quantize q8 1 Fuse rms_norm, mul, quantize_q8_1 May 5, 2026
Comment thread ggml/src/ggml-cuda/norm.cu Outdated
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 5, 2026
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

@am17an am17an left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR creates a separate ggml_tensor via a copy, but actually still modifies the underlying data pointer of the mul node from f32 to q8_1. I think it is maybe ok, @JohannesGaessler what do you think?

Also heads up the current PR assumes we also quantize activations to Q8_1 today (which is correct), but potentially could be changed to bf16 scales (e.g. #22571)

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
for (int j = i + 2; j < cgraph->n_nodes && found < mul_use_count; j++) {
ggml_tensor * cand = cgraph->nodes[j];
const bool uses_mul = cand->src[0] == mul_node || cand->src[1] == mul_node;
if (!uses_mul) { continue; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably run this through git clang-format

Suggested change
if (!uses_mul) { continue; }
if (!uses_mul) {
continue;
}

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment on lines +4030 to +4031
// can use the pre-quantized buffer directly, without type-patching the
// original tensor or allocating a separate side buffer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// can use the pre-quantized buffer directly, without type-patching the
// original tensor or allocating a separate side buffer.
// can use the pre-quantized buffer directly

@ORippler ORippler left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not materialize a fleeting tensor of MUL_MAT (GGML does not even expose a GGML_OP_QUANTIZE operator yet). Instead, I feel we should fuse RMS_NORM + MUL + MUL_MAT instead of RMS_NORM + MUL -> Q8_1 fused quantize + MUL_MAT.

This avoids mutating the ggml_cgraph (which is only allowed in the graph_optimize function inside the cuda backend), and avoids materializing the fleeting, quantized, activation-tensor of the cuda backend's MUL_MAT op.

@ORippler

ORippler commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Out of curiosity: How much perf gain is left on BW GPUs if you merge this an #22522? Still think this is something we should do, but I suspect it will mainly benefit GPUs with CC < 90.

@am17an

am17an commented May 6, 2026

Copy link
Copy Markdown
Contributor

I agree that quantizing rms_norm + mul + mul_mat is the best. I tried it earlier but it was quite complex and didn't lead to the perf gains I had hoped. #18538 was another attempt at doing this in a different way.

However there are two factors long term that may make this exercise a bit futile. One is the fused exps (in mainline, ggml-org models seem to use this by default), and the other is the fused QKV (not in mainline yet but should be there soon hopefully). If there was a simple enough fusion which doesn't modify the graph then it would still be worth merging obviously.

@lnigam

lnigam commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

@ORippler RMS_NORM+MUL is fed to 5 MUL_MAT (quantize_q8_1 + MUL_MAT) in case of Qwen-3.6-35b-A3b, If we fuse RMS_NORM + MUL + MUL_MAT, we only save dispatch latency but won't be able to reuse compute. There is also one more RMS_NORM+MUL here which is getting consumed by 2 MUL_MAT_ID(expert_gate, expert_up needs Q8_1), expert_input(F32), shared_expert_up and gate(Q8_1), and residual(F32) which is not getting fused here. or else I need to have separate Q8_1 tensor per layer per RMS.
If there is an Quantize operator, we can better fuse and reuse the compute.

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fusion should strictly be an operation that executes multiple whole ggml ops as one kernel, not one that executes only part of an ggml op like only the quantization of activations for MUL_MAT. The solution of taking a q8_1 view of an FP32 tensor in particular is not acceptable because it opens up the risk of accidentally overwriting data that is still supposed to be read by another kernel.

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication. The ggml graph allocator will then assign a dedicated buffer for the quantized activations that is safe to use (and also preferable over the current pool buffer since it is more VRAM efficient). To avoid adding complexity for the other backends, it may make sense to use GGML_OP_CUSTOM for operations that the CUDA backend is adding internally for itself.

@lnigam

lnigam commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

I will change the map to thread_local which will prevent race conditions. Before this fusion, I am checking for the consumers of this tensor, if all the consumers are expecting the same Q8_1 data, then only I am fusing here, else not fusing at all. At MUL node, I am writing F32 or Q8_1 based on the fusion condition. in case of qwen-3.6-35B-A3B, per decode there are around 81 RMS_NORMS+MUL, out of which only 41 are getting fused because all 5 consumers are MUL_MAT and expect Q8_1. but rest 40 RMS_NORMS+MUL are left without fusion. so this change should be safe.
Regarding addition of the GGML_OP_CUSTOM for quantization, when(before or after split) and where(gallocr or ggml_cuda_pool) I am supposed to allocate the tensor memory for this node?

@ORippler

ORippler commented May 7, 2026

Copy link
Copy Markdown
Collaborator

@ORippler RMS_NORM+MUL is fed to 5 MUL_MAT (quantize_q8_1 + MUL_MAT) in case of Qwen-3.6-35b-A3b, If we fuse RMS_NORM + MUL + MUL_MAT, we only save dispatch latency but won't be able to reuse compute. There is also one more RMS_NORM+MUL here which is getting consumed by 2 MUL_MAT_ID(expert_gate, expert_up needs Q8_1), expert_input(F32), shared_expert_up and gate(Q8_1), and residual(F32) which is not getting fused here. or else I need to have separate Q8_1 tensor per layer per RMS.
If there is an Quantize operator, we can better fuse and reuse the compute.

I wasn't aware of this. A graphical visualization of what happens at the ggml_cgraph level would have helped in the initial PR description, but thanks for laying it out like this.

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication. The ggml graph allocator will then assign a dedicated buffer for the quantized activations that is safe to use (and also preferable over the current pool buffer since it is more VRAM efficient). To avoid adding complexity for the other backends, it may make sense to use GGML_OP_CUSTOM for operations that the CUDA backend is adding internally for itself.

I agree with Johannes proposal as the most sensical solution to this.

Regarding addition of the GGML_OP_CUSTOM for quantization, when(before or after split) and where(gallocr or ggml_cuda_pool) I am supposed to allocate the tensor memory for this node?

As per johannes comment above, this should happen in ggml_backend_cuda_graph_optimize, as graph_optimize is

  1. allowed to change a ggml_cgraph (see my comment here Fuse rms_norm, mul, quantize_q8_1 #22710 (review)), and
  2. called by ggml's backend scheduler before reserving and allocating memory -> you would only have to create the tensor there, and defer allocation to the backend scheduler.

I think it makes sense to do this in a stacked PR to isolate ggml changes from cuda-backend-changes while giving an example for their value. I personally think having a quantize/dequantize op is sensible, but feel we should co-design this with expanding support for "derived"/non-exclusively-block-scaled-quant recipes. Unfortunately, I don't have the bandwidth to spend focus time on this at the moment to come up with a proposal to share on my side.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

As @ORippler said, the graph optimization happens before it is handed over to the graph allocator so from your side you should not need to worry about allocating ggml_tensor::data, it should happen automatically. Though as of right now I think all graph optimizations result in graphs that are the same size or smaller than the original graph and I'm not 100% sure whether adding extra nodes will be OK in terms of how much memory is currently being allocated for ggml_cgraph::nodes in the backend scheduler for the "splits" per backend.

@ORippler

ORippler commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Okay upon further reading the way to quantize/dequantize in ggml is via ggml_cast, so no custom op needed.

I guess we should first align/confirm that a backend is allowed to change the number of nodes in the call to graph_optimize (@ggerganov any idea if this may break something? From a naive perspective, I feel mutating all aspects of a graph should be allowed by a backend here). After, we could proceed with the suggestion from @JohannesGaessler

In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication.

@lnigam

lnigam commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

any node(custom_op or ggml_cast) addition in graph_optimize would need the full_graph * (currently only have a view of graph) to make changes in the graph.

  1. first iterate through the split and create a list of nodes to be inserted in one pass
  2. after complete parsing of the split, insert all these nodes into the full_graph. Insertion of node would require array reserve or re-allocation.
  3. indexes of the subsequent splits would also change.
  4. sched_graph created from the original graph would allocate required tensors (one F32 and one quantized)
  5. execute original graph would fuse these nodes.

Is this flow acceptable?

@am17an

am17an commented May 7, 2026

Copy link
Copy Markdown
Contributor

I don't think such an intrusive change is warranted for a few % of performance. As I said earlier this fusion might not be needed after we fusion QKV and gate and up. Even if not then we should make GGML_CUDA_GRAPH_OPT work more generally which can parallelize these operations.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

I think it would be enough to just overallocate the ggml_context and ggml_cgraph memory for the splits by 2x and to provide ggml backends with the ggml_context in which they can allocate extra nodes, the backend should then just re-arrange the nodes in the memory buffers of ggml_cgraph as needed.

I don't think such an intrusive change is warranted for a few % of performance. As I said earlier this fusion might not be needed after we fusion QKV and gate and up. Even if not then we should make GGML_CUDA_GRAPH_OPT work more generally which can parallelize these operations.

I agree that factoring out activation quantization into a separate ggml op would not be worthwhile in terms of opportunity cost just to get a few more % performance. However, there is another reason to do this: it would minimize buffer pool allocations. As it is we frequently need to allocate memory for type conversions that is not part of the memory allocated by the graph allocator. This is undesirable because that memory cannot be recycled for other nodes in the graph and it cannot be reliably estimated for --fit. So I think it would be preferable to use the buffer pool only for allocations that are negligible in size.

@lnigam

lnigam commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@ORippler @JohannesGaessler with PDL I see around 15% improvement. With my changes (changing the tensor type in graph_optimize, fusing 41 out of 81 RMS norms) on top of PDL gives 2-3% perf improvement.

Command: .\llama-bench.exe --model "D:\models\Qwen\GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0
Fused RMS+PDL:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 256.49 ± 3.12
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 291.00 ± 14.26
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 284.91 ± 14.70
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 280.79 ± 13.37

only-PDL:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 247.12 ± 4.02
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 284.08 ± 14.45
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 279.55 ± 14.01
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 275.74 ± 13.93

Latest without any change:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 212.03 ± 16.99
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 246.12 ± 9.82
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 244.33 ± 10.73
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 239.31 ± 9.32

…atch mul->type=Q8_1 in graph_optimize so gallocr allocates Q8_1 memory when all consumers are MMVQ-eligible; fuse norm+mul+quantize into single block_reduce kernel; MMVQ consumers skip re-quantization when src1 is already Q8_1; add test_rms_norm_mul_q8_1_mul_mat
@lnigam lnigam force-pushed the fuse_rms_norm_mul_qunatize_q8_1 branch from 382ceb3 to bebea65 Compare May 11, 2026 12:35
lnigam added 2 commits May 11, 2026 12:50
…d to 64 registers for 1024 blocks failing test case on SM75. __launch_bounds__will take care of that
Comment thread ggml/src/ggml-cuda/norm.cu
@lnigam lnigam requested a review from am17an May 12, 2026 07:04
@am17an

am17an commented May 12, 2026

Copy link
Copy Markdown
Contributor

Not sure why you're still proceeding the approach that is not acceptable (changing the mul's node type), you can also check whether fusing QKV like this #22780 (comment) results in the same speed-up

@lnigam

lnigam commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

@am17an I am changing the mul node in graph_optimize now which should be acceptable as per the above discussion. gallocr will allocate either q8_1(if fused) or FP32(not fused). I am not reusing the F32 buffer for storing Q8_1 which I did earlier. As @JohannesGaessler mentioned it will also reduce the buffer pool allocation for quantization.

@am17an

am17an commented May 12, 2026

Copy link
Copy Markdown
Contributor

Okay I'm a bit unclear on the discussion then. @JohannesGaessler is this what you recommended?

@lnigam

lnigam commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

@am17an with this [QKV-Fusion PR], the exported Qwen-3.6-35B-A3B model is not working with the corresponding llama build. model without QKV fusions works well.
attached fused and non-fused model's llama-bench logs.
log-without-QKV-fusion.txt
QKV_fusion_verbose_logs.txt

@JoursBleu

JoursBleu commented May 13, 2026

Copy link
Copy Markdown
Contributor

@lnigam @am17an The root cause here is that Qwen-3.6-35B-A3B is a hybrid architecture. The non-standard Gated Delta Net layers share the attn_qkv namespace with the full-attention layers. So the converter ends up fusing tensors that the loader's fused path can't interpret correctly. I will fix this.
But, for all these hybrid archs in the same (qwen35moe / qwen3next / nemotron-h / jamba / granite-hybrid / falcon-h1 / lfm2 / kimi-linear / gemma3n), how should we handle the non-standard layers? Would it be acceptable to restrict --fuse-qkv to only the standard self-attention layers and skip the SSM / GDN / linear-attention ones?

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach of changing the type of the tensor in the graph optimization step is potentially workable. However, the way this is implemented is definitely not correct for all edge cases. One problem is for example if fusion is disabled via environment variable.

The fusion logic in this PR is currently being complicated by legacy code surrounding ggml_cuda_op_mul_mat. I would suggest we postpone this PR until --split-mode row has been removed, after that ggml_cuda_op_mul_mat is no longer needed and I think it will be much easier to reason about the correctness of fusing the quantization kernel with other tensors.

Comment on lines +4392 to +4394
if (cand->src[0] != mul && cand->src[1] != mul) {
continue;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just iterate over GGML_MAX_SRC, I don't think it makes sense to risk a potential but in the future here.

Comment on lines +4396 to +4401
const bool is_mmvq_op = cand->op == GGML_OP_MUL_MAT || cand->op == GGML_OP_MUL_MAT_ID;
const bool src0_quantized = cand->src[0] && ggml_is_quantized(cand->src[0]->type);
const int64_t batch = (cand->op == GGML_OP_MUL_MAT_ID) ? cand->ne[2] : cand->ne[1];
if (!is_mmvq_op || !src0_quantized || batch > MMVQ_MAX_BATCH_SIZE) {
all_mmvq = false;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong logic, it has to exactly mirror the kernel selection logic in ggml_cuda_mul_mat.

@am17an

am17an commented May 18, 2026

Copy link
Copy Markdown
Contributor

Since I wrote most of fusion code and intend to maintain it, I don't like how this PR is done. It introduces a new concept for very little gain. If the intention is to reduce pool allocations, then it should be done separately.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

Okay, that is I think understandable. I think we should in any case remove the legacy code first since that will reduce the maintenance burden either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants