Fuse rms_norm, mul, quantize_q8_1#22710
Conversation
There was a problem hiding this comment.
This PR creates a separate ggml_tensor via a copy, but actually still modifies the underlying data pointer of the mul node from f32 to q8_1. I think it is maybe ok, @JohannesGaessler what do you think?
Also heads up the current PR assumes we also quantize activations to Q8_1 today (which is correct), but potentially could be changed to bf16 scales (e.g. #22571)
| for (int j = i + 2; j < cgraph->n_nodes && found < mul_use_count; j++) { | ||
| ggml_tensor * cand = cgraph->nodes[j]; | ||
| const bool uses_mul = cand->src[0] == mul_node || cand->src[1] == mul_node; | ||
| if (!uses_mul) { continue; } |
There was a problem hiding this comment.
probably run this through git clang-format
| if (!uses_mul) { continue; } | |
| if (!uses_mul) { | |
| continue; | |
| } |
| // can use the pre-quantized buffer directly, without type-patching the | ||
| // original tensor or allocating a separate side buffer. |
There was a problem hiding this comment.
| // can use the pre-quantized buffer directly, without type-patching the | |
| // original tensor or allocating a separate side buffer. | |
| // can use the pre-quantized buffer directly |
There was a problem hiding this comment.
Let's not materialize a fleeting tensor of MUL_MAT (GGML does not even expose a GGML_OP_QUANTIZE operator yet). Instead, I feel we should fuse RMS_NORM + MUL + MUL_MAT instead of RMS_NORM + MUL -> Q8_1 fused quantize + MUL_MAT.
This avoids mutating the ggml_cgraph (which is only allowed in the graph_optimize function inside the cuda backend), and avoids materializing the fleeting, quantized, activation-tensor of the cuda backend's MUL_MAT op.
|
Out of curiosity: How much perf gain is left on BW GPUs if you merge this an #22522? Still think this is something we should do, but I suspect it will mainly benefit GPUs with CC < 90. |
|
I agree that quantizing rms_norm + mul + mul_mat is the best. I tried it earlier but it was quite complex and didn't lead to the perf gains I had hoped. #18538 was another attempt at doing this in a different way. However there are two factors long term that may make this exercise a bit futile. One is the fused exps (in mainline, ggml-org models seem to use this by default), and the other is the fused QKV (not in mainline yet but should be there soon hopefully). If there was a simple enough fusion which doesn't modify the graph then it would still be worth merging obviously. |
|
@ORippler RMS_NORM+MUL is fed to 5 MUL_MAT (quantize_q8_1 + MUL_MAT) in case of Qwen-3.6-35b-A3b, If we fuse RMS_NORM + MUL + MUL_MAT, we only save dispatch latency but won't be able to reuse compute. There is also one more RMS_NORM+MUL here which is getting consumed by 2 MUL_MAT_ID(expert_gate, expert_up needs Q8_1), expert_input(F32), shared_expert_up and gate(Q8_1), and residual(F32) which is not getting fused here. or else I need to have separate Q8_1 tensor per layer per RMS. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Fusion should strictly be an operation that executes multiple whole ggml ops as one kernel, not one that executes only part of an ggml op like only the quantization of activations for MUL_MAT. The solution of taking a q8_1 view of an FP32 tensor in particular is not acceptable because it opens up the risk of accidentally overwriting data that is still supposed to be read by another kernel.
In order to fuse the quantization of activations with other ops, ggml_backend_cuda_graph_optimize should first expand MUL_MAT into one op that does the quantization of activations and one that does the actual matrix multiplication. The ggml graph allocator will then assign a dedicated buffer for the quantized activations that is safe to use (and also preferable over the current pool buffer since it is more VRAM efficient). To avoid adding complexity for the other backends, it may make sense to use GGML_OP_CUSTOM for operations that the CUDA backend is adding internally for itself.
|
I will change the map to thread_local which will prevent race conditions. Before this fusion, I am checking for the consumers of this tensor, if all the consumers are expecting the same Q8_1 data, then only I am fusing here, else not fusing at all. At MUL node, I am writing F32 or Q8_1 based on the fusion condition. in case of qwen-3.6-35B-A3B, per decode there are around 81 RMS_NORMS+MUL, out of which only 41 are getting fused because all 5 consumers are MUL_MAT and expect Q8_1. but rest 40 RMS_NORMS+MUL are left without fusion. so this change should be safe. |
I wasn't aware of this. A graphical visualization of what happens at the ggml_cgraph level would have helped in the initial PR description, but thanks for laying it out like this.
I agree with Johannes proposal as the most sensical solution to this.
As per johannes comment above, this should happen in
I think it makes sense to do this in a stacked PR to isolate ggml changes from cuda-backend-changes while giving an example for their value. I personally think having a quantize/dequantize op is sensible, but feel we should co-design this with expanding support for "derived"/non-exclusively-block-scaled-quant recipes. Unfortunately, I don't have the bandwidth to spend focus time on this at the moment to come up with a proposal to share on my side. |
|
As @ORippler said, the graph optimization happens before it is handed over to the graph allocator so from your side you should not need to worry about allocating |
|
Okay upon further reading the way to quantize/dequantize in ggml is via I guess we should first align/confirm that a backend is allowed to change the number of nodes in the call to
|
|
any node(custom_op or ggml_cast) addition in graph_optimize would need the full_graph * (currently only have a view of graph) to make changes in the graph.
Is this flow acceptable? |
|
I don't think such an intrusive change is warranted for a few % of performance. As I said earlier this fusion might not be needed after we fusion QKV and gate and up. Even if not then we should make |
|
I think it would be enough to just overallocate the
I agree that factoring out activation quantization into a separate ggml op would not be worthwhile in terms of opportunity cost just to get a few more % performance. However, there is another reason to do this: it would minimize buffer pool allocations. As it is we frequently need to allocate memory for type conversions that is not part of the memory allocated by the graph allocator. This is undesirable because that memory cannot be recycled for other nodes in the graph and it cannot be reliably estimated for |
|
@ORippler @JohannesGaessler with PDL I see around 15% improvement. With my changes (changing the tensor type in graph_optimize, fusing 41 out of 81 RMS norms) on top of PDL gives 2-3% perf improvement. Command: .\llama-bench.exe --model "D:\models\Qwen\GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0
only-PDL:
Latest without any change:
|
…atch mul->type=Q8_1 in graph_optimize so gallocr allocates Q8_1 memory when all consumers are MMVQ-eligible; fuse norm+mul+quantize into single block_reduce kernel; MMVQ consumers skip re-quantization when src1 is already Q8_1; add test_rms_norm_mul_q8_1_mul_mat
382ceb3 to
bebea65
Compare
…d to 64 registers for 1024 blocks failing test case on SM75. __launch_bounds__will take care of that
|
Not sure why you're still proceeding the approach that is not acceptable (changing the |
|
@am17an I am changing the mul node in graph_optimize now which should be acceptable as per the above discussion. gallocr will allocate either q8_1(if fused) or FP32(not fused). I am not reusing the F32 buffer for storing Q8_1 which I did earlier. As @JohannesGaessler mentioned it will also reduce the buffer pool allocation for quantization. |
|
Okay I'm a bit unclear on the discussion then. @JohannesGaessler is this what you recommended? |
|
@am17an with this [QKV-Fusion PR], the exported Qwen-3.6-35B-A3B model is not working with the corresponding llama build. model without QKV fusions works well. |
|
@lnigam @am17an The root cause here is that Qwen-3.6-35B-A3B is a hybrid architecture. The non-standard Gated Delta Net layers share the attn_qkv namespace with the full-attention layers. So the converter ends up fusing tensors that the loader's fused path can't interpret correctly. I will fix this. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
I think the approach of changing the type of the tensor in the graph optimization step is potentially workable. However, the way this is implemented is definitely not correct for all edge cases. One problem is for example if fusion is disabled via environment variable.
The fusion logic in this PR is currently being complicated by legacy code surrounding ggml_cuda_op_mul_mat. I would suggest we postpone this PR until --split-mode row has been removed, after that ggml_cuda_op_mul_mat is no longer needed and I think it will be much easier to reason about the correctness of fusing the quantization kernel with other tensors.
| if (cand->src[0] != mul && cand->src[1] != mul) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Just iterate over GGML_MAX_SRC, I don't think it makes sense to risk a potential but in the future here.
| const bool is_mmvq_op = cand->op == GGML_OP_MUL_MAT || cand->op == GGML_OP_MUL_MAT_ID; | ||
| const bool src0_quantized = cand->src[0] && ggml_is_quantized(cand->src[0]->type); | ||
| const int64_t batch = (cand->op == GGML_OP_MUL_MAT_ID) ? cand->ne[2] : cand->ne[1]; | ||
| if (!is_mmvq_op || !src0_quantized || batch > MMVQ_MAX_BATCH_SIZE) { | ||
| all_mmvq = false; | ||
| } |
There was a problem hiding this comment.
This is the wrong logic, it has to exactly mirror the kernel selection logic in ggml_cuda_mul_mat.
|
Since I wrote most of fusion code and intend to maintain it, I don't like how this PR is done. It introduces a new concept for very little gain. If the intention is to reduce pool allocations, then it should be done separately. |
|
Okay, that is I think understandable. I think we should in any case remove the legacy code first since that will reduce the maintenance burden either way. |
Overview
Fuse rms_norm+ mul+ qunatize_q8_1
Additional information
Tested on Qwen-3.6-35B-A3B-Q4_KM model, out of 81 RMS norms 41 are fused and 40 remained unfused due to incompatible MoE gate_input which requires F32 input only.
On 5090, it gives around 3-5% perf boost:
command: ./llama-bench -m ~/trust/Qwen3.6-35B-A3B-Q4_K_M.gguf -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0
Without change:
With-change:
Requirements