Prevent the sum of the dequantized activation in q8_1 from overflowing#21652
Prevent the sum of the dequantized activation in q8_1 from overflowing#21652bartowski1182 wants to merge 7 commits into
Conversation
|
Can you dump the BF16 values of the problematic tensor? I also noticed some irregularities in this specific model in #20668 (comment) To me it looks like the model data is not sound, so I don't think patching the code is warranted. |
|
@ggerganov Yeah sure, and it's the same tensor that you noted in that eval bug. I added the debugging code back so you can see this: For the BF16 weights, ran a similar command to what you ran in the linked report: ./build/bin/llama-debug -m Mistral-Small-4-bf16.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32And with Q4_0 (with ffn_down set to Q4_1) in case it's relevant: ./build/bin/llama-debug -m Mistral-Small-4-Q4_0.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32(this was without my changes so it asserted) If the model data is not sound, not sure where to go from here, though this clamping does make it run and doesn't affect any sound model.. But I totally understand not wanting to put arbitrary code that masks bugs with the model itself, so more than happy to hear your personal judgement |
|
@ggerganov just curious if I should close this and we call Mistral broken or if I should continue investigating |
|
Any sutch clamping should probubly be performed at convert-time |
|
@IMbackK this overflow happens during activation calculations, so it can't be done at convert-time |
|
right, yeah. I dont see a good solution then. |
|
One fix that could be done is to scale down the FP32 activations prior to the matrix multiplication and to then scale up the FP32 results afterwards again. You would lose some information on activations with very small absolute values that may now get flushed to zero but you would become more robust against overflow. |
|
IMO it's not worth patching this without understanding better what exactly causes one of the activations to explode. Maybe something goes wrong in the normalization logic: Lines 1398 to 1412 in 6a6780a |
|
well you won't like this development... MiniMax M2.7 is showing a similar issue but this time with However this time it only happens with CUDA, CPU gets no such NaNs Compiling with unfortunately this is debugging done with Claude, would need someone more well versed with CUDA to suggest a proper fix (though Claude suggested swapping DS4 for D2S6, however I have no clue what the implications are), the clamping is done purely as a show of where the issue seems to be :') but it does lead me to believe we have some strange numerical issues on our hands and it's more widespread than initially thought CUDA Version 12.2.2 btw Edit: nevermind, the compile flag only delays the nan values, they eventually happen.. |
|
@bartowski1182 recently |
|
I'll give that a shot Minimal repro:
If you have an imatrix, you can make a small version like this:
Can use mine from here: https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF/blob/main/MiniMaxAI_MiniMax-M2.7-imatrix.gguf Should I consider opening a new issue for this or does this discussion feel related? |
|
I didn't realize that would involve download 500GB of weights, I'm guessing you don't have a smaller model to work with |
|
Oh right, meant to link at the end @ubergarm made a copy of that exact setup here: |
If https://docs.nvidia.com/cuda/parallel-thread-execution/#scalar-conversions |
|
If we need more dynamic value ranges, it makes sense to go towards BF16 datatypes (we have been confronted with numerical stability issues related to F16 multiple times already) imo |
|
Wanted to provide an update on investigation Here's what I know: With mainline, running pure With the clamp fix proposed above, those two quants are fixed. HOWEVER, the recipe used for (my at least, can check mainline) To fix that, an additional clamp on this was discovered when dumping the op that produced the issue: Something about Q3_K for earlier tensors causes CPU is immune to all of this because it uses With the two clamps to I think we should introduce these clamps since the model's PPL seems to be acceptable with them and investigate other solutions like switching to F32 scales and/or sums in the future, but for now I think it best to get a working solution into mainline so we can at least alleviate these current standing issues with existing models Other things I tried while experimenting: Bumping Using
I've pushed the latest fixes for both Mistral and Minimax to this branch so they can be seen together, like I said I think it best to merge these fixes for now, if a further investigation is requested I can continue digging (if provided some direction) but this is a short-term easy fix that doesn't break anything existing |
|
@ggerganov I will continue investigating to see if I can find a root cause, but I think we should consider merging this in the meantime to fix any models that exist with this issue (mistral and minimax) |
|
Does Minimax need just the CUDA clamp in |
|
Correct yes |
|
Clamping the weights is not a good idea. The basic assumption for using For the activations - I think we have to prepare a separate fix where we switch to a wider-range for the sum in |
|
I agree 100%, but clamping weights is a quick easy fix in the interim, unless you think the swap to f32/bf16 will be quick and painless The "depending on which is feasible and performant" is the only part that concerns me for getting the swap in quickly :) Personally I'd rather merge this, investigate the better options, then implement the real fix and revert this |
|
Yesterday night I did an experiment where I had gpt-5.5 integrate a patch for Talkie 13B support (https://github.com/solwyc/talkie-1930-13b-it-q5) into llama.cpp, perform quantizations and test if the full CUDA offload works. Initially the workaround was to keep the offending layer in Q8_0, but deep research pointed to this PR so I had it try that. The clamp approach from this PR removed the NaN values but changed the first token from Happy to share more details/notes/pi session if that is in any way helpful. Relevant technical details Codex really wants to mention:
|
|
Just to point out this Fp16 overflow has also been bugging several SD.cpp models using the ds4 layout. See leejet/stable-diffusion.cpp#851 (comment). The SD author has to scale down the activation to get around the problem. #22571 is a nice and clean fix. |
|
@mrexodia I also have an implementation of the Talkie architecture. You can find it on my fork (GGUF). I've had no problems running the model on CUDA, and the logits seem to match the original PyTorch model (or almost match, some slight difference when using flash-attn). I'm planning on opening a PR once I have everything properly reviewed. Hope that helps! |
|
Yeah I indeed used your patch @thomasgauthier, thanks for publishing it! The issues didn't show up for all quantizations and I used https://huggingface.co/lewtun/talkie-1930-13b-it-hf as a base instead of their custom checkpoints. Not exactly sure why it happened, but there is definitely a real issue in the CUDA implementation (since everything works fine when running on the CPU). |
|
@mrexodia yeah ok I have only tested Q8_0, I'll try the other quants and investigate the issue. Thanks for flagging this. |
|
FYI @thomasgauthier the workaround is to specify |

Overview
During Mistral 4 small quantization and subsequent testing, I found that the PPL of
Q4_1ended up withNaNWhen testing the reason, it only happened when later
FFN_DOWNlayers were quantized toQ4_1, IE:Works as expected, but:
(note the --tensor-type ffn_down=q4_1) gets
NaNwith PPLAfter digging around with Claude and debug code, found that 16
Q8_1blocks haves = Infbecause the fp16 value is overflowingIn Claude's words:
Additional information
I ran the same model with the updated activation code and yielded a PPL of
5.5535 +/- 0.1235For completeness, also tested with ignoring the pre-computed
svalue and recalculating the results asf32, and got a PPL of5.5725 +/- 0.12469Note that in either case, the PPL without this change was
NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularlyNote that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those
Mistral 4 small PPL before these changes
Mistral 4 small PPL after these changes
Also tested on a
Q4_1quant of Qwen 3.5 9B and got identical PPL results both with and without this changeQwen 3.5 9B before these changes
Qwen 3.5 9B before these changes
Requirements