Skip to content

Prevent the sum of the dequantized activation in q8_1 from overflowing#21652

Open
bartowski1182 wants to merge 7 commits into
ggml-org:masterfrom
bartowski1182:mistral4-q4_0
Open

Prevent the sum of the dequantized activation in q8_1 from overflowing#21652
bartowski1182 wants to merge 7 commits into
ggml-org:masterfrom
bartowski1182:mistral4-q4_0

Conversation

@bartowski1182

@bartowski1182 bartowski1182 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Overview

During Mistral 4 small quantization and subsequent testing, I found that the PPL of Q4_1 ended up with NaN

When testing the reason, it only happened when later FFN_DOWN layers were quantized to Q4_1, IE:

llama-quantize ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

Works as expected, but:

llama-quantize --tensor-type ffn_down=q4_1 ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

(note the --tensor-type ffn_down=q4_1) gets NaN with PPL

After digging around with Claude and debug code, found that 16 Q8_1 blocks have s = Inf because the fp16 value is overflowing

In Claude's words:

Q8_1's s field stores sum * d in fp16 (max 65504), but when activation values in a 32-element block are large and same-sign, sum * d ≈ 32 * amax can exceed 65504. The max finite |s| is only 410, so the 16 overflowing blocks are massive outliers — their activations must be ~160x larger than typical.

Additional information

I ran the same model with the updated activation code and yielded a PPL of 5.5535 +/- 0.1235

For completeness, also tested with ignoring the pre-computed s value and recalculating the results as f32, and got a PPL of 5.5725 +/- 0.12469

Note that in either case, the PPL without this change was NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularly

Note that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those

Mistral 4 small PPL before these changes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,
Unexpected negative standard deviation of log(prob)
Mistral 4 small PPL after these changes
[1]3.4955,[2]5.1043,[3]4.3632,[4]4.0977,[5]4.2305,[6]4.4037,[7]4.5087,[8]4.5073,[9]4.4639,[10]4.5297,[11]4.5263,[12]4.5587,[13]4.7890,[14]4.8800,[15]4.9211,[16]5.1019,[17]4.9447,[18]5.0835,[19]5.3203,[20]5.2572,[21]5.2755,[22]5.2618,[23]5.2392,[24]5.0943,[25]4.9508,[26]4.8844,[27]4.7809,[28]4.7516,[29]4.6887,[30]4.6612,[31]4.7435,[32]4.7837,[33]4.9112,[34]4.9299,[35]4.9606,[36]5.0317,[37]5.1810,[38]5.2880,[39]5.2647,[40]5.3024,[41]5.3430,[42]5.3550,[43]5.3820,[44]5.4165,[45]5.3988,[46]5.3975,[47]5.4033,[48]5.4926,[49]5.5758,[50]5.5696,[51]5.5633,[52]5.5685,[53]5.5910,[54]5.6138,[55]5.6804,[56]5.6624,[57]5.7382,[58]5.7408,[59]5.7636,[60]5.8286,[61]5.8487,[62]5.8500,[63]5.8483,[64]5.8840,[65]5.9183,[66]5.9979,[67]6.0463,[68]6.0607,[69]6.0864,[70]6.0978,[71]6.1055,[72]6.0793,[73]6.1318,[74]6.1260,[75]6.1347,[76]6.1301,[77]6.1563,[78]6.1131,[79]6.1374,[80]6.0724,[81]6.0041,[82]5.9799,[83]5.9689,[84]5.9874,[85]5.9820,[86]5.9677,[87]5.9715,[88]6.0430,[89]6.0806,[90]6.0899,[91]6.0997,[92]6.0917,[93]6.1328,[94]6.1264,[95]6.1512,[96]6.1638,[97]6.1756,[98]6.1676,[99]6.1591,[100]6.1809,
Final estimate: PPL = 6.1809 +/- 0.09843

Also tested on a Q4_1 quant of Qwen 3.5 9B and got identical PPL results both with and without this change

Qwen 3.5 9B before these changes
[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933
Qwen 3.5 9B before these changes
[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, Claude was used extensively for discovering the issue through trial/error and debugging code

@bartowski1182 bartowski1182 marked this pull request as ready for review April 9, 2026 02:35
@bartowski1182 bartowski1182 requested review from a team and ggerganov as code owners April 9, 2026 02:35
@ggerganov

Copy link
Copy Markdown
Member

Can you dump the BF16 values of the problematic tensor?

I also noticed some irregularities in this specific model in #20668 (comment)

To me it looks like the model data is not sound, so I don't think patching the code is warranted.

@bartowski1182

bartowski1182 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@ggerganov Yeah sure, and it's the same tensor that you noted in that eval bug.

I added the debugging code back so you can see this:

Q8_1 FP16 OVERFLOW: dst=ffn_moe_down-32 src0=blk.32.ffn_down_exps.weight src1=ffn_moe_swiglu-32 total_blocks=131072 s_inf=16 s_nan=0 max_finite_|s|=410

For the BF16 weights, ran a similar command to what you ran in the linked report:

./build/bin/llama-debug -m Mistral-Small-4-bf16.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32
common_debug_cb_eval:      ffn_moe_weighted-32 = (f32)        MUL(ffn_moe_down-32{4096, 128, 2, 1}, ffn_moe_weights_norm-32 (reshaped){1, 128, 2, 1}}) = {4096, 128, 2, 1}
    [
        [
            [     -0.0511,      -0.0220,       0.0349,    ...,      -0.0577,      -0.2088,      -0.0310  ],
            [      0.0186,       0.0214,       0.0072,    ...,       0.0049,      -0.0257,      -0.0022  ],
            [     -0.0089,      -0.0244,      -0.0031,    ...,       0.0025,       0.0092,      -0.0025  ],
            ..., 
            [     -0.0008,       0.0003,       0.0007,    ...,       0.0005,       0.0009,       0.0015  ],
            [      0.0001,       0.0002,      -0.0002,    ...,      -0.0004,       0.0017,      -0.0007  ],
            [     -0.0001,      -0.0002,       0.0000,    ...,       0.0001,       0.0006,      -0.0001  ],
        ],
        [
            [      0.0032,      -0.0039,       0.0005,    ...,      -0.0004,       0.0037,      -0.0035  ],
            [      0.0004,       0.0005,       0.0008,    ...,      -0.0012,      -0.0022,       0.0004  ],
            [      0.0009,      -0.0013,      -0.0016,    ...,      -0.0004,      -0.0001,       0.0038  ],
            ..., 
            [      0.0004,       0.0002,      -0.0001,    ...,       0.0002,      -0.0016,      -0.0001  ],
            [      0.0001,       0.0006,      -0.0000,    ...,       0.0001,       0.0000,      -0.0003  ],
            [      0.0004,      -0.0001,      -0.0000,    ...,       0.0000,       0.0001,       0.0005  ],
        ],
    ]
    sum = -21.966097
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0511,      -0.0220,       0.0349,    ...,      -0.0577,      -0.2088,      -0.0310  ],
            [      0.0032,      -0.0039,       0.0005,    ...,      -0.0004,       0.0037,      -0.0035  ],
        ],
    ]
    sum = -25.767612
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [      0.0186,       0.0214,       0.0072,    ...,       0.0049,      -0.0257,      -0.0022  ],
            [      0.0004,       0.0005,       0.0008,    ...,      -0.0012,      -0.0022,       0.0004  ],
        ],
    ]
    sum = 0.872716
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0089,      -0.0244,      -0.0031,    ...,       0.0025,       0.0092,      -0.0025  ],
            [      0.0009,      -0.0013,      -0.0016,    ...,      -0.0004,      -0.0001,       0.0038  ],
        ],
    ]
    sum = 1.773644
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0008,      -0.0067,       0.0098,    ...,      -0.0082,       0.0038,       0.0027  ],
            [      0.0002,      -0.0025,      -0.0003,    ...,      -0.0008,      -0.0006,      -0.0033  ],
        ],
    ]
    sum = 0.260006

system_info: n_threads = 96 (n_threads_batch = 96) / 192 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

common_debug_cb_eval:      ffn_moe_weighted-32 = (f32)        MUL(ffn_moe_down-32{4096, 4, 30, 1}, ffn_moe_weights_norm-32 (reshaped){1, 4, 30, 1}}) = {4096, 4, 30, 1}
    [
        [
            [  -2526.8816,    -468.2716,    5885.8682,    ...,    -205.6809,    1444.4355,   -1961.9440  ],
            [     -0.0054,       0.0031,      -0.0012,    ...,       0.0069,       0.0003,      -0.0012  ],
            [     -0.0001,      -0.0018,       0.0027,    ...,       0.0008,       0.0006,      -0.0027  ],
            [     -0.0023,      -0.0003,      -0.0008,    ...,      -0.0013,      -0.0000,      -0.0046  ],
        ],
        [
            [      0.5057,       0.3572,       0.7750,    ...,      -0.1811,       0.2851,       0.1452  ],
            [      0.0019,       0.2234,      -0.1807,    ...,       0.1392,       0.2369,       0.1664  ],
            [     -0.1560,      -0.0971,       0.1389,    ...,      -0.0336,       0.0982,      -0.0383  ],
            [     -0.0777,      -0.2010,       0.1259,    ...,       0.0096,       0.0090,       0.0095  ],
        ],
        [
            [     -0.1498,       0.1264,      -0.3107,    ...,       0.4183,      -0.3672,      -0.2537  ],
            [     -0.2451,       0.1512,       0.0539,    ...,      -0.0710,       0.0478,      -0.0811  ],
            [     -0.0553,       0.0492,       0.0706,    ...,       0.0006,       0.0383,      -0.0575  ],
            [      0.0330,       0.0586,      -0.0472,    ...,       0.0649,      -0.0267,       0.0149  ],
        ],
        ..., 
        [
            [      0.3370,      -0.4041,      -1.5760,    ...,       0.2102,       0.1811,       0.6098  ],
            [     -0.0259,       0.0100,      -0.0058,    ...,      -0.0275,      -0.0011,      -0.0559  ],
            [     -0.0052,       0.0170,      -0.0072,    ...,      -0.0358,       0.0118,      -0.0224  ],
            [     -0.0130,       0.0112,       0.0027,    ...,       0.0276,      -0.0050,       0.0055  ],
        ],
        [
            [     -0.0828,      -0.3576,      -0.2163,    ...,      -0.4464,       0.1573,       0.7532  ],
            [     -0.0241,       0.0534,       0.0650,    ...,      -0.0410,      -0.0222,      -0.0236  ],
            [      0.0251,       0.0047,      -0.0601,    ...,      -0.0091,       0.0226,      -0.0236  ],
            [     -0.0489,       0.0651,       0.1208,    ...,       0.0190,       0.0436,      -0.0181  ],
        ],
        [
            [      0.0869,      -0.0826,      -0.1747,    ...,       0.0032,      -0.4236,       0.2473  ],
            [      0.0479,      -0.1143,       0.0173,    ...,      -0.2089,       0.3555,      -0.0883  ],
            [      0.0113,       0.0037,      -0.0545,    ...,      -0.0280,      -0.0623,       0.0449  ],
            [     -0.0140,       0.0095,       0.0075,    ...,      -0.0045,      -0.0471,       0.0529  ],
        ],
    ]
    sum = -577249.375000
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [  -2526.8816,    -468.2716,    5885.8682,    ...,    -205.6809,    1444.4355,   -1961.9440  ],
            [      0.5057,       0.3572,       0.7750,    ...,      -0.1811,       0.2851,       0.1452  ],
            [     -0.1498,       0.1264,      -0.3107,    ...,       0.4183,      -0.3672,      -0.2537  ],
            ..., 
            [      0.3370,      -0.4041,      -1.5760,    ...,       0.2102,       0.1811,       0.6098  ],
            [     -0.0828,      -0.3576,      -0.2163,    ...,      -0.4464,       0.1573,       0.7532  ],
            [      0.0869,      -0.0826,      -0.1747,    ...,       0.0032,      -0.4236,       0.2473  ],
        ],
    ]
    sum = -577287.187500
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0054,       0.0031,      -0.0012,    ...,       0.0069,       0.0003,      -0.0012  ],
            [      0.0019,       0.2234,      -0.1807,    ...,       0.1392,       0.2369,       0.1664  ],
            [     -0.2451,       0.1512,       0.0539,    ...,      -0.0710,       0.0478,      -0.0811  ],
            ..., 
            [     -0.0259,       0.0100,      -0.0058,    ...,      -0.0275,      -0.0011,      -0.0559  ],
            [     -0.0241,       0.0534,       0.0650,    ...,      -0.0410,      -0.0222,      -0.0236  ],
            [      0.0479,      -0.1143,       0.0173,    ...,      -0.2089,       0.3555,      -0.0883  ],
        ],
    ]
    sum = 36.512115
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0001,      -0.0018,       0.0027,    ...,       0.0008,       0.0006,      -0.0027  ],
            [     -0.1560,      -0.0971,       0.1389,    ...,      -0.0336,       0.0982,      -0.0383  ],
            [     -0.0553,       0.0492,       0.0706,    ...,       0.0006,       0.0383,      -0.0575  ],
            ..., 
            [     -0.0052,       0.0170,      -0.0072,    ...,      -0.0358,       0.0118,      -0.0224  ],
            [      0.0251,       0.0047,      -0.0601,    ...,      -0.0091,       0.0226,      -0.0236  ],
            [      0.0113,       0.0037,      -0.0545,    ...,      -0.0280,      -0.0623,       0.0449  ],
        ],
    ]
    sum = 8.844553
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0023,      -0.0003,      -0.0008,    ...,      -0.0013,      -0.0000,      -0.0046  ],
            [     -0.0777,      -0.2010,       0.1259,    ...,       0.0096,       0.0090,       0.0095  ],
            [      0.0330,       0.0586,      -0.0472,    ...,       0.0649,      -0.0267,       0.0149  ],
            ..., 
            [     -0.0130,       0.0112,       0.0027,    ...,       0.0276,      -0.0050,       0.0055  ],
            [     -0.0489,       0.0651,       0.1208,    ...,       0.0190,       0.0436,      -0.0181  ],
            [     -0.0140,       0.0095,       0.0075,    ...,      -0.0045,      -0.0471,       0.0529  ],
        ],
    ]
    sum = -14.182524
Model add_bos: true
Input prompt: "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]"
Token ids (30):
<s>(1) [(1091) SY(101289) STEM(58343) _PRO(25396) MP(7690) T(1084) ](1093)  You(3213)  are(1584)  a(1261)  helpful(20351)  assistant(27089) [(1091) /(1047) SY(101289) STEM(58343) _PRO(25396) MP(7690) T(1084) ][(5371) IN(3174) ST(3074) ](1093)  Hello(45383) [(1091) /(1047) IN(3174) ST(3074) ](1093) 

And with Q4_0 (with ffn_down set to Q4_1) in case it's relevant:

./build/bin/llama-debug -m Mistral-Small-4-Q4_0.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32
common_debug_cb_eval:      ffn_moe_weighted-32 = (f32)        MUL(ffn_moe_down-32{4096, 128, 2, 1}, ffn_moe_weights_norm-32 (reshaped){1, 128, 2, 1}}) = {4096, 128, 2, 1}
    [
        [
            [     -0.0377,      -0.0534,       0.0481,    ...,      -0.0858,      -0.2257,       0.0617  ],
            [      0.0159,       0.0224,       0.0115,    ...,       0.0055,      -0.0229,      -0.0037  ],
            [     -0.0093,      -0.0266,      -0.0029,    ...,       0.0008,       0.0078,       0.0002  ],
            ..., 
            [      0.0001,       0.0001,      -0.0001,    ...,      -0.0003,       0.0014,      -0.0006  ],
            [     -0.0007,       0.0004,       0.0007,    ...,       0.0005,       0.0010,       0.0014  ],
            [      0.0001,      -0.0003,       0.0001,    ...,       0.0001,       0.0006,      -0.0001  ],
        ],
        [
            [      0.0030,      -0.0027,       0.0018,    ...,      -0.0004,       0.0024,      -0.0029  ],
            [      0.0002,       0.0004,       0.0005,    ...,      -0.0010,      -0.0020,       0.0001  ],
            [      0.0008,      -0.0008,      -0.0016,    ...,      -0.0013,      -0.0007,       0.0037  ],
            ..., 
            [      0.0005,       0.0003,      -0.0000,    ...,       0.0002,      -0.0017,      -0.0001  ],
            [      0.0001,       0.0008,       0.0000,    ...,       0.0002,      -0.0003,      -0.0005  ],
            [      0.0005,      -0.0002,       0.0000,    ...,      -0.0001,      -0.0000,       0.0003  ],
        ],
    ]
    sum = -21.766863
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0377,      -0.0534,       0.0481,    ...,      -0.0858,      -0.2257,       0.0617  ],
            [      0.0030,      -0.0027,       0.0018,    ...,      -0.0004,       0.0024,      -0.0029  ],
        ],
    ]
    sum = -25.451143
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [      0.0159,       0.0224,       0.0115,    ...,       0.0055,      -0.0229,      -0.0037  ],
            [      0.0002,       0.0004,       0.0005,    ...,      -0.0010,      -0.0020,       0.0001  ],
        ],
    ]
    sum = 0.668126
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0093,      -0.0266,      -0.0029,    ...,       0.0008,       0.0078,       0.0002  ],
            [      0.0008,      -0.0008,      -0.0016,    ...,      -0.0013,      -0.0007,       0.0037  ],
        ],
    ]
    sum = 1.542167
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 128, 2, 1}, }) = {4096, 2, 1, 1}
    [
        [
            [     -0.0005,      -0.0085,       0.0111,    ...,      -0.0100,       0.0063,       0.0026  ],
            [      0.0006,      -0.0000,       0.0004,    ...,      -0.0011,       0.0013,      -0.0015  ],
        ],
    ]
    sum = 0.344204

system_info: n_threads = 96 (n_threads_batch = 96) / 192 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

common_debug_cb_eval:      ffn_moe_weighted-32 = (f32)        MUL(ffn_moe_down-32{4096, 4, 30, 1}, ffn_moe_weights_norm-32 (reshaped){1, 4, 30, 1}}) = {4096, 4, 30, 1}
    [
        [
            [        -nan,         -nan,         -nan,    ...,         -nan,         -nan,         -nan  ],
            [     -0.0046,       0.0024,      -0.0007,    ...,       0.0076,      -0.0001,      -0.0018  ],
            [     -0.0003,      -0.0017,       0.0028,    ...,       0.0011,       0.0001,      -0.0032  ],
            [     -0.0024,      -0.0001,      -0.0010,    ...,      -0.0012,      -0.0003,      -0.0046  ],
        ],
        [
            [      0.4818,       0.3821,       0.7114,    ...,      -0.1320,       0.2678,       0.1817  ],
            [     -0.0309,       0.1859,      -0.1440,    ...,       0.1226,       0.2021,       0.1455  ],
            [     -0.1730,      -0.1076,       0.1479,    ...,      -0.0188,       0.1073,      -0.0484  ],
            [     -0.0807,      -0.1962,       0.1267,    ...,      -0.0168,       0.0028,       0.0135  ],
        ],
        [
            [     -0.1117,       0.1570,      -0.2545,    ...,       0.3830,      -0.3810,      -0.1876  ],
            [     -0.2143,       0.1105,       0.0423,    ...,      -0.0512,       0.0245,      -0.0892  ],
            [     -0.0464,       0.0308,       0.0665,    ...,      -0.0129,       0.0517,      -0.0598  ],
            [      0.0488,       0.0612,      -0.0372,    ...,       0.0743,      -0.0307,       0.0143  ],
        ],
        ..., 
        [
            [     -0.0786,      -0.2572,      -0.2437,    ...,      -0.1087,       0.0904,       0.1486  ],
            [      0.0056,       0.0713,       0.0265,    ...,      -0.0325,      -0.0137,      -0.1162  ],
            [      0.0425,       0.1282,       0.0186,    ...,       0.0156,      -0.0127,       0.0121  ],
            [      0.0365,      -0.0189,      -0.0376,    ...,       0.0461,       0.0720,       0.0437  ],
        ],
        [
            [      0.1010,      -0.5950,      -0.6300,    ...,      -0.5743,       0.3738,       0.5940  ],
            [     -0.0817,       0.0937,       0.0383,    ...,       0.1123,      -0.0569,       0.0118  ],
            [      0.0167,      -0.0310,      -0.0787,    ...,       0.0418,      -0.0279,      -0.0389  ],
            [     -0.0318,      -0.0012,      -0.0139,    ...,      -0.0261,      -0.0425,      -0.0318  ],
        ],
        [
            [      0.0268,      -0.1215,       0.0420,    ...,       0.1526,       0.3784,      -0.2065  ],
            [      0.1985,      -0.0054,      -0.2593,    ...,      -0.0466,      -0.3999,       0.2104  ],
            [      0.0128,      -0.0004,       0.0352,    ...,       0.0763,      -0.0343,      -0.0307  ],
            [      0.0292,       0.0563,      -0.0295,    ...,      -0.0066,      -0.0192,       0.0690  ],
        ],
    ]
    sum = -nan
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [        -nan,         -nan,         -nan,    ...,         -nan,         -nan,         -nan  ],
            [      0.4818,       0.3821,       0.7114,    ...,      -0.1320,       0.2678,       0.1817  ],
            [     -0.1117,       0.1570,      -0.2545,    ...,       0.3830,      -0.3810,      -0.1876  ],
            ..., 
            [     -0.0786,      -0.2572,      -0.2437,    ...,      -0.1087,       0.0904,       0.1486  ],
            [      0.1010,      -0.5950,      -0.6300,    ...,      -0.5743,       0.3738,       0.5940  ],
            [      0.0268,      -0.1215,       0.0420,    ...,       0.1526,       0.3784,      -0.2065  ],
        ],
    ]
    sum = -nan
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0046,       0.0024,      -0.0007,    ...,       0.0076,      -0.0001,      -0.0018  ],
            [     -0.0309,       0.1859,      -0.1440,    ...,       0.1226,       0.2021,       0.1455  ],
            [     -0.2143,       0.1105,       0.0423,    ...,      -0.0512,       0.0245,      -0.0892  ],
            ..., 
            [      0.0056,       0.0713,       0.0265,    ...,      -0.0325,      -0.0137,      -0.1162  ],
            [     -0.0817,       0.0937,       0.0383,    ...,       0.1123,      -0.0569,       0.0118  ],
            [      0.1985,      -0.0054,      -0.2593,    ...,      -0.0466,      -0.3999,       0.2104  ],
        ],
    ]
    sum = 26.722431
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0003,      -0.0017,       0.0028,    ...,       0.0011,       0.0001,      -0.0032  ],
            [     -0.1730,      -0.1076,       0.1479,    ...,      -0.0188,       0.1073,      -0.0484  ],
            [     -0.0464,       0.0308,       0.0665,    ...,      -0.0129,       0.0517,      -0.0598  ],
            ..., 
            [      0.0425,       0.1282,       0.0186,    ...,       0.0156,      -0.0127,       0.0121  ],
            [      0.0167,      -0.0310,      -0.0787,    ...,       0.0418,      -0.0279,      -0.0389  ],
            [      0.0128,      -0.0004,       0.0352,    ...,       0.0763,      -0.0343,      -0.0307  ],
        ],
    ]
    sum = 43.818310
common_debug_cb_eval: ffn_moe_weighted-32 (view) = (f32)       VIEW(ffn_moe_weighted-32{4096, 4, 30, 1}, }) = {4096, 30, 1, 1}
    [
        [
            [     -0.0024,      -0.0001,      -0.0010,    ...,      -0.0012,      -0.0003,      -0.0046  ],
            [     -0.0807,      -0.1962,       0.1267,    ...,      -0.0168,       0.0028,       0.0135  ],
            [      0.0488,       0.0612,      -0.0372,    ...,       0.0743,      -0.0307,       0.0143  ],
            ..., 
            [      0.0365,      -0.0189,      -0.0376,    ...,       0.0461,       0.0720,       0.0437  ],
            [     -0.0318,      -0.0012,      -0.0139,    ...,      -0.0261,      -0.0425,      -0.0318  ],
            [      0.0292,       0.0563,      -0.0295,    ...,      -0.0066,      -0.0192,       0.0690  ],
        ],
    ]
    sum = -40.639267
llama-debug: /home/colin/git_repos/forks/mistral4-q4_0/ggml/src/ggml-cpu/ops.cpp:3756: void ggml_compute_forward_rms_norm_f32(const ggml_compute_params*, ggml_tensor*): Assertion `scale > 0.0f' failed.

(this was without my changes so it asserted)

If the model data is not sound, not sure where to go from here, though this clamping does make it run and doesn't affect any sound model.. But I totally understand not wanting to put arbitrary code that masks bugs with the model itself, so more than happy to hear your personal judgement

@bartowski1182

Copy link
Copy Markdown
Contributor Author

@ggerganov just curious if I should close this and we call Mistral broken or if I should continue investigating

@IMbackK

IMbackK commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Any sutch clamping should probubly be performed at convert-time

@bartowski1182

Copy link
Copy Markdown
Contributor Author

@IMbackK this overflow happens during activation calculations, so it can't be done at convert-time

@IMbackK

IMbackK commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

right, yeah. I dont see a good solution then.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

One fix that could be done is to scale down the FP32 activations prior to the matrix multiplication and to then scale up the FP32 results afterwards again. You would lose some information on activations with very small absolute values that may now get flushed to zero but you would become more robust against overflow.

@ggerganov

Copy link
Copy Markdown
Member

IMO it's not worth patching this without understanding better what exactly causes one of the activations to explode.

Maybe something goes wrong in the normalization logic:

llama.cpp/src/llama-graph.cpp

Lines 1398 to 1412 in 6a6780a

if (norm_w) {
weights = ggml_reshape_2d(ctx0, weights, n_expert_used, n_tokens);
ggml_tensor * weights_sum = ggml_sum_rows(ctx0, weights); // [1, n_tokens]
cb(weights_sum, "ffn_moe_weights_sum", il);
// Avoid division by zero, clamp to smallest number representable by F16
weights_sum = ggml_clamp(ctx0, weights_sum, 6.103515625e-5, INFINITY);
cb(weights_sum, "ffn_moe_weights_sum_clamped", il);
weights = ggml_div(ctx0, weights, weights_sum); // [n_expert_used, n_tokens]
cb(weights, "ffn_moe_weights_norm", il);
weights = ggml_reshape_3d(ctx0, weights, 1, n_expert_used, n_tokens);
}

@bartowski1182

bartowski1182 commented Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

well you won't like this development...

MiniMax M2.7 is showing a similar issue but this time with Q4_K and Q5_K: when the last FFN_DOWN_EXPS is Q4_K, the perplexity gets NaN values

However this time it only happens with CUDA, CPU gets no such NaNs

Compiling with -DGGML_CUDA_FORCE_CUBLAS=ON fixes the NaNs, as does performing an identical clamp to the F16 range inside the quantize_q8_1 DS4 layout:

diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
index 4300ffc14..ea23fed7f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
@@ -44,7 +44,9 @@ static __global__ void quantize_q8_1(
         return;
     }

-    y[ib].ds = make_half2(d, sum);
+    // DEBUG: clamp sum to F16 range to test if this is the NaN source
+    const float sum_clamped = fminf(fmaxf(sum, -65504.0f), 65504.0f);
+    y[ib].ds = make_half2(d, sum_clamped);
 }

 __device__ __forceinline__ uint8_t compute_e8m0_scale(float amax) {
@@ -264,7 +266,9 @@ static __global__ void quantize_mmq_q8_1(
     const float d = 1.0f / d_inv;

     if (ds_layout == MMQ_Q8_1_DS_LAYOUT_DS4) {
-        y[ib].ds4[iqs/32] = make_half2(d, sum);
+        // DEBUG: clamp sum to F16 range to test if this is the NaN source
+        const float sum_clamped = fminf(fmaxf(sum, -65504.0f), 65504.0f);
+        y[ib].ds4[iqs/32] = make_half2(d, sum_clamped);
     } else {
         y[ib].d4[iqs/32]  = d;
     }

unfortunately this is debugging done with Claude, would need someone more well versed with CUDA to suggest a proper fix (though Claude suggested swapping DS4 for D2S6, however I have no clue what the implications are), the clamping is done purely as a show of where the issue seems to be :')

but it does lead me to believe we have some strange numerical issues on our hands and it's more widespread than initially thought

CUDA Version 12.2.2 btw

Edit: nevermind, the compile flag only delays the nan values, they eventually happen..

@am17an

am17an commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

@bartowski1182 recently GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F was added, does that still cause the NaNs? Also do you have a minimal repro for this?

@bartowski1182

Copy link
Copy Markdown
Contributor Author

I'll give that a shot

Minimal repro:

./build/bin/llama-quantize --pure ./MiniMax-M2.7-bf16.gguf ./MiniMax-M2.7-Q4_K.gguf Q4_K

If you have an imatrix, you can make a small version like this:

./build/bin/llama-quantize --imatrix ./MiniMaxAI_MiniMax-M2.7-imatrix.gguf --tensor-type 61.ffn_down_exps=q4_k ./MiniMax-M2.7-bf16.gguf ./MiniMax-M2.7-IQ1_S.gguf IQ1_S

Can use mine from here: https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF/blob/main/MiniMaxAI_MiniMax-M2.7-imatrix.gguf

Should I consider opening a new issue for this or does this discussion feel related?

@am17an

am17an commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

I didn't realize that would involve download 500GB of weights, I'm guessing you don't have a smaller model to work with

@bartowski1182

Copy link
Copy Markdown
Contributor Author

Oh right, meant to link at the end

@ubergarm made a copy of that exact setup here:

https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/blob/ed148d9b6be8a935caf693efcc92863d117fc8e8/BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf

@ORippler

Copy link
Copy Markdown
Collaborator

y[ib].ds4[iqs/32] = make_half2(d, sum);

If sum exceeds the value range for F16, CUDA will follow IEEE 754 and sum will be INF. Unsure if/how the CPU backend handles this case

https://docs.nvidia.com/cuda/parallel-thread-execution/#scalar-conversions

@ORippler

Copy link
Copy Markdown
Collaborator

If we need more dynamic value ranges, it makes sense to go towards BF16 datatypes (we have been confronted with numerical stability issues related to F16 multiple times already) imo

@bartowski1182

Copy link
Copy Markdown
Contributor Author

Wanted to provide an update on investigation

Here's what I know:

With mainline, running pure Q4_K and pure Q6_K with the last FFN_DOWN_EXPS set to Q4_K results in NaN on CUDA

With the clamp fix proposed above, those two quants are fixed.

HOWEVER, the recipe used for (my at least, can check mainline) Q3_K_M is NOT fixed with the clamping above. Those NaN values still appear and only go away when disabling flash attention

To fix that, an additional clamp on d is ALSO required:

      269 -        // DEBUG: clamp sum to F16 range to test if this is the NaN source                                                                                                                                                                                                                            
      269 +        // clamp d and sum to f16 range to avoid inf from large activations                                                                                                                                                                                                                         
      270 +        const float d_clamped   = fminf(d, 65504.0f);                                                                                                                                                                                                                                                 
      271          const float sum_clamped = fminf(fmaxf(sum, -65504.0f), 65504.0f);                                                                                                                                                                                                                           
      271 -        y[ib].ds4[iqs/32] = make_half2(d, sum_clamped);                                                                                                                                                                                                                                             
      272 +        y[ib].ds4[iqs/32] = make_half2(d_clamped, sum_clamped);   

this was discovered when dumping the op that produced the issue:

NAN DEBUG: op #57489 'ffn_moe_down-61' (op=30) produced NaN/inf at index 9732096 (of 12582912): -inf
  src0: 'CUDA0#blk.61.ffn_down_exps.weight#0' type=12
  src1: 'ffn_moe_swiglu-61' type=0

Something about Q3_K for earlier tensors causes d to also overflow to -inf and result in NaN

CPU is immune to all of this because it uses block_q8_K with f32 scale and int16 partial sums

With the two clamps to quantize.cu, we get proper values for the blocks that used to give NaN

I think we should introduce these clamps since the model's PPL seems to be acceptable with them and investigate other solutions like switching to F32 scales and/or sums in the future, but for now I think it best to get a working solution into mainline so we can at least alleviate these current standing issues with existing models

Other things I tried while experimenting:

Bumping FATTN_KQ_MAX_OFFSET to 10.0f

Using -ctk f32 -ctv f32

GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F

GGML_CUDA_DISABLE_FUSION=1

I've pushed the latest fixes for both Mistral and Minimax to this branch so they can be seen together, like I said I think it best to merge these fixes for now, if a further investigation is requested I can continue digging (if provided some direction) but this is a short-term easy fix that doesn't break anything existing

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 21, 2026
@bartowski1182

Copy link
Copy Markdown
Contributor Author

@ggerganov I will continue investigating to see if I can find a root cause, but I think we should consider merging this in the meantime to fix any models that exist with this issue (mistral and minimax)

@ggerganov

Copy link
Copy Markdown
Member

Does Minimax need just the CUDA clamp in Q8_1?

@bartowski1182

Copy link
Copy Markdown
Contributor Author

Correct yes

@am17an

am17an commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

This is a real problem, even deepseek v4 uses the OAI style clamp post gate + swiglu. These can reasonably overflow in later layers of larger models as you train with more tokens.

image

@ggerganov

Copy link
Copy Markdown
Member

Clamping the weights is not a good idea. The basic assumption for using F16 for most quantization scales is that weights are standard normal distributed, which I think is still a valid assumption. Mistral 4 small deviates from this - I suspect something went wrong during training. But in any case, clamping the weights during quantization is very hacky.

For the activations - I think we have to prepare a separate fix where we switch to a wider-range for the sum in Q8_1. Either BF16 or F32 - depending which one is feasible and performant. Since this type is "internal", we can afford to make such a change.

@bartowski1182

Copy link
Copy Markdown
Contributor Author

I agree 100%, but clamping weights is a quick easy fix in the interim, unless you think the swap to f32/bf16 will be quick and painless

The "depending on which is feasible and performant" is the only part that concerns me for getting the swap in quickly :)

Personally I'd rather merge this, investigate the better options, then implement the real fix and revert this

@mrexodia

Copy link
Copy Markdown

Yesterday night I did an experiment where I had gpt-5.5 integrate a patch for Talkie 13B support (https://github.com/solwyc/talkie-1930-13b-it-q5) into llama.cpp, perform quantizations and test if the full CUDA offload works. Initially the workaround was to keep the offending layer in Q8_0, but deep research pointed to this PR so I had it try that.

The clamp approach from this PR removed the NaN values but changed the first token from My to <|end|>, making the output unusable. I have experience with low-level development and some familiarity with CUDA, so I was able to guide the model. However I cannot claim to understand the solution fully, so I do not feel comfortable submitting a patch. If it is helpful here is my commit: mrexodia@8c8ebce. I tested different quantization levels that previously failed and the tested quantization formats no longer produced NaNs under full CUDA offload

Happy to share more details/notes/pi session if that is in any way helpful.

Relevant technical details Codex really wants to mention:

The problematic Talkie activation block had a finite 32-value sum around 145894, above FP16 max 65504.
The experiment stores sum(qs) as int16 in the upper half of ds and reconstructs d * sum_q in FP32.
ffn_swiglu-14 was finite, max_abs_sum32 ~= 145894; first unexpected Inf was ffn_out-14, followed by NaNs in attn_norm-15.

@bssrdf

bssrdf commented May 1, 2026

Copy link
Copy Markdown
Contributor

Just to point out this Fp16 overflow has also been bugging several SD.cpp models using the ds4 layout. See leejet/stable-diffusion.cpp#851 (comment). The SD author has to scale down the activation to get around the problem. #22571 is a nice and clean fix.

@thomasgauthier

Copy link
Copy Markdown

@mrexodia I also have an implementation of the Talkie architecture. You can find it on my fork (GGUF).

I've had no problems running the model on CUDA, and the logits seem to match the original PyTorch model (or almost match, some slight difference when using flash-attn). I'm planning on opening a PR once I have everything properly reviewed. Hope that helps!

@mrexodia

mrexodia commented May 1, 2026

Copy link
Copy Markdown

Yeah I indeed used your patch @thomasgauthier, thanks for publishing it! The issues didn't show up for all quantizations and I used https://huggingface.co/lewtun/talkie-1930-13b-it-hf as a base instead of their custom checkpoints. Not exactly sure why it happened, but there is definitely a real issue in the CUDA implementation (since everything works fine when running on the CPU).

@thomasgauthier

Copy link
Copy Markdown

@mrexodia yeah ok I have only tested Q8_0, I'll try the other quants and investigate the issue. Thanks for flagging this.

@mrexodia

mrexodia commented May 1, 2026

Copy link
Copy Markdown

FYI @thomasgauthier the workaround is to specify --tensor-type "blocks.14.mlp.mlp_resid.weight=q8_0" because that was the layer with large values that triggered the CUDA problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants