Skip to content

Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571

Draft
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:minimax
Draft

Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:minimax

Conversation

@bartowski1182

@bartowski1182 bartowski1182 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Overview

New attempt to prevent Q8_1 overflow issues, supercedes #21652

This was done entirely with Claude 4.7 Opus, if validated by someone who understands these functions then great, best I can do is test it comprehensively

On a model that had no issues previously, final PPL remains almost identical (though values were slightly different throughout, they were within margin of error:

Qwen3.6 35B A3B Q3_K_M perplexity on CUDA

master:

[1]4.3686,[2]6.1338,[3]5.5270,[4]5.4420,[5]5.5565,[6]5.7229,[7]5.9575,[8]6.4584,[9]6.9418,[10]7.3256,
...
Final estimate: PPL = 6.9420 +/- 0.04567

this PR:

[1]4.3630,[2]6.1135,[3]5.4933,[4]5.4165,[5]5.5378,[6]5.7021,[7]5.9366,[8]6.4445,[9]6.9280,[10]7.3208,
...
Final estimate: PPL = 6.9420 +/- 0.04567

Qwen3.6 35B A3B Q4_1 perplexity on CPU

master:

[1]4.1408,[2]5.8097,[3]5.2460,[4]5.2142,[5]5.3413,[6]5.5225,[7]5.7804,[8]6.2870,[9]6.7792,[10]7.1659,
...
Final estimate: PPL = 6.7989 +/- 0.04433

this PR:

[1]4.1466,[2]5.7802,[3]5.2669,[4]5.2327,[5]5.3456,[6]5.5232,[7]5.7714,[8]6.2829,[9]6.7681,[10]7.1520,
Final estimate: PPL = 6.7995 +/- 0.04434

We see about 1% drop in speed from this on a 3090, need further testing with pre-ampere arch:

MiniMax M2.7 Q3_K_M llama-bench on CUDA

master:

model size params backend ngl test t/s master t/s PR change
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 pp128 37.27 ± 0.42 36.57 ± 0.68 -1.9%
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 pp256 65.65 ± 0.41 64.88 ± 0.61 -1.2%
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 pp512 117.95 ± 0.57 115.41 ± 0.51 -2.1%
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 tg128 26.07 ± 0.09 25.88 ± 0.04 -0.73%
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 tg256 26.09 ± 0.00 25.91 ± 0.02 -0.69%
minimax-m2 230B.A10B Q3_K - Medium 96.97 GiB 228.69 B CUDA 0 tg512 24.71 ± 0.86 25.60 ± 0.02 +3.6%

Mistral 4 small Q4_0 (with Q4_1 FFN_DOWN) on CPU

master:

model size params backend threads test t/s master t/s PR change
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 pp128 253.81 ± 0.37 254.49 ± 0.87 +0.27%
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 pp256 285.91 ± 0.67 287.11 ± 0.96 +0.42%
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 pp512 289.02 ± 0.29 292.96 ± 0.21 +1.4%
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 tg128 40.96 ± 0.01 40.68 ± 0.02 -0.68%
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 tg256 40.91 ± 0.01 40.66 ± 0.01 -0.61%
mistral4 ?B Q4_0 64.78 GiB 118.97 B CPU 96 tg512 40.60 ± 0.01 40.38 ± 0.01 -0.54%

Now for the actual fixes themselves

MiniMax M2.7 Q3_K_M perplexity on CUDA

master:

[1]4.2071,[2]5.1631,[3]4.8918,[4]5.4672,[5]5.7012,[6]6.2516,[7]6.5704,[8]7.5223,[9]7.9325,[10]8.0931,[11]8.2322,[12]8.6627,[13]8.6830,[14]8.5045,[15]8.6880,[16]8.2756,[17]8.3873,[18]8.3394,[19]8.2548,[20]7.9901,[21]7.9147,[22]7.6877,[23]7.4215,[24]7.2850,[25]6.9328,[26]6.7717,[27]6.8922,[28]6.8746,[29]6.9311,[30]6.9283,[31]6.8601,[32]nan,
...
Unexpected negative standard deviation of log(prob)

this PR:

[1]4.2146,[2]5.1287,[3]4.8143,[4]5.4133,[5]5.6447,[6]6.1810,[7]6.5155,[8]7.4806,[9]7.8757,[10]8.0376,[11]8.1917,[12]8.5957,[13]8.6355,[14]8.4622,[15]8.6559,[16]8.2562,[17]8.3488,[18]8.2902,[19]8.1950,[20]7.9454,[21]7.8700,[22]7.6319,[23]7.3665,[24]7.2315,[25]6.8810,[26]6.7312,[27]6.8493,[28]6.8279,[29]6.8816,[30]6.8785,[31]6.8117,[32]6.8623,
...
Final estimate: PPL = 8.1570 +/- 0.10323

Mistral 4 Small Q4_0 (with Q4_1 FFN_DOWN_EXPS)

master:

[1]131072.0000,[2]131072.0000,[3]131072.0000,[4]131072.0000,[5]131072.0000,[6]131072.0000,[7]131072.0000,[8]131072.0000,[9]131072.0000,[10]131072.0000,[11]131072.0000,[12]131072.0000,[13]131072.0000,[14]131072.0000,[15]131072.0000,[16]131072.0000,[17]131072.0000,[18]131072.0000,[19]131072.0000,[20]131072.0000,[21]131072.0000,[22]131072.0000,[23]131072.0000,[24]131072.0000,[25]131072.0000,[26]131072.0000,[27]131072.0000,[28]131072.0000,[29]131072.0000,[30]131072.0000,
Unexpected negative standard deviation of log(prob)

this PR:

[1]3.4894,[2]5.1771,[3]4.4108,[4]4.1429,[5]4.2988,[6]4.4945,[7]4.6000,[8]4.5921,[9]4.5142,[10]4.5728,[11]4.5730,[12]4.6035,[13]4.8440,[14]4.9321,[15]4.9869,[16]5.1665,[17]5.0031,[18]5.1377,[19]5.3681,[20]5.3158,[21]5.3321,[22]5.3115,[23]5.2734,[24]5.1278,[25]4.9829,[26]4.9108,[27]4.7981,[28]4.7804,[29]4.7221,[30]4.6932,
Final estimate: PPL = 4.6932 +/- 0.12949

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, this is FULLY generated by Opus 4.7, tested fully by hand

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 1, 2026
@mrexodia

mrexodia commented May 1, 2026

Copy link
Copy Markdown

I tested both this commit and the one GPT-5.5 came up with (mrexodia@8c8ebce) and while the NaN issue is indeed avoided, this one is less precise. Full session (scroll to the bottom for the comparison and review): https://pi.dev/session/#457e5201c6a747dd2f6b4dd1d1a1f970

@bartowski1182

Copy link
Copy Markdown
Contributor Author

Yeah this is intended as an initial PoC

Regarding the d comment though, it can't be kept as fp16 in CUDA because it also overflows in some cases (minimax)

I don't really understand the changes you/gpt 5.5 have proposed, but can you test it in similar ways to show if it fixes nan values without destroying perplexity or speed?

@mrexodia

mrexodia commented May 2, 2026

Copy link
Copy Markdown

It looks like I misunderstood the initial benchmark. I thought that Q4_0 had a different outcome between the CPU/CUDA paths on your branch and not mine, but this is not the case. This quant is just weak for the model I'm testing.

I got some books from Gutenberg from before 1931 to test the perplexity and these were the results:

branch model PPL stderr prompt eval tok/s
int16 Q8_0 11.4761 0.05557 2206.95
bartowski Q8_0 11.4761 0.05557 2198.38
int16 Q5_0 10.0763 0.04450 2106.73
bartowski Q5_0 10.0763 0.04450 2108.07

The test was llama-perplexity -ngl 99 -c 2048 -b 256 -ub 128 -fa on 2.6 MB, 311 chunks / 636,928 tokens. Your fix has exactly the same outcome as the one from gpt-5.5, and the NaN issue I had originally is gone as well!

@bartowski1182

Copy link
Copy Markdown
Contributor Author

Leaving the extra comments from Opus in an attempt to explain reasoning since some of the decisions were tricky..

Performance and PPL both look good, though of course being entirely Opus generated I wouldn't be comfortable suggesting that it get merged as-is without extra eyes on this

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimming the PR, the general idea of replacing half2 with nv_bfloat2 in the activations is correct assuming the goal is just to prevent numerical overflows during quantization. The only question is getting a non-janky implementation across all matrix multiplication kernels and asserting that there isn't a performance regression (though it should be fine since the compute bound kernels only use int8 and FP32 arithmetic anyways).

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

static_assert(sizeof(block_q8_1_bf16) == sizeof(block_q8_1), "block_q8_1_bf16 must match block_q8_1 byte layout");

static __device__ __forceinline__ float2 ggml_cuda_bf162_to_float2(const nv_bfloat162 v) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ggml_cuda_cast as defined in convert.cuh instead.

@0cc4m

0cc4m commented May 7, 2026

Copy link
Copy Markdown
Contributor

In my opinion, we can change q8_1 to use bfloat16 if it helps avoid NaNs in these cases and doesn't cause regressions somewhere else, but it should be done for all backends at once, then.

Alternatively, q8_1 could also be "removed" as an "official" ggml type and instead left to each backend separately to use whatever it thinks is best, internally. We already have multiple variants of it (like CUDA's mmq struct and Vulkan's _x4 packed struct) because it doesn't affect anything outside the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants