Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values by bartowski1182 · Pull Request #22571 · ggml-org/llama.cpp

bartowski1182 · 2026-05-01T01:51:14Z

Overview

New attempt to prevent Q8_1 overflow issues, supercedes #21652

This was done entirely with Claude 4.7 Opus, if validated by someone who understands these functions then great, best I can do is test it comprehensively

On a model that had no issues previously, final PPL remains almost identical (though values were slightly different throughout, they were within margin of error:

Qwen3.6 35B A3B Q3_K_M perplexity on CUDA

master:

[1]4.3686,[2]6.1338,[3]5.5270,[4]5.4420,[5]5.5565,[6]5.7229,[7]5.9575,[8]6.4584,[9]6.9418,[10]7.3256,
...
Final estimate: PPL = 6.9420 +/- 0.04567

this PR:

[1]4.3630,[2]6.1135,[3]5.4933,[4]5.4165,[5]5.5378,[6]5.7021,[7]5.9366,[8]6.4445,[9]6.9280,[10]7.3208,
...
Final estimate: PPL = 6.9420 +/- 0.04567

Qwen3.6 35B A3B Q4_1 perplexity on CPU

master:

[1]4.1408,[2]5.8097,[3]5.2460,[4]5.2142,[5]5.3413,[6]5.5225,[7]5.7804,[8]6.2870,[9]6.7792,[10]7.1659,
...
Final estimate: PPL = 6.7989 +/- 0.04433

this PR:

[1]4.1466,[2]5.7802,[3]5.2669,[4]5.2327,[5]5.3456,[6]5.5232,[7]5.7714,[8]6.2829,[9]6.7681,[10]7.1520,
Final estimate: PPL = 6.7995 +/- 0.04434

We see about 1% drop in speed from this on a 3090, need further testing with pre-ampere arch:

MiniMax M2.7 Q3_K_M llama-bench on CUDA

master:

model	size	params	backend	test	t/s master	t/s PR	change
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	pp128	37.27 ± 0.42	36.57 ± 0.68	-1.9%
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	pp256	65.65 ± 0.41	64.88 ± 0.61	-1.2%
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	pp512	117.95 ± 0.57	115.41 ± 0.51	-2.1%
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	tg128	26.07 ± 0.09	25.88 ± 0.04	-0.73%
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	tg256	26.09 ± 0.00	25.91 ± 0.02	-0.69%
minimax-m2 230B.A10B Q3_K - Medium	96.97 GiB	228.69 B	CUDA	tg512	24.71 ± 0.86	25.60 ± 0.02	+3.6%

Mistral 4 small Q4_0 (with `Q4_1` FFN_DOWN) on CPU

master:

model	size	params	backend	threads	test	t/s master	t/s PR	change
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	pp128	253.81 ± 0.37	254.49 ± 0.87	+0.27%
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	pp256	285.91 ± 0.67	287.11 ± 0.96	+0.42%
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	pp512	289.02 ± 0.29	292.96 ± 0.21	+1.4%
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	tg128	40.96 ± 0.01	40.68 ± 0.02	-0.68%
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	tg256	40.91 ± 0.01	40.66 ± 0.01	-0.61%
mistral4 ?B Q4_0	64.78 GiB	118.97 B	CPU	96	tg512	40.60 ± 0.01	40.38 ± 0.01	-0.54%

Now for the actual fixes themselves

MiniMax M2.7 Q3_K_M perplexity on CUDA

master:

[1]4.2071,[2]5.1631,[3]4.8918,[4]5.4672,[5]5.7012,[6]6.2516,[7]6.5704,[8]7.5223,[9]7.9325,[10]8.0931,[11]8.2322,[12]8.6627,[13]8.6830,[14]8.5045,[15]8.6880,[16]8.2756,[17]8.3873,[18]8.3394,[19]8.2548,[20]7.9901,[21]7.9147,[22]7.6877,[23]7.4215,[24]7.2850,[25]6.9328,[26]6.7717,[27]6.8922,[28]6.8746,[29]6.9311,[30]6.9283,[31]6.8601,[32]nan,
...
Unexpected negative standard deviation of log(prob)

this PR:

[1]4.2146,[2]5.1287,[3]4.8143,[4]5.4133,[5]5.6447,[6]6.1810,[7]6.5155,[8]7.4806,[9]7.8757,[10]8.0376,[11]8.1917,[12]8.5957,[13]8.6355,[14]8.4622,[15]8.6559,[16]8.2562,[17]8.3488,[18]8.2902,[19]8.1950,[20]7.9454,[21]7.8700,[22]7.6319,[23]7.3665,[24]7.2315,[25]6.8810,[26]6.7312,[27]6.8493,[28]6.8279,[29]6.8816,[30]6.8785,[31]6.8117,[32]6.8623,
...
Final estimate: PPL = 8.1570 +/- 0.10323

Mistral 4 Small Q4_0 (with Q4_1 FFN_DOWN_EXPS)

master:

[1]131072.0000,[2]131072.0000,[3]131072.0000,[4]131072.0000,[5]131072.0000,[6]131072.0000,[7]131072.0000,[8]131072.0000,[9]131072.0000,[10]131072.0000,[11]131072.0000,[12]131072.0000,[13]131072.0000,[14]131072.0000,[15]131072.0000,[16]131072.0000,[17]131072.0000,[18]131072.0000,[19]131072.0000,[20]131072.0000,[21]131072.0000,[22]131072.0000,[23]131072.0000,[24]131072.0000,[25]131072.0000,[26]131072.0000,[27]131072.0000,[28]131072.0000,[29]131072.0000,[30]131072.0000,
Unexpected negative standard deviation of log(prob)

this PR:

[1]3.4894,[2]5.1771,[3]4.4108,[4]4.1429,[5]4.2988,[6]4.4945,[7]4.6000,[8]4.5921,[9]4.5142,[10]4.5728,[11]4.5730,[12]4.6035,[13]4.8440,[14]4.9321,[15]4.9869,[16]5.1665,[17]5.0031,[18]5.1377,[19]5.3681,[20]5.3158,[21]5.3321,[22]5.3115,[23]5.2734,[24]5.1278,[25]4.9829,[26]4.9108,[27]4.7981,[28]4.7804,[29]4.7221,[30]4.6932,
Final estimate: PPL = 4.6932 +/- 0.12949

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, this is FULLY generated by Opus 4.7, tested fully by hand

mrexodia · 2026-05-01T20:59:16Z

I tested both this commit and the one GPT-5.5 came up with (mrexodia@8c8ebce) and while the NaN issue is indeed avoided, this one is less precise. Full session (scroll to the bottom for the comparison and review): https://pi.dev/session/#457e5201c6a747dd2f6b4dd1d1a1f970

bartowski1182 · 2026-05-02T00:29:23Z

Yeah this is intended as an initial PoC

Regarding the d comment though, it can't be kept as fp16 in CUDA because it also overflows in some cases (minimax)

I don't really understand the changes you/gpt 5.5 have proposed, but can you test it in similar ways to show if it fixes nan values without destroying perplexity or speed?

mrexodia · 2026-05-02T11:28:11Z

It looks like I misunderstood the initial benchmark. I thought that Q4_0 had a different outcome between the CPU/CUDA paths on your branch and not mine, but this is not the case. This quant is just weak for the model I'm testing.

I got some books from Gutenberg from before 1931 to test the perplexity and these were the results:

branch	model	PPL	stderr	prompt eval tok/s
int16	Q8_0	11.4761	0.05557	2206.95
bartowski	Q8_0	11.4761	0.05557	2198.38
int16	Q5_0	10.0763	0.04450	2106.73
bartowski	Q5_0	10.0763	0.04450	2108.07

The test was llama-perplexity -ngl 99 -c 2048 -b 256 -ub 128 -fa on 2.6 MB, 311 chunks / 636,928 tokens. Your fix has exactly the same outcome as the one from gpt-5.5, and the NaN issue I had originally is gone as well!

…ng onto fp16 types

bartowski1182 · 2026-05-05T00:43:09Z

Leaving the extra comments from Opus in an attempt to explain reasoning since some of the decisions were tricky..

Performance and PPL both look good, though of course being entirely Opus generated I wouldn't be comfortable suggesting that it get merged as-is without extra eyes on this

JohannesGaessler

Skimming the PR, the general idea of replacing half2 with nv_bfloat2 in the activations is correct assuming the goal is just to prevent numerical overflows during quantization. The only question is getting a non-janky implementation across all matrix multiplication kernels and asserting that there isn't a performance regression (though it should be fine since the compute bound kernels only use int8 and FP32 arithmetic anyways).

JohannesGaessler · 2026-05-05T15:48:58Z

+
+static_assert(sizeof(block_q8_1_bf16) == sizeof(block_q8_1), "block_q8_1_bf16 must match block_q8_1 byte layout");
+
+static __device__ __forceinline__ float2 ggml_cuda_bf162_to_float2(const nv_bfloat162 v) {


Use ggml_cuda_cast as defined in convert.cuh instead.

0cc4m · 2026-05-07T15:13:54Z

In my opinion, we can change q8_1 to use bfloat16 if it helps avoid NaNs in these cases and doesn't cause regressions somewhere else, but it should be done for all backends at once, then.

Alternatively, q8_1 could also be "removed" as an "official" ggml type and instead left to each backend separately to use whatever it thinks is best, internally. We already have multiple variants of it (like CUDA's mmq struct and Vulkan's _x4 packed struct) because it doesn't affect anything outside the backend.

Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values

753509f

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 1, 2026

bssrdf mentioned this pull request May 1, 2026

Prevent the sum of the dequantized activation in q8_1 from overflowing #21652

Open

bartowski1182 added 2 commits May 4, 2026 19:12

Proper implementation of new structs that use bf16 instead of repacki…

7dd455a

…ng onto fp16 types

Merge branch 'master' into minimax

2b62884

JohannesGaessler reviewed May 5, 2026

View reviewed changes

am17an mentioned this pull request May 6, 2026

Fuse rms_norm, mul, quantize_q8_1 #22710

Open

Replace casting with ggml_cuda_cast

8d93f39

niklassheth mentioned this pull request Jun 5, 2026

Eval bug: llama-server produces intermittent garbage output with Talkie architecture, llama-cli works fine #23953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571

Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:minimax

bartowski1182 commented May 1, 2026 •

edited

Loading

Uh oh!

mrexodia commented May 1, 2026

Uh oh!

bartowski1182 commented May 2, 2026

Uh oh!

mrexodia commented May 2, 2026

Uh oh!

bartowski1182 commented May 5, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler May 5, 2026

Uh oh!

0cc4m commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		static_assert(sizeof(block_q8_1_bf16) == sizeof(block_q8_1), "block_q8_1_bf16 must match block_q8_1 byte layout");

		static __device__ __forceinline__ float2 ggml_cuda_bf162_to_float2(const nv_bfloat162 v) {

Uh oh!

Conversation

bartowski1182 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Qwen3.6 35B A3B Q3_K_M perplexity on CUDA

Qwen3.6 35B A3B Q4_1 perplexity on CPU

MiniMax M2.7 Q3_K_M llama-bench on CUDA

Mistral 4 small Q4_0 (with Q4_1 FFN_DOWN) on CPU

MiniMax M2.7 Q3_K_M perplexity on CUDA

Mistral 4 Small Q4_0 (with Q4_1 FFN_DOWN_EXPS)

Requirements

Uh oh!

mrexodia commented May 1, 2026

Uh oh!

bartowski1182 commented May 2, 2026

Uh oh!

mrexodia commented May 2, 2026

Uh oh!

bartowski1182 commented May 5, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler May 5, 2026

Choose a reason for hiding this comment

Uh oh!

0cc4m commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bartowski1182 commented May 1, 2026 •

edited

Loading

Mistral 4 small Q4_0 (with `Q4_1` FFN_DOWN) on CPU