Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571
Swap out F16 for BF16 in Q8_1 activations to avoid overflowing values#22571bartowski1182 wants to merge 4 commits into
Conversation
|
I tested both this commit and the one GPT-5.5 came up with (mrexodia@8c8ebce) and while the NaN issue is indeed avoided, this one is less precise. Full session (scroll to the bottom for the comparison and review): https://pi.dev/session/#457e5201c6a747dd2f6b4dd1d1a1f970 |
|
Yeah this is intended as an initial PoC Regarding the I don't really understand the changes you/gpt 5.5 have proposed, but can you test it in similar ways to show if it fixes nan values without destroying perplexity or speed? |
|
It looks like I misunderstood the initial benchmark. I thought that Q4_0 had a different outcome between the CPU/CUDA paths on your branch and not mine, but this is not the case. This quant is just weak for the model I'm testing. I got some books from Gutenberg from before 1931 to test the perplexity and these were the results:
The test was |
|
Leaving the extra comments from Opus in an attempt to explain reasoning since some of the decisions were tricky.. Performance and PPL both look good, though of course being entirely Opus generated I wouldn't be comfortable suggesting that it get merged as-is without extra eyes on this |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Skimming the PR, the general idea of replacing half2 with nv_bfloat2 in the activations is correct assuming the goal is just to prevent numerical overflows during quantization. The only question is getting a non-janky implementation across all matrix multiplication kernels and asserting that there isn't a performance regression (though it should be fine since the compute bound kernels only use int8 and FP32 arithmetic anyways).
|
|
||
| static_assert(sizeof(block_q8_1_bf16) == sizeof(block_q8_1), "block_q8_1_bf16 must match block_q8_1 byte layout"); | ||
|
|
||
| static __device__ __forceinline__ float2 ggml_cuda_bf162_to_float2(const nv_bfloat162 v) { |
There was a problem hiding this comment.
Use ggml_cuda_cast as defined in convert.cuh instead.
|
In my opinion, we can change q8_1 to use bfloat16 if it helps avoid NaNs in these cases and doesn't cause regressions somewhere else, but it should be done for all backends at once, then. Alternatively, q8_1 could also be "removed" as an "official" ggml type and instead left to each backend separately to use whatever it thinks is best, internally. We already have multiple variants of it (like CUDA's mmq struct and Vulkan's _x4 packed struct) because it doesn't affect anything outside the backend. |
Overview
New attempt to prevent
Q8_1overflow issues, supercedes #21652This was done entirely with Claude 4.7 Opus, if validated by someone who understands these functions then great, best I can do is test it comprehensively
On a model that had no issues previously, final PPL remains almost identical (though values were slightly different throughout, they were within margin of error:
Qwen3.6 35B A3B Q3_K_M perplexity on CUDA
master:this PR:Qwen3.6 35B A3B Q4_1 perplexity on CPU
master:this PR:We see about 1% drop in speed from this on a 3090, need further testing with pre-ampere arch:
MiniMax M2.7 Q3_K_M llama-bench on CUDA
master:Mistral 4 small Q4_0 (with
Q4_1FFN_DOWN) on CPUmaster:Now for the actual fixes themselves
MiniMax M2.7 Q3_K_M perplexity on CUDA
master:this PR:Mistral 4 Small Q4_0 (with Q4_1 FFN_DOWN_EXPS)
master:this PR:Requirements