Add ROCmFP4 CPU quantization support#56
Conversation
|
Please provide the rationale for adding these types. |
|
Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete. Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs. Hardware: Framework AMD Strix Halo class system, 128 GB unified memory Models:
ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline. Speed command shape: llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Decode speed:
Limited quality check: llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty. I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. |
|
I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint. All runs use the same Strix Halo class system and the same speed command shape: llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL
Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim. Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL
Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster. I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim. |
Summary
This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types to Beellama:
Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.This is the CPU/reference-format portion only. It intentionally avoids HIP/CUDA/Vulkan backend changes so the format and quantizer behavior can be reviewed separately.
Rationale and Benchmark Notes
Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:
Models compared:
Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.
Speed command shape:
Decode speed from the prototype backend branch:
Limited quality check:
This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.
Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.
Additional prototype datapoints, same hardware and same speed command shape:
Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:
Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.
Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:
Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.
I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.
Beellama-specific notes
Beellama already has additional TurboQuant tensor and file types, so this port uses Beellama's next available IDs:
GGML_TYPE_Q4_0_ROCMFP4 = 50GGML_TYPE_Q4_0_ROCMFP4_FAST = 51LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 45LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 46User-facing usage
With an importance matrix:
Compatibility
The change is additive:
Local validation
Validated locally against current Beellama
origin/main: