Add ROCmFP4 CPU quantization support by charlie12345 · Pull Request #56 · Anbeeld/beellama.cpp

charlie12345 · 2026-06-05T14:09:22Z

Summary

This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types to Beellama:

Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.
Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

This is the CPU/reference-format portion only. It intentionally avoids HIP/CUDA/Vulkan backend changes so the format and quantizer behavior can be reviewed separately.

Rationale and Benchmark Notes

Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:

Models compared:

ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.
Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Model	Backend	pp512 tok/s	tg128 tok/s
ROCmFP4 STRIX_LEAN	Vulkan	255.19 +/- 2.88	14.16 +/- 0.09
UD-Q5_K_XL	Vulkan	333.41 +/- 0.64	10.77 +/- 0.01
ROCmFP4 STRIX_LEAN	ROCm	362.72 +/- 1.89	13.83 +/- 0.01
UD-Q5_K_XL	ROCm	335.14 +/- 5.25	10.23 +/- 0.02

Decode speed from the prototype backend branch:

Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32

Model	Backend	Chunks	PPL
ROCmFP4 STRIX_LEAN	Vulkan	32	6.6538 +/- 0.09455
UD-Q5_K_XL	Vulkan	32	6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.

Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.

Additional prototype datapoints, same hardware and same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:

Metric	ROCmFP4 STRIX_LEAN	UD-Q5_K_XL
Size	17.74 GiB	25.29 GiB
Vulkan pp512	1092.13 +/- 11.48 tok/s	1027.47 +/- 14.50 tok/s
Vulkan tg128	76.95 +/- 0.09 tok/s	57.31 +/- 0.04 tok/s
ROCm pp512	1128.36 +/- 20.65 tok/s	941.70 +/- 14.89 tok/s
ROCm tg128	67.53 +/- 0.16 tok/s	48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks	6.0609 +/- 0.08084	5.8748 +/- 0.07794

Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:

Metric	ROCmFP4 STRIX_LEAN	UD-Q4_K_XL
Size	12.65 GiB	15.84 GiB
Vulkan pp512	1175.12 +/- 15.11 tok/s	1228.07 +/- 15.89 tok/s
Vulkan tg128	62.68 +/- 0.03 tok/s	52.73 +/- 0.13 tok/s
ROCm pp512	1305.38 +/- 28.52 tok/s	1191.64 +/- 17.98 tok/s
ROCm tg128	61.51 +/- 0.14 tok/s	45.63 +/- 0.06 tok/s

Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Beellama-specific notes

Beellama already has additional TurboQuant tensor and file types, so this port uses Beellama's next available IDs:

GGML_TYPE_Q4_0_ROCMFP4 = 50
GGML_TYPE_Q4_0_ROCMFP4_FAST = 51
LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 45
LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 46

User-facing usage

./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

With an importance matrix:

./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4

Compatibility

The change is additive:

existing Beellama quantization modes are unchanged;
existing TurboQuant types are unchanged;
existing MXFP4/NVFP4 behavior is unchanged;
no non-CPU backend files are modified;
accelerated backend execution paths are left for follow-up work.

Local validation

Validated locally against current Beellama origin/main:

cmake configure, CPU-only: passed
llama-quantize build: passed
test-quantize-fns build: passed
test-quantize-perf build: passed
test-quantize-fns: passed, including q4_0_rocmfp4 and q4_0_rocmfp4_fast
git diff --check: passed
added-line secret/path scan: clean

Anbeeld · 2026-06-05T14:31:52Z

Please provide the rationale for adding these types.

charlie12345 · 2026-06-05T15:35:04Z

Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete.

Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs.

Hardware: Framework AMD Strix Halo class system, 128 GB unified memory

Models:

ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB
Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Model	Backend	pp512 tok/s	tg128 tok/s
ROCmFP4 STRIX_LEAN	Vulkan	255.19 +/- 2.88	14.16 +/- 0.09
UD-Q5_K_XL	Vulkan	333.41 +/- 0.64	10.77 +/- 0.01
ROCmFP4 STRIX_LEAN	ROCm	362.72 +/- 1.89	13.83 +/- 0.01
UD-Q5_K_XL	ROCm	335.14 +/- 5.25	10.23 +/- 0.02

Decode speed:

Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32

Model	Backend	Chunks	PPL
ROCmFP4 STRIX_LEAN	Vulkan	32	6.6538 +/- 0.09455
UD-Q5_K_XL	Vulkan	32	6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty.

I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass.

charlie12345 · 2026-06-05T16:02:23Z

I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint.

All runs use the same Strix Halo class system and the same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Metric	ROCmFP4 STRIX_LEAN	UD-Q5_K_XL
Size	17.74 GiB	25.29 GiB
Vulkan pp512	1092.13 +/- 11.48 tok/s	1027.47 +/- 14.50 tok/s
Vulkan tg128	76.95 +/- 0.09 tok/s	57.31 +/- 0.04 tok/s
ROCm pp512	1128.36 +/- 20.65 tok/s	941.70 +/- 14.89 tok/s
ROCm tg128	67.53 +/- 0.16 tok/s	48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks	6.0609 +/- 0.08084	5.8748 +/- 0.07794

Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Metric	ROCmFP4 STRIX_LEAN	UD-Q4_K_XL
Size	12.65 GiB	15.84 GiB
Vulkan pp512	1175.12 +/- 15.11 tok/s	1228.07 +/- 15.89 tok/s
Vulkan tg128	62.68 +/- 0.03 tok/s	52.73 +/- 0.13 tok/s
ROCm pp512	1305.38 +/- 28.52 tok/s	1191.64 +/- 17.98 tok/s
ROCm tg128	61.51 +/- 0.14 tok/s	45.63 +/- 0.06 tok/s

Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Add ROCmFP4 CPU quantization support

343eb5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ROCmFP4 CPU quantization support#56

Add ROCmFP4 CPU quantization support#56
charlie12345 wants to merge 1 commit into
Anbeeld:mainfrom
charlie12345:rocmfp4-cpu-only-beellama

charlie12345 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Anbeeld commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

charlie12345 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale and Benchmark Notes

Beellama-specific notes

User-facing usage

Compatibility

Local validation

Uh oh!

Anbeeld commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

charlie12345 commented Jun 5, 2026 •

edited

Loading