Skip to content

Add ROCmFP4 CPU quantization support#56

Open
charlie12345 wants to merge 1 commit into
Anbeeld:mainfrom
charlie12345:rocmfp4-cpu-only-beellama
Open

Add ROCmFP4 CPU quantization support#56
charlie12345 wants to merge 1 commit into
Anbeeld:mainfrom
charlie12345:rocmfp4-cpu-only-beellama

Conversation

@charlie12345
Copy link
Copy Markdown

@charlie12345 charlie12345 commented Jun 5, 2026

Summary

This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types to Beellama:

  • Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.
  • Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

This is the CPU/reference-format portion only. It intentionally avoids HIP/CUDA/Vulkan backend changes so the format and quantizer behavior can be reviewed separately.

Rationale and Benchmark Notes

Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:

Models compared:

  • ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.
  • Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Model Backend pp512 tok/s tg128 tok/s
ROCmFP4 STRIX_LEAN Vulkan 255.19 +/- 2.88 14.16 +/- 0.09
UD-Q5_K_XL Vulkan 333.41 +/- 0.64 10.77 +/- 0.01
ROCmFP4 STRIX_LEAN ROCm 362.72 +/- 1.89 13.83 +/- 0.01
UD-Q5_K_XL ROCm 335.14 +/- 5.25 10.23 +/- 0.02

Decode speed from the prototype backend branch:

  • Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
  • ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
Model Backend Chunks PPL
ROCmFP4 STRIX_LEAN Vulkan 32 6.6538 +/- 0.09455
UD-Q5_K_XL Vulkan 32 6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.

Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.

Additional prototype datapoints, same hardware and same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:

Metric ROCmFP4 STRIX_LEAN UD-Q5_K_XL
Size 17.74 GiB 25.29 GiB
Vulkan pp512 1092.13 +/- 11.48 tok/s 1027.47 +/- 14.50 tok/s
Vulkan tg128 76.95 +/- 0.09 tok/s 57.31 +/- 0.04 tok/s
ROCm pp512 1128.36 +/- 20.65 tok/s 941.70 +/- 14.89 tok/s
ROCm tg128 67.53 +/- 0.16 tok/s 48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks 6.0609 +/- 0.08084 5.8748 +/- 0.07794

Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:

Metric ROCmFP4 STRIX_LEAN UD-Q4_K_XL
Size 12.65 GiB 15.84 GiB
Vulkan pp512 1175.12 +/- 15.11 tok/s 1228.07 +/- 15.89 tok/s
Vulkan tg128 62.68 +/- 0.03 tok/s 52.73 +/- 0.13 tok/s
ROCm pp512 1305.38 +/- 28.52 tok/s 1191.64 +/- 17.98 tok/s
ROCm tg128 61.51 +/- 0.14 tok/s 45.63 +/- 0.06 tok/s

Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Beellama-specific notes

Beellama already has additional TurboQuant tensor and file types, so this port uses Beellama's next available IDs:

  • GGML_TYPE_Q4_0_ROCMFP4 = 50
  • GGML_TYPE_Q4_0_ROCMFP4_FAST = 51
  • LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 45
  • LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 46

User-facing usage

./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

With an importance matrix:

./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4

Compatibility

The change is additive:

  • existing Beellama quantization modes are unchanged;
  • existing TurboQuant types are unchanged;
  • existing MXFP4/NVFP4 behavior is unchanged;
  • no non-CPU backend files are modified;
  • accelerated backend execution paths are left for follow-up work.

Local validation

Validated locally against current Beellama origin/main:

cmake configure, CPU-only: passed
llama-quantize build: passed
test-quantize-fns build: passed
test-quantize-perf build: passed
test-quantize-fns: passed, including q4_0_rocmfp4 and q4_0_rocmfp4_fast
git diff --check: passed
added-line secret/path scan: clean

@Anbeeld
Copy link
Copy Markdown
Owner

Anbeeld commented Jun 5, 2026

Please provide the rationale for adding these types.

@charlie12345
Copy link
Copy Markdown
Author

Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete.

Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs.

Hardware: Framework AMD Strix Halo class system, 128 GB unified memory

Models:

  • ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB
  • Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Model Backend pp512 tok/s tg128 tok/s
ROCmFP4 STRIX_LEAN Vulkan 255.19 +/- 2.88 14.16 +/- 0.09
UD-Q5_K_XL Vulkan 333.41 +/- 0.64 10.77 +/- 0.01
ROCmFP4 STRIX_LEAN ROCm 362.72 +/- 1.89 13.83 +/- 0.01
UD-Q5_K_XL ROCm 335.14 +/- 5.25 10.23 +/- 0.02

Decode speed:

  • Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
  • ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
Model Backend Chunks PPL
ROCmFP4 STRIX_LEAN Vulkan 32 6.6538 +/- 0.09455
UD-Q5_K_XL Vulkan 32 6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty.

I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass.

@charlie12345
Copy link
Copy Markdown
Author

I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint.

All runs use the same Strix Halo class system and the same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Metric ROCmFP4 STRIX_LEAN UD-Q5_K_XL
Size 17.74 GiB 25.29 GiB
Vulkan pp512 1092.13 +/- 11.48 tok/s 1027.47 +/- 14.50 tok/s
Vulkan tg128 76.95 +/- 0.09 tok/s 57.31 +/- 0.04 tok/s
ROCm pp512 1128.36 +/- 20.65 tok/s 941.70 +/- 14.89 tok/s
ROCm tg128 67.53 +/- 0.16 tok/s 48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks 6.0609 +/- 0.08084 5.8748 +/- 0.07794

Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Metric ROCmFP4 STRIX_LEAN UD-Q4_K_XL
Size 12.65 GiB 15.84 GiB
Vulkan pp512 1175.12 +/- 15.11 tok/s 1228.07 +/- 15.89 tok/s
Vulkan tg128 62.68 +/- 0.03 tok/s 52.73 +/- 0.13 tok/s
ROCm pp512 1305.38 +/- 28.52 tok/s 1191.64 +/- 17.98 tok/s
ROCm tg128 61.51 +/- 0.14 tok/s 45.63 +/- 0.06 tok/s

Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants