Release v0.6.5 — Re-baseline without Metal (3rd honest correction) · quantumaikr/quant.cpp

🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only

While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.

The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)

Build	KV type	tok/s
Metal ON	fp32	15.07
Metal OFF	fp32	17.87 (+19%)
Metal ON	turbo_kv_4b	14.17
Metal OFF	turbo_kv_4b	16.53 (+17%)
Metal ON	turbo_kv_5b	13.43
Metal OFF	turbo_kv_5b	15.33 (+14%)

Across model sizes:

Model	Metal-OFF speedup
SmolLM2 135M	neutral
Llama 3.2 1B	+13–17%
Llama 3.2 3B	+14–22%
Gemma 4 26B	+40%

Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.

Impact on past benchmarks

The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.

This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.

Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)

Type	Bytes/block	Compression	tok/s	vs FP32	PPL Δ
FP32 KV	—	1×	18.13	baseline	—
`turbo_kv_4b` ⭐ default	72	7.1×	16.60	−8.4%	+5.7%
`turbo_kv_5b` 🏆 quality	88	5.8×	15.43	−14.9%	+0.7%
`turbo_kv_3b`	56	9.1×	15.77	−13.0%	+13.3%
`turbo_kv_4bo` 🧪	96	5.3×	15.20	−16.2%	+2.5%
`uniform_4b`	68	7.5×	13.27	−26.8%	+7.7%

The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.

What we did NOT do

The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.

The third honest correction

This is the third honest correction we've caught and fixed before it spread:

v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"

Each correction was caught by our validation discipline. Validation > marketing.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```

Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.

Tests

35/35 unit tests pass on macOS / Linux / Windows.

Filed

Issue #16 — Metal backend currently slower than CPU-only on all tested models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.5 — Re-baseline without Metal (3rd honest correction)

Choose a tag to compare

Sorry, something went wrong.