Skip to content

v0.6.5 — Re-baseline without Metal (3rd honest correction)

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 11:17
· 400 commits to main since this release

🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only

While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.

The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)

Build KV type tok/s
Metal ON fp32 15.07
Metal OFF fp32 17.87 (+19%)
Metal ON turbo_kv_4b 14.17
Metal OFF turbo_kv_4b 16.53 (+17%)
Metal ON turbo_kv_5b 13.43
Metal OFF turbo_kv_5b 15.33 (+14%)

Across model sizes:

Model Metal-OFF speedup
SmolLM2 135M neutral
Llama 3.2 1B +13–17%
Llama 3.2 3B +14–22%
Gemma 4 26B +40%

Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.

Impact on past benchmarks

The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.

This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.

Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)

Type Bytes/block Compression tok/s vs FP32 PPL Δ
FP32 KV 18.13 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 16.60 −8.4% +5.7%
`turbo_kv_5b` 🏆 quality 88 5.8× 15.43 −14.9% +0.7%
`turbo_kv_3b` 56 9.1× 15.77 −13.0% +13.3%
`turbo_kv_4bo` 🧪 96 5.3× 15.20 −16.2% +2.5%
`uniform_4b` 68 7.5× 13.27 −26.8% +7.7%

The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.

What we did NOT do

The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.

The third honest correction

This is the third honest correction we've caught and fixed before it spread:

  1. v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
  2. v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
  3. v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"

Each correction was caught by our validation discipline. Validation > marketing.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```

Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.

Tests

35/35 unit tests pass on macOS / Linux / Windows.

Filed

  • Issue #16 — Metal backend currently slower than CPU-only on all tested models