v0.6.5 — Re-baseline without Metal (3rd honest correction)
🚨 P3 Metal investigation found the existing Metal backend is slower than CPU-only
While exploring Option P3 (GPU compute graph for KV attention), we measured the existing Metal backend (`TQ_BUILD_METAL=ON`) and discovered it is net negative on every model size we tested.
The numbers (3 runs each, Llama 3.2 3B Instruct PPL eval)
| Build | KV type | tok/s |
|---|---|---|
| Metal ON | fp32 | 15.07 |
| Metal OFF | fp32 | 17.87 (+19%) |
| Metal ON | turbo_kv_4b | 14.17 |
| Metal OFF | turbo_kv_4b | 16.53 (+17%) |
| Metal ON | turbo_kv_5b | 13.43 |
| Metal OFF | turbo_kv_5b | 15.33 (+14%) |
Across model sizes:
| Model | Metal-OFF speedup |
|---|---|
| SmolLM2 135M | neutral |
| Llama 3.2 1B | +13–17% |
| Llama 3.2 3B | +14–22% |
| Gemma 4 26B | +40% |
Even on the largest model we have access to, Metal is net negative. The per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU compute benefit at batch-1 inference.
Impact on past benchmarks
The CMake default has always been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. The bug was in our internal benchmark methodology: we built with `-DTQ_BUILD_METAL=ON` and reported numbers that were 14–22% slower than what users actually get.
This means our v0.6.0–v0.6.4 release notes UNDERSTATE the project's actual speed by 14–22% on Apple Silicon. v0.6.5 republishes the corrected numbers.
Re-baselined (Llama 3.2 3B Instruct, FP32 = 13.56 PPL)
| Type | Bytes/block | Compression | tok/s | vs FP32 | PPL Δ |
|---|---|---|---|---|---|
| FP32 KV | — | 1× | 18.13 | baseline | — |
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 16.60 | −8.4% | +5.7% |
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 15.43 | −14.9% | +0.7% |
| `turbo_kv_3b` | 56 | 9.1× | 15.77 | −13.0% | +13.3% |
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 15.20 | −16.2% | +2.5% |
| `uniform_4b` | 68 | 7.5× | 13.27 | −26.8% | +7.7% |
The relative gaps are essentially unchanged (turbo_kv_4b is still ~8% slower than fp32) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.
What we did NOT do
The original P3 plan was to add Metal kernels for the new turbo_kv_4b/5b attention path. We abandoned that plan after measuring the existing Metal backend is already net negative — adding more Metal kernels would compound the problem until the existing dispatch path is fixed. See issue #16 for the investigation plan.
The third honest correction
This is the third honest correction we've caught and fixed before it spread:
- v0.6.0: "lossless 7× compression" → measured "+6.3% PPL"
- v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
- v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (the user default)"
Each correction was caught by our validation discipline. Validation > marketing.
What you should use
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j
./build/quant model.gguf # turbo_kv_4b default
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality
```
Do not add `-DTQ_BUILD_METAL=ON` until issue #16 is resolved.
Tests
35/35 unit tests pass on macOS / Linux / Windows.
Filed
- Issue #16 — Metal backend currently slower than CPU-only on all tested models