|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## [0.6.5] — 2026-04-08 |
| 4 | + |
| 5 | +### 🚨 Re-baseline: all benchmarks now CPU-only (Metal is slower) |
| 6 | + |
| 7 | +P3 (Metal compute graph for KV attention) investigation revealed that the existing Metal backend (`TQ_BUILD_METAL=ON`) is **net negative** on every model size we tested — 13–40% slower than CPU-only. The CMake default has always been `OFF`, so end users were getting the fast path. But our internal benchmarks (including all numbers in v0.6.0–v0.6.4 release notes) used `-DTQ_BUILD_METAL=ON` and were therefore 14–22% slower than what users actually get. |
| 8 | + |
| 9 | +### Re-baselined numbers (Llama 3.2 3B Instruct, FP32 baseline = 13.56 PPL) |
| 10 | + |
| 11 | +| Type | Bytes/block | tok/s (Metal OFF) | vs FP32 | PPL Δ | |
| 12 | +|---|---:|---:|---:|---:| |
| 13 | +| **FP32 KV** | — | **18.13** | baseline | — | |
| 14 | +| **`turbo_kv_4b`** ⭐ | 72 | 16.60 | **−8.4%** | +5.7% | |
| 15 | +| `turbo_kv_3b` | 56 | 15.77 | −13.0% | +13.3% | |
| 16 | +| **`turbo_kv_5b`** 🏆 | 88 | 15.43 | −14.9% | +0.7% | |
| 17 | +| `turbo_kv_4bo` | 96 | 15.20 | −16.2% | +2.5% | |
| 18 | +| `uniform_4b` | 68 | 13.27 | −26.8% | +7.7% | |
| 19 | + |
| 20 | +The relative gaps to FP32 are essentially unchanged (turbo_kv_4b is still ~8% slower) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged. |
| 21 | + |
| 22 | +### Cross-model Metal investigation |
| 23 | + |
| 24 | +| Model | Metal OFF speedup vs Metal ON | |
| 25 | +|---|---| |
| 26 | +| SmolLM2 135M | neutral (within noise) | |
| 27 | +| Llama 3.2 1B | +13–17% | |
| 28 | +| Llama 3.2 3B | +14–22% | |
| 29 | +| Gemma 4 26B | **+40%** | |
| 30 | + |
| 31 | +Even on the largest model (Gemma 4 26B), Metal is net negative. Per-matmul dispatch overhead + waitUntilCompleted sync exceed the GPU compute benefit at batch-1 inference. Filed [issue #16](https://github.com/quantumaikr/quant.cpp/issues/16) with investigation plan. |
| 32 | + |
| 33 | +### What changed in v0.6.5 |
| 34 | + |
| 35 | +| File | Change | |
| 36 | +|------|--------| |
| 37 | +| `README.md`, `README.ko.md` | Re-baselined headline tables and ASCII charts. New build note linking to issue #16. | |
| 38 | +| `CHANGELOG.md` | This entry. | |
| 39 | +| Issue #16 | Filed: Metal backend is currently slower than CPU-only | |
| 40 | + |
| 41 | +No source code changes — the CMake default was already `OFF`. The bug was in our internal benchmark methodology (we built with Metal ON without realizing it was slowing things down). |
| 42 | + |
| 43 | +### Honest corrections so far in the v0.6.x series |
| 44 | + |
| 45 | +This is now the **third** honest correction we've caught and fixed before it spread: |
| 46 | + |
| 47 | +1. **v0.6.0**: "lossless 7× compression" → measured "+6.3% PPL on Llama 3.2 3B" |
| 48 | +2. **v0.6.4**: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)" |
| 49 | +3. **v0.6.5**: "benchmarks with Metal" → re-measured "benchmarks without Metal (which is the user default)" |
| 50 | + |
| 51 | +Each correction was caught by the validation discipline documented in our `feedback_validation_first` memory. **Validation > marketing.** |
| 52 | + |
3 | 53 | ## [0.6.4] — 2026-04-08 |
4 | 54 |
|
5 | 55 | ### Honest validation pass |
|
0 commit comments