Skip to content

Commit b7b20b8

Browse files
committed
CHANGELOG: v0.6.5 — Metal re-baseline (3rd honest correction)
1 parent 0b3b958 commit b7b20b8

File tree

1 file changed

+50
-0
lines changed

1 file changed

+50
-0
lines changed

CHANGELOG.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,55 @@
11
# Changelog
22

3+
## [0.6.5] — 2026-04-08
4+
5+
### 🚨 Re-baseline: all benchmarks now CPU-only (Metal is slower)
6+
7+
P3 (Metal compute graph for KV attention) investigation revealed that the existing Metal backend (`TQ_BUILD_METAL=ON`) is **net negative** on every model size we tested — 13–40% slower than CPU-only. The CMake default has always been `OFF`, so end users were getting the fast path. But our internal benchmarks (including all numbers in v0.6.0–v0.6.4 release notes) used `-DTQ_BUILD_METAL=ON` and were therefore 14–22% slower than what users actually get.
8+
9+
### Re-baselined numbers (Llama 3.2 3B Instruct, FP32 baseline = 13.56 PPL)
10+
11+
| Type | Bytes/block | tok/s (Metal OFF) | vs FP32 | PPL Δ |
12+
|---|---:|---:|---:|---:|
13+
| **FP32 KV** || **18.13** | baseline ||
14+
| **`turbo_kv_4b`**| 72 | 16.60 | **−8.4%** | +5.7% |
15+
| `turbo_kv_3b` | 56 | 15.77 | −13.0% | +13.3% |
16+
| **`turbo_kv_5b`** 🏆 | 88 | 15.43 | −14.9% | +0.7% |
17+
| `turbo_kv_4bo` | 96 | 15.20 | −16.2% | +2.5% |
18+
| `uniform_4b` | 68 | 13.27 | −26.8% | +7.7% |
19+
20+
The relative gaps to FP32 are essentially unchanged (turbo_kv_4b is still ~8% slower) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.
21+
22+
### Cross-model Metal investigation
23+
24+
| Model | Metal OFF speedup vs Metal ON |
25+
|---|---|
26+
| SmolLM2 135M | neutral (within noise) |
27+
| Llama 3.2 1B | +13–17% |
28+
| Llama 3.2 3B | +14–22% |
29+
| Gemma 4 26B | **+40%** |
30+
31+
Even on the largest model (Gemma 4 26B), Metal is net negative. Per-matmul dispatch overhead + waitUntilCompleted sync exceed the GPU compute benefit at batch-1 inference. Filed [issue #16](https://github.com/quantumaikr/quant.cpp/issues/16) with investigation plan.
32+
33+
### What changed in v0.6.5
34+
35+
| File | Change |
36+
|------|--------|
37+
| `README.md`, `README.ko.md` | Re-baselined headline tables and ASCII charts. New build note linking to issue #16. |
38+
| `CHANGELOG.md` | This entry. |
39+
| Issue #16 | Filed: Metal backend is currently slower than CPU-only |
40+
41+
No source code changes — the CMake default was already `OFF`. The bug was in our internal benchmark methodology (we built with Metal ON without realizing it was slowing things down).
42+
43+
### Honest corrections so far in the v0.6.x series
44+
45+
This is now the **third** honest correction we've caught and fixed before it spread:
46+
47+
1. **v0.6.0**: "lossless 7× compression" → measured "+6.3% PPL on Llama 3.2 3B"
48+
2. **v0.6.4**: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
49+
3. **v0.6.5**: "benchmarks with Metal" → re-measured "benchmarks without Metal (which is the user default)"
50+
51+
Each correction was caught by the validation discipline documented in our `feedback_validation_first` memory. **Validation > marketing.**
52+
353
## [0.6.4] — 2026-04-08
454

555
### Honest validation pass

0 commit comments

Comments
 (0)