Skip to content

Commit b78ae1c

Browse files
committed
CHANGELOG: v0.6.4 — honest validation pass + corrections
1 parent 75719f5 commit b78ae1c

1 file changed

Lines changed: 47 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,52 @@
11
# Changelog
22

3+
## [0.6.4] — 2026-04-08
4+
5+
### Honest validation pass
6+
7+
This patch release exists to publish the **corrected** speed numbers
8+
for v0.6.3 prominently. The v0.6.3 release shipped with the wrong
9+
headline ('turbo_kv beats fp32 KV speed') because the fp32 attention
10+
path was being compared in unoptimized scalar form. After NEON fix,
11+
the honest gap is **−7% to −12%**, not **+5% to +10%**.
12+
13+
### What changed in this release
14+
15+
- **`tq_transformer.c`**: NEON-optimized the fp32 attention path
16+
(commit `4490c83`). FP32 attention went from 12.6 → 14.83 tok/s
17+
on Llama 3.2 3B (+18% standalone improvement).
18+
- **README.md / README.ko.md**: corrected the headline tables and
19+
ASCII charts to reflect the honest fp32-NEON comparison
20+
(commit `33b6315`).
21+
- **GitHub release notes for v0.6.3**: updated with a prominent
22+
Correction notice at the top.
23+
- **`tq_transformer.c`**: Round 8 prefetch attempt reverted (no
24+
measurable benefit on Apple M1 Pro). Round 9 strided-attention
25+
not pursued (would require ABI change with no clear win).
26+
27+
### Final honest numbers (3 runs each, Llama 3.2 3B PPL eval)
28+
29+
| Type | Avg tok/s | vs FP32 | PPL Δ | Compression |
30+
|---|---:|---:|---:|---:|
31+
| **FP32 KV** (NEON) | **14.63** | baseline |||
32+
| **`turbo_kv_4b`** ⭐ default | 13.57 | **−7.2%** | +5.7% | **7.1×** |
33+
| **`turbo_kv_3b`** | 13.13 | −10.2% | +13.3% | 9.1× |
34+
| **`turbo_kv_5b`** 🏆 quality | 12.90 | −11.8% | +0.7% | 5.8× |
35+
36+
### What we learned
37+
38+
1. **Validation is the most valuable step.** It found the wrong claim
39+
before it spread to users.
40+
2. **The Round 5 win is real.** turbo_kv_4b went from 6.9 → 13.6 tok/s
41+
(+97%). Just the comparison baseline was wrong.
42+
3. **Local optimum reached.** Rounds 8 and 9 (prefetch, strided gather)
43+
gave no measurable improvement. Further wins would need structural
44+
changes (e.g., a different KV cache memory layout, or true parallel
45+
attention dispatch).
46+
4. **Pareto improvement is still real.** turbo_kv_4b dominates
47+
`uniform_4b` on quality (14.33 vs 14.60 PPL) AND speed (13.57 vs
48+
11.7 tok/s) AND compression (7.1× vs 7.5× — close enough).
49+
350
## [0.6.3] — 2026-04-08
451

552
### Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

0 commit comments

Comments
 (0)