|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## [0.6.4] — 2026-04-08 |
| 4 | + |
| 5 | +### Honest validation pass |
| 6 | + |
| 7 | +This patch release exists to publish the **corrected** speed numbers |
| 8 | +for v0.6.3 prominently. The v0.6.3 release shipped with the wrong |
| 9 | +headline ('turbo_kv beats fp32 KV speed') because the fp32 attention |
| 10 | +path was being compared in unoptimized scalar form. After NEON fix, |
| 11 | +the honest gap is **−7% to −12%**, not **+5% to +10%**. |
| 12 | + |
| 13 | +### What changed in this release |
| 14 | + |
| 15 | +- **`tq_transformer.c`**: NEON-optimized the fp32 attention path |
| 16 | + (commit `4490c83`). FP32 attention went from 12.6 → 14.83 tok/s |
| 17 | + on Llama 3.2 3B (+18% standalone improvement). |
| 18 | +- **README.md / README.ko.md**: corrected the headline tables and |
| 19 | + ASCII charts to reflect the honest fp32-NEON comparison |
| 20 | + (commit `33b6315`). |
| 21 | +- **GitHub release notes for v0.6.3**: updated with a prominent |
| 22 | + Correction notice at the top. |
| 23 | +- **`tq_transformer.c`**: Round 8 prefetch attempt reverted (no |
| 24 | + measurable benefit on Apple M1 Pro). Round 9 strided-attention |
| 25 | + not pursued (would require ABI change with no clear win). |
| 26 | + |
| 27 | +### Final honest numbers (3 runs each, Llama 3.2 3B PPL eval) |
| 28 | + |
| 29 | +| Type | Avg tok/s | vs FP32 | PPL Δ | Compression | |
| 30 | +|---|---:|---:|---:|---:| |
| 31 | +| **FP32 KV** (NEON) | **14.63** | baseline | — | 1× | |
| 32 | +| **`turbo_kv_4b`** ⭐ default | 13.57 | **−7.2%** | +5.7% | **7.1×** | |
| 33 | +| **`turbo_kv_3b`** | 13.13 | −10.2% | +13.3% | 9.1× | |
| 34 | +| **`turbo_kv_5b`** 🏆 quality | 12.90 | −11.8% | +0.7% | 5.8× | |
| 35 | + |
| 36 | +### What we learned |
| 37 | + |
| 38 | +1. **Validation is the most valuable step.** It found the wrong claim |
| 39 | + before it spread to users. |
| 40 | +2. **The Round 5 win is real.** turbo_kv_4b went from 6.9 → 13.6 tok/s |
| 41 | + (+97%). Just the comparison baseline was wrong. |
| 42 | +3. **Local optimum reached.** Rounds 8 and 9 (prefetch, strided gather) |
| 43 | + gave no measurable improvement. Further wins would need structural |
| 44 | + changes (e.g., a different KV cache memory layout, or true parallel |
| 45 | + attention dispatch). |
| 46 | +4. **Pareto improvement is still real.** turbo_kv_4b dominates |
| 47 | + `uniform_4b` on quality (14.33 vs 14.60 PPL) AND speed (13.57 vs |
| 48 | + 11.7 tok/s) AND compression (7.1× vs 7.5× — close enough). |
| 49 | + |
3 | 50 | ## [0.6.3] — 2026-04-08 |
4 | 51 |
|
5 | 52 | ### Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8% |
|
0 commit comments