Skip to content

Commit e6ac01c

Browse files
committed
v0.7.1 release: README + CHANGELOG with Round 11 honest numbers
1 parent 63e26f9 commit e6ac01c

File tree

2 files changed

+53
-4
lines changed

2 files changed

+53
-4
lines changed

CHANGELOG.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,54 @@
11
# Changelog
22

3+
## [0.7.1] — 2026-04-08
4+
5+
### Round 11 — NEON tbl pattern applied to 3b/5b (partial parity)
6+
7+
After Round 10 (turbo_kv_4b at fp32 parity via `vqtbl1q_s8`), Round 11 applied the same SIMD codebook lookup pattern to the remaining production variants. The lookup side scales beautifully (1 instruction per 16 lanes for any small codebook), but the **bit-unpack side** is the new bottleneck for non-byte-aligned packing.
8+
9+
Llama 3.2 3B PPL eval, 3 runs each (CPU-only, no Metal):
10+
11+
| Type | Round 10 → Round 11 | Δ | vs FP32 (R11) | PPL Δ |
12+
|---|---|---:|---:|---:|
13+
| FP32 | 17.87 → 18.43 t/s | +3% | baseline ||
14+
| `turbo_kv_3b` | 16.10 → 16.57 t/s | +3% | **−10.1%** | +13.3% |
15+
| **`turbo_kv_4b`**| 18.17 → 18.17 t/s | parity (R10 stable) | **−1.4%**| +3.8% |
16+
| `turbo_kv_5b` 🏆 | 15.43 → 16.80 t/s | **+9%** | **−8.8%** | +0.7% |
17+
18+
### Why 4b reached parity but 3b/5b didn't
19+
20+
| Type | Bit packing | Unpack | Result |
21+
|---|---|---|---|
22+
| 4b | byte-aligned (2 nibbles/byte) | pure SIMD `vandq_u8` + `vshrq_n_u8` | **parity**|
23+
| 3b | bit-aligned (irregular 3-bit fields) | uint64 read + scalar shifts | −10.1% |
24+
| 5b | bit-aligned (irregular 5-bit fields) | uint64 read + scalar shifts | −8.8% |
25+
26+
For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found (uint64 read + 16 scalar shifts + `vld1q_u8`) but it costs ~16 instructions per 16-element iteration. The lookup itself is 1 instruction. So the unpack now dominates for 3b/5b.
27+
28+
### Insight: matmul was already using the same pattern
29+
30+
While investigating other optimization axes, we discovered that the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) **already uses `vqtbl1q_s8`** for the codebook lookup. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile) — they both share the same NEON tbl matmul kernel.
31+
32+
This is why Round 10 worked: we'd been using NEON tbl in matmul since v0.5, but had built the attention path with scalar gather. Once we applied the same primitive to attention, the gap closed. Round 11 extended it to 3b/5b but hit the bit-packing constraint.
33+
34+
### What's NOT in v0.7.1
35+
36+
- 5b/3b at full parity. The remaining gap is in the unpack, not the lookup. Closing it needs either (a) a layout change (1-byte-per-index, sacrificing compression), (b) a SIMD bit-extraction trick, or (c) acceptance. We chose (c) for v0.7.1 with honest disclosure.
37+
- `turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
38+
- AVX2 / WASM SIMD ports of the NEON tbl pattern — separate session
39+
40+
### What changed in v0.7.1
41+
42+
| File | Change |
43+
|------|--------|
44+
| `src/core/tq_turbo_kv.c::tq_turbo_kv_3b_attention_ref` | NEON tbl + uint64 unpack |
45+
| `src/core/tq_turbo_kv.c::tq_turbo_kv_5b_attention_ref` | NEON tbl + uint64 unpack |
46+
| `README.md`, `README.ko.md` | Round 11 numbers |
47+
| `CHANGELOG.md` | This entry |
48+
| Memory `feedback_simd_unpack_constraint.md` | Documents the byte-alignment constraint for future work |
49+
50+
35/35 tests pass. PPL unchanged.
51+
352
## [0.7.0] — 2026-04-08
453

554
### 🏆 Round 10 — `turbo_kv_4b` matches fp32 KV speed at 7.1× compression

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,10 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
4949
5050
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
5151
|:----------|------------:|------------:|----:|----------:|------:|--------------:|
52-
| FP32 reference ||| 13.56 || 17.9 | baseline |
53-
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.08** | **+3.8%** | **18.7** | **+4.5%** |
54-
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | 15.3 |14.5% |
55-
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 15.7 |12.3% |
52+
| FP32 reference ||| 13.56 || 18.43 | baseline |
53+
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.08** | **+3.8%** | **18.17** | **−1.4%** ✅ parity |
54+
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | 16.80 |8.8% |
55+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 16.57 |10.1% |
5656
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
5757
| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% |||
5858

0 commit comments

Comments
 (0)