diff --git a/docs/releases/v0.2.2.md b/docs/releases/v0.2.2.md new file mode 100644 index 00000000..238b81b5 --- /dev/null +++ b/docs/releases/v0.2.2.md @@ -0,0 +1,100 @@ +## Highlights + +**Quality parity with MLX achieved.** Lattice's WikiText-2 perplexity on Qwen3.5-0.8B is now within 0.029 PPL of MLX (15.89 vs 15.86), down from a 0.77 PPL gap in v0.2.1. The remaining gap is within the f32↔bf16 numerical-precision band. + +**Three independent quality fixes landed this cycle:** + +1. **RoPE pairing — stride-half, not interleaved** (#96). Apply_partial_rope was rotating consecutive pairs `(2i, 2i+1)` — the GPT-J convention. Qwen3.5 is trained with stride-half pairs `(i, half+i)` — HF transformers `rotate_half`, MLX `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses mrope_interleaved=true" was misread: that config field controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Verified against MLX with max-diff 8e-6 (stride-half) vs 67.5 (interleaved). PPL: 16.62 → 15.89. + +2. **FP16 lm_head for tied-embedding Q8 path**. Lattice's per-row symmetric Q8 quantization of the embedding produced sharper logits than MLX (max 25.5 vs 11.1). Switching the tied lm_head to use the FP16 embedding buffer already loaded for embedding lookup brings per-position distributions in line. + +3. **Asymmetric Q4 quantization**. Q4Block was symmetric-only (`scale = abs_max/7`). Added asymmetric mode (scale + bias, `scale = (max - min)/15`). 0.77 PPL improvement on unrotated Q4. Critically, asymmetric BREAKS QuaRot (Hadamard rotation zero-centers weights, so bias adds noise) — QuaRot paths keep symmetric quantization. + +### WikiText-2 Perplexity (Qwen3.5-0.8B, window=512, stride=256) + +| Engine | Quant | PPL | Δ vs MLX gold | +|--------|-------|-----|---------------| +| MLX | FP16 / Q8 g64 | **15.86** | (reference) | +| **Lattice** | **FP16 / Q8 (auto-quant)** | **15.89** | **+0.029** | +| MLX | Q4 g64 | 18.18 | +2.32 | +| Lattice | Q4 asymmetric | 19.27 | +3.41 | + +## What's New + +### Inference + +- **XGrammar structured output engine** (ADR-046) — grammar-constrained decoding with token-level acceptance masks +- **Continuous batching with chunked prefill** (ADR-048) — multi-sequence serving with prefix sharing +- **Prefix-shared paged KV cache** (ADR-047) — IndexMap-backed LRU with O(1) page reuse across sequences +- **MoE Metal dispatch with expert coalescing** (ADR-053) — top-k expert routing on GPU +- **Vision encoder module** (ADR-049) — ViT path for Qwen3-VL (text + image) +- **Self-speculative decoding via GDN draft heads** — uses the model's own linear-attention layers as draft model +- **Probabilistic rejection sampling for speculative decoding** (ADR-050) — accept/reject with target-model verification probabilities +- **MTP on QuaRot Q4 via counter-rotation** — multi-token prediction now works alongside rotated quantization + +### Quality & Numerics + +- **RoPE stride-half pairing** (#96) — closes 0.74 PPL gap +- **FP16 lm_head for tied-embedding Q8 path** — removes per-row Q8 lm_head distortion +- **Asymmetric Q4 quantization** — 0.77 PPL improvement on unrotated Q4 + +### Performance + +- **CPU SIMD optimizations + parity regression tests** (#80) +- **Embed SIMD optimizations + correctness hardening** (#79) +- **Metal GPU**: zero-copy decode + MLP fusion (#81) +- **CI perf regression tracking** (ADR-058, #83) — `bench-regression.yml` gates PRs touching CPU kernel paths against `perf-baselines` branch (>7% CI-lower-bound regression blocks merge) + +### Fine-tuning + +- **LoRA full-lifecycle consumer API** (ADR-057 D1-D5, #65) — load, save_peft_safetensors, online single-event SGD via `adapt_step`, module validation, consumer docs +- **Adam/AdamW optimizer + LoRA gradient computation** (#90) — full backward pass for low-rank adapters + +### Transport + +- **Online drift detection via Sinkhorn** (ADR-055) — Wasserstein distance between rolling distributions + +### Bench & Tooling + +- **Reproducible bench scripts**: `bench-compare.sh` (origin/main vs HEAD A/B), `bench_apples_to_apples.sh`, `bench_apples_precise.sh`, `bench_q4_apples.sh`, `bench_quality.sh` +- **Logit comparison harness**: `scripts/compare_logits.py` — windowed PPL evaluation, per-position argmax agreement vs MLX +- **WikiText-2 fixture committed** (`docs/bench_results/wiki.test.raw`) for reproducible PPL bench +- **Codex review automation**: `scripts/codex_review_pr76*.sh` — adversarial review rounds with self-verification + +## Fixes + +- **RoPE pairing convention** — stride-half not interleaved (#96) +- **Metal GPU build repair** + zero-copy decode + MLP fusion (#81) +- **embed bench**: `data()` accessor after field privatization (#95) +- **kv_cache**: replace `VecDeque` LRU with `IndexMap` for correct insertion-order semantics +- **batch**: stop throttling decode by free GDN slots — running sequences already own theirs +- **grammar**: handle `advance()` rejection, gate MTP bypass, remove unsafe impls +- **generate**: close external-review functional gaps (#41) +- **bench**: correct decode-throughput methodology — honest e2e numbers (#40) +- **inference**: document and skip MTP emission for QuaRot rotation basis mismatch (#42) +- **LoRA hook** in batch prefill path (#43) + +## Internal + +- **ADR-046** XGrammar structured output engine +- **ADR-047** Prefix-shared paged KV cache +- **ADR-048** Continuous batching with chunked prefill +- **ADR-049** Vision encoder module for Qwen3-VL +- **ADR-050** Probabilistic rejection sampling +- **ADR-053** MoE Metal dispatch with expert coalescing +- **ADR-055** Online drift detection via Sinkhorn +- **ADR-056** LoRA full-lifecycle consumer API +- **ADR-057** LoRA full-lifecycle implementation (D1-D5) +- **ADR-058** CPU perf regression tracking + +## Crates Published + +- `lattice-inference` 0.2.2 +- `lattice-embed` 0.2.2 +- `lattice-fann` 0.2.2 +- `lattice-tune` 0.2.2 +- `lattice-transport` 0.2.2 + +## Diff Stats + +129 files changed, 29,524 insertions(+), 1,012 deletions(-) since v0.2.1.