ohdearquant · ohdearquant · May 25, 2026 · May 25, 2026
diff --git a/docs/releases/v0.2.2.md b/docs/releases/v0.2.2.md
@@ -0,0 +1,100 @@
+## Highlights
+
+**Quality parity with MLX achieved.** Lattice's WikiText-2 perplexity on Qwen3.5-0.8B is now within 0.029 PPL of MLX (15.89 vs 15.86), down from a 0.77 PPL gap in v0.2.1. The remaining gap is within the f32↔bf16 numerical-precision band.
+
+**Three independent quality fixes landed this cycle:**
+
+1. **RoPE pairing — stride-half, not interleaved** (#96). Apply_partial_rope was rotating consecutive pairs `(2i, 2i+1)` — the GPT-J convention. Qwen3.5 is trained with stride-half pairs `(i, half+i)` — HF transformers `rotate_half`, MLX `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses mrope_interleaved=true" was misread: that config field controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Verified against MLX with max-diff 8e-6 (stride-half) vs 67.5 (interleaved). PPL: 16.62 → 15.89.
+
+2. **FP16 lm_head for tied-embedding Q8 path**. Lattice's per-row symmetric Q8 quantization of the embedding produced sharper logits than MLX (max 25.5 vs 11.1). Switching the tied lm_head to use the FP16 embedding buffer already loaded for embedding lookup brings per-position distributions in line.
+
+3. **Asymmetric Q4 quantization**. Q4Block was symmetric-only (`scale = abs_max/7`). Added asymmetric mode (scale + bias, `scale = (max - min)/15`). 0.77 PPL improvement on unrotated Q4. Critically, asymmetric BREAKS QuaRot (Hadamard rotation zero-centers weights, so bias adds noise) — QuaRot paths keep symmetric quantization.
+
+### WikiText-2 Perplexity (Qwen3.5-0.8B, window=512, stride=256)
+
+| Engine | Quant | PPL | Δ vs MLX gold |
+|--------|-------|-----|---------------|
+| MLX | FP16 / Q8 g64 | **15.86** | (reference) |
+| **Lattice** | **FP16 / Q8 (auto-quant)** | **15.89** | **+0.029** |
+| MLX | Q4 g64 | 18.18 | +2.32 |
+| Lattice | Q4 asymmetric | 19.27 | +3.41 |
+
+## What's New
+
+### Inference
+
+- **XGrammar structured output engine** (ADR-046) — grammar-constrained decoding with token-level acceptance masks
+- **Continuous batching with chunked prefill** (ADR-048) — multi-sequence serving with prefix sharing
+- **Prefix-shared paged KV cache** (ADR-047) — IndexMap-backed LRU with O(1) page reuse across sequences
+- **MoE Metal dispatch with expert coalescing** (ADR-053) — top-k expert routing on GPU
+- **Vision encoder module** (ADR-049) — ViT path for Qwen3-VL (text + image)
+- **Self-speculative decoding via GDN draft heads** — uses the model's own linear-attention layers as draft model
+- **Probabilistic rejection sampling for speculative decoding** (ADR-050) — accept/reject with target-model verification probabilities
+- **MTP on QuaRot Q4 via counter-rotation** — multi-token prediction now works alongside rotated quantization
+
+### Quality & Numerics
+
+- **RoPE stride-half pairing** (#96) — closes 0.74 PPL gap
+- **FP16 lm_head for tied-embedding Q8 path** — removes per-row Q8 lm_head distortion
+- **Asymmetric Q4 quantization** — 0.77 PPL improvement on unrotated Q4
+
+### Performance
+
+- **CPU SIMD optimizations + parity regression tests** (#80)
+- **Embed SIMD optimizations + correctness hardening** (#79)
+- **Metal GPU**: zero-copy decode + MLP fusion (#81)
+- **CI perf regression tracking** (ADR-058, #83) — `bench-regression.yml` gates PRs touching CPU kernel paths against `perf-baselines` branch (>7% CI-lower-bound regression blocks merge)
+
+### Fine-tuning
+
+- **LoRA full-lifecycle consumer API** (ADR-057 D1-D5, #65) — load, save_peft_safetensors, online single-event SGD via `adapt_step`, module validation, consumer docs
+- **Adam/AdamW optimizer + LoRA gradient computation** (#90) — full backward pass for low-rank adapters
+
+### Transport
+
+- **Online drift detection via Sinkhorn** (ADR-055) — Wasserstein distance between rolling distributions
+
+### Bench & Tooling
+
+- **Reproducible bench scripts**: `bench-compare.sh` (origin/main vs HEAD A/B), `bench_apples_to_apples.sh`, `bench_apples_precise.sh`, `bench_q4_apples.sh`, `bench_quality.sh`
+- **Logit comparison harness**: `scripts/compare_logits.py` — windowed PPL evaluation, per-position argmax agreement vs MLX
+- **WikiText-2 fixture committed** (`docs/bench_results/wiki.test.raw`) for reproducible PPL bench
+- **Codex review automation**: `scripts/codex_review_pr76*.sh` — adversarial review rounds with self-verification
+
+## Fixes
+
+- **RoPE pairing convention** — stride-half not interleaved (#96)
+- **Metal GPU build repair** + zero-copy decode + MLP fusion (#81)
+- **embed bench**: `data()` accessor after field privatization (#95)
+- **kv_cache**: replace `VecDeque` LRU with `IndexMap` for correct insertion-order semantics
+- **batch**: stop throttling decode by free GDN slots — running sequences already own theirs
+- **grammar**: handle `advance()` rejection, gate MTP bypass, remove unsafe impls
+- **generate**: close external-review functional gaps (#41)
+- **bench**: correct decode-throughput methodology — honest e2e numbers (#40)
+- **inference**: document and skip MTP emission for QuaRot rotation basis mismatch (#42)
+- **LoRA hook** in batch prefill path (#43)
+
+## Internal
+
+- **ADR-046** XGrammar structured output engine
+- **ADR-047** Prefix-shared paged KV cache
+- **ADR-048** Continuous batching with chunked prefill
+- **ADR-049** Vision encoder module for Qwen3-VL
+- **ADR-050** Probabilistic rejection sampling
+- **ADR-053** MoE Metal dispatch with expert coalescing
+- **ADR-055** Online drift detection via Sinkhorn
+- **ADR-056** LoRA full-lifecycle consumer API
+- **ADR-057** LoRA full-lifecycle implementation (D1-D5)
+- **ADR-058** CPU perf regression tracking
+
+## Crates Published
+
+- `lattice-inference` 0.2.2
+- `lattice-embed` 0.2.2
+- `lattice-fann` 0.2.2
+- `lattice-tune` 0.2.2
+- `lattice-transport` 0.2.2
+
+## Diff Stats
+
+129 files changed, 29,524 insertions(+), 1,012 deletions(-) since v0.2.1.