Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/releases/v0.2.2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
## Highlights

**Quality parity with MLX achieved.** Lattice's WikiText-2 perplexity on Qwen3.5-0.8B is now within 0.029 PPL of MLX (15.89 vs 15.86), down from a 0.77 PPL gap in v0.2.1. The remaining gap is within the f32↔bf16 numerical-precision band.

**Three independent quality fixes landed this cycle:**

1. **RoPE pairing — stride-half, not interleaved** (#96). Apply_partial_rope was rotating consecutive pairs `(2i, 2i+1)` — the GPT-J convention. Qwen3.5 is trained with stride-half pairs `(i, half+i)` — HF transformers `rotate_half`, MLX `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses mrope_interleaved=true" was misread: that config field controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Verified against MLX with max-diff 8e-6 (stride-half) vs 67.5 (interleaved). PPL: 16.62 → 15.89.

2. **FP16 lm_head for tied-embedding Q8 path**. Lattice's per-row symmetric Q8 quantization of the embedding produced sharper logits than MLX (max 25.5 vs 11.1). Switching the tied lm_head to use the FP16 embedding buffer already loaded for embedding lookup brings per-position distributions in line.

3. **Asymmetric Q4 quantization**. Q4Block was symmetric-only (`scale = abs_max/7`). Added asymmetric mode (scale + bias, `scale = (max - min)/15`). 0.77 PPL improvement on unrotated Q4. Critically, asymmetric BREAKS QuaRot (Hadamard rotation zero-centers weights, so bias adds noise) — QuaRot paths keep symmetric quantization.

### WikiText-2 Perplexity (Qwen3.5-0.8B, window=512, stride=256)

| Engine | Quant | PPL | Δ vs MLX gold |
|--------|-------|-----|---------------|
| MLX | FP16 / Q8 g64 | **15.86** | (reference) |
| **Lattice** | **FP16 / Q8 (auto-quant)** | **15.89** | **+0.029** |
| MLX | Q4 g64 | 18.18 | +2.32 |
| Lattice | Q4 asymmetric | 19.27 | +3.41 |

## What's New

### Inference

- **XGrammar structured output engine** (ADR-046) — grammar-constrained decoding with token-level acceptance masks
- **Continuous batching with chunked prefill** (ADR-048) — multi-sequence serving with prefix sharing
- **Prefix-shared paged KV cache** (ADR-047) — IndexMap-backed LRU with O(1) page reuse across sequences
- **MoE Metal dispatch with expert coalescing** (ADR-053) — top-k expert routing on GPU
- **Vision encoder module** (ADR-049) — ViT path for Qwen3-VL (text + image)
- **Self-speculative decoding via GDN draft heads** — uses the model's own linear-attention layers as draft model
- **Probabilistic rejection sampling for speculative decoding** (ADR-050) — accept/reject with target-model verification probabilities
- **MTP on QuaRot Q4 via counter-rotation** — multi-token prediction now works alongside rotated quantization

### Quality & Numerics

- **RoPE stride-half pairing** (#96) — closes 0.74 PPL gap
- **FP16 lm_head for tied-embedding Q8 path** — removes per-row Q8 lm_head distortion
- **Asymmetric Q4 quantization** — 0.77 PPL improvement on unrotated Q4

### Performance

- **CPU SIMD optimizations + parity regression tests** (#80)
- **Embed SIMD optimizations + correctness hardening** (#79)
- **Metal GPU**: zero-copy decode + MLP fusion (#81)
- **CI perf regression tracking** (ADR-058, #83) — `bench-regression.yml` gates PRs touching CPU kernel paths against `perf-baselines` branch (>7% CI-lower-bound regression blocks merge)

### Fine-tuning

- **LoRA full-lifecycle consumer API** (ADR-057 D1-D5, #65) — load, save_peft_safetensors, online single-event SGD via `adapt_step`, module validation, consumer docs
- **Adam/AdamW optimizer + LoRA gradient computation** (#90) — full backward pass for low-rank adapters

### Transport

- **Online drift detection via Sinkhorn** (ADR-055) — Wasserstein distance between rolling distributions

### Bench & Tooling

- **Reproducible bench scripts**: `bench-compare.sh` (origin/main vs HEAD A/B), `bench_apples_to_apples.sh`, `bench_apples_precise.sh`, `bench_q4_apples.sh`, `bench_quality.sh`
- **Logit comparison harness**: `scripts/compare_logits.py` — windowed PPL evaluation, per-position argmax agreement vs MLX
- **WikiText-2 fixture committed** (`docs/bench_results/wiki.test.raw`) for reproducible PPL bench
- **Codex review automation**: `scripts/codex_review_pr76*.sh` — adversarial review rounds with self-verification

## Fixes

- **RoPE pairing convention** — stride-half not interleaved (#96)
- **Metal GPU build repair** + zero-copy decode + MLP fusion (#81)
- **embed bench**: `data()` accessor after field privatization (#95)
- **kv_cache**: replace `VecDeque` LRU with `IndexMap` for correct insertion-order semantics
- **batch**: stop throttling decode by free GDN slots — running sequences already own theirs
- **grammar**: handle `advance()` rejection, gate MTP bypass, remove unsafe impls
- **generate**: close external-review functional gaps (#41)
- **bench**: correct decode-throughput methodology — honest e2e numbers (#40)
- **inference**: document and skip MTP emission for QuaRot rotation basis mismatch (#42)
- **LoRA hook** in batch prefill path (#43)

## Internal

- **ADR-046** XGrammar structured output engine
- **ADR-047** Prefix-shared paged KV cache
- **ADR-048** Continuous batching with chunked prefill
- **ADR-049** Vision encoder module for Qwen3-VL
- **ADR-050** Probabilistic rejection sampling
- **ADR-053** MoE Metal dispatch with expert coalescing
- **ADR-055** Online drift detection via Sinkhorn
- **ADR-056** LoRA full-lifecycle consumer API
- **ADR-057** LoRA full-lifecycle implementation (D1-D5)
- **ADR-058** CPU perf regression tracking

## Crates Published

- `lattice-inference` 0.2.2
- `lattice-embed` 0.2.2
- `lattice-fann` 0.2.2
- `lattice-tune` 0.2.2
- `lattice-transport` 0.2.2

## Diff Stats

129 files changed, 29,524 insertions(+), 1,012 deletions(-) since v0.2.1.
Loading