From ac98dffacb1a837d4d4475cdcf8c9188f707e55f Mon Sep 17 00:00:00 2001 From: OceanLi <122793010+ohdearquant@users.noreply.github.com> Date: Sun, 24 May 2026 23:39:42 -0400 Subject: [PATCH] docs(claude-md): differential-test-first + bump-and-yank recipe MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two development directions promoted from the v0.2.3 session: 1. **Differential Test First** — when lattice diverges from MLX/HF/llama.cpp, write a 20-line Python script comparing the same primitive across both frameworks BEFORE reading lattice code or spawning agents. This closed a 0.77 PPL gap (Qwen3.5-0.8B, WikiText-2) in 5 seconds that had been misdiagnosed as "FP precision drift" for days. The actual bug was RoPE pairing convention (interleaved vs stride-half). Also: quantitative literature bounds cheaply reject hypotheses — f16-vs-f32 PPL <0.01, bf16-vs-f32 <0.05; gaps above those bounds are structural, not numerical. Also: be skeptical of comments that paraphrase config fields without explaining what the field actually controls in the reference impl. 2. **Bump-and-yank recovery** — crates.io is immutable. When a published release has a correctness bug, bump to next patch + ship the fix + yank the broken version. Done in v0.2.3 (yanked 0.2.2 which shipped with the RoPE bug). Plus: corrected stale "version = 0.1.0" pin in the publish section to match current workspace-version convention. Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 36 +++++++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index 0a51d05e..2c0edc65 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -30,6 +30,30 @@ cargo bench -p lattice-inference --bench elementwise_cpu_bench # inference Quick mode (`--quick`) is sufficient for direction + magnitude. Full mode only when you need tight CIs for a PR description or ADR evidence. +### Differential Test First (Cross-Framework Bugs) + +When lattice produces different output than a reference framework (MLX, HF transformers, llama.cpp), write a self-contained Python script that runs the same primitive in both frameworks and compares max-diff **before** reading lattice code or spawning investigation agents. A 20-line script gives a definitive answer in seconds; code-reading and agent analysis take hours and can converge on wrong conclusions. + +```python +# Template: /tmp/test__conv.py +import numpy as np, mlx.core as mx, mlx.nn as nn +# 1. Construct minimal input +# 2. Run via MLX (reference) +# 3. Run via each candidate lattice convention (as numpy) +# 4. Compare: which candidate has max-diff < 1e-4? +``` + +This process closed a 0.77 PPL gap on Qwen3.5-0.8B that had been misdiagnosed as "f32-vs-bf16 precision drift" for days. The actual bug was a RoPE pairing convention mismatch — interleaved `(2i, 2i+1)` vs stride-half `(i, half+i)`. Verified in 5 seconds: stride-half max-diff `8e-6`, interleaved `67.5`. PPL dropped from 16.62 → 15.89 (MLX gold 15.86). + +**Quantitative bounds reject hypotheses cheaply.** Before chasing "FP precision drift" or other plausible-sounding causes, check the literature for typical magnitude: +- f16 vs f32 PPL delta: `~0.00x` (llama.cpp community) +- bf16 vs f32 PPL delta: `<0.05` (arxiv:2510.26788) +- Q4 quantization PPL delta: `0.1-0.3` (llama.cpp #406) + +If the gap you're investigating exceeds these bounds, the cause is structural (algorithm, layout, convention), not numerical. Reject the precision hypothesis on quantitative grounds and look for a real bug. + +**Be skeptical of comments that paraphrase config fields.** A comment that says "X uses field=true" without explaining what the field actually controls in the reference implementation is a footgun. The lattice RoPE comment said "Qwen3.5 uses mrope_interleaved=true" — technically matched config, but `mrope_interleaved` controls multimodal M-RoPE section interleaving (video/image tokens), not 1-D text RoPE pairing. The bug existed for months because nobody verified the comment against HF's `rotate_half` or MLX's `nn.RoPE`. + ### Regression Gate (ADR-058) PRs touching CPU kernel paths trigger `bench-regression.yml` in CI. It runs on both `x86_64-linux` (AVX2) and `aarch64-linux` (NEON) against baselines stored on the orphan `perf-baselines` branch. @@ -80,4 +104,14 @@ Changes to `inference` affect `embed` and `tune`. Changes to `fann` affect `tune ## Publishing -Leaf crates publish first: inference → fann → transport → (wait 30s) → embed → (wait 30s) → tune. Use `make publish`. Internal path deps must have `version = "0.1.0"`. +Leaf crates publish first: inference → fann → transport → (wait 30s) → embed → (wait 30s) → tune. Use `make publish`. Internal path deps' `version = ` field must match the current workspace version (bump them in lockstep when bumping `[workspace.package].version`). + +**Shipped-bug recovery (bump-and-yank).** crates.io versions are immutable. When a published release has a correctness bug: + +1. Bump workspace + path-dep versions to the next patch +2. Update release notes file (rename if needed); add a "Note on v" section explaining the yank +3. Tag + GH release + `make publish` +4. `for c in lattice-inference lattice-fann lattice-transport lattice-embed lattice-tune; do cargo yank --version "$c"; done` +5. Verify: `curl -s https://crates.io/api/v1/crates/` should show `latest_unyanked=`, `yanked=[]` + +Done in v0.2.3 (yanked broken 0.2.2 which shipped with the RoPE bug). New `cargo add` users get the fix; existing pinned users get a yank warning on next `cargo update`.