fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX by ohdearquant · Pull Request #96 · ohdearquant/lattice

ohdearquant · 2026-05-25T02:50:25Z

Summary

apply_partial_rope and Metal partial_rope_interleaved rotated consecutive pairs (2i, 2i+1) — the GPT-J / traditional=True convention. Qwen3.5 is trained with stride-half pairs (i, half+i) — HF transformers rotate_half, MLX nn.RoPE(traditional=False). The comment "Qwen3.5 uses mrope_interleaved=true" was misread: mrope_interleaved in config controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention.

Evidence

RoPE convention test (/tmp/test_rope_conv.py): MLX nn.RoPE(traditional=False, rope_dim=64, base=1e7, position=5) vs each candidate convention:

Convention	max-diff vs MLX
Stride-half `(i, half+i)`	8e-6
Interleaved `(2i, 2i+1)`	67.5

WikiText-2 PPL (Qwen3.5-0.8B, window=512, stride=256, 2041 scored tokens):

	Lattice	MLX gold	Gap
Before	16.6242	15.8580	+0.77
After	15.8870	15.8580	+0.029

Single-window argmax agreement at position 0 (used as forward-pass health diagnostic earlier this session):

	Before	After
Lattice pos 0 argmax	695 (input token echoed back)	220
MLX pos 0 argmax	220	220
511-position argmax agreement	low (different distributions)	497/511 = 97.3%

The 0.029 PPL residual is within the f32↔bf16 numerical-precision band documented for transformer inference (llama.cpp community: f16 vs f32 PPL deltas are "0.00x"). The earlier hypothesis that the gap was FP precision drift was wrong — it was a positional-encoding bug masquerading as numerical noise. The hybrid transformer (18 GDN + 6 full-attention layers) wasn't transforming hidden state correctly because attention was running on scrambled positions, producing the "embedding leakage" pattern where logits peaked at the input token.

Files

crates/inference/src/model/qwen35/forward.rs:391 — CPU apply_partial_rope
crates/inference/src/forward/metal_qwen35.rs:346 — Metal kernel (name kept for ABI continuity)
crates/inference/src/speculative.rs:1090 — mtp_apply_partial_rope
crates/inference/src/forward/metal_qwen35.rs golden snapshot — updated from stale -22.62 (pre-(1+gamma)) to the math-derived -45.243256
crates/inference/src/forward/metal_qwen35.rs test inits — add missing grammar: None to fix pre-existing test compile error

Test plan

cargo test -p lattice-inference --release --features "f16 metal-gpu" --lib — 843 pass, 0 fail
cargo clippy -p lattice-inference --features "f16 metal-gpu" — no new errors
Single-window PPL (512 tokens) — lattice 11.17, MLX 11.19
Full windowed PPL (2041 tokens) — lattice 15.89, MLX 15.86

Bench-compare

No CPU kernel paths touched — RoPE change is array-indexing only with identical FLOP count. Not gated by bench-regression.yml. Decode throughput is unaffected (same number of mul/add per token).

🤖 Generated with Claude Code

….74 PPL gap `apply_partial_rope` and Metal `partial_rope_interleaved` rotated consecutive pairs (2i, 2i+1) — the GPT-J / `traditional=True` convention. Qwen3.5 is trained with stride-half pairs (i, half+i) — HF transformers' `rotate_half` and MLX's `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses mrope_interleaved=true" was misread: `mrope_interleaved` in config controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Empirically verified against MLX's nn.RoPE(traditional=False, rope_dim=64, base=1e7, position=5): stride-half matches with max-diff 8e-6; interleaved diverges with max-diff 67.5. WikiText-2 PPL on Qwen3.5-0.8B (window=512, stride=256, 2041 scored tokens): before: 16.6242 (lattice) vs 15.8580 (MLX) → +0.77 PPL gap after: 15.8870 (lattice) vs 15.8580 (MLX) → +0.029 PPL gap argmax agreement at pos 0: 0% before (lat=695 echoed input token, mlx=220) → 97.3% after across single 512-token window. The wrong RoPE scrambled positional information in all 6 full-attention layers of the 24-layer hybrid (75% GDN, 25% full attention) stack. The hybrid transformer's hidden state stopped transforming, producing the "embedding leakage" signature where logits peaked at the input token — diagnosed earlier this session, now explained. Files: - crates/inference/src/model/qwen35/forward.rs:391 — CPU apply_partial_rope - crates/inference/src/forward/metal_qwen35.rs:346 — Metal kernel (name kept for ABI continuity) - crates/inference/src/speculative.rs:1090 — mtp_apply_partial_rope - crates/inference/src/forward/metal_qwen35.rs golden snapshot — updated from stale -22.62 (pre-(1+gamma)) to the correct -45.24 derived value - crates/inference/src/forward/metal_qwen35.rs test inits — add missing `grammar: None` field so tests compile Tests: 843 pass, 0 fail. Clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

crates.io v0.2.2 was published 2026-05-20, before the RoPE pairing fix landed (PR #96, merged today). Cannot republish 0.2.2 (immutable on crates.io), so bumping to 0.2.3 to ship the fix. v0.2.2 will be yanked on crates.io post-publish to prevent new installs from getting the broken interleaved RoPE. - Workspace version 0.2.2 → 0.2.3 - Internal path-dep minimum versions bumped to 0.2.3 - Release notes renamed v0.2.2.md → v0.2.3.md with yank notice - GitHub tag v0.2.2 left in place for history Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

ohdearquant merged commit 0cc5c9e into main May 25, 2026
3 checks passed

This was referenced May 25, 2026

release: v0.2.2 — MLX-parity quality + structured output + LoRA lifecycle #97

Merged

release: v0.2.3 — ship RoPE fix to crates.io (yank 0.2.2) #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX#96

fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX#96
ohdearquant merged 1 commit into
mainfrom
show/perf-ppl-gap-close

ohdearquant commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026

Summary

Evidence

Files

Test plan

Bench-compare

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant