fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX#96
Merged
Merged
Conversation
….74 PPL gap `apply_partial_rope` and Metal `partial_rope_interleaved` rotated consecutive pairs (2i, 2i+1) — the GPT-J / `traditional=True` convention. Qwen3.5 is trained with stride-half pairs (i, half+i) — HF transformers' `rotate_half` and MLX's `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses mrope_interleaved=true" was misread: `mrope_interleaved` in config controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Empirically verified against MLX's nn.RoPE(traditional=False, rope_dim=64, base=1e7, position=5): stride-half matches with max-diff 8e-6; interleaved diverges with max-diff 67.5. WikiText-2 PPL on Qwen3.5-0.8B (window=512, stride=256, 2041 scored tokens): before: 16.6242 (lattice) vs 15.8580 (MLX) → +0.77 PPL gap after: 15.8870 (lattice) vs 15.8580 (MLX) → +0.029 PPL gap argmax agreement at pos 0: 0% before (lat=695 echoed input token, mlx=220) → 97.3% after across single 512-token window. The wrong RoPE scrambled positional information in all 6 full-attention layers of the 24-layer hybrid (75% GDN, 25% full attention) stack. The hybrid transformer's hidden state stopped transforming, producing the "embedding leakage" signature where logits peaked at the input token — diagnosed earlier this session, now explained. Files: - crates/inference/src/model/qwen35/forward.rs:391 — CPU apply_partial_rope - crates/inference/src/forward/metal_qwen35.rs:346 — Metal kernel (name kept for ABI continuity) - crates/inference/src/speculative.rs:1090 — mtp_apply_partial_rope - crates/inference/src/forward/metal_qwen35.rs golden snapshot — updated from stale -22.62 (pre-(1+gamma)) to the correct -45.24 derived value - crates/inference/src/forward/metal_qwen35.rs test inits — add missing `grammar: None` field so tests compile Tests: 843 pass, 0 fail. Clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 25, 2026
ohdearquant
added a commit
that referenced
this pull request
May 25, 2026
crates.io v0.2.2 was published 2026-05-20, before the RoPE pairing fix landed (PR #96, merged today). Cannot republish 0.2.2 (immutable on crates.io), so bumping to 0.2.3 to ship the fix. v0.2.2 will be yanked on crates.io post-publish to prevent new installs from getting the broken interleaved RoPE. - Workspace version 0.2.2 → 0.2.3 - Internal path-dep minimum versions bumped to 0.2.3 - Release notes renamed v0.2.2.md → v0.2.3.md with yank notice - GitHub tag v0.2.2 left in place for history Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apply_partial_ropeand Metalpartial_rope_interleavedrotated consecutive pairs(2i, 2i+1)— the GPT-J /traditional=Trueconvention. Qwen3.5 is trained with stride-half pairs(i, half+i)— HF transformersrotate_half, MLXnn.RoPE(traditional=False). The comment "Qwen3.5 uses mrope_interleaved=true" was misread:mrope_interleavedin config controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention.Evidence
RoPE convention test (
/tmp/test_rope_conv.py): MLXnn.RoPE(traditional=False, rope_dim=64, base=1e7, position=5)vs each candidate convention:(i, half+i)(2i, 2i+1)WikiText-2 PPL (Qwen3.5-0.8B, window=512, stride=256, 2041 scored tokens):
Single-window argmax agreement at position 0 (used as forward-pass health diagnostic earlier this session):
The 0.029 PPL residual is within the f32↔bf16 numerical-precision band documented for transformer inference (llama.cpp community: f16 vs f32 PPL deltas are "0.00x"). The earlier hypothesis that the gap was FP precision drift was wrong — it was a positional-encoding bug masquerading as numerical noise. The hybrid transformer (18 GDN + 6 full-attention layers) wasn't transforming hidden state correctly because attention was running on scrambled positions, producing the "embedding leakage" pattern where logits peaked at the input token.
Files
crates/inference/src/model/qwen35/forward.rs:391— CPUapply_partial_ropecrates/inference/src/forward/metal_qwen35.rs:346— Metal kernel (name kept for ABI continuity)crates/inference/src/speculative.rs:1090—mtp_apply_partial_ropecrates/inference/src/forward/metal_qwen35.rsgolden snapshot — updated from stale-22.62(pre-(1+gamma)) to the math-derived-45.243256crates/inference/src/forward/metal_qwen35.rstest inits — add missinggrammar: Noneto fix pre-existing test compile errorTest plan
cargo test -p lattice-inference --release --features "f16 metal-gpu" --lib— 843 pass, 0 failcargo clippy -p lattice-inference --features "f16 metal-gpu"— no new errorsBench-compare
No CPU kernel paths touched — RoPE change is array-indexing only with identical FLOP count. Not gated by
bench-regression.yml. Decode throughput is unaffected (same number of mul/add per token).🤖 Generated with Claude Code