Skip to content

fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX#96

Merged
ohdearquant merged 1 commit into
mainfrom
show/perf-ppl-gap-close
May 25, 2026
Merged

fix(inference): RoPE stride-half pairing — WikiText-2 PPL gap 0.77 → 0.029 vs MLX#96
ohdearquant merged 1 commit into
mainfrom
show/perf-ppl-gap-close

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Summary

apply_partial_rope and Metal partial_rope_interleaved rotated consecutive pairs (2i, 2i+1) — the GPT-J / traditional=True convention. Qwen3.5 is trained with stride-half pairs (i, half+i) — HF transformers rotate_half, MLX nn.RoPE(traditional=False). The comment "Qwen3.5 uses mrope_interleaved=true" was misread: mrope_interleaved in config controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention.

Evidence

RoPE convention test (/tmp/test_rope_conv.py): MLX nn.RoPE(traditional=False, rope_dim=64, base=1e7, position=5) vs each candidate convention:

Convention max-diff vs MLX
Stride-half (i, half+i) 8e-6
Interleaved (2i, 2i+1) 67.5

WikiText-2 PPL (Qwen3.5-0.8B, window=512, stride=256, 2041 scored tokens):

Lattice MLX gold Gap
Before 16.6242 15.8580 +0.77
After 15.8870 15.8580 +0.029

Single-window argmax agreement at position 0 (used as forward-pass health diagnostic earlier this session):

Before After
Lattice pos 0 argmax 695 (input token echoed back) 220
MLX pos 0 argmax 220 220
511-position argmax agreement low (different distributions) 497/511 = 97.3%

The 0.029 PPL residual is within the f32↔bf16 numerical-precision band documented for transformer inference (llama.cpp community: f16 vs f32 PPL deltas are "0.00x"). The earlier hypothesis that the gap was FP precision drift was wrong — it was a positional-encoding bug masquerading as numerical noise. The hybrid transformer (18 GDN + 6 full-attention layers) wasn't transforming hidden state correctly because attention was running on scrambled positions, producing the "embedding leakage" pattern where logits peaked at the input token.

Files

  • crates/inference/src/model/qwen35/forward.rs:391 — CPU apply_partial_rope
  • crates/inference/src/forward/metal_qwen35.rs:346 — Metal kernel (name kept for ABI continuity)
  • crates/inference/src/speculative.rs:1090mtp_apply_partial_rope
  • crates/inference/src/forward/metal_qwen35.rs golden snapshot — updated from stale -22.62 (pre-(1+gamma)) to the math-derived -45.243256
  • crates/inference/src/forward/metal_qwen35.rs test inits — add missing grammar: None to fix pre-existing test compile error

Test plan

  • cargo test -p lattice-inference --release --features "f16 metal-gpu" --lib — 843 pass, 0 fail
  • cargo clippy -p lattice-inference --features "f16 metal-gpu" — no new errors
  • Single-window PPL (512 tokens) — lattice 11.17, MLX 11.19
  • Full windowed PPL (2041 tokens) — lattice 15.89, MLX 15.86

Bench-compare

No CPU kernel paths touched — RoPE change is array-indexing only with identical FLOP count. Not gated by bench-regression.yml. Decode throughput is unaffected (same number of mul/add per token).

🤖 Generated with Claude Code

….74 PPL gap

`apply_partial_rope` and Metal `partial_rope_interleaved` rotated consecutive
pairs (2i, 2i+1) — the GPT-J / `traditional=True` convention. Qwen3.5 is
trained with stride-half pairs (i, half+i) — HF transformers' `rotate_half`
and MLX's `nn.RoPE(traditional=False)`. The comment "Qwen3.5 uses
mrope_interleaved=true" was misread: `mrope_interleaved` in config controls
multimodal-position section interleaving (M-RoPE for video/image tokens),
not the 1-D text pairing convention.

Empirically verified against MLX's nn.RoPE(traditional=False, rope_dim=64,
base=1e7, position=5): stride-half matches with max-diff 8e-6; interleaved
diverges with max-diff 67.5.

WikiText-2 PPL on Qwen3.5-0.8B (window=512, stride=256, 2041 scored tokens):
  before:  16.6242 (lattice) vs 15.8580 (MLX)  →  +0.77 PPL gap
  after:   15.8870 (lattice) vs 15.8580 (MLX)  →  +0.029 PPL gap
  argmax agreement at pos 0: 0% before (lat=695 echoed input token,
  mlx=220) → 97.3% after across single 512-token window.

The wrong RoPE scrambled positional information in all 6 full-attention
layers of the 24-layer hybrid (75% GDN, 25% full attention) stack. The
hybrid transformer's hidden state stopped transforming, producing the
"embedding leakage" signature where logits peaked at the input token —
diagnosed earlier this session, now explained.

Files:
- crates/inference/src/model/qwen35/forward.rs:391 — CPU apply_partial_rope
- crates/inference/src/forward/metal_qwen35.rs:346 — Metal kernel (name
  kept for ABI continuity)
- crates/inference/src/speculative.rs:1090 — mtp_apply_partial_rope
- crates/inference/src/forward/metal_qwen35.rs golden snapshot — updated
  from stale -22.62 (pre-(1+gamma)) to the correct -45.24 derived value
- crates/inference/src/forward/metal_qwen35.rs test inits — add missing
  `grammar: None` field so tests compile

Tests: 843 pass, 0 fail. Clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ohdearquant ohdearquant merged commit 0cc5c9e into main May 25, 2026
3 checks passed
ohdearquant added a commit that referenced this pull request May 25, 2026
crates.io v0.2.2 was published 2026-05-20, before the RoPE pairing fix
landed (PR #96, merged today). Cannot republish 0.2.2 (immutable on
crates.io), so bumping to 0.2.3 to ship the fix. v0.2.2 will be yanked
on crates.io post-publish to prevent new installs from getting the
broken interleaved RoPE.

- Workspace version 0.2.2 → 0.2.3
- Internal path-dep minimum versions bumped to 0.2.3
- Release notes renamed v0.2.2.md → v0.2.3.md with yank notice
- GitHub tag v0.2.2 left in place for history

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant