Skip to content

IHA interleaved + MuonEq-R: 3.192 val loss (2hr track)#79

Open
ms337 wants to merge 6 commits into
qlabs-eng:mainfrom
ms337:iha-interleaved-muoneq
Open

IHA interleaved + MuonEq-R: 3.192 val loss (2hr track)#79
ms337 wants to merge 6 commits into
qlabs-eng:mainfrom
ms337:iha-interleaved-muoneq

Conversation

@ms337
Copy link
Copy Markdown
Contributor

@ms337 ms337 commented Apr 22, 2026

Summary

Combines two orthogonal improvements on the 2hr track:

  1. Paper-faithful IHA with sequence interleaving (from Paper-faithful IHA with sequence interleaving: 3.193 val loss (2hr track) #78): Algorithm 1 from arxiv 2602.21371 with proper P=2 pseudo-head interleaving
  2. MuonEq-R row normalization (from Add MuonEq-R (https://arxiv.org/abs/2603.28254) to limited track #71): one-line gradient row normalization before Muon orthogonalization

Result

Config Val Loss Training Time
Interleaved IHA 12ep (baseline) 3.1926 — 3.1931 112m
Interleaved IHA + MuonEq-R 12ep 3.1921 112m

Marginal improvement (~0.001) within run-to-run noise on top of interleaved IHA. Reported MuonEq-R improvement was -0.003 with weight-fusion IHA (#71 commit: 3.214 → 3.211) — but the interleaved IHA's richer P²=4 attention patterns appear to capture most of the gradient information MuonEq-R's row normalization would otherwise add.

Trajectory observations

  • Epochs 1-7: MuonEq-R clearly helps (up to -0.021 at epoch 7 before dupe layers)
  • Epochs 8-12: Gap closes, runs converge after dupe layer activation + logit averaging

Changes

Just the 2-line MuonEq-R addition cherry-picked on top of the interleaved IHA branch (#78):

# MuonEq-R row normalization (arxiv 2603.28254)
g /= g.float().norm(dim=-1, keepdim=True).clamp_min(1e-7).to(g.dtype)

Applied to both:

  • train.py (root)
  • two_hour/train.py (2hr track)

Test plan

  • Merges cleanly on top of interleaved IHA
  • 12-epoch 2hr run: 3.192141 val loss / 111.97m training
  • No regression — improvement is small but positive
  • Decide: is ~0.001 improvement worth the PR, or should this be merged only if MuonEq-R wasn't already present?

🤖 Generated with Claude Code

ubuntu and others added 6 commits April 14, 2026 01:44
Implements cross-head Q/K/V mixing from the IHA paper (arxiv 2602.21371)
on top of the MTP baseline. Each head's query/key/value projection becomes
a learned linear combination of all heads' projections via H×H mixing
matrices, enabling richer cross-head attention patterns.

Key optimization: mixing matrices are fused into the Q/K/V projection
weights at forward time (W_fused[h] = sum_m mix[h,m] * W_orig[m]).
The [H,H]@[H,d*C] fusion matmul is negligible vs the main projection,
keeping per-step overhead to just 47ms (3.3%) over baseline.

Results (sub-1hr track, 11 epochs):
  MTP baseline:  3.222 val loss, 57.5m training
  IHA+MTP fused: 3.214 val loss, 59.7m training (-0.008, under 1hr)

CLI flags: --iha, --iha-v, --iha-lr (default: SCALAR_LR)
Best config: --iha --iha-v --iha-lr=0.02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IHA Q+K+V mixing with iha-lr=0.02 is now the default behavior.
No flags needed to reproduce the record — just run:
  torchrun --standalone --nproc_per_node=8 train.py

Use --no-iha to disable IHA if needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the previous weight-fusion cross-head mixing with full
Algorithm 1 from the IHA paper (arxiv 2602.21371):

1. Mix: α^{Q,K,V} ∈ [H, H, P] generate P pseudo-heads via einsum
2. Interleave: [B,T,H,P,d] → [B,T*P,H,d] — pseudo-tokens adjacent
3. Attend: Flash attention on expanded T*P sequence
4. De-interleave + collapse: learned R ∈ [H, P] back to [B,T,H,d]

Paper-faithful FLOP-matched window schedule:
- Short (S) layers: window = N/(2P) = 512 for P=2
- Long (L) layers: full expanded context = N*P = 4096 for P=2

Results on sub-1hr track config (12 epochs):
  Previous IHA (weight-fusion): 3.214 val loss in 59.7m
  Paper-faithful IHA (P=2):     3.193 val loss in 112.0m

Per-step cost is ~1.76x baseline due to 2x attention sequence length
(fundamental memory bandwidth cost, not FLOPs). Fits 2hr track, exceeds
1hr budget. For the 1hr track, revert to the previous commit which uses
weight-fusion mixing without sequence expansion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move paper-faithful sequence-interleaving IHA to two_hour/train.py
where it fits the time budget, and restore root train.py to the
weight-fusion cross-head mixing variant which fits the sub-1hr track.

Sub-1hr track (root train.py, weight-fusion IHA):
  val loss 3.214 in 59.7m training (11 epochs)

2hr track (two_hour/train.py, paper-faithful IHA P=2):
  val loss 3.200 in 101.9m training (11 epochs)
  val loss 3.193 in 112.0m training (12 epochs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keeps the paper-faithful Algorithm 1 implementation (sequence-dim
interleaving) in root train.py alongside two_hour/train.py.
Root currently exceeds the 1hr budget — to be optimized separately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applies the one-line MuonEq-R fix (arxiv 2603.28254) from upstream
PR qlabs-eng#71 on top of our paper-faithful interleaved IHA implementation.

Normalizes gradient rows to unit norm before the Muon orthogonalization
step. Upstream reported -0.003 val loss improvement alone (3.214 → 3.211)
with weight-fusion IHA.

Now applied to both:
- train.py (root, 1hr track)
- two_hour/train.py (2hr track with full interleaving)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@akshayvegesna
Copy link
Copy Markdown
Contributor

Thanks for fixing the impl! Perf wise, this seems worse than the current record #75 which gets 3.188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants