Skip to content

10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370)#434

Open
parinzee wants to merge 2 commits intoopenai:mainfrom
parinzee:submission/2026-03-22-XSA-LeakyReLU-PartialRoPE
Open

10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370)#434
parinzee wants to merge 2 commits intoopenai:mainfrom
parinzee:submission/2026-03-22-XSA-LeakyReLU-PartialRoPE

Conversation

@parinzee
Copy link

Summary

  • val_bpb: 1.1370 (mean of 3 seeds, post int5/int6+zstd quantization roundtrip, sliding window stride=64)
  • 10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
  • Artifact: ~15.9 MB (all 3 seeds under 16 MB)

3-Seed Results

Seed val_bpb artifact_bytes valid
42 1.13815 15,983,322 yes
1337 1.13601 15,968,675 yes
2024 1.13697 15,650,120 yes
Mean 1.13704
Std 0.00088

Key Changes from Baseline

  1. XSA (Exclusive Self Attention) on last 4 layers — removes self-value projection (arxiv:2603.09078)
  2. LeakyReLU(0.5)² activation replacing ReLU²
  3. Partial RoPE — rotary embeddings on 25% of head dims (16/64)
  4. Higher learning rates — matrix_lr 0.02→0.025, scalar_lr 0.02→0.025, tied_embed_lr 0.03→0.035
  5. 8% magnitude pruning (up from 3%) for artifact compliance

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

parinzee and others added 2 commits March 22, 2026 12:08
Co-Authored-By: Parinthapat Pengpun <parinzee@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant