10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370) by parinzee · Pull Request #434 · openai/parameter-golf

parinzee · 2026-03-22T16:09:20Z

Summary

val_bpb: 1.1370 (mean of 3 seeds, post int5/int6+zstd quantization roundtrip, sliding window stride=64)
10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
Artifact: ~15.9 MB (all 3 seeds under 16 MB)

Seed	val_bpb	artifact_bytes	valid
42	1.13815	15,983,322	yes
1337	1.13601	15,968,675	yes
2024	1.13697	15,650,120	yes
Mean	1.13704
Std	0.00088

XSA (Exclusive Self Attention) on last 4 layers — removes self-value projection (arxiv:2603.09078)
LeakyReLU(0.5)² activation replacing ReLU²
Partial RoPE — rotary embeddings on 25% of head dims (16/64)
Higher learning rates — matrix_lr 0.02→0.025, scalar_lr 0.02→0.025, tied_embed_lr 0.03→0.035
8% magnitude pruning (up from 3%) for artifact compliance

torchrun --standalone --nproc_per_node=8 train_gpt.py

Built on SOTA baseline by @thwu1 (PR #180).

…370, 3 seeds)

Co-Authored-By: Parinthapat Pengpun <parinzee@users.noreply.github.com>

parinzee and others added 2 commits March 22, 2026 12:08

Add submission: 10L XSA + LeakyReLU² + Partial RoPE (mean val_bpb=1.1…

46a2b51

…370, 3 seeds)

Add train.log to submission

eeef2a0

Co-Authored-By: Parinthapat Pengpun <parinzee@users.noreply.github.com>