Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) by JoeProAI · Pull Request #462 · openai/parameter-golf

JoeProAI · 2026-03-22T21:43:53Z

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)

3-seed mean val_bpb: 1.0672 | Best seed: 1.0658
Verified on 8xH100 80GB, 10-minute wall-clock budget.

Approach

Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s.

Architecture (discovered by GEPA)

SwiGLU FFN with Star-ReLU activation
U-Net skip connections with learned gating
BigramHash embeddings (8192 buckets, 128 dim)
SmearGate on embeddings
11 layers, 512 dim, 8 heads, 8 KV heads, MLP hidden=1792, tied embeddings

Training techniques (adopted + tuned)

XSA4 (cross-sequence attention on last 4 layers) -- credited to @felipe-parodi (Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) #398)
EMA (decay=0.9985) -- credited to @felipe-parodi (Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) #398), decay tuned by us
AdamW TTT (lr=0.0005, 10 epochs, wd=0.0) -- credited to @sjp611 (Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442)
Partial RoPE (16 dims) -- credited to @felipe-parodi (Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) #398)
LN Scale (1/sqrt(layer_idx+1)) -- credited to @felipe-parodi (Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) #398)
Late QAT (threshold 0.15) -- credited to @fbedev (Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216) #410)
Muon optimizer (matrix_lr=0.025, wd=0.04, momentum=0.99)
Warmdown: 6000 steps
Int6 quantization + zstd-22 compression

3-Seed Results

Seed	val_bpb
42	1.06733191
123	1.06833018
7	1.06579646
Mean	1.06715285
Std	0.00104211

Comparison to prior SOTA

Submission	Mean BPB	Best BPB
Ours	1.0672	1.0658
@sjp611 (#442)	1.1027	1.0992
@felipe-parodi (#398)	1.1221	1.1213
@thwu1 (#180, merged)	1.1428	--

Key finding

AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training.

Credits

@felipe-parodi (Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221) #398): EMA, TTT, XSA4, Partial RoPE, LN Scale
@sjp611 (Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442): AdamW TTT
@fbedev (Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216) #410): Late QAT
@thwu1 (Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180): 11-layer architecture direction
Compute provided by Modal

Built by @JoePro (GitHub: @JoeProAI) with AI agent assistance: OpenClaw (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination.

Run command

# Default seed
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Specific seed
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

All hyperparameters are set as defaults in train_gpt.py.

@sjp611

Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.

JoeProAI added 2 commits March 22, 2026 17:50

Remove compute cost from README

7e2f938

JoeProAI force-pushed the joeproai/swiglu-xsa4-adamw-ttt-1.0672 branch from a4faacd to 7e2f938 Compare March 22, 2026 21:50

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-xsa4-adamw-ttt-1.0672

JoeProAI commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoeProAI commented Mar 22, 2026

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)

Approach

Architecture (discovered by GEPA)

Training techniques (adopted + tuned)

3-Seed Results

Comparison to prior SOTA

Key finding

Credits

Run command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant