Skip to content

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462

Open
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-xsa4-adamw-ttt-1.0672
Open

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)#462
JoeProAI wants to merge 2 commits intoopenai:mainfrom
JoeProAI:joeproai/swiglu-xsa4-adamw-ttt-1.0672

Conversation

@JoeProAI
Copy link

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)

3-seed mean val_bpb: 1.0672 | Best seed: 1.0658
Verified on 8xH100 80GB, 10-minute wall-clock budget.

Approach

Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s.

Architecture (discovered by GEPA)

  • SwiGLU FFN with Star-ReLU activation
  • U-Net skip connections with learned gating
  • BigramHash embeddings (8192 buckets, 128 dim)
  • SmearGate on embeddings
  • 11 layers, 512 dim, 8 heads, 8 KV heads, MLP hidden=1792, tied embeddings

Training techniques (adopted + tuned)

3-Seed Results

Seed val_bpb
42 1.06733191
123 1.06833018
7 1.06579646
Mean 1.06715285
Std 0.00104211

Comparison to prior SOTA

Submission Mean BPB Best BPB
Ours 1.0672 1.0658
@sjp611 (#442) 1.1027 1.0992
@felipe-parodi (#398) 1.1221 1.1213
@thwu1 (#180, merged) 1.1428 --

Key finding

AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training.

Credits

Built by @JoePro (GitHub: @JoeProAI) with AI agent assistance: OpenClaw (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination.

Run command

# Default seed
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Specific seed
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

All hyperparameters are set as defaults in train_gpt.py.

Architecture discovered via GEPA (Gemini-driven evolutionary search).
SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4.
AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442).
EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410).

3-seed results: 1.06733 / 1.06833 / 1.06580
Mean: 1.06715, Std: 0.00104

Built by @joepro with AI agents via OpenClaw.
Compute provided by Modal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant