Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations by anantdgoel · Pull Request #413 · openai/parameter-golf

anantdgoel · 2026-03-22T07:06:50Z

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations

val_bpb: 1.4525 (sliding window, stride=128, GA+VR combined) | 13.2 MB | 1xRTX3090, 1000 steps, 131K batch

Two novel architecture modifications and one negative result. Sharing validated techniques with controlled ablation data.

Contributions

Value Residual (ResFormer) -- -0.015 BPB. Cache V vectors from layer 0, mix into all subsequent layers via learnable scalars. 18 params total. arXiv:2410.17897 (ACL 2025). Enable: VALUE_RESIDUAL=1.
Gated Attention -- -0.003 BPB. Per-head sigmoid gate after SDPA output, eliminating attention sinks. ~37K params. arXiv:2505.06708 (NeurIPS 2025 Best Paper). Enable: GATED_ATTENTION=1.
PPM-C Context Mixer -- +0.0018 BPB (negative result). Classical compression blended with neural softmax. Dilutes predictions on SmearGate+BigramHash models.

The two positive techniques stack additively for -0.017 BPB combined.

Ablation Results

v1024 9L 2xMLP, SmearGate + BigramHash + OrthoInit + WD 0.04, 131K batch, 1000 steps.

Config	Sliding BPB	Delta vs Control
Control	1.4697	--
Gated Attention only	1.4665	-0.0032
Value Residual only	1.4546	-0.0151
GA + VR combined	1.4525	-0.0172
PPM-C (eval-only)	1.2900	+0.0018 (worse)

A production run (11L MLP3x + full community stack + VR + GA, 9500 steps) is in progress. Results in a follow-up submission if competitive.

Files

README.md -- Full writeup with technique details and reproducibility
submission.json -- Metadata
train_gpt.py -- Training script with Value Residual, Gated Attention, XSA, EMA, Partial RoPE, LN Scale

…) with ablations Two novel architecture modifications validated with controlled ablations: - Value Residual: layer-0 V shortcut, 18 scalars, -0.015 BPB - Gated Attention: per-head sigmoid gate, -0.003 BPB - PPM-C: negative result (+0.002 BPB on SmearGate+BigramHash) Combined: -0.017 BPB additive, no interference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

joshuaswarren mentioned this pull request Mar 22, 2026

Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) #474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:value-residual-gated-attention

anantdgoel commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anantdgoel commented Mar 22, 2026