Skip to content

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413

Open
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:value-residual-gated-attention
Open

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:value-residual-gated-attention

Conversation

@anantdgoel
Copy link

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations

val_bpb: 1.4525 (sliding window, stride=128, GA+VR combined) | 13.2 MB | 1xRTX3090, 1000 steps, 131K batch

Two novel architecture modifications and one negative result. Sharing validated techniques with controlled ablation data.

Contributions

  1. Value Residual (ResFormer) -- -0.015 BPB. Cache V vectors from layer 0, mix into all subsequent layers via learnable scalars. 18 params total. arXiv:2410.17897 (ACL 2025). Enable: VALUE_RESIDUAL=1.

  2. Gated Attention -- -0.003 BPB. Per-head sigmoid gate after SDPA output, eliminating attention sinks. ~37K params. arXiv:2505.06708 (NeurIPS 2025 Best Paper). Enable: GATED_ATTENTION=1.

  3. PPM-C Context Mixer -- +0.0018 BPB (negative result). Classical compression blended with neural softmax. Dilutes predictions on SmearGate+BigramHash models.

The two positive techniques stack additively for -0.017 BPB combined.

Ablation Results

v1024 9L 2xMLP, SmearGate + BigramHash + OrthoInit + WD 0.04, 131K batch, 1000 steps.

Config Sliding BPB Delta vs Control
Control 1.4697 --
Gated Attention only 1.4665 -0.0032
Value Residual only 1.4546 -0.0151
GA + VR combined 1.4525 -0.0172
PPM-C (eval-only) 1.2900 +0.0018 (worse)

A production run (11L MLP3x + full community stack + VR + GA, 9500 steps) is in progress. Results in a follow-up submission if competitive.

Files

  • README.md -- Full writeup with technique details and reproducibility
  • submission.json -- Metadata
  • train_gpt.py -- Training script with Value Residual, Gated Attention, XSA, EMA, Partial RoPE, LN Scale

…) with ablations

Two novel architecture modifications validated with controlled ablations:
- Value Residual: layer-0 V shortcut, 18 scalars, -0.015 BPB
- Gated Attention: per-head sigmoid gate, -0.003 BPB
- PPM-C: negative result (+0.002 BPB on SmearGate+BigramHash)
Combined: -0.017 BPB additive, no interference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuaswarren added a commit to joshuaswarren/parameter-golf that referenced this pull request Mar 22, 2026
…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)

First submission combining 6 independently-proven architecture improvements:
- Catalytic Residuals (PR openai#450, -0.024 bpb)
- Value Residual/ResFormer (PR openai#413, -0.015 bpb)
- Gated Attention (PR openai#413, -0.003 bpb)
- BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048)
- 12 Layers (-0.023 bpb vs 11L)
- 3x MLP

8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant