Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)#415
Closed
EthanYangTW wants to merge 17 commits intoopenai:mainfrom
Closed
Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)#415EthanYangTW wants to merge 17 commits intoopenai:mainfrom
EthanYangTW wants to merge 17 commits intoopenai:mainfrom
Conversation
Based on SOTA (10L_Int5MLP_MuonWD04_SWA50) with improvements: - QAT with STE for int5/int6 quantization-aware training - BigramHash increased from 10240 to 12288 - Eval stride reduced from 64 to 32 for better context - Magnitude pruning increased from 3% to 5% - SWA every 25 steps instead of 50 - Artifact size: ~15.89MB (under 16MB limit)
Restore original train_gpt.py baseline. Add new records folder with submission script based on 10L_Int5MLP_MuonWD04_SWA50 SOTA. Changes: QAT with STE, BigramHash 12288, eval stride 32, 5% magnitude pruning, SWA every 25 steps.
Port LoRA TTT from records/2026-03-17_LoRA_TTT into our submission. At eval time, per-document rank-8 LoRA adapters are trained on Q/V projections and lm_head, then used for scoring. Expected -0.003 to -0.005 bpb improvement on top of sliding window eval.
val_bpb=1.14443 (seed=2024), artifact=15.90MB
…/train_gpt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
…y 10 steps - Disable FA3 (SDPA faster for GQA on PyTorch 2.9) - BigramHash 10240 -> 8192 to fit 11L under 16MB - EMA update every 10 steps with adjusted decay to reduce CPU overhead - Simplify attention forward (remove FA3 code path)
Previous run: 16.94MB with BigramHash 8192 + 5% pruning. BigramHash 2048 saves ~0.5MB, 10% pruning improves compression further.
v3 was 16.38MB with BigramHash 2048 + 10% pruning. Removing BigramHash saves ~0.15MB, 15% pruning improves zstd compression.
Fork of unnir's openai#374 (1.1246 BPB) with TTT added: - 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA - Shared VE128, SmearGate, BigramHash 2048 - TTT: 25 epochs SGD on val data post-quantization - Trimmed to 1476 lines (under 1500 limit)
Previous TTT took 7+ min per epoch (uncompiled, single GPU). Now: torch.compile + DDP across 8 GPUs + 3 epochs + batch 64. Should finish in ~2-3 min total.
flash_attn_interface (FA3 Hopper) not available on RunPod. Falls back to flash_attn, then SDPA with GQA support.
Two-phase TTT on PR openai#374 base: phase 1 norm-only recalibration (100ep Adam), phase 2 selective-freeze last 2 blocks (15ep SGD). Artifact 15.76MB.
84.65ms/step with FA3 Hopper (was 96ms), 6939 steps. Two-phase TTT: norm-only 100ep + selective-freeze 25ep. Artifact 15.70MB. Seed 42 running for 3-seed validation.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds new record submissions under records/track_10min_16mb building on PR #374, including a new two-phase test-time training (TTT) approach and additional archived experiments/logs.
Changes:
- Introduces a two-phase TTT pipeline (norm/scale “repair” phase + selective unfreeze of last blocks) in a new record directory.
- Adds/updates multiple record artifacts (train scripts, READMEs, submission metadata, and a training log) for reproducibility and leaderboard tracking.
- Includes a separate “11L XSA4 TightSWA TTT” record script and a “QAT + BigramHash(12288) + Stride 32” record bundle.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-22_TwoPhase_TTT_NormRepair/train_gpt.py | New training + int6 export + two-phase TTT implementation. |
| records/track_10min_16mb/2026-03-22_TwoPhase_TTT_NormRepair/submission.json | New submission metadata for the two-phase TTT record. |
| records/track_10min_16mb/2026-03-22_TwoPhase_TTT_NormRepair/README.md | Documentation for the two-phase TTT record. |
| records/track_10min_16mb/2026-03-22_11L_XSA4_TightSWA_TTT/train_gpt.py | New record training script with (single-phase) TTT. |
| records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_seed2024.log | Added training log artifact for the QAT/BigramHash record. |
| records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py | New record training script including QAT, quant export, and TTT. |
| records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json | New submission metadata for the QAT/BigramHash record. |
| records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md | Documentation for the QAT/BigramHash record. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Author
|
Superseded by #417 (3-seed mean 1.1227) |
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 22, 2026
Major changes: - DDP gradient sharding: each GPU processes batch_seqs sequences, manual all_reduce on gradients (matches PR openai#415/openai#417 approach) - Two-phase TTT (TTT_TWO_PHASE=1): Phase 1: norm-only recalibration (50 epochs Adam, ~22K params) Phase 2: selective block adaptation (10 epochs SGD, last 3 blocks) - TTT_BATCH_SEQS=64 per GPU (512 total with 8 GPUs) - Falls back to single-phase SGD if TTT_TWO_PHASE=0 Expected speedup: ~235x (from 1344s/epoch to ~5.7s/epoch)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Built on PR #374 with FA3 Hopper attention and a novel two-phase test-time training approach:
Key insight: the two phases target different error sources (quantization artifacts vs. distribution mismatch) and are additive.
Results
Architecture
Setup
Command