openai · anderamondarainh-stack · Apr 4, 2026 · Apr 4, 2026 · Apr 16, 2026 · Apr 30, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+parameter-golf/
 data/tokenizers
 __pycache__/
 .DS_Store

diff --git a/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/README.md b/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/README.md
@@ -0,0 +1,62 @@
+This record captures `11L DepthRec PolarNS SWA`. Non-record submission on the 10min / 16MB track.
+
+## Summary
+
+A 28.5M-param 11-layer transformer trained for 600s on 8×H100 SXM, serialized to an int6 + zstd-22 artifact totaling 15,999,891 bytes (109 bytes under the 16MB cap). Pre-int6 `val_bpb` at the wallclock cap is `1.1444`. The post-int6 sliding-window eval didn't complete on this run due to a pod interruption right after the artifact was written; a 3-seed run with proper sliding measurement is planned as a follow-up.
+
+## Configuration
+
+- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=4`
+- Tied embeddings, partial RoPE (16 / 64 dims), layerwise LN scale
+- BigramHash (3072 buckets, dim=112)
+- Depth recurrence: blocks 4 and 5 reuse the MLP of block 3, each pass gated by a learned scalar
+- XSA on the last 4 layers
+- Parallel residuals from layer 7 onward
+- int6 per-row quantization on MLP and attention 2D weights, tied embedding stays fp
+- zstd-22 serialization
+
+## Training
+
+- Muon for matrices (Newton-Schulz with Polar Express coefficients + AOL preconditioning, 5 iters); Adam for scalars and embeddings
+- `TIED_EMBED_LR=0.035 MATRIX_LR=0.025 SCALAR_LR=0.025`
+- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048`
+- Late QAT kicks in at scale 0.15
+- SWA starts at scale 0.2 and averages every 50 steps; the final serialized weights are a blend of EMA and SWA
+- `MAX_WALLCLOCK_SECONDS=600`, seed 1337
+
+## Command
+
+```bash
+pip install -r requirements.txt
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
+cd records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA
+torchrun --standalone --nproc-per-node=8 train_gpt.py
+```
+
+## Key metrics
+
+| step | val_loss | val_bpb |
+|-----:|---------:|--------:|
+|    0 | 6.9288   | 4.1036  |
+| 2000 | 2.1428   | 1.2691  |
+| 4000 | 2.0641   | 1.2225  |
+| 6000 | 1.9881   | 1.1775  |
+| 7000 | 1.9374   | 1.1474  |
+| 7171 | 1.9323   | 1.1444  |
+
+- Training stopped at 7171 / 20000 steps against the wallclock cap (`step_avg:83.68ms`)
+- Peak memory: 18,204 MiB allocated, 19,866 MiB reserved
+- Artifact: 15,968,114 bytes (int6 + zstd-22)
+- Code: 31,777 bytes
+- Total: 15,999,891 bytes
+
+## Approach
+
+The stack is a combination of several published ideas on top of the public baseline. Depth recurrence lets 11 physical MLPs cover 13 attention positions at zero parameter cost, with a learned scalar per reused pass so the model can weigh the repeated MLP differently from the first pass. XSA on the last 4 layers and parallel residuals from layer 7 onward take some compute pressure off the deep blocks. Inside Muon, Polar Express coefficients and AOL preconditioning replace the classic Newton-Schulz triplet, which keeps the orthogonalization well-conditioned in 5 iterations. SWA averages late-training checkpoints once the warmdown schedule is below a fraction threshold, and the final serialized weights are a blend of EMA and SWA.
+
+The byte budget was the tight constraint: the int6 state dict for this config compresses to ~16.2 MB under the standard lzma-9 path, which is over the cap. Switching the serialization path brought it under 16 MB with room left over for a minified training script.
+
+## Caveats
+
+- Single seed (1337), so no statistical significance claim over the current SOTA yet. Submitting as non-record for iteration signal.
+- `val_bpb` above is pre-int6; the post-int6 sliding-window number was not measured on this run. Will report once the 3-seed follow-up lands.
diff --git a/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/final_model.int6.ptz b/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/final_model.int6.ptz
diff --git a/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/requirements.txt b/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/requirements.txt
@@ -0,0 +1 @@
+zstandard
diff --git a/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/submission.json b/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/submission.json
@@ -0,0 +1,11 @@
+{
+  "author": "Ander Amondarain",
+  "github_id": "anderamondarainh-stack",
+  "name": "11L DepthRec PolarNS SWA Int6+Zstd",
+  "blurb": "sp1024 11L 512 kv4 on fineweb10B with depth recurrence (layers 4 and 5 share the mlp of layer 3), polar express newton-schulz with aol preconditioning, swa blended with ema, partial rope, xsa on the last 4 layers, bigram hash 3072x112, and int6+zstd-22 with stripped-duplicate state dict so the whole artifact fits under 16MB. wallclock-capped at 600s on 8xH100 SXM, seed 1337. the reported val_bpb is pre-int6 at step 7171; the post-int6 sliding eval is pending a 3-seed re-run.",
+  "date": "2026-04-15T22:47:00Z",
+  "val_loss": 1.9323,
+  "val_bpb": 1.1444,
+  "bytes_total": 15999891,
+  "bytes_code": 31777
+}
diff --git a/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/train_gpt.py b/records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA/train_gpt.py
diff --git a/records/track_10min_16mb/submission_v1/README.md b/records/track_10min_16mb/submission_v1/README.md
@@ -0,0 +1,35 @@
+# SP4096 + depth recurrence + MuonEq-R + misc improvements
+
+Stacking stuff that works from recent PRs. Nothing too fancy, just trying to get everything working together before adding SLOT/TTT later.
+
+## what changed vs baseline
+
+- switched to **sp4096** tokenizer (bigger vocab = better compression per byte)
+- **11 layers with depth recurrence** on layers 3-5 (shared MLP), so effectively 14 virtual layers for 0 extra params
+- **MLP 4x** (2048 hidden) instead of 2x
+- **LeakyReLU(0.5)²** instead of relu²
+- **MuonEq-R**: added row-normalization before newton-schulz in muon. small thing but helps
+- **QK-Gain 5.0** (init was 1.5, bumped it up based on what others found works)
+- **BigramHash** 3072x112 + projection to model dim
+- **SmearGate** for blending adjacent token embeddings
+- **EMA** (0.997 decay) applied at the end before quantization
+- decoupled **weight decay** (0.04) in muon for better quantization later
+- warmdown bumped to 4000 iters
+- tuned LRs: matrix=0.025, scalar=0.025, embed=0.035
+- muon momentum 0.99 (warmup from 0.92)
+- grad clip 0.3
+
+## quantization
+
+still using baseline int8 + zlib for now. plan is to switch to int6 + lzma once I verify everything trains properly.
+
+## expected results
+
+haven't run this yet (waiting on compute). aiming for somewhere around 1.09-1.12 based on what similar setups get in other PRs.
+
+## to run
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp4096
+torchrun --standalone --nproc_per_node=8 records/track_10min_16mb/submission_v1/train_gpt.py
+```
diff --git a/records/track_10min_16mb/submission_v1/submission.json b/records/track_10min_16mb/submission_v1/submission.json
@@ -0,0 +1,9 @@
+{
+    "author": "anderamondarainh-stack",
+    "github_id": "anderamondarainh-stack",
+    "val_bpb": null,
+    "date": "2026-04-04",
+    "summary": "SP4096 + depth recurrence (3,4,5) + MuonEq-R + MLP4x + BigramHash + EMA",
+    "base_pr": "baseline",
+    "notes": "stacking known improvements, no SLOT/TTT yet"
+}