Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
parameter-golf/
data/tokenizers
__pycache__/
.DS_Store
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
This record captures `11L DepthRec PolarNS SWA`. Non-record submission on the 10min / 16MB track.

## Summary

A 28.5M-param 11-layer transformer trained for 600s on 8×H100 SXM, serialized to an int6 + zstd-22 artifact totaling 15,999,891 bytes (109 bytes under the 16MB cap). Pre-int6 `val_bpb` at the wallclock cap is `1.1444`. The post-int6 sliding-window eval didn't complete on this run due to a pod interruption right after the artifact was written; a 3-seed run with proper sliding measurement is planned as a follow-up.

## Configuration

- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=4`
- Tied embeddings, partial RoPE (16 / 64 dims), layerwise LN scale
- BigramHash (3072 buckets, dim=112)
- Depth recurrence: blocks 4 and 5 reuse the MLP of block 3, each pass gated by a learned scalar
- XSA on the last 4 layers
- Parallel residuals from layer 7 onward
- int6 per-row quantization on MLP and attention 2D weights, tied embedding stays fp
- zstd-22 serialization

## Training

- Muon for matrices (Newton-Schulz with Polar Express coefficients + AOL preconditioning, 5 iters); Adam for scalars and embeddings
- `TIED_EMBED_LR=0.035 MATRIX_LR=0.025 SCALAR_LR=0.025`
- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048`
- Late QAT kicks in at scale 0.15
- SWA starts at scale 0.2 and averages every 50 steps; the final serialized weights are a blend of EMA and SWA
- `MAX_WALLCLOCK_SECONDS=600`, seed 1337

## Command

```bash
pip install -r requirements.txt
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
cd records/track_10min_16mb/2026-04-15_11L_DepthRec_PolarNS_SWA
torchrun --standalone --nproc-per-node=8 train_gpt.py
```

## Key metrics

| step | val_loss | val_bpb |
|-----:|---------:|--------:|
| 0 | 6.9288 | 4.1036 |
| 2000 | 2.1428 | 1.2691 |
| 4000 | 2.0641 | 1.2225 |
| 6000 | 1.9881 | 1.1775 |
| 7000 | 1.9374 | 1.1474 |
| 7171 | 1.9323 | 1.1444 |

- Training stopped at 7171 / 20000 steps against the wallclock cap (`step_avg:83.68ms`)
- Peak memory: 18,204 MiB allocated, 19,866 MiB reserved
- Artifact: 15,968,114 bytes (int6 + zstd-22)
- Code: 31,777 bytes
- Total: 15,999,891 bytes

## Approach

The stack is a combination of several published ideas on top of the public baseline. Depth recurrence lets 11 physical MLPs cover 13 attention positions at zero parameter cost, with a learned scalar per reused pass so the model can weigh the repeated MLP differently from the first pass. XSA on the last 4 layers and parallel residuals from layer 7 onward take some compute pressure off the deep blocks. Inside Muon, Polar Express coefficients and AOL preconditioning replace the classic Newton-Schulz triplet, which keeps the orthogonalization well-conditioned in 5 iterations. SWA averages late-training checkpoints once the warmdown schedule is below a fraction threshold, and the final serialized weights are a blend of EMA and SWA.

The byte budget was the tight constraint: the int6 state dict for this config compresses to ~16.2 MB under the standard lzma-9 path, which is over the cap. Switching the serialization path brought it under 16 MB with room left over for a minified training script.

## Caveats

- Single seed (1337), so no statistical significance claim over the current SOTA yet. Submitting as non-record for iteration signal.
- `val_bpb` above is pre-int6; the post-int6 sliding-window number was not measured on this run. Will report once the 3-seed follow-up lands.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
zstandard
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Ander Amondarain",
"github_id": "anderamondarainh-stack",
"name": "11L DepthRec PolarNS SWA Int6+Zstd",
"blurb": "sp1024 11L 512 kv4 on fineweb10B with depth recurrence (layers 4 and 5 share the mlp of layer 3), polar express newton-schulz with aol preconditioning, swa blended with ema, partial rope, xsa on the last 4 layers, bigram hash 3072x112, and int6+zstd-22 with stripped-duplicate state dict so the whole artifact fits under 16MB. wallclock-capped at 600s on 8xH100 SXM, seed 1337. the reported val_bpb is pre-int6 at step 7171; the post-int6 sliding eval is pending a 3-seed re-run.",
"date": "2026-04-15T22:47:00Z",
"val_loss": 1.9323,
"val_bpb": 1.1444,
"bytes_total": 15999891,
"bytes_code": 31777
}

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions records/track_10min_16mb/submission_v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# SP4096 + depth recurrence + MuonEq-R + misc improvements

Stacking stuff that works from recent PRs. Nothing too fancy, just trying to get everything working together before adding SLOT/TTT later.

## what changed vs baseline

- switched to **sp4096** tokenizer (bigger vocab = better compression per byte)
- **11 layers with depth recurrence** on layers 3-5 (shared MLP), so effectively 14 virtual layers for 0 extra params
- **MLP 4x** (2048 hidden) instead of 2x
- **LeakyReLU(0.5)²** instead of relu²
- **MuonEq-R**: added row-normalization before newton-schulz in muon. small thing but helps
- **QK-Gain 5.0** (init was 1.5, bumped it up based on what others found works)
- **BigramHash** 3072x112 + projection to model dim
- **SmearGate** for blending adjacent token embeddings
- **EMA** (0.997 decay) applied at the end before quantization
- decoupled **weight decay** (0.04) in muon for better quantization later
- warmdown bumped to 4000 iters
- tuned LRs: matrix=0.025, scalar=0.025, embed=0.035
- muon momentum 0.99 (warmup from 0.92)
- grad clip 0.3

## quantization

still using baseline int8 + zlib for now. plan is to switch to int6 + lzma once I verify everything trains properly.

## expected results

haven't run this yet (waiting on compute). aiming for somewhere around 1.09-1.12 based on what similar setups get in other PRs.

## to run

```bash
python3 data/cached_challenge_fineweb.py --variant sp4096
torchrun --standalone --nproc_per_node=8 records/track_10min_16mb/submission_v1/train_gpt.py
```
9 changes: 9 additions & 0 deletions records/track_10min_16mb/submission_v1/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"author": "anderamondarainh-stack",
"github_id": "anderamondarainh-stack",
"val_bpb": null,
"date": "2026-04-04",
"summary": "SP4096 + depth recurrence (3,4,5) + MuonEq-R + MLP4x + BigramHash + EMA",
"base_pr": "baseline",
"notes": "stacking known improvements, no SLOT/TTT yet"
}
Loading