diff --git a/caseops-memory-leakage/family-tree.md b/caseops-memory-leakage/family-tree.md new file mode 100644 index 0000000000..a5e6e70ee4 --- /dev/null +++ b/caseops-memory-leakage/family-tree.md @@ -0,0 +1,180 @@ +# CaseOps records — family tree with leak/clean annotations + +**Updated 2026-05-02 with strict re-audit applied** (see `verdicts.md` for criteria). + +Legend: `[C]` = CLEAN (val docs not in train), `[L]` = LEAK (val docs in train), `[?]` = AMBIGUOUS (cannot resolve from PR artifacts alone). + +## Tree 1 — Merged trunk (linear ancestry) + +``` +#1493 [pre-CaseOps boundary, clean by lineage] + ↓ +#1626 [pre-CaseOps boundary, VarLen, clean by lineage] + ↓ ← BOUNDARY: pre-CaseOps to CaseOps +#1729 [C] @romeerp bpb=1.0678 (Apr 18) + │ — first CaseOps record; cached_challenge_fineweb.py from romeerp/parameter-golf-caseops-v1 + ↓ ←== LEAK INTRODUCED HERE == +#1736 [L] @dexhunter bpb=1.06549 (Apr 19) + │ — first prepare_caseops_data.py default; train docs 10k+, val docs 0–49,999 + │ — OUR CURRENT RESEARCH BASELINE + ↓ +#1769 [L] @dexhunter bpb=1.06453 (Apr 22) + │ — +MLPClip12; same prep + ↓ +#1787 [L] @nprime06 bpb=1.06335 (Apr 23) + │ — +Polar Express NS, MIN_LR, SparseAttnGate, FusedCE; same prep + ↓ + ├──→ #1797 [L] @dexhunter bpb=1.06157 (Apr 25) — +SmearGate +LQER int4 + │ │ + │ ↓ ←== LEAK FIXED HERE == + │ #1851 [C] @aquariouseworkman bpb=1.06128 (Apr 27) + │ │ — +SmearGate BOS-fix; SWITCHED to /dev/shm/pgolf_data (HF subset, 39 shards) + │ │ — current merged-leaderboard SOTA leader + │ │ + │ ├──→ #1855 [L] @codemath3000 bpb=1.06108 (Apr 27) + │ │ — 9-hparam stack; LEAK RE-INTRODUCED — author rebuilt locally with default --val-docs + │ │ — DATASET_AUDIT.md (PR #2018) verified --val-docs=10000 byte-for-byte + │ │ + │ └──→ #1868 [C] @Christopher-Lee-McClendon bpb=1.06141 (Apr 29) + │ — 3-seed reproduction of #1851; STAYED on HF dataset + │ — LATEST clean merged record +``` + +## Tree 2 — Unmerged frontier branches off #1855 + +#1855 became the dominant fork point for the unmerged frontier. Most descendants inherited the leaky local prep workflow. + +``` +#1855 [L] @codemath3000 bpb=1.06108 + │ + ├──→ #1908 [C] @romeerp bpb=1.06081 — README explicit HF source; +AWQ-lite GPTQ + │ + ├──→ #1923 [L] @jorge-asenjo bpb=1.05971 — +AsymLogit +AWQ-lite; ORIGINAL val=9.66M (default --val-docs=10000), val-only re-pulled from HF after corruption; train still doc 10k+ → leak + │ + ├──→ #1945 [C] ← *flipped from [L] in re-audit* @alertcat bpb=1.05943 + │ │ — finalize_v18.sh has `snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', local_dir='/workspace/caseops_data')` + │ │ — README's prepare_caseops_data.py "Data setup" is stale — actual run used HF + │ │ — IF this is correct, #1945 at 1.05943 is a clean-frontier candidate + │ │ + │ ├──→ #1953 [?] ← *downgraded from [L] in re-audit* @andrewbaggio1 bpb=1.05855 + │ │ │ — V21 + TTT tweaks. PR ships only train_gpt.py + logs. No prep evidence. + │ │ │ — Path matches HF target. Parent #1945 confirmed HF. **Lean CLEAN.** + │ │ │ + │ ├──→ #1967 [L] @ndokutovich bpb=1.05851 — V21 + LeakyReLU 0.3 + N-gram Tilt + │ │ │ — setup.sh invokes prepare_caseops_data.py default; ALSO has within/word boundary_lut C1 leak + │ │ │ + │ │ └──→ #2018 [L] Simon Marcus bpb=1.04722 (Apr 30) + │ │ │ — multi-parent (#1945, #1967, #1953, #1855); +Gated XSA, LQER top-1, AsymLogit, n-gram tilt + │ │ │ — DATASET_AUDIT.md is gold-standard leak documentation + │ │ │ — note: parent #1945 is CLEAN but #2018 audit explicitly proves LEAK construction + │ │ │ + │ │ ├──→ #2118 [L] @aquariouseworkman bpb=1.04350 (May 1) + │ │ │ — CURRENT FRONTIER (claimed); submission.json: "--val-docs=10000 train shards + 50k val eval" + │ │ │ — same author who shipped clean #1851 a week earlier + │ │ │ + │ │ └──→ #2041 [?] ← *downgraded from [L] in re-audit* @jorge-asenjo bpb=1.05692 + │ │ — No prep invocation in PR; double-nested path, ambiguous + │ │ + │ └──→ #2014 [L] @simonbissonnette bpb=1.05759 + │ │ — "uses same shards as PR #1855"; /dev/shm/pgolf_caseops_data_80_l17_final + │ │ + │ └──→ #2078 [L] @hi-aduek bpb=1.05804 — #2014 reproduction + │ + ├──→ #2007 [L] @Elubrazione bpb=1.05899 — LongCtx + NoQV; triple nesting + ships prep + │ │ + │ └──→ #2060 [L] @S0urC10ud bpb=1.05792 — 5-knob retune + │ │ + │ └──→ #2100 [L] @someone114514 bpb=1.05807 — LongCtx + No-QV + Prefix3500 + │ + ├──→ #2019 [C] @aquariouseworkman bpb=1.05847 — README explicit: snapshot_download from HF + │ + ├──→ #2031 [C] @deborahnelson8788726 bpb=1.05985 — README explicit: 39 train shards from HF + │ + ├──→ #2068 [C] @jayaram1125 bpb=1.06172 (parent #1797) — cached_challenge_fineweb.py from HF + │ + ├──→ #2071 [L] @jamesEmerson112 bpb=1.0066 (claimed) (parent #1851) + │ — SEPARATE LEAK: symlink-leak (audit-flagged); SP8192 path symlinked to CaseOps shards + │ + ├──→ #2075 [?] ← *downgraded from [L] in re-audit* @deusexnatura — PairGeom-V; ships prep but no explicit invocation + │ + ├──→ #2101 [L] @OnlyJundong bpb=1.05845 — AWQ-lite + AsymLogit + GradCentral; ships prep + │ │ + │ └──→ #2117 [L] @JulianTang2027 — 3-seed reproduction of #2101 + │ + ├──→ #2109 [L] @izlley bpb=1.05917 — MP3 marker-pair fusion (CUSTOM dataset variant); val_tokens=36.56M + │ + ├──→ #2121 [L] @Kbediako bpb=1.06099 — StageB v2; ships prep + │ + ├──→ #2123 [L] @vaibhavmishra1 bpb=1.05933 — closed; superseded by #2124 + │ + └──→ #2124 [L] @vaibhavmishra1 bpb=1.05933 — resubmission of #2123 +``` + +## Tree 3 — Out-of-CaseOps-scope (in date window but different lineage) + +``` +#1493 [pre-CaseOps boundary] + ↓ +#2027 [C] @H1cSuNtDr4C0n3S bpb=1.08064 (Apr 30) + — SP8192 QRescue + JEPA-Lite; non-CaseOps SP8192 lineage; clean by lineage + +(separately:) +#1915 [not in working set; bulk-classified clean in state.json] + ↓ +#2050 [INHERIT] @AidenGeunGeun bpb=1.06083 (Apr 30) + — eval-only on frozen #1915 quantized artifacts; data verdict depends on #1915 +``` + +## Tree 4 — Symlink leak branch (separate mechanism) + +``` +#1851 [C] + ↓ +#2071 [L] @jamesEmerson112 bpb=1.0066 (claimed) + — caseops_enabled=False but pod data paths symlinked to CaseOps-tokenized shards + — README admits: "active via symlinked data" + — NOT the val10k-train leak; orthogonal mechanism +``` + +## Where leak transitions occur + +| Edge | Author of child | Action | +|---|---|---| +| #1729 [C] → #1736 [L] | @dexhunter | **LEAK INTRODUCED**: first use of `prepare_caseops_data.py` default `--val-docs=10000`, started the leaky CaseOps trunk | +| #1797 [L] → #1851 [C] | @aquariouseworkman | **LEAK FIXED**: switched to `/dev/shm/pgolf_data` (39-shard HF subset); first clean record post-#1736 | +| #1851 [C] → #1855 [L] | @codemath3000 | **LEAK RE-INTRODUCED**: rebuilt locally with `prepare_caseops_data.py` default, despite parent being clean | +| #1851 [C] → #1868 [C] | @Christopher-Lee-McClendon | (clean stays clean) — used HF dataset same as parent | +| #1855 [L] → #1908 [C] | @romeerp | **LEAK FIXED**: README explicit HF source | +| #1855 [L] → #1923 [L] | @jorge-asenjo | (leak stays leak) — only val-side fix, train kept default-prep | +| #1855 [L] → #2019 [C] | @aquariouseworkman | **LEAK FIXED**: snapshot_download from HF | +| #1855 [L] → #2031 [C] | @deborahnelson8788726 | **LEAK FIXED**: HF first-39 explicit | +| #1855 [L] → #2068 [C] | @jayaram1125 | **LEAK FIXED**: cached_challenge_fineweb.py from HF | +| #2018 [L] → #2118 [L] | @aquariouseworkman | **REGRESSION**: same author who fixed leak in #1851 now ships leaky #2118; submission.json admits | + +## Author behaviors + +| Author | Records | Shipped status | +|---|---|---| +| @romeerp | #1729 [C], #1908 [C] | Always clean | +| @dexhunter | #1736 [L], #1769 [L], #1797 [L] | Always leaky (started the leak) | +| @nprime06 | #1787 [L] | Leaky | +| @aquariouseworkman | #1851 [C], #2019 [C], #2118 [L] | Mostly clean; regressed on #2118 | +| @codemath3000 | #1855 [L] | Leaky (re-introduced after #1851 fixed it) | +| @Christopher-Lee-McClendon | #1868 [C] | Clean | +| @jorge-asenjo | #1923 [L], #2041 [L] | Leaky | +| @jamesEmerson112 | #2071 [L] (symlink) | Different leak mechanism | +| @alertcat | #1945 [L] | Leaky | +| @andrewbaggio1 | #1953 [L] | Leaky | +| @ndokutovich | #1967 [L] | Leaky | +| Simon Marcus | #2018 [L] | Leaky (with audit doc) | +| @deborahnelson8788726 | #2031 [C] | Clean (HF) | +| @jayaram1125 | #2068 [C] | Clean (HF) | +| @vaibhavmishra1 | #2123 [L], #2124 [L] | Leaky | + +## Key takeaways + +1. **The clean trunk is short**: pre-CaseOps → #1729 → (#1851 → #1868). Three actual record submissions in the post-leak-introduction era. +2. **The leaky trunk is long**: #1736 → #1855 → V21 (#1945) → #1967/#1953 → #2018 → #2118, with many sibling forks. +3. **Same authors switch verdicts across PRs**: @aquariouseworkman shipped clean #1851 / #2019 and leaky #2118 within a week. +4. **Once a fork "fixes" the leak by going HF, it stays clean** (e.g., #1908, #2019, #2031, #2068 all sit downstream of leaky #1855 but went HF). +5. **Conversely, "fixing" doesn't propagate**: #1851's HF switch didn't stop #1855 from re-introducing the leak using a sibling local prep. diff --git a/caseops-memory-leakage/verdicts.md b/caseops-memory-leakage/verdicts.md new file mode 100644 index 0000000000..7484d50832 --- /dev/null +++ b/caseops-memory-leakage/verdicts.md @@ -0,0 +1,133 @@ +# CaseOps records — train/val data-leakage verdicts + +**Fresh audit 2026-05-02 (complete from-scratch pass).** Every CaseOps-lineage record (merged + unmerged) since 2026-04-18. + +**Working set:** 34 records (31 from user's seed list + 3 ancestors: #1908, #1923, #2007). +**Boundary nodes (not classified):** #1493, #1626 (pre-CaseOps, clean by `download_hf_docs_and_tokenize.py NUM_VAL_DOCS=50000`). + +## Tally + +| Verdict | Count | Records | +|---|---:|---| +| **CLEAN** | 12 | #1729, #1851, #1868, #1945, #1953, #2014, #2019, #2027 (non-CaseOps), #2031, #2068, #2123, #2124 | +| **LEAK** | 15 | #1736, #1769, #1797, #1855, #1923, #1967, #2007, #2018, #2060, #2071 (symlink), #2078, #2100, #2101, #2109 (custom variant), #2118 | +| **AMBIGUOUS** | 6 | #1787, #1908, #2041, #2075, #2117, #2121 | +| **INHERIT** | 1 | #2050 (eval-only on #1915) | + +## Classification algorithm + +Two questions applied to every PR's **reproduce flow** (README data-setup section + all `.sh` scripts shipped with the PR): + +- **Q1:** Is there a HF download command? (`snapshot_download`, `cached_challenge_fineweb.py`, `hf_hub_download`, `huggingface-cli download` — all targeting `romeerp/parameter-golf-caseops-v1`) +- **Q2:** Is there a `prepare_caseops_data.py` invocation? (any call without `--val-docs=50000` — **no PR in this set ever passes that override**) + +| Q1 | Q2 | Primary verdict | +|---|---|---| +| ✅ | ❌ | **CLEAN** | +| ❌ | ✅ | **LEAK** | +| ✅ | ✅ | Check which is the real reproduce step (HF cmd in actual run script → CLEAN; prep in run script → LEAK) | +| ❌ | ❌ | **AMBIGUOUS** — use train log as tiebreaker | + +**Train-log tiebreaker (for ❌/❌ cases):** +- `train_shards: 39` → definitively CLEAN (HF 39-shard subset; impossible from `prepare_caseops_data.py` which always produces 80+) +- `train_shards > 1000` → definitively LEAK (local prep on enlarged docs file) +- Triple-nesting `…/fineweb10B_sp8192_caseops/datasets/datasets/` → lean LEAK (prep script creates this intermediate directory; HF download never would) +- 80 shards + single/double-nesting → still AMBIGUOUS (consistent with either full HF download or local prep) + +`frontier-state.json` was NOT used as evidence. All verdicts from primary sources (scripts, logs, audit docs). + +## What "LEAK" means + +For records flagged `val10k-train+50k-val-regen`: +- `prepare_caseops_data.py` with default `--val-docs=10000` → train documents start at canonical-stream index **10,000**. +- Val covers the first **50,000** canonical-stream documents (`val_tokens ≈ 47,851,520`). +- → Docs 10,000–49,999 (**40,000 docs, 80% of val**) appear in both train and val. + +## What "CLEAN" means + +Records flagged `hf-dataset`: +- Train + val from `romeerp/parameter-golf-caseops-v1` (HF manifest: `docs_val=50000, docs_train=8,181,945, docs_total=8,231,945` — sums match exactly, disjoint by construction). + +## Master table + +| PR | Author | Date | val_bpb | Stated parent | datasets_dir | train_shards | val_tokens | **Verdict** | Mechanism | Evidence | +|---|---|---|---:|---|---|---:|---:|---|---|---| +| **#1729** | @romeerp | 2026-04-19 | 1.06780 | #1626 | `/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **CLEAN** | hf-dataset | Q1✅: README invokes `MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 python3 cached_challenge_fineweb.py` as the data-setup step. Q2❌. | +| **#1736** | @dexhunter | 2026-04-19 | 1.06549 | #1729 | `./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | **LEAK INTRODUCED HERE.** Q2✅: README "Data setup" step 2: `python3 prepare_caseops_data.py --docs ./fineweb10B_raw/docs_selected.jsonl ...` (no `--val-docs` → default 10,000). Our research baseline. | +| **#1769** | @dexhunter | 2026-04-22 | 1.06453 | #1736 | same triple-nested local prep | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: same README data-setup as #1736, `prepare_caseops_data.py` invoked. | +| **#1787** | @nprime06 | 2026-04-23 | 1.06335 | #1736, #1769 | `/workspace/src/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: README has no data-setup section; reproduce jumps directly to torchrun. README calls `prepare_caseops_data.py` the "one-time data prep script" and ships a BOS-fix patch for it (strong contextual evidence of use), but no explicit invocation command. Train-log triple-nesting with `fineweb10B_sp8192_caseops/datasets/datasets/` leans LEAK. | +| **#1797** | @dexhunter | 2026-04-25 | 1.06157 | #1787 | local triple-nested | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: README data-setup section invokes `python3 prepare_caseops_data.py` (same workflow as #1736/#1769, same author). | +| **#1851** | @aquariouseworkman | 2026-04-27 | **1.06128** | #1787 (via #1797) | `/dev/shm/pgolf_data` | **39** | 47,851,520 | **CLEAN** | hf-dataset | **LEAK FIXED HERE.** Q1❌ Q2❌ in README (no data-setup section). Train-log tiebreaker: `train_shards: 39` → definitively HF (39-shard subset). Current merged-SOTA leader. | +| **#1855** | @codemath3000 | 2026-04-27 | 1.06108 | #1787, #1797 | `/workspace/pr1797_work/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | **LEAK RE-INTRODUCED HERE.** Q1❌ Q2❌ in README. Resolved via PR #2018's `DATASET_AUDIT.md` (external primary source): verifies #1855's first 80 shards byte-for-byte against `prepare_caseops_data.py --val-docs=10000` output. | +| **#1868** | @Christopher-Lee-McClendon | 2026-04-29 | 1.06141 | #1851 | `/dev/shm/pgolf_data` | **39** | 47,851,520 | **CLEAN** | hf-dataset | Train-log tiebreaker: `train_shards: 39` → definitively HF. README has misleading comment `python3 prepare_caseops_data.py # downloads from romeerp/parameter-golf-caseops-v1` (the script does NOT download from HF; the comment is wrong). Actual run used HF data. | +| **#1908** | @romeerp | 2026-04-28 | 1.06081 | #1855 | `/workspace/parameter-golf-pr1855-clean/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: README says "sourced from Hugging Face: `romeerp/parameter-golf-caseops-v1`" (text mention only — no download command). PR does not ship `prepare_caseops_data.py`. 80 shards consistent with either full HF download or local prep. Path prefix `parameter-golf-pr1855-clean` suggests intent to use clean data. Lean CLEAN (romeerp is dataset owner; "clean" in path name), but no explicit HF command. | +| **#1923** | @jorge-asenjo | 2026-04-29 | 1.05971 | #1855 | `/workspace/pg-data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 1502 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q1❌ Q2❌ in README. Train-log tiebreaker: `train_shards: 1502` → definitively local prep (HF has 80 shards; 1502 = `prepare_caseops_data.py` run on an enlarged docs file). README also admits original `val_tokens=9,662,464` (= single val shard, 10k-doc default prep); val was re-pulled from HF after corruption but train shards were never replaced → overlap on docs 10,000–49,999. | +| **#1945** | @alertcat | 2026-04-29 | 1.05943 | #1855, #1908, #1923 | `/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,852,288 | **CLEAN** | hf-dataset | Q1✅ Q2✅: `finalize_v18.sh` (the actual run script) has `snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', local_dir='/workspace/caseops_data')` followed by `DATA_DIR=/workspace/caseops_data/datasets/` for training. README's `prepare_caseops_data.py` "Data setup" section is stale documentation. The finalize script is the canonical reproduce path → CLEAN. (val_tokens off by 768 from canonical 47,851,520 = shard-boundary alignment artifact; same 50k-doc val partition.) | +| **#1953** | @andrewbaggio1 | 2026-04-30 | 1.05855 | #1945 | `/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **CLEAN** | hf-dataset | Q1✅ Q2❌: README explicitly: "This submission uses the canonical CaseOps SP8192 export hosted on Hugging Face (`romeerp/parameter-golf-caseops-v1`), accessed via `huggingface_hub.snapshot_download`." And: "No local rebuild via `prepare_caseops_data.py` was used in the production runs; `prepare_caseops_data.py` is not part of this PR's file set." Train log path matches HF snapshot extraction location. val_tokens=47,851,520 consistent with canonical HF val. | +| **#1967** | @ndokutovich | 2026-04-30 | 1.05851 | #1945 | `/runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 1499 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: `setup.sh` explicitly invokes `python3 "$(dirname "$0")/prepare_caseops_data.py" --docs $DOCS_JSONL --out $DATA_DIR --sp ...` with no `--val-docs` flag → default 10,000. Also has separate within/word `boundary_lut[tokens[i]]` C1 leak (code bug, orthogonal). | +| **#2007** | @Elubrazione | 2026-04-30 | 1.05899 | #1855 | `/root/blockdata/pg-data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: README "Reproduce" section invokes `python prepare_caseops_data.py --local-dir /workspace/caseops_data`. Train log triple-nesting confirms local prep. | +| **#2014** | @simonbissonnette | 2026-04-30 | 1.05759 | #1855, #1953 | `/dev/shm/pgolf_caseops_data_80_l17_final` | 80 | 47,853,343 | **CLEAN** | hf-or-corrected-prep | Q1✅ Q2✅: README "Preferred data setup" is `snapshot_download(repo_id="romeerp/parameter-golf-caseops-v1")`. Fallback uses a modified `prepare_caseops_data.py` that "defaults to 50,000 validation docs and refuses to write over existing shards" — so even the fallback produces a clean partition. val_tokens=47,853,343 (off by 1823 from canonical 47,851,520, suggesting fallback was actually used, but with 50k val docs → no overlap regardless). CLEAN: no train/val overlap under either path. Note: val_tokens differs from canonical; not directly comparable to records at 47,851,520. | +| **#2018** | @simon-marcus | 2026-04-30 | 1.04722 | #1945, #1967, #1953, #1855 | `/tmp/pr1855_compact_train_full50k_val/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | **GOLD-STANDARD LEAK DOC.** `DATASET_AUDIT.md` explicitly states `--val-docs=10000` train + 50k val regen + first 80 train shards verified byte-for-byte against the local prep output. | +| **#2019** | @aquariouseworkman | 2026-04-30 | 1.05847 | #1855 | `/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **CLEAN** | hf-dataset | Q1✅ Q2❌: README has `HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', ...)"` as the explicit data-setup command. | +| **#2027** | @H1cSuNtDr4C0n3S | 2026-04-30 | 1.08064 | #1493 | `/workspace/parameter-golf-qrescue-20260426/data/datasets/fineweb10B_sp8192` | — | — | **CLEAN** | pre-caseops-pipeline | Non-CaseOps SP8192 lineage (SP8192 QRescue + JEPA-Lite). Clean by lineage (pre-CaseOps val partition). Out of CaseOps-audit scope. | +| **#2031** | @deborahnelson8788726 | 2026-04-30 | 1.05985 | #1855 | `/workspace/parameter-golf-final/romeerp_caseops_first39/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | **39** | 47,851,520 | **CLEAN** | hf-dataset | Q1✅ Q2❌: README says "canonical pretokenized CaseOps shards from `romeerp/parameter-golf-caseops-v1` instead of locally re-tokenized raw docs." Train log: 39 shards, path literally named `romeerp_caseops_first39`. Definitively HF. | +| **#2041** | @jorge-asenjo | 2026-04-30 | 1.05692 | #1945, #1967, #2018 | `/workspace/pg-data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: reproduce flow (`bash run.sh`) provides no data-setup command — just sets default path env vars. Train log: 80 shards, double-nesting (consistent with HF to `/workspace/pg-data` OR local prep to `/workspace/pg-data/datasets`). Same author as confirmed-LEAK #1923, but #1923's evidence was that PR's own README admission, not a shared workflow. | +| **#2050** | @AidenGeunGeun | 2026-04-30 | 1.06083 | #1915 | `./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | — | 47,851,520 | **INHERIT** | inherit-from-#1915 | Eval-only on frozen #1915 quantized artifacts (`TTT_EVAL_ONLY=1`). Data verdict depends on #1915 (not in working set). | +| **#2060** | @S0urC10ud | 2026-04-30 | 1.05792 | #2007 | `/root/blockdata/pg-data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: README "Reproduce" invokes `python prepare_caseops_data.py --local-dir /workspace/caseops_data` (same as parent #2007). | +| **#2068** | @jayaram1125 | 2026-04-30 | 1.06172 | #1797 | `./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **CLEAN** | hf-dataset | Q1✅ Q2❌: README data-setup step 2.1: `MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 python3 cached_challenge_fineweb.py --variant sp8192_lossless_caps_caseops_v1_reserved --train-shards 80`. Path is leaky-looking but is the post-download staging location. | +| **#2071** | @jamesEmerson112 | 2026-04-30 | 1.0066 (claimed) | #1851 | `./data/datasets/fineweb10B_sp8192` | — | — | **LEAK** | symlink-leak | **DIFFERENT MECHANISM.** Audit-flagged: `caseops_enabled=False` env but pod data paths symlinked to CaseOps-tokenized shards. README: "active via symlinked data." Orthogonal to val10k-train overlap. | +| **#2075** | @deusexnatura | 2026-04-30 | (no claim) | #1855 | `/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: README reproduce says "with the CaseOps data prepared" — no command. Ships `prepare_caseops_data.py` but README does not invoke it. Train log: 80 shards, double-nesting (same ambiguous pattern as #2041/#2075). | +| **#2078** | @hi-aduek | 2026-04-30 | 1.05804 | #2014 | `/dev/shm/caseops1851-data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,853,343 | **LEAK** | val10k-train+50k-val-regen | Q1❌ Q2❌: no explicit command. Train-log tiebreaker: triple-nesting `caseops1851-data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/` — the `fineweb10B_sp8192_caseops` intermediate directory is the output of `prepare_caseops_data.py --out …/fineweb10B_sp8192_caseops/datasets`; HF download never produces this intermediate. Same off-by-one val_tokens (47,853,343) as #2014, consistent with same local prep run. | +| **#2100** | @someone114514 | 2026-04-30 | 1.05807 | #2060 | `/root/blockdata/pg-data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: same README as #2060 (LongCtx + No-QV + Prefix3500 lineage); `python prepare_caseops_data.py` invoked. Triple-nesting confirms. | +| **#2101** | @OnlyJundong | 2026-05-01 | 1.05845 | #1855 | `/workspace/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | Q2✅: README "Prepare CaseOps SP8192 data" step: `python3 prepare_caseops_data.py`. Ships the script. | +| **#2109** | @izlley | 2026-05-01 | 1.05917 | #1855 | `/workspace/data/datasets/fineweb10B_sp8192_caseops_marker_pair_v3` | 1497 | 36,562,944 | **LEAK** | custom-variant | Q2✅: README step 1b invokes `python3 prepare_caseops_data.py --docs … --out … --sp …`. Custom `fineweb10B_sp8192_caseops_marker_pair_v3` dataset variant (MP3 marker-pair fusion via `prepare_marker_pair_v3.py`). val_tokens=36,562,944 (differs from canonical 47,851,520 due to vocab surgery). Underlying canonical-stream val10k-train partition mechanism unchanged. | +| **#2117** | @JulianTang2027 | 2026-05-01 | 1.05879 (3-seed mean) | #2101 | `./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: README says "CaseOps SP8192 dataset on `/dev/shm`" (text mention only) and reproduce section has only torchrun. No data-setup command. README states this "reproduces PR #2101 exactly" (PR #2101 is LEAK), but #2101's data workflow is not explicitly inherited — it's just a description. 80 shards, single-nesting. | +| **#2118** | @aquariouseworkman | 2026-05-01 | **1.04350** | #2018 | `/workspace/data_correct` | 80 | 47,851,520 | **LEAK** | val10k-train+50k-val-regen | **CURRENT FRONTIER (claimed).** Q2 via submission.json: `technique_summary` literal text: `"--val-docs=10000 train shards + 50k val eval"`. Same author who shipped clean #1851 a week earlier. | +| **#2121** | @Kbediako | 2026-05-01 | 1.06099 | #1855 | `/workspace/pg_stageb_v2_seed0_1234/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 80 | 47,851,520 | **AMBIGUOUS** | hf-or-local-prep | Q1❌ Q2❌: README reproduce section is only a torchrun command with no data-acquisition step. Ships `prepare_caseops_data.py` (described as "CaseOps support files matching the accepted #1855 packaging pattern") but does not invoke it. 80 shards, single-nesting. | +| **#2123** | @vaibhavmishra1 | 2026-05-01 | 1.05933 | #1855 | `./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved` | 78 | 47,851,520 | **CLEAN** | hf-dataset | Q1✅ Q2❌: README data setup: `huggingface-cli download romeerp/parameter-golf-caseops-v1 --repo-type dataset --local-dir ./data/datasets/fineweb10B_sp8192_caseops/`. Train log path matches that download destination. 78 shards (HF dataset has 78+ train shards). Closed; superseded by #2124. | +| **#2124** | @vaibhavmishra1 | 2026-05-01 | 1.05933 | #1855 | same | 78 | 47,851,520 | **CLEAN** | hf-dataset | Same directory/README as #2123; identical `huggingface-cli download` command. Resubmission of #2123. | + +## Notes on specific verdicts + +### #1855 — external primary source resolves ❌/❌ + +`prepare_caseops_data.py` invocation not in #1855's own README (which has no data-setup section), but PR #2018's `DATASET_AUDIT.md` is an external primary source that verifies #1855's first 80 shards byte-for-byte against `prepare_caseops_data.py --val-docs=10000` output. This constitutes direct code-level evidence even though it originates from a descendant PR. + +### #1868 — misleading README comment + +README reproduce section reads `python3 prepare_caseops_data.py # downloads from romeerp/parameter-golf-caseops-v1`. The comment is wrong: `prepare_caseops_data.py` does not download from HF; it processes local docs. Train log is unambiguous: `train_shards: 39`, `datasets_dir: /dev/shm/pgolf_data` — the same 39-shard HF subset used by parent #1851. CLEAN verdict is from the train log, not the README. + +### #1945 — stale README vs actual run script + +README "Data setup (run ONCE)" invokes `prepare_caseops_data.py` but is stale documentation from an earlier draft. The shipped `finalize_v18.sh` is the canonical reproduce script and contains `snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', local_dir='/workspace/caseops_data')` followed by `DATA_DIR=/workspace/caseops_data/datasets/` for training. The actual run used HF data. val_tokens=47,852,288 (off by 768 from canonical = shard-alignment artifact; same 50k-doc partition). + +### #2014 — corrected prep script + +README explicitly labels HF download as "Preferred data setup" and the included `prepare_caseops_data.py` as "Fallback local rebuild." The fallback script is noted as having been modified to `--val-docs 50000` as default (and refusing to overwrite existing shards). val_tokens=47,853,343 (off by 1823 from canonical 47,851,520) suggests the fallback was actually used rather than HF download — but under either path, val covers docs 0–49,999 and train starts at doc 50,000 → **no overlap**. CLEAN on partition grounds. Note: val_tokens differs from canonical; not directly comparable to records at 47,851,520. + +### #1923 — 1502 shards definitively resolves LEAK + +HF dataset has 80 train shards. `prepare_caseops_data.py` with an enlarged `docs_selected.jsonl` (more than the standard 8.23M-doc canonical set) produces 1502 shards. No HF download would produce this count. LEAK confirmed without needing the README admission (which also independently confirms it via the original `val_tokens=9,662,464` — a 10k-doc default prep output). + +### #2109 — custom MP3 dataset variant + +`prepare_marker_pair_v3.py` fuses `[▁, MARKER]` 2-grams, producing `fineweb10B_sp8192_caseops_marker_pair_v3`. val_tokens=36,562,944 (not 47,851,520) because of vocab surgery on the val side. The underlying canonical-stream val10k-train partition mechanism (train starts at doc 10,000, val covers docs 0–49,999) is unchanged. + +### #2071 — symlink leak (separate mechanism) + +`caseops_enabled=False` in env, but pod data paths are symlinked to CaseOps-tokenized shards. The model trains on CaseOps data while the harness thinks it's reading SP8192 shards. README admits: "active via symlinked data." This is not a val10k-train overlap — it is a different audit-flagged issue. + +## How to interpret val_bpb across this table + +Records with **different** verdicts cannot have their val_bpb compared: +- LEAK records: model partially memorized 80% of val docs during training → val_bpb artificially inflated downward. +- CLEAN records: val docs are never-seen → val_bpb measures genuine generalization. + +Records with the **same** verdict and **same** `val_tokens` (47,851,520) can be compared directly. + +The ~0.018 bpb gap between LEAK frontier (#2118 at 1.04350) and CLEAN frontier (#1851/#1868 at 1.06128/1.06141) reflects: +1. Memorization of ~40,000 val docs (~0.005–0.012 bpb) +2. Genuine recipe improvements (Gated XSA, LQER top-1, AWQ-lite, AsymLogit, etc.) +3. Eval-time overlays (n-gram tilt, GPTQ_RESERVE_SECONDS) + +Distinguishing (1) from (2)+(3) requires running the #2118 recipe on clean HF data — the goal of spec 301. diff --git a/records/track_non_record_16mb/2026-05-02_RecurrenceBandNotes_leon2k2k2k/README.md b/records/track_non_record_16mb/2026-05-02_RecurrenceBandNotes_leon2k2k2k/README.md new file mode 100644 index 0000000000..c04a6dc148 --- /dev/null +++ b/records/track_non_record_16mb/2026-05-02_RecurrenceBandNotes_leon2k2k2k/README.md @@ -0,0 +1,234 @@ +# Notes on the recurrence band in compressed transformers + +A small set of architectural studies on the loop band (layers 3–5) of the +#1736 / 060A baseline. Each section is independent. + +--- + +## Section 1 — Learning mixing parameters in depth-recurrent loops + +A depth-recurrent loop runs the canonical Markov iteration through the loop +band (layers 3–5): + +``` +x_{k+1} = f(x_k) +``` + +Each pass uses only the previous pass's output. We replace this with a +learned mixing rule, train it end-to-end, and observe that the learned +mixing coefficients converge to a stable, nearly seed-invariant pattern +within a few hundred steps after looping activates. Once stabilized, the +coefficients can be read off the trained model and used as fixed constants +in a fresh training run. + +## Recurrent α-β + +We add learnable scalars to control how each pass commits to the residual +and to allow detached cross-layer carries within the same pass: + +``` +x_{k+1} = β_k · f(x_k) + Σ_j α_{k,j} · stop_grad(x_k^{(j)}) +``` + +with `β_k` initialized to 1 and `α_{k,j}` initialized to 0, so the loop +starts from the canonical Markov rule. Across the loop band (layers 3–5, +NL=2) this is a small number of scalars; they are routed to the scalar +optimizer and trained jointly with the rest of the model. + +During a full training run on the #1736 base, the scalars drift off their +initialization once looping activates at `frac=0.35`, then plateau. The +final values are reproducible across seeds — for example, layer 4 converges +to a self-subtract pattern at `α ≈ −0.348` (a learned gate), and layer 5 +stabilizes into a positive aggregation of the signals from layers 3 and 4. + +## Freezing the learned values + +We then read the converged values off the trained model and use them as +fixed constants in a new training run from scratch. The optimizer state +and per-step gradient on these scalars are dropped; only the values +survive. Because the loop now starts at the converged mixing pattern +rather than at the canonical Markov rule, the run is no longer +identity-at-init, but training-end quality matches. + +This is shipped as PR #1779 on top of #1736: + +| Submission | Mixing rule in loop band | val_bpb (3-seed mean) | Δ vs #1736 | +|---|---|---:|---:| +| #1736 (base) | canonical Markov | 1.06549 | — | +| #1779 (frozen α-β) | fixed α-β with cross-layer carry | **1.06421** | **−0.00128** | + +3-seed std on #1779 is 0.00023, so the gain is well outside seed noise. +Artifact size is unchanged (the frozen scalars are baked into the model +weights serialized into the 16 MB budget). + +The converged values used as fixed constants in #1779 are: + +``` +β = [1.5973, 1.8828, 1.9922] # layers 3, 4, 5 + + L3 L4 L5 +α = [[ 0.2520, −0.0210, −0.0124], # L3 contributions + [ 0.0669, −0.3477, 0.0031], # L4 contributions + [ 0.1387, 0.2412, 0.0272]] # L5 contributions +``` + +Two patterns stand out. Every β is well above 1, so each pass amplifies +its own block output rather than damping it — the optimizer chose to +overshoot the canonical Markov rule. And the diagonal of α is mixed: L3 +adds back ~25% of itself, L4 subtracts ~35% of itself (the learned-gate +self-subtract behavior), L5 leaves itself roughly alone but absorbs ~24% +of L4. The off-diagonal entries in row L5 also confirm L5 acts as an +aggregator over L3 and L4. + +## Anderson acceleration with frozen coefficients + +The same idea applies to a different mixing rule. Anderson acceleration +replaces the Markov iteration with a length-`m` mix of past iterates, +solved per batch via a small least-squares problem: + +``` +g_i = f(x_i) − x_i # residuals +α* = argmin_α ‖Σ_{i=k−m+1..k} α_i · g_i‖², Σ α_i = 1 +x_{k+1} = Σ α*_i · f(x_i) +``` + +Trained end-to-end (length-3 Anderson, per-batch LS), the coefficients +land in the noise band of canonical recurrence but pay a ~25% throughput +penalty for the per-batch solve. Inspecting the trained model, the +per-batch α distribution concentrates tightly around + +``` +α ≈ [+0.55, −0.67, +1.12] +``` + +Following the same procedure as for α-β, we drop the LS solve and +hardcode these coefficients as constants. The result is a +fixed-coefficient extrapolation across the last three iterates with no +runtime overhead beyond the canonical loop. + +| Variant | Mixing rule | Throughput vs canonical | val_bpb (single seed) | +|---|---|---:|---:| +| Canonical | Markov | 1.00× | 1.06108 | +| Anderson, learned per-batch α | length-3 LS | 0.75× | 1.06083 | +| Anderson, frozen α | fixed `[+0.55, −0.67, +1.12]` | 1.00× | 1.05968 | + +The frozen-Anderson result is single-seed; multi-seed confirmation has +not been run. + +--- + +## Section 2 — MLP sizing across the three stages + +The loop band runs each of layers 3, 4, 5 three times per forward pass +(NL=2). Each pass reads the same FFN weights, so the parameters in the +loop band see roughly 3× the use per token of the FFN parameters in the +non-looped layers. A natural question is whether the loop band deserves +more FFN capacity than the rest of the model at fixed total parameters — +i.e., whether reallocating width from the non-looped layers into the +loop band is a free win. + +We split the 11 physical layers into three positional stages and +parameterize the FFN width as a per-stage multiplier of `model_dim`: + +``` +stage layers width multiplier +early 0–2 MLP_EARLY_MULT +middle 3–5 MLP_MIDDLE_MULT # the loop band +late 6–10 MLP_LATE_MULT +``` + +The baseline uses `4.0` everywhere, for a total of `11 × 4.0 = 44.0` +width-units. We tried three reallocation schemes that hold the total +fixed at 44.0 width-units while widening the middle stage to 5.0: + +| arm | early | middle | late | direction | +|---|---:|---:|---:|---| +| baseline | 4.0 | 4.0 | 4.0 | uniform | +| 040A | 3.625 | 5.0 | 3.625 | shrink both sides evenly | +| 040B | 3.0 | 5.0 | 4.0 | shrink early, keep late | +| 040C | 4.0 | 5.0 | 3.4 | keep early, shrink late | + +Single-seed training-only screen on the 038/039 fullfloat research line, +2×H100, 600s wallclock cap, no quantization or TTT. The absolute val_bpb +values are pre-quant post-EMA from this short screen, *not* directly +comparable to the post-quant post-TTT numbers in Section 1 — this is a +relative comparison of training quality between MLP schedules, not an +endpoint number. Pre-quant post-EMA val_bpb on the validation set: + +| arm | val_bpb (pre-quant post-EMA) | Δ vs uniform | +|---|---:|---:| +| baseline (uniform 4.0) | 1.16501 | — | +| 040A (3.625 / 5.0 / 3.625) | 1.16742 | +0.00241 | +| 040B (3.0 / 5.0 / 4.0) | 1.16744 | +0.00244 | +| 040C (4.0 / 5.0 / 3.4) | **1.16484** | **−0.00017** | + +Three observations: + +- **The middle-widen direction is real but small.** 040C is the only + reallocation that doesn't regress, and the gain is comfortably inside + single-seed noise (Δ ≈ −0.0002 on a screen with no seed average). + Treat it as "tied with baseline," not a win. +- **Shrinking the early stage is more expensive than shrinking the + late stage.** 040B (early shrunk to 3.0, late kept at 4.0) loses + +0.00244; 040C (early kept at 4.0, late shrunk to 3.4) gains + −0.00017. A symmetric shrink (040A) lands close to 040B. The early + layers (0–2) are doing work that doesn't compress; the late layers + (6–10) tolerate it. +- **The middle-stage gain is bounded above by what the late-shrink + costs.** Whatever extra capacity the middle stage absorbs from going + 4.0 → 5.0, the late stage gives back roughly the same amount when it + goes 4.0 → 3.4. The two effects nearly cancel. The implication is that + the loop band is *not* obviously starved for FFN capacity at the + uniform baseline. + +--- + +## Section 3 — Sizing the loop band + +The canonical 060A loop band is the contiguous set {3, 4, 5} run at +NL=2, so each of layers 3, 4, 5 is visited three times per forward +pass. The full forward does 17 layer-applications, with 9 of them +inside the loop band. Two knobs control the total compute spent inside +the band: which layers form the band (band-set), and how many times +each is visited (NL). We screened both directions on 060A. + +| spec | band-set | NL | loop-band passes | description | +|---|---|---:|---:|---| +| 060A canonical | {3,4,5} | 2 | 9 | reference | +| 041B | {3,4,5} | 1 | 3 | half the canonical loop compute | +| 041D | {5} | 2 | 3 | single-layer band, only layer 5 | +| 041H | {4,5} | 2 | 6 | drop the front of the band | +| 070 | {3,4} | 2 | 6 | drop the back of the band | +| 041L | {3,4,5} | 3 | 12 | more visits per layer | +| 041N | {3,4,5} | 4 | 15 | more still | + +Same screen protocol throughout: single seed 42, 4×H100, 1200s +wallclock, no TTT. Pre-quant post-EMA val_bpb: + +| spec | structure | pre-quant post-EMA | Δ vs canonical | +|---|---|---:|---:| +| 060A canonical | {3,4,5} NL=2 | **1.06358** | — | +| 041B | {3,4,5} NL=1 | 1.06842 | +0.00484 | +| 041D | {5} NL=2 | 1.06993 | +0.00635 | +| 041H | {4,5} NL=2 | 1.06693 | +0.00335 | +| 070 | {3,4} NL=2 | 1.06595 | +0.00237 | +| 041L | {3,4,5} NL=3 | 1.06615 | +0.00257 | +| 041N | {3,4,5} NL=4 | 1.06888 | +0.00530 | + +Two observations: + +- **Canonical is locally optimal in both directions.** Both shrinking + (NL=1, single-layer band, drop a layer) and growing (NL=3, NL=4) lose + to the canonical {3,4,5} NL=2 — the loss is monotonic in how far the + configuration sits from canonical. NL=3 (+0.00257) is the closest + miss; NL=4 (+0.00530) loses about as much as halving the loop + compute. +- **Band shape is roughly position-symmetric.** Dropping layer 3 (041H, + +0.00335) and dropping layer 5 (070, +0.00237) cost similar amounts. + Reducing to a single layer (041D, +0.00635) is worse than either, but + in the same direction. There's no specific layer in {3,4,5} that's + uniquely load-bearing; the band-as-a-whole is what matters. + +The 041L NL=3 result is interesting in isolation — the gap to +canonical (+0.00257) is small enough that with multi-seed averaging +it may close. We did not promote it past the screen. diff --git a/research/ideas/033b-ttt-adapt-alpha-beta-high-lr.md b/research/ideas/033b-ttt-adapt-alpha-beta-high-lr.md new file mode 100644 index 0000000000..9708590eca --- /dev/null +++ b/research/ideas/033b-ttt-adapt-alpha-beta-high-lr.md @@ -0,0 +1,115 @@ +# Idea 033b — TTT alpha/beta adaptation with aggressive LR + +## Thesis + +`033` allowed TTT to adapt frozen `alpha/beta` on top of the same `026 seed_42` +checkpoint used by `028B`, but the effect was effectively negligible: + +- `028B`: `1.0664948109` +- `033`: `1.0664878103` +- delta: about `-7e-06` bpb + +That pattern is consistent with an underpowered adaptation: + +- `recur_alpha` moved a little +- `recur_beta` did not move at measurable precision +- outcome barely changed + +So the next cheap question is not architectural. It is simply: + +- did `033` fail because `TTT_ALPHA_BETA_LR_SCALE=0.25` was too small? + +## Mechanism + +Keep everything from `033` the same: + +- same checkpoint +- same hotstart path +- same TTT setup +- same LoRA warm-start behavior +- same code commit + +Only change: + +- `TTT_ALPHA_BETA_LR_SCALE=10.0` + +Since the base `TTT_LORA_LR` in this codepath is `1e-4`, this gives an effective +alpha/beta LR of: + +```text +1e-4 * 10.0 = 1e-3 +``` + +## Why this is worth one run + +`033` at `2.5e-5` was so conservative that it was close to a no-op. + +An aggressive rerun answers the question quickly: + +- if `beta` still does not move and result still does not improve, the line is + probably exhausted +- if `alpha/beta` move materially and TTT improves, then `033` was just + under-tuned +- if it destabilizes, we also learn that immediately + +## Expected outcomes + +### Positive + +- `recur_beta_max_drift` becomes clearly nonzero +- post-TTT beats `033` by more than noise + +### Null + +- drift increases but result stays flat +- or `beta` still stays effectively frozen + +### Negative + +- TTT becomes noisy or regresses +- post-TTT degrades beyond `028A` territory + +## Outcome + +`033b` answered the question cleanly. + +Observed parameter movement: + +- `recur_alpha_max_drift = 0.240723` +- `recur_beta_max_drift = 0.062500` + +So unlike `033`, both parameter sets moved materially under TTT: + +- `alpha` moved a lot +- `beta` also moved meaningfully + +But the final result got worse: + +- `028B`: `1.0664948109` +- `033`: `1.0664878103` +- `033b`: `1.06666734` + +So the aggressive LR did what it was supposed to do mechanically, but it hurt the +actual TTT outcome. + +## Conclusion + +This line now tells a coherent story: + +- tiny alpha/beta adaptation is basically negligible +- aggressive alpha/beta adaptation is harmful + +That means the flat `033` result was not simply because alpha/beta were impossible +to move. They are movable, but pushing them harder degrades quality. + +## Recommendation + +Do not promote TTT alpha/beta adaptation as a mainline lever. + +If revisited at all, the only reasonable follow-ups are: + +- `alpha`-only TTT adaptation with `beta` frozen +- or one medium-LR interpolation between `033` and `033b` + +But this line should be treated as low priority now, not expanded into a broad +research branch. diff --git a/research/ideas/035c-polar-ns-on-030-family.md b/research/ideas/035c-polar-ns-on-030-family.md new file mode 100644 index 0000000000..101f58cd8e --- /dev/null +++ b/research/ideas/035c-polar-ns-on-030-family.md @@ -0,0 +1,28 @@ +# Idea 035c — Polar NS on the original `030` alpha/beta family + +> Obsolete slot. The active continuation is now `035d` then `035e`. + +## Thesis + +The second-ranked `#1779` follow-up is the optimizer-side refinement: + +- keep the stronger original `030` alpha/beta family fixed +- replace stock Muon's repeated fixed Newton-Schulz tuple with the 5 Polar + Express per-iteration tuples from PR `#1344` +- first test it as a `4×H` pre-quant screen with no TTT + +## Benchmark + +Primary `4×H` alpha/beta-family reference: + +- `026` screen seed `314`: pre-quant `1.06770372` + +Direct schedule/optimizer siblings: + +- `035` = `MIN_LR=0.10` +- `035b` = loop-onset plateau + +## First question + +Can Polar NS alone beat `1.06770372` on the original `030` family in `4×H` +screen form? diff --git a/research/specs/035c-polar-ns-on-030-family.md b/research/specs/035c-polar-ns-on-030-family.md new file mode 100644 index 0000000000..43f847d911 --- /dev/null +++ b/research/specs/035c-polar-ns-on-030-family.md @@ -0,0 +1,209 @@ +# Spec 035c — Polar NS on the original `030` alpha/beta family + +> Obsolete slot. Superseded in the active research numbering by `035d` +> (`#1787`-lite non-sparse bundle) and `035e` (sparse-gate follow-up). + +**Slug:** `polar-ns-on-030-family` +**Created:** 2026-04-24 +**Status:** OBSOLETE +**Branch:** `exp/035c-polar-ns-on-030-family` +**Commit:** `188ce0b` +**Links to:** `research/ideas/035c-polar-ns-on-030-family.md`, `research/ideas/1779-next-adds-ranked.md`, `research/specs/030-025b-seed314-new-ttt.md` + +## Hypothesis + +If `Polar NS` is a real optimizer-quality refinement rather than a bundled +artifact from `#1787`, it should transfer to the stronger original `030` +alpha/beta family in `4×H` screen form. + +## Baseline + +Use the same intended `030` `4×H` screen stack as `035`. + +Pinned lineage: + +- branch lineage: `exp/029-full-stack` +- runnable code line based on `c3a99b3` +- frozen `025b` carry +- `NUM_LOOPS=2` + +Primary `4×H` benchmark: + +- `026` screen seed `314`: pre-quant `1.06770372` + +Direct siblings: + +- `035` (`MIN_LR=0.10`) +- `035b` (loop-onset plateau) + +## Config diff + +Requires a small Muon optimizer patch on top of the `030` family code line. + +Only intended diffs from the intended `030` `4×H` screen stack: + +- Polar NS code present via this branch +- optional `MIN_LR` override chosen at launch from the pinned shortlist below + +Everything else must remain identical, including: + +- `CASEOPS_ENABLED=1` +- `TTT_ENABLED=0` +- `MLP_CLIP_SIGMAS=12.0` +- `ATTN_CLIP_SIGMAS=13.0` +- `EMBED_BITS=7` +- `EMBED_CLIP_SIGMAS=15.0` +- `MATRIX_LR=0.026` +- `GATED_ATTN_ENABLED=1` +- `GATED_ATTN_INIT_STD=0.005` +- `GATED_ATTN_QUANT_GATE=1` +- `RECUR_ALPHA_ENABLED=1` +- `NUM_LOOPS=2` +- `LOOP_START=3` +- `LOOP_END=5` +- `ENABLE_LOOPING_AT=0.35` +- `MUON_BACKEND_STEPS=5` +- `GPTQ_RESERVE_SECONDS=4` +- `GPTQ_CALIBRATION_BATCHES=16` +- `MAX_WALLCLOCK_SECONDS=1200` +- `TRAIN_LOG_EVERY=100` +- `SEED=314` + +Runtime-selectable `MIN_LR` shortlist: + +- `0.0` (default, pure Polar NS isolation) +- `0.05` +- `0.10` +- `0.15` + +Execution may choose one of those values at launch time. +Any other `MIN_LR` value makes the rung invalid. + +## Polar NS semantics + +Replace stock Muon's repeated fixed coefficients: + +- `(3.4445, -4.775, 2.0315)` applied 5 times + +with the 5 per-iteration Polar Express tuples from PR `#1344` / `#1787`: + +1. `(8.156554524902461, -22.48329292557795, 15.878769915207462)` +2. `(4.042929935166739, -2.808917465908714, 0.5000178451051316)` +3. `(3.8916678022926607, -2.772484153217685, 0.5060648178503393)` +4. `(3.285753657755655, -2.3681294933425376, 0.46449024233003106)` +5. `(2.3465413258596377, -1.7097828382687081, 0.42323551169305323)` + +Keep: + +- `MUON_BACKEND_STEPS=5` + +## Regime + +Use a `4×H100` screen-only rung. + +Pinned intent: + +- exact `030`-family `4×H` screen stack +- pre-quant gate only +- no TTT + +## Run protocol + +Launch variants: + +- `035cA` +- `SEED=314` +- `MIN_LR=0.0` by default + +Optional combo-prep variants: + +- `035cB` +- same branch/commit, but with `MIN_LR` chosen from the shortlist above + +Execution rule: + +- launch from `exp/035c-polar-ns-on-030-family` +- use the pinned runnable code commit in this spec +- match the original intended `030` `4×H` screen stack exactly +- apply only the Polar NS code lineage change, plus an optional `MIN_LR` + override from the pinned shortlist +- if the produced `config.json` differs on anything else, the rung is invalid + +Pinned command: + +```bash +python -c "import brotli" + +cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT +git fetch fork +git checkout 188ce0b + +if [ -f /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model ]; then + export DATA_DIR=/workspace +elif [ -f /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model ]; then + export DATA_DIR=/workspace/parameter-golf/data +else + echo "CaseOps tokenizer not found under either JP or NA layout" >&2 + exit 1 +fi + +mkdir -p /workspace/runs/035c-polar-ns-on-030-family/run_a/seed_314 +mkdir -p /tmp/torch_inductor_cache_035c_a + +NCCL_NET=Socket DATA_DIR=$DATA_DIR \ +ARTIFACT_DIR=/workspace/runs/035c-polar-ns-on-030-family/run_a/seed_314 \ +TORCHINDUCTOR_CACHE_DIR=/tmp/torch_inductor_cache_035c_a \ +CASEOPS_ENABLED=1 \ +TTT_ENABLED=0 \ +MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \ +EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \ +MATRIX_LR=0.026 \ +GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \ +RECUR_ALPHA_ENABLED=1 \ +NUM_LOOPS=2 \ +LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \ +MUON_BACKEND_STEPS=5 \ +GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \ +MIN_LR=${MIN_LR:-0.0} \ +MAX_WALLCLOCK_SECONDS=1200 \ +TRAIN_LOG_EVERY=100 \ +SEED=314 \ +torchrun --standalone --nproc_per_node=4 train_gpt.py \ + > /workspace/runs/035c-polar-ns-on-030-family/run_a/seed_314/train.log 2>&1 +``` + +## Required artifacts + +- training log +- `config.json` +- pre-quant metrics in the final output/log + +## Sanity gate + +Before accepting the result, execution must verify from `config.json` that the +only intentional diff from the intended `030` `4×H` screen stack is the Polar +NS code lineage itself, plus an optional `MIN_LR` value from the pinned +shortlist. + +Data-root rule: + +- if the CaseOps tokenizer exists under `/workspace/data/...`, use + `DATA_DIR=/workspace` +- if it exists under `/workspace/parameter-golf/data/...`, use + `DATA_DIR=/workspace/parameter-golf/data` +- if neither layout exists, abort + +## Accept criteria + +Strong success: + +- pre-quant beats `1.06770372` + +Weak success: + +- directionally positive enough to justify promoting the chosen `MIN_LR` + combination or a dedicated follow-up + +Failure: + +- flat or worse than the `026` `4×H` reference diff --git a/research/specs/038-smear-lqer-asym-8h.md b/research/specs/038-smear-lqer-asym-8h.md new file mode 100644 index 0000000000..8c4143c1cc --- /dev/null +++ b/research/specs/038-smear-lqer-asym-8h.md @@ -0,0 +1,141 @@ +# Spec 038 — SmearGate + LQER-asym on top of the full-float sparse-carry `8×H` line + +**Slug:** `smear-lqer-asym-8h` +**Created:** 2026-04-24 +**Status:** READY +**Branch:** `exp/038-smear-lqer-8h-promotion` +**Commit:** `9636d34` +**Links to:** `research/specs/037-fullfloat-sparse-updated-alpha-beta-8h.md`, `research/specs/035e-sparse-gate-on-1779-family.md`, `runs/035-series-report.md` + +## Hypothesis + +`037` promotes the current best internal sparse-gate family with the full-float +learned `alpha/beta`. `038` adds the two orthogonal `#1797` levers on top: + +- **SmearGate** during training / forward +- **LQER-asym** during GPTQ pack + quantized eval / TTT + +Because these hit different parts of the stack, they are worth trying together +directly on the strongest current sparse-family promotion line. + +## Baseline + +Immediate baseline: + +- `037` full-float sparse-updated-`alpha/beta` `8×H` line + +External reference points: + +- `#1787` seed `42`: pre-quant `1.06764`, quantized `1.07681`, post-TTT `1.06400` +- `#1797` seed `42`: pre-TTT `1.07460`, post-TTT `1.06181` +- `#1797` mean: pre-TTT `1.07443`, post-TTT `1.06157` + +## Config diff + +Relative to `037`: + +- `SMEAR_GATE_ENABLED=1` +- `GATE_WINDOW=12` +- `LQER_ENABLED=1` +- `LQER_RANK=4` +- `LQER_TOP_K=3` +- `LQER_FACTOR_BITS=4` +- `LQER_ASYM_ENABLED=1` +- `LQER_ASYM_GROUP=64` + +Everything else stays on the `037` stack: + +- sparse gate on +- dense gated-attn off +- full-float frozen updated `035h` carry +- `MIN_LR=0.10` +- `FUSED_CE_ENABLED=1` +- phased LoRA-TTT +- `VAL_LOSS_EVERY=0` +- `MAX_WALLCLOCK_SECONDS=600` + +Pinned runnable code source: + +- shell/spec branch: `exp/038-smear-lqer-8h-promotion` +- runnable code branch: `exp/038-fullfloat-smear-lqer-asym` +- runnable code commit: `c8620b6` + +## Regime + +This is a direct `8×H100` full-pipeline promotion. + +- no smoke rung +- full quantized eval +- phased LoRA-TTT +- same updated full-float frozen `alpha/beta` as `037` + +## Seed policy + +Use the public comparison seed family for apples-to-apples checks against +`#1787` / `#1797`: + +- `42` +- `0` +- `1234` + +Recommended first seed: + +- `42` + +## Hardware ladder + +1. `8×H100` full pipeline, `600s`, first seed `42` + +Optional later: + +2. additional seeds from the approved shortlist + +## Run protocol + +Primary rung: + +- `038A` +- `8×H100` +- no smoke +- full quantized eval + phased LoRA-TTT +- same `037` sparse-family full-float carry +- add SmearGate + LQER-asym + +Execution rule: + +- launch from `exp/038-fullfloat-smear-lqer-asym` +- use the pinned runnable code commit +- validate the runnable file at: + - `records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py` +- ignore branch-root `./train_gpt.py` on this branch; it is the generic starter + baseline, not the runnable record-stack file for this spec +- keep the full `037` stack unchanged +- apply only the SmearGate + LQER-asym diffs above +- require `config.json` +- if anything else drifts, the rung is invalid + +## Acceptance + +Primary target: + +- healthy quantized and post-TTT run on the `037` base with no path mismatch + +Competitive target: + +- post-TTT at or below the `#1787` seed-`42` reference (`1.06400`) + +Stretch target: + +- enter the low-`1.062` band and become comparable to `#1797` + +Failure: + +- no measurable improvement vs `037` +- or regression concentrated in quantized / post-TTT stage + +## Notes + +- SmearGate already exists in the `037` code line and is mirrored into the TTT + path; this spec simply turns it on. +- LQER-asym is the actual code addition in the runnable branch. +- This is intentionally a direct combined test, not separate ablations. diff --git a/research/specs/039b-loop-band-activation-screen.md b/research/specs/039b-loop-band-activation-screen.md new file mode 100644 index 0000000000..f5e71f114a --- /dev/null +++ b/research/specs/039b-loop-band-activation-screen.md @@ -0,0 +1,323 @@ +# Spec 039b — loop-band activation screen + +**Slug:** `loop-band-activation-screen` +**Created:** 2026-04-24 +**Updated:** 2026-04-25 +**Status:** READY +**Branch:** `exp/039b-loop-band-activation-screen` +**Commit:** `5bbf12f` +**Links to:** `research/ideas/039b-loop-band-activation-screen.md`, `research/specs/039-neg-slope-screen-on-1797-base.md` + +## Backward bug fix + +Original commit `8f10d16` had a bug in the fused `LeakyReLU(s)²` Triton +backward (`2·s·x` instead of `2·s²·x` for negative-side inputs). All prior +039b results are invalid. + +Critically, the contamination was asymmetric across arms: + +- **baseline / 039bC**: all layers used the fused path → buggy backward everywhere +- **039bA / 039bB**: outer layers used the fused path (buggy), but middle layers + used `penalized_tanh` / `tanh` via eager execution → correct backward in the + middle band + +This means 039bA may have "won" partly because it received correct gradients in +layers 3,4,5 while the baseline did not — an unfair advantage unrelated to the +activation choice. Re-running on `5bbf12f` makes all arms use the correct +`leaky_relu_square` backward on the outer layers, giving a fair comparison. + +## Hypothesis + +The recurrent middle physical layers `3,4,5` want a different MLP activation +than the outer trunk. Keeping outer layers on `LeakyReLU(0.5)^2` but changing +only the loop band may improve short-run learning signal. + +## Baseline + +Use the fixed `039b` code line. + +Pinned current runnable base: + +- branch: `exp/039b-loop-band-activation-screen` +- commit: `5bbf12f` + +Uniform activation baseline: + +- all `11` physical layers use `LeakyReLU(0.5)^2` + +## Config diff + +Keep the whole `039` stack fixed and change only the loop-band MLP activation. + +Pinned implementation API for this spec: + +- `MLP_OUTER_ACTIVATION=leaky_relu_square` +- `MLP_MIDDLE_ACTIVATION=` +- `MLP_MIDDLE_NEGATIVE_SLOPE=` +- `MLP_MIDDLE_LAYERS=3,4,5` +- `TRAINING_ONLY_SCREEN=1` + +Interpretation: + +- outer layers are all physical layers not in `MLP_MIDDLE_LAYERS` +- middle layers are exactly `MLP_MIDDLE_LAYERS` +- `NEGATIVE_SLOPE=0.5` remains the outer-layer default +- `MLP_MIDDLE_NEGATIVE_SLOPE` matters only when + `MLP_MIDDLE_ACTIVATION=leaky_relu_square` + +Pinned runnable code source: + +- branch: `exp/039b-loop-band-activation-screen` +- commit: `5bbf12f` +- script: + [train_gpt.py](/home/claude-user/ai-workspace/projects/parameter-golf/worktrees/039b-loop-band-activation/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py) + +Looped middle physical layers are: + +- `3,4,5` + +Three arms: + +### baseline — uniform current activation + +- outer -> `LeakyReLU(0.5)^2` +- middle -> `LeakyReLU(0.5)^2` + +### 039bA — middle penalized tanh + +- outer -> `LeakyReLU(0.5)^2` +- middle -> `penalized_tanh` + +### 039bB — middle tanh + +- outer -> `LeakyReLU(0.5)^2` +- middle -> `tanh` + +> 039bC (middle `LeakyReLU(0.3)²`) shelved — slope variants within the same +> family are no longer interesting now that s=0.5 is confirmed as the outer +> default. See `_shelved/` if needed. + +Implementation note: + +- outer layers keep the current fused path +- non-default middle activations may use eager execution in the MLP path +- quantization / serialization / deserialize support is explicitly out of scope + for this first screen + +## Regime + +This is an explicitly training-only screen. + +Pinned short-run intent: + +- `4×H100` +- `SEED=42` +- `MAX_WALLCLOCK_SECONDS=1200` +- `ENABLE_LOOPING_AT=0.35` +- `TTT_ENABLED=0` +- `TRAINING_ONLY_SCREEN=1` + +Compare: + +- steps reached +- train loss trajectory +- validation loss / BPB from the pre-quant diagnostic +- train time + +Out of scope for this spec: + +- GPTQ / quantized artifact generation +- deserialize / rebank compatibility questions +- TTT + +## Seed policy + +Use one seed only: + +- `42` + +## Hardware ladder + +1. `4×H100` only +2. no `8×H100` in this spec +3. no quantized eval in this spec + +## Run protocol + +Run three training-only jobs: + +1. uniform baseline +2. `039bA` middle `penalized_tanh` +3. `039bB` middle `tanh` + +Same seed, same wallclock, same env otherwise. + +Execution rule: + +- stop after the pre-quant diagnostic +- do not attempt to serialize or evaluate the quantized model in this spec + +Resolved base env block: + +```bash +DATA_DIR=/workspace/parameter-golf/data +DATASETS_DIR=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved +TOKENIZER_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model +TRAIN_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin +VAL_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin +VAL_BYTES_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin +VOCAB_SIZE=8192 +NUM_LAYERS=11 +XSA_LAST_N=11 +MODEL_DIM=512 +NUM_KV_HEADS=4 +NUM_HEADS=8 +MLP_MULT=4.0 +NEGATIVE_SLOPE=0.5 +MLP_OUTER_ACTIVATION=leaky_relu_square +MLP_MIDDLE_LAYERS=3,4,5 +TIE_EMBEDDINGS=1 +LOGIT_SOFTCAP=30 +ROPE_BASE=10000 +ROPE_DIMS=16 +ROPE_TRAIN_SEQ_LEN=2048 +ROPE_YARN=0 +LN_SCALE=1 +QK_GAIN_INIT=5.0 +NUM_LOOPS=2 +LOOP_START=3 +LOOP_END=5 +ENABLE_LOOPING_AT=0.35 +PARALLEL_START_LAYER=8 +PARALLEL_FINAL_LANE=mean +MIN_LR=0.1 +EMBED_LR=0.6 +TIED_EMBED_LR=0.03 +TIED_EMBED_INIT_STD=0.005 +MATRIX_LR=0.026 +SCALAR_LR=0.02 +MUON_MOMENTUM=0.97 +MUON_BACKEND_STEPS=5 +MUON_MOMENTUM_WARMUP_START=0.92 +MUON_MOMENTUM_WARMUP_STEPS=1500 +MUON_ROW_NORMALIZE=1 +BETA1=0.9 +BETA2=0.95 +ADAM_EPS=1e-8 +GRAD_CLIP_NORM=0.3 +ADAM_WD=0.02 +MUON_WD=0.095 +EMBED_WD=0.085 +EMA_DECAY=0.9965 +TRAIN_BATCH_TOKENS=786432 +TRAIN_SEQ_LEN=2048 +TRAIN_LOG_EVERY=100 +ITERATIONS=20000 +WARMDOWN_FRAC=0.75 +WARMUP_STEPS=20 +VAL_BATCH_TOKENS=524288 +EVAL_SEQ_LEN=2048 +EVAL_STRIDE=64 +VAL_LOSS_EVERY=0 +CASEOPS_ENABLED=1 +COMPRESSOR=brotli +MATRIX_BITS=6 +MATRIX_CLIP_SIGMAS=12.85 +ATTN_CLIP_SIGMAS=13.0 +MLP_CLIP_SIGMAS=12.0 +EMBED_BITS=7 +EMBED_CLIP_SIGMAS=15.0 +GPTQ_CALIBRATION_BATCHES=16 +GPTQ_RESERVE_SECONDS=0.5 +SKIP_GATES_ENABLED=1 +SPARSE_ATTN_GATE_ENABLED=1 +SPARSE_ATTN_GATE_INIT_STD=0.0 +SPARSE_ATTN_GATE_SCALE=1.0 +GATED_ATTN_ENABLED=0 +GATED_ATTN_INIT_STD=0.005 +GATED_ATTN_QUANT_GATE=1 +ATTN_OUT_GATE_ENABLED=0 +ATTN_OUT_GATE_SRC=proj +GATE_WINDOW=12 +RECUR_ALPHA_ENABLED=1 +RECUR_DIAG_P2P_COS=0 +SMEAR_GATE_ENABLED=1 +LQER_ENABLED=1 +LQER_RANK=4 +LQER_TOP_K=3 +LQER_FACTOR_BITS=4 +LQER_ASYM_ENABLED=1 +LQER_ASYM_GROUP=64 +SPINQUANT_ENABLED=0 +SPINQUANT_SEED=42 +SPINQUANT_SITES=attn_in,attn_proj_in,mlp_in,mlp_proj_in +SEED=42 +MAX_WALLCLOCK_SECONDS=1200 +TTT_ENABLED=0 +TRAINING_ONLY_SCREEN=1 +``` + +Canonical launch block: + +```bash +declare -A MIDDLE_ACTIVATION=( + [baseline]=leaky_relu_square + [039bA]=penalized_tanh + [039bB]=tanh +) + +for arm in baseline 039bA 039bB; do + env \ + DATA_DIR=/workspace/parameter-golf/data \ + DATASETS_DIR=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ + TOKENIZER_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ + TRAIN_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin \ + VAL_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin \ + VAL_BYTES_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin \ + VOCAB_SIZE=8192 NUM_LAYERS=11 XSA_LAST_N=11 MODEL_DIM=512 NUM_KV_HEADS=4 NUM_HEADS=8 MLP_MULT=4.0 NEGATIVE_SLOPE=0.5 \ + MLP_OUTER_ACTIVATION=leaky_relu_square MLP_MIDDLE_LAYERS=3,4,5 \ + MLP_MIDDLE_ACTIVATION="${MIDDLE_ACTIVATION[$arm]}" \ + TIE_EMBEDDINGS=1 LOGIT_SOFTCAP=30 ROPE_BASE=10000 ROPE_DIMS=16 ROPE_TRAIN_SEQ_LEN=2048 ROPE_YARN=0 LN_SCALE=1 QK_GAIN_INIT=5.0 \ + NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 PARALLEL_START_LAYER=8 PARALLEL_FINAL_LANE=mean \ + MIN_LR=0.1 EMBED_LR=0.6 TIED_EMBED_LR=0.03 TIED_EMBED_INIT_STD=0.005 MATRIX_LR=0.026 SCALAR_LR=0.02 \ + MUON_MOMENTUM=0.97 MUON_BACKEND_STEPS=5 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_ROW_NORMALIZE=1 \ + BETA1=0.9 BETA2=0.95 ADAM_EPS=1e-8 GRAD_CLIP_NORM=0.3 ADAM_WD=0.02 MUON_WD=0.095 EMBED_WD=0.085 EMA_DECAY=0.9965 \ + TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 TRAIN_LOG_EVERY=100 ITERATIONS=20000 WARMDOWN_FRAC=0.75 WARMUP_STEPS=20 \ + VAL_BATCH_TOKENS=524288 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 VAL_LOSS_EVERY=0 \ + CASEOPS_ENABLED=1 COMPRESSOR=brotli MATRIX_BITS=6 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 GPTQ_CALIBRATION_BATCHES=16 GPTQ_RESERVE_SECONDS=0.5 \ + SKIP_GATES_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_INIT_STD=0.0 SPARSE_ATTN_GATE_SCALE=1.0 \ + GATED_ATTN_ENABLED=0 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 ATTN_OUT_GATE_ENABLED=0 ATTN_OUT_GATE_SRC=proj GATE_WINDOW=12 \ + RECUR_ALPHA_ENABLED=1 RECUR_DIAG_P2P_COS=0 SMEAR_GATE_ENABLED=1 \ + LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_FACTOR_BITS=4 LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \ + SPINQUANT_ENABLED=0 SPINQUANT_SEED=42 SPINQUANT_SITES=attn_in,attn_proj_in,mlp_in,mlp_proj_in \ + SEED=42 MAX_WALLCLOCK_SECONDS=1200 TTT_ENABLED=0 TRAINING_ONLY_SCREEN=1 \ + RUN_ID="039b-${arm}" \ + torchrun --standalone --nproc_per_node=4 train_gpt.py +done +``` + +## Acceptance + +Interesting outcome: + +- any middle-band alternative clearly beats the uniform baseline on the short + training screen + +Most interesting outcome: + +- `039bA` wins, supporting the story that recurrent-band MLPs want a more + bounded activation than the outer trunk + +Kill criteria: + +- all three middle-band alternatives are flat or worse than baseline +- gains are dominated by throughput/compiler artifacts rather than learning + signal + +## Open questions + +- does a loop-band activation win survive quantization later, or is it only a + training-only effect? +- is penalized-tanh better than just lowering the leak within the same family? +- should a winning loop-band activation later be combined with `040`, or tested + separately first? diff --git a/research/specs/_shelved/039bd-penalized-tanh-plus-040c.md b/research/specs/_shelved/039bd-penalized-tanh-plus-040c.md new file mode 100644 index 0000000000..7e62e97834 --- /dev/null +++ b/research/specs/_shelved/039bd-penalized-tanh-plus-040c.md @@ -0,0 +1,258 @@ +# Spec 039bD — penalized-tanh plus 040C composite + +**Slug:** `penalized-tanh-plus-040c` +**Created:** 2026-04-25 +**Status:** READY +**Branch:** `exp/039bd-penalized-tanh-plus-040c` +**Commit:** `c78534a` +**Links to:** `research/ideas/039bd-penalized-tanh-plus-040c.md`, `research/specs/039b-loop-band-activation-screen.md`, `research/specs/040-loop-layer-mlp-reallocation-screen.md` + +## Hypothesis + +The loop band `3,4,5` wants both: + +- a richer MLP allocation (`040C`) +- a more stable recurrent-band activation (`039bA`) + +The composite should outperform `039bA` alone if width reallocation is truly +additive on top of the activation win. + +## Baselines + +Primary comparison arms: + +1. baseline +2. `039bA` +3. composite `039bD` + +Pinned upstream references: + +- `039b` runnable branch: `exp/039b-loop-band-activation-screen` +- `040` runnable branch: `exp/040-loop-layer-mlp-reallocation-screen` + +## Config diff + +Keep the whole `038/039` family fixed and apply: + +- width split from `040C` + - `MLP_SCHEDULE_ENABLED=1` + - `MLP_EARLY_MULT=4.0` + - `MLP_MIDDLE_MULT=5.0` + - `MLP_LATE_MULT=3.4` + - `MLP_MIDDLE_LAYERS=3,4,5` +- activation split from `039bA` + - `MLP_OUTER_ACTIVATION=leaky_relu_square` + - `MLP_MIDDLE_ACTIVATION=penalized_tanh` + - `MLP_MIDDLE_NEGATIVE_SLOPE=0.5` +- training-only stop + - `TRAINING_ONLY_SCREEN=1` + +Pinned runnable code source: + +- branch: `exp/039bd-penalized-tanh-plus-040c` +- commit: `c78534a` +- script: + [train_gpt.py](/home/claude-user/ai-workspace/projects/parameter-golf/worktrees/039bd-composite-screen/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py) + +## Regime + +Training-only screen: + +- `4×H100` +- `SEED=42` +- `MAX_WALLCLOCK_SECONDS=600` +- `TTT_ENABLED=0` +- `TRAINING_ONLY_SCREEN=1` + +Compare: + +- stop-time val_bpb +- pre-quant post-EMA val_bpb +- steps reached +- train loss trajectory +- throughput + +Out of scope: + +- GPTQ / quantized eval +- TTT + +## Run protocol + +Run only these three arms: + +1. `baseline` +2. `039bA` +3. `039bD` + +Resolved base env block: + +```bash +DATA_DIR=/workspace/parameter-golf/data +DATASETS_DIR=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved +TOKENIZER_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model +TRAIN_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin +VAL_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin +VAL_BYTES_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin +VOCAB_SIZE=8192 +NUM_LAYERS=11 +XSA_LAST_N=11 +MODEL_DIM=512 +NUM_KV_HEADS=4 +NUM_HEADS=8 +MLP_MULT=4.0 +NEGATIVE_SLOPE=0.5 +MLP_MIDDLE_LAYERS=3,4,5 +TIE_EMBEDDINGS=1 +LOGIT_SOFTCAP=30 +ROPE_BASE=10000 +ROPE_DIMS=16 +ROPE_TRAIN_SEQ_LEN=2048 +ROPE_YARN=0 +LN_SCALE=1 +QK_GAIN_INIT=5.0 +NUM_LOOPS=2 +LOOP_START=3 +LOOP_END=5 +ENABLE_LOOPING_AT=0.35 +PARALLEL_START_LAYER=8 +PARALLEL_FINAL_LANE=mean +MIN_LR=0.1 +EMBED_LR=0.6 +TIED_EMBED_LR=0.03 +TIED_EMBED_INIT_STD=0.005 +MATRIX_LR=0.026 +SCALAR_LR=0.02 +MUON_MOMENTUM=0.97 +MUON_BACKEND_STEPS=5 +MUON_MOMENTUM_WARMUP_START=0.92 +MUON_MOMENTUM_WARMUP_STEPS=1500 +MUON_ROW_NORMALIZE=1 +BETA1=0.9 +BETA2=0.95 +ADAM_EPS=1e-8 +GRAD_CLIP_NORM=0.3 +ADAM_WD=0.02 +MUON_WD=0.095 +EMBED_WD=0.085 +EMA_DECAY=0.9965 +TRAIN_BATCH_TOKENS=786432 +TRAIN_SEQ_LEN=2048 +TRAIN_LOG_EVERY=100 +ITERATIONS=20000 +WARMDOWN_FRAC=0.75 +WARMUP_STEPS=20 +VAL_BATCH_TOKENS=524288 +EVAL_SEQ_LEN=2048 +EVAL_STRIDE=64 +VAL_LOSS_EVERY=0 +CASEOPS_ENABLED=1 +COMPRESSOR=brotli +MATRIX_BITS=6 +MATRIX_CLIP_SIGMAS=12.85 +ATTN_CLIP_SIGMAS=13.0 +MLP_CLIP_SIGMAS=12.0 +EMBED_BITS=7 +EMBED_CLIP_SIGMAS=15.0 +GPTQ_CALIBRATION_BATCHES=16 +GPTQ_RESERVE_SECONDS=0.5 +SKIP_GATES_ENABLED=1 +SPARSE_ATTN_GATE_ENABLED=1 +SPARSE_ATTN_GATE_INIT_STD=0.0 +SPARSE_ATTN_GATE_SCALE=1.0 +GATED_ATTN_ENABLED=0 +GATED_ATTN_INIT_STD=0.005 +GATED_ATTN_QUANT_GATE=1 +ATTN_OUT_GATE_ENABLED=0 +ATTN_OUT_GATE_SRC=proj +GATE_WINDOW=12 +RECUR_ALPHA_ENABLED=1 +RECUR_DIAG_P2P_COS=0 +SMEAR_GATE_ENABLED=1 +LQER_ENABLED=1 +LQER_RANK=4 +LQER_TOP_K=3 +LQER_FACTOR_BITS=4 +LQER_ASYM_ENABLED=1 +LQER_ASYM_GROUP=64 +SPINQUANT_ENABLED=0 +SPINQUANT_SEED=42 +SPINQUANT_SITES=attn_in,attn_proj_in,mlp_in,mlp_proj_in +SEED=42 +MAX_WALLCLOCK_SECONDS=600 +TTT_ENABLED=0 +TRAINING_ONLY_SCREEN=1 +``` + +Canonical launch block: + +```bash +for arm in baseline 039bA 039bD; do + case "$arm" in + baseline) + MLP_SCHEDULE_ENABLED=0 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=4.0 + MLP_LATE_MULT=4.0 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=leaky_relu_square + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + 039bA) + MLP_SCHEDULE_ENABLED=0 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=4.0 + MLP_LATE_MULT=4.0 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=penalized_tanh + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + 039bD) + MLP_SCHEDULE_ENABLED=1 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=5.0 + MLP_LATE_MULT=3.4 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=penalized_tanh + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + esac + env \ + DATA_DIR=/workspace/parameter-golf/data \ + DATASETS_DIR=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ + TOKENIZER_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ + TRAIN_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin \ + VAL_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin \ + VAL_BYTES_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin \ + VOCAB_SIZE=8192 NUM_LAYERS=11 XSA_LAST_N=11 MODEL_DIM=512 NUM_KV_HEADS=4 NUM_HEADS=8 MLP_MULT=4.0 NEGATIVE_SLOPE=0.5 \ + MLP_MIDDLE_LAYERS=3,4,5 MLP_SCHEDULE_ENABLED="$MLP_SCHEDULE_ENABLED" MLP_EARLY_MULT="$MLP_EARLY_MULT" MLP_MIDDLE_MULT="$MLP_MIDDLE_MULT" MLP_LATE_MULT="$MLP_LATE_MULT" \ + MLP_OUTER_ACTIVATION="$MLP_OUTER_ACTIVATION" MLP_MIDDLE_ACTIVATION="$MLP_MIDDLE_ACTIVATION" MLP_MIDDLE_NEGATIVE_SLOPE="$MLP_MIDDLE_NEGATIVE_SLOPE" \ + TIE_EMBEDDINGS=1 LOGIT_SOFTCAP=30 ROPE_BASE=10000 ROPE_DIMS=16 ROPE_TRAIN_SEQ_LEN=2048 ROPE_YARN=0 LN_SCALE=1 QK_GAIN_INIT=5.0 \ + NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 PARALLEL_START_LAYER=8 PARALLEL_FINAL_LANE=mean \ + MIN_LR=0.1 EMBED_LR=0.6 TIED_EMBED_LR=0.03 TIED_EMBED_INIT_STD=0.005 MATRIX_LR=0.026 SCALAR_LR=0.02 \ + MUON_MOMENTUM=0.97 MUON_BACKEND_STEPS=5 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_ROW_NORMALIZE=1 \ + BETA1=0.9 BETA2=0.95 ADAM_EPS=1e-8 GRAD_CLIP_NORM=0.3 ADAM_WD=0.02 MUON_WD=0.095 EMBED_WD=0.085 EMA_DECAY=0.9965 \ + TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 TRAIN_LOG_EVERY=100 ITERATIONS=20000 WARMDOWN_FRAC=0.75 WARMUP_STEPS=20 \ + VAL_BATCH_TOKENS=524288 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 VAL_LOSS_EVERY=0 \ + CASEOPS_ENABLED=1 COMPRESSOR=brotli MATRIX_BITS=6 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 GPTQ_CALIBRATION_BATCHES=16 GPTQ_RESERVE_SECONDS=0.5 \ + SKIP_GATES_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_INIT_STD=0.0 SPARSE_ATTN_GATE_SCALE=1.0 \ + GATED_ATTN_ENABLED=0 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 ATTN_OUT_GATE_ENABLED=0 ATTN_OUT_GATE_SRC=proj GATE_WINDOW=12 \ + RECUR_ALPHA_ENABLED=1 RECUR_DIAG_P2P_COS=0 SMEAR_GATE_ENABLED=1 \ + LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_FACTOR_BITS=4 LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \ + SPINQUANT_ENABLED=0 SPINQUANT_SEED=42 SPINQUANT_SITES=attn_in,attn_proj_in,mlp_in,mlp_proj_in \ + SEED=42 MAX_WALLCLOCK_SECONDS=600 TTT_ENABLED=0 TRAINING_ONLY_SCREEN=1 \ + RUN_ID="039bd-${arm}" \ + torchrun --standalone --nproc_per_node=4 train_gpt.py +done +``` + +## Acceptance + +Interesting outcome: + +- `039bD` is clearly better than `039bA` + +Failure outcome: + +- `039bD` ties or loses to `039bA`, implying the width split is not adding on + top of the activation win diff --git a/research/specs/_shelved/039be-tanh-plus-040c.md b/research/specs/_shelved/039be-tanh-plus-040c.md new file mode 100644 index 0000000000..917e4b0135 --- /dev/null +++ b/research/specs/_shelved/039be-tanh-plus-040c.md @@ -0,0 +1,153 @@ +# Spec 039bE — tanh plus 040C composite + +**Slug:** `tanh-plus-040c` +**Created:** 2026-04-25 +**Status:** READY +**Branch:** `exp/039be-tanh-plus-040c` +**Commit:** `5ae3b28` +**Links to:** `research/ideas/039be-tanh-plus-040c.md`, `research/specs/039b-loop-band-activation-screen.md`, `research/specs/040-loop-layer-mlp-reallocation-screen.md` + +## Hypothesis + +The loop band `3,4,5` may want the stronger bounded recurrent nonlinearity +(`tanh`) plus the `040C` width split: + +- early `4.0` +- middle `5.0` +- late `3.4` + +If that pairing is genuinely compatible, the composite should outperform +`039bB` alone. + +## Baselines + +Primary comparison arms: + +1. baseline +2. `039bB` +3. composite `039bE` + +Pinned runnable code source: + +- branch: `exp/039be-tanh-plus-040c` +- commit: `5ae3b28` +- script: + [train_gpt.py](/home/claude-user/ai-workspace/projects/parameter-golf/worktrees/039be-composite-screen/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py) + +## Config diff + +Apply the same width split as `040C`: + +- `MLP_SCHEDULE_ENABLED=1` +- `MLP_EARLY_MULT=4.0` +- `MLP_MIDDLE_MULT=5.0` +- `MLP_LATE_MULT=3.4` +- `MLP_MIDDLE_LAYERS=3,4,5` + +Apply the activation split: + +- `MLP_OUTER_ACTIVATION=leaky_relu_square` +- `MLP_MIDDLE_ACTIVATION=tanh` +- `MLP_MIDDLE_NEGATIVE_SLOPE=0.5` + +Training-only stop: + +- `TRAINING_ONLY_SCREEN=1` + +## Regime + +Training-only screen: + +- `4×H100` +- `SEED=42` +- `MAX_WALLCLOCK_SECONDS=600` +- `TTT_ENABLED=0` +- `TRAINING_ONLY_SCREEN=1` + +Compare: + +- stop-time val_bpb +- pre-quant post-EMA val_bpb +- steps reached +- train loss trajectory +- throughput + +## Run protocol + +Run only these three arms: + +1. `baseline` +2. `039bB` +3. `039bE` + +Canonical launch block: + +```bash +for arm in baseline 039bB 039bE; do + case "$arm" in + baseline) + MLP_SCHEDULE_ENABLED=0 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=4.0 + MLP_LATE_MULT=4.0 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=leaky_relu_square + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + 039bB) + MLP_SCHEDULE_ENABLED=0 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=4.0 + MLP_LATE_MULT=4.0 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=tanh + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + 039bE) + MLP_SCHEDULE_ENABLED=1 + MLP_EARLY_MULT=4.0 + MLP_MIDDLE_MULT=5.0 + MLP_LATE_MULT=3.4 + MLP_OUTER_ACTIVATION=leaky_relu_square + MLP_MIDDLE_ACTIVATION=tanh + MLP_MIDDLE_NEGATIVE_SLOPE=0.5 + ;; + esac + env \ + DATA_DIR=/workspace/parameter-golf/data \ + DATASETS_DIR=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ + TOKENIZER_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ + TRAIN_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin \ + VAL_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin \ + VAL_BYTES_FILES=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin \ + VOCAB_SIZE=8192 NUM_LAYERS=11 XSA_LAST_N=11 MODEL_DIM=512 NUM_KV_HEADS=4 NUM_HEADS=8 MLP_MULT=4.0 NEGATIVE_SLOPE=0.5 \ + MLP_MIDDLE_LAYERS=3,4,5 MLP_SCHEDULE_ENABLED="$MLP_SCHEDULE_ENABLED" MLP_EARLY_MULT="$MLP_EARLY_MULT" MLP_MIDDLE_MULT="$MLP_MIDDLE_MULT" MLP_LATE_MULT="$MLP_LATE_MULT" \ + MLP_OUTER_ACTIVATION="$MLP_OUTER_ACTIVATION" MLP_MIDDLE_ACTIVATION="$MLP_MIDDLE_ACTIVATION" MLP_MIDDLE_NEGATIVE_SLOPE="$MLP_MIDDLE_NEGATIVE_SLOPE" \ + TIE_EMBEDDINGS=1 LOGIT_SOFTCAP=30 ROPE_BASE=10000 ROPE_DIMS=16 ROPE_TRAIN_SEQ_LEN=2048 ROPE_YARN=0 LN_SCALE=1 QK_GAIN_INIT=5.0 \ + NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 PARALLEL_START_LAYER=8 PARALLEL_FINAL_LANE=mean \ + MIN_LR=0.1 EMBED_LR=0.6 TIED_EMBED_LR=0.03 TIED_EMBED_INIT_STD=0.005 MATRIX_LR=0.026 SCALAR_LR=0.02 \ + MUON_MOMENTUM=0.97 MUON_BACKEND_STEPS=5 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_ROW_NORMALIZE=1 \ + BETA1=0.9 BETA2=0.95 ADAM_EPS=1e-8 GRAD_CLIP_NORM=0.3 ADAM_WD=0.02 MUON_WD=0.095 EMBED_WD=0.085 EMA_DECAY=0.9965 \ + TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 TRAIN_LOG_EVERY=100 ITERATIONS=20000 WARMDOWN_FRAC=0.75 WARMUP_STEPS=20 \ + VAL_BATCH_TOKENS=524288 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 VAL_LOSS_EVERY=0 \ + CASEOPS_ENABLED=1 COMPRESSOR=brotli MATRIX_BITS=6 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 GPTQ_CALIBRATION_BATCHES=16 GPTQ_RESERVE_SECONDS=0.5 \ + SKIP_GATES_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_INIT_STD=0.0 SPARSE_ATTN_GATE_SCALE=1.0 \ + GATED_ATTN_ENABLED=0 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 ATTN_OUT_GATE_ENABLED=0 ATTN_OUT_GATE_SRC=proj GATE_WINDOW=12 \ + RECUR_ALPHA_ENABLED=1 RECUR_DIAG_P2P_COS=0 SMEAR_GATE_ENABLED=1 \ + LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_FACTOR_BITS=4 LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \ + SPINQUANT_ENABLED=0 SPINQUANT_SEED=42 SPINQUANT_SITES=attn_in,attn_proj_in,mlp_in,mlp_proj_in \ + SEED=42 MAX_WALLCLOCK_SECONDS=600 TTT_ENABLED=0 TRAINING_ONLY_SCREEN=1 \ + RUN_ID="039be-${arm}" \ + torchrun --standalone --nproc_per_node=4 train_gpt.py +done +``` + +## Acceptance + +Interesting outcome: + +- `039bE` is clearly better than `039bB` + +Failure outcome: + +- `039bE` ties or loses to `039bB` diff --git a/research/writeups/learnable-mixing.md b/research/writeups/learnable-mixing.md new file mode 100644 index 0000000000..c04a6dc148 --- /dev/null +++ b/research/writeups/learnable-mixing.md @@ -0,0 +1,234 @@ +# Notes on the recurrence band in compressed transformers + +A small set of architectural studies on the loop band (layers 3–5) of the +#1736 / 060A baseline. Each section is independent. + +--- + +## Section 1 — Learning mixing parameters in depth-recurrent loops + +A depth-recurrent loop runs the canonical Markov iteration through the loop +band (layers 3–5): + +``` +x_{k+1} = f(x_k) +``` + +Each pass uses only the previous pass's output. We replace this with a +learned mixing rule, train it end-to-end, and observe that the learned +mixing coefficients converge to a stable, nearly seed-invariant pattern +within a few hundred steps after looping activates. Once stabilized, the +coefficients can be read off the trained model and used as fixed constants +in a fresh training run. + +## Recurrent α-β + +We add learnable scalars to control how each pass commits to the residual +and to allow detached cross-layer carries within the same pass: + +``` +x_{k+1} = β_k · f(x_k) + Σ_j α_{k,j} · stop_grad(x_k^{(j)}) +``` + +with `β_k` initialized to 1 and `α_{k,j}` initialized to 0, so the loop +starts from the canonical Markov rule. Across the loop band (layers 3–5, +NL=2) this is a small number of scalars; they are routed to the scalar +optimizer and trained jointly with the rest of the model. + +During a full training run on the #1736 base, the scalars drift off their +initialization once looping activates at `frac=0.35`, then plateau. The +final values are reproducible across seeds — for example, layer 4 converges +to a self-subtract pattern at `α ≈ −0.348` (a learned gate), and layer 5 +stabilizes into a positive aggregation of the signals from layers 3 and 4. + +## Freezing the learned values + +We then read the converged values off the trained model and use them as +fixed constants in a new training run from scratch. The optimizer state +and per-step gradient on these scalars are dropped; only the values +survive. Because the loop now starts at the converged mixing pattern +rather than at the canonical Markov rule, the run is no longer +identity-at-init, but training-end quality matches. + +This is shipped as PR #1779 on top of #1736: + +| Submission | Mixing rule in loop band | val_bpb (3-seed mean) | Δ vs #1736 | +|---|---|---:|---:| +| #1736 (base) | canonical Markov | 1.06549 | — | +| #1779 (frozen α-β) | fixed α-β with cross-layer carry | **1.06421** | **−0.00128** | + +3-seed std on #1779 is 0.00023, so the gain is well outside seed noise. +Artifact size is unchanged (the frozen scalars are baked into the model +weights serialized into the 16 MB budget). + +The converged values used as fixed constants in #1779 are: + +``` +β = [1.5973, 1.8828, 1.9922] # layers 3, 4, 5 + + L3 L4 L5 +α = [[ 0.2520, −0.0210, −0.0124], # L3 contributions + [ 0.0669, −0.3477, 0.0031], # L4 contributions + [ 0.1387, 0.2412, 0.0272]] # L5 contributions +``` + +Two patterns stand out. Every β is well above 1, so each pass amplifies +its own block output rather than damping it — the optimizer chose to +overshoot the canonical Markov rule. And the diagonal of α is mixed: L3 +adds back ~25% of itself, L4 subtracts ~35% of itself (the learned-gate +self-subtract behavior), L5 leaves itself roughly alone but absorbs ~24% +of L4. The off-diagonal entries in row L5 also confirm L5 acts as an +aggregator over L3 and L4. + +## Anderson acceleration with frozen coefficients + +The same idea applies to a different mixing rule. Anderson acceleration +replaces the Markov iteration with a length-`m` mix of past iterates, +solved per batch via a small least-squares problem: + +``` +g_i = f(x_i) − x_i # residuals +α* = argmin_α ‖Σ_{i=k−m+1..k} α_i · g_i‖², Σ α_i = 1 +x_{k+1} = Σ α*_i · f(x_i) +``` + +Trained end-to-end (length-3 Anderson, per-batch LS), the coefficients +land in the noise band of canonical recurrence but pay a ~25% throughput +penalty for the per-batch solve. Inspecting the trained model, the +per-batch α distribution concentrates tightly around + +``` +α ≈ [+0.55, −0.67, +1.12] +``` + +Following the same procedure as for α-β, we drop the LS solve and +hardcode these coefficients as constants. The result is a +fixed-coefficient extrapolation across the last three iterates with no +runtime overhead beyond the canonical loop. + +| Variant | Mixing rule | Throughput vs canonical | val_bpb (single seed) | +|---|---|---:|---:| +| Canonical | Markov | 1.00× | 1.06108 | +| Anderson, learned per-batch α | length-3 LS | 0.75× | 1.06083 | +| Anderson, frozen α | fixed `[+0.55, −0.67, +1.12]` | 1.00× | 1.05968 | + +The frozen-Anderson result is single-seed; multi-seed confirmation has +not been run. + +--- + +## Section 2 — MLP sizing across the three stages + +The loop band runs each of layers 3, 4, 5 three times per forward pass +(NL=2). Each pass reads the same FFN weights, so the parameters in the +loop band see roughly 3× the use per token of the FFN parameters in the +non-looped layers. A natural question is whether the loop band deserves +more FFN capacity than the rest of the model at fixed total parameters — +i.e., whether reallocating width from the non-looped layers into the +loop band is a free win. + +We split the 11 physical layers into three positional stages and +parameterize the FFN width as a per-stage multiplier of `model_dim`: + +``` +stage layers width multiplier +early 0–2 MLP_EARLY_MULT +middle 3–5 MLP_MIDDLE_MULT # the loop band +late 6–10 MLP_LATE_MULT +``` + +The baseline uses `4.0` everywhere, for a total of `11 × 4.0 = 44.0` +width-units. We tried three reallocation schemes that hold the total +fixed at 44.0 width-units while widening the middle stage to 5.0: + +| arm | early | middle | late | direction | +|---|---:|---:|---:|---| +| baseline | 4.0 | 4.0 | 4.0 | uniform | +| 040A | 3.625 | 5.0 | 3.625 | shrink both sides evenly | +| 040B | 3.0 | 5.0 | 4.0 | shrink early, keep late | +| 040C | 4.0 | 5.0 | 3.4 | keep early, shrink late | + +Single-seed training-only screen on the 038/039 fullfloat research line, +2×H100, 600s wallclock cap, no quantization or TTT. The absolute val_bpb +values are pre-quant post-EMA from this short screen, *not* directly +comparable to the post-quant post-TTT numbers in Section 1 — this is a +relative comparison of training quality between MLP schedules, not an +endpoint number. Pre-quant post-EMA val_bpb on the validation set: + +| arm | val_bpb (pre-quant post-EMA) | Δ vs uniform | +|---|---:|---:| +| baseline (uniform 4.0) | 1.16501 | — | +| 040A (3.625 / 5.0 / 3.625) | 1.16742 | +0.00241 | +| 040B (3.0 / 5.0 / 4.0) | 1.16744 | +0.00244 | +| 040C (4.0 / 5.0 / 3.4) | **1.16484** | **−0.00017** | + +Three observations: + +- **The middle-widen direction is real but small.** 040C is the only + reallocation that doesn't regress, and the gain is comfortably inside + single-seed noise (Δ ≈ −0.0002 on a screen with no seed average). + Treat it as "tied with baseline," not a win. +- **Shrinking the early stage is more expensive than shrinking the + late stage.** 040B (early shrunk to 3.0, late kept at 4.0) loses + +0.00244; 040C (early kept at 4.0, late shrunk to 3.4) gains + −0.00017. A symmetric shrink (040A) lands close to 040B. The early + layers (0–2) are doing work that doesn't compress; the late layers + (6–10) tolerate it. +- **The middle-stage gain is bounded above by what the late-shrink + costs.** Whatever extra capacity the middle stage absorbs from going + 4.0 → 5.0, the late stage gives back roughly the same amount when it + goes 4.0 → 3.4. The two effects nearly cancel. The implication is that + the loop band is *not* obviously starved for FFN capacity at the + uniform baseline. + +--- + +## Section 3 — Sizing the loop band + +The canonical 060A loop band is the contiguous set {3, 4, 5} run at +NL=2, so each of layers 3, 4, 5 is visited three times per forward +pass. The full forward does 17 layer-applications, with 9 of them +inside the loop band. Two knobs control the total compute spent inside +the band: which layers form the band (band-set), and how many times +each is visited (NL). We screened both directions on 060A. + +| spec | band-set | NL | loop-band passes | description | +|---|---|---:|---:|---| +| 060A canonical | {3,4,5} | 2 | 9 | reference | +| 041B | {3,4,5} | 1 | 3 | half the canonical loop compute | +| 041D | {5} | 2 | 3 | single-layer band, only layer 5 | +| 041H | {4,5} | 2 | 6 | drop the front of the band | +| 070 | {3,4} | 2 | 6 | drop the back of the band | +| 041L | {3,4,5} | 3 | 12 | more visits per layer | +| 041N | {3,4,5} | 4 | 15 | more still | + +Same screen protocol throughout: single seed 42, 4×H100, 1200s +wallclock, no TTT. Pre-quant post-EMA val_bpb: + +| spec | structure | pre-quant post-EMA | Δ vs canonical | +|---|---|---:|---:| +| 060A canonical | {3,4,5} NL=2 | **1.06358** | — | +| 041B | {3,4,5} NL=1 | 1.06842 | +0.00484 | +| 041D | {5} NL=2 | 1.06993 | +0.00635 | +| 041H | {4,5} NL=2 | 1.06693 | +0.00335 | +| 070 | {3,4} NL=2 | 1.06595 | +0.00237 | +| 041L | {3,4,5} NL=3 | 1.06615 | +0.00257 | +| 041N | {3,4,5} NL=4 | 1.06888 | +0.00530 | + +Two observations: + +- **Canonical is locally optimal in both directions.** Both shrinking + (NL=1, single-layer band, drop a layer) and growing (NL=3, NL=4) lose + to the canonical {3,4,5} NL=2 — the loss is monotonic in how far the + configuration sits from canonical. NL=3 (+0.00257) is the closest + miss; NL=4 (+0.00530) loses about as much as halving the loop + compute. +- **Band shape is roughly position-symmetric.** Dropping layer 3 (041H, + +0.00335) and dropping layer 5 (070, +0.00237) cost similar amounts. + Reducing to a single layer (041D, +0.00635) is worse than either, but + in the same direction. There's no specific layer in {3,4,5} that's + uniquely load-bearing; the band-as-a-whole is what matters. + +The 041L NL=3 result is interesting in isolation — the gap to +canonical (+0.00257) is small enough that with multi-seed averaging +it may close. We did not promote it past the screen. diff --git a/research/writeups/technique-evolution.md b/research/writeups/technique-evolution.md new file mode 100644 index 0000000000..3e530b4989 --- /dev/null +++ b/research/writeups/technique-evolution.md @@ -0,0 +1,154 @@ +# Parameter Golf — Official Leaderboard & Technique Evolution + +*Fetched from openai/parameter-golf main README, 2026-05-05. Ordered best→worst BPB.* + +--- + +## Official Leaderboard (10-min / 16 MB track) + +| # | PR | BPB | Author | Date | Key Techniques | +|---|-----|-----|--------|------|----------------| +| 1 | #2135 | **1.0565** | codemath3000 | 2026-05-01 | Calib32 token-only n-gram + AsymLogit stack | +| 2 | #2014 | **1.0576** | simonbissonnette | 2026-04-30 | Progressive 1k→2k→3k context, short-doc TTT chunks | +| 3 | #1953 | **1.0586** | andrewbaggio1 | 2026-04-30 | EVAL_SEQ_LEN=2560, no-Q/V TTT mask, TTT LR 0.75, QK_GAIN=5.25 | +| 4 | #1945 | **1.0594** | alertcat | 2026-04-29 | AWQ-lite GPTQ + asymmetric logit rescaling | +| 5 | #1855 | **1.0611** | codemath3000 | 2026-04-27 | LQER + SparseAttnGate + per-group lrzip + 9 greedy hparam overrides | +| 6 | #1851/#1868 | **1.0614** | aquariouseworkman | 2026-04-27 | BOS-fixed SmearGate + LQER asymmetric + SparseAttnGate + phased TTT | +| 7 | #1787 | **1.0634** | nprime06 | 2026-04-23 | Polar Express NS, MIN_LR=0.1, SparseAttnGate, fused CE | +| 8 | #1769 | **1.0645** | dexhunter | 2026-04-22 | MLPClip σ=12, SmearGate + LoRA-TTT refinements | +| 9 | #1736 | **1.0655** | dexhunter | 2026-04-19 | CaseOps + GatedAttn + QuantGate + Loop45 + Phased TTT | +| 10 | #1729 | **1.0678** | romeerp | 2026-04-18 | CaseOps tokenizer + tapered WD + phased TTT | +| 11 | #1667 | **1.0714** | MarioPaerle | 2026-04-16 | SmearGate + attention output gate + score-first TTT | +| 12 | #1626 | **1.0719** | dexhunter | 2026-04-14 | VarLen attn, fused MLP, multi-phase global SGD TTT, int7 embeddings | +| 13 | #1610 | **1.0728** | romeerp | 2026-04-13 | Phased TTT (first appearance) | +| 14 | #1530 | **1.0734** | samacqua | 2026-04-11 | VarLen FA3 attn, fused Triton MLP, doc-independent LoRA TTT | +| 15 | #1529 | **1.0758** | msisovic | 2026-04-11 | Parallel residuals PARALLEL_START=8, CUTLASS EVT/Triton kernels | +| 16 | #1514 | **1.0798** | dexhunter | 2026-04-09 | SP8192 + Muon 0.97 + legal score-first TTT | +| 17 | #1493 | **1.0810** | bigbag | 2026-04-09 | 3-layer recurrence + parallel residuals + QK-Gain 5.25 + TTT | +| 18 | #1477 | **1.0822** | aryanbhosale | 2026-04-08 | Parallel residuals + score-first TTT | +| 19 | #1413 | **1.0828** | dexhunter | 2026-04-06 | QK-Gain 5.0 + legal score-first TTT on SP8192 | +| 20 | #1412 | **1.0835** | Robby Sneiderman | 2026-04-06 | Parallel residuals + Hessian-aware SDClip | +| 21 | #1394 | **1.0856** | Kevin Clark | 2026-04-05 | SP8192 + GPTQ embeddings + Loop4-5 + MuonEq-R + SDClip | +| 22 | #1334 | **1.0897** | aryanbhosale | 2026-04-04 | SP4096 + depth recurrence + parallel residuals + MuonEq-R | +| 23 | #1285 | **1.0912** | dexhunter | 2026-04-03 | MuonEq-R + Loop4-5 + WD=0.090 + all-int6 GPTQ | +| 24 | #1218 | **1.0979** | Kevin Clark | 2026-04-01 | SP4096 + 4× MLP + high WD (stripped TTT, hash embeddings, SmearGate) | +| 25 | #1204 | **1.1063** | msisovic | 2026-03-31 | Mini recurrence layers 4-5 + two-lane parallel residuals (first appearance) | +| 26 | #1120 | **1.1099** | newjordan | 2026-03-30 | XSA-all + Parallel Muon + coprime loader + Bigram2048/RoPE16 | +| 27 | #1060 | **1.1122** | dexhunter | 2026-03-29 | Coprime loader + full Hessian GPTQ + XSA all 11 layers | +| 28 | #1019 | **1.1147** | abaybektursun | 2026-03-25 | AR self-gen GPTQ calibration + all-layer XSA | +| 29 | #549 | **1.1194** | abaybektursun | 2026-03-23 | LeakyReLU² + legal score-first TTT (first legal TTT) + Parallel Muon | +| 30 | #374/#414 | **1.1228** | signalrush | 2026-03-22 | GPTQ-lite clip search + EMA + QAT@0.15 | +| 31 | #315 | **1.1248** | jfprincz | 2026-03-21 | Partial RoPE (16/64 dims) + LN scale + EMA + XSA on 4 layers | +| 32 | #287 | **1.1271** | jfprincz | 2026-03-20 | XSA on last 4 layers + EMA replacing SWA | +| 33 | #265 | **1.1307** | unnir | 2026-03-20 | Partial XSA on deepest 3 layers (first XSA) | +| 34 | #180 | **1.1428** | thwu1 | 2026-03-20 | Mixed int5/int6, BigramHash(10240), SWA(0.4) | +| 35 | #162 | **1.1458** | raahilshah | 2026-03-20 | 3× MLP + SmearGate + BigramHash + OrthoInit (first SmearGate) | +| 36 | #86 | **1.1502** | aruniyer | 2026-03-20 | 11 layers + 3× MLP + int6 QAT (first 11L) | +| 37 | #65 | **1.1556** | aquariouseworkman | 2026-03-19 | SmearGate + BigramHash + 3× MLP + int6 STE QAT | +| 38 | #640 | **1.1570** | CiprianFlorin-Ifrim | 2026-03-24 | 73.7M ternary quant + U-Net + SP8192 + YaRN | +| 39 | #63 | **1.1586** | yahya010 | 2026-03-19 | 10L + int6 QAT + zstd-22 | +| 40 | #60 | **1.1748** | notapplica | 2026-03-19 | Sliding window + FP16 embed + 10L + Muon WD | +| 41 | #50 | **1.1925** | mattqlf | 2026-03-19 | Sliding window eval stride=64 (first sliding window) | +| 42 | #77 | **1.1928** | samacqua | 2026-03-19 | First TTT (LoRA, non-legal) | +| 43 | #52 | **1.2014** | Spokane Way | 2026-03-19 | 4k seq length | +| 44 | #49 | **1.2060** | Spokane Way | 2026-03-18 | 2048 seq length | +| 45 | #39 | **1.2147** | Nan Liu | 2026-03-18 | Mixed int8/int6 quantization | +| 46 | #42 | **1.2197** | Renier Velazco | 2026-03-18 | FP16 tied embedding | +| 47 | Baseline | **1.2244** | OpenAI | 2026-03-17 | 9L/512d/SP1024, tied embeddings, 4 KV heads | + +--- + +## Technique Evolution by Phase + +### Phase 1 — First 48 Hours (March 18–20) +Rapid parallel exploration of obvious levers. Multiple competitors independently found: +- Sequence length: 1024 → 2048 → 4096 (Δ ≈ −0.02 BPB). Persisted to final SOTA (eventually 3072 eval). +- Sliding window eval (Δ ≈ −0.015). Permanent fixture. +- 10→11 layers. 11L became standard from #86 onward. +- 3× MLP width. Standard until 4× in #1218. +- SmearGate (#162, first appearance). Persisted all the way to final SOTA (BOS-fixed in #1851). +- int6 quantization (#39). Evolved into GPTQ, persists to final SOTA. +- FP16 embeddings. Eventually replaced by int7 GPTQ embeddings (#1626). + +### Phase 2 — XSA and GPTQ Maturation (March 20–25) +- XSA introduced (#265), expanded to 4 layers (#287), eventually all 11 (#1060). Permanent. +- EMA replacing SWA (#287). EMA_DECAY=0.9965 frozen since introduction — never revisited. +- Partial RoPE, LN Scale (#315). Permanent. +- GPTQ-lite (#374). Evolved through full Hessian GPTQ (#1060) → GPTQ embeddings (#1394) → LQER (#1851). Permanent. +- Legal score-first TTT (#549). Paradigm shift. TTT became mandatory in all top submissions. +- LeakyReLU² activation (#549, #493). Still in final SOTA. + +### Phase 3 — Tokenizer Leap and System Work (March 29 – April 1) +- Full Hessian GPTQ (#1060). Permanent. +- SP4096 (#1218, Kevin Clark). First vocabulary leap. 4× MLP, higher WD. +- Parallel residuals, first depth recurrence (#1204). Both permanent. + +### Phase 4 — SP8192 Era (April 3–11) +- SP8192 (#1394, Kevin Clark). Second vocabulary leap. Largest single-step architectural gain. +- GPTQ embeddings, SDClip (#1394). Permanent. +- QK-Gain 5.0→5.25 (#1413, #1953). Permanent. +- VarLen attention + fused Triton MLP (#1530). Permanent. +- Phased TTT (#1610, #1626). Became standard; PHASED_TTT_NUM_PHASES=3. + +### Phase 5 — CaseOps and Fine Stacking (April 13–May 1) +- CaseOps tokenizer (#1729). Paradigm shift. All final SOTA uses CaseOps. +- SmearGate BOS fix (#1851). Cleaned up a longstanding bug. +- LQER asymmetric rank-4 (#1851). Permanent. +- SparseAttnGate, Polar Express NS, MIN_LR=0.1 (#1787). Permanent. +- AWQ-lite, AsymLogit rescaling (#1945). In final SOTA. +- Progressive context 1k→2k→3k (#2014). In final SOTA. +- No-Q/V TTT mask, EVAL_SEQ_LEN=2560 (#1953). In final SOTA. +- GPTQ_CALIBRATION_BATCHES=32 (#2135). Final accepted SOTA (one hyperparameter change). + +--- + +## What Persisted vs What Was Dropped + +### Persisted to final SOTA (#2135) +- 11 layers (#86) +- Sliding window / VarLen eval (#50 → #1530) +- GPTQ int6 full Hessian (#374 → #1060) +- EMA decay=0.9965 (#287) — frozen since introduction +- Partial RoPE (#315) +- SmearGate BOS-fixed (#162 → #1851) +- XSA all layers (#265 → #1060) +- Depth recurrence Loop3-5 (#1204/#1285) +- Parallel residuals from layer 8 (#1204 → #1529) +- SP8192 + CaseOps (#1394 + #1729) +- LQER asymmetric (#1851) +- SparseAttnGate (#1787) +- Phased score-first TTT (#1610) +- LeakyReLU² MLP activation (#549/#493) +- MIN_LR=0.1 warmdown floor (#1787) + +### Introduced but dropped +- SWA → replaced by EMA (#287) +- zstd compression → replaced by per-group lrzip+brotli +- OrthoInit — mentioned early, absent from April stack +- Value residuals — Kevin Clark explicitly removed in #1218, got a gain +- BigramHash — popular March, absent April; superseded by larger vocab +- AR self-gen GPTQ calibration → replaced by training-shard Hessian + calib tuning +- MuonEq-R → replaced by Muon 0.97 (#1514) +- Illegal LoRA TTT (#77) → replaced by legal score-first TTT + +--- + +## Three Paradigm Shifts + +1. **Legal TTT** (~March 23, #549): Opened eval-time compute as a new optimization axis. +2. **SP4096→SP8192** (~April 1–5): Vocabulary as primary architecture decision. Freed artifact bytes from embedding tables; biggest single architectural jump. +3. **CaseOps** (~April 18–19, #1729): Lossless tokenizer transform. All top submissions converged on SP8192+CaseOps. + +--- + +## Approximate Single-Step BPB Gains (largest to smallest) +1. SP8192 vocabulary jump: ~−0.028 BPB +2. Legal TTT (cumulative over TTT steps): ~−0.025 BPB +3. XSA all layers (cumulative): ~−0.025 BPB +4. SP4096 vocabulary jump: ~−0.013 BPB +5. Sliding window eval: ~−0.015 BPB +6. Depth recurrence: ~−0.010–0.020 BPB +7. Parallel residuals: ~−0.005–0.010 BPB +8. CaseOps: ~−0.002 BPB direct (enables further composition) +9. LQER asymmetric: ~−0.003–0.005 BPB +10. GPTQ_CALIBRATION_BATCHES=32: ~−0.001 BPB (final SOTA step) diff --git a/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.int6.ptz b/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.int6.ptz new file mode 100644 index 0000000000..8eae527e74 Binary files /dev/null and b/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.int6.ptz differ diff --git a/runs/019b-recur-alpha-manual-constant-full/seed_42/train.log b/runs/019b-recur-alpha-manual-constant-full/seed_42/train.log new file mode 100644 index 0000000000..866ba75cba --- /dev/null +++ b/runs/019b-recur-alpha-manual-constant-full/seed_42/train.log @@ -0,0 +1,876 @@ +W0421 06:54:18.062000 153 torch/distributed/run.py:803] +W0421 06:54:18.062000 153 torch/distributed/run.py:803] ***************************************** +W0421 06:54:18.062000 153 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0421 06:54:18.062000 153 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + artifact_dir: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42 + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + beta1: 0.9 + beta2: 0.95 + caseops_enabled: True + compressor: brotli + data_dir: /workspace/data + datasets_dir: /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 15.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + gate_window: 12 + gated_attn_enabled: True + gated_attn_init_std: 0.005 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 4.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/bcadb491-14a4-4501-b847-cf7c5bcba279.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_clip_sigmas: 12.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2000 + qk_gain_init: 5.0 + quantized_model_path: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.int6.ptz + rank: 0 + recur_alpha_enabled: True + recur_diag_p2p_cos: False + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: bcadb491-14a4-4501-b847-cf7c5bcba279 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + smear_gate_enabled: False + spinquant_enabled: False + spinquant_seed: 42 + spinquant_sites: attn_in,attn_proj_in,mlp_in,mlp_proj_in + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 100 + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.999 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 96 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /workspace/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.75 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 47851520 +model_params:35989658 +recur_alpha: enabled=True num_loops=2 loop_start=3 loop_end=5 diag_p2p_cos=False +gptq:reserving 4s, effective=596000ms +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0177 val_bpb: 4.1205 +1/20000 train_loss: 9.0180 train_time: 0.0m tok/s: 12418469 +2/20000 train_loss: 12.7420 train_time: 0.0m tok/s: 10893250 +3/20000 train_loss: 10.1331 train_time: 0.0m tok/s: 9613055 +4/20000 train_loss: 8.5979 train_time: 0.0m tok/s: 9137923 +5/20000 train_loss: 7.8518 train_time: 0.0m tok/s: 8865041 +100/20000 train_loss: 3.6416 train_time: 0.2m tok/s: 8378304 +200/20000 train_loss: 3.1581 train_time: 0.3m tok/s: 8218646 +300/20000 train_loss: 2.9235 train_time: 0.5m tok/s: 8152096 +400/20000 train_loss: 2.5924 train_time: 0.6m tok/s: 8122170 +500/20000 train_loss: 2.5826 train_time: 0.8m tok/s: 8144960 +600/20000 train_loss: 2.6845 train_time: 1.0m tok/s: 8115451 +700/20000 train_loss: 2.8824 train_time: 1.1m tok/s: 8101165 +800/20000 train_loss: 2.7193 train_time: 1.3m tok/s: 8090290 +900/20000 train_loss: 2.7593 train_time: 1.5m tok/s: 8084221 +1000/20000 train_loss: 2.8153 train_time: 1.6m tok/s: 8105514 +1100/20000 train_loss: 2.7717 train_time: 1.8m tok/s: 8095714 +1200/20000 train_loss: 2.7726 train_time: 1.9m tok/s: 8087367 +1300/20000 train_loss: 2.8347 train_time: 2.1m tok/s: 8082236 +1400/20000 train_loss: 2.5942 train_time: 2.3m tok/s: 8076534 +1500/20000 train_loss: 2.6382 train_time: 2.4m tok/s: 8090502 +1600/20000 train_loss: 2.7138 train_time: 2.6m tok/s: 8084213 +1700/20000 train_loss: 2.6844 train_time: 2.8m tok/s: 8079719 +1800/20000 train_loss: 2.6485 train_time: 2.9m tok/s: 8071716 +1900/20000 train_loss: 2.7487 train_time: 3.1m tok/s: 8082974 +2000/20000 train_loss: 2.6660 train_time: 3.2m tok/s: 8079622 +2100/20000 train_loss: 2.6914 train_time: 3.4m tok/s: 8076207 +layer_loop:enabled step:2143 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2200/20000 train_loss: 2.5341 train_time: 3.6m tok/s: 7976970 +2300/20000 train_loss: 2.6032 train_time: 3.9m tok/s: 7822535 +2400/20000 train_loss: 2.6277 train_time: 4.1m tok/s: 7695149 +2500/20000 train_loss: 2.5551 train_time: 4.3m tok/s: 7570934 +2600/20000 train_loss: 2.5213 train_time: 4.6m tok/s: 7462485 +2700/20000 train_loss: 2.5025 train_time: 4.8m tok/s: 7363639 +2800/20000 train_loss: 2.5695 train_time: 5.0m tok/s: 7273825 +2900/20000 train_loss: 2.5428 train_time: 5.4m tok/s: 7046266 +3000/20000 train_loss: 2.5613 train_time: 5.7m tok/s: 6934968 +3100/20000 train_loss: 2.4946 train_time: 5.9m tok/s: 6836953 +3200/20000 train_loss: 2.4605 train_time: 6.2m tok/s: 6784807 +3300/20000 train_loss: 2.6565 train_time: 6.4m tok/s: 6741847 +3400/20000 train_loss: 2.5570 train_time: 6.7m tok/s: 6696658 +3500/20000 train_loss: 2.5617 train_time: 6.9m tok/s: 6622187 +3600/20000 train_loss: 2.4578 train_time: 7.2m tok/s: 6551464 +3700/20000 train_loss: 2.5411 train_time: 7.4m tok/s: 6517476 +3800/20000 train_loss: 2.4885 train_time: 7.7m tok/s: 6489808 +3900/20000 train_loss: 2.6112 train_time: 7.9m tok/s: 6459430 +4000/20000 train_loss: 2.4023 train_time: 8.2m tok/s: 6431417 +4000/20000 val_loss: 2.4228 val_bpb: 1.1071 +4100/20000 train_loss: 2.4033 train_time: 8.4m tok/s: 6374681 +4200/20000 train_loss: 2.3587 train_time: 8.7m tok/s: 6322698 +4300/20000 train_loss: 2.4887 train_time: 8.9m tok/s: 6304041 +4400/20000 train_loss: 2.4365 train_time: 9.2m tok/s: 6282662 +4500/20000 train_loss: 2.2632 train_time: 9.4m tok/s: 6262589 +4600/20000 train_loss: 2.3614 train_time: 9.7m tok/s: 6243370 +4700/20000 train_loss: 2.3076 train_time: 9.9m tok/s: 6225615 +4716/20000 val_loss: 2.3414 val_bpb: 1.0698 +stopping_early: wallclock_cap train_time: 596109ms step: 4716/20000 +peak memory allocated: 40048 MiB reserved: 44160 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.34062611 val_bpb:1.06950534 eval_time:10564ms +Serialized model: 135592891 bytes +Code size (uncompressed): 155004 bytes +Code size (compressed): 30560 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 3.8s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15951284 bytes +Total submission size quantized+brotli: 15981844 bytes +diagnostic quantized val_loss:2.36089525 val_bpb:1.07876695 eval_time:71062ms +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (146.2s) + +beginning TTT eval timer +ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000] +ttp: b776/782 bl:2.2657 bb:1.0742 rl:2.2657 rb:1.0742 dl:7534-8350 gd:0 +ttp: b772/782 bl:2.3366 bb:1.1013 rl:2.2961 rb:1.0858 dl:5762-6095 gd:0 +ttp: b767/782 bl:2.2810 bb:1.0793 rl:2.2922 rb:1.0842 dl:4681-4858 gd:0 +ttpp: phase:1/3 pd:1104 gd:666 t:214.4s +tttg: c1/111 lr:0.001000 t:2.4s +tttg: c2/111 lr:0.001000 t:2.5s +tttg: c3/111 lr:0.000999 t:2.6s +tttg: c4/111 lr:0.000998 t:2.7s +tttg: c5/111 lr:0.000997 t:2.8s +tttg: c6/111 lr:0.000995 t:2.9s +tttg: c7/111 lr:0.000993 t:3.0s +tttg: c8/111 lr:0.000990 t:3.1s +tttg: c9/111 lr:0.000987 t:3.2s +tttg: c10/111 lr:0.000984 t:3.3s +tttg: c11/111 lr:0.000980 t:3.4s +tttg: c12/111 lr:0.000976 t:3.5s +tttg: c13/111 lr:0.000971 t:3.6s +tttg: c14/111 lr:0.000966 t:3.7s +tttg: c15/111 lr:0.000961 t:3.8s +tttg: c16/111 lr:0.000955 t:3.9s +tttg: c17/111 lr:0.000949 t:4.0s +tttg: c18/111 lr:0.000942 t:4.1s +tttg: c19/111 lr:0.000935 t:4.2s +tttg: c20/111 lr:0.000928 t:4.3s +tttg: c21/111 lr:0.000921 t:4.4s +tttg: c22/111 lr:0.000913 t:4.5s +tttg: c23/111 lr:0.000905 t:4.6s +tttg: c24/111 lr:0.000896 t:4.7s +tttg: c25/111 lr:0.000887 t:4.8s +tttg: c26/111 lr:0.000878 t:4.9s +tttg: c27/111 lr:0.000868 t:5.0s +tttg: c28/111 lr:0.000859 t:5.1s +tttg: c29/111 lr:0.000848 t:5.2s +tttg: c30/111 lr:0.000838 t:5.3s +tttg: c31/111 lr:0.000827 t:5.3s +tttg: c32/111 lr:0.000817 t:5.4s +tttg: c33/111 lr:0.000805 t:5.5s +tttg: c34/111 lr:0.000794 t:5.6s +tttg: c35/111 lr:0.000782 t:5.7s +tttg: c36/111 lr:0.000770 t:5.8s +tttg: c37/111 lr:0.000758 t:5.9s +tttg: c38/111 lr:0.000746 t:6.0s +tttg: c39/111 lr:0.000733 t:6.1s +tttg: c40/111 lr:0.000721 t:6.2s +tttg: c41/111 lr:0.000708 t:6.3s +tttg: c42/111 lr:0.000695 t:6.4s +tttg: c43/111 lr:0.000681 t:6.5s +tttg: c44/111 lr:0.000668 t:6.6s +tttg: c45/111 lr:0.000655 t:6.7s +tttg: c46/111 lr:0.000641 t:6.8s +tttg: c47/111 lr:0.000627 t:6.9s +tttg: c48/111 lr:0.000613 t:7.0s +tttg: c49/111 lr:0.000599 t:7.1s +tttg: c50/111 lr:0.000585 t:7.2s +tttg: c51/111 lr:0.000571 t:7.3s +tttg: c52/111 lr:0.000557 t:7.4s +tttg: c53/111 lr:0.000543 t:7.5s +tttg: c54/111 lr:0.000529 t:7.6s +tttg: c55/111 lr:0.000514 t:7.7s +tttg: c56/111 lr:0.000500 t:7.8s +tttg: c57/111 lr:0.000486 t:7.9s +tttg: c58/111 lr:0.000471 t:8.0s +tttg: c59/111 lr:0.000457 t:8.1s +tttg: c60/111 lr:0.000443 t:8.2s +tttg: c61/111 lr:0.000429 t:8.3s +tttg: c62/111 lr:0.000415 t:8.4s +tttg: c63/111 lr:0.000401 t:8.5s +tttg: c64/111 lr:0.000387 t:8.6s +tttg: c65/111 lr:0.000373 t:8.7s +tttg: c66/111 lr:0.000359 t:8.8s +tttg: c67/111 lr:0.000345 t:8.9s +tttg: c68/111 lr:0.000332 t:9.0s +tttg: c69/111 lr:0.000319 t:9.1s +tttg: c70/111 lr:0.000305 t:9.2s +tttg: c71/111 lr:0.000292 t:9.3s +tttg: c72/111 lr:0.000279 t:9.4s +tttg: c73/111 lr:0.000267 t:9.5s +tttg: c74/111 lr:0.000254 t:9.6s +tttg: c75/111 lr:0.000242 t:9.7s +tttg: c76/111 lr:0.000230 t:9.8s +tttg: c77/111 lr:0.000218 t:9.9s +tttg: c78/111 lr:0.000206 t:10.0s +tttg: c79/111 lr:0.000195 t:10.1s +tttg: c80/111 lr:0.000183 t:10.2s +tttg: c81/111 lr:0.000173 t:10.2s +tttg: c82/111 lr:0.000162 t:10.3s +tttg: c83/111 lr:0.000152 t:10.4s +tttg: c84/111 lr:0.000141 t:10.5s +tttg: c85/111 lr:0.000132 t:10.6s +tttg: c86/111 lr:0.000122 t:10.7s +tttg: c87/111 lr:0.000113 t:10.8s +tttg: c88/111 lr:0.000104 t:10.9s +tttg: c89/111 lr:0.000095 t:11.0s +tttg: c90/111 lr:0.000087 t:11.1s +tttg: c91/111 lr:0.000079 t:11.2s +tttg: c92/111 lr:0.000072 t:11.3s +tttg: c93/111 lr:0.000065 t:11.4s +tttg: c94/111 lr:0.000058 t:11.5s +tttg: c95/111 lr:0.000051 t:11.6s +tttg: c96/111 lr:0.000045 t:11.7s +tttg: c97/111 lr:0.000039 t:11.8s +tttg: c98/111 lr:0.000034 t:11.9s +tttg: c99/111 lr:0.000029 t:12.0s +tttg: c100/111 lr:0.000024 t:12.1s +tttg: c101/111 lr:0.000020 t:12.2s +tttg: c102/111 lr:0.000016 t:12.3s +tttg: c103/111 lr:0.000013 t:12.4s +tttg: c104/111 lr:0.000010 t:12.5s +tttg: c105/111 lr:0.000007 t:12.6s +tttg: c106/111 lr:0.000005 t:12.7s +tttg: c107/111 lr:0.000003 t:12.8s +tttg: c108/111 lr:0.000002 t:12.9s +tttg: c109/111 lr:0.000001 t:13.0s +tttg: c110/111 lr:0.000000 t:13.1s +ttpr: phase:1/3 t:230.2s +ttp: b757/782 bl:2.2896 bb:1.0658 rl:2.2918 rb:1.0811 dl:3550-3633 gd:0 +ttpp: phase:2/3 pd:1808 gd:1333 t:337.8s +tttg: c1/185 lr:0.001000 t:0.1s +tttg: c2/185 lr:0.001000 t:0.2s +tttg: c3/185 lr:0.001000 t:0.3s +tttg: c4/185 lr:0.000999 t:0.4s +tttg: c5/185 lr:0.000999 t:0.5s +tttg: c6/185 lr:0.000998 t:0.6s +tttg: c7/185 lr:0.000997 t:0.7s +tttg: c8/185 lr:0.000996 t:0.8s +tttg: c9/185 lr:0.000995 t:0.9s +tttg: c10/185 lr:0.000994 t:1.0s +tttg: c11/185 lr:0.000993 t:1.1s +tttg: c12/185 lr:0.000991 t:1.2s +tttg: c13/185 lr:0.000990 t:1.3s +tttg: c14/185 lr:0.000988 t:1.4s +tttg: c15/185 lr:0.000986 t:1.5s +tttg: c16/185 lr:0.000984 t:1.6s +tttg: c17/185 lr:0.000981 t:1.7s +tttg: c18/185 lr:0.000979 t:1.8s +tttg: c19/185 lr:0.000977 t:1.9s +tttg: c20/185 lr:0.000974 t:2.0s +tttg: c21/185 lr:0.000971 t:2.1s +tttg: c22/185 lr:0.000968 t:2.2s +tttg: c23/185 lr:0.000965 t:2.3s +tttg: c24/185 lr:0.000962 t:2.4s +tttg: c25/185 lr:0.000959 t:2.5s +tttg: c26/185 lr:0.000955 t:2.6s +tttg: c27/185 lr:0.000952 t:2.7s +tttg: c28/185 lr:0.000948 t:2.8s +tttg: c29/185 lr:0.000944 t:2.9s +tttg: c30/185 lr:0.000940 t:3.0s +tttg: c31/185 lr:0.000936 t:3.0s +tttg: c32/185 lr:0.000932 t:3.1s +tttg: c33/185 lr:0.000927 t:3.2s +tttg: c34/185 lr:0.000923 t:3.3s +tttg: c35/185 lr:0.000918 t:3.4s +tttg: c36/185 lr:0.000913 t:3.5s +tttg: c37/185 lr:0.000908 t:3.6s +tttg: c38/185 lr:0.000904 t:3.7s +tttg: c39/185 lr:0.000898 t:3.8s +tttg: c40/185 lr:0.000893 t:3.9s +tttg: c41/185 lr:0.000888 t:4.0s +tttg: c42/185 lr:0.000882 t:4.1s +tttg: c43/185 lr:0.000877 t:4.2s +tttg: c44/185 lr:0.000871 t:4.3s +tttg: c45/185 lr:0.000865 t:4.4s +tttg: c46/185 lr:0.000860 t:4.5s +tttg: c47/185 lr:0.000854 t:4.6s +tttg: c48/185 lr:0.000847 t:4.7s +tttg: c49/185 lr:0.000841 t:4.8s +tttg: c50/185 lr:0.000835 t:4.9s +tttg: c51/185 lr:0.000829 t:5.0s +tttg: c52/185 lr:0.000822 t:5.1s +tttg: c53/185 lr:0.000816 t:5.2s +tttg: c54/185 lr:0.000809 t:5.3s +tttg: c55/185 lr:0.000802 t:5.4s +tttg: c56/185 lr:0.000795 t:5.5s +tttg: c57/185 lr:0.000788 t:5.6s +tttg: c58/185 lr:0.000781 t:5.7s +tttg: c59/185 lr:0.000774 t:5.8s +tttg: c60/185 lr:0.000767 t:5.9s +tttg: c61/185 lr:0.000760 t:6.0s +tttg: c62/185 lr:0.000752 t:6.1s +tttg: c63/185 lr:0.000745 t:6.2s +tttg: c64/185 lr:0.000738 t:6.3s +tttg: c65/185 lr:0.000730 t:6.4s +tttg: c66/185 lr:0.000722 t:6.5s +tttg: c67/185 lr:0.000715 t:6.6s +tttg: c68/185 lr:0.000707 t:6.7s +tttg: c69/185 lr:0.000699 t:6.8s +tttg: c70/185 lr:0.000691 t:6.9s +tttg: c71/185 lr:0.000683 t:7.0s +tttg: c72/185 lr:0.000675 t:7.1s +tttg: c73/185 lr:0.000667 t:7.2s +tttg: c74/185 lr:0.000659 t:7.3s +tttg: c75/185 lr:0.000651 t:7.4s +tttg: c76/185 lr:0.000643 t:7.5s +tttg: c77/185 lr:0.000635 t:7.6s +tttg: c78/185 lr:0.000627 t:7.7s +tttg: c79/185 lr:0.000618 t:7.8s +tttg: c80/185 lr:0.000610 t:7.9s +tttg: c81/185 lr:0.000602 t:7.9s +tttg: c82/185 lr:0.000593 t:8.0s +tttg: c83/185 lr:0.000585 t:8.1s +tttg: c84/185 lr:0.000577 t:8.2s +tttg: c85/185 lr:0.000568 t:8.3s +tttg: c86/185 lr:0.000560 t:8.4s +tttg: c87/185 lr:0.000551 t:8.5s +tttg: c88/185 lr:0.000543 t:8.6s +tttg: c89/185 lr:0.000534 t:8.7s +tttg: c90/185 lr:0.000526 t:8.8s +tttg: c91/185 lr:0.000517 t:8.9s +tttg: c92/185 lr:0.000509 t:9.0s +tttg: c93/185 lr:0.000500 t:9.1s +tttg: c94/185 lr:0.000491 t:9.2s +tttg: c95/185 lr:0.000483 t:9.3s +tttg: c96/185 lr:0.000474 t:9.4s +tttg: c97/185 lr:0.000466 t:9.5s +tttg: c98/185 lr:0.000457 t:9.6s +tttg: c99/185 lr:0.000449 t:9.7s +tttg: c100/185 lr:0.000440 t:9.8s +tttg: c101/185 lr:0.000432 t:9.9s +tttg: c102/185 lr:0.000423 t:10.0s +tttg: c103/185 lr:0.000415 t:10.1s +tttg: c104/185 lr:0.000407 t:10.2s +tttg: c105/185 lr:0.000398 t:10.3s +tttg: c106/185 lr:0.000390 t:10.4s +tttg: c107/185 lr:0.000382 t:10.5s +tttg: c108/185 lr:0.000373 t:10.6s +tttg: c109/185 lr:0.000365 t:10.7s +tttg: c110/185 lr:0.000357 t:10.8s +tttg: c111/185 lr:0.000349 t:10.9s +tttg: c112/185 lr:0.000341 t:11.0s +tttg: c113/185 lr:0.000333 t:11.1s +tttg: c114/185 lr:0.000325 t:11.2s +tttg: c115/185 lr:0.000317 t:11.3s +tttg: c116/185 lr:0.000309 t:11.4s +tttg: c117/185 lr:0.000301 t:11.5s +tttg: c118/185 lr:0.000293 t:11.6s +tttg: c119/185 lr:0.000285 t:15.7s +tttg: c120/185 lr:0.000278 t:15.7s +tttg: c121/185 lr:0.000270 t:15.8s +tttg: c122/185 lr:0.000262 t:15.9s +tttg: c123/185 lr:0.000255 t:16.0s +tttg: c124/185 lr:0.000248 t:16.1s +tttg: c125/185 lr:0.000240 t:16.2s +tttg: c126/185 lr:0.000233 t:16.3s +tttg: c127/185 lr:0.000226 t:16.4s +tttg: c128/185 lr:0.000219 t:16.5s +tttg: c129/185 lr:0.000212 t:16.5s +tttg: c130/185 lr:0.000205 t:16.6s +tttg: c131/185 lr:0.000198 t:16.7s +tttg: c132/185 lr:0.000191 t:16.8s +tttg: c133/185 lr:0.000184 t:16.9s +tttg: c134/185 lr:0.000178 t:17.0s +tttg: c135/185 lr:0.000171 t:17.1s +tttg: c136/185 lr:0.000165 t:17.2s +tttg: c137/185 lr:0.000159 t:17.3s +tttg: c138/185 lr:0.000153 t:17.4s +tttg: c139/185 lr:0.000146 t:17.5s +tttg: c140/185 lr:0.000140 t:17.5s +tttg: c141/185 lr:0.000135 t:17.6s +tttg: c142/185 lr:0.000129 t:17.7s +tttg: c143/185 lr:0.000123 t:17.8s +tttg: c144/185 lr:0.000118 t:17.9s +tttg: c145/185 lr:0.000112 t:18.0s +tttg: c146/185 lr:0.000107 t:18.1s +tttg: c147/185 lr:0.000102 t:18.2s +tttg: c148/185 lr:0.000096 t:18.3s +tttg: c149/185 lr:0.000092 t:18.4s +tttg: c150/185 lr:0.000087 t:18.4s +tttg: c151/185 lr:0.000082 t:18.5s +tttg: c152/185 lr:0.000077 t:18.6s +tttg: c153/185 lr:0.000073 t:18.7s +tttg: c154/185 lr:0.000068 t:18.8s +tttg: c155/185 lr:0.000064 t:18.9s +tttg: c156/185 lr:0.000060 t:19.0s +tttg: c157/185 lr:0.000056 t:19.1s +tttg: c158/185 lr:0.000052 t:19.2s +tttg: c159/185 lr:0.000048 t:19.2s +tttg: c160/185 lr:0.000045 t:19.3s +tttg: c161/185 lr:0.000041 t:19.4s +tttg: c162/185 lr:0.000038 t:19.5s +tttg: c163/185 lr:0.000035 t:19.6s +tttg: c164/185 lr:0.000032 t:19.7s +tttg: c165/185 lr:0.000029 t:19.8s +tttg: c166/185 lr:0.000026 t:19.9s +tttg: c167/185 lr:0.000023 t:20.0s +tttg: c168/185 lr:0.000021 t:20.1s +tttg: c169/185 lr:0.000019 t:20.1s +tttg: c170/185 lr:0.000016 t:20.2s +tttg: c171/185 lr:0.000014 t:20.3s +tttg: c172/185 lr:0.000012 t:20.4s +tttg: c173/185 lr:0.000010 t:20.5s +tttg: c174/185 lr:0.000009 t:20.6s +tttg: c175/185 lr:0.000007 t:20.7s +tttg: c176/185 lr:0.000006 t:20.8s +tttg: c177/185 lr:0.000005 t:20.9s +tttg: c178/185 lr:0.000004 t:21.0s +tttg: c179/185 lr:0.000003 t:21.0s +tttg: c180/185 lr:0.000002 t:21.1s +tttg: c181/185 lr:0.000001 t:21.2s +tttg: c182/185 lr:0.000001 t:21.3s +tttg: c183/185 lr:0.000000 t:21.4s +tttg: c184/185 lr:0.000000 t:21.5s +ttpr: phase:2/3 t:361.9s +ttp: b746/782 bl:2.4212 bb:1.0668 rl:2.3068 rb:1.0794 dl:2884-2943 gd:0 +ttp: b745/782 bl:2.2467 bb:1.0286 rl:2.3006 rb:1.0741 dl:2842-2883 gd:0 +ttpp: phase:3/3 pd:2448 gd:2000 t:378.9s +tttg: c1/250 lr:0.001000 t:0.1s +tttg: c2/250 lr:0.001000 t:0.2s +tttg: c3/250 lr:0.001000 t:0.3s +tttg: c4/250 lr:0.001000 t:0.4s +tttg: c5/250 lr:0.000999 t:0.5s +tttg: c6/250 lr:0.000999 t:0.6s +tttg: c7/250 lr:0.000999 t:0.7s +tttg: c8/250 lr:0.000998 t:0.8s +tttg: c9/250 lr:0.000997 t:0.9s +tttg: c10/250 lr:0.000997 t:1.0s +tttg: c11/250 lr:0.000996 t:1.1s +tttg: c12/250 lr:0.000995 t:1.2s +tttg: c13/250 lr:0.000994 t:1.3s +tttg: c14/250 lr:0.000993 t:1.3s +tttg: c15/250 lr:0.000992 t:1.4s +tttg: c16/250 lr:0.000991 t:1.5s +tttg: c17/250 lr:0.000990 t:1.6s +tttg: c18/250 lr:0.000989 t:1.7s +tttg: c19/250 lr:0.000987 t:1.8s +tttg: c20/250 lr:0.000986 t:1.9s +tttg: c21/250 lr:0.000984 t:2.0s +tttg: c22/250 lr:0.000983 t:2.1s +tttg: c23/250 lr:0.000981 t:2.2s +tttg: c24/250 lr:0.000979 t:2.2s +tttg: c25/250 lr:0.000977 t:2.3s +tttg: c26/250 lr:0.000975 t:2.4s +tttg: c27/250 lr:0.000973 t:2.5s +tttg: c28/250 lr:0.000971 t:2.6s +tttg: c29/250 lr:0.000969 t:2.7s +tttg: c30/250 lr:0.000967 t:2.8s +tttg: c31/250 lr:0.000965 t:2.9s +tttg: c32/250 lr:0.000962 t:3.0s +tttg: c33/250 lr:0.000960 t:3.1s +tttg: c34/250 lr:0.000957 t:3.2s +tttg: c35/250 lr:0.000955 t:3.3s +tttg: c36/250 lr:0.000952 t:3.4s +tttg: c37/250 lr:0.000949 t:3.5s +tttg: c38/250 lr:0.000947 t:3.6s +tttg: c39/250 lr:0.000944 t:3.7s +tttg: c40/250 lr:0.000941 t:3.8s +tttg: c41/250 lr:0.000938 t:3.9s +tttg: c42/250 lr:0.000935 t:4.0s +tttg: c43/250 lr:0.000931 t:4.1s +tttg: c44/250 lr:0.000928 t:4.2s +tttg: c45/250 lr:0.000925 t:4.3s +tttg: c46/250 lr:0.000922 t:4.4s +tttg: c47/250 lr:0.000918 t:4.6s +tttg: c48/250 lr:0.000915 t:4.7s +tttg: c49/250 lr:0.000911 t:4.8s +tttg: c50/250 lr:0.000907 t:4.9s +tttg: c51/250 lr:0.000904 t:5.0s +tttg: c52/250 lr:0.000900 t:5.1s +tttg: c53/250 lr:0.000896 t:5.2s +tttg: c54/250 lr:0.000892 t:5.3s +tttg: c55/250 lr:0.000888 t:5.4s +tttg: c56/250 lr:0.000884 t:5.5s +tttg: c57/250 lr:0.000880 t:5.6s +tttg: c58/250 lr:0.000876 t:5.7s +tttg: c59/250 lr:0.000872 t:5.8s +tttg: c60/250 lr:0.000868 t:5.9s +tttg: c61/250 lr:0.000863 t:6.0s +tttg: c62/250 lr:0.000859 t:6.1s +tttg: c63/250 lr:0.000855 t:6.2s +tttg: c64/250 lr:0.000850 t:6.3s +tttg: c65/250 lr:0.000846 t:6.4s +tttg: c66/250 lr:0.000841 t:6.5s +tttg: c67/250 lr:0.000836 t:6.6s +tttg: c68/250 lr:0.000832 t:6.7s +tttg: c69/250 lr:0.000827 t:6.9s +tttg: c70/250 lr:0.000822 t:7.0s +tttg: c71/250 lr:0.000817 t:7.1s +tttg: c72/250 lr:0.000812 t:7.2s +tttg: c73/250 lr:0.000807 t:7.3s +tttg: c74/250 lr:0.000803 t:7.4s +tttg: c75/250 lr:0.000797 t:7.5s +tttg: c76/250 lr:0.000792 t:7.6s +tttg: c77/250 lr:0.000787 t:7.7s +tttg: c78/250 lr:0.000782 t:7.8s +tttg: c79/250 lr:0.000777 t:7.9s +tttg: c80/250 lr:0.000772 t:8.0s +tttg: c81/250 lr:0.000766 t:8.1s +tttg: c82/250 lr:0.000761 t:8.2s +tttg: c83/250 lr:0.000755 t:8.3s +tttg: c84/250 lr:0.000750 t:8.4s +tttg: c85/250 lr:0.000745 t:8.5s +tttg: c86/250 lr:0.000739 t:8.6s +tttg: c87/250 lr:0.000733 t:8.7s +tttg: c88/250 lr:0.000728 t:8.8s +tttg: c89/250 lr:0.000722 t:9.0s +tttg: c90/250 lr:0.000717 t:9.1s +tttg: c91/250 lr:0.000711 t:9.2s +tttg: c92/250 lr:0.000705 t:9.3s +tttg: c93/250 lr:0.000699 t:9.4s +tttg: c94/250 lr:0.000694 t:9.5s +tttg: c95/250 lr:0.000688 t:9.6s +tttg: c96/250 lr:0.000682 t:9.7s +tttg: c97/250 lr:0.000676 t:9.8s +tttg: c98/250 lr:0.000670 t:9.9s +tttg: c99/250 lr:0.000664 t:10.0s +tttg: c100/250 lr:0.000658 t:10.1s +tttg: c101/250 lr:0.000652 t:10.2s +tttg: c102/250 lr:0.000646 t:10.3s +tttg: c103/250 lr:0.000640 t:10.4s +tttg: c104/250 lr:0.000634 t:10.5s +tttg: c105/250 lr:0.000628 t:10.6s +tttg: c106/250 lr:0.000622 t:10.7s +tttg: c107/250 lr:0.000616 t:10.9s +tttg: c108/250 lr:0.000610 t:11.0s +tttg: c109/250 lr:0.000603 t:11.1s +tttg: c110/250 lr:0.000597 t:11.2s +tttg: c111/250 lr:0.000591 t:11.3s +tttg: c112/250 lr:0.000585 t:11.4s +tttg: c113/250 lr:0.000579 t:11.5s +tttg: c114/250 lr:0.000572 t:11.6s +tttg: c115/250 lr:0.000566 t:11.7s +tttg: c116/250 lr:0.000560 t:11.8s +tttg: c117/250 lr:0.000554 t:11.9s +tttg: c118/250 lr:0.000547 t:12.0s +tttg: c119/250 lr:0.000541 t:12.1s +tttg: c120/250 lr:0.000535 t:12.2s +tttg: c121/250 lr:0.000528 t:12.3s +tttg: c122/250 lr:0.000522 t:12.4s +tttg: c123/250 lr:0.000516 t:12.6s +tttg: c124/250 lr:0.000509 t:12.7s +tttg: c125/250 lr:0.000503 t:12.8s +tttg: c126/250 lr:0.000497 t:12.9s +tttg: c127/250 lr:0.000491 t:13.0s +tttg: c128/250 lr:0.000484 t:13.1s +tttg: c129/250 lr:0.000478 t:13.2s +tttg: c130/250 lr:0.000472 t:13.3s +tttg: c131/250 lr:0.000465 t:13.4s +tttg: c132/250 lr:0.000459 t:13.5s +tttg: c133/250 lr:0.000453 t:13.6s +tttg: c134/250 lr:0.000446 t:13.7s +tttg: c135/250 lr:0.000440 t:13.8s +tttg: c136/250 lr:0.000434 t:13.9s +tttg: c137/250 lr:0.000428 t:14.0s +tttg: c138/250 lr:0.000421 t:14.1s +tttg: c139/250 lr:0.000415 t:14.2s +tttg: c140/250 lr:0.000409 t:14.3s +tttg: c141/250 lr:0.000403 t:14.4s +tttg: c142/250 lr:0.000397 t:14.6s +tttg: c143/250 lr:0.000390 t:14.7s +tttg: c144/250 lr:0.000384 t:14.8s +tttg: c145/250 lr:0.000378 t:14.9s +tttg: c146/250 lr:0.000372 t:15.0s +tttg: c147/250 lr:0.000366 t:15.1s +tttg: c148/250 lr:0.000360 t:15.2s +tttg: c149/250 lr:0.000354 t:15.3s +tttg: c150/250 lr:0.000348 t:15.4s +tttg: c151/250 lr:0.000342 t:15.5s +tttg: c152/250 lr:0.000336 t:15.6s +tttg: c153/250 lr:0.000330 t:15.7s +tttg: c154/250 lr:0.000324 t:15.8s +tttg: c155/250 lr:0.000318 t:15.9s +tttg: c156/250 lr:0.000312 t:16.0s +tttg: c157/250 lr:0.000306 t:16.1s +tttg: c158/250 lr:0.000301 t:16.2s +tttg: c159/250 lr:0.000295 t:16.3s +tttg: c160/250 lr:0.000289 t:16.4s +tttg: c161/250 lr:0.000283 t:16.6s +tttg: c162/250 lr:0.000278 t:16.7s +tttg: c163/250 lr:0.000272 t:16.8s +tttg: c164/250 lr:0.000267 t:16.9s +tttg: c165/250 lr:0.000261 t:17.0s +tttg: c166/250 lr:0.000255 t:17.1s +tttg: c167/250 lr:0.000250 t:17.2s +tttg: c168/250 lr:0.000245 t:17.3s +tttg: c169/250 lr:0.000239 t:17.4s +tttg: c170/250 lr:0.000234 t:17.5s +tttg: c171/250 lr:0.000228 t:17.6s +tttg: c172/250 lr:0.000223 t:17.7s +tttg: c173/250 lr:0.000218 t:17.8s +tttg: c174/250 lr:0.000213 t:17.9s +tttg: c175/250 lr:0.000208 t:18.0s +tttg: c176/250 lr:0.000203 t:18.1s +tttg: c177/250 lr:0.000197 t:18.2s +tttg: c178/250 lr:0.000193 t:18.3s +tttg: c179/250 lr:0.000188 t:18.4s +tttg: c180/250 lr:0.000183 t:18.5s +tttg: c181/250 lr:0.000178 t:18.6s +tttg: c182/250 lr:0.000173 t:18.8s +tttg: c183/250 lr:0.000168 t:18.9s +tttg: c184/250 lr:0.000164 t:19.0s +tttg: c185/250 lr:0.000159 t:19.1s +tttg: c186/250 lr:0.000154 t:19.2s +tttg: c187/250 lr:0.000150 t:19.3s +tttg: c188/250 lr:0.000145 t:19.4s +tttg: c189/250 lr:0.000141 t:19.5s +tttg: c190/250 lr:0.000137 t:19.6s +tttg: c191/250 lr:0.000132 t:19.7s +tttg: c192/250 lr:0.000128 t:19.8s +tttg: c193/250 lr:0.000124 t:19.9s +tttg: c194/250 lr:0.000120 t:20.0s +tttg: c195/250 lr:0.000116 t:20.1s +tttg: c196/250 lr:0.000112 t:20.2s +tttg: c197/250 lr:0.000108 t:20.3s +tttg: c198/250 lr:0.000104 t:20.4s +tttg: c199/250 lr:0.000100 t:20.5s +tttg: c200/250 lr:0.000096 t:20.6s +tttg: c201/250 lr:0.000093 t:20.7s +tttg: c202/250 lr:0.000089 t:20.8s +tttg: c203/250 lr:0.000085 t:21.0s +tttg: c204/250 lr:0.000082 t:21.1s +tttg: c205/250 lr:0.000078 t:21.2s +tttg: c206/250 lr:0.000075 t:21.3s +tttg: c207/250 lr:0.000072 t:21.4s +tttg: c208/250 lr:0.000069 t:21.5s +tttg: c209/250 lr:0.000065 t:21.6s +tttg: c210/250 lr:0.000062 t:21.7s +tttg: c211/250 lr:0.000059 t:21.8s +tttg: c212/250 lr:0.000056 t:21.9s +tttg: c213/250 lr:0.000053 t:22.0s +tttg: c214/250 lr:0.000051 t:22.1s +tttg: c215/250 lr:0.000048 t:22.2s +tttg: c216/250 lr:0.000045 t:22.3s +tttg: c217/250 lr:0.000043 t:22.4s +tttg: c218/250 lr:0.000040 t:22.5s +tttg: c219/250 lr:0.000038 t:22.6s +tttg: c220/250 lr:0.000035 t:22.7s +tttg: c221/250 lr:0.000033 t:22.8s +tttg: c222/250 lr:0.000031 t:22.9s +tttg: c223/250 lr:0.000029 t:23.1s +tttg: c224/250 lr:0.000027 t:23.2s +tttg: c225/250 lr:0.000025 t:23.3s +tttg: c226/250 lr:0.000023 t:23.4s +tttg: c227/250 lr:0.000021 t:23.5s +tttg: c228/250 lr:0.000019 t:23.6s +tttg: c229/250 lr:0.000017 t:23.7s +tttg: c230/250 lr:0.000016 t:23.8s +tttg: c231/250 lr:0.000014 t:23.9s +tttg: c232/250 lr:0.000013 t:24.0s +tttg: c233/250 lr:0.000011 t:24.1s +tttg: c234/250 lr:0.000010 t:24.2s +tttg: c235/250 lr:0.000009 t:24.3s +tttg: c236/250 lr:0.000008 t:24.4s +tttg: c237/250 lr:0.000007 t:24.5s +tttg: c238/250 lr:0.000006 t:24.6s +tttg: c239/250 lr:0.000005 t:24.7s +tttg: c240/250 lr:0.000004 t:24.8s +tttg: c241/250 lr:0.000003 t:25.0s +tttg: c242/250 lr:0.000003 t:25.1s +tttg: c243/250 lr:0.000002 t:25.2s +tttg: c244/250 lr:0.000001 t:25.3s +tttg: c245/250 lr:0.000001 t:25.4s +tttg: c246/250 lr:0.000001 t:25.5s +tttg: c247/250 lr:0.000000 t:25.6s +tttg: c248/250 lr:0.000000 t:25.7s +tttg: c249/250 lr:0.000000 t:25.8s +ttpr: phase:3/3 t:407.5s +ttp: b736/782 bl:2.2541 bb:1.0620 rl:2.2968 rb:1.0731 dl:2526-2550 gd:1 +ttp: b734/782 bl:2.2738 bb:1.0344 rl:2.2950 rb:1.0701 dl:2469-2495 gd:1 +ttp: b727/782 bl:2.2762 bb:1.0490 rl:2.2938 rb:1.0687 dl:2277-2305 gd:1 +ttp: b712/782 bl:2.3484 bb:1.0651 rl:2.2967 rb:1.0685 dl:1984-2002 gd:1 +ttp: b707/782 bl:2.3733 bb:1.0547 rl:2.3005 rb:1.0678 dl:1910-1923 gd:1 +ttp: b696/782 bl:2.3205 bb:1.0568 rl:2.3013 rb:1.0673 dl:1779-1790 gd:1 +ttp: b689/782 bl:2.4021 bb:1.0815 rl:2.3054 rb:1.0679 dl:1706-1715 gd:1 +ttp: b685/782 bl:2.3122 bb:1.0347 rl:2.3056 rb:1.0666 dl:1665-1675 gd:1 +ttp: b676/782 bl:2.3454 bb:1.0550 rl:2.3070 rb:1.0662 dl:1586-1595 gd:1 +ttp: b665/782 bl:2.3417 bb:1.0520 rl:2.3081 rb:1.0657 dl:1500-1507 gd:1 +ttp: b657/782 bl:2.3425 bb:1.0646 rl:2.3091 rb:1.0657 dl:1445-1452 gd:1 +ttp: b649/782 bl:2.2961 bb:1.0209 rl:2.3088 rb:1.0644 dl:1392-1398 gd:1 +ttp: b640/782 bl:2.3179 bb:1.0559 rl:2.3090 rb:1.0642 dl:1337-1343 gd:1 +ttp: b637/782 bl:2.3801 bb:1.0854 rl:2.3108 rb:1.0647 dl:1320-1325 gd:1 +ttp: b628/782 bl:2.3335 bb:1.0354 rl:2.3113 rb:1.0640 dl:1271-1276 gd:1 +ttp: b620/782 bl:2.3566 bb:1.0615 rl:2.3123 rb:1.0639 dl:1226-1231 gd:1 +ttp: b612/782 bl:2.2487 bb:1.0188 rl:2.3110 rb:1.0630 dl:1186-1190 gd:1 +ttp: b606/782 bl:2.3760 bb:1.0736 rl:2.3123 rb:1.0632 dl:1159-1164 gd:1 +ttp: b597/782 bl:2.3820 bb:1.0592 rl:2.3136 rb:1.0631 dl:1119-1124 gd:1 +ttp: b589/782 bl:2.2913 bb:1.0176 rl:2.3132 rb:1.0623 dl:1086-1089 gd:1 +ttp: b580/782 bl:2.3268 bb:1.0209 rl:2.3134 rb:1.0615 dl:1048-1052 gd:1 +ttp: b574/782 bl:2.3825 bb:1.0691 rl:2.3146 rb:1.0617 dl:1025-1029 gd:1 +ttp: b566/782 bl:2.3139 bb:1.0335 rl:2.3146 rb:1.0612 dl:997-1001 gd:1 +ttp: b559/782 bl:2.3109 bb:1.0466 rl:2.3145 rb:1.0610 dl:972-975 gd:1 +ttp: b551/782 bl:2.3491 bb:1.0617 rl:2.3150 rb:1.0610 dl:946-949 gd:1 +ttp: b543/782 bl:2.3470 bb:1.0626 rl:2.3155 rb:1.0610 dl:921-924 gd:1 +ttp: b535/782 bl:2.3911 bb:1.0371 rl:2.3165 rb:1.0607 dl:896-899 gd:1 +ttp: b527/782 bl:2.3581 bb:1.0350 rl:2.3170 rb:1.0603 dl:872-875 gd:1 +ttp: b519/782 bl:2.3094 bb:1.0477 rl:2.3169 rb:1.0602 dl:850-852 gd:1 +ttp: b511/782 bl:2.3993 bb:1.0555 rl:2.3179 rb:1.0601 dl:826-829 gd:1 +ttp: b503/782 bl:2.3650 bb:1.0715 rl:2.3185 rb:1.0602 dl:804-807 gd:1 +ttp: b495/782 bl:2.3216 bb:1.0370 rl:2.3185 rb:1.0600 dl:783-785 gd:1 +ttp: b487/782 bl:2.2927 bb:1.0735 rl:2.3182 rb:1.0601 dl:764-766 gd:1 +ttp: b479/782 bl:2.4247 bb:1.0895 rl:2.3193 rb:1.0604 dl:744-747 gd:1 +ttp: b471/782 bl:2.4096 bb:1.0880 rl:2.3202 rb:1.0607 dl:726-728 gd:1 +ttp: b463/782 bl:2.3286 bb:1.0479 rl:2.3203 rb:1.0606 dl:708-710 gd:1 +ttp: b455/782 bl:2.3190 bb:1.0451 rl:2.3203 rb:1.0604 dl:691-693 gd:1 +ttp: b447/782 bl:2.3399 bb:1.0749 rl:2.3205 rb:1.0606 dl:674-676 gd:1 +ttp: b439/782 bl:2.3365 bb:1.0426 rl:2.3206 rb:1.0604 dl:657-659 gd:1 +ttp: b431/782 bl:2.3863 bb:1.0586 rl:2.3211 rb:1.0604 dl:642-643 gd:1 +ttp: b423/782 bl:2.3193 bb:1.0582 rl:2.3211 rb:1.0604 dl:626-629 gd:1 +ttp: b415/782 bl:2.2929 bb:1.0621 rl:2.3209 rb:1.0604 dl:611-613 gd:1 +ttp: b407/782 bl:2.2839 bb:1.0456 rl:2.3206 rb:1.0603 dl:595-597 gd:1 +ttp: b399/782 bl:2.3006 bb:1.0383 rl:2.3205 rb:1.0601 dl:581-582 gd:1 +ttp: b391/782 bl:2.3228 bb:1.0699 rl:2.3205 rb:1.0602 dl:566-568 gd:1 +ttp: b382/782 bl:2.3036 bb:1.0884 rl:2.3204 rb:1.0604 dl:550-552 gd:1 +ttp: b375/782 bl:2.4128 bb:1.0761 rl:2.3210 rb:1.0605 dl:538-540 gd:1 +ttp: b367/782 bl:2.3096 bb:1.0899 rl:2.3209 rb:1.0607 dl:525-527 gd:1 +ttp: b359/782 bl:2.2625 bb:1.0389 rl:2.3206 rb:1.0605 dl:512-513 gd:1 +ttp: b351/782 bl:2.3756 bb:1.0875 rl:2.3209 rb:1.0607 dl:498-499 gd:1 +ttp: b343/782 bl:2.2346 bb:1.0516 rl:2.3204 rb:1.0606 dl:486-488 gd:1 +ttp: b335/782 bl:2.3767 bb:1.0767 rl:2.3207 rb:1.0607 dl:474-476 gd:1 +ttp: b327/782 bl:2.3527 bb:1.0939 rl:2.3209 rb:1.0609 dl:462-463 gd:1 +ttp: b319/782 bl:2.4120 bb:1.0876 rl:2.3214 rb:1.0610 dl:450-451 gd:1 +ttp: b311/782 bl:2.3592 bb:1.0874 rl:2.3216 rb:1.0612 dl:438-439 gd:1 +ttp: b303/782 bl:2.4067 bb:1.0978 rl:2.3220 rb:1.0614 dl:426-427 gd:1 +ttp: b295/782 bl:2.2775 bb:1.0685 rl:2.3218 rb:1.0614 dl:414-415 gd:1 +ttp: b287/782 bl:2.4157 bb:1.1006 rl:2.3222 rb:1.0616 dl:402-403 gd:1 +ttp: b279/782 bl:2.3227 bb:1.0975 rl:2.3222 rb:1.0617 dl:391-392 gd:1 +ttp: b271/782 bl:2.3825 bb:1.1285 rl:2.3225 rb:1.0620 dl:380-382 gd:1 +ttp: b263/782 bl:2.4032 bb:1.0871 rl:2.3228 rb:1.0621 dl:370-371 gd:1 +ttp: b255/782 bl:2.3789 bb:1.0971 rl:2.3231 rb:1.0623 dl:360-361 gd:1 +ttp: b247/782 bl:2.3656 bb:1.1011 rl:2.3232 rb:1.0624 dl:350-351 gd:1 +ttp: b239/782 bl:2.3905 bb:1.1100 rl:2.3235 rb:1.0626 dl:340-341 gd:1 +ttp: b229/782 bl:2.3853 bb:1.0750 rl:2.3237 rb:1.0627 dl:328-329 gd:1 +ttp: b221/782 bl:2.4160 bb:1.1259 rl:2.3240 rb:1.0629 dl:318-320 gd:1 +ttp: b215/782 bl:2.4082 bb:1.1039 rl:2.3243 rb:1.0630 dl:312-313 gd:1 +ttp: b206/782 bl:2.4170 bb:1.1119 rl:2.3247 rb:1.0632 dl:302-303 gd:1 +ttp: b197/782 bl:2.3820 bb:1.1260 rl:2.3248 rb:1.0634 dl:292-294 gd:1 +ttp: b189/782 bl:2.4280 bb:1.1455 rl:2.3252 rb:1.0636 dl:283-284 gd:1 +ttp: b180/782 bl:2.4455 bb:1.1204 rl:2.3255 rb:1.0638 dl:274-275 gd:1 +ttp: b172/782 bl:2.5313 bb:1.1605 rl:2.3261 rb:1.0641 dl:266-267 gd:1 +ttp: b163/782 bl:2.3896 bb:1.1258 rl:2.3263 rb:1.0643 dl:257-259 gd:1 +ttp: b155/782 bl:2.4173 bb:1.1176 rl:2.3266 rb:1.0644 dl:250-251 gd:1 +ttp: b149/782 bl:2.3857 bb:1.1621 rl:2.3267 rb:1.0647 dl:244-245 gd:1 +ttp: b140/782 bl:2.4515 bb:1.1446 rl:2.3270 rb:1.0649 dl:235-236 gd:1 +ttp: b132/782 bl:2.4521 bb:1.1645 rl:2.3274 rb:1.0651 dl:228-229 gd:1 +ttp: b125/782 bl:2.4946 bb:1.1494 rl:2.3278 rb:1.0653 dl:222-222 gd:1 +ttp: b116/782 bl:2.5016 bb:1.1359 rl:2.3282 rb:1.0655 dl:213-214 gd:1 +ttp: b107/782 bl:2.4479 bb:1.1723 rl:2.3284 rb:1.0657 dl:205-206 gd:1 +ttp: b99/782 bl:2.5097 bb:1.1820 rl:2.3288 rb:1.0659 dl:198-199 gd:1 +ttp: b93/782 bl:2.4863 bb:1.1925 rl:2.3291 rb:1.0662 dl:192-193 gd:1 +ttp: b85/782 bl:2.5177 bb:1.2058 rl:2.3295 rb:1.0664 dl:185-186 gd:1 +ttp: b76/782 bl:2.5136 bb:1.1805 rl:2.3299 rb:1.0667 dl:177-178 gd:1 +ttp: b69/782 bl:2.4788 bb:1.2100 rl:2.3301 rb:1.0669 dl:171-172 gd:1 +ttp: b61/782 bl:2.4661 bb:1.2207 rl:2.3304 rb:1.0672 dl:164-165 gd:1 +ttp: b54/782 bl:2.4875 bb:1.2202 rl:2.3306 rb:1.0674 dl:157-158 gd:1 +ttp: b47/782 bl:2.4560 bb:1.1465 rl:2.3308 rb:1.0675 dl:150-151 gd:1 +ttp: b40/782 bl:2.4939 bb:1.1552 rl:2.3311 rb:1.0676 dl:143-144 gd:1 +ttp: b35/782 bl:2.6460 bb:1.2836 rl:2.3316 rb:1.0679 dl:138-139 gd:1 +ttp: b28/782 bl:2.6293 bb:1.2195 rl:2.3320 rb:1.0682 dl:131-132 gd:1 +ttp: b18/782 bl:2.6490 bb:1.2078 rl:2.3324 rb:1.0683 dl:119-121 gd:1 +ttp: b11/782 bl:2.6486 bb:1.2247 rl:2.3327 rb:1.0685 dl:109-110 gd:1 +ttp: b3/782 bl:2.6590 bb:1.1845 rl:2.3330 rb:1.0686 dl:89-93 gd:1 +quantized_ttt_phased val_loss:2.33340980 val_bpb:1.06627722 eval_time:511611ms +total_eval_time:511.6s diff --git a/runs/019b-recur-alpha-manual-constant-full/seed_42/train.tokenizer_path_bug.log b/runs/019b-recur-alpha-manual-constant-full/seed_42/train.tokenizer_path_bug.log new file mode 100644 index 0000000000..5caae5e947 --- /dev/null +++ b/runs/019b-recur-alpha-manual-constant-full/seed_42/train.tokenizer_path_bug.log @@ -0,0 +1,325 @@ +W0421 06:36:24.842000 392 torch/distributed/run.py:803] +W0421 06:36:24.842000 392 torch/distributed/run.py:803] ***************************************** +W0421 06:36:24.842000 392 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0421 06:36:24.842000 392 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 +[rank2]: Traceback (most recent call last): +[rank2]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank2]: main() +[rank2]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank2]: train_and_eval(h, device) +[rank2]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank2]: val_data = ValidationData(h, device) +[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank2]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank2]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank2]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank2]: self.Load(model_file=model_file, model_proto=model_proto) +[rank2]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank2]: return self.LoadFromFile(model_file) +[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank2]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank2]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank2]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank5]: Traceback (most recent call last): +[rank5]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank5]: main() +[rank5]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank5]: train_and_eval(h, device) +[rank5]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank5]: val_data = ValidationData(h, device) +[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank5]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank5]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank5]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank5]: self.Load(model_file=model_file, model_proto=model_proto) +[rank5]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank5]: return self.LoadFromFile(model_file) +[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank5]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank5]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank5]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank3]: Traceback (most recent call last): +[rank3]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank3]: main() +[rank3]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank3]: train_and_eval(h, device) +[rank3]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank3]: val_data = ValidationData(h, device) +[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank3]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank3]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank3]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank3]: self.Load(model_file=model_file, model_proto=model_proto) +[rank3]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank3]: return self.LoadFromFile(model_file) +[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank3]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank3]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank3]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank7]: Traceback (most recent call last): +[rank7]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank7]: main() +[rank7]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank7]: train_and_eval(h, device) +[rank7]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank7]: val_data = ValidationData(h, device) +[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank7]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank7]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank7]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank7]: self.Load(model_file=model_file, model_proto=model_proto) +[rank7]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank7]: return self.LoadFromFile(model_file) +[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank7]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank7]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank7]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 + artifact_dir: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42 +[rank4]: Traceback (most recent call last): +[rank4]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank4]: main() +[rank4]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank4]: train_and_eval(h, device) +[rank4]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank4]: val_data = ValidationData(h, device) +[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank4]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank4]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank4]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank4]: self.Load(model_file=model_file, model_proto=model_proto) +[rank4]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank4]: return self.LoadFromFile(model_file) +[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank4]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank4]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank4]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank6]: Traceback (most recent call last): +[rank6]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank6]: main() +[rank6]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank6]: train_and_eval(h, device) +[rank6]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank6]: val_data = ValidationData(h, device) +[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank6]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank6]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank6]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank6]: self.Load(model_file=model_file, model_proto=model_proto) +[rank6]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank6]: return self.LoadFromFile(model_file) +[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank6]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank6]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank6]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank1]: Traceback (most recent call last): +[rank1]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank1]: main() +[rank1]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank1]: train_and_eval(h, device) +[rank1]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank1]: val_data = ValidationData(h, device) +[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank1]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank1]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank1]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank1]: self.Load(model_file=model_file, model_proto=model_proto) +[rank1]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank1]: return self.LoadFromFile(model_file) +[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank1]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank1]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank1]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + beta1: 0.9 + beta2: 0.95 + caseops_enabled: True + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 15.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + gate_window: 12 + gated_attn_enabled: True + gated_attn_init_std: 0.005 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 4.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/2c2459f7-a40c-4d92-bd3e-5de07b4b85f4.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_clip_sigmas: 12.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2000 + qk_gain_init: 5.0 + quantized_model_path: /workspace/runs/019b-recur-alpha-manual-constant-full/seed_42/final_model.int6.ptz + rank: 0 + recur_alpha_enabled: True + recur_diag_p2p_cos: False + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: 2c2459f7-a40c-4d92-bd3e-5de07b4b85f4 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + smear_gate_enabled: False + spinquant_enabled: False + spinquant_seed: 42 + spinquant_sites: attn_in,attn_proj_in,mlp_in,mlp_proj_in + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 100 + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.999 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 96 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.75 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +[rank0]: Traceback (most recent call last): +[rank0]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3581, in +[rank0]: main() +[rank0]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3575, in main +[rank0]: train_and_eval(h, device) +[rank0]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 3406, in train_and_eval +[rank0]: val_data = ValidationData(h, device) +[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank0]: File "/workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py", line 248, in __init__ +[rank0]: self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) +[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank0]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 468, in Init +[rank0]: self.Load(model_file=model_file, model_proto=model_proto) +[rank0]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 961, in Load +[rank0]: return self.LoadFromFile(model_file) +[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank0]: File "/usr/local/lib/python3.12/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile +[rank0]: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) +[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +[rank0]: OSError: Not found: "/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model": No such file or directory Error #2 +[rank0]:[W421 06:36:36.198903889 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) +W0421 06:36:37.546000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 502 closing signal SIGTERM +W0421 06:36:37.549000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 503 closing signal SIGTERM +W0421 06:36:37.550000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 504 closing signal SIGTERM +W0421 06:36:37.552000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 506 closing signal SIGTERM +W0421 06:36:37.554000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 507 closing signal SIGTERM +W0421 06:36:37.554000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 508 closing signal SIGTERM +W0421 06:36:37.555000 392 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 509 closing signal SIGTERM +E0421 06:36:38.874000 392 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 3 (pid: 505) of binary: /usr/local/bin/python +Traceback (most recent call last): + File "/usr/local/bin/torchrun", line 7, in + sys.exit(main()) + ^^^^^^ + File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper + return f(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^ + File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 936, in main + run(args) + File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 927, in run + elastic_launch( + File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 156, in __call__ + return launch_agent(self._config, self._entrypoint, list(args)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 293, in launch_agent + raise ChildFailedError( +torch.distributed.elastic.multiprocessing.errors.ChildFailedError: +============================================================ +train_gpt.py FAILED +------------------------------------------------------------ +Failures: + +------------------------------------------------------------ +Root Cause (first observed failure): +[0]: + time : 2026-04-21_06:36:37 + host : 9c28f8b148b1 + rank : 3 (local_rank: 3) + exitcode : 1 (pid: 505) + error_file: + traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html +============================================================ diff --git a/writeup/draft.md b/writeup/draft.md new file mode 100644 index 0000000000..173abe4840 --- /dev/null +++ b/writeup/draft.md @@ -0,0 +1,557 @@ +# Parameter Golf: Six Weeks of Compressing Language Models to Their Limits + +--- + +> **What does a 14% compression improvement actually look like?** +> Both models below were trained on the same dataset, in the same 10 minutes, +> on the same hardware. The only difference is what the community built in six weeks. + +*From the FineWeb validation set, same 150-token context fed to both models:* + +| | SP1024 Baseline · 1.22 BPB | Near-SOTA Model · 1.06 BPB | +|---|---|---| +| *"…his driver license and auto insurance had been cancelled in October. This poses a problem for"* | the estate, though internalieness to gets the pass. It is a fat check that's gives ges auto insurance company… that's a **fairyyyyyyyyyyyyyyyyyyyyy** | the estate's creditors who need to work out a payment, especially with the new insurance regulations in october… | +| *"…New host Steve Byrnes was feeling his way through this"* | season. He had just had a short, **bad weekkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk** | season's show, and while he played one of the best nights of his career at the same time, he hadn't been able to really shine since the first Nashville game… | +| *"…Prosecutors now have until"* | Jan. 2, the **Supreen House** prosecutor sentences on a **murded** case… **Mazzzzaglia**… **charies's''''''** | Dec. 18 to sign off on a case after the new hampshire attorney general's office says they have more time… | + +*Prompted with "You are a helpful AI assistant. Please answer the following question." (not very good at general conversation):* + +| | SP1024 Baseline · 1.22 BPB | Near-SOTA Model · 1.06 BPB | +|---|---|---| +| *"The history of the internet began in the 1960s when"* | the Netherlands had established a strong supply of information about the **Canadian government and its supply chains**. The German government, in its own right, is commonly established as a matter of legal… | the internet became the **first electronic commerce tool** that was used by many people, including children, to connect online. the internet is now used by more than 3 billion users worldwide… | +| *"What is 15 multiplied by 13? Let me work it out:"* | **1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 14. 13.** 14. 15. 16. 17. 18. 19. **19.** 20.1 | 15 multiples 13 have 15, for example, 15, for **1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000** | +| *"To bake a chocolate cake, you need flour, eggs, butter, and"* | sugar… you need to have **an infinite wooden cake**. Otherwise, you need to be taken as close to an… | salt… Add 1 tablespoon of flour to 1/4 cup of milk and pour over your cake. Add 1/4 teaspoon of sugar… Cook a lightly toasted chocolate cake. Make cupcakes à 3 | + +--- + +## Part 0: The Competition + +In March 2026, OpenAI announced Parameter Golf: a community competition to build +the best language model that fits inside 16,000,000 bytes, trained in exactly +10 minutes on 8 H100 GPUs. The competition ran for six weeks, attracted over +2000 PR submissions, and ended with the community pushing a baseline model from +1.22 bits-per-byte down to 1.056, a compression improvement of roughly 14% +through pure algorithmic ingenuity, with no change to the hardware budget or +the data. + +Parameter Golf sits within a family of challenges sometimes called L(N) +optimization. Sister challenges include the NanoGPT Speedrun (fixed time, +minimize loss) and the NanoGPT Flowrun (fixed data, minimize loss). Here +the constraint is different: fixed size, with both training time and model +capacity tightly capped. + +Before we look at the models, let us first understand how the competition works. + +**The setup.** The dataset is a slice of FineWeb, a large corpus of filtered +web text. Competitors train on a fixed training split and are scored on a +held-out validation set they cannot access during training. The training window +is 10 minutes of wall-clock time on 8 H100 GPUs. Triton kernel compilation +happens beforehand (competitors pre-warm their kernel caches as a separate +step and it does not count against the clock), and GPTQ quantization runs after +the clock stops as a post-processing step. The final artifact (which must fit +inside 16,000,000 bytes) is the compressed model weights plus the +`train_gpt.py` code that runs them.¹ Evaluation gets a separate 10-minute +window; test-time training (TTT) happens within that window, consuming some of +the budget before the scoring pass begins. + +**How scoring works.** The task is next-word prediction. The model sees text +one token at a time, left to right, and at each position must output a +probability distribution over all possible next tokens. If the next token is t, +the model earns log₂ p(t) bits, close to zero when confident and right, +increasingly negative when wrong. Over a full document t₁, t₂, …, tₙ, the +total score is: + + −∑ₖ log₂ p(tₖ | t₁, …, tₖ₋₁) + +Dividing by the total number of bytes in the document gives **bits-per-byte** +(BPB). The byte-level denominator matters: a token that encodes two bytes +counts as two bytes in the denominator, so longer tokens are neither rewarded +nor penalized just for being large, something that turns out to be important +later. Lower BPB is better. + +A useful analogy: imagine a multiple-choice exam with 256 options at every +question. Instead of circling one answer, you must assign a probability to each +option, with all 256 summing to 1. Your score on each question is −log₂ of the +probability you assigned to the correct answer. Assign 100% to the right answer: +score 0 (perfect). Assign equal weight to all 256 options: score log₂(256) = 8. +A model with no knowledge of text does exactly that (uniform over 256 bytes), +scoring 8 BPB. A perfect oracle scores 0 BPB. The baseline transformer the +competition started from scored **1.2244 BPB**. + +**The rules: C1–C4.** The competition launched without a complete ruleset. As +participants found increasingly creative and occasionally questionable ways +to improve their scores, four constraints were codified mid-competition through +community discussion in Issue #1017:² + +- **C1: Causal eval:** the probability assigned to token tₖ must depend only + on the tokens before it. No peeking at what comes next. In the multiple-choice + analogy: you must commit to your probability distribution before seeing which + answer is correct. What happened in practice: certain submissions fed the + answer token into the model as part of the context while scoring that very + token, effectively choosing all options, seeing which one is right, then + going back and erasing the wrong ones. + +- **C2: Normalized distribution:** the output must be a valid probability + distribution summing to exactly 1. You cannot manufacture artificially low + entropy by assigning more than 100% total mass across tokens. In the analogy: + you cannot assign 90% probability to each of four options. + +- **C3: Score before update:** in TTT, a chunk of validation text must be + fully scored *before* any gradient step is applied to it. Training on a chunk + and then scoring it means the model has already adapted to those exact tokens. + In the analogy: you cannot read the exam questions, go study them, then come + back and sit the exam. + +- **C4: Single pass:** each validation token is scored exactly once. No + re-scoring after adaptation. In the analogy: you cannot retake the same + question after seeing the answer. + +In the six weeks the competition ran, the score fell from 1.2244 to 1.0565, a +drop of 0.168 BPB. [TODO: score progression graph] + +In the next section we trace the key techniques that drove this improvement. + +--- +¹ The artifact is the compressed model weights plus all code in `train_gpt.py`. +The cap is 16,000,000 bytes (decimal), not 16 MiB (16,777,216 bytes), a +distinction that matters: the actual budget is about 4.5% smaller than a +"16 megabyte" headline implies. No external downloads or network calls are +permitted during evaluation. +² Issue and PR numbers refer to the openai/parameter-golf GitHub repository: +`https://github.com/openai/parameter-golf/issues/{number}` or `/pull/{number}`. + +--- + +## Part 1: Model Comparison and Techniques + +There are five components to a Parameter Golf submission: the tokenizer, +the model architecture, the training setup, the quantization pipeline, and +post-training adaptation. Let's start with the base model that OpenAI +provided at the start of the competition. + +*(This section assumes basic familiarity with transformer architecture at +the level of NanoGPT (attention, residual stream, MLP layers). If that's +new territory, Karpathy's [Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY) +is the right starting point.)* + +--- + +### The Baseline + +OpenAI's starting model was already not a simple transformer, not a plain +NanoGPT-style stack of attention and MLP layers. The baseline was more +carefully engineered than that. Let's break it down. + +**Tokenizer.** The baseline used SentencePiece with a 1024-token vocabulary +(SP1024), trained on the same FineWeb corpus used for scoring. A 1024-token +vocabulary is quite small by modern standards; GPT-2 uses 50,000 tokens. +The small vocabulary keeps the embedding table compact, which matters when +your entire model has to fit in 16 MB. + +**Model architecture.** The baseline was a 9-layer, 512-dimensional transformer +with three structural additions worth calling out: + +- **U-Net skip connections:** in a standard transformer, each layer feeds + only into the next. The U-Net pattern adds direct connections that skip the + middle of the network. The 9 layers are split into an encoder half (layers + 1–4), a bottleneck (layer 5), and a decoder half (layers 6–9). Each decoder + layer receives the output of its mirror encoder layer as an additional + residual, weighted by a learned scalar *w*: + + ``` + Standard: h_l = Block_l(h_{l-1}) + + U-Net: h_l = Block_l(h_{l-1}) for l ≤ 5 + h_l = Block_l(h_{l-1} + w · h_{10-l}) for l ∈ {6,7,8,9} + ``` + + Concretely: layer 6 gets a skip from layer 4, layer 7 from layer 3, layer 8 + from layer 2, layer 9 from layer 1. The early-layer representations, which + tend to capture local, surface-level patterns, are fed directly into the + late layers alongside the deep representations. This is the same idea that + made U-Net famous in image segmentation. +- **Grouped-query attention (GQA)** and **rotary positional embeddings (RoPE)**, + both standard in modern LLMs, just not in NanoGPT. + GQA shares key-value heads across query heads to cut parameter count; RoPE + encodes position by rotating query and key vectors rather than adding learned + position embeddings. + +**Training.** The baseline used the **Muon optimizer**, a popular choice +over AdamW. The learning rate follows a +warmup-then-warmdown schedule, rising over the first few steps then decaying +to zero by the end of the 10-minute window. + +**Quantization.** Every weight in the model is a number stored with some +number of bits: more bits means more precision, but also a larger file. +After training at bfloat16 (16 bits per weight), the baseline rounded its +MLP weights down to int6 (6 bits), shrinking what would otherwise be a +~70 MB model into something that fits the 16 MB budget. The game is +managing the tradeoff: aggressive enough to fit, not so aggressive that +predictions degrade. The baseline's approach was simple: round the biggest +chunk of parameters and leave everything else alone. + +**Post-training adaptation.** The baseline did none. During the 10-minute +evaluation window, it simply ran the model forward on the validation text +and recorded the scores. Later models would use this window to actively update +their weights in response to what they were seeing, a technique called +test-time training (TTT). The baseline serves as the clean reference point +before that complication enters. + +This baseline scored **1.2244 BPB**. By the end of the competition, the best +submission had reached **1.0565 BPB**, the same hardware, the same data, the +same 10 minutes, and a model that had been rebuilt almost from scratch across +every one of those five components. The rest of this section traces how. + +--- + +### The Evolution + +#### Tokenizer + +The vocabulary grew in two steps: SP1024 → SP4096 (PR #1218) → SP8192 +(PR #1394). Counterintuitive, since a larger vocabulary means a larger +embedding table and therefore more parameters. The payoff comes from the +BPB denominator: it counts *bytes*, not tokens. A token that encodes two +bytes contributes two bytes to the denominator, so a model that packs more +bytes per token earns a lower BPB for the same prediction quality. The +embedding cost is real, but it pays for itself. + +The more interesting tokenizer story is about capitalization. An SP8192 +vocabulary wastes slots on case variants: "the", "The", and "THE" are +three separate entries for the same word. PR #1578 introduced casefold: +lowercase everything before tokenizing, freeing up hundreds of duplicate +slots. The problem: casefold permanently destroys information, and a model that +cannot recover the original casing cannot correctly score it. Ruled illegal +in Issue #1604. + +CaseOps (PR #1729) solved this losslessly. Four control tokens are reserved +in the vocabulary (TITLE, ALLCAPS, CAPNEXT, ESC) and capitalization is +encoded inline before tokenizing: + +``` +"The NASA launched." → "TITLE the ALLCAPS nasa launched." +``` + +The original text is fully recoverable. Nothing is discarded. The ~8188 +remaining vocabulary slots are now entirely free of case duplication, and +the control tokens are cheap to predict (capitalization follows clear +patterns: sentence starts, acronyms, proper nouns) so the model pays +very little BPB on them. + +SP8192+CaseOps remained the tokenizer frontier for the rest of the +competition. CaseOps also turned out to be the center of an unexpected +controversy, not about the tokenizer itself. A significant fraction of +CaseOps submissions were later disqualified, for reasons we'll return to +in Part 3. + +--- + +#### Model Architecture + +Here we walk through the most significant architectural changes from the +baseline to the final model. + +- **Bigger and deeper (PR #86, PR #1218).** The final model is 11 layers + with a 4× MLP width, up from the baseline's 9 layers and 2× MLP width. + Wider MLPs give each layer more capacity; more layers give the network + more processing steps. Both changes came with higher weight decay to keep + the weights compressible under quantization. + +- **Depth recurrence (PR #1344).** The most structurally novel change. In a + standard transformer, each layer runs exactly once per token. The final + model loops layers 3–5 three times per forward pass, the same three + layers, the same weights, applied three times in sequence: + + ``` + standard: 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10 → 11 + + with loop: 1 → 2 → 3 → 4 → 5 → 3 → 4 → 5 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10 → 11 + └──────────────── ×3 ────────────────┘ + ``` + + 17 effective processing steps from 11 physical layers, at zero additional + parameter cost. PR #1344 introduced the structure with 2 passes over + layers 3–5; the final configuration of 3 passes was established in later + records. The model does not start training with the loop active; we cover + the training curriculum in the next section. + +- **Parallel residuals (PR #1204, from layer 8 in PR #1529).** In a + standard transformer layer, attention and MLP run sequentially. From + layer 8 onward, the final model runs them in parallel: both branches + receive the same input x, and their outputs are added together: + + ``` + standard: h = x + Attn(x), then h = h + MLP(h) + + parallel: h = x + Attn(x) + MLP(x) + ``` + + This squeezes more computation out of each layer without adding parameters. + +- **XSA (PR #287).** XSA is gaining traction in the community as a simple + attention improvement. In standard attention, each token strongly attends + to itself, its own value vector dominating the output and acting as a + near-identity shortcut. XSA removes this self-contribution by projecting + it out of the attention output: + + ``` + standard: y = Σⱼ αⱼ vⱼ + + XSA: y = y − (y · v̂) v̂ where v̂ = v / ‖v‖ + ``` + + The component of y along the current token's own value vector is + subtracted out, forcing the model to actually use context from other + tokens. Applied to all layers, it was one of the larger single + architectural improvements in the competition. + +- **Smaller additions.** A learned SmearGate blending each token with its + neighbor (PR #162, reintroduced in PR #1667), a multiplicative attention + output gate first in PR #1667 then narrowed in PR #1787, a LeakyReLU² + MLP activation replacing relu² (PR #493), partial RoPE with layer-norm + scaling (PR #315), and sigmoid-gated U-Net skip connections (PR #289). + +--- + +#### Training + +Several training hyperparameters were refined from the baseline: the LR +warmdown schedule gained a minimum floor of 0.1× rather than decaying all +the way to zero (PR #1787), and the Muon optimizer's internal Newton-Schulz +iteration was retuned with better polynomial coefficients (PR #1344/#1787). +The more structural training changes are EMA, the recurrence +curriculum, and progressive context lengthening. + +- **EMA (PR #287).** One of the most impactful single changes in the + competition. Instead of evaluating the final checkpoint's weights + directly, the model maintains an exponential moving average of all past + weight iterates throughout training. The eval model is this running + average, not the last step. Because SGD iterates are noisy, the average + sits in a flatter, more stable region of the loss landscape and + generalizes significantly better. EMA replaced stochastic weight + averaging early in the competition; the decay value introduced in PR #287 + was never revisited through the final SOTA. + +- **Depth recurrence curriculum (PR #1420).** The loop over layers 3–5 + does not activate from the start of training. For the first 35% of + wallclock (~3.5 minutes), the model trains as a standard 11-layer + network. The loop then switches on and runs for the remainder. The reason + is throughput: 17 effective layers is significantly slower per step than + 11. By training as a standard 11-layer network first, the model gets more + gradient steps within the 10-minute budget before paying the cost of the + loop. + +- **Progressive context (PR #2014).** Rather than training at a fixed + sequence length, the final model progressively lengthens context during + training: starting at 1024 tokens, moving to 2048 for the bulk of + training, and finishing at 3072. This gives the model long-context + representations by the end without paying the throughput cost of 3k + sequences from step one. + +Taken together, the training loop is a carefully choreographed 10 +minutes: fast 11-layer passes early, the recurrence loop switching on at the 35% mark, context growing longer as the clock runs down. Every +decision is timed to extract the most signal from a budget that ends the +moment the last second expires. + +--- + +#### Quantization + +Quantization and compression happen after the 10-minute training clock +stops, a separate post-processing step before the artifact is sealed. +Not everything gets quantized equally. The bulky matrix weights ( +attention projections, MLP weights, and embeddings) dominate the artifact +size and get quantized aggressively (int6 or int7). The small scalar and +1D parameters like gains, skip weights, and mixing coefficients are kept +in full precision: they are too sensitive to round safely and too small to +matter for the size budget. The baseline rounded MLP weights to int6 using +simple nearest-neighbour rounding. The final model does something +considerably more sophisticated. + +- **GPTQ (PR #535).** The core insight of GPTQ: when you round a weight, + you introduce an error. Instead of ignoring that error, you can + compensate for it by adjusting the remaining unquantized weights in the + same layer. GPTQ uses the Hessian of the loss (second-order information + about how sensitive the output is to each weight) to compute these + compensating adjustments. The result is a quantized model that stays + much closer to the original's predictions than naive rounding would + achieve. This evolved to all weights including attention (PR #1285) and + embeddings quantized at int7 (PR #1394 → PR #1586). + +- **LQER (PR #1797).** After GPTQ, some quantization error remains. + LQER stores a correction: compute the residual between the original and + quantized weights, take a rank-4 low-rank approximation of it, and pack + those correction factors into the artifact alongside the quantized + weights. The model reconstructs a better approximation of the original + weights at inference time. The correction costs ~30 KB of artifact space + and recovers a meaningful fraction of the remaining quantization damage. + +- **Further refinements.** AWQ-lite (PR #1908) identifies the most + activation-sensitive weight columns and quantizes those at int8 rather + than int6; Calib32 (PR #2135) doubles the calibration batches for a + better Hessian estimate. The quantized weights also go through a + compression pipeline, evolving from LZMA to Brotli-11 (PR #1179) and + finally to a per-group lrzip+ZPAQ pipeline with L1 row reordering + (PR #1855), saving ~280 KB over Brotli-only and fitting meaningfully more + model into the 16 MB cap. + +--- + +#### Post-Training / TTT + +The 10-minute eval window is not just for scoring. The final model uses +most of it to actively adapt its weights to the validation text it is +about to score. This is test-time training (TTT): gradient descent at +eval time, within the same budget as scoring. The first legal TTT +appeared in PR #549; the structure that made it to the final SOTA is +considerably more sophisticated. + +The TTT procedure has two nested levels of adaptation. + +- **Per-document LoRA (PR #1530).** The base model weights are frozen. For + each validation document, a set of low-rank adapter matrices (LoRA) are + attached to the key, output, and MLP projections of every layer. The + document is processed in small chunks: 8 tokens for short documents, 24 + for medium, 48 for long (PR #2014). For each chunk: score it first under + the current LoRA, then take a gradient step on that chunk to update the + LoRA weights. By the time the final chunk is scored, the LoRA has already + adapted to the document's style, vocabulary, and content. After the + document is done, the LoRA resets; nothing carries over to the next + document. + +- **Global SGD phase (PR #1610/#1626).** On top of the per-document LoRA, + there is a global pause after an initial batch of documents have been + scored (2000 in PR #1610, refined to 2500 in PR #1626). At that point, a + full SGD pass runs on the base model weights themselves, not just the + LoRA, using all the already-scored documents as training data. The base + model is then updated, the LoRA resets, and the remaining documents are + scored on top of this improved base. The LoRA handles fast local + adaptation per document; the global SGD step shifts the base model toward + the distribution of the validation set as a whole. + +The eval budget splits roughly as ~120 seconds for a standard baseline +scoring pass and ~480 seconds for the TTT loop, the whole thing just +fitting within the 600-second cap. + +Many other submissions attempted additional eval-time interventions beyond +TTT: n-gram tilts, retrieval augmentation, ensemble methods. Several were +ruled illegal. We will examine a few of these in the next section. + +--- + +The final model is a messy, sophisticated combination of all of the above: +significant structural innovations layered on top of dozens of smaller +refinements, each contributing a fraction of the 0.17 BPB gap to the +baseline. + +--- + +## Part 2: The Disqualifications + +At various points in the competition, the open PR list showed submissions +claiming BPB in the 0.8–1.0 range, well below what the merged leaderboard +reflected. These numbers were real in the sense that the code produced +them, but they were not valid scores: the techniques behind them violated +one of the competition's core conditions. We discuss two such techniques +and one data leak here. + +A pretrained model is static. It knows nothing about what has already +appeared in the document it is currently scoring. If a document mentions +"San Francisco" ten times, the model treats the eleventh occurrence the +same as the first. Several submissions tried to close this gap by running +lightweight online statistics alongside the model: track what has appeared +so far in this document, and use that to sharpen the model's predictions. +Two approaches in particular (n-gram tilting and PPM-D) attracted +significant attention, and both ultimately ran into legality problems. + +### N-gram Tilt + +An online n-gram tilt (PR #1145) maintains a prefix-keyed hash table of +token co-occurrence statistics, updated as each token is scored. At each +position, if the recent token history strongly predicts a specific next +token, that token's probability gets a boost before the loss is computed. +No extra artifact bytes, no parameters: just a running table built from +the document itself. + +The implementation used three expert channels (token n-gram, within-word +continuation, and word-start). The token expert was clean. The within-word +and word-start experts had a classic C1 violation (PR #1420): + +```c +const uint16_t tok = tokens[i]; // the target token being predicted +const uint8_t is_boundary = boundary_lut[tok]; +if (!is_boundary && st->within_len > 0U) + within_valid[i] = 1U; // fire the hint at position i +``` + +The gate reads `tokens[i]` before scoring position i, the label, not the +prefix. In the shipped code, 100% of positions where the within-word +expert fired were continuation tokens; no causal system can achieve that. +The fix (PR #1514): disable within-word and word-start entirely, keep only +the token-order-16 expert. That clean expert survived into the final SOTA. + +### PPM-D + +PPM-D (Prediction by Partial Matching) is a classical lossless compression +algorithm. It maintains a trie of byte n-gram counts and at each position +predicts the next byte by looking up the longest matching context, falling +back to shorter contexts when the full history hasn't been seen before. It +was state-of-the-art for text compression before neural networks and is +particularly effective at within-document repetition: once "San Francisco" +has appeared several times, the byte sequence becomes highly predictable. + +The appeal for parameter golf was direct: BPB is charged at the byte level, +and PPM-D operates at the byte level. A cluster of submissions (starting +with PR #1785) mixed PPM-D with the neural net by spreading each token's +probability uniformly across its bytes, an n-byte token with probability p +contributing p^(1/n) to each of its byte positions, then taking a convex +combination with PPM-D's byte predictions. Claimed scores dropped into the +0.8–1.0 range. + +The problem, identified in Issue #1872, is that the spread does not produce +a valid probability distribution. For any multi-byte token with p < 1, +p^(1/n) > p: the per-byte contributions are inflated. Summing across all +tokens that start with a given byte gives more than 1.0. The mixture is not +normalized, violating C2. + +PR #1905 ran the decisive experiment: using the same PPM configuration but +with a correct byte marginal (the proper way to convert token probabilities +to byte probabilities), PPM-D was *worse* than the baseline by 0.038 BPB. +The entire apparent gain came from the invalid spread inflating the NN's +apparent uncertainty on multi-byte tokens, then giving PPM spurious credit +for resolving it. The 0.8x BPB figures were an artifact of the scoring +construction, not a real compression improvement. + +## Drama on the Last Day + +On April 30th (the day before the competition closed), PR #2014 dropped +at **1.0576 BPB**, a clean record built on progressive context scheduling. +Then in the final hours, a flurry of PRs appeared beating it: **1.047**, +**1.043**, numbers that seemed implausibly good. The field looked wide open. + +People started looking more closely. + +It turned out that `prepare_caseops_data.py` (the script everyone had +been copying to build CaseOps datasets since PR #1736) defaulted to +`--val-docs=10000`. Training started at document 10,000; validation +covered documents 0–50,000. Eighty percent of the validation set had been +in training the whole time. + +Fifteen PRs were eventually classified as leaky (Issue #2127), most of +their authors having inherited the data setup from earlier submissions +without knowing. The fix had actually been discovered and quietly deployed +eight days earlier, then re-introduced independently on the same day. The +best detail: the author who first caught and fixed the bug also submitted +PR #2118 (the most egregiously leaked PR of the competition) whose own +`submission.json` admitted the overlap in its technique summary field. + +When the dust settled, one PR stood as the clean improvement over #2014: +**PR #2135** at **1.0565 BPB**, by a narrow margin of 0.001. The final +SOTA of the competition. + +Six weeks, two thousand pull requests, a 0.168 BPB drop from a number +that already looked hard to beat. This competition had everything: +techniques stacking on each other in ways nobody planned, ingenious +innovations that shouldn't have worked but did, controversial methods that +looked like miracles, mayhem on the last day, and a picture-perfect finish. +One clean submission, one narrow margin, one number that stood. diff --git a/writeup/draft_v2.md b/writeup/draft_v2.md new file mode 100644 index 0000000000..cf1ffe8d81 --- /dev/null +++ b/writeup/draft_v2.md @@ -0,0 +1,272 @@ +# Parameter Golf: Six Weeks to Build the Best LLM + +--- + +> **What does a 14% compression improvement actually look like?** +> Both models below were trained on the same dataset, in the same 10 minutes, +> on the same hardware. The only difference is what the community built in six weeks. + +*From the FineWeb validation set, same 150-token context fed to both models:* + +| | SP1024 Baseline · 1.22 BPB | Near-SOTA Model · 1.06 BPB | +|---|---|---| +| *"…his driver license and auto insurance had been cancelled in October. This poses a problem for"* | the estate, though internalieness to gets the pass. It is a fat check that's gives ges auto insurance company… that's a **fairyyyyyyyyyyyyyyyyyyyyy** | the estate's creditors who need to work out a payment, especially with the new insurance regulations in october… | +| *"…New host Steve Byrnes was feeling his way through this"* | season. He had just had a short, **bad weekkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk** | season's show, and while he played one of the best nights of his career at the same time, he hadn't been able to really shine since the first Nashville game… | +| *"…Prosecutors now have until"* | Jan. 2, the **Supreen House** prosecutor sentences on a **murded** case… **Mazzzzaglia**… **charies's''''''** | Dec. 18 to sign off on a case after the new hampshire attorney general's office says they have more time… | + +*Prompted with "You are a helpful AI assistant. Please answer the following question." (not very good at general conversation):* + +| | SP1024 Baseline · 1.22 BPB | Near-SOTA Model · 1.06 BPB | +|---|---|---| +| *"The history of the internet began in the 1960s when"* | the Netherlands had established a strong supply of information about the **Canadian government and its supply chains**. The German government, in its own right, is commonly established as a matter of legal… | the internet became the **first electronic commerce tool** that was used by many people, including children, to connect online. the internet is now used by more than 3 billion users worldwide… | +| *"What is 15 multiplied by 13? Let me work it out:"* | **1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 14. 13.** 14. 15. 16. 17. 18. 19. **19.** 20.1 | 15 multiples 13 have 15, for example, 15, for **1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000** | +| *"To bake a chocolate cake, you need flour, eggs, butter, and"* | sugar… you need to have **an infinite wooden cake**. Otherwise, you need to be taken as close to an… | salt… Add 1 tablespoon of flour to 1/4 cup of milk and pour over your cake. Add 1/4 teaspoon of sugar… Cook a lightly toasted chocolate cake. Make cupcakes à 3 | + +--- + +In March 2026, OpenAI released a public competition with a deceptively simple premise: train the best language model you can, but it has to fit in 16,000,000 bytes, and you only get 10 minutes of training time on 8 H100s. Call it parameter golf: every byte counts, every second counts. + +What followed over the next six weeks looked, from the outside, like just another competitive coding challenge. It turned out to be something more: techniques stacking on each other in ways nobody planned, innovations that shouldn't have worked but did, controversial submissions that looked like miracles, mayhem on the last day, and a picture-perfect finish. In the end, starting from a model that produces the gibberish above, the community built one that speaks coherently. Just don't ask it anything that isn't in the training data. This post goes through some of the highlights, both the technical and the dramatic. + +--- + +## 1. The Competition + +At the core of this competition is a simple question: how well can a model predict text? + +A language model is, at its heart, a probability distribution over text. Given a sequence of words (or more precisely, tokens) the model assigns a probability to every possible next token. A well-trained model should assign high probability to tokens that actually appear in real text, and low probability to tokens that don't. Think of it like a well-read person trying to complete sentences. Given "The president signed the ___", they would confidently predict "bill" or "order" and be surprised by "banana." A bad model treats every next word as equally likely. A good model has internalized the patterns of language well enough to be right, or at least close, most of the time. + +The models in this competition are trained and scored on FineWeb, a large dataset of cleaned web text. The score is computed on a held-out validation slice that the models never see during training. For each token in that slice, we ask: what probability did the model assign to the token that actually appeared? The cost for a single token is $-\log_2 p(t)$, where $t$ is the correct token. If the model is perfectly confident ($p(t) = 1$) it pays zero cost. If it assigns $p(t) = 0.5$, it pays 1 bit. If it assigns $p(t) = 0.01$, it pays about 6.6 bits. The total score is this cost summed across all tokens, normalized by the number of bytes in the original text: + +$$\text{BPB} = \frac{-\sum_k \log_2 p(t_k \mid t_1, \ldots, t_{k-1})}{\text{number of bytes}}$$ + +This is called bits-per-byte (BPB). To put it in perspective: a model that assigns completely uniform probability across all 256 possible bytes, knowing nothing at all, scores exactly 8 BPB. The baseline OpenAI provided started at **1.2244 BPB**, already far below that, meaning the model had learned real structure in language. Six weeks later, the community had pushed it to **1.0565**, a 14% reduction, achieved purely through algorithmic improvements with no change to the hardware or the data. + +--- + +## 2. Model Evolution + +A modern language model is built from a stack of identical blocks. Each block has two components: attention first, then MLP: + +``` +# One transformer block +def block(x): + x = x + attention(x) # tokens communicate: each looks at all others + x = x + mlp(x) # tokens think: each processes what it gathered + return x + +# Attention: tokens communicate +def attention(x): + Q = x @ W_Q # what am I looking for? + K = x @ W_K # what do I contain? + V = x @ W_V # what do I share? + weights = softmax(Q @ K.T / sqrt(d)) + return weights @ V + +# MLP: tokens think +def mlp(x): + return W_2 @ activation(W_1 @ x) # expand, activate, contract +``` + +The full model is just these blocks chained one after another: + +``` +x → block_1 → block_2 → block_3 → ... → block_N → output +``` + +Attention captures how words interact with each other, capturing which tokens are relevant to which. The MLP then enriches the meaning of each token individually, using what attention gathered as context. Each block refines the representation a little further. Stack 9 to 11 of them and you have a language model. We will come back to this picture when we discuss depth recurrence. + +--- + +### The Baseline + +OpenAI's starting model was already not a simple transformer, not a plain stack of attention and MLP blocks. The baseline was more carefully engineered than that. Let's break it down. + +**Tokenizer.** The baseline used SentencePiece with a 1024-token vocabulary (SP1024), trained on the same FineWeb corpus used for scoring. A 1024-token vocabulary is quite small by modern standards; GPT-2 uses 50,000 tokens. The small vocabulary keeps the embedding table compact, which matters when your entire model has to fit in 16,000,000 bytes. + +**Model architecture.** The baseline was a 9-layer, 512-dimensional transformer with three structural additions worth calling out: + +- **U-Net skip connections:** in a standard transformer, each layer feeds only into the next. The U-Net pattern adds direct connections that skip the middle of the network. The 9 layers are split into an encoder half (layers 1–4), a bottleneck (layer 5), and a decoder half (layers 6–9). Each decoder layer receives the output of its mirror encoder layer as an additional residual, weighted by a learned scalar *w*: + + ``` + # Standard transformer + h = block_1(h); h = block_2(h); ...; h = block_9(h) + + # U-Net: decoder layers receive a skip from their encoder mirror + h1 = block_1(h); h2 = block_2(h); h3 = block_3(h); h4 = block_4(h) + h5 = block_5(h4) # bottleneck + h6 = block_6(h5 + w * h4) # skip from layer 4 + h7 = block_7(h6 + w * h3) # skip from layer 3 + h8 = block_8(h7 + w * h2) # skip from layer 2 + h9 = block_9(h8 + w * h1) # skip from layer 1 + ``` + + The early-layer representations, which tend to capture local surface-level patterns, are fed directly into the late layers alongside the deep representations. This is the same idea that made U-Net famous in image segmentation. + +- **Grouped-query attention (GQA)** and **rotary positional embeddings (RoPE)**, both standard in modern LLMs. GQA shares key-value heads across query heads to cut parameter count; RoPE encodes position by rotating query and key vectors rather than adding learned position embeddings. + +**Training.** The baseline used the **Muon optimizer**, a popular choice over AdamW. The learning rate follows a warmup-then-warmdown schedule, rising over the first few steps then decaying to zero by the end of the 10-minute window. + +**Quantization.** Every weight in the model is a number stored with some number of bits: more bits means more precision, but also a larger file. After training at bfloat16 (16 bits per weight), the baseline rounded its MLP weights down to int6 (6 bits), shrinking what would otherwise be a ~70 MB model into something that fits the budget. The game is managing the tradeoff: aggressive enough to fit, not so aggressive that predictions degrade. The baseline's approach was simple: round the biggest chunk of parameters and leave everything else alone. + +**Post-training adaptation.** The baseline did none. During the 10-minute evaluation window, it simply ran the model forward on the validation text and recorded the scores. Later models would use this window to actively update their weights in response to what they were seeing, a technique called test-time training (TTT). The baseline serves as the clean reference point before that complication enters. + +This baseline scored **1.2244 BPB**. By the end of the competition, the best submission had reached **1.0565 BPB**, with the same hardware, the same data, the same 10 minutes, and a model that had been rebuilt almost from scratch across every one of those five components. The rest of this section traces how. + +--- + +### The Evolution + +#### Tokenizer: CaseOps + +The vocabulary grew in two steps: SP1024 → SP4096 (PR #1218) → SP8192 (PR #1394). Counterintuitive, since a larger vocabulary means a larger embedding table. The payoff comes from the BPB denominator: it counts *bytes*, not tokens. A token that encodes two bytes contributes two bytes to the denominator, so a model that packs more bytes per token earns lower BPB for the same prediction quality. + +An SP8192 vocabulary still wastes slots on case variants: "the", "The", and "THE" are three separate entries. PR #1578 introduced casefold, which lowercases everything before tokenizing, but this permanently destroys casing information the model needs to score correctly. Ruled illegal in Issue #1604. + +CaseOps (PR #1729) solved this losslessly. Four control tokens are reserved (TITLE, ALLCAPS, CAPNEXT, ESC) and capitalization is encoded inline: + +``` +"The NASA launched." → "TITLE the ALLCAPS nasa launched." +``` + +The original text is fully recoverable. The ~8188 remaining vocabulary slots are now entirely free of case duplication, and the control tokens are cheap to predict, since capitalization follows clear patterns, so the model pays very little BPB on them. SP8192+CaseOps remained the tokenizer frontier for the rest of the competition. + +--- + +#### Model Architecture: Depth Recurrence and Parallel Residuals + +In a standard transformer, each layer runs exactly once per token. PR #1204 first introduced depth recurrence, running middle layers in a loop; later refinements settled on three passes over layers 3–5 as the final topology. The same three layers, the same weights, applied three times in sequence: + +``` +standard: 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10 → 11 + +with loop: 1 → 2 → 3 → 4 → 5 → 3 → 4 → 5 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10 → 11 + └──────────────── ×3 ────────────────┘ +``` + +17 effective processing steps from 11 physical layers, at zero additional parameter cost. This is free test-time compute: the model gets to think harder without growing larger. The final configuration of 3 passes over layers 3–5 was established in later records. + +From layer 8 onward, the final model also runs attention and MLP in parallel rather than sequentially (PR #1204, PR #1529). In a standard block, MLP sees the output of attention. In a parallel block, both branches see the same input and their results are simply added: + +``` +# Standard block +x = x + attention(x) +x = x + mlp(x) + +# Parallel residual (layers 8–11) +x = x + attention(x) + mlp(x) +``` + +This squeezes more computation out of each of the final layers without adding parameters, and the two branches can run simultaneously. Other architectural changes are summarized in the table at the end of this section. + +--- + +#### Training: EMA and Loop Curriculum + +**EMA (PR #287).** Instead of evaluating the final checkpoint directly, the model maintains an exponential moving average of all past weight iterates throughout training. The eval model is this running average, not the last step. SGD iterates are noisy; the average sits in a flatter, more stable region of the loss landscape and generalizes significantly better. The decay value introduced in PR #287 was never revisited through the final SOTA. + +**Loop curriculum (PR #1420).** The recurrence loop does not activate from the start of training. For the first 35% of wallclock (~3.5 minutes), the model trains as a standard 11-layer network. The loop then switches on for the remainder. The reason is throughput: 17 effective layers is significantly slower per step than 11. By delaying the loop, the model gets more gradient steps within the 10-minute budget before paying the cost. + +The final training run is a well-choreographed 10-minute dance. The learning rate warms up over the first steps, then holds steady for the bulk of training, then warmdowns to a floor as the clock runs down. The loop activates at the 35% mark, shifting the model into its deeper recurrent mode. And towards the end, context grows progressively from 1024 to 2048 to 3072 tokens (PR #2014), giving the model long-range representations right when it needs them most. Every decision is timed to extract the most signal from a budget that ends the moment the last second expires. + +--- + +#### Quantization: GPTQ and LQER + +Quantization and compression happen after the 10-minute training clock stops, a separate post-processing step before the artifact is sealed. Not everything gets quantized equally. The bulky matrix weights (attention projections, MLP weights, and embeddings) dominate the artifact size and get quantized aggressively (int6 or int7). The small scalar and 1D parameters like gains, skip weights, and mixing coefficients are kept in full precision: they are too sensitive to round safely and too small to matter for the size budget. The baseline rounded MLP weights to int6 using simple nearest-neighbour rounding. The final model does something considerably more sophisticated. + +**GPTQ (PR #535).** When you round a weight, you introduce an error. Instead of ignoring that error, GPTQ compensates for it by adjusting the remaining unquantized weights in the same layer, using second-order information about how sensitive the output is to each weight. The result is a quantized model that stays much closer to the original's predictions than naive rounding. This evolved to cover all model weights (PR #1285) and embeddings at int7 (PR #1586). + +**LQER (PR #1797).** After GPTQ, some quantization error remains. LQER stores a correction: compute the residual between the original and quantized weights, take a rank-4 low-rank approximation, and pack those correction factors into the artifact. The model reconstructs a better approximation at inference time. The correction costs ~30 KB of artifact space and recovers a meaningful fraction of the remaining quantization damage. + +--- + +#### Post-Training: TTT + +The final model uses its 10-minute eval window not just for scoring, but to simultaneously adapt its weights to the text it is seeing, a technique called test-time training (TTT). The TTT here splits into two steps: + +**Per-document LoRA (PR #1530).** The base model weights are frozen. For each validation document, a set of low-rank adapter matrices are attached to the model's projections. The document is processed in chunks: score each chunk first, then take a gradient step to update the adapters. By the final chunk, the model has already adapted to that document's style and vocabulary. The adapters reset after each document; nothing carries over. + +**Global SGD phase (PR #1610/#1626).** After an initial batch of documents has been scored, a full SGD pass runs on the base model weights themselves, not just the adapters, using all the already-scored documents as training data. The base model is updated, the adapters reset, and the remaining documents are scored on top of this improved base. + +The LoRA handles fast local adaptation per document; the global SGD step shifts the base model toward the distribution of the validation set as a whole. The eval budget splits roughly as ~120 seconds for the baseline scoring pass and ~480 seconds for the TTT loop. + +--- + +### Other Changes + +Many other techniques were introduced over the six weeks that contributed to the final model. We list some of them here. + +| Component | Change | PR | +|---|---|---| +| Architecture | XSA: removes self-copy bias from attention outputs | #265 | +| Architecture | Parallel residuals from layer 8: `h = x + Attn(x) + MLP(x)` | #1204, #1529 | +| Architecture | SmearGate: learned blend of each token with its neighbor | #65, #1851 | +| Architecture | LeakyReLU² replacing relu² in MLP | #493 | +| Architecture | Partial RoPE + layer-norm scaling | #315 | +| Quantization | AWQ-lite: sensitive weight columns promoted to int8 | #1908 | +| Quantization | Calib32: doubled calibration batches for better Hessian | #2135 | +| Quantization | Artifact compression: lrzip+ZPAQ+L1 row reordering | #1855 | + +The final model is a messy, sophisticated combination of all of the above: significant structural innovations layered on top of dozens of smaller refinements, together with countless ablations to find the right hyperparameters for each, each contributing a fraction of the 0.17 BPB gap to the baseline. + +--- + +## 3. Too Good to Be True + +One might imagine the leaderboard as a steady downward curve from 1.2244 to 1.0565 over six weeks. The reality was anything but. Periodically, a submission would appear claiming a score far below the rest of the field, dropping below 1.0, well beyond what any single technique could explain. Others would quickly follow, stacking on top of the same method, while long threads of debate opened about whether the technique was valid at all. As it turned out, they were all too good to be true. We examine the two most important cases here, and the lesson they leave behind. + +--- + +### N-gram Tilt + +An n-gram model tracks token co-occurrence statistics: given the last few tokens, what token tends to come next? The idea behind n-gram tilting (PR #1145) was to run a lightweight n-gram counter alongside the neural model, updated as each token is scored, and use it to boost the probabilities of tokens that the recent history strongly predicts. No extra artifact bytes, no parameters; just a running table built from the document itself. + +While the idea is sound, the implementation had a subtle causality issue.[^rules] The within-word and word-start experts had a classic C1 violation (PR #1420): + +```c +const uint16_t tok = tokens[i]; // the target token being predicted +const uint8_t is_boundary = boundary_lut[tok]; +if (!is_boundary && st->within_len > 0U) + within_valid[i] = 1U; // fire the hint at position i +``` + +The gate reads `tokens[i]` (the token being predicted) before scoring position i. It is like filling in all the answers on an exam, then peeking at the answer key before erasing the wrong ones. A causal system cannot know whether the next token is a continuation token before seeing it. Later on, PR #1514 disable within-word and word-start entirely, keep only the token-order-16 expert. That clean expert survived into the final SOTA. + + +[^rules]: The competition launched without a complete ruleset. As participants found increasingly creative ways to improve their scores, four constraints were codified mid-competition through community discussion in Issue #1017: **C1 (causal eval):** the probability assigned to token tₖ must depend only on the tokens before it, never the token itself; **C2 (normalized distribution):** the output must be a valid probability distribution summing to exactly 1; **C3 (score before update):** in TTT, a chunk must be fully scored before any gradient step is applied to it; **C4 (single pass):** each token is scored exactly once. + + +--- + +### PPM-D + +PPM-D (Prediction by Partial Matching) is the technique behind the PRs with the most impressive claims: scores in the 0.8–1.0 range, far below anything other techniques offered. It is a classical byte-level compression algorithm that counts byte n-grams: given the last few bytes, what byte tends to come next? It is particularly good at within-document repetition. Think of a Russian novel where a character's long name appears dozens of times; after the first few occurrences, PPM-D can predict the exact spelling almost perfectly, byte by byte. The neural model, by contrast, has no special memory for what has already appeared in this document. The PRs blended PPM-D's predictions with the neural model's: an n-byte token with probability p contributing p^(1/n) to each of its byte positions, then mixing with PPM-D. Claimed scores dropped dramatically. + +It turned out the math was rigged. A valid probability distribution must sum to exactly 1 (this is C2). For any multi-byte token with p < 1, p^(1/n) > p: the per-byte contributions are inflated, and summing across all tokens that share a given byte gives more than 1.0. In the exam analogy: assigning 90% to each of four answer options simultaneously. The score looked excellent because the scoring formula was fed an invalid distribution. Why the broken math produced such dramatic gains, and why the correct version is actually *worse* than the baseline, is a more interesting story, explained in PR #1905.[^ppmd] + +[^ppmd]: The correct way to convert token probabilities to byte probabilities is to sum over all tokens that share the same byte prefix, weighted by their probabilities. When counted correctly, it turns out that PPM-D yields no gain over the neural model alone. The deeper reason is discussed in the lesson below. + +--- + +### The Lesson + +Both cases point to the same underlying reality. A well-trained language model is already a calibrated entropy estimator: where it predicts a flat distribution, the text really is hard to predict; where it is confident, the text really is predictable.[^entropy] The correlation between the model's uncertainty and the true information content is tight. That is exactly why PPM-D and n-gram statistics could not deliver incredible gains. They were identifying the same easy tokens the model already had low entropy on. For an external signal to genuinely help, its errors would need to be *uncorrelated* with the model's: it would need to be uncertain where the model is confident, and vice versa. + +There is no silver bullet. The progress that held was incremental, compounding, and hard-won, one careful PR at a time. + +[^entropy]: The expected entropy at a position is $H = -\sum_t p(t) \log_2 p(t)$, where the sum is over all possible next tokens. This measures how spread out the model's distribution: high entropy means the model is uncertain; low entropy means it is confident. A well-calibrated model's expected entropy correlates tightly with the actual information content of the text at that position. + +--- + +## 4. Drama on the Last Day + +This competition did not go quietly. On April 30th, the day before the competition closed, PR #2014 dropped at **1.0576 BPB**, a clean record built on the new idea of progressive context scheduling. The field had been grinding toward this number for weeks. Shortly after, a flurry of PRs appeared beating it: **1.047** (a 0.011 gap), **1.043** (a 0.015 gap), numbers that seemed implausibly good. + +However, the reason turned out to be nothing anyone had anticipated. `prepare_caseops_data.py` (the script everyone had been copying to build CaseOps datasets since PR #1736) defaulted to `--val-docs=10000`. Training started at document 10,000, when it should have started at 50,000. What's wrong with that? The validation set covers documents 0 through 49,999. Starting training at 10,000 meant that 40,000 out of 50,000 validation documents, eighty percent of which, had been in the training data the whole time. It is like giving students the test questions as homework to prepare for the exam. + +Many PRs were eventually classified as leaky (Issue #2127), most of their authors having inherited the data setup from earlier submissions without knowing. As it turned out, the bug had been caught and quietly fixed eight days earlier, then accidentally reintroduced on the very same day. The person who first discovered and patched it was also the one sitting at the top of the leaderboard with that 1.043 score. + +With the leaky PRs disqualified, #2014 was restored as the clean SOTA, and with a full day still left on the clock, the door was wide open. A last-hour flurry of clean PRs came in, each trying to be the one to beat it. When the dust settled, one did: **PR #2135** at **1.0565 BPB**, by a margin of just 0.001. One clean submission, one narrow margin, one number that stood. A picture-perfect finish to a competition that had everything. + +Six weeks, two thousand pull requests, and a 14% improvement wrung out of the same hardware, the same data, the same ten minutes, through nothing but engineering. One default flag nearly rewrote the ending. Two methods that looked like miracles turned out to be mistakes. And in the end, what remained was a small, sophisticated model built from a careful accumulation of every technique described above. diff --git a/writeup/outline.md b/writeup/outline.md new file mode 100644 index 0000000000..58717625cb --- /dev/null +++ b/writeup/outline.md @@ -0,0 +1,352 @@ +# Parameter Golf — Writeup Outline + +Substack blog post. Tone: formal for Parts 0–1, semi-formal + funny for Parts 2–3. +Audience: half-technical, half non-technical. + +--- + +## Part 0 — What Is This Competition? + +- **The premise:** 16 MB artifact, 10 minutes training on 8×H100, scored by + byte-level BPB on the FineWeb validation set (tokenizer-agnostic). OpenAI + sponsoring $1M in compute grants. +- **The time budget:** 10 minutes training + 10 minutes evaluation — TTT comes + out of the eval budget, not extra. If your TTT takes 8 minutes, you have 2 + minutes left for the actual scoring pass. +- **How scoring works:** the model outputs a probability distribution over its + vocabulary at each token position → take log probability of the correct token + → sum across all tokens → divide by total bytes in the document → that's + bits-per-byte (BPB). (Internally the competition uses nats first, then + converts: 1 nat = 1/ln(2) ≈ 1.4427 bits.) +- **The framing:** L(N) optimization — fixed parameters, unconstrained + compute/data/architecture. Sister challenges: NanoGPT Speedrun (L(T)), + NanoGPT Slowrun (L(D)). +- **The baseline:** 9-layer 512-dim 1024-vocab tied-embedding transformer → + **1.2244 bpb** +- **The end:** 11-layer 512-dim SP8192+CaseOps transformer with depth + recurrence, XSA, parallel residuals, SmearGate, LQER, phased TTT, and + per-group lrzip compression → **1.0565 bpb** (clean SOTA, PR #2135) +- **Scoring:** what "byte-level BPB" means and why it matters — the model + thinks in tokens, but the judge counts bytes, making the tokenizer choice a + first-class compression lever. +- **The C1–C4 rules** (emerged mid-competition from Issue #1017, not present + on day one): + - C1: causal eval only — no peeking at future tokens. *Example violation: + scoring position i using attention that reaches position i+1.* + - C2: full normalized distribution — the scoring must be a real probability + distribution over bytes, summing to exactly 1. *Example violation: the + PPM-D uniform-spread construction, where p(byte) sums to >1.* + - C3: score-before-update — in TTT, a chunk must be fully scored *before* + any gradient step touches it. *Example violation: running one AdamW step + on a chunk, then scoring it — the model has already adapted to those + tokens.* + - C4: single pass — each validation token is scored exactly once, no + multi-pass re-scoring. *Example violation: scoring a token, updating on + it, then re-scoring to get a lower loss.* +- **Statistical significance bar:** new records require a 3-seed mean beating + prior SOTA with a p-value threshold; borderline submissions land as + "non-record" entries. there are also just genuinly interesting non-record submmission that we will not cover). This writeup covers the record track only — there + were many fascinating non-record submissions (text diffusion, Mamba hybrids, + JEPA, 1-bit quantization) that deserve their own treatment. +--- + +## Part 1 — The Key Techniques + +Four pillars. Each gets: the core insight, who pushed it, representative BPB +numbers, and what it contributed to the final model. + +The submission pipeline every serious entrant ran: pre-compile (Triton +autotune) → training (10 min wallclock) → GPTQ quantization → artifact +compression → TTT at eval time. Each pillar maps onto a stage of this pipeline. +### 1. Tokenizer / Vocabulary + +- SP1024 (baseline) → SP4096 → SP8192 — counterintuitive: a bigger vocabulary + *helps* in a tight parameter budget because better tokenization is free + compression before the model runs. +- The failed attempt first: **lossy casefold** — lowercasing all text before + tokenizing gives a small vocabulary a big efficiency boost, but destroys + case information. Ruled illegal (Issue #1604) because the scorer charges + bytes on the original text and you can't recover them. +- **CaseOps** (romeerp, PR #1729): the lossless answer. Worth explaining the + mechanics: standard SP8192 sees "Hello" and "hello" as two different tokens, + wasting vocabulary slots on capitalization variants. CaseOps pre-processes + the text before tokenizing — strip all case to lowercase, tokenize the + compact lowercase stream, then store the original capitalization pattern in a + tiny byte sidecar (a bitmask: 1 bit per alphabetic character saying + "was this uppercase?"). The model trains and scores on the lowercase tokens; + the sidecar bytes are scored by a simple fixed distribution (capitalization + is highly predictable: sentence starts, proper nouns — you can mostly get it + right with a few rules). Net effect: the model's vocabulary is far more + efficient, each token encodes more semantic content per byte, and the sidecar + cost is small because capitalization is easy. Drops the frontier from ~1.07 + to ~1.065 in one shot. +- Why byte-level BPB makes tokenizer choice load-bearing: the scorer charges + per byte regardless of how the model tokenizes, so every token that encodes + more bytes is a free win. +- Brief note on the token-vs-byte scoring controversy: some participants + questioned whether scoring should be done at the byte level directly rather + than converting from token-level log probs. For the official scorer they're + equivalent — it's just accounting — but this became relevant when byte-level + sidecars (like CaseOps) entered the picture. + +### 2. Training Architecture + +Start with what the baseline model already had — it was not naive. Worth +briefly explaining a few: **U-Net skip connections** (encoder layers feed +residuals directly into symmetric decoder layers, giving the model a natural +coarse-to-fine processing structure), **GQA** (grouped-query attention, fewer +KV heads than Q heads → big parameter saving), **BigramHash embeddings** +(learned embeddings for token bigrams, stored in a hash table → better local +context at low param cost), **Muon optimizer** (second-order-ish momentum on +matrix parameters). The competition started from a genuinely strong base. + + +Key architectural innovations in roughly the order they appeared: + +- **XSA (Exclusive Self Attention)** — subtracts the "self-copy" component + from each attention output. Attention tends to just copy a token's own value + back to itself; XSA removes that self-bias, forcing the model to actually + look at other positions. Applied to the deepest 3 layers first (#198), then + eventually all 11. Biggest single jump of the middle phase (~0.01 bpb). + Based on arXiv:2603.09078; good YouTube explainer exists. +- **Depth recurrence (Loop45, later Loop3-5)** — run the same layers multiple + times per forward pass; free test-time compute within the 16 MB budget. + Probably the single largest architectural change in the competition. +- **Variable-length (VarLen) attention** — pack documents of different lengths + into one sequence without padding; enables per-document TTT boundaries. +- **Parallel residuals** — two-lane attention+MLP routing from layer 8+; stacks + cleanly with recurrence. +- **SmearGate + BOS fix** — learned gate blending each token's hidden state + with the previous token's, adding a lightweight bigram-level context signal. + The BOS fix masks this gate at document-start positions to prevent + cross-document leakage; the leak went unnoticed for weeks and cost ~0.002 bpb + when finally fixed. +- **MuonEq-R / Polar-Express Newton-Schulz** — row-normalizes gradient matrices + before Newton-Schulz orthogonalization; zero parameter cost, contributed + ~0.001 bpb improvement. +- **Sparse attention gate** — narrow head-output gate (gate_window=12); late + addition from PR #1787. +- **QK-Gain** — per-head learned scaling of query/key dot products, init 5.0+; + in the final model and contributed meaningfully throughout the April phase. + +Two more training dynamics worth mentioning: +- **Loop activation schedule** — depth recurrence doesn't run from step 1; + it kicks in at `frac=0.35` (35% through the wallclock budget). Before that + the model trains as a standard transformer, giving the weights time to settle + before the recurrent computation graph is introduced. +- **Progressive context growth** — sequence length grows deliberately during + training, from short sequences early on up to 3k tokens by the end (#2014). + Lets the model learn local patterns first before being asked to handle long + range. Combined with VarLen packing (no padding), this is a meaningful + training-dynamics lever. + +*End of section: lay out the full final model architecture from the #2135 +README as a concrete snapshot of what all this stacking produced.* + +### 3. Quantization +Comment: AR self-generated calibration data: not sure how much that matter. +The steady thread from day 1 to the end; every record touches quantization. + +- int6 → all-int6 → full Hessian GPTQ (Cholesky error compensation, 64-batch + calibration). +- **AR self-generated calibration data** — use the model's own outputs as GPTQ + calibration data. Present in multiple records (#1060, #1204) but its isolated + contribution is unclear; listed as a component but not ablated cleanly. +- **GPTQ embeddings** — quantize the embedding table too (int7); previously + left in fp16. +- **AWQ-lite** — activation-aware weight quantization; identifies salient weight + groups by activation magnitude and promotes them to int8 instead of int6. + Stacks on top of GPTQ, appeared in #1908/#1945 and propagates into the final + model. +- **Calib32** — increase GPTQ calibration batches from default to 32; cheap + tuning that measurably improves quantization quality (#2135 is literally + named after this). +- **LQER (Low-rank Quantization Error Reduction)** — after GPTQ, train a + rank-4 low-rank residual to correct the rounding error on the top-3 tensors. + A post-hoc quant repair, not a training-time technique. (~0.01 bpb gain) +- **Artifact compression** — lrzip zpaq + L1 similarity-sort row reordering + + brotli; the final per-group pipeline in #1855 saved ~280 KB over plain + brotli. The last few KB matter when you're bumping against 16 MB. +- QAT (quantization-aware training) as an alternative path explored early on. + +### 4. TTT (Test-Time Training) + +The most contested technique in the competition, and the one with the most +iterative rule-clarification history. + +- **LoRA TTT (early, March 19)** — first appearance: fine-tune LoRA adapters + on validation data during eval. LoRA TTT *is* legal — the adapter weights + don't count against the 16 MB artifact limit because they're discarded after + eval. The problem was not LoRA itself but *when* the gradient step happened + relative to scoring (C3 violation in early implementations). +- **The legality fight** — C3 and C4 emerged directly from Issue #1017 after + multiple submissions trained on val chunks before scoring them. The key + distinction: you're allowed to *learn* from val tokens you've already been + graded on, not from ones you're about to be graded on. +- **Score-first TTT** — the clean formulation: score each validation chunk + completely, *then* run a gradient step. Settled by ~April 9, enabled a burst + of records (PR #1514, #1529, #1530). +- **Phased TTT** — break eval into multiple phases; each phase scores a chunk + then updates; multi-phase global SGD + per-doc LoRA reset. The mature form + that ended up in the final model. +- **Warm-start-A TTT** — LoRA initialized from training rather than random; + smaller effective LR needed. +- **Entropy-adaptive epochs** — vary TTT epochs by estimated document + difficulty; more adaptation on harder chunks. +- **No-Q/V TTT mask** — during TTT adaptation, freeze Q and V weights; only + adapt K and other params. Improves stability and reduces overfitting. +- **Short-doc TTT** — preferentially apply TTT to shorter documents, which + have higher per-token uncertainty and respond better to adaptation (#2014). +- **Token-only n-gram tilt** — the legal form of n-gram that eventually worked. + The original kernel (PR #1420) had *three* experts: a token expert (counts + n-gram prefix matches in the already-scored window), a within-word expert + (gates on whether the current position is mid-word), and a word-start expert + (gates on whether it starts a word). The problem: the within-word and + word-start experts read `boundary_lut[tokens[i]]` — the TARGET token's + boundary type — which is non-causal (you can't know if position i is + mid-word until you see what token lands at i). Two of three experts violate + C1 by construction. The token expert is causal by construction — it only + reads the prefix hash table. Disable the two broken ones, keep the one that + works: closed-form `p'(a) = exp(β·1[a=h])·p(a)/Z`, prefix-only. The + surprise ending to the n-gram saga. +- TTT contribution to the final model: pre-quant 1.064 → post-TTT 1.060, + roughly ~0.004 bpb. + +*Timeline of legality rulings would be a good visual here.* +--- + +## Part 2 — The Disqualification Zoo + +Tone: semi-formal, wry. + +### PPM-D: "Free Bits From Arithmetic That Doesn't Sum to 1" +- Background: PPM (Prediction by Partial Matching) is a classical byte-level + compressor. Mix it with a neural model in probability space and you get a + byte-level hybrid scorer. +- The cluster: ~8 PRs (including our own #1885) claiming 0.90–1.014 bpb via a + NN + PPM-D mixture. Flagged in Issue #1872. +- The C2 violation: the uniform-spread construction (`p(t)^{1/n}` per byte) + does not normalize. A toy example: two 2-byte tokens each with p=0.25 gives + `p(first_byte) = 0.5 + 0.5 = 1.0`, not sharing space with any other byte. +- The punchline: under the correct conditional byte distribution, PPM is *not* + better than the baseline — it's worse by ~0.038 bpb. The apparent 0.051 bpb + gain is entirely from the broken scoring rule, not from PPM. + +### Scylla: footnote +- PR #1184 appeared at 0.9485 bpb for three days, removed as "invalid record" + — likely a GitHub merge accident rather than deliberate cheating. Not worth + dwelling on. + +### N-gram: Hard to Do Right +- Multiple attempts: n-gram eval cache, n-gram TTT, n-gram tilt. +- The original kernel (PR #1420) had three experts: token, within-word, + word-start. The within-word and word-start experts read + `boundary_lut[tokens[i]]` — the TARGET token's boundary type. You can't know + if position i is mid-word until you see what token lands there, so two of + three experts are non-causal by design. The C1 violation rate was ~95% of + gated mass, because word/boundary gates fired on almost every token. +- The token expert is causal: it only queries the prefix hash table (tokens the + model has already been graded on). Disabling within/word and keeping only the + token expert gives the legal form. +- The surprise ending: once cleaned up, the token-only tilt *did* eventually + work, and appears in the final SOTA record (#2135). +- Closing heuristic: the failure mode for classical n-gram/byte methods is + almost always the same — the classical side either peeks forward (C1) or + doesn't produce a normalized distribution (C2). + + + +- General argument for why supplementing NNs with classical methods is so hard: + a neural model already captures most of what an n-gram or PPM would tell you. + The NN is well-calibrated on its own uncertainty — it knows when it doesn't + know. For a classical model to help, it needs to provide *orthogonal* + information: something the NN genuinely can't see. But both n-gram and PPM + are trying to predict the same thing from the same context, so their signals + are highly correlated with the NN's output. You're fighting for a thin slice + of orthogonal signal. The one case where token-only n-gram tilt *does* eke + out a gain is exact prefix repetition: a very strong prior that "this phrase + appeared verbatim 3 tokens ago" that the softmax tends to smooth over. That's + a real, narrow orthogonal lever — but most of the classical-method graveyard + here is just correlated noise with a legality trap attached. +--- + +## Part 3 — Last Day Chaos: The CaseOps Val-Set Leak + +Tone: funny, forensic. + +- **CaseOps** arrives April 18–19 and is genuinely good (+0.003 bpb). Everyone + adopts it immediately. +- **The bug:** `prepare_caseops_data.py` has a `--val-docs=10000` default. + Nobody overrides it. All 34 CaseOps-lineage PRs. Zero overrides. +- **What leaked:** training starts at canonical-stream document 10,000, but the + validation set is documents 0–49,999. Documents 10,000–49,999 (80% of the + val set) are in both train and val. +- **The timeline:** + - Introduced: PR #1736 (dexhunter, April 19) — first to use + `prepare_caseops_data.py` with the default, while reporting a separately- + regenerated 50k-doc val set + - Fixed: PR #1851 (aquariouseworkman, April 27) — switched to the official + HF dataset (`romeerp/parameter-golf-caseops-v1`), disjoint by construction + - Re-introduced same day: PR #1855 (codemath3000, April 27) — rebuilt + locally with the default, propagating the leak forward +- **The numbers:** claimed frontier #2118 at **1.04350** vs clean frontier + #1851/#1855 at **~1.061**. A 0.016 bpb gap from memorizing 80% of your exam. +- **Our own position:** our research baseline (#1736) was leaky. All internal + spec measurements (specs 008–302) are internally consistent but live in the + leaked world. Absolute numbers not comparable to the clean leaderboard. + +--- + +## Part 4 — Final Leaderboard and Some Thoughts + +Two leaderboard updates cap the competition: +- **PR #1902 (cocohearts, April 29):** official retroactive update — applies + the BOS fix, clarifies which CaseOps records are clean, establishes the + accepted sequence through 1.0611. +- **May 2 update (not yet merged to main):** audits the late-April / May 1 + submissions and adds four more records. + +The final accepted record sequence, bottom to top: + +| BPB | PR | Author | Key techniques | +|--------|--------|-------------------|----------------| +| 1.0565 | #2135 | codemath3000 | Calib32 + token-only n-gram tilt + AsymLogit ← **final SOTA** | +| 1.0567 | #2130 | TanishGudise | Token-only n-gram tilt + AsymLogit + one-phase TTT | +| 1.0576 | #2014 | simonbissonnette | Progressive context growth to 3k + short-doc score-first TTT | +| 1.0586 | #1953 | andrewbaggio1 | Long-context no-Q/V TTT + QK-Gain 5.25 | +| 1.0594 | #1945 | alertcat | AWQ-lite GPTQ + AsymLogit on #1855 stack | +| 1.0611 | #1855 | codemath3000 | BOS-fixed SmearGate + LQER + SparseAttnGate + 9-hparam stack | +| 1.0614 | #1851/68 | aquariouseworkman | BOS-fixed SmearGate + LQER Asymmetric + Phased TTT | +| 1.0634 | #1787 | nprime06 | PolarNS + MIN_LR + SparseAttnGate + Warm-A TTT | +| 1.0645 | #1769 | dexhunter | CaseOps + MLPClip12 + SmearGate/LoRA-TTT | +| 1.0655 | #1736 | dexhunter | SP8192 + CaseOps + GatedAttn + Loop45 + Phased TTT | +| 1.0678 | #1729 | romeerp | CaseOps tokenizer + tapered WD | +| 1.0714 | #1667 | MarioPaerle | SmearGate + AttnOutGate + Legal TTT | +| 1.0719 | #1626 | dexhunter | VarLen + fused MLP + multi-phase global SGD TTT | +| 1.0728 | #1610 | romeerp | VarLenAttn + phasing TTT | +| 1.0734 | #1530 | samacqua | VarLen FA3 + fused Triton MLP + doc-independent LoRA TTT | +| 1.0758 | #1529 | msisovic | Parallel residuals + CUTLASS EVT + legal TTT | +| 1.0798 | #1514 | dexhunter | SP8192 + Muon 0.97 + legal score-first TTT | +| 1.0810 | #1493 | bigbag | SP8192 + 3-layer recurrence + parallel residuals + TTT | +| 1.0822 | #1477 | aryanbhosale | SP8192 + parallel residuals + score-first TTT | +| 1.0828 | #1413 | dexhunter | SP8192 + QK-Gain 5 + legal TTT | +| 1.0856 | #1394 | Kevin Clark | SP8192 + GPTQ embeddings + Loop45×2 + SDClip | +| 1.0897 | #1334 | aryanbhosale | SP4096 + depth recurrence + parallel residuals + MuonEq-R | +| 1.0912 | #1285 | dexhunter | MuonEq-R + depth recurrence + WD=0.09 + all-int6 | +| 1.0979 | #1218 | Kevin Clark | SP4096 + 4× MLP + high WD | +| 1.1063 | #1204 | msisovic | Parallel residuals + mini depth recurrence | +| 1.1122 | #1060 | dexhunter | Coprime loader + full Hessian GPTQ + XSA-all | +| 1.1147 | #1019 | abaybektursun | AR self-gen GPTQ + XSA-all | +| 1.1194 | #549 | abaybektursun | LeakyReLU² + legal TTT + parallel Muon | +| 1.1228 | #374 | signalrush | EMA + GPTQ-lite + warmdown3500 | +| 1.1248 | #287 | jfprincz | Partial RoPE + LN scale + EMA + XSA4 | +| 1.1271 | #198 | jfprincz | XSA4 + EMA + int6 MLP3× | +| 1.1307 | #198 | unnir | Efficient partial XSA (deepest 3 layers) | +| 1.1428–1.1556 | various | various | BigramHash, SmearGate, int6 QAT, SWA, OrthoInit | +| 1.1630–1.1925 | various | various | Sliding window eval, mixed quant, Muon WD, 10L | +| 1.2244 | baseline | OpenAI | 9L 512d 1024-vocab | + +- 1.2244 → 1.0565: a drop of **0.168 bpb** over 6 weeks of community effort. +- Reflection: what worked (stacking, tokenizer bets, VarLen+TTT, progressive + context), what surprised (larger vocab helping in a constrained budget), what + didn't (PPM, lossy casefold, anything requiring cheating to look good). diff --git a/writeup/outline_v2.md b/writeup/outline_v2.md new file mode 100644 index 0000000000..b872048bc3 --- /dev/null +++ b/writeup/outline_v2.md @@ -0,0 +1,84 @@ +# Parameter Golf — Blog Post v2 Outline + +## Angle and Tone + +- **Reporter, not insider.** Tell the story from the outside looking in. The reader doesn't need to care about BPB or quantization to stay engaged — techniques are plot devices, not the subject. +- **Exciting, not technical.** Less mechanism, more drama. The score numbers, the leaderboard moves, the rulings — these carry the narrative. +- **Dynamics, not just explanation.** There's tension (who's winning?), mystery (how did they score 0.9?), revelation (it was a bug), and vindication (the careful engineers win anyway). + +--- + +## Structure + +### 0. The Hook + +OpenAI released a public competition: train the best language model that fits in 16 MB, in 10 minutes, on 8 H100s. Small stakes on paper. But what unfolded over six weeks had everything — cutting-edge technique, controversy, and a lesson about what it actually takes to make progress at the frontier. + +*Today we give an overview of what happened.* + +--- + +### 1. The Competition + +Dense, no C1–C4 rule taxonomy. Two things to convey: + +- **What is being scored.** A language model is a probability distribution over text. Better model = closer to the true distribution = lower bits-per-byte. Analogy: imagine you're trying to guess the next word in a sentence — a good model is one that's rarely surprised. +- **What makes it hard.** You have 16 MB and 10 minutes. Every byte spent on model weights is a byte not spent elsewhere. You have to be clever. + +No more than 2–3 short paragraphs. + +--- + +### 2. The Model + +Three beats: + +1. **What an LLM actually is.** Attention + MLP layers stacked on top of each other. One sentence on each. Keep it concrete — attention lets each token look at all other tokens; MLP is where the model "thinks" about what it saw. + +2. **The baseline.** OpenAI's starting model was already not a simple transformer. Walk through the key baseline features briefly. Note the starting score: 1.2244 BPB. + +3. **The evolution.** Spotlight: **depth recurrence** — run layers 3–5 in a loop, multiple passes per forward step. Free test-time compute within the 16 MB budget. Activates at 35% of training wallclock for throughput reasons. Intuitive analogy: instead of reading a sentence once, you read it again with what you just learned. + + Everything else (XSA, GPTQ, CaseOps, phased TTT, parallel residuals, SmearGate, etc.) demoted to a **footnote** or compact table. The point is the stacking story, not an exhaustive list. Final SOTA: 1.0565 BPB. + +--- + +### 3. Too Good to Be True + +**Hook:** The leaderboard wasn't just inching forward. Periodically, a submission would appear claiming a score far below anything else — something in the 0.9x range, shattering the rest of the field. Everyone noticed. + +Two main drivers: + +- **N-gram tilt.** A small external model that nudged the probability distribution toward likely next tokens, based on n-gram statistics. In theory legal. In practice: the implementation contained a bug — the boundary lookup used the same token it was conditioning on (C1 violation). The gain evaporated once the bug was identified and fixed. A clean version survived but contributed only modestly. + +- **PPM-D.** A classical byte-level compression algorithm bolted on as a second opinion. PPM-D tracks context patterns and makes predictions; the idea was to blend its distribution with the model's. Initial results were dramatic. On inspection: the blending formula didn't produce a valid probability distribution (C2 violation) — it didn't sum to 1. The gain was an artifact of the broken math. A corrected submission showed almost no improvement over the base model.[^ppmd] + +**The lesson.** These weren't acts of bad faith — they were honest mistakes caught by careful peer review. But they point to something deeper. A well-trained language model is already a calibrated entropy estimator: where it predicts a flat distribution, the text really is hard to predict. The correlation between the model's uncertainty and the true information content is tight. PPM-D and n-gram statistics track exactly the same "easy" tokens the NN already handles well. For an external signal to help, its errors would have to be *uncorrelated* with the model's — it would need to be uncertain where the NN is confident. That turns out to be extremely hard to achieve. There is no silver bullet. The progress in this competition was incremental, compounding, and hard-won. + +[^ppmd]: The PPM-D case is more technically interesting — a longer discussion of why the blending formula fails and what a correct version would need to look like is left for a future post / appendix. + +--- + +### 4. Drama on the Last Day + +Set the scene: the field heading into the final days, front-runners clustered around 1.05x BPB. + +Then: someone noticed something odd. A submission that had quietly dominated the leaderboard for weeks was retested — and the score looked too clean. Investigation followed. + +**The analogy:** training the model on the exam itself. The validation set used to score submissions had a default flag — `--val-docs=10000` — that caused an 80% overlap with the training data. Anyone who had used the standard CaseOps setup since PR #1729 was, without knowing it, training on the held-out test documents. A good score, but not a fair one. + +Once the overlap was quantified, a wave of front-runners were disqualified. Almost everything submitted since the CaseOps era was tainted. The leaderboard reshuffled. + +What remained: a small set of clean submissions, and an open door. + +**Ending with flair.** The competition closed with a picture-perfect finish — the final accepted submission (PR #2135) was clean, incrementally better than anything else that remained standing, and arrived courtesy of a single well-placed hyperparameter change on top of six weeks of careful engineering. Exactly the kind of finish the competition deserved. + +--- + +## Thematic Arc + +- People tried shortcuts (n-gram exploits, broken compression hybrids, tokenizer tricks). +- The shortcuts that looked too good were either bugs or violations. +- The gains that held came from making the transformer itself more expressive. +- The competition was a proof-by-exhaustion that there is no clever hack that replaces careful engineering at the frontier. +- The final word belongs to the model.