{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
Conversation
…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)
|
Independent reproduction at seed=42 (8×H100 SXM,
Final number sits within σ of the reported s42 (3-seed std 0.00057) and within 2σ of the 3-seed mean (1.03540). Training stopped via Stack works as documented: SP8192 + CaseOps + PreQuantTTT 21ep + GPTQ INT6 + Brotli + V15 byte sidecars. No SLOT, no PPM mixer, no eval-time-update tricks. Posting for the maintainer review queue — happy to share the raw logs if useful. |
|
Thanks for the independent reproduction @yaowubarbara . This is very helpful. Your seed-42 result is close to the reported seed-42 run and confirms that the record folder is mechanically reproducible under the stated stack and budget. I appreciate you checking artifact/budget behavior as well. The remaining question is the rule interpretation around pre-quant validation TTT. I tried to document that caveat explicitly in the PR; if maintainers decide that this class is invalid under the score-before-update reading, I agree it should be judged consistently with the rest of the pre-quant TTT line. Since I do not have any more funds, I believe this is my last PR. |
|
Thanks @dttdrv — credit for the clean lineage. The reproduction was straightforward precisely because you documented the stack and env vars completely (and the V15 byte-sidecar accounting made the BPB chain easy to verify). On the score-before-update reading: agree it's a maintainer call. I sketched a small CPU-only behavioral-probe library (yaowubarbara/pgolf-validator) for the C1 / C2 conditions; happy to extend it toward C3 if useful for the ruling thread. Best of luck with whatever's next. |
|
Follow-up to the reproduction comment above: I ran four single-seed ablations on top of your stack today to characterize the LeakyReLU² and QK-Gain axes on the PR #1911 base, since @bsisduck PR #1970 ran the global-slope ablation on a related base (and confirmed slope 0.5 there) and @mikeapedia PR #1648 explored both per-layer activation coefficients and softer per-layer QK-Gain on a third base via xIELU. These are ablations on top of your stack — same SP8192 + CaseOps + 21ep PreQuantTTT + GPTQ INT6 + Brotli + V15 byte-sidecar pipeline, seed=42 — not corrections to your PR. What changed across the four runs
Single-seed=42 results
E1b sits ~+0.00032 above the s42 reading you reported (1.03618), within σ (your 3-seed std 0.00057). All four runs share the same training and TTT budgets and finished within compliance. The harvested per-layer slope profile (E2 after PreQuantTTT)This single harvested profile was inverse-U-shaped — early layers below 0.5, middle layers above 1, late layers around 0.7–1. Qualitatively similar to mikeapedia's per-layer xIELU coefficient pattern on a different base. Tentative reading at single seed
The qualitative ordering E4a ≈ E1b < E3 < E2 is consistent across pre-EMA, post-TTT, and final sliding eval, but the absolute deltas of 0.00036 (E4a) and 0.00255 (E3) are 0.6× and 4.5× the reported inter-seed std. A 3-seed re-validation would be needed before any of these orderings could be called robust. Caveats worth knowing
One-line summaryOn the PR #1911 stack at seed=42, none of (per-layer learnable LeakyReLU², frozen inverse-U LeakyReLU², QK_GAIN_INIT=3.0) materially improves over your fixed defaults; E4a essentially reproduces s42, E2/E3 lose by 0.0025–0.0036. For anyone with budget for the multi-iteration convergence loop methodology mikeapedia used in PR #1648, the E2-harvested profile above is a reasonable iter-1 init. The patches are reproducible from the description above (a 17-line MLP-class modification for E2/E3 and a single |
|
Since the data the optimizer trains on is val_data.val_tokens, thus making this invalid, correct? |
|
@aquariouseworkman Yes, I think this is the same C3 issue that caused #1958 and #1992 to be withdrawn. The original intent here was to follow the pre-quant TTT precedent from that line of PRs: adapt before final artifact export, then evaluate a fixed artifact. But under the stricter score-before-update reading, that distinction does not save it, because the useful model state used for the reported score has already been optimized on |
Summary
This PR submits the CaseOps V15 record stack for
track_10min_16mb.1.03540487(std0.00056684)1337,42,99915,994,993to15,996,195bytes1337reached1.03459029BPB with a15,996,563byte artifact on 2026-04-28/29{RECORD}because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's1.04290BPB resultI am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.
Why I Did This
The frontier PRs pointed to two large, mostly orthogonal levers:
Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.
What This PR Adds
This record folder is based on PR #1738's CaseOps V15 integration:
load_validation_token_bytes()eval_val(),eval_val_sliding(), andeval_val_ttt()_bytes_files from token-shard loading to avoid double-countingromeerp/parameter-golf-caseops-v1Results
The previous frontier stack I am comparing against is PR #1735 at
1.04290BPB. This improves it by about0.00750BPB, just over the0.005nats /0.00721BPB threshold.Independent reproduction from this same record folder:
Reproduction checkpoints:
588132ms, step4568/20000val_bpb=1.08389912post-prequant-ttt val_bpb=1.02819756val_bpb=1.04801825val_bpb=1.0345902915,996,563bytesTechnique Stack
mlp_mult=4.0Full Lineage / Credits
I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.
Compliance Notes
This is submitted under the same Track A interpretation as PR #1735 and PR #1738:
The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.
Dependencies / External Data
The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a
requirements.txtto this record folder for manual setup.For clarity:
romeerp/parameter-golf-caseops-v1is used as the public CaseOps tokenizer/data export for training setup, beforetrain_gpt.pyruns.torch,numpy,sentencepiece, andbrotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.huggingface-hubandhf_transferare only for fetching the public CaseOps dataset/tokenizer during setup.So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.
Reproduction
Test Plan
1337,42,9991337reproduction on 2026-04-28/29