{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) by dttdrv · Pull Request #1911 · openai/parameter-golf

dttdrv · 2026-04-28T22:29:52Z

Summary

This PR submits the CaseOps V15 record stack for track_10min_16mb.

3-seed mean val_bpb: 1.03540487 (std 0.00056684)
Seeds: 1337, 42, 999
Artifact range: 15,994,993 to 15,996,195 bytes
Independent reproduction: seed 1337 reached 1.03459029 BPB with a 15,996,563 byte artifact on 2026-04-28/29
Title change: this is marked {RECORD} because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's 1.04290 BPB result

I am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.

Why I Did This

The frontier PRs pointed to two large, mostly orthogonal levers:

Pre-quant TTT was the biggest optimization lever. Instead of trying to make post-quant TTT work after GPTQ has already crushed the degrees of freedom, PR Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364 and then PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 adapt the full-precision EMA model first, then quantize the adapted model into a fixed artifact.
CaseOps was the cleanest data/tokenizer lever. PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 showed that capitalization can be represented as a reversible side channel over a lower-case lexical stream. That reduces avoidable case fragmentation while still evaluating BPB against the original raw bytes.

Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.

What This PR Adds

This record folder is based on PR #1738's CaseOps V15 integration:

Adds load_validation_token_bytes()
Threads byte sidecars through eval_val(), eval_val_sliding(), and eval_val_ttt()
Excludes _bytes_ files from token-shard loading to avoid double-counting
Uses the PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 pre-quant TTT settings: 8 ranks, 21 AdamW epochs, epoch-level cosine LR, federated averaging, then GPTQ export
Uses PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729's CaseOps dataset/tokenizer exported publicly at romeerp/parameter-golf-caseops-v1
Keeps the final scorer fixed: the submitted artifact is quantized and compressed before final sliding-window eval

Results

Seed	Sliding val_bpb	Artifact bytes
1337	1.03484145	15,996,061
42	1.03618043	15,996,195
999	1.03519273	15,994,993
Mean	1.03540487	15,995,750
Std	0.00056684

The previous frontier stack I am comparing against is PR #1735 at 1.04290 BPB. This improves it by about 0.00750 BPB, just over the 0.005 nats / 0.00721 BPB threshold.

Independent reproduction from this same record folder:

Date	Seed	Sliding val_bpb	Artifact bytes
2026-04-28/29	1337	1.03459029	15,996,563

Reproduction checkpoints:

Training stopped at 588132ms, step 4568/20000
Pre-quantization post-EMA: val_bpb=1.08389912
After 21 pre-quant TTT epochs: post-prequant-ttt val_bpb=1.02819756
Quantized non-sliding eval: val_bpb=1.04801825
Quantized sliding eval: val_bpb=1.03459029
Total submission size: 15,996,563 bytes

Technique Stack

SP8192 CaseOps tokenizer with reversible case-control operators
Original-byte validation sidecars for correct BPB accounting
11-layer, 512d, 8-head / 4-KV-head transformer
XSA on all layers
3-layer depth recurrence over layers 3-5
Parallel residual path from layer 7 onward
QK-Gain 5.25
LeakyReLU(0.5)^2 MLP, mlp_mult=4.0
EMA/SWA, Muon-family optimization, high-WD compression pressure, warmdown scheduling
8-GPU parallel pre-quant AdamW TTT, 21 epochs
Full-Hessian GPTQ with SDClip-style row clipping
Int6 model matrices, int8 embeddings
Brotli-compressed model and LZMA-wrapped code under 16 MB
Sliding-window eval, stride 64

Full Lineage / Credits

I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.

PR	Contributor	Why it matters here
#1738	@alertcat	Exact CaseOps V15 integration of PR #1735 plus PR #1729. This PR is based on that record folder.
#1735	@AjAnubolu	Parallel pre-quant AdamW TTT, 21 epochs, federated averaging, epoch-level cosine LR, torch.compile speedup.
#1729	@romeerp	CaseOps tokenizer/data export, reversible capitalization operators, validation byte sidecars.
#1626	@dexhunter	Multi-phase score-first TTT lineage used by the CaseOps PR.
#1530	@samacqua	VarLen attention / fused MLP / doc-TTT base referenced by #1626.
#1610	@romeerp	Phased TTT concept referenced by #1626.
#1493	@bigbag	QK-Gain 5.25 and consolidation of SP8192 + recurrence + residuals + legal TTT.
#1445	@X-Abhishek-X	Tuned WD, matrix LR, EMA, warmdown settings cited by #1493.
#1412	@Robby955	Parallel residuals from layer 7 onward; Hessian-aware SDClip analysis.
#1331	@dexhunter	3-layer depth recurrence over layers 3-5 and WD/LR compression tradeoff.
#1285	@dexhunter	Earlier recurrence and WD-quantization synergy extended by #1331.
#1394	@clarkkev	SP8192, GPTQ embedding quantization, SDClip, Brotli packaging, simplified recurrence.
#1218	@clarkkev	Larger vocab/model stack, high-WD compression logic, GPTQ Hessian-aware path, skip-gate and QK-gain adoption.
#1217	@bigbag	MuonEq-R and QK-gain sweep context.
#1204	@msisovic	Mini depth recurrence and parallel residual formulation.
#1179	@dexhunter	Base stack used by #1204 / #1217 lineage.
#1125	@jainpranjal97	XSA-all and QK-Gain 4.0 findings that pushed the later attention-gain sweeps.
#1105	@abaybektursun	Mixed-quantization / AR GPTQ path referenced by #1204.
#1089	@clarkkev	Byte-shuffle/Brotli compression and sigmoid-gated skip connection lineage.
#1060	@clarkkev	GPTQ Hessian-aware quantization implementation referenced by #1218.
#1019	@abaybektursun	AR self-generated GPTQ calibration, XSA-all, architecture documentation, prior merged SOTA baseline.
#756	@abaybektursun	Negative post-quant TTT experiments that helped motivate pre-quant adaptation.
#726	@clarkkev	Coprime-stride loader lineage that preceded the simplified loader in #1394.
#609	@saml212	BigramHash / selective pruning / GPTQ calibration lineage referenced by #1019.
#593	prior contributors	GPTQ calibration legality context referenced by #1019.
#569	prior contributors	GPTQ calibration legality context referenced by #1019.
#549	@abaybektursun	LeakyReLU^2, legal score-first TTT, Parallel Muon record line.
#535	@raahilshah	Full-Hessian GPTQ and QAT/export alignment lineage.
#518	@sofiabod	LeakyReLU^2 follow-up credit in the #549 lineage.
#493	@parinzee	11-layer model, XSA, LeakyReLU(0.5)^2, EMA, int6 quantization, partial RoPE.
#478	@gowtham0992	XSA on all 11 layers, GPTQ-lite, EMA, late-QAT record line.
#461	@Christopher-Lee-McClendon	Score-first TTT framework used by earlier legal TTT records.
#414	@signalrush	Base model lineage credited by #549.
#401	@newjordan	EMA/SWA weight-averaging lineage.
#399	@abaybektursun	Parallel Muon optimizer lineage.
#364	@shikhar1729	Warmdown schedule lineage.
#315	@jfprincz	Partial RoPE and layer-scale lineage.
#289	contributor in #1019 lineage	U-Net skip connection lineage documented by #1019.
#286	@chris-buckley	Late QAT / STE lineage documented by #1019.
#180	@thwu1	Early SOTA baseline credited by #493.
#162	@raahilshah	BigramHash concept lineage documented by #1019.
#160	@ChaseWNorton	Compression lineage documented by #1019.
#122	@mtybadger	Flash Attention 3 / Hopper kernel dependency lineage documented by #1019.
#65	@aquariouseworkman	SmearGate lineage documented by #1019; later SP8192 stacks simplified parts away.

Compliance Notes

This is submitted under the same Track A interpretation as PR #1735 and PR #1738:

Final evaluation uses a fixed quantized artifact.
Pre-quant TTT happens before export, not during final scoring.
No SLOT, RLS, ETLB, n-gram cache, or eval-time cache.
No two-pass rescoring.
Sliding-window eval is causal with stride 64.
The softmax distribution is normalized.
CaseOps is reversible and uses original-byte sidecars for BPB.
Artifact size is below 16,000,000 bytes.
Training and eval stay below the 10-minute limits.

The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.

Dependencies / External Data

The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a requirements.txt to this record folder for manual setup.

For clarity:

The final submitted artifact is self-contained: counted code bytes plus compressed model bytes.
There are no network calls or external downloads during final evaluation.
romeerp/parameter-golf-caseops-v1 is used as the public CaseOps tokenizer/data export for training setup, before train_gpt.py runs.
The train script imports torch, numpy, sentencepiece, and brotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.
huggingface-hub and hf_transfer are only for fetching the public CaseOps dataset/tokenizer during setup.

So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.

Reproduction

pip install -r requirements.txt
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='romeerp/parameter-golf-caseops-v1',
    repo_type='dataset',
    local_dir='/workspace/caseops_data',
)
"

cd /workspace/caseops_data/datasets/datasets/
ln -sf fineweb10B_sp8192_lossless_caps_caseops_v1_reserved fineweb10B_sp8192
cd /workspace/caseops_data/datasets/tokenizers/
ln -sf fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model fineweb_8192_bpe.model

SEED=1337 \
  DATA_DIR=/workspace/caseops_data/datasets/ \
  TTT_EMA_ENABLED=0 \
  PREQUANT_TTT_ENABLED=1 \
  PREQUANT_TTT_EPOCHS=21 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

3-seed validation: 1337, 42, 999
Independent seed 1337 reproduction on 2026-04-28/29
Artifacts below 16,000,000 bytes
Training below 600s
Eval below 600s
Fixed predictor for final scoring
Full-Hessian GPTQ int6 + Brotli
CaseOps byte-sidecar BPB accounting

…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)

yaowubarbara · 2026-04-30T09:32:38Z

Independent reproduction at seed=42 (8×H100 SXM, matotezitanka/proteus-pytorch:latest base image, 2026-04-30):

stage	this run	PR #1911 reported (s42)	Δ
pre-quantization post-EMA val_bpb	1.08474	1.08389	+0.001
post-prequant-TTT val_bpb (21 ep)	1.02923	1.02819	+0.001
quantized non-sliding val_bpb	1.05013	1.04802	+0.002
quantized sliding val_bpb (final)	1.03650	1.03618	+0.00032

Final number sits within σ of the reported s42 (3-seed std 0.00057) and within 2σ of the 3-seed mean (1.03540). Training stopped via wallclock_cap at step 4535/20000, very close to the PR's 4568. Artifact size and budget compliance line up.

Stack works as documented: SP8192 + CaseOps + PreQuantTTT 21ep + GPTQ INT6 + Brotli + V15 byte sidecars. No SLOT, no PPM mixer, no eval-time-update tricks.

Posting for the maintainer review queue — happy to share the raw logs if useful.

dttdrv · 2026-04-30T10:25:47Z

Thanks for the independent reproduction @yaowubarbara . This is very helpful.

Your seed-42 result is close to the reported seed-42 run and confirms that the record folder is mechanically reproducible under the stated stack and budget. I appreciate you checking artifact/budget behavior as well.

The remaining question is the rule interpretation around pre-quant validation TTT. I tried to document that caveat explicitly in the PR; if maintainers decide that this class is invalid under the score-before-update reading, I agree it should be judged consistently with the rest of the pre-quant TTT line.

Since I do not have any more funds, I believe this is my last PR.

yaowubarbara · 2026-04-30T10:35:10Z

Thanks @dttdrv — credit for the clean lineage. The reproduction was straightforward precisely because you documented the stack and env vars completely (and the V15 byte-sidecar accounting made the BPB chain easy to verify).

On the score-before-update reading: agree it's a maintainer call. I sketched a small CPU-only behavioral-probe library (yaowubarbara/pgolf-validator) for the C1 / C2 conditions; happy to extend it toward C3 if useful for the ruling thread. Best of luck with whatever's next.

yaowubarbara · 2026-04-30T14:53:57Z

Follow-up to the reproduction comment above: I ran four single-seed ablations on top of your stack today to characterize the LeakyReLU² and QK-Gain axes on the PR #1911 base, since @bsisduck PR #1970 ran the global-slope ablation on a related base (and confirmed slope 0.5 there) and @mikeapedia PR #1648 explored both per-layer activation coefficients and softer per-layer QK-Gain on a third base via xIELU.

These are ablations on top of your stack — same SP8192 + CaseOps + 21ep PreQuantTTT + GPTQ INT6 + Brotli + V15 byte-sidecar pipeline, seed=42 — not corrections to your PR.

What changed across the four runs

E1b — your stack as-is, fixed slope=0.5 (control reproduction)
E2 — MLP.forward modified so the LeakyReLU² slope is nn.Parameter(torch.tensor(0.5)) per MLP block (11 extra fp32 scalars, classified passthrough float16 by your existing GPTQ pipeline, trainable through main + PreQuantTTT)
E3 — same eleven slope values frozen as register_buffer after harvesting them from E2's after-TTT log (no architectural change, fixed profile only)
E4a — your stack with QK_GAIN_INIT=3.0 instead of the default 5.25 (single env-var change, no source patch)

Single-seed=42 results

Variant	Pre-EMA	Post-PreQuantTTT (21ep)	Quantized non-sliding	Quantized sliding (final)	Δ vs E1b
E1b fixed slope=0.5, QK=5.25	1.08474	1.02923	1.05013	1.03650	—
E4a QK=3.0	1.08505	1.02957	1.04968	1.03614	−0.00036
E3 frozen inverse-U from E2	1.08662	1.03088	1.05250	1.03905	+0.00255
E2 per-layer learnable	1.08815	1.03260	1.05360	1.04014	+0.00364

E1b sits ~+0.00032 above the s42 reading you reported (1.03618), within σ (your 3-seed std 0.00057). All four runs share the same training and TTT budgets and finished within compliance.

The harvested per-layer slope profile (E2 after PreQuantTTT)

L0=0.336    L1=0.443    L2=0.512    L3=0.706
L4=1.263    L5=1.396    L6=1.135    L7=0.898
L8=0.780    L9=0.745    L10=1.012

This single harvested profile was inverse-U-shaped — early layers below 0.5, middle layers above 1, late layers around 0.7–1. Qualitatively similar to mikeapedia's per-layer xIELU coefficient pattern on a different base.

Tentative reading at single seed

E4a vs E1b: −0.00036 BPB. Below the 0.00057 inter-seed std you reported. This is essentially equal to E1b at seed=42; mikeapedia's "softer attention" preference does not appear to transfer cleanly to the PR {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911 PreQuantTTT-equipped stack at this seed.
E3 vs E1b: +0.00255. Frozen inverse-U beats the learnable variant by ~0.001 (E3 vs E2), suggesting joint optimization of 11 slope params adds some optimization noise; but the profile itself underperforms uniform 0.5 at this seed.
E2 vs E1b: +0.00364. Per-layer learnable is the worst of the four.

The qualitative ordering E4a ≈ E1b < E3 < E2 is consistent across pre-EMA, post-TTT, and final sliding eval, but the absolute deltas of 0.00036 (E4a) and 0.00255 (E3) are 0.6× and 4.5× the reported inter-seed std. A 3-seed re-validation would be needed before any of these orderings could be called robust.

Caveats worth knowing

Single seed. Above all else.
Slope micro-drift in E3. Buffers were registered fp32 but observed to drift 0.001–0.004 from init values during distributed training, presumably from fp32→bf16 round-trips. Drift is small but the "frozen" label is approximate.
E2 pre-LZMA artifact measured 16,061,499 bytes (61 KB over the 16 MB cap before code-side LZMA wrapping). The +44 bytes of slope parameters is not the cause — the unwrapped patched script itself is what pushes it over. For a record submission this would need code-side LZMA wrap.
Untested interactions. Frozen-buffer and learnable-parameter numbers may interact with EMA, GPTQ Hessian calibration, and the Brotli compressor in ways we did not isolate.

One-line summary

On the PR #1911 stack at seed=42, none of (per-layer learnable LeakyReLU², frozen inverse-U LeakyReLU², QK_GAIN_INIT=3.0) materially improves over your fixed defaults; E4a essentially reproduces s42, E2/E3 lose by 0.0025–0.0036. For anyone with budget for the multi-iteration convergence loop methodology mikeapedia used in PR #1648, the E2-harvested profile above is a reasonable iter-1 init.

The patches are reproducible from the description above (a 17-line MLP-class modification for E2/E3 and a single QK_GAIN_INIT=3.0 env-var override for E4a); the seed=42 stdout logs and trained artifacts lived on the RunPod pod and are no longer available since the pod was terminated. Thanks again for the clean, reproducible base.

aquariouseworkman · 2026-04-30T16:57:51Z

Since the data the optimizer trains on is val_data.val_tokens, thus making this invalid, correct?

dttdrv · 2026-04-30T17:31:34Z

@aquariouseworkman Yes, I think this is the same C3 issue that caused #1958 and #1992 to be withdrawn.

The original intent here was to follow the pre-quant TTT precedent from that line of PRs: adapt before final artifact export, then evaluate a fixed artifact. But under the stricter score-before-update reading, that distinction does not save it, because the useful model state used for the reported score has already been optimized on val_data.val_tokens.
So I agree this should not be treated as leaderboard-valid if C3 forbids building useful state from validation tokens before scoring them. The stack is mechanically reproducible, but the record claim should be withdrawn or reclassified as an invalid/pre-quant-TTT ablation artifact.

Add CaseOps pre-quant TTT record

80da055

dttdrv changed the title ~~Add CaseOps pre-quant TTT record (1.0354 BPB)~~ {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) Apr 28, 2026

dttdrv added 2 commits April 29, 2026 01:38

Expand CaseOps record attribution

082d9e8

Document CaseOps record dependencies

ea8c855

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

BharathSShankar mentioned this pull request Apr 30, 2026

Record: SP10240 SimCTG + PreQuantTTT — 1.03983 sliding-window (3-seed) #1972

Open

jamesEmerson112 mentioned this pull request Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
dttdrv wants to merge 3 commits into
openai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459

dttdrv commented Apr 28, 2026 •

edited

Loading

Uh oh!

yaowubarbara commented Apr 30, 2026

Uh oh!

dttdrv commented Apr 30, 2026

Uh oh!

yaowubarbara commented Apr 30, 2026

Uh oh!

yaowubarbara commented Apr 30, 2026

Uh oh!

aquariouseworkman commented Apr 30, 2026

Uh oh!

dttdrv commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dttdrv commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why I Did This

What This PR Adds

Results

Technique Stack

Full Lineage / Credits

Compliance Notes

Dependencies / External Data

Reproduction

Test Plan

Uh oh!

yaowubarbara commented Apr 30, 2026

Uh oh!

dttdrv commented Apr 30, 2026

Uh oh!

yaowubarbara commented Apr 30, 2026

Uh oh!

yaowubarbara commented Apr 30, 2026

What changed across the four runs

Single-seed=42 results

The harvested per-layer slope profile (E2 after PreQuantTTT)

Tentative reading at single seed

Caveats worth knowing

One-line summary

Uh oh!

aquariouseworkman commented Apr 30, 2026

Uh oh!

dttdrv commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dttdrv commented Apr 28, 2026 •

edited

Loading