Skip to content

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911

Open
dttdrv wants to merge 3 commits into
openai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459
Open

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
dttdrv wants to merge 3 commits into
openai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459

Conversation

@dttdrv
Copy link
Copy Markdown

@dttdrv dttdrv commented Apr 28, 2026

Summary

This PR submits the CaseOps V15 record stack for track_10min_16mb.

  • 3-seed mean val_bpb: 1.03540487 (std 0.00056684)
  • Seeds: 1337, 42, 999
  • Artifact range: 15,994,993 to 15,996,195 bytes
  • Independent reproduction: seed 1337 reached 1.03459029 BPB with a 15,996,563 byte artifact on 2026-04-28/29
  • Title change: this is marked {RECORD} because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's 1.04290 BPB result

I am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.

Why I Did This

The frontier PRs pointed to two large, mostly orthogonal levers:

  1. Pre-quant TTT was the biggest optimization lever. Instead of trying to make post-quant TTT work after GPTQ has already crushed the degrees of freedom, PR Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364 and then PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 adapt the full-precision EMA model first, then quantize the adapted model into a fixed artifact.
  2. CaseOps was the cleanest data/tokenizer lever. PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 showed that capitalization can be represented as a reversible side channel over a lower-case lexical stream. That reduces avoidable case fragmentation while still evaluating BPB against the original raw bytes.

Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.

What This PR Adds

This record folder is based on PR #1738's CaseOps V15 integration:

Results

Seed Sliding val_bpb Artifact bytes
1337 1.03484145 15,996,061
42 1.03618043 15,996,195
999 1.03519273 15,994,993
Mean 1.03540487 15,995,750
Std 0.00056684

The previous frontier stack I am comparing against is PR #1735 at 1.04290 BPB. This improves it by about 0.00750 BPB, just over the 0.005 nats / 0.00721 BPB threshold.

Independent reproduction from this same record folder:

Date Seed Sliding val_bpb Artifact bytes
2026-04-28/29 1337 1.03459029 15,996,563

Reproduction checkpoints:

  • Training stopped at 588132ms, step 4568/20000
  • Pre-quantization post-EMA: val_bpb=1.08389912
  • After 21 pre-quant TTT epochs: post-prequant-ttt val_bpb=1.02819756
  • Quantized non-sliding eval: val_bpb=1.04801825
  • Quantized sliding eval: val_bpb=1.03459029
  • Total submission size: 15,996,563 bytes

Technique Stack

  • SP8192 CaseOps tokenizer with reversible case-control operators
  • Original-byte validation sidecars for correct BPB accounting
  • 11-layer, 512d, 8-head / 4-KV-head transformer
  • XSA on all layers
  • 3-layer depth recurrence over layers 3-5
  • Parallel residual path from layer 7 onward
  • QK-Gain 5.25
  • LeakyReLU(0.5)^2 MLP, mlp_mult=4.0
  • EMA/SWA, Muon-family optimization, high-WD compression pressure, warmdown scheduling
  • 8-GPU parallel pre-quant AdamW TTT, 21 epochs
  • Full-Hessian GPTQ with SDClip-style row clipping
  • Int6 model matrices, int8 embeddings
  • Brotli-compressed model and LZMA-wrapped code under 16 MB
  • Sliding-window eval, stride 64

Full Lineage / Credits

I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.

PR Contributor Why it matters here
#1738 @alertcat Exact CaseOps V15 integration of PR #1735 plus PR #1729. This PR is based on that record folder.
#1735 @AjAnubolu Parallel pre-quant AdamW TTT, 21 epochs, federated averaging, epoch-level cosine LR, torch.compile speedup.
#1729 @romeerp CaseOps tokenizer/data export, reversible capitalization operators, validation byte sidecars.
#1626 @dexhunter Multi-phase score-first TTT lineage used by the CaseOps PR.
#1530 @samacqua VarLen attention / fused MLP / doc-TTT base referenced by #1626.
#1610 @romeerp Phased TTT concept referenced by #1626.
#1493 @bigbag QK-Gain 5.25 and consolidation of SP8192 + recurrence + residuals + legal TTT.
#1445 @X-Abhishek-X Tuned WD, matrix LR, EMA, warmdown settings cited by #1493.
#1412 @Robby955 Parallel residuals from layer 7 onward; Hessian-aware SDClip analysis.
#1331 @dexhunter 3-layer depth recurrence over layers 3-5 and WD/LR compression tradeoff.
#1285 @dexhunter Earlier recurrence and WD-quantization synergy extended by #1331.
#1394 @clarkkev SP8192, GPTQ embedding quantization, SDClip, Brotli packaging, simplified recurrence.
#1218 @clarkkev Larger vocab/model stack, high-WD compression logic, GPTQ Hessian-aware path, skip-gate and QK-gain adoption.
#1217 @bigbag MuonEq-R and QK-gain sweep context.
#1204 @msisovic Mini depth recurrence and parallel residual formulation.
#1179 @dexhunter Base stack used by #1204 / #1217 lineage.
#1125 @jainpranjal97 XSA-all and QK-Gain 4.0 findings that pushed the later attention-gain sweeps.
#1105 @abaybektursun Mixed-quantization / AR GPTQ path referenced by #1204.
#1089 @clarkkev Byte-shuffle/Brotli compression and sigmoid-gated skip connection lineage.
#1060 @clarkkev GPTQ Hessian-aware quantization implementation referenced by #1218.
#1019 @abaybektursun AR self-generated GPTQ calibration, XSA-all, architecture documentation, prior merged SOTA baseline.
#756 @abaybektursun Negative post-quant TTT experiments that helped motivate pre-quant adaptation.
#726 @clarkkev Coprime-stride loader lineage that preceded the simplified loader in #1394.
#609 @saml212 BigramHash / selective pruning / GPTQ calibration lineage referenced by #1019.
#593 prior contributors GPTQ calibration legality context referenced by #1019.
#569 prior contributors GPTQ calibration legality context referenced by #1019.
#549 @abaybektursun LeakyReLU^2, legal score-first TTT, Parallel Muon record line.
#535 @raahilshah Full-Hessian GPTQ and QAT/export alignment lineage.
#518 @sofiabod LeakyReLU^2 follow-up credit in the #549 lineage.
#493 @parinzee 11-layer model, XSA, LeakyReLU(0.5)^2, EMA, int6 quantization, partial RoPE.
#478 @gowtham0992 XSA on all 11 layers, GPTQ-lite, EMA, late-QAT record line.
#461 @Christopher-Lee-McClendon Score-first TTT framework used by earlier legal TTT records.
#414 @signalrush Base model lineage credited by #549.
#401 @newjordan EMA/SWA weight-averaging lineage.
#399 @abaybektursun Parallel Muon optimizer lineage.
#364 @shikhar1729 Warmdown schedule lineage.
#315 @jfprincz Partial RoPE and layer-scale lineage.
#289 contributor in #1019 lineage U-Net skip connection lineage documented by #1019.
#286 @chris-buckley Late QAT / STE lineage documented by #1019.
#180 @thwu1 Early SOTA baseline credited by #493.
#162 @raahilshah BigramHash concept lineage documented by #1019.
#160 @ChaseWNorton Compression lineage documented by #1019.
#122 @mtybadger Flash Attention 3 / Hopper kernel dependency lineage documented by #1019.
#65 @aquariouseworkman SmearGate lineage documented by #1019; later SP8192 stacks simplified parts away.

Compliance Notes

This is submitted under the same Track A interpretation as PR #1735 and PR #1738:

  • Final evaluation uses a fixed quantized artifact.
  • Pre-quant TTT happens before export, not during final scoring.
  • No SLOT, RLS, ETLB, n-gram cache, or eval-time cache.
  • No two-pass rescoring.
  • Sliding-window eval is causal with stride 64.
  • The softmax distribution is normalized.
  • CaseOps is reversible and uses original-byte sidecars for BPB.
  • Artifact size is below 16,000,000 bytes.
  • Training and eval stay below the 10-minute limits.

The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.

Dependencies / External Data

The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a requirements.txt to this record folder for manual setup.

For clarity:

  • The final submitted artifact is self-contained: counted code bytes plus compressed model bytes.
  • There are no network calls or external downloads during final evaluation.
  • romeerp/parameter-golf-caseops-v1 is used as the public CaseOps tokenizer/data export for training setup, before train_gpt.py runs.
  • The train script imports torch, numpy, sentencepiece, and brotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.
  • huggingface-hub and hf_transfer are only for fetching the public CaseOps dataset/tokenizer during setup.

So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.

Reproduction

pip install -r requirements.txt
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='romeerp/parameter-golf-caseops-v1',
    repo_type='dataset',
    local_dir='/workspace/caseops_data',
)
"

cd /workspace/caseops_data/datasets/datasets/
ln -sf fineweb10B_sp8192_lossless_caps_caseops_v1_reserved fineweb10B_sp8192
cd /workspace/caseops_data/datasets/tokenizers/
ln -sf fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model fineweb_8192_bpe.model

SEED=1337 \
  DATA_DIR=/workspace/caseops_data/datasets/ \
  TTT_EMA_ENABLED=0 \
  PREQUANT_TTT_ENABLED=1 \
  PREQUANT_TTT_EPOCHS=21 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

  • 3-seed validation: 1337, 42, 999
  • Independent seed 1337 reproduction on 2026-04-28/29
  • Artifacts below 16,000,000 bytes
  • Training below 600s
  • Eval below 600s
  • Fixed predictor for final scoring
  • Full-Hessian GPTQ int6 + Brotli
  • CaseOps byte-sidecar BPB accounting

@dttdrv dttdrv changed the title Add CaseOps pre-quant TTT record (1.0354 BPB) {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) Apr 28, 2026
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
@yaowubarbara
Copy link
Copy Markdown

Independent reproduction at seed=42 (8×H100 SXM, matotezitanka/proteus-pytorch:latest base image, 2026-04-30):

stage this run PR #1911 reported (s42) Δ
pre-quantization post-EMA val_bpb 1.08474 1.08389 +0.001
post-prequant-TTT val_bpb (21 ep) 1.02923 1.02819 +0.001
quantized non-sliding val_bpb 1.05013 1.04802 +0.002
quantized sliding val_bpb (final) 1.03650 1.03618 +0.00032

Final number sits within σ of the reported s42 (3-seed std 0.00057) and within 2σ of the 3-seed mean (1.03540). Training stopped via wallclock_cap at step 4535/20000, very close to the PR's 4568. Artifact size and budget compliance line up.

Stack works as documented: SP8192 + CaseOps + PreQuantTTT 21ep + GPTQ INT6 + Brotli + V15 byte sidecars. No SLOT, no PPM mixer, no eval-time-update tricks.

Posting for the maintainer review queue — happy to share the raw logs if useful.

@dttdrv
Copy link
Copy Markdown
Author

dttdrv commented Apr 30, 2026

Thanks for the independent reproduction @yaowubarbara . This is very helpful.

Your seed-42 result is close to the reported seed-42 run and confirms that the record folder is mechanically reproducible under the stated stack and budget. I appreciate you checking artifact/budget behavior as well.

The remaining question is the rule interpretation around pre-quant validation TTT. I tried to document that caveat explicitly in the PR; if maintainers decide that this class is invalid under the score-before-update reading, I agree it should be judged consistently with the rest of the pre-quant TTT line.

Since I do not have any more funds, I believe this is my last PR.

@yaowubarbara
Copy link
Copy Markdown

Thanks @dttdrv — credit for the clean lineage. The reproduction was straightforward precisely because you documented the stack and env vars completely (and the V15 byte-sidecar accounting made the BPB chain easy to verify).

On the score-before-update reading: agree it's a maintainer call. I sketched a small CPU-only behavioral-probe library (yaowubarbara/pgolf-validator) for the C1 / C2 conditions; happy to extend it toward C3 if useful for the ruling thread. Best of luck with whatever's next.

@yaowubarbara
Copy link
Copy Markdown

Follow-up to the reproduction comment above: I ran four single-seed ablations on top of your stack today to characterize the LeakyReLU² and QK-Gain axes on the PR #1911 base, since @bsisduck PR #1970 ran the global-slope ablation on a related base (and confirmed slope 0.5 there) and @mikeapedia PR #1648 explored both per-layer activation coefficients and softer per-layer QK-Gain on a third base via xIELU.

These are ablations on top of your stack — same SP8192 + CaseOps + 21ep PreQuantTTT + GPTQ INT6 + Brotli + V15 byte-sidecar pipeline, seed=42 — not corrections to your PR.

What changed across the four runs

  • E1b — your stack as-is, fixed slope=0.5 (control reproduction)
  • E2MLP.forward modified so the LeakyReLU² slope is nn.Parameter(torch.tensor(0.5)) per MLP block (11 extra fp32 scalars, classified passthrough float16 by your existing GPTQ pipeline, trainable through main + PreQuantTTT)
  • E3 — same eleven slope values frozen as register_buffer after harvesting them from E2's after-TTT log (no architectural change, fixed profile only)
  • E4a — your stack with QK_GAIN_INIT=3.0 instead of the default 5.25 (single env-var change, no source patch)

Single-seed=42 results

Variant Pre-EMA Post-PreQuantTTT (21ep) Quantized non-sliding Quantized sliding (final) Δ vs E1b
E1b fixed slope=0.5, QK=5.25 1.08474 1.02923 1.05013 1.03650
E4a QK=3.0 1.08505 1.02957 1.04968 1.03614 −0.00036
E3 frozen inverse-U from E2 1.08662 1.03088 1.05250 1.03905 +0.00255
E2 per-layer learnable 1.08815 1.03260 1.05360 1.04014 +0.00364

E1b sits ~+0.00032 above the s42 reading you reported (1.03618), within σ (your 3-seed std 0.00057). All four runs share the same training and TTT budgets and finished within compliance.

The harvested per-layer slope profile (E2 after PreQuantTTT)

L0=0.336    L1=0.443    L2=0.512    L3=0.706
L4=1.263    L5=1.396    L6=1.135    L7=0.898
L8=0.780    L9=0.745    L10=1.012

This single harvested profile was inverse-U-shaped — early layers below 0.5, middle layers above 1, late layers around 0.7–1. Qualitatively similar to mikeapedia's per-layer xIELU coefficient pattern on a different base.

Tentative reading at single seed

  • E4a vs E1b: −0.00036 BPB. Below the 0.00057 inter-seed std you reported. This is essentially equal to E1b at seed=42; mikeapedia's "softer attention" preference does not appear to transfer cleanly to the PR {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911 PreQuantTTT-equipped stack at this seed.
  • E3 vs E1b: +0.00255. Frozen inverse-U beats the learnable variant by ~0.001 (E3 vs E2), suggesting joint optimization of 11 slope params adds some optimization noise; but the profile itself underperforms uniform 0.5 at this seed.
  • E2 vs E1b: +0.00364. Per-layer learnable is the worst of the four.

The qualitative ordering E4a ≈ E1b < E3 < E2 is consistent across pre-EMA, post-TTT, and final sliding eval, but the absolute deltas of 0.00036 (E4a) and 0.00255 (E3) are 0.6× and 4.5× the reported inter-seed std. A 3-seed re-validation would be needed before any of these orderings could be called robust.

Caveats worth knowing

  • Single seed. Above all else.
  • Slope micro-drift in E3. Buffers were registered fp32 but observed to drift 0.001–0.004 from init values during distributed training, presumably from fp32→bf16 round-trips. Drift is small but the "frozen" label is approximate.
  • E2 pre-LZMA artifact measured 16,061,499 bytes (61 KB over the 16 MB cap before code-side LZMA wrapping). The +44 bytes of slope parameters is not the cause — the unwrapped patched script itself is what pushes it over. For a record submission this would need code-side LZMA wrap.
  • Untested interactions. Frozen-buffer and learnable-parameter numbers may interact with EMA, GPTQ Hessian calibration, and the Brotli compressor in ways we did not isolate.

One-line summary

On the PR #1911 stack at seed=42, none of (per-layer learnable LeakyReLU², frozen inverse-U LeakyReLU², QK_GAIN_INIT=3.0) materially improves over your fixed defaults; E4a essentially reproduces s42, E2/E3 lose by 0.0025–0.0036. For anyone with budget for the multi-iteration convergence loop methodology mikeapedia used in PR #1648, the E2-harvested profile above is a reasonable iter-1 init.

The patches are reproducible from the description above (a 17-line MLP-class modification for E2/E3 and a single QK_GAIN_INIT=3.0 env-var override for E4a); the seed=42 stdout logs and trained artifacts lived on the RunPod pod and are no longer available since the pod was terminated. Thanks again for the clean, reproducible base.

@aquariouseworkman
Copy link
Copy Markdown
Contributor

Since the data the optimizer trains on is val_data.val_tokens, thus making this invalid, correct?

@dttdrv
Copy link
Copy Markdown
Author

dttdrv commented Apr 30, 2026

@aquariouseworkman Yes, I think this is the same C3 issue that caused #1958 and #1992 to be withdrawn.

The original intent here was to follow the pre-quant TTT precedent from that line of PRs: adapt before final artifact export, then evaluate a fixed artifact. But under the stricter score-before-update reading, that distinction does not save it, because the useful model state used for the reported score has already been optimized on val_data.val_tokens.
So I agree this should not be treated as leaderboard-valid if C3 forbids building useful state from validation tokens before scoring them. The stack is mechanically reproducible, but the record claim should be withdrawn or reclassified as an invalid/pre-quant-TTT ablation artifact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants