Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) by simonbissonnette · Pull Request #2014 · openai/parameter-golf

simonbissonnette · 2026-04-30T21:00:47Z

Record candidate: SP8192 CaseOps + Progressive 3k Context Growth + Short-Doc Score-First TTT

val_bpb: 1.05759 (3-seed mean, std 0.00034) | val_loss: 2.31441 nats (std 0.00075) | 15.98 MB max | 8xH100 SXM | 600s train / 600s eval

Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00348 BPB / -0.00762 nats

This stacks a progressive training-context schedule and a short-document TTT schedule on top of the late-April CaseOps/SP8192/LQER/SparseAttnGate/BOS-fixed SmearGate lineage. The direct leaderboard comparison is PR #1855, which is the current merged leader used here as the baseline.

Results

Seed	Steps	ms/step	Train ms	Pre-quant BPB	Quant BPB	Post-TTT BPB	TTT eval s	Artifact bytes
42	4,888	121.9	596,025	1.05993108	1.06833072	1.05740567	572.4	15,981,945
314	4,882	122.1	595,976	1.05975470	1.06832443	1.05730104	489.9	15,984,387
0	4,884	122.0	596,022	1.06072266	1.06902034	1.05807084	493.5	15,981,122
Mean	4,884.7	122.0	596,008	1.06013615	1.06855850	1.05759252	518.6	15,982,485

3-seed population std: 0.00034091 BPB / 0.00074604 nats.

All included seeds are under the 16,000,000-byte artifact cap and the 600s train/eval budgets as logged. The maximum artifact is 15,984,387 bytes and the maximum validation-data TTT pass is 572.4s.

Full validation coverage

All three logs evaluate the full CaseOps validation shard target set:

Seed	`val_tokens`	`target_tokens`
42	47,853,343	47,853,343
314	47,853,343	47,853,343
0	47,853,343	47,853,343

The training script explicitly keeps the validation tail via EVAL_INCLUDE_TAIL=1. This avoids the older multiple-of-context truncation and makes the standard diagnostic eval and quantized TTT eval agree on the same target count.

The tokenizer, CaseOps transform, training shards, validation shard, and byte sidecar format are the same canonical HF-hosted CaseOps export used by the merged PR #1855 setup. If a reviewer already has the clean #1855/HF CaseOps data staged, those same staged shards can be reused here. The included tokenizer/prep files are present only to make this submission self-contained; the preferred reproduction path is to download the canonical HF CaseOps export directly.

What changed vs PR #1855

This submission keeps the same overall 11-layer SP8192 CaseOps recurrent-transformer family as PR #1855, then adds the following levers:

Lever	Setting	Purpose
Progressive train context	`TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000`	Train cheaply at 1k early, move to 2k for most of training, then finish at 3k context.
Final/eval context	`TRAIN_SEQ_LEN=3072`, `EVAL_SEQ_LEN=3072`, `TTT_EVAL_SEQ_LEN=3072`, `EVAL_STRIDE=1536`	Extend the final model and TTT scoring context beyond 2k without the 4k eval-time cost.
Long-context TTT mask	`TTT_MASK=no_qv`, `TTT_Q_LORA=0`, `TTT_V_LORA=0`	Keep K/O/MLP LoRA adaptation while removing Q/V adapters that were less helpful at longer context.
TTT local LR	`TTT_LOCAL_LR_MULT=0.75`	Slightly softer per-document LoRA adaptation.
Short-doc score-first chunks	`TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24`, default chunk 48	Use smaller score-before-update chunks for short documents, preserving causality while improving adaptation.
TTT phases	`PHASED_TTT_NUM_PHASES=1`, `PHASED_TTT_PREFIX_DOCS=2500`	Single score-first phased pass with a 2500-doc prefix budget.
QK gain	`QK_GAIN_INIT=5.25`	Public long-context sweep result from the PR #1953 lineage.
Compression/quant stack	`COMPRESSOR=pergroup`, AWQ-lite, asymmetric logit rescale	Inherited from public late-April quantization/compression work stacked on the PR #1855 base.

The short-doc TTT schedule does not train on future validation tokens. It only changes the chunk granularity used inside the existing score-before-update loop: each chunk is scored first, then the LoRA update is applied for future chunks.

Architecture and training stack

Component	Setting
Model	11 layers, 512d, 8 query heads, 4 KV heads, MLP 4x
Tokenizer/data	SP8192 CaseOps lossless caps with byte sidecar accounting
RoPE	Partial RoPE, 16 dims
Recurrence	Layers 3-5 looped, enabled at `frac=0.35`
Parallel decoder	Parallel lane from layer 8, mean final lane
XSA	All 11 layers
Gates	BOS-fixed SmearGate, SparseAttnGate with `gate_window=12`, scale 0.5
Optimizer	Muon on matrix params, Adam on embedding/scalars, `BETA2=0.99`
EMA	`ema_decay=0.9965`
Quantization	GPTQ int6 matrices, int7 embeddings, LQER asymmetric rank-4 correction
GPTQ reserve	`GPTQ_RESERVE_SECONDS=4.0`; logs show `gptq:reserving 4s, effective=596000ms`
Compression	Per-group compression
TTT	Quantized phased LoRA TTT, score-first, no_qv mask, short-doc chunk schedule

Compliance notes

Artifact cap: all seeds <= 15,984,387 bytes.
Training wallclock: all training loops stop around 596.0s with GPTQ_RESERVE_SECONDS=4.0; GPTQ hessian collection is logged immediately after (67 Hessians in 4.1s) for transparency.
Eval wallclock: all validation-data TTT passes are <= 572.4s. The ttt_lora:compile warmup uses random tokens and no validation data; it is logged separately from total_eval_time.
Score-before-update: quantized_ttt_phased scores each chunk before applying that chunk's LoRA update. The short-doc schedule only changes chunk size.
Full validation targets: val_tokens == target_tokens == 47853343 in all included logs.
No validation data in training: training uses only training shards. TTT accesses validation documents left-to-right under the score-first rule.
No external cache or direct memorization: no SLOT, n-gram cache, PPM mixture, logit bias table, or validation-derived precomputation.
Original-byte BPB: CaseOps byte sidecar accounting is preserved.

Reproduction

Install the dependencies in requirements.txt. FlashAttention 3 and the lrzip system binary are noted there because they require separate install paths.

This submission uses the clean canonical CaseOps SP8192 export hosted on Hugging Face. The logs were produced from a 50,000-document validation split with 80 training shards (train_shards: 80, ttt_phased: total_docs:50000, and val_tokens == target_tokens == 47853343 in every included log).

Preferred data setup:

python3 - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="romeerp/parameter-golf-caseops-v1",
    repo_type="dataset",
    local_dir="./data/datasets/fineweb10B_sp8192_caseops",
    allow_patterns=[
        "datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/*",
        "datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
    ],
    max_workers=8,
)
PY

Then set:

DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
TOKENIZER_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

Fallback local rebuild: if the HF export is unavailable, rebuild from the canonical docs_selected.jsonl with the included prepare_caseops_data.py, lossless_caps.py, and tokenizer. Use --val-docs 50000 and write into a fresh output directory. The prep script now defaults to 50,000 validation docs and refuses to write over existing fineweb_*.bin shards unless --overwrite is passed, to avoid accidentally mixing stale validation shards with a new train split.

Run one seed at a time, replacing DATA_PATH and TOKENIZER_PATH with the staged CaseOps paths:

for SEED in 42 314 0; do
  NCCL_NET=Socket \
  DATA_DIR=./data \
  DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
  TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
  CASEOPS_ENABLED=1 \
  VOCAB_SIZE=8192 \
  ITERATIONS=20000 \
  MAX_WALLCLOCK_SECONDS=600 \
  EVAL_INCLUDE_TAIL=1 \
  TRAIN_SEQ_LEN=3072 \
  ROPE_TRAIN_SEQ_LEN=3072 \
  TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000 \
  TRAIN_SEQ_SCHEDULE_MODE=wallclock \
  SEQ_CHANGE_WARMUP_STEPS=32 \
  EVAL_SEQ_LEN=3072 \
  EVAL_STRIDE=1536 \
  TTT_ENABLED=1 \
  TTT_EVAL_SEQ_LEN=3072 \
  TTT_BATCH_SIZE=24 \
  TTT_CHUNK_SIZE=48 \
  TTT_SHORT_SCORE_FIRST_ENABLED=1 \
  TTT_SHORT_DOC_LEN=2000 \
  TTT_SHORT_CHUNK_SIZE=24 \
  TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24 \
  TTT_LORA_RANK=80 \
  TTT_LORA_LR=0.0001 \
  TTT_LOCAL_LR_MULT=0.75 \
  TTT_MASK=no_qv \
  TTT_Q_LORA=0 \
  TTT_V_LORA=0 \
  TTT_WEIGHT_DECAY=0.5 \
  TTT_BETA2=0.99 \
  PHASED_TTT_PREFIX_DOCS=2500 \
  PHASED_TTT_NUM_PHASES=1 \
  WARMDOWN_FRAC=0.85 \
  BETA2=0.99 \
  QK_GAIN_INIT=5.25 \
  SPARSE_ATTN_GATE_ENABLED=1 \
  SPARSE_ATTN_GATE_SCALE=0.5 \
  GATED_ATTN_QUANT_GATE=1 \
  SMEAR_GATE_ENABLED=1 \
  GATE_WINDOW=12 \
  FUSED_CE_ENABLED=1 \
  MATRIX_LR=0.026 \
  MIN_LR=0.1 \
  GRAD_CLIP_NORM=0.3 \
  EMBED_BITS=7 \
  EMBED_CLIP_SIGMAS=14.0 \
  MATRIX_CLIP_SIGMAS=12.85 \
  ATTN_CLIP_SIGMAS=13.0 \
  MLP_CLIP_SIGMAS=11.5 \
  LQER_ENABLED=1 \
  LQER_RANK=4 \
  LQER_TOP_K=3 \
  LQER_FACTOR_BITS=4 \
  LQER_ASYM_ENABLED=1 \
  LQER_ASYM_GROUP=64 \
  AWQ_LITE_ENABLED=1 \
  AWQ_LITE_BITS=8 \
  AWQ_LITE_GROUP_TOP_K=1 \
  AWQ_LITE_GROUP_SIZE=64 \
  ASYM_LOGIT_RESCALE=1 \
  GPTQ_RESERVE_SECONDS=4.0 \
  GPTQ_CALIBRATION_BATCHES=16 \
  COMPRESSOR=pergroup \
  VAL_LOSS_EVERY=0 \
  SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py \
      > train_seed${SEED}.log 2>&1
done

Included files

train_gpt.py - full training/eval script used for the logs.
train_seed42.log, train_seed314.log, train_seed0.log - full per-seed logs.
submission.json - structured metadata and per-seed results.
README.md - this file.
requirements.txt - Python dependencies plus notes for FA3 and lrzip.
prepare_caseops_data.py - fallback CaseOps dataset/token/byte-sidecar preparation; defaults to the canonical 50,000-doc validation split and refuses mixed/stale output directories by default.
lossless_caps.py - reversible CaseOps transform, same as the PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 CaseOps setup.
tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model - SentencePiece tokenizer used by the logs; identical CaseOps tokenizer lineage as PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855.

Lineage and credits

This submission is a stack on top of the public CaseOps/SP8192 record lineage:

PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 by @codemath3000 - merged leaderboard record and direct comparison baseline.
PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 / PR Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) #1908 / PR Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923 public late-April quantization stack - AWQ-lite and asymmetric logit rescale lineage.
PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953 - long-context/no_qv/QK-gain sweep ideas.
PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 by @dexhunter - SmearGate and LQER asymmetric rank-4 lineage.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 by @nprime06 - Polar Express Muon, MIN_LR, SparseAttnGate, fused CE.
PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736 and PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 by @dexhunter / @romeerp - CaseOps integration and byte sidecar accounting.
PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 by @MarioPaerle - SmearGate lineage.
PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 / PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 - phased score-first TTT lineage.
Issue A Field Guide to Valid Submissions #1017 by @cocohearts - score-first validation criteria.

The new contribution here is the combination of progressive 3k train/eval context growth with the short-document score-first TTT chunk schedule, while preserving the full validation target count and staying under the artifact/eval budgets.

Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759 3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule 1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24). Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014 should land ~1.0590 — close enough to record bar to potentially clear it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port the 040C 'middle 5x / late 3.4x' allocation onto simonbissonnette's progressive-3k base (openai#2014) and screen vs uniform 4.0 baseline. Training-only, 4xH100 1200s, single seed. Code on exp/300-040c-on-2014 @ d174313. Spec flags the column-slice-in-compile hazard from feedback memory and mandates a compile-sanity check before scaling. PREQUANT_ONLY=1 keeps the screen cheap by skipping serialize/GPTQ/TTT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ean 1.05831 BPB Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108), p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under the 600s wallclock budget. Per-seed: - 42: ttt=1.05793 art=15,986,149 eval=572.6s - 314: ttt=1.05852 art=15,987,257 eval=553.7s - 1234: ttt=1.05849 art=15,989,895 eval=574.1s Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/ contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923 -> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition C1-C4 legality check. submission.json author/github_id are placeholders pending the user's choice of submitting account. Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single 8xH100 SXM pod (~2.5h wall, ~$66 cost). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…chedule

AsymLogit Rescale (PR openai#1923) ported as 2 TTT-adaptable scalar params (softcap_pos, softcap_neg). Pre-quant 1.06160 (slightly worse than S55's 1.06058 — AsymLogit hurts un-adapted model). TTT recovery -0.01267 (much better than S55's -0.01103) — AsymLogit gives massive adaptive capacity. Final 1.05759 = -0.00055 vs S55. Single-seed matches PR openai#2014's 3-seed mean. Eval 521.7s (under 600s cap), Size 15,946,610. softcap_pos and softcap_neg init to logit_softcap=30.0, adapted per-doc via TTT-LoRA optimizer.

User pushed back on openai#2014's LEAK call as too inference-based. Verified directly: - README says "uses same shards as PR openai#1855. If you don't have them, prepare with included prepare_caseops_data.py" — phrasing implies inheritance from openai#1855 (LEAK) but doesn't explicitly invoke prep - No setup.sh, no shell script invoking prep - No HF download script - Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir (not triple-nested local-prep signature) - Could be either HF-flattened download OR local-prep copy Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855" English, but not iron-clad). Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.

…E_OUTSIDE=0 Seed 314: pre-quant 1.06128 / quant 1.06962 / final 1.05701 / eval 571.7s Compliance: ngram_hint_precompute_outside=False, precompute (166.95s) INSIDE timer per PR openai#1514 precedent. Token-only tilt: within_gate=0, word_gate=0 - legal per PR openai#1514. Size 15,943,530 bytes. Single seed beats openai#2014's 3-seed mean (1.05759). Validating seeds 42 and 1234.

…015, WD=0.25

Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)

…LR=0.00015, WD=0.25

…entions

…values

…erged, final SOTA 1.05651 PR openai#2146 merged May 1: audit complete, 4 grace-policy PRs accepted (openai#1945/openai#1953/openai#2014/openai#2135). Official final SOTA is 1.05651 (codemath3000, PR openai#2135). PR openai#2130 rejected for data overlap. CLAUDE.md updated to reflect completed audit and new top-5 leaderboard. https://claude.ai/code/session_01V4CoM1HMPmJDcDHzdtdy7X

Auto-detects GPU capability and patches train_gpt.py at runtime: - FA dispatch: FA4 (sm_100/120 try) → FA2 fallback (sm_120) → FA3 (sm_90) - linear_leaky_relu_square block sizes per SMEM: - sm_90 (H100): 256x128x64 ns=4/3 (192 KB, 84%) - sm_100 (B200): 256x128x64 ns=5/4 (240 KB, 94%, +1 stage for HBM3e) - sm_120 (Pro 6000 Max-Q): 128x128x64 ns=3/2 (96 KB, 95%, fixes prior 36 KB under-utilization) Tests 100 steps at fixed ctx=3072 (no progressive schedule, no eval). Mini-retokenizes only first 50K docs (~5 min) to skip the 50-min full retokenize. Uploads steps.csv + summary.json to HF dataset and auto-stops pod. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FA4's flash_attn_varlen_func has different kwargs than FA3 (no max_seqlen_q/k), which breaks PR openai#2014 train_gpt.py line 1224 varlen call: TypeError: Unexpected keyword arguments: ['max_seqlen_q', 'max_seqlen_k'] Mixed dispatch: route basic flash_attn_func calls (3 sites) to FA4 (where speed matters most, ~70% of attn), keep varlen call (1 site) on FA2 (compatible kwargs). This keeps the FA4 speedup on the dominant attn path without an API-translation layer for the varlen kwargs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…clone ROOT CAUSE of all 3 prior Phase Z failures: the runpod/parameter-golf:latest image has /workspace/parameter-golf pre-cloned with an older commit that predates the PR openai#2014 reproduction directory. The 'if [ ! -d .git ]' guard meant we never re-cloned, so train_gpt.py was missing — leading to 'No such file or directory' on the cp / cd / torchrun calls. Earlier 'cwd bug' fix was a red herring (real issue was the source tree on disk was the wrong revision). Fix: when .git exists, fetch the requested BRANCH at depth=1 and hard-reset to its tip. Log HEAD + ls of train_gpt.py for fast smoke verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v4's re-sync logic assumed either '.git exists, fetch' OR 'nothing exists, clone fresh'. But the runpod/parameter-golf:latest image has a third state: /workspace/parameter-golf is populated WITHOUT .git — so git clone fails 'destination already exists and is not an empty directory'. The script's tail -3 swallowed the error and we limped along with an incomplete tree that missed the PR openai#2014 reproduction directory. Fix: when REPO exists without .git, preserve data + tokenizers under /workspace/pg_image_cache, rm -rf REPO, then fresh-clone the branch. After clone, copy preserved data back (-n to not overwrite tracked files). Also dump HEAD short SHA + ls of train_gpt.py + prepare_caseops_data_parallel.py right after clone for fail-fast smoke verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3 variants on 1× B200 (Triton 3.5.1, torch 2.6.0, FA4+FA2 mixed, PR openai#2014 workload): A_baseline 256x128x64 ns=4/3 1,378,166 tok/s (= Phase Y v6 yesterday) B_M_fine 128x128x64 ns=4/3 1,338,010 tok/s (-2.91%) C_N_wide 128x256x64 ns=3/2 1,357,513 tok/s (-1.50%) Default 256x128x64 ns=4/3 is the local optimum on this stack. Smaller M tile oversubscribes SMs (launch overhead dominates), wider N tile loses to reduced num_stages (less prefetch overlap). Conclusion: B200 has no easy unlock via MLP block-size tuning on Triton 3.5.1. The original Phase Z hypothesis (Triton 3.7 enables native sm_100 tcgen05 PTX) remains untested — Triton 3.7 is incompatible with torch 2.6 inductor (KernelMetadata.cluster_dims AttributeError). Would need full torch + Triton + FA stack rebuild to test; tabled. Real remaining headroom on B200 (in order of tractability): 1. CUDA graphs / reduced launch overhead (+5-15%, ~2-4h work) 2. FP8 transformer-engine for MLP up/down (+25-30%, half-day to 1d work) 3. torch 2.7+ Triton 3.7 native sm_100 codegen (+20-50%, 1-2d work) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After Phase Z ruled out block-size tuning, this doc proposes the 3 remaining feasible paths and explicitly asks for greenlight before spawning more pods. Tier 1 (recommended): enable CUDA Graphs via TORCHINDUCTOR_CUDAGRAPHS=1. One-line change. PR openai#2014 already does torch.compile(dynamic=False, fullgraph=True) which is graph-capture-friendly. Expected +5-15%. Cost ~$1.50 for one verification pod. Tier 2: FP8 transformer-engine for MLP up/down. Half-day to 1d engineering. Risk: BPB regression from FP8 quant; also unclear if MLP is compute-bound vs memory-bw-bound at K=512. Greenlight criterion: profiler trace showing MLP >30% of step time. Tier 3: torch 2.7 + Triton 3.7 + FA stack rebuild. Most uncertain; 1-2d work. Only do if Tiers 1+2 don't reach target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical findings from 2 B200 pods (~$1.50 total): v1 (torch 2.6→2.11 + bundled triton 3.6): flash-attn ABI mismatch — wheel built for torch 2.9 has undefined symbol against torch 2.11's libc10_cuda. SDPA fallback also fails under torch.compile (Invalid backend for bf16 GQA on B200 with torch 2.11). Result: 0 training steps for variant B. v2 (keep torch 2.9.1, only swap triton 3.5.1 → 3.6.0): isolated the triton variable. torch.compile(matmul) smoke gate PASSED but PR openai#2014's actual triton-kernel path failed with the SAME error as Phase Z v5 with triton 3.7: File 'torch/_inductor/runtime/triton_heuristics.py', line 1757 (binary.metadata.num_ctas, *binary.metadata.cluster_dims) AttributeError: 'KernelMetadata' object has no attribute 'cluster_dims' KEY FINDING: The KernelMetadata.cluster_dims removal happened in triton 3.5.1→3.6.0, NOT 3.6→3.7. torch 2.9.1's inductor was compiled against the 3.5.x metadata schema. Any triton ≥ 3.6 breaks this image's inductor. torch 2.10+ has the dual-name fallback (get_first_attr 'cluster_dims', 'clusterDims') needed to handle triton 3.6+. But torch 2.10+ doesn't have prebuilt flash-attn wheels — requires source build (15-20 min on pod). Tier 3 realistic estimate: 2-3 pod iterations, $15-30, uncertain upside because Phase Z already showed block-size tuning doesn't help (suggesting the bottleneck isn't the Triton-emitted kernel). Pivot recommendation: Tier 1 CUDA Graphs is highest-expected-return remaining unlock (one-line env var, ~$1.50 verification, expected +5-15%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Enable inductor cuda graphs via torch._inductor.config.triton.cudagraphs=True (set BEFORE torch.compile() — patched into train_gpt.py via CUDA_GRAPHS_ENABLED env var). Single B200 pod, A_no_graphs vs B_cudagraphs sequential. Hypothesis: PR openai#2014 is partially launch-overhead-bound. With 17 layer-applies × ~20 kernels per layer ≈ 340 kernel launches per step on the model's hot path, host-side launch latency probably eats 10-20% of step time. CUDA graphs collapse these to one cudaGraphLaunch. PR openai#2014 already uses torch.compile(dynamic=False, fullgraph=True) and COMPILE_SHAPE_WARMUP=1 enumerates the cu_bucket shapes, so graph capture shape coverage should be complete. Worst case: fallback to eager on missed shapes (no correctness impact). Expected: +5-15%. Probability of clean run: 70%+ (no version churn, just one flag in inductor config). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single B200 pod (zu2a9pxy91h5rn, $1.10), torch 2.9.1 + triton 3.5.1 image, A vs B with torch._inductor.config.triton.cudagraphs flag flipped between variants. Patched train_gpt.py to set the flag from CUDA_GRAPHS_ENABLED env var BEFORE any torch.compile() call. Steady state (steps 20..100, 9 datapoints each): A_no_graphs mean 1,378,297 tok/s (matches Phase Z baseline) B_cudagraphs mean 1,428,986 tok/s (+3.68%, +50,689 tok/s) Caveat: CUDA Graphs changes kernel scheduling enough that train_loss values diverge between variants from step 2 onwards (A step 2 = 12.8549 vs B step 2 = 13.0082). Pure-throughput bench is unaffected; production training would need to verify final BPB stays in seed-noise band. Cumulative B200 unlock state on PR openai#2014: Phase Y baseline (FA4-mixed): 1,378,166 tok/s Phase Z (block-size scan): 0% (no win) Phase A1 (triton 3.6/3.7): incompat (KernelMetadata.cluster_dims) Phase A2 (CUDA Graphs): +3.68% ← LOCKED IN Recommendation: keep cuda graphs flag enabled in any production run. The next remaining unlock is Tier 2 (FP8 TE for MLP up/down), but that's half-day to full-day engineering and needs profiler trace + BPB regression verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Targets the plain F.linear(post, w2) on line 1067 of FusedLinearLeakyReLUSquareFunction (the down-projection of PR openai#2014's MLP). Up-projection stays fused-triton bf16 since replacing it requires rewriting the custom autograd Function (it returns pre-activation aux for backward reuse). Helper _fp8_linear() does per-call amax-based e4m3 quantize then torch._scaled_mm. Gated by FP8_MLP_ENABLED env var. CUDA Graphs ON in both variants (locked in from Phase A2). Per-call amax is two GPU reductions, no host sync — should be graph-friendly. Variants: A — CUDA Graphs ON + bf16 MLP (= Phase A2 B_cudagraphs ~1.429M baseline) B — CUDA Graphs ON + FP8 down (this experiment) Tradeoffs: - FP8 down at 2x tensor-core throughput should be net positive - Per-call amax adds 2 reductions per matmul (~us) - Loss values will diverge from A (numerical change). For bench-only test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single B200 pod (1xgh2ye47bfzaa, $1.10), CUDA Graphs ON for both variants. FP8 e4m3 via torch._scaled_mm replaces F.linear(post, w2) in PR openai#2014's FusedLinearLeakyReLUSquareFunction down-projection. Per-call amax for both activation and weight, no host sync (graph-friendly). Steady state (warm steps 20..100, 9 datapoints): A_bf16_mlp 1,445,456 tok/s (loss@100=4.3793) B_fp8_mlp 1,405,190 tok/s (loss@100=4.1690) -2.79% Why FP8 lost on this workload: - Only half MLP changed (up-proj stays fused BF16 triton; replacing it needs rewriting the custom autograd Function) - Per-call amax adds 4 extra reductions per layer × 17 layer-applies × 100 steps = ~41 ms wall overhead - F.linear had inductor fusion with surrounding ops; _scaled_mm + quantize + dequantize becomes 3 separate kernel launches What this rules out: simple drop-in FP8 down-projection. Doesn't rule out: (a) full MLP rewrite with FP8 up+down (~half-day work), (b) static-scaling amax buffers (no per-call reduce), (c) transformer-engine.Linear with DelayedScaling. All half-day+ investments with uncertain upside. Final B200 unlock for PR openai#2014: CUDA Graphs (+3.68% from A2) is the cheap win. Cumulative: 1,378,166 → 1,428,986 tok/s. Tier 2 needs more invasive surgery to pay off; recommend stopping here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add progressive 3k CaseOps record candidate

c9843c9

Maheshram1 mentioned this pull request Apr 30, 2026

Recurrent Transformer RT-KV experiment #2034

Draft

anderamondarainh-stack mentioned this pull request Apr 30, 2026

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify) #2054

Open

dexhunter mentioned this pull request May 1, 2026

Record: PR #1908 base + GPTQ module-damp + Asym Logit Rescale — val_bpb 1.06048 (3-seed mean) #2051

Closed

Idan3011 pushed a commit to Idan3011/parameter-golf that referenced this pull request May 1, 2026

Add progressive seq schedule (PR openai#2014 lineage) + batch-token s…

df0a390

…chedule

hi-aduek mentioned this pull request May 1, 2026

Support: independent PR2014 prefix-2400 reproduction, seed 42 val_bpb 1.05804 #2078

Open

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

andrewbaggio1 mentioned this pull request May 1, 2026

Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118

Open

Clarify canonical CaseOps reproduction split

fcd0b83

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026

peer-LoRA ensemble on PR openai#2014: K=4, thresh=0.5, w=0.8, LR=0.00…

93b2286

…015, WD=0.25

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026

TTT Peer-LoRA Ensemble on PR openai#2014: K=3, threshold=0.5, w=0.8, …

bf23ef1

…LR=0.00015, WD=0.25

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

varunneal mentioned this pull request May 1, 2026

[Record candidate] TTT Peer-LoRA Ensemble on PR #2014, val_bpb = 1.05749 #2139

Closed

codemath3000 mentioned this pull request May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request May 2, 2026

Align README and submission.json with PR openai#2014/openai#2130 conv…

ff90522

…entions

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Merged

This was referenced May 2, 2026

Record candidate: PR #2014 base + GATE_WINDOW=8 #2131

Closed

Record candidate: PR #2014 base + GPTQ_CALIBRATION_BATCHES=32 #2132

Closed

Record candidate: PR #2014 base + GATE_WINDOW=8 + GPTQ_CALIBRATION_BATCHES=32 #2133

Closed

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 5, 2026

writeup: add hook line to section 4, clarify PR openai#2014, add gap …

8fc21c5

…values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014
simonbissonnette wants to merge 2 commits into
openai:mainfrom
simonbissonnette:submission/final-growth-candidate

simonbissonnette commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simonbissonnette commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record candidate: SP8192 CaseOps + Progressive 3k Context Growth + Short-Doc Score-First TTT

Results

Full validation coverage

What changed vs PR #1855

Architecture and training stack

Compliance notes

Reproduction

Included files

Lineage and credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simonbissonnette commented Apr 30, 2026 •

edited

Loading