Skip to content

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014

Open
simonbissonnette wants to merge 2 commits into
openai:mainfrom
simonbissonnette:submission/final-growth-candidate
Open

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014
simonbissonnette wants to merge 2 commits into
openai:mainfrom
simonbissonnette:submission/final-growth-candidate

Conversation

@simonbissonnette
Copy link
Copy Markdown

@simonbissonnette simonbissonnette commented Apr 30, 2026

Record candidate: SP8192 CaseOps + Progressive 3k Context Growth + Short-Doc Score-First TTT

val_bpb: 1.05759 (3-seed mean, std 0.00034) | val_loss: 2.31441 nats (std 0.00075) | 15.98 MB max | 8xH100 SXM | 600s train / 600s eval

Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00348 BPB / -0.00762 nats

This stacks a progressive training-context schedule and a short-document TTT schedule on top of the late-April CaseOps/SP8192/LQER/SparseAttnGate/BOS-fixed SmearGate lineage. The direct leaderboard comparison is PR #1855, which is the current merged leader used here as the baseline.

Results

Seed Steps ms/step Train ms Pre-quant BPB Quant BPB Post-TTT BPB TTT eval s Artifact bytes
42 4,888 121.9 596,025 1.05993108 1.06833072 1.05740567 572.4 15,981,945
314 4,882 122.1 595,976 1.05975470 1.06832443 1.05730104 489.9 15,984,387
0 4,884 122.0 596,022 1.06072266 1.06902034 1.05807084 493.5 15,981,122
Mean 4,884.7 122.0 596,008 1.06013615 1.06855850 1.05759252 518.6 15,982,485

3-seed population std: 0.00034091 BPB / 0.00074604 nats.

All included seeds are under the 16,000,000-byte artifact cap and the 600s train/eval budgets as logged. The maximum artifact is 15,984,387 bytes and the maximum validation-data TTT pass is 572.4s.

Full validation coverage

All three logs evaluate the full CaseOps validation shard target set:

Seed val_tokens target_tokens
42 47,853,343 47,853,343
314 47,853,343 47,853,343
0 47,853,343 47,853,343

The training script explicitly keeps the validation tail via EVAL_INCLUDE_TAIL=1. This avoids the older multiple-of-context truncation and makes the standard diagnostic eval and quantized TTT eval agree on the same target count.

The tokenizer, CaseOps transform, training shards, validation shard, and byte sidecar format are the same canonical HF-hosted CaseOps export used by the merged PR #1855 setup. If a reviewer already has the clean #1855/HF CaseOps data staged, those same staged shards can be reused here. The included tokenizer/prep files are present only to make this submission self-contained; the preferred reproduction path is to download the canonical HF CaseOps export directly.

What changed vs PR #1855

This submission keeps the same overall 11-layer SP8192 CaseOps recurrent-transformer family as PR #1855, then adds the following levers:

Lever Setting Purpose
Progressive train context TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000 Train cheaply at 1k early, move to 2k for most of training, then finish at 3k context.
Final/eval context TRAIN_SEQ_LEN=3072, EVAL_SEQ_LEN=3072, TTT_EVAL_SEQ_LEN=3072, EVAL_STRIDE=1536 Extend the final model and TTT scoring context beyond 2k without the 4k eval-time cost.
Long-context TTT mask TTT_MASK=no_qv, TTT_Q_LORA=0, TTT_V_LORA=0 Keep K/O/MLP LoRA adaptation while removing Q/V adapters that were less helpful at longer context.
TTT local LR TTT_LOCAL_LR_MULT=0.75 Slightly softer per-document LoRA adaptation.
Short-doc score-first chunks TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24, default chunk 48 Use smaller score-before-update chunks for short documents, preserving causality while improving adaptation.
TTT phases PHASED_TTT_NUM_PHASES=1, PHASED_TTT_PREFIX_DOCS=2500 Single score-first phased pass with a 2500-doc prefix budget.
QK gain QK_GAIN_INIT=5.25 Public long-context sweep result from the PR #1953 lineage.
Compression/quant stack COMPRESSOR=pergroup, AWQ-lite, asymmetric logit rescale Inherited from public late-April quantization/compression work stacked on the PR #1855 base.

The short-doc TTT schedule does not train on future validation tokens. It only changes the chunk granularity used inside the existing score-before-update loop: each chunk is scored first, then the LoRA update is applied for future chunks.

Architecture and training stack

Component Setting
Model 11 layers, 512d, 8 query heads, 4 KV heads, MLP 4x
Tokenizer/data SP8192 CaseOps lossless caps with byte sidecar accounting
RoPE Partial RoPE, 16 dims
Recurrence Layers 3-5 looped, enabled at frac=0.35
Parallel decoder Parallel lane from layer 8, mean final lane
XSA All 11 layers
Gates BOS-fixed SmearGate, SparseAttnGate with gate_window=12, scale 0.5
Optimizer Muon on matrix params, Adam on embedding/scalars, BETA2=0.99
EMA ema_decay=0.9965
Quantization GPTQ int6 matrices, int7 embeddings, LQER asymmetric rank-4 correction
GPTQ reserve GPTQ_RESERVE_SECONDS=4.0; logs show gptq:reserving 4s, effective=596000ms
Compression Per-group compression
TTT Quantized phased LoRA TTT, score-first, no_qv mask, short-doc chunk schedule

Compliance notes

  • Artifact cap: all seeds <= 15,984,387 bytes.
  • Training wallclock: all training loops stop around 596.0s with GPTQ_RESERVE_SECONDS=4.0; GPTQ hessian collection is logged immediately after (67 Hessians in 4.1s) for transparency.
  • Eval wallclock: all validation-data TTT passes are <= 572.4s. The ttt_lora:compile warmup uses random tokens and no validation data; it is logged separately from total_eval_time.
  • Score-before-update: quantized_ttt_phased scores each chunk before applying that chunk's LoRA update. The short-doc schedule only changes chunk size.
  • Full validation targets: val_tokens == target_tokens == 47853343 in all included logs.
  • No validation data in training: training uses only training shards. TTT accesses validation documents left-to-right under the score-first rule.
  • No external cache or direct memorization: no SLOT, n-gram cache, PPM mixture, logit bias table, or validation-derived precomputation.
  • Original-byte BPB: CaseOps byte sidecar accounting is preserved.

Reproduction

Install the dependencies in requirements.txt. FlashAttention 3 and the lrzip system binary are noted there because they require separate install paths.

This submission uses the clean canonical CaseOps SP8192 export hosted on Hugging Face. The logs were produced from a 50,000-document validation split with 80 training shards (train_shards: 80, ttt_phased: total_docs:50000, and val_tokens == target_tokens == 47853343 in every included log).

Preferred data setup:

python3 - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="romeerp/parameter-golf-caseops-v1",
    repo_type="dataset",
    local_dir="./data/datasets/fineweb10B_sp8192_caseops",
    allow_patterns=[
        "datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/*",
        "datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
    ],
    max_workers=8,
)
PY

Then set:

DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
TOKENIZER_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

Fallback local rebuild: if the HF export is unavailable, rebuild from the canonical docs_selected.jsonl with the included prepare_caseops_data.py, lossless_caps.py, and tokenizer. Use --val-docs 50000 and write into a fresh output directory. The prep script now defaults to 50,000 validation docs and refuses to write over existing fineweb_*.bin shards unless --overwrite is passed, to avoid accidentally mixing stale validation shards with a new train split.

Run one seed at a time, replacing DATA_PATH and TOKENIZER_PATH with the staged CaseOps paths:

for SEED in 42 314 0; do
  NCCL_NET=Socket \
  DATA_DIR=./data \
  DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
  TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
  CASEOPS_ENABLED=1 \
  VOCAB_SIZE=8192 \
  ITERATIONS=20000 \
  MAX_WALLCLOCK_SECONDS=600 \
  EVAL_INCLUDE_TAIL=1 \
  TRAIN_SEQ_LEN=3072 \
  ROPE_TRAIN_SEQ_LEN=3072 \
  TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000 \
  TRAIN_SEQ_SCHEDULE_MODE=wallclock \
  SEQ_CHANGE_WARMUP_STEPS=32 \
  EVAL_SEQ_LEN=3072 \
  EVAL_STRIDE=1536 \
  TTT_ENABLED=1 \
  TTT_EVAL_SEQ_LEN=3072 \
  TTT_BATCH_SIZE=24 \
  TTT_CHUNK_SIZE=48 \
  TTT_SHORT_SCORE_FIRST_ENABLED=1 \
  TTT_SHORT_DOC_LEN=2000 \
  TTT_SHORT_CHUNK_SIZE=24 \
  TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24 \
  TTT_LORA_RANK=80 \
  TTT_LORA_LR=0.0001 \
  TTT_LOCAL_LR_MULT=0.75 \
  TTT_MASK=no_qv \
  TTT_Q_LORA=0 \
  TTT_V_LORA=0 \
  TTT_WEIGHT_DECAY=0.5 \
  TTT_BETA2=0.99 \
  PHASED_TTT_PREFIX_DOCS=2500 \
  PHASED_TTT_NUM_PHASES=1 \
  WARMDOWN_FRAC=0.85 \
  BETA2=0.99 \
  QK_GAIN_INIT=5.25 \
  SPARSE_ATTN_GATE_ENABLED=1 \
  SPARSE_ATTN_GATE_SCALE=0.5 \
  GATED_ATTN_QUANT_GATE=1 \
  SMEAR_GATE_ENABLED=1 \
  GATE_WINDOW=12 \
  FUSED_CE_ENABLED=1 \
  MATRIX_LR=0.026 \
  MIN_LR=0.1 \
  GRAD_CLIP_NORM=0.3 \
  EMBED_BITS=7 \
  EMBED_CLIP_SIGMAS=14.0 \
  MATRIX_CLIP_SIGMAS=12.85 \
  ATTN_CLIP_SIGMAS=13.0 \
  MLP_CLIP_SIGMAS=11.5 \
  LQER_ENABLED=1 \
  LQER_RANK=4 \
  LQER_TOP_K=3 \
  LQER_FACTOR_BITS=4 \
  LQER_ASYM_ENABLED=1 \
  LQER_ASYM_GROUP=64 \
  AWQ_LITE_ENABLED=1 \
  AWQ_LITE_BITS=8 \
  AWQ_LITE_GROUP_TOP_K=1 \
  AWQ_LITE_GROUP_SIZE=64 \
  ASYM_LOGIT_RESCALE=1 \
  GPTQ_RESERVE_SECONDS=4.0 \
  GPTQ_CALIBRATION_BATCHES=16 \
  COMPRESSOR=pergroup \
  VAL_LOSS_EVERY=0 \
  SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py \
      > train_seed${SEED}.log 2>&1
done

Included files

Lineage and credits

This submission is a stack on top of the public CaseOps/SP8192 record lineage:

The new contribution here is the combination of progressive 3k train/eval context growth with the short-document score-first TTT chunk schedule, while preserving the full validation target count and staying under the artifact/eval budgets.

Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 30, 2026
Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759
3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule
1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc
score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24).

Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014
should land ~1.0590 — close enough to record bar to potentially clear it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Port the 040C 'middle 5x / late 3.4x' allocation onto simonbissonnette's
progressive-3k base (openai#2014) and screen vs uniform 4.0 baseline. Training-only,
4xH100 1200s, single seed. Code on exp/300-040c-on-2014 @ d174313.

Spec flags the column-slice-in-compile hazard from feedback memory and
mandates a compile-sanity check before scaling. PREQUANT_ONLY=1 keeps the
screen cheap by skipping serialize/GPTQ/TTT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 1, 2026
…ean 1.05831 BPB

Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108),
p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under
the 600s wallclock budget.

Per-seed:
- 42:   ttt=1.05793  art=15,986,149  eval=572.6s
- 314:  ttt=1.05852  art=15,987,257  eval=553.7s
- 1234: ttt=1.05849  art=15,989,895  eval=574.1s

Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/
contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a
detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923
-> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition
C1-C4 legality check. submission.json author/github_id are placeholders pending
the user's choice of submitting account.

Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single
8xH100 SXM pod (~2.5h wall, ~$66 cost).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Idan3011 pushed a commit to Idan3011/parameter-golf that referenced this pull request May 1, 2026
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
AsymLogit Rescale (PR openai#1923) ported as 2 TTT-adaptable scalar params (softcap_pos, softcap_neg).
Pre-quant 1.06160 (slightly worse than S55's 1.06058 — AsymLogit hurts un-adapted model).
TTT recovery -0.01267 (much better than S55's -0.01103) — AsymLogit gives massive adaptive capacity.
Final 1.05759 = -0.00055 vs S55. Single-seed matches PR openai#2014's 3-seed mean.
Eval 521.7s (under 600s cap), Size 15,946,610.
softcap_pos and softcap_neg init to logit_softcap=30.0, adapted per-doc via TTT-LoRA optimizer.
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
User pushed back on openai#2014's LEAK call as too inference-based. Verified directly:
- README says "uses same shards as PR openai#1855. If you don't have them, prepare
  with included prepare_caseops_data.py" — phrasing implies inheritance from
  openai#1855 (LEAK) but doesn't explicitly invoke prep
- No setup.sh, no shell script invoking prep
- No HF download script
- Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir
  (not triple-nested local-prep signature)
- Could be either HF-flattened download OR local-prep copy

Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855"
English, but not iron-clad).

Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
…E_OUTSIDE=0

Seed 314: pre-quant 1.06128 / quant 1.06962 / final 1.05701 / eval 571.7s
Compliance: ngram_hint_precompute_outside=False, precompute (166.95s) INSIDE timer per PR openai#1514 precedent.
Token-only tilt: within_gate=0, word_gate=0 - legal per PR openai#1514.
Size 15,943,530 bytes.
Single seed beats openai#2014's 3-seed mean (1.05759).
Validating seeds 42 and 1234.
varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB.
Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB.
Beats PR openai#2060 (1.05792) by 0.00122 BPB.

Stack:
- Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled)
- AsymLogit Rescale (2 trainable scalars adapted by global TTT)
- 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5)
- PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014)
- NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514)

Compliance:
- All seeds eval ≤533.1s (cap 600s, 67-80s margin)
- All artifacts ≤15.95MB (cap 16MB)
- Token-only n-gram channel (within_gate=0, word_gate=0)
- Score-first TTT (per PR openai#402)
varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026
codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request May 2, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 5, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 5, 2026
…erged, final SOTA 1.05651

PR openai#2146 merged May 1: audit complete, 4 grace-policy PRs accepted (openai#1945/openai#1953/openai#2014/openai#2135).
Official final SOTA is 1.05651 (codemath3000, PR openai#2135). PR openai#2130 rejected for data overlap.
CLAUDE.md updated to reflect completed audit and new top-5 leaderboard.

https://claude.ai/code/session_01V4CoM1HMPmJDcDHzdtdy7X
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 6, 2026
Auto-detects GPU capability and patches train_gpt.py at runtime:
- FA dispatch: FA4 (sm_100/120 try) → FA2 fallback (sm_120) → FA3 (sm_90)
- linear_leaky_relu_square block sizes per SMEM:
  - sm_90 (H100):  256x128x64 ns=4/3 (192 KB, 84%)
  - sm_100 (B200): 256x128x64 ns=5/4 (240 KB, 94%, +1 stage for HBM3e)
  - sm_120 (Pro 6000 Max-Q): 128x128x64 ns=3/2 (96 KB, 95%, fixes prior 36 KB under-utilization)

Tests 100 steps at fixed ctx=3072 (no progressive schedule, no eval).
Mini-retokenizes only first 50K docs (~5 min) to skip the 50-min full retokenize.
Uploads steps.csv + summary.json to HF dataset and auto-stops pod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 6, 2026
FA4's flash_attn_varlen_func has different kwargs than FA3 (no max_seqlen_q/k),
which breaks PR openai#2014 train_gpt.py line 1224 varlen call:
  TypeError: Unexpected keyword arguments: ['max_seqlen_q', 'max_seqlen_k']

Mixed dispatch: route basic flash_attn_func calls (3 sites) to FA4 (where speed
matters most, ~70% of attn), keep varlen call (1 site) on FA2 (compatible kwargs).
This keeps the FA4 speedup on the dominant attn path without an API-translation
layer for the varlen kwargs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
…clone

ROOT CAUSE of all 3 prior Phase Z failures: the runpod/parameter-golf:latest
image has /workspace/parameter-golf pre-cloned with an older commit that
predates the PR openai#2014 reproduction directory. The 'if [ ! -d .git ]' guard
meant we never re-cloned, so train_gpt.py was missing — leading to 'No such
file or directory' on the cp / cd / torchrun calls.

Earlier 'cwd bug' fix was a red herring (real issue was the source tree on
disk was the wrong revision).

Fix: when .git exists, fetch the requested BRANCH at depth=1 and hard-reset
to its tip. Log HEAD + ls of train_gpt.py for fast smoke verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
v4's re-sync logic assumed either '.git exists, fetch' OR 'nothing exists,
clone fresh'. But the runpod/parameter-golf:latest image has a third state:
/workspace/parameter-golf is populated WITHOUT .git — so git clone fails
'destination already exists and is not an empty directory'. The script's
tail -3 swallowed the error and we limped along with an incomplete tree
that missed the PR openai#2014 reproduction directory.

Fix: when REPO exists without .git, preserve data + tokenizers under
/workspace/pg_image_cache, rm -rf REPO, then fresh-clone the branch.
After clone, copy preserved data back (-n to not overwrite tracked files).

Also dump HEAD short SHA + ls of train_gpt.py + prepare_caseops_data_parallel.py
right after clone for fail-fast smoke verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
3 variants on 1× B200 (Triton 3.5.1, torch 2.6.0, FA4+FA2 mixed, PR openai#2014 workload):
  A_baseline  256x128x64 ns=4/3  1,378,166 tok/s  (= Phase Y v6 yesterday)
  B_M_fine    128x128x64 ns=4/3  1,338,010 tok/s  (-2.91%)
  C_N_wide    128x256x64 ns=3/2  1,357,513 tok/s  (-1.50%)

Default 256x128x64 ns=4/3 is the local optimum on this stack. Smaller M tile
oversubscribes SMs (launch overhead dominates), wider N tile loses to reduced
num_stages (less prefetch overlap). Conclusion: B200 has no easy unlock via
MLP block-size tuning on Triton 3.5.1.

The original Phase Z hypothesis (Triton 3.7 enables native sm_100 tcgen05 PTX)
remains untested — Triton 3.7 is incompatible with torch 2.6 inductor
(KernelMetadata.cluster_dims AttributeError). Would need full torch + Triton
+ FA stack rebuild to test; tabled.

Real remaining headroom on B200 (in order of tractability):
  1. CUDA graphs / reduced launch overhead (+5-15%, ~2-4h work)
  2. FP8 transformer-engine for MLP up/down (+25-30%, half-day to 1d work)
  3. torch 2.7+ Triton 3.7 native sm_100 codegen (+20-50%, 1-2d work)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
After Phase Z ruled out block-size tuning, this doc proposes the 3 remaining
feasible paths and explicitly asks for greenlight before spawning more pods.

Tier 1 (recommended): enable CUDA Graphs via TORCHINDUCTOR_CUDAGRAPHS=1.
One-line change. PR openai#2014 already does torch.compile(dynamic=False,
fullgraph=True) which is graph-capture-friendly. Expected +5-15%. Cost
~$1.50 for one verification pod.

Tier 2: FP8 transformer-engine for MLP up/down. Half-day to 1d engineering.
Risk: BPB regression from FP8 quant; also unclear if MLP is compute-bound vs
memory-bw-bound at K=512. Greenlight criterion: profiler trace showing MLP
>30% of step time.

Tier 3: torch 2.7 + Triton 3.7 + FA stack rebuild. Most uncertain; 1-2d work.
Only do if Tiers 1+2 don't reach target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
Empirical findings from 2 B200 pods (~$1.50 total):

v1 (torch 2.6→2.11 + bundled triton 3.6): flash-attn ABI mismatch — wheel
built for torch 2.9 has undefined symbol against torch 2.11's libc10_cuda.
SDPA fallback also fails under torch.compile (Invalid backend for bf16 GQA
on B200 with torch 2.11). Result: 0 training steps for variant B.

v2 (keep torch 2.9.1, only swap triton 3.5.1 → 3.6.0): isolated the triton
variable. torch.compile(matmul) smoke gate PASSED but PR openai#2014's actual
triton-kernel path failed with the SAME error as Phase Z v5 with triton 3.7:

    File 'torch/_inductor/runtime/triton_heuristics.py', line 1757
        (binary.metadata.num_ctas, *binary.metadata.cluster_dims)
    AttributeError: 'KernelMetadata' object has no attribute 'cluster_dims'

KEY FINDING: The KernelMetadata.cluster_dims removal happened in triton
3.5.1→3.6.0, NOT 3.6→3.7. torch 2.9.1's inductor was compiled against the
3.5.x metadata schema. Any triton ≥ 3.6 breaks this image's inductor.

torch 2.10+ has the dual-name fallback (get_first_attr 'cluster_dims',
'clusterDims') needed to handle triton 3.6+. But torch 2.10+ doesn't have
prebuilt flash-attn wheels — requires source build (15-20 min on pod).

Tier 3 realistic estimate: 2-3 pod iterations, $15-30, uncertain upside
because Phase Z already showed block-size tuning doesn't help (suggesting
the bottleneck isn't the Triton-emitted kernel).

Pivot recommendation: Tier 1 CUDA Graphs is highest-expected-return remaining
unlock (one-line env var, ~$1.50 verification, expected +5-15%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
Enable inductor cuda graphs via torch._inductor.config.triton.cudagraphs=True
(set BEFORE torch.compile() — patched into train_gpt.py via CUDA_GRAPHS_ENABLED
env var). Single B200 pod, A_no_graphs vs B_cudagraphs sequential.

Hypothesis: PR openai#2014 is partially launch-overhead-bound. With 17 layer-applies
× ~20 kernels per layer ≈ 340 kernel launches per step on the model's hot path,
host-side launch latency probably eats 10-20% of step time. CUDA graphs collapse
these to one cudaGraphLaunch.

PR openai#2014 already uses torch.compile(dynamic=False, fullgraph=True) and
COMPILE_SHAPE_WARMUP=1 enumerates the cu_bucket shapes, so graph capture
shape coverage should be complete. Worst case: fallback to eager on missed
shapes (no correctness impact).

Expected: +5-15%. Probability of clean run: 70%+ (no version churn, just
one flag in inductor config).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
Single B200 pod (zu2a9pxy91h5rn, $1.10), torch 2.9.1 + triton 3.5.1 image,
A vs B with torch._inductor.config.triton.cudagraphs flag flipped between
variants. Patched train_gpt.py to set the flag from CUDA_GRAPHS_ENABLED env
var BEFORE any torch.compile() call.

Steady state (steps 20..100, 9 datapoints each):
  A_no_graphs   mean 1,378,297 tok/s (matches Phase Z baseline)
  B_cudagraphs  mean 1,428,986 tok/s  (+3.68%, +50,689 tok/s)

Caveat: CUDA Graphs changes kernel scheduling enough that train_loss values
diverge between variants from step 2 onwards (A step 2 = 12.8549 vs B step 2 =
13.0082). Pure-throughput bench is unaffected; production training would need
to verify final BPB stays in seed-noise band.

Cumulative B200 unlock state on PR openai#2014:
  Phase Y baseline (FA4-mixed):     1,378,166 tok/s
  Phase Z (block-size scan):        0% (no win)
  Phase A1 (triton 3.6/3.7):        incompat (KernelMetadata.cluster_dims)
  Phase A2 (CUDA Graphs):           +3.68%  ← LOCKED IN

Recommendation: keep cuda graphs flag enabled in any production run. The next
remaining unlock is Tier 2 (FP8 TE for MLP up/down), but that's half-day to
full-day engineering and needs profiler trace + BPB regression verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
Targets the plain F.linear(post, w2) on line 1067 of FusedLinearLeakyReLUSquareFunction
(the down-projection of PR openai#2014's MLP). Up-projection stays fused-triton bf16 since
replacing it requires rewriting the custom autograd Function (it returns pre-activation
aux for backward reuse).

Helper _fp8_linear() does per-call amax-based e4m3 quantize then torch._scaled_mm.
Gated by FP8_MLP_ENABLED env var. CUDA Graphs ON in both variants (locked in from
Phase A2). Per-call amax is two GPU reductions, no host sync — should be graph-friendly.

Variants:
  A — CUDA Graphs ON + bf16 MLP   (= Phase A2 B_cudagraphs ~1.429M baseline)
  B — CUDA Graphs ON + FP8 down   (this experiment)

Tradeoffs:
 - FP8 down at 2x tensor-core throughput should be net positive
 - Per-call amax adds 2 reductions per matmul (~us)
 - Loss values will diverge from A (numerical change). For bench-only test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 12, 2026
Single B200 pod (1xgh2ye47bfzaa, $1.10), CUDA Graphs ON for both variants.
FP8 e4m3 via torch._scaled_mm replaces F.linear(post, w2) in PR openai#2014's
FusedLinearLeakyReLUSquareFunction down-projection. Per-call amax for both
activation and weight, no host sync (graph-friendly).

Steady state (warm steps 20..100, 9 datapoints):
  A_bf16_mlp  1,445,456 tok/s (loss@100=4.3793)
  B_fp8_mlp   1,405,190 tok/s (loss@100=4.1690)  -2.79%

Why FP8 lost on this workload:
 - Only half MLP changed (up-proj stays fused BF16 triton; replacing it needs
   rewriting the custom autograd Function)
 - Per-call amax adds 4 extra reductions per layer × 17 layer-applies × 100
   steps = ~41 ms wall overhead
 - F.linear had inductor fusion with surrounding ops; _scaled_mm + quantize +
   dequantize becomes 3 separate kernel launches

What this rules out: simple drop-in FP8 down-projection. Doesn't rule out:
(a) full MLP rewrite with FP8 up+down (~half-day work), (b) static-scaling
amax buffers (no per-call reduce), (c) transformer-engine.Linear with
DelayedScaling. All half-day+ investments with uncertain upside.

Final B200 unlock for PR openai#2014: CUDA Graphs (+3.68% from A2) is the cheap win.
Cumulative: 1,378,166 → 1,428,986 tok/s. Tier 2 needs more invasive surgery
to pay off; recommend stopping here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant