Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB) by upascal · Pull Request #2143 · openai/parameter-golf

upascal · 2026-05-02T07:19:55Z

Summary

Post-deadline community submission added to records/track_non_record_16mb/. Sharing a configuration that completed May 1 (after the April 30 deadline) for educational value, not for the leaderboard track.

Score: 1.07134 quantized_phased_ttt val_bpb (would have placed fix(data): normalize cached FineWeb paths #7 on the active leaderboard)
Artifact: 15.87 MB / 16.00 MB SI cap
Hardware: 8×H100 SXM, 596s wallclock
Submission folder: records/track_non_record_16mb/2026-05-02_PostDeadline_CaseOps_SparseAttnGate_PhasedTTT_1.0713/

Approach

Stack derived from the 2026-04-27 leader record (1.06128) plus CaseOps tokenizer:

Tokenizer: sp12288 SentencePiece + lossless CaseOps transform (lossless_caps_caseops_v1)
Model: 12L × 512d × 8H/4KV, partial RoPE, MLP×2 with LeakyReLU(0.5)², tied embeddings
Recurrence: layers 3-5 looped 2× starting at fraction 0.35
Parallel residuals: layers 7-11 (simple parallel-sum, not the leader's 2-lane variant)
SmearGate: GATE_WINDOW=12, BOS-masked
SparseAttnGate: per-head zero-init sigmoid gate, 96 params/layer
CUDA graphs + fused softcapped CE Triton kernel
Quantization: GPTQ + Hadamard rotation, mixed bits (int5/int6/int7), LQER asymmetric rank-4 top-3
Phased TTT (3 cumulative phases over 2500 prefix docs) with batched LoRA rank-80 on Q/K/V/O/MLP-fc/lm_head, AdamW (lr=1e-4, β2=0.99, wd=0.5)
Hparams: WARMDOWN_FRAC=0.85, MATRIX_LR=0.026, EMBED_CLIP_SIGMAS=14, MIN_LR=0.1

Full env config in submission.json.

What's worth surfacing for other participants

The README in the submission folder documents two bugs we hit while porting the leader's TTT code into a different repo. They may be useful to anyone doing similar porting work:

cu_seqlens plumbing in train_val_ttt_global_sgd_distributed: leader's global SGD pass uses flash_attn_varlen_func with cu_seqlens to prevent attention from leaking across BOS during the prefix update. If your GPT.forward doesn't accept cu_seqlens, you silently no-op this path. Phased TTT delta vs sliding tripled (-0.0012 → -0.0037) once we threaded this through.
Parallel-lane structure mismatch in forward_ttt: if your base trains with parallel residuals at some layer, your LoRA-injected forward_ttt needs a corresponding _parallel_block_with_lora method or it silently train/eval-mismatches on those layers.

Notes

131 KB of unused artifact headroom; we identified an experiment (full split-clip from leader values + LZMA code wrap) that plausibly takes this sub-1.07 but didn't get to ship before the deadline.
This is intentionally a non-record submission — we know the deadline has passed. Posting here so the configuration and approach are visible to the community.

Test plan

submission.json parses as JSON
train_gpt.py parses with ast.parse
final_model.int6.ptz is the exact artifact produced by the run (15,749,430 bytes)
run_log.txt is the unmodified stdout from the 8×H100 run with all val_bpb measurements

🤖 Generated with Claude Code

…hased TTT (1.07134 BPB) Post-deadline community submission shared for educational value, not for a leaderboard track record. Trained on 8xH100 in 596s wallclock, artifact 15.87 MB / 16.00 MB cap, score 1.07134 quantized_phased_ttt val_bpb. Would have placed openai#7 on the active leaderboard. Stack derived from the 2026-04-27 leader record (1.06128) + CaseOps: - sp12288 + lossless CaseOps tokenizer - Hadamard-rotated GPTQ (int5/int6/int7), LQER asymmetric rank-4 - SmearGate, recurrence (12L w/ layers 3-5 looped), parallel residuals - SparseAttnGate (zero-init per-head), CUDA graphs, fused softcapped CE - Phased TTT (3 cumulative phases) with batched LoRA rank-80 on Q/K/V/O/MLP/lm_head - Leader hparams: WARMDOWN_FRAC=0.85, MATRIX_LR=0.026, EMBED_CLIP_SIGMAS=14 The README documents two bugs we hit while porting the leader's TTT code into a different repo (cu_seqlens plumbing through flash_attn_varlen_func, parallel-lane mismatch in forward_ttt that requires a _parallel_block_with_lora method when PARALLEL_RESIDUAL_START < num_layers). It also notes 131 KB of unused artifact headroom and an untested experiment (full split-clip + LZMA code wrap) that plausibly takes this sub-1.07. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143

Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143
upascal wants to merge 1 commit into
openai:mainfrom
upascal:post-deadline-1.0713-caseops-sag-phasedttt

upascal commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

upascal commented May 2, 2026

Summary

Approach

What's worth surfacing for other participants

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant