Skip to content

Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143

Open
upascal wants to merge 1 commit into
openai:mainfrom
upascal:post-deadline-1.0713-caseops-sag-phasedttt
Open

Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143
upascal wants to merge 1 commit into
openai:mainfrom
upascal:post-deadline-1.0713-caseops-sag-phasedttt

Conversation

@upascal
Copy link
Copy Markdown

@upascal upascal commented May 2, 2026

Summary

Post-deadline community submission added to records/track_non_record_16mb/. Sharing a configuration that completed May 1 (after the April 30 deadline) for educational value, not for the leaderboard track.

  • Score: 1.07134 quantized_phased_ttt val_bpb (would have placed fix(data): normalize cached FineWeb paths #7 on the active leaderboard)
  • Artifact: 15.87 MB / 16.00 MB SI cap
  • Hardware: 8×H100 SXM, 596s wallclock
  • Submission folder: records/track_non_record_16mb/2026-05-02_PostDeadline_CaseOps_SparseAttnGate_PhasedTTT_1.0713/

Approach

Stack derived from the 2026-04-27 leader record (1.06128) plus CaseOps tokenizer:

  • Tokenizer: sp12288 SentencePiece + lossless CaseOps transform (lossless_caps_caseops_v1)
  • Model: 12L × 512d × 8H/4KV, partial RoPE, MLP×2 with LeakyReLU(0.5)², tied embeddings
  • Recurrence: layers 3-5 looped 2× starting at fraction 0.35
  • Parallel residuals: layers 7-11 (simple parallel-sum, not the leader's 2-lane variant)
  • SmearGate: GATE_WINDOW=12, BOS-masked
  • SparseAttnGate: per-head zero-init sigmoid gate, 96 params/layer
  • CUDA graphs + fused softcapped CE Triton kernel
  • Quantization: GPTQ + Hadamard rotation, mixed bits (int5/int6/int7), LQER asymmetric rank-4 top-3
  • Phased TTT (3 cumulative phases over 2500 prefix docs) with batched LoRA rank-80 on Q/K/V/O/MLP-fc/lm_head, AdamW (lr=1e-4, β2=0.99, wd=0.5)
  • Hparams: WARMDOWN_FRAC=0.85, MATRIX_LR=0.026, EMBED_CLIP_SIGMAS=14, MIN_LR=0.1

Full env config in submission.json.

What's worth surfacing for other participants

The README in the submission folder documents two bugs we hit while porting the leader's TTT code into a different repo. They may be useful to anyone doing similar porting work:

  1. cu_seqlens plumbing in train_val_ttt_global_sgd_distributed: leader's global SGD pass uses flash_attn_varlen_func with cu_seqlens to prevent attention from leaking across BOS during the prefix update. If your GPT.forward doesn't accept cu_seqlens, you silently no-op this path. Phased TTT delta vs sliding tripled (-0.0012 → -0.0037) once we threaded this through.
  2. Parallel-lane structure mismatch in forward_ttt: if your base trains with parallel residuals at some layer, your LoRA-injected forward_ttt needs a corresponding _parallel_block_with_lora method or it silently train/eval-mismatches on those layers.

Notes

  • 131 KB of unused artifact headroom; we identified an experiment (full split-clip from leader values + LZMA code wrap) that plausibly takes this sub-1.07 but didn't get to ship before the deadline.
  • This is intentionally a non-record submission — we know the deadline has passed. Posting here so the configuration and approach are visible to the community.

Test plan

  • submission.json parses as JSON
  • train_gpt.py parses with ast.parse
  • final_model.int6.ptz is the exact artifact produced by the run (15,749,430 bytes)
  • run_log.txt is the unmodified stdout from the 8×H100 run with all val_bpb measurements

🤖 Generated with Claude Code

…hased TTT (1.07134 BPB)

Post-deadline community submission shared for educational value, not
for a leaderboard track record. Trained on 8xH100 in 596s wallclock,
artifact 15.87 MB / 16.00 MB cap, score 1.07134 quantized_phased_ttt
val_bpb. Would have placed openai#7 on the active leaderboard.

Stack derived from the 2026-04-27 leader record (1.06128) + CaseOps:
- sp12288 + lossless CaseOps tokenizer
- Hadamard-rotated GPTQ (int5/int6/int7), LQER asymmetric rank-4
- SmearGate, recurrence (12L w/ layers 3-5 looped), parallel residuals
- SparseAttnGate (zero-init per-head), CUDA graphs, fused softcapped CE
- Phased TTT (3 cumulative phases) with batched LoRA rank-80 on Q/K/V/O/MLP/lm_head
- Leader hparams: WARMDOWN_FRAC=0.85, MATRIX_LR=0.026, EMBED_CLIP_SIGMAS=14

The README documents two bugs we hit while porting the leader's TTT
code into a different repo (cu_seqlens plumbing through
flash_attn_varlen_func, parallel-lane mismatch in forward_ttt that
requires a _parallel_block_with_lora method when
PARALLEL_RESIDUAL_START < num_layers). It also notes 131 KB of unused
artifact headroom and an untested experiment (full split-clip + LZMA
code wrap) that plausibly takes this sub-1.07.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant