Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143
Open
upascal wants to merge 1 commit into
Open
Non-record submission: post-deadline CaseOps + SparseAttnGate + Phased TTT (1.07134 BPB)#2143upascal wants to merge 1 commit into
upascal wants to merge 1 commit into
Conversation
…hased TTT (1.07134 BPB) Post-deadline community submission shared for educational value, not for a leaderboard track record. Trained on 8xH100 in 596s wallclock, artifact 15.87 MB / 16.00 MB cap, score 1.07134 quantized_phased_ttt val_bpb. Would have placed openai#7 on the active leaderboard. Stack derived from the 2026-04-27 leader record (1.06128) + CaseOps: - sp12288 + lossless CaseOps tokenizer - Hadamard-rotated GPTQ (int5/int6/int7), LQER asymmetric rank-4 - SmearGate, recurrence (12L w/ layers 3-5 looped), parallel residuals - SparseAttnGate (zero-init per-head), CUDA graphs, fused softcapped CE - Phased TTT (3 cumulative phases) with batched LoRA rank-80 on Q/K/V/O/MLP/lm_head - Leader hparams: WARMDOWN_FRAC=0.85, MATRIX_LR=0.026, EMBED_CLIP_SIGMAS=14 The README documents two bugs we hit while porting the leader's TTT code into a different repo (cu_seqlens plumbing through flash_attn_varlen_func, parallel-lane mismatch in forward_ttt that requires a _parallel_block_with_lora method when PARALLEL_RESIDUAL_START < num_layers). It also notes 131 KB of unused artifact headroom and an untested experiment (full split-clip + LZMA code wrap) that plausibly takes this sub-1.07. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-deadline community submission added to
records/track_non_record_16mb/. Sharing a configuration that completed May 1 (after the April 30 deadline) for educational value, not for the leaderboard track.records/track_non_record_16mb/2026-05-02_PostDeadline_CaseOps_SparseAttnGate_PhasedTTT_1.0713/Approach
Stack derived from the 2026-04-27 leader record (1.06128) plus CaseOps tokenizer:
lossless_caps_caseops_v1)Full env config in
submission.json.What's worth surfacing for other participants
The README in the submission folder documents two bugs we hit while porting the leader's TTT code into a different repo. They may be useful to anyone doing similar porting work:
cu_seqlensplumbing intrain_val_ttt_global_sgd_distributed: leader's global SGD pass usesflash_attn_varlen_funcwithcu_seqlensto prevent attention from leaking across BOS during the prefix update. If yourGPT.forwarddoesn't acceptcu_seqlens, you silently no-op this path. Phased TTT delta vs sliding tripled (-0.0012 → -0.0037) once we threaded this through.forward_ttt: if your base trains with parallel residuals at some layer, your LoRA-injectedforward_tttneeds a corresponding_parallel_block_with_loramethod or it silently train/eval-mismatches on those layers.Notes
Test plan
submission.jsonparses as JSONtrain_gpt.pyparses withast.parsefinal_model.int6.ptzis the exact artifact produced by the run (15,749,430 bytes)run_log.txtis the unmodified stdout from the 8×H100 run with all val_bpb measurements🤖 Generated with Claude Code