Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)#2163
Open
uniagent-alpha wants to merge 1 commit into
Conversation
…prefix=3000, LoRA-128) — val_bpb 1.06035 3-seed mean val_bpb 1.06035 (std 0.00044) on 8xH100 SXM, all artifacts under the 16 MB cap. Track: 10min_16mb. Adds NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT) and a z-loss regularization term (weight 1e-4 on mean(LSE^2), computed from the LSE returned by the fused softcapped-CE Triton kernel) on top of the 1.06108 parent, plus a phased-TTT retune (LoRA rank 80→128, prefix docs 2500→3000, num phases 3→4). Architecture is unchanged from the parent. Per-seed: seed 42: val_bpb 1.05980, artifact 15,897,143 B seed 0: val_bpb 1.06038, artifact 15,894,185 B seed 314: val_bpb 1.06087, artifact 15,893,797 B
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
May 10, 2026
… strategy update Covers 7 sessions of post-competition monitoring: - May 4: post-competition day 4, PR openai#2146 audit draft - May 5: PR openai#2146 merged, final SOTA confirmed 1.05651 - May 6-8: leaderboard frozen, paper scan - May 9: THREE new organizer codex branches (CaseOps revocation risk) - May 10: leaderboard unchanged, NGPU-LM paper, PR openai#2163 detail CLAUDE.md: Competition Strategy updated to reflect final SOTA 1.05651 and three unmerged organizer codex branches signaling possible CaseOps revocation. https://claude.ai/code/session_011Gc3nLekNUQdvoWfYZTuX3
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
May 15, 2026
…bandoned, SOTA 1.05651 locked - No new upstream/main commits since May 4 (11-day silence) - Codex CaseOps-stripping branches last updated April 26–28 (17–19 days idle); OpenAI's May 4 audit commit post-dates them without merging → SOTA locked at 1.05651 - CLAUDE.md: updated codex branch note from "Monitor daily" to "ABANDONED" - Logged PR openai#2163 (NEFTune + Z-Loss + Phased-TTT, 1.06035, May 7) - Added arXiv:2605.02404 (Statistically-Lossless Quantization) and arXiv:2505.22857 (NGPU-LM) to post-competition findings https://claude.ai/code/session_01H14864JGTC6TJvav24zVzf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
11L 512d 8H/4KV transformer (XSA-all, U-Net skips, parallel residuals from L8+, partial RoPE+YaRN, Polar-Express Muon, LeakyReLU² MLP, fused softcapped CE Triton kernel, sparse attention head-output gate, BOS-fixed SmearGate, GPTQ int6 + int7 embed + per-row int8 attn-gate, LQER asymmetric int4 rank-4, per-group lrzip+brotli compression) with NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT), z-loss regularization (weight 1e-4 on mean(LSE²)), and a phased-TTT retune (LoRA rank 80→128, prefix 2500→3000 docs, num phases 3→4) on top of the 1.06108 parent.
3-seed mean: 1.06035 BPB (std 0.00044) on 8×H100 SXM, all artifacts under the 16 MB cap.
What's new vs. parents
NEFTune embedding noise (alpha=5.0)
Adds uniform noise scaled by
alpha / sqrt(seq_len * dim)to the token embeddings during training. Gated off during phased-TTT via_in_ttt, since TTT fine-tunes on the validation prefix and noise would just inject loss. Concept: Jain et al., 2023 (arXiv:2310.05914). None of the parents use NEFTune.Z-loss regularization (weight 1e-4)
Adds
weight * mean(LSE²)to the training loss, whereLSEis the per-token log-sum-exp of the (softcapped) logits. Standard PaLM-style trick. The integration detail: whenFUSED_CE_ENABLED=1, the fused softcapped-CE Triton kernel already returns the per-token LSE, so the z-loss term is essentially free — no second logits pass.Phased-TTT retune
Higher-rank adapters fit the longer per-phase prefix; the extra phase (boundaries ~750/1500/2250/3000 docs) given the longer prefix. All three changes monotonically improve the 3-seed mean and the eval still fits inside the 600 s budget (472–528 s observed).
Test plan
train_seed{42,0,314}.logSee
records/track_10min_16mb/2026-05-09_SP8192_NEFTune_TTT128_PhasedTTT4_1.0603/README.mdfor full architecture, lineage, and credits.