Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean) by uniagent-alpha · Pull Request #2163 · openai/parameter-golf

uniagent-alpha · 2026-05-09T16:17:33Z

Summary

11L 512d 8H/4KV transformer (XSA-all, U-Net skips, parallel residuals from L8+, partial RoPE+YaRN, Polar-Express Muon, LeakyReLU² MLP, fused softcapped CE Triton kernel, sparse attention head-output gate, BOS-fixed SmearGate, GPTQ int6 + int7 embed + per-row int8 attn-gate, LQER asymmetric int4 rank-4, per-group lrzip+brotli compression) with NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT), z-loss regularization (weight 1e-4 on mean(LSE²)), and a phased-TTT retune (LoRA rank 80→128, prefix 2500→3000 docs, num phases 3→4) on top of the 1.06108 parent.

3-seed mean: 1.06035 BPB (std 0.00044) on 8×H100 SXM, all artifacts under the 16 MB cap.

seed	post-TTT val_bpb	artifact bytes	eval_time
42	1.05980	15,897,143	~510 s
0	1.06038	15,894,185	~510 s
314	1.06087	15,893,797	~510 s
mean	1.06035	15,895,042	508.7 s

What's new vs. parents

NEFTune embedding noise (alpha=5.0)

Adds uniform noise scaled by alpha / sqrt(seq_len * dim) to the token embeddings during training. Gated off during phased-TTT via _in_ttt, since TTT fine-tunes on the validation prefix and noise would just inject loss. Concept: Jain et al., 2023 (arXiv:2310.05914). None of the parents use NEFTune.

if self.training and self.neftune_alpha > 0 and not self._in_ttt:
    seq_len = max_seqlen if max_seqlen > 0 else x.size(1)
    noise = torch.rand_like(x) * 2.0 - 1.0
    x = x + noise * (self.neftune_alpha / math.sqrt(seq_len * x.size(-1)))

Z-loss regularization (weight 1e-4)

Adds weight * mean(LSE²) to the training loss, where LSE is the per-token log-sum-exp of the (softcapped) logits. Standard PaLM-style trick. The integration detail: when FUSED_CE_ENABLED=1, the fused softcapped-CE Triton kernel already returns the per-token LSE, so the z-loss term is essentially free — no second logits pass.

if self.fused_ce_enabled:
    losses, lse = torch.ops.pgsubmission1draft7fusedce.softcapped_ce(
        logits_proj.reshape(-1, logits_proj.size(-1)),
        flat_targets,
        float(self.logit_softcap),
    )
    return losses.mean() + self.z_loss_weight * (lse**2).mean()

Phased-TTT retune

hparam	value	parent (1.0611)
TTT_LORA_RANK	128	80
PHASED_TTT_PREFIX_DOCS	3000	2500
PHASED_TTT_NUM_PHASES	4	3

Higher-rank adapters fit the longer per-phase prefix; the extra phase (boundaries ~750/1500/2250/3000 docs) given the longer prefix. All three changes monotonically improve the 3-seed mean and the eval still fits inside the 600 s budget (472–528 s observed).

Test plan

Trains within 600s on 8×H100 80GB SXM (4,921–4,976 steps, ~121.2 ms/step)
All 3 artifacts under 16 MB (max 15,897,143 B)
TTT eval within 600s (max 527.8 s, mean 508.7 s)
3-seed mean reproduced; per-seed numbers in train_seed{42,0,314}.log
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, score-first TTT

See records/track_10min_16mb/2026-05-09_SP8192_NEFTune_TTT128_PhasedTTT4_1.0603/README.md for full architecture, lineage, and credits.

…prefix=3000, LoRA-128) — val_bpb 1.06035 3-seed mean val_bpb 1.06035 (std 0.00044) on 8xH100 SXM, all artifacts under the 16 MB cap. Track: 10min_16mb. Adds NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT) and a z-loss regularization term (weight 1e-4 on mean(LSE^2), computed from the LSE returned by the fused softcapped-CE Triton kernel) on top of the 1.06108 parent, plus a phased-TTT retune (LoRA rank 80→128, prefix docs 2500→3000, num phases 3→4). Architecture is unchanged from the parent. Per-seed: seed 42: val_bpb 1.05980, artifact 15,897,143 B seed 0: val_bpb 1.06038, artifact 15,894,185 B seed 314: val_bpb 1.06087, artifact 15,893,797 B

… strategy update Covers 7 sessions of post-competition monitoring: - May 4: post-competition day 4, PR openai#2146 audit draft - May 5: PR openai#2146 merged, final SOTA confirmed 1.05651 - May 6-8: leaderboard frozen, paper scan - May 9: THREE new organizer codex branches (CaseOps revocation risk) - May 10: leaderboard unchanged, NGPU-LM paper, PR openai#2163 detail CLAUDE.md: Competition Strategy updated to reflect final SOTA 1.05651 and three unmerged organizer codex branches signaling possible CaseOps revocation. https://claude.ai/code/session_011Gc3nLekNUQdvoWfYZTuX3

…bandoned, SOTA 1.05651 locked - No new upstream/main commits since May 4 (11-day silence) - Codex CaseOps-stripping branches last updated April 26–28 (17–19 days idle); OpenAI's May 4 audit commit post-dates them without merging → SOTA locked at 1.05651 - CLAUDE.md: updated codex branch note from "Monitor daily" to "ABANDONED" - Logged PR openai#2163 (NEFTune + Z-Loss + Phased-TTT, 1.06035, May 7) - Added arXiv:2605.02404 (Statistically-Lossless Quantization) and arXiv:2505.22857 (NGPU-LM) to post-competition findings https://claude.ai/code/session_01H14864JGTC6TJvav24zVzf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)#2163

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)#2163
uniagent-alpha wants to merge 1 commit into
openai:mainfrom
uniagent-alpha:submission/sp8192-neftune-ttt128-phasedttt4-1.0603

uniagent-alpha commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uniagent-alpha commented May 9, 2026

Summary

What's new vs. parents

NEFTune embedding noise (alpha=5.0)

Z-loss regularization (weight 1e-4)

Phased-TTT retune

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant