Skip to content

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)#2163

Open
uniagent-alpha wants to merge 1 commit into
openai:mainfrom
uniagent-alpha:submission/sp8192-neftune-ttt128-phasedttt4-1.0603
Open

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)#2163
uniagent-alpha wants to merge 1 commit into
openai:mainfrom
uniagent-alpha:submission/sp8192-neftune-ttt128-phasedttt4-1.0603

Conversation

@uniagent-alpha
Copy link
Copy Markdown

Summary

11L 512d 8H/4KV transformer (XSA-all, U-Net skips, parallel residuals from L8+, partial RoPE+YaRN, Polar-Express Muon, LeakyReLU² MLP, fused softcapped CE Triton kernel, sparse attention head-output gate, BOS-fixed SmearGate, GPTQ int6 + int7 embed + per-row int8 attn-gate, LQER asymmetric int4 rank-4, per-group lrzip+brotli compression) with NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT), z-loss regularization (weight 1e-4 on mean(LSE²)), and a phased-TTT retune (LoRA rank 80→128, prefix 2500→3000 docs, num phases 3→4) on top of the 1.06108 parent.

3-seed mean: 1.06035 BPB (std 0.00044) on 8×H100 SXM, all artifacts under the 16 MB cap.

seed post-TTT val_bpb artifact bytes eval_time
42 1.05980 15,897,143 ~510 s
0 1.06038 15,894,185 ~510 s
314 1.06087 15,893,797 ~510 s
mean 1.06035 15,895,042 508.7 s

What's new vs. parents

NEFTune embedding noise (alpha=5.0)

Adds uniform noise scaled by alpha / sqrt(seq_len * dim) to the token embeddings during training. Gated off during phased-TTT via _in_ttt, since TTT fine-tunes on the validation prefix and noise would just inject loss. Concept: Jain et al., 2023 (arXiv:2310.05914). None of the parents use NEFTune.

if self.training and self.neftune_alpha > 0 and not self._in_ttt:
    seq_len = max_seqlen if max_seqlen > 0 else x.size(1)
    noise = torch.rand_like(x) * 2.0 - 1.0
    x = x + noise * (self.neftune_alpha / math.sqrt(seq_len * x.size(-1)))

Z-loss regularization (weight 1e-4)

Adds weight * mean(LSE²) to the training loss, where LSE is the per-token log-sum-exp of the (softcapped) logits. Standard PaLM-style trick. The integration detail: when FUSED_CE_ENABLED=1, the fused softcapped-CE Triton kernel already returns the per-token LSE, so the z-loss term is essentially free — no second logits pass.

if self.fused_ce_enabled:
    losses, lse = torch.ops.pgsubmission1draft7fusedce.softcapped_ce(
        logits_proj.reshape(-1, logits_proj.size(-1)),
        flat_targets,
        float(self.logit_softcap),
    )
    return losses.mean() + self.z_loss_weight * (lse**2).mean()

Phased-TTT retune

hparam value parent (1.0611)
TTT_LORA_RANK 128 80
PHASED_TTT_PREFIX_DOCS 3000 2500
PHASED_TTT_NUM_PHASES 4 3

Higher-rank adapters fit the longer per-phase prefix; the extra phase (boundaries ~750/1500/2250/3000 docs) given the longer prefix. All three changes monotonically improve the 3-seed mean and the eval still fits inside the 600 s budget (472–528 s observed).

Test plan

  • Trains within 600s on 8×H100 80GB SXM (4,921–4,976 steps, ~121.2 ms/step)
  • All 3 artifacts under 16 MB (max 15,897,143 B)
  • TTT eval within 600s (max 527.8 s, mean 508.7 s)
  • 3-seed mean reproduced; per-seed numbers in train_seed{42,0,314}.log
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, score-first TTT

See records/track_10min_16mb/2026-05-09_SP8192_NEFTune_TTT128_PhasedTTT4_1.0603/README.md for full architecture, lineage, and credits.

…prefix=3000, LoRA-128) — val_bpb 1.06035

3-seed mean val_bpb 1.06035 (std 0.00044) on 8xH100 SXM, all artifacts under
the 16 MB cap. Track: 10min_16mb.

Adds NEFTune embedding noise (alpha=5.0, training-only, gated off during TTT)
and a z-loss regularization term (weight 1e-4 on mean(LSE^2), computed from
the LSE returned by the fused softcapped-CE Triton kernel) on top of the
1.06108 parent, plus a phased-TTT retune (LoRA rank 80→128, prefix
docs 2500→3000, num phases 3→4). Architecture is unchanged from the parent.

Per-seed:
  seed 42:  val_bpb 1.05980, artifact 15,897,143 B
  seed 0:   val_bpb 1.06038, artifact 15,894,185 B
  seed 314: val_bpb 1.06087, artifact 15,893,797 B
sunnypatneedi added a commit to sunnypatneedi/parameter-golf that referenced this pull request May 10, 2026
… strategy update

Covers 7 sessions of post-competition monitoring:
- May 4: post-competition day 4, PR openai#2146 audit draft
- May 5: PR openai#2146 merged, final SOTA confirmed 1.05651
- May 6-8: leaderboard frozen, paper scan
- May 9: THREE new organizer codex branches (CaseOps revocation risk)
- May 10: leaderboard unchanged, NGPU-LM paper, PR openai#2163 detail

CLAUDE.md: Competition Strategy updated to reflect final SOTA 1.05651 and
three unmerged organizer codex branches signaling possible CaseOps revocation.

https://claude.ai/code/session_011Gc3nLekNUQdvoWfYZTuX3
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 15, 2026
…bandoned, SOTA 1.05651 locked

- No new upstream/main commits since May 4 (11-day silence)
- Codex CaseOps-stripping branches last updated April 26–28 (17–19 days idle);
  OpenAI's May 4 audit commit post-dates them without merging → SOTA locked at 1.05651
- CLAUDE.md: updated codex branch note from "Monitor daily" to "ABANDONED"
- Logged PR openai#2163 (NEFTune + Z-Loss + Phased-TTT, 1.06035, May 7)
- Added arXiv:2605.02404 (Statistically-Lossless Quantization) and
  arXiv:2505.22857 (NGPU-LM) to post-competition findings

https://claude.ai/code/session_01H14864JGTC6TJvav24zVzf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant