Skip to content

Add non-record EMA and adaptive export exploration#424

Open
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:nonrecord-ema-adaptive-export
Open

Add non-record EMA and adaptive export exploration#424
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:nonrecord-ema-adaptive-export

Conversation

@someone114514
Copy link

Summary

This PR adds a single non-record exploration branch:

  • records/track_non_record_16mb/2026-03-22_Baseline_EMA_AdaptiveExport

The explored idea is that some remaining progress under the 16MB artifact constraint may come from late-stage weight smoothing and budget-aware export selection, not only from changing the backbone.

Main Result

This run reaches:

  • final post-quant sliding-window roundtrip: val_bpb = 1.17251579
  • final post-quant sliding-window roundtrip: val_loss = 1.97973856
  • training wallclock: 1200.112s
  • final eval wallclock: 613.676s
  • total artifact size: 16,399,881 bytes

So this branch is non-record only: it produces a strong final score shape, but remains 449,881 bytes over the 16MB target.

What Changed

Built on the strong Int6 MLP3x + SmearGate + BigramHash + Muon baseline and adds:

  1. Late-stage EMA

    • EMA_ENABLED=1
    • EMA_BETA=0.9998
    • EMA_START_FRAC=0.8
  2. Adaptive export-time pruning search

    • PRUNE_CANDIDATES=0.00,0.01,0.02,0.03,0.04,0.05
    • TARGET_ARTIFACT_BYTES=15950000
    • choose the smallest pruning ratio that meets the target size, or the smallest artifact if none do

Why This Is Worth Looking At

Even though the run is not leaderboard-valid yet, the failure mode is narrow and actionable:

  • the bottleneck is artifact size, not post-quant quality collapse
  • the final sliding-window metric is already competitive for a non-record run
  • the next iteration path is clear: broaden export search and move toward module-aware budget allocation

Submission Checklist

  • training completes under wallclock cap: yes
  • final post-quant roundtrip eval runs successfully: yes
  • sliding-window final eval runs successfully: yes
  • self-contained train_gpt.py: yes
  • artifact under 16MB: no
  • multi-seed verification: no

Compute Limitation

This result was produced under constrained compute:

  • 2xH100, not 8xH100
  • single seed only
  • no remaining compute budget for follow-up tuning passes after the first full validation run

So this PR should be read as a validated directional non-record result, not as a claim of a fully tuned record-capable submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant