Skip to content

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414

Open
signalrush wants to merge 1 commit intoopenai:mainfrom
signalrush:submission/ema-gptqlite-1.1233
Open

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)#414
signalrush wants to merge 1 commit intoopenai:mainfrom
signalrush:submission/ema-gptqlite-1.1233

Conversation

@signalrush
Copy link

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15

val_bpb: 1.1233 (sliding window stride=64, 3-seed mean) | 15.55 MB (mean) | 8xH100 SXM, 600s

Key Innovations Over PR #374

Change PR #374 This Impact
GPTQ-lite Fixed clip (row max) 5 clip percentiles per row, pick min MSE -0.0006 BPB
EMA (decay=0.997) None (Tight SWA only) EMA every step -0.0006 BPB
Warmdown 3000 3500 -0.0002 BPB
Late QAT threshold 0.1 0.15 -0.0001 BPB
Total 1.1246 1.1233 -0.0013 BPB

GPTQ-lite: Per-Layer Optimal Clip Percentile

Instead of using row maximum for int6 scale, try 5 clip percentiles (0.999, 0.9995, 0.9999, 0.99999, 1.0) per weight matrix row and pick the one minimizing reconstruction MSE. Zero training cost.

Results (3 seeds, 8xH100 SXM)

Seed Steps val_loss Sliding BPB (s64) Artifact
1337 7101 1.8958 1.1228 15.56 MB
42 ~7100 1.8972 1.1236 15.54 MB
2024 ~7100 1.8971 1.1236 15.59 MB

Mean: 1.1233 | Std: 0.0005

Architecture

11L, 512d, 8H/4KV, MLP 3x (relu²), U-Net skips, XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), FA3, Muon WD=0.04, EMA(0.997), Tight SWA, Late QAT@0.15, int6+zstd-22.

Run Command

SEED=1337 bash eval/eval.sh

Test plan

  • All 3 seeds under 16MB
  • All 3 seeds train in 600s on 8xH100
  • Post-quant roundtrip verified
  • Sliding window eval (stride=64) consistent across seeds (std=0.0005)
  • train_gpt.py under 1500 lines (1402)
  • No TTT on validation data

🤖 Generated with Claude Code

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 22, 2026
Seed 1337: 81.86ms, 1.1241 bpb, 15.83MB
Seed 42: 81.88ms, 1.1253 bpb, 15.82MB
Seed 2025: 81.86ms, 1.1247 bpb, 15.80MB
Mean: 81.87ms, 1.1247 bpb

Also adds GPTQ-lite (PR openai#414's per-row optimal clip percentile search)
for improved int6 quantization quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant