Skip to content

Record: 11L GradQuant + EMA + Sliding Eval (val_bpb=1.1416)#422

Open
albertorkive wants to merge 4 commits intoopenai:mainfrom
albertorkive:submission-clean
Open

Record: 11L GradQuant + EMA + Sliding Eval (val_bpb=1.1416)#422
albertorkive wants to merge 4 commits intoopenai:mainfrom
albertorkive:submission-clean

Conversation

@albertorkive
Copy link

Summary

  • val_bpb: 1.1416 (post int8+zstd quantization roundtrip, sliding window eval stride=64, full validation coverage)
  • Artifact: 15,059,186 bytes (code: 59,158 bytes + model: 15,000,028 bytes)
  • 11 layers, 512 dim, MLP 3x, Muon optimizer, EMA (alpha=0.997, from init)

Key Techniques

  • Gradient-guided adaptive quantization: per-tensor int5/int6/int7 bit assignment based on gradient sensitivity (top 30% → int7, middle 40% → int6, bottom 30% → int5)
  • EMA from model init (alpha=0.997)
  • SmearGate residual mixing + NTK-aware RoPE + XSA (last 4 layers)
  • Orthogonal initialization + tied embeddings
  • zstd level 22 compression
  • Sliding window eval with stride=64, full validation set coverage (~121K windows/GPU)

Run Command

pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py
torchrun --nproc_per_node=8 train_gpt.py

All hyperparameters are baked into train_gpt.py as defaults. No env vars needed.

Included Files

  • README.md — Architecture and training details
  • submission.json — Metadata and metrics
  • train_gpt.py — Complete training script (1,309 lines)
  • train.log — Full training + evaluation output

albertorkive and others added 4 commits March 22, 2026 14:25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shift more tensors to int7 (45% vs 30% previously) to use more of the
16MB budget for gradient-sensitive weights. Reduces quantization
degradation while staying under artifact size limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At step 100, estimate total steps from step timing and cap warmdown to
55% of total. Prevents warmdown from consuming too many steps on slower
hardware (e.g., 3000 warmdown at 4200 total steps = only 29% productive
training). On fast hardware, cap is not reached and behavior is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove DATA_PATH, TOKENIZER_PATH, TOKENIZER_TYPE, TRAIN_ON_VAL env var
overrides. These were lab scaffolding — the competition uses fixed data
paths relative to repo root. Evaluators run from repo root after
downloading data with the standard script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant