openai · adiprathapa · May 7, 2026
diff --git a/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/EXPERIMENT_PLAN.md b/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/EXPERIMENT_PLAN.md
@@ -0,0 +1,60 @@
+# Experiment Plan
+
+Goal: make the budget-run candidate more submittable by adding one small empirical improvement over the current valid baseline.
+
+Original valid baseline:
+
+```text
+seed: 42
+train_shards: 86
+steps: 3975
+valid_sliding_bpb: 1.11043153
+total_bytes: 15,987,889
+```
+
+Best current result:
+
+```text
+seed: 42
+train_shards: 86
+qk_gain_init: 4.5
+steps: 4271
+valid_sliding_bpb: 1.10743376
+total_bytes: 15,987,195
+```
+
+## Completed Paid Run
+
+Run one targeted variant:
+
+```bash
+QK_GAIN_INIT=4.5 \
+RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
+./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
+```
+
+Why this first:
+
+Outcome:
+
+- Improved valid sliding BPB by about `0.0030`.
+- Required repeated artifact coarsening of `ve_shared.embed.weight` to fit the decimal 16MB cap.
+- Keep as the current best budget candidate.
+
+## Follow-up Queue
+
+Run these one at a time only if spending more RunPod budget:
+
+```bash
+QK_GAIN_INIT=4.5 SEED=1337 RUN_ID=sp4096_1xh100_seed1337_86shards_qk45 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
+MATRIX_LR=0.018 RUN_ID=sp4096_1xh100_seed42_86shards_mlr018 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
+MUON_WD=0.09 EMBED_WD=0.09 RUN_ID=sp4096_1xh100_seed42_86shards_wd09 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
+QK_GAIN_INIT=5.0 RUN_ID=sp4096_1xh100_seed42_86shards_qk50 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
+```
+
+## Submission Bar
+
+Submit as a non-record only after one of these is true:
+
+- A single-knob variant beats the baseline and a second seed confirms it is not obvious noise.
+- Or the folder is framed purely as a budget reproduction/engineering note, not as a new ML improvement.
diff --git a/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/README.md b/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/README.md
@@ -0,0 +1,133 @@
+# SP4096 Budget Reproduction Candidate
+
+This is our first serious Parameter Golf candidate: a clean reproduction target based on Kevin Clark's accepted **4096-Vocab + Larger Model + High WD + Simplifications** run.
+
+This folder is intentionally conservative. The `train_gpt.py` starts from:
+
+`records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/train_gpt.py`
+
+and adds one reproducibility guard: if the quantized artifact is slightly over the decimal `16,000,000` byte cap, it coarsens selected quantized tensors until the artifact fits. The RunPod result used only `ve_shared.embed.weight`.
+
+Why start here:
+
+- Proven score: `val_bpb ~= 1.09785` on 8xH100.
+- Proven artifact size: about `15.9 MB`, under the decimal `16,000,000` byte cap.
+- Easier data path than the stronger CaseOps submissions.
+- Good enough to be a real baseline before spending RunPod money on experiments.
+
+This is a non-record reproduction/iteration folder, not a claim of a novel leaderboard result.
+
+## Current Valid Result
+
+Best RunPod 1xH100 budget run, seed 42, 86 train shards, `QK_GAIN_INIT=4.5`:
+
+```text
+steps: 4271
+train_time_ms: 3590178
+pre_quant_post_ema val_bpb: 1.11595986
+valid_roundtrip val_bpb: 1.12578632
+valid_sliding val_bpb: 1.10743376
+valid_blob_bytes: 15,916,816
+code_bytes: 70,379
+total_bytes: 15,987,195
+```
+
+See `runpod_results/RESULTS.md` for the full run note.
+
+## Data Setup
+
+Run from the repository root:
+
+```bash
+rm -f data/manifest.json
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+  python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 86
+```
+
+The verified budget run uses 86 SP4096 train shards, about `17 GB` materialized. Use a pod volume with enough room for temporary Hugging Face cache files, logs, and artifacts.
+
+The script expects:
+
+```text
+data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
+data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
+data/tokenizers/fineweb_4096_bpe.model
+```
+
+## Smoke Run
+
+Use this on a 1xH100/4090-class pod to verify dependencies, data, CUDA, FlashAttention, logging, serialization, quantization, and eval:
+
+```bash
+cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
+DATA_DIR=../../../data \
+RUN_ID=sp4096_smoke \
+SEED=42 \
+ITERATIONS=8 \
+MAX_WALLCLOCK_SECONDS=0 \
+WARMUP_STEPS=1 \
+TRAIN_LOG_EVERY=1 \
+VAL_LOSS_EVERY=0 \
+GPTQ_CALIBRATION_BATCHES=1 \
+SLIDING_WINDOW_ENABLED=0 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Budget 1xH100 Run
+
+This is not leaderboard-comparable because the official record budget is 10 minutes on 8xH100, but it is useful for a $20 RunPod budget:
+
+```bash
+cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
+DATA_DIR=../../../data \
+RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
+SEED=42 \
+QK_GAIN_INIT=4.5 \
+MAX_WALLCLOCK_SECONDS=3600 \
+VAL_LOSS_EVERY=0 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+On 1xH100, this approximates the GPU-hours of the original 8xH100 10-minute budget, but overheads and scaling are not identical.
+
+If storage is tight, the verified budget result used `86` SP4096 train shards:
+
+```bash
+rm -f data/manifest.json
+HF_HOME=/workspace/hf-cache \
+HUGGINGFACE_HUB_CACHE=/workspace/hf-cache/hub \
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 86
+```
+
+## Official-Style Reproduction
+
+For a comparable reproduction, use 8xH100 SXM:
+
+```bash
+cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
+DATA_DIR=../../../data \
+RUN_ID=sp4096_repro_seed42 \
+SEED=42 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Expected accepted-record ballpark from the source run:
+
+```text
+pre-quantization post-ema val_bpb ~= 1.1041
+final_int6_roundtrip val_bpb ~= 1.1159
+final_int6_sliding_window val_bpb ~= 1.0974
+Total submission size ~= 15.9 MB
+```
+
+## Next Experiment Knobs
+
+Once the reproduction works, the first cheap sweeps I would try are:
+
+- `QK_GAIN_INIT=5.0`
+- `MUON_WD=0.09` and `EMBED_WD=0.09`
+- `MATRIX_LR=0.018`
+- `GPTQ_CALIBRATION_BATCHES=32` for faster iteration, then restore `64`
+
+Treat one seed as signal-finding only. A real claim needs multiple seeds.
diff --git a/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/SUBMISSION_NOTE.md b/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/SUBMISSION_NOTE.md
@@ -0,0 +1,22 @@
+# Draft Non-Record Submission Note
+
+This submission is a budget reproduction and engineering note for the SP4096 line of Parameter Golf submissions.
+
+It starts from Kevin Clark's accepted `4096-Vocab + Larger Model + High WD + Simplifications` record and tests a constrained 1xH100 setup:
+
+- 1x H100 SXM instead of 8xH100
+- 86 SP4096 train shards due pod storage limits
+- 3600 second train cap
+- Full validation and sliding-window eval
+- Reproducible artifact fitting under the decimal 16MB cap
+
+The current result is valid but not SOTA:
+
+```text
+valid_sliding_bpb: 1.10743376
+total_artifact_bytes: 15,987,195
+```
+
+The main implementation addition is an artifact-fit guard. If the compressed quantized artifact exceeds the byte cap, the script repeatedly coarsens selected low-impact quantized tensors and re-compresses. In the best RunPod result, repeated coarsening of `ve_shared.embed.weight` reduced the artifact enough to fit while changing sliding BPB by about `0.00012`.
+
+The best valid local artifact is `runpod_results/final_model.qk45.valid.ptz`, produced from a `QK_GAIN_INIT=4.5` run. It improves the first 1xH100 budget baseline from `1.11043153` to `1.10743376` sliding BPB, but remains a non-record result relative to the public leaderboard.
diff --git a/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/requirements.txt b/records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/requirements.txt
@@ -0,0 +1,6 @@
+numpy
+sentencepiece
+brotli
+huggingface-hub
+tqdm
+typing-extensions==4.15.0
diff --git a/.../track_non_record_16mb/2026-05-07_sp4096_budget_repro/runpod_results/RESULTS.md b/.../track_non_record_16mb/2026-05-07_sp4096_budget_repro/runpod_results/RESULTS.md
@@ -0,0 +1,112 @@
+# RunPod Results: 1xH100 Budget Runs
+
+Date: 2026-05-07 UTC
+
+Hardware:
+
+- 1x NVIDIA H100 SXM 80GB
+- RunPod on-demand pod
+- PyTorch `2.11.0+cu130`
+- FlashAttention 3
+
+Data:
+
+- SP4096 tokenizer/data from `kevclark/parameter-golf`
+- `86` train shards available on the pod
+- Full validation shard
+
+Baseline run:
+
+```bash
+DATA_DIR=../../../data \
+RUN_ID=sp4096_1xh100_seed42_86shards_retry \
+SEED=42 \
+MAX_WALLCLOCK_SECONDS=3600 \
+VAL_LOSS_EVERY=0 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+Training stopped cleanly at the wallclock cap:
+
+```text
+stopping_early: wallclock_cap train_time: 3590665ms step: 3975/20000
+pre-quantization post-ema val_loss:2.57429177 val_bpb:1.11875755
+```
+
+The unmodified quantized artifact was slightly over the 16,000,000 byte cap:
+
+```text
+Serialized model int6+brotli: 15995984 bytes
+Code size: 68206 bytes
+Total submission size int6+brotli: 16064190 bytes
+final_int6_sliding_window val_loss:2.55503271 val_bpb:1.11038778
+```
+
+Valid artifact salvage:
+
+- Coarsened only `ve_shared.embed.weight.q` by halving integer codes and doubling its scale.
+- Wrote `final_model.valid.ptz`.
+- This reduces compressed model bytes enough to fit under the cap with negligible BPB movement.
+
+Valid result:
+
+```text
+valid_blob_bytes:15917702
+code_bytes:70187
+total_bytes:15987889
+valid_roundtrip val_loss:2.59734720 val_bpb:1.12877717
+valid_sliding val_loss:2.55513338 val_bpb:1.11043153
+```
+
+Local copied artifacts:
+
+- `sp4096_1xh100_seed42_86shards_retry.txt`
+- `final_model.valid.ptz`
+
+## QK 4.5 Improvement
+
+Follow-up run on the migrated RunPod pod:
+
+```bash
+DATA_DIR=../../../data \
+RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
+SEED=42 \
+QK_GAIN_INIT=4.5 \
+MAX_WALLCLOCK_SECONDS=3600 \
+VAL_LOSS_EVERY=0 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+Training stopped cleanly at the wallclock cap:
+
+```text
+stopping_early: wallclock_cap train_time: 3590178ms step: 4271/20000
+pre-quantization post-ema val_loss:2.56785422 val_bpb:1.11595986
+```
+
+The first one-pass artifact fit was still over cap:
+
+```text
+Artifact fit: coarsened ve_shared.embed.weight
+Serialized model int6+brotli: 16041069 bytes
+Total submission size int6+brotli: 16111256 bytes
+WARNING: artifact exceeds cap:16111256>16000000
+final_int6_roundtrip val_loss:2.59018066 val_bpb:1.12566268
+final_int6_sliding_window val_loss:2.54796393 val_bpb:1.10731577
+```
+
+After patching the guard to repeatedly coarsen the selected tensor, the salvaged valid artifact is:
+
+```text
+valid_blob_bytes:15916816
+code_bytes:70379
+total_bytes:15987195
+valid_roundtrip val_loss:2.59046517 val_bpb:1.12578632
+valid_sliding val_loss:2.54823542 val_bpb:1.10743376
+```
+
+Local copied artifacts:
+
+- `sp4096_1xh100_seed42_86shards_qk45.txt`
+- `sp4096_qk45_valid_eval.txt`
+- `final_model.qk45.valid.ptz`