Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Experiment Plan

Goal: make the budget-run candidate more submittable by adding one small empirical improvement over the current valid baseline.

Original valid baseline:

```text
seed: 42
train_shards: 86
steps: 3975
valid_sliding_bpb: 1.11043153
total_bytes: 15,987,889
```

Best current result:

```text
seed: 42
train_shards: 86
qk_gain_init: 4.5
steps: 4271
valid_sliding_bpb: 1.10743376
total_bytes: 15,987,195
```

## Completed Paid Run

Run one targeted variant:

```bash
QK_GAIN_INIT=4.5 \
RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
```

Why this first:

Outcome:

- Improved valid sliding BPB by about `0.0030`.
- Required repeated artifact coarsening of `ve_shared.embed.weight` to fit the decimal 16MB cap.
- Keep as the current best budget candidate.

## Follow-up Queue

Run these one at a time only if spending more RunPod budget:

```bash
QK_GAIN_INIT=4.5 SEED=1337 RUN_ID=sp4096_1xh100_seed1337_86shards_qk45 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
MATRIX_LR=0.018 RUN_ID=sp4096_1xh100_seed42_86shards_mlr018 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
MUON_WD=0.09 EMBED_WD=0.09 RUN_ID=sp4096_1xh100_seed42_86shards_wd09 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
QK_GAIN_INIT=5.0 RUN_ID=sp4096_1xh100_seed42_86shards_qk50 ./scripts/parameter_golf/run_sp4096_budget86_existing_data.sh
```

## Submission Bar

Submit as a non-record only after one of these is true:

- A single-knob variant beats the baseline and a second seed confirms it is not obvious noise.
- Or the folder is framed purely as a budget reproduction/engineering note, not as a new ML improvement.
133 changes: 133 additions & 0 deletions records/track_non_record_16mb/2026-05-07_sp4096_budget_repro/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# SP4096 Budget Reproduction Candidate

This is our first serious Parameter Golf candidate: a clean reproduction target based on Kevin Clark's accepted **4096-Vocab + Larger Model + High WD + Simplifications** run.

This folder is intentionally conservative. The `train_gpt.py` starts from:

`records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/train_gpt.py`

and adds one reproducibility guard: if the quantized artifact is slightly over the decimal `16,000,000` byte cap, it coarsens selected quantized tensors until the artifact fits. The RunPod result used only `ve_shared.embed.weight`.

Why start here:

- Proven score: `val_bpb ~= 1.09785` on 8xH100.
- Proven artifact size: about `15.9 MB`, under the decimal `16,000,000` byte cap.
- Easier data path than the stronger CaseOps submissions.
- Good enough to be a real baseline before spending RunPod money on experiments.

This is a non-record reproduction/iteration folder, not a claim of a novel leaderboard result.

## Current Valid Result

Best RunPod 1xH100 budget run, seed 42, 86 train shards, `QK_GAIN_INIT=4.5`:

```text
steps: 4271
train_time_ms: 3590178
pre_quant_post_ema val_bpb: 1.11595986
valid_roundtrip val_bpb: 1.12578632
valid_sliding val_bpb: 1.10743376
valid_blob_bytes: 15,916,816
code_bytes: 70,379
total_bytes: 15,987,195
```

See `runpod_results/RESULTS.md` for the full run note.

## Data Setup

Run from the repository root:

```bash
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 86
```

The verified budget run uses 86 SP4096 train shards, about `17 GB` materialized. Use a pod volume with enough room for temporary Hugging Face cache files, logs, and artifacts.

The script expects:

```text
data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
data/tokenizers/fineweb_4096_bpe.model
```

## Smoke Run

Use this on a 1xH100/4090-class pod to verify dependencies, data, CUDA, FlashAttention, logging, serialization, quantization, and eval:

```bash
cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
DATA_DIR=../../../data \
RUN_ID=sp4096_smoke \
SEED=42 \
ITERATIONS=8 \
MAX_WALLCLOCK_SECONDS=0 \
WARMUP_STEPS=1 \
TRAIN_LOG_EVERY=1 \
VAL_LOSS_EVERY=0 \
GPTQ_CALIBRATION_BATCHES=1 \
SLIDING_WINDOW_ENABLED=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Budget 1xH100 Run

This is not leaderboard-comparable because the official record budget is 10 minutes on 8xH100, but it is useful for a $20 RunPod budget:

```bash
cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
DATA_DIR=../../../data \
RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
SEED=42 \
QK_GAIN_INIT=4.5 \
MAX_WALLCLOCK_SECONDS=3600 \
VAL_LOSS_EVERY=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

On 1xH100, this approximates the GPU-hours of the original 8xH100 10-minute budget, but overheads and scaling are not identical.

If storage is tight, the verified budget result used `86` SP4096 train shards:

```bash
rm -f data/manifest.json
HF_HOME=/workspace/hf-cache \
HUGGINGFACE_HUB_CACHE=/workspace/hf-cache/hub \
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 86
```

## Official-Style Reproduction

For a comparable reproduction, use 8xH100 SXM:

```bash
cd records/track_non_record_16mb/2026-05-07_sp4096_budget_repro
DATA_DIR=../../../data \
RUN_ID=sp4096_repro_seed42 \
SEED=42 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Expected accepted-record ballpark from the source run:

```text
pre-quantization post-ema val_bpb ~= 1.1041
final_int6_roundtrip val_bpb ~= 1.1159
final_int6_sliding_window val_bpb ~= 1.0974
Total submission size ~= 15.9 MB
```

## Next Experiment Knobs

Once the reproduction works, the first cheap sweeps I would try are:

- `QK_GAIN_INIT=5.0`
- `MUON_WD=0.09` and `EMBED_WD=0.09`
- `MATRIX_LR=0.018`
- `GPTQ_CALIBRATION_BATCHES=32` for faster iteration, then restore `64`

Treat one seed as signal-finding only. A real claim needs multiple seeds.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Draft Non-Record Submission Note

This submission is a budget reproduction and engineering note for the SP4096 line of Parameter Golf submissions.

It starts from Kevin Clark's accepted `4096-Vocab + Larger Model + High WD + Simplifications` record and tests a constrained 1xH100 setup:

- 1x H100 SXM instead of 8xH100
- 86 SP4096 train shards due pod storage limits
- 3600 second train cap
- Full validation and sliding-window eval
- Reproducible artifact fitting under the decimal 16MB cap

The current result is valid but not SOTA:

```text
valid_sliding_bpb: 1.10743376
total_artifact_bytes: 15,987,195
```

The main implementation addition is an artifact-fit guard. If the compressed quantized artifact exceeds the byte cap, the script repeatedly coarsens selected low-impact quantized tensors and re-compresses. In the best RunPod result, repeated coarsening of `ve_shared.embed.weight` reduced the artifact enough to fit while changing sliding BPB by about `0.00012`.

The best valid local artifact is `runpod_results/final_model.qk45.valid.ptz`, produced from a `QK_GAIN_INIT=4.5` run. It improves the first 1xH100 budget baseline from `1.11043153` to `1.10743376` sliding BPB, but remains a non-record result relative to the public leaderboard.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
numpy
sentencepiece
brotli
huggingface-hub
tqdm
typing-extensions==4.15.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# RunPod Results: 1xH100 Budget Runs

Date: 2026-05-07 UTC

Hardware:

- 1x NVIDIA H100 SXM 80GB
- RunPod on-demand pod
- PyTorch `2.11.0+cu130`
- FlashAttention 3

Data:

- SP4096 tokenizer/data from `kevclark/parameter-golf`
- `86` train shards available on the pod
- Full validation shard

Baseline run:

```bash
DATA_DIR=../../../data \
RUN_ID=sp4096_1xh100_seed42_86shards_retry \
SEED=42 \
MAX_WALLCLOCK_SECONDS=3600 \
VAL_LOSS_EVERY=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

Training stopped cleanly at the wallclock cap:

```text
stopping_early: wallclock_cap train_time: 3590665ms step: 3975/20000
pre-quantization post-ema val_loss:2.57429177 val_bpb:1.11875755
```

The unmodified quantized artifact was slightly over the 16,000,000 byte cap:

```text
Serialized model int6+brotli: 15995984 bytes
Code size: 68206 bytes
Total submission size int6+brotli: 16064190 bytes
final_int6_sliding_window val_loss:2.55503271 val_bpb:1.11038778
```

Valid artifact salvage:

- Coarsened only `ve_shared.embed.weight.q` by halving integer codes and doubling its scale.
- Wrote `final_model.valid.ptz`.
- This reduces compressed model bytes enough to fit under the cap with negligible BPB movement.

Valid result:

```text
valid_blob_bytes:15917702
code_bytes:70187
total_bytes:15987889
valid_roundtrip val_loss:2.59734720 val_bpb:1.12877717
valid_sliding val_loss:2.55513338 val_bpb:1.11043153
```

Local copied artifacts:

- `sp4096_1xh100_seed42_86shards_retry.txt`
- `final_model.valid.ptz`

## QK 4.5 Improvement

Follow-up run on the migrated RunPod pod:

```bash
DATA_DIR=../../../data \
RUN_ID=sp4096_1xh100_seed42_86shards_qk45 \
SEED=42 \
QK_GAIN_INIT=4.5 \
MAX_WALLCLOCK_SECONDS=3600 \
VAL_LOSS_EVERY=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

Training stopped cleanly at the wallclock cap:

```text
stopping_early: wallclock_cap train_time: 3590178ms step: 4271/20000
pre-quantization post-ema val_loss:2.56785422 val_bpb:1.11595986
```

The first one-pass artifact fit was still over cap:

```text
Artifact fit: coarsened ve_shared.embed.weight
Serialized model int6+brotli: 16041069 bytes
Total submission size int6+brotli: 16111256 bytes
WARNING: artifact exceeds cap:16111256>16000000
final_int6_roundtrip val_loss:2.59018066 val_bpb:1.12566268
final_int6_sliding_window val_loss:2.54796393 val_bpb:1.10731577
```

After patching the guard to repeatedly coarsen the selected tensor, the salvaged valid artifact is:

```text
valid_blob_bytes:15916816
code_bytes:70379
total_bytes:15987195
valid_roundtrip val_loss:2.59046517 val_bpb:1.12578632
valid_sliding val_loss:2.54823542 val_bpb:1.10743376
```

Local copied artifacts:

- `sp4096_1xh100_seed42_86shards_qk45.txt`
- `sp4096_qk45_valid_eval.txt`
- `final_model.qk45.valid.ptz`
Loading