openai · DeividasMat · Mar 20, 2026 · Mar 20, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/EXPERIMENT_PLAN.md b/EXPERIMENT_PLAN.md
@@ -0,0 +1,117 @@
+# Experiment Plan
+
+This plan is optimized for limited budget and the challenge rules.
+
+## Goals
+
+- Improve `final_int8_zlib_roundtrip_exact val_bpb`
+- Improve `final_int8_ttt_lora val_bpb`
+- Stay under the `16,000,000` byte artifact cap
+- Avoid risky dataset changes until the safe path is exhausted
+
+## 5-Run Moonshot Sequence
+
+Run these in order on remote GPUs, using the current branch and `TRAIN_SHARDS=1`:
+
+1. `drope_eval`
+2. `yarn_eval`
+3. `mtp_low`
+4. `muon_balance`
+5. `hybrid_delta`
+
+Run the entire sequence:
+
+```bash
+NPROC_PER_NODE=1 bash scripts/run_moonshot5.sh
+```
+
+This prints each run tail and then a ranked JSON summary against the control run `twice_eval2048_ttt1024_clean2`.
+
+Ranking priority:
+
+1. Lowest `final_int8_ttt_lora val_bpb`
+2. Lowest `final_int8_zlib_roundtrip_exact val_bpb`
+3. Smallest artifact
+4. Fastest step time
+
+Promotion rules:
+
+- Promote any run that beats the control on at least one final metric without exceeding the artifact cap.
+- Promote `hybrid_delta` if it beats the control on either final metric, even slightly.
+
+Next-step rules:
+
+- If `drope_eval` beats `yarn_eval`, keep DRoPE and drop YaRN.
+- If `yarn_eval` beats `drope_eval`, keep YaRN and drop DRoPE.
+- If `mtp_low` wins, sweep `MTP_DEPTH=3` and `MTP_LOSS_WEIGHT` in `0.05`, `0.1`, `0.2`.
+- If `muon_balance` wins, sweep `MUON_UPDATE_BALANCE` in `0.25`, `0.5`, `0.75`.
+- If `hybrid_delta` wins even slightly, open a dedicated hybrid branch next.
+
+## Next Moonshot
+
+New architecture branch:
+
+1. `shared_depth`
+
+Idea:
+
+- reuse `4` unique blocks across `10` logical layers
+- keep tiny per-pass learned output scales so reused blocks can still specialize
+- preserve the existing optimizer, export, and TTT paths
+
+## Dataset And Tokenizer Work
+
+The challenge allows tokenizer or dataset changes, but the repo says they will be examined carefully and you must prove the `val_bpb` calculation is correct. See [README.md](/Users/deividasmataciunas/Desktop/research/openai_golf/README.md#L168).
+
+Safest path:
+
+- Rebuild tokenizers from the published docs cache only
+- Re-export shards from the same selected docs
+- Keep validation on the fixed first `50k` docs
+
+Use:
+
+```bash
+bash scripts/rebuild_tokenizer_export.sh
+```
+
+Default ablation config:
+
+- `sp_bpe_768`
+- `sp_bpe_1024`
+- `sp_bpe_1280`
+- `sp_bpe_1536`
+- `pure_byte_260`
+
+After the model-side shortlist settles, do these data sweeps:
+
+1. Rebuild `sp_bpe_768`, `sp_bpe_1280`, and `pure_byte_260`
+2. Rerun the current best profile on `TRAIN_SHARDS=1`
+3. Only promote tokenizer changes that help `final_int8_ttt_lora` without pushing artifact bytes in the wrong direction
+
+## Dataset Ideas That Look Safe
+
+- Vary tokenizer vocab size on the same published docs
+- Compare pure-byte vs SentencePiece BPE
+- Train on a prefix of shards, then do a short final stage on a higher-quality subset from the same docs
+- Filter obviously low-value docs from the training side only
+- Keep document boundaries clean during training and eval
+
+## Risky Ideas
+
+- External corpora
+- Changing validation docs
+- Any data use at eval time beyond what the rules allow
+- Tokenizer changes without exact byte-accounting validation
+
+## Success Metrics
+
+For each run, record:
+
+- `val_bpb`
+- `final_int8_zlib_roundtrip_exact val_bpb`
+- `final_int8_ttt_lora val_bpb`
+- `Total submission size int8+zlib`
+- `step_avg`
+
+If a tokenizer change helps pre-quant quality but hurts artifact bytes, reject it early.
diff --git a/REMOTE_RUNBOOK.md b/REMOTE_RUNBOOK.md
@@ -0,0 +1,83 @@
+# Remote Runbook
+
+This repo is ready for the CUDA path.
+
+## Recommended Path
+
+Use the official Runpod Parameter Golf template mentioned in [README.md](/Users/deividasmataciunas/Desktop/research/openai_golf/README.md).
+
+Start with one of these:
+
+- `1x H100`: cheapest realistic sanity-check path for code, logs, artifact size, and eval behavior.
+- `8x H100 SXM`: record-track run once the recipe looks stable.
+
+## First-Time Remote Setup
+
+On the remote box:
+
+```bash
+cd /workspace
+git clone https://github.com/openai/parameter-golf.git
+cd parameter-golf
+git remote add myfork <your-fork-url>
+git fetch myfork
+git checkout <your-branch-with-our-changes>
+```
+
+Then hydrate the published cache:
+
+```bash
+TRAIN_SHARDS=1 bash scripts/remote_fetch_data.sh
+```
+
+For a fuller training prefix:
+
+```bash
+TRAIN_SHARDS=10 bash scripts/remote_fetch_data.sh
+```
+
+## First Experiment
+
+This is the first recipe to run against our merged script:
+
+```bash
+NPROC_PER_NODE=1 bash scripts/run_remote_experiment.sh
+```
+
+For a full multi-GPU run:
+
+```bash
+NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+```
+
+## What This Recipe Uses
+
+- `10` layers
+- fp16 tied-embedding export
+- NTK-aware longer eval support
+- sliding-window eval with stride `64`
+- decoupled Muon weight decay
+- overtone embedding init
+- phase-shaped residual mixing init
+
+## First Ablations To Queue
+
+Run these one at a time after the first successful remote run:
+
+```bash
+EVAL_STRIDE=0 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+EVAL_SEQ_LEN=2048 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+NUM_LAYERS=9 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+MUON_WEIGHT_DECAY=0.00 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+OVERTONE_INIT_POWER=0.00 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
+```
+
+## What To Look For
+
+- `step_avg`
+- final `val_bpb`
+- final `final_int8_zlib_roundtrip_exact`
+- final `final_int8_ttt_lora`
+- total `int8+zlib` artifact bytes
+
+If you send me a remote log, I can turn it into the next ablation decision quickly.
diff --git a/checkpoints/2026-03-20_twice_eval2048_ttt1024/REMOTE_CHECKPOINT.md b/checkpoints/2026-03-20_twice_eval2048_ttt1024/REMOTE_CHECKPOINT.md
@@ -0,0 +1,8 @@
+Best raw checkpoint metadata captured from the Runpod pod before shutdown:
+
+- Run ID: `twice_eval2048_ttt1024`
+- Remote path: `/workspace/parameter-golf/final_model.pt`
+- Size on pod: `72M`
+- SHA256: `292d79fa54a638be348354f09d185f80b69710e7de8f4dfa42b36e43afccdc96`
+
+The raw `.pt` file itself was not copied into this repo because Runpod's SSH wrapper blocked automated binary transfer through `scp`. If you want to preserve the raw checkpoint, keep the pod or its volume alive until we manually copy it out tomorrow.
diff --git a/data/tokenizer_specs.ablation.json b/data/tokenizer_specs.ablation.json
@@ -0,0 +1,29 @@
+{
+  "tokenizers": [
+    {
+      "name": "sp_bpe_768",
+      "dataset_suffix": "sp768",
+      "vocab_size": 768
+    },
+    {
+      "name": "sp_bpe_1024",
+      "dataset_suffix": "sp1024",
+      "vocab_size": 1024
+    },
+    {
+      "name": "sp_bpe_1280",
+      "dataset_suffix": "sp1280",
+      "vocab_size": 1280
+    },
+    {
+      "name": "sp_bpe_1536",
+      "dataset_suffix": "sp1536",
+      "vocab_size": 1536
+    },
+    {
+      "name": "pure_byte_260",
+      "dataset_suffix": "byte260",
+      "kind": "pure_byte"
+    }
+  ]
+}
diff --git a/program.md b/program.md
@@ -0,0 +1,86 @@
+# Parameter Golf Research Program
+
+You are working inside the OpenAI Parameter Golf repository.
+
+## Objective
+
+Improve the challenge score under these constraints:
+
+- optimize `final_int8_ttt_lora val_bpb`
+- optimize `final_int8_zlib_roundtrip_exact val_bpb`
+- keep `Total submission size int8+zlib` under `16,000,000` bytes
+- preserve reproducibility
+
+Lower `val_bpb` is better.
+
+## Primary Rules
+
+1. Prefer small, ablation-friendly changes.
+2. Keep changes concentrated in `train_gpt.py` unless there is a strong reason not to.
+3. Reject changes that improve one metric but badly regress the other.
+4. Reject changes that push artifact size toward the budget without a clear score win.
+5. Do not change the validation set.
+6. Treat tokenizer or dataset changes as higher-risk and require stronger evidence.
+
+## Current Priors
+
+- Sliding-window evaluation is high value.
+- FP16 tied embedding export is high value.
+- 10-layer small models are promising.
+- Decoupled Muon weight decay is promising.
+- `ATTN_TWICE_ALPHA=0.05` currently looks better than baseline.
+- `Z_LOSS_COEF=0.0001` currently looks worse than baseline.
+
+## Current Best Known Local Results
+
+- `base10l`
+  - `roundtrip_val_bpb = 1.40296458`
+  - `ttt_val_bpb = 1.3976`
+  - `artifact_bytes = 10831123`
+
+- `twice_low`
+  - `roundtrip_val_bpb = 1.40177526`
+  - `ttt_val_bpb = 1.3969`
+  - `artifact_bytes = 10836065`
+
+## Experiment Order
+
+1. `twice_eval2048`
+2. best `twice_*` variant on more seeds
+3. training-context and batch tradeoff ablations
+4. tokenizer ablations on published docs cache
+
+## Allowed Edit Zones
+
+- architecture details in `train_gpt.py`
+- training schedule and optimizer settings
+- quantization/export logic
+- evaluation logic
+- remote profile scripts
+
+## High-Risk Areas
+
+- external datasets
+- validation handling
+- complex multi-file refactors
+- changes that increase code size substantially
+
+## Decision Policy
+
+Keep a change only if at least one is true:
+
+- `final_int8_ttt_lora` improves and `roundtrip_exact` does not materially regress
+- `roundtrip_exact` improves and `ttt` does not materially regress
+- artifact size drops meaningfully with near-flat score
+
+Reject a change if:
+
+- both `ttt` and `roundtrip_exact` regress
+- artifact size grows with no score benefit
+- it adds a lot of complexity without measurable value
+
+## Logging And Packaging
+
+- Use `scripts/run_remote_profile.sh` or `scripts/run_and_score.sh`
+- Parse logs with `scripts/parse_run.py`
+- Package strong candidates with `scripts/package_record.sh`
diff --git a/records/_template/README.md b/records/_template/README.md
@@ -0,0 +1,38 @@
+# Submission Name
+
+One-paragraph summary of the idea and why it matters for Parameter Golf.
+
+## Key Techniques
+
+1. Technique 1
+2. Technique 2
+3. Technique 3
+
+## Results
+
+| Seed | val_loss | val_bpb | Steps | ms/step |
+|------|----------|---------|-------|---------|
+| 1337 | TBD | TBD | TBD | TBD |
+| 42 | TBD | TBD | TBD | TBD |
+| 7 | TBD | TBD | TBD | TBD |
+| **Mean** | **TBD** | **TBD** | | |
+
+Artifact: `TBD` bytes | Eval time: `TBD`
+
+## Configuration
+
+```bash
+# Paste the exact training command here
+```
+
+## Notes
+
+- Explain artifact accounting if needed
+- Explain tokenizer/dataset changes if any
+- Explain evaluation procedure if non-standard
+
+## Included Files
+
+- `train_gpt.py`
+- `submission.json`
+- `train_seed*.log`
diff --git a/records/_template/submission.json b/records/_template/submission.json
@@ -0,0 +1,20 @@
+{
+  "track": "10min_16mb",
+  "date": "YYYY-MM-DD",
+  "name": "Submission Name",
+  "author": "Your Name",
+  "github_id": "YourGitHubID",
+  "seed_results": {
+    "1337": {
+      "val_loss": 0.0,
+      "val_bpb": 0.0,
+      "steps": 0,
+      "ms_per_step": 0.0
+    }
+  },
+  "mean_val_loss": 0.0,
+  "mean_val_bpb": 0.0,
+  "p_value": 1.0,
+  "artifact_bytes": 0,
+  "code_bytes": 0
+}