Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions EXPERIMENT_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Experiment Plan

This plan is optimized for limited budget and the challenge rules.

## Goals

- Improve `final_int8_zlib_roundtrip_exact val_bpb`
- Improve `final_int8_ttt_lora val_bpb`
- Stay under the `16,000,000` byte artifact cap
- Avoid risky dataset changes until the safe path is exhausted

## 5-Run Moonshot Sequence

Run these in order on remote GPUs, using the current branch and `TRAIN_SHARDS=1`:

1. `drope_eval`
2. `yarn_eval`
3. `mtp_low`
4. `muon_balance`
5. `hybrid_delta`

Run the entire sequence:

```bash
NPROC_PER_NODE=1 bash scripts/run_moonshot5.sh
```

This prints each run tail and then a ranked JSON summary against the control run `twice_eval2048_ttt1024_clean2`.

Ranking priority:

1. Lowest `final_int8_ttt_lora val_bpb`
2. Lowest `final_int8_zlib_roundtrip_exact val_bpb`
3. Smallest artifact
4. Fastest step time

Promotion rules:

- Promote any run that beats the control on at least one final metric without exceeding the artifact cap.
- Promote `hybrid_delta` if it beats the control on either final metric, even slightly.

Next-step rules:

- If `drope_eval` beats `yarn_eval`, keep DRoPE and drop YaRN.
- If `yarn_eval` beats `drope_eval`, keep YaRN and drop DRoPE.
- If `mtp_low` wins, sweep `MTP_DEPTH=3` and `MTP_LOSS_WEIGHT` in `0.05`, `0.1`, `0.2`.
- If `muon_balance` wins, sweep `MUON_UPDATE_BALANCE` in `0.25`, `0.5`, `0.75`.
- If `hybrid_delta` wins even slightly, open a dedicated hybrid branch next.

## Next Moonshot

New architecture branch:

1. `shared_depth`

Idea:

- reuse `4` unique blocks across `10` logical layers
- keep tiny per-pass learned output scales so reused blocks can still specialize
- preserve the existing optimizer, export, and TTT paths

## Dataset And Tokenizer Work

The challenge allows tokenizer or dataset changes, but the repo says they will be examined carefully and you must prove the `val_bpb` calculation is correct. See [README.md](/Users/deividasmataciunas/Desktop/research/openai_golf/README.md#L168).

Safest path:

- Rebuild tokenizers from the published docs cache only
- Re-export shards from the same selected docs
- Keep validation on the fixed first `50k` docs

Use:

```bash
bash scripts/rebuild_tokenizer_export.sh
```

Default ablation config:

- `sp_bpe_768`
- `sp_bpe_1024`
- `sp_bpe_1280`
- `sp_bpe_1536`
- `pure_byte_260`

After the model-side shortlist settles, do these data sweeps:

1. Rebuild `sp_bpe_768`, `sp_bpe_1280`, and `pure_byte_260`
2. Rerun the current best profile on `TRAIN_SHARDS=1`
3. Only promote tokenizer changes that help `final_int8_ttt_lora` without pushing artifact bytes in the wrong direction

## Dataset Ideas That Look Safe

- Vary tokenizer vocab size on the same published docs
- Compare pure-byte vs SentencePiece BPE
- Train on a prefix of shards, then do a short final stage on a higher-quality subset from the same docs
- Filter obviously low-value docs from the training side only
- Keep document boundaries clean during training and eval

## Risky Ideas

- External corpora
- Changing validation docs
- Any data use at eval time beyond what the rules allow
- Tokenizer changes without exact byte-accounting validation

## Success Metrics

For each run, record:

- `val_bpb`
- `final_int8_zlib_roundtrip_exact val_bpb`
- `final_int8_ttt_lora val_bpb`
- `Total submission size int8+zlib`
- `step_avg`

If a tokenizer change helps pre-quant quality but hurts artifact bytes, reject it early.
83 changes: 83 additions & 0 deletions REMOTE_RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Remote Runbook

This repo is ready for the CUDA path.

## Recommended Path

Use the official Runpod Parameter Golf template mentioned in [README.md](/Users/deividasmataciunas/Desktop/research/openai_golf/README.md).

Start with one of these:

- `1x H100`: cheapest realistic sanity-check path for code, logs, artifact size, and eval behavior.
- `8x H100 SXM`: record-track run once the recipe looks stable.

## First-Time Remote Setup

On the remote box:

```bash
cd /workspace
git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
git remote add myfork <your-fork-url>
git fetch myfork
git checkout <your-branch-with-our-changes>
```

Then hydrate the published cache:

```bash
TRAIN_SHARDS=1 bash scripts/remote_fetch_data.sh
```

For a fuller training prefix:

```bash
TRAIN_SHARDS=10 bash scripts/remote_fetch_data.sh
```

## First Experiment

This is the first recipe to run against our merged script:

```bash
NPROC_PER_NODE=1 bash scripts/run_remote_experiment.sh
```

For a full multi-GPU run:

```bash
NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
```

## What This Recipe Uses

- `10` layers
- fp16 tied-embedding export
- NTK-aware longer eval support
- sliding-window eval with stride `64`
- decoupled Muon weight decay
- overtone embedding init
- phase-shaped residual mixing init

## First Ablations To Queue

Run these one at a time after the first successful remote run:

```bash
EVAL_STRIDE=0 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
EVAL_SEQ_LEN=2048 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
NUM_LAYERS=9 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
MUON_WEIGHT_DECAY=0.00 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
OVERTONE_INIT_POWER=0.00 NPROC_PER_NODE=8 bash scripts/run_remote_experiment.sh
```

## What To Look For

- `step_avg`
- final `val_bpb`
- final `final_int8_zlib_roundtrip_exact`
- final `final_int8_ttt_lora`
- total `int8+zlib` artifact bytes

If you send me a remote log, I can turn it into the next ablation decision quickly.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Best raw checkpoint metadata captured from the Runpod pod before shutdown:

- Run ID: `twice_eval2048_ttt1024`
- Remote path: `/workspace/parameter-golf/final_model.pt`
- Size on pod: `72M`
- SHA256: `292d79fa54a638be348354f09d185f80b69710e7de8f4dfa42b36e43afccdc96`

The raw `.pt` file itself was not copied into this repo because Runpod's SSH wrapper blocked automated binary transfer through `scp`. If you want to preserve the raw checkpoint, keep the pod or its volume alive until we manually copy it out tomorrow.
29 changes: 29 additions & 0 deletions data/tokenizer_specs.ablation.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"tokenizers": [
{
"name": "sp_bpe_768",
"dataset_suffix": "sp768",
"vocab_size": 768
},
{
"name": "sp_bpe_1024",
"dataset_suffix": "sp1024",
"vocab_size": 1024
},
{
"name": "sp_bpe_1280",
"dataset_suffix": "sp1280",
"vocab_size": 1280
},
{
"name": "sp_bpe_1536",
"dataset_suffix": "sp1536",
"vocab_size": 1536
},
{
"name": "pure_byte_260",
"dataset_suffix": "byte260",
"kind": "pure_byte"
}
]
}
86 changes: 86 additions & 0 deletions program.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Parameter Golf Research Program

You are working inside the OpenAI Parameter Golf repository.

## Objective

Improve the challenge score under these constraints:

- optimize `final_int8_ttt_lora val_bpb`
- optimize `final_int8_zlib_roundtrip_exact val_bpb`
- keep `Total submission size int8+zlib` under `16,000,000` bytes
- preserve reproducibility

Lower `val_bpb` is better.

## Primary Rules

1. Prefer small, ablation-friendly changes.
2. Keep changes concentrated in `train_gpt.py` unless there is a strong reason not to.
3. Reject changes that improve one metric but badly regress the other.
4. Reject changes that push artifact size toward the budget without a clear score win.
5. Do not change the validation set.
6. Treat tokenizer or dataset changes as higher-risk and require stronger evidence.

## Current Priors

- Sliding-window evaluation is high value.
- FP16 tied embedding export is high value.
- 10-layer small models are promising.
- Decoupled Muon weight decay is promising.
- `ATTN_TWICE_ALPHA=0.05` currently looks better than baseline.
- `Z_LOSS_COEF=0.0001` currently looks worse than baseline.

## Current Best Known Local Results

- `base10l`
- `roundtrip_val_bpb = 1.40296458`
- `ttt_val_bpb = 1.3976`
- `artifact_bytes = 10831123`

- `twice_low`
- `roundtrip_val_bpb = 1.40177526`
- `ttt_val_bpb = 1.3969`
- `artifact_bytes = 10836065`

## Experiment Order

1. `twice_eval2048`
2. best `twice_*` variant on more seeds
3. training-context and batch tradeoff ablations
4. tokenizer ablations on published docs cache

## Allowed Edit Zones

- architecture details in `train_gpt.py`
- training schedule and optimizer settings
- quantization/export logic
- evaluation logic
- remote profile scripts

## High-Risk Areas

- external datasets
- validation handling
- complex multi-file refactors
- changes that increase code size substantially

## Decision Policy

Keep a change only if at least one is true:

- `final_int8_ttt_lora` improves and `roundtrip_exact` does not materially regress
- `roundtrip_exact` improves and `ttt` does not materially regress
- artifact size drops meaningfully with near-flat score

Reject a change if:

- both `ttt` and `roundtrip_exact` regress
- artifact size grows with no score benefit
- it adds a lot of complexity without measurable value

## Logging And Packaging

- Use `scripts/run_remote_profile.sh` or `scripts/run_and_score.sh`
- Parse logs with `scripts/parse_run.py`
- Package strong candidates with `scripts/package_record.sh`
38 changes: 38 additions & 0 deletions records/_template/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Submission Name

One-paragraph summary of the idea and why it matters for Parameter Golf.

## Key Techniques

1. Technique 1
2. Technique 2
3. Technique 3

## Results

| Seed | val_loss | val_bpb | Steps | ms/step |
|------|----------|---------|-------|---------|
| 1337 | TBD | TBD | TBD | TBD |
| 42 | TBD | TBD | TBD | TBD |
| 7 | TBD | TBD | TBD | TBD |
| **Mean** | **TBD** | **TBD** | | |

Artifact: `TBD` bytes | Eval time: `TBD`

## Configuration

```bash
# Paste the exact training command here
```

## Notes

- Explain artifact accounting if needed
- Explain tokenizer/dataset changes if any
- Explain evaluation procedure if non-standard

## Included Files

- `train_gpt.py`
- `submission.json`
- `train_seed*.log`
20 changes: 20 additions & 0 deletions records/_template/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"track": "10min_16mb",
"date": "YYYY-MM-DD",
"name": "Submission Name",
"author": "Your Name",
"github_id": "YourGitHubID",
"seed_results": {
"1337": {
"val_loss": 0.0,
"val_bpb": 0.0,
"steps": 0,
"ms_per_step": 0.0
}
},
"mean_val_loss": 0.0,
"mean_val_bpb": 0.0,
"p_value": 1.0,
"artifact_bytes": 0,
"code_bytes": 0
}
Loading