Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Mockingbird

10k-vocab CaseOps body on the SOTA architecture, derived from PR1855.

This is a **non-record** submission. It does not beat the current leader. It is filed as evidence of the SP10240 CaseOps lane on the same compression / phased-TTT machinery as PR1855, for comparison with the SP8192 lane.

## Results

| Seed | val_bpb (quantized_ttt_phased) | Steps | Total submission size |
|------|--------------------------------|-------|------------------------|
| 42 | 1.06204667 | 5,264 | 15,816,988 B |
| 0 | 1.06226648 | 5,231 | 15,818,783 B |
| 1 | 1.06299064 | 5,221 | 15,810,544 B |
| **mean** | **1.06243460** | | **15,818,783 B (max)** |

Hardware: 8×H100 SXM · 600s wallclock · `bytes_code`: 163,036 (uncompressed) / 41,220 (compressed)

## Architecture

11L · dim 512 · `mlp_mult=3.75` · loop_start=3, loop_end=5, `enable_looping_at=0.45`

- **Vocab/data**: SP10240 CaseOps lossless-caps tokenizer (10,240 tokens), FineWeb 10B sidecar with byte-level loss accounting
- **Quantization**: per-group, embed int7, matrix int6, LQER asymmetric rank-4
- **Eval**: PR1855 phased LoRA TTT — `prefix_docs=2500`, `phases=3`, `chunk=48`
- **Compression**: pergroup
- **Train budget**: 600 s wallclock, hard 16 MB artifact cap

## Lineage

This is the SP10240 sister of PR #1855 (`510d03e0fc355406c9fd06f92d23b8c5aedea7fb`), which used the same CaseOps + LQER + phased-TTT machinery on SP8192 and reported a 3-seed mean of 1.06107587 post-phased-TTT.

The architecture is held fixed; only the tokenizer / vocab dimension changes (8192 → 10240). The 10k vocab consumes more bytes in the embedding table, so the body is shrunk to MLP3.75 (vs the SP8192 record's wider body) to stay under the 16 MB cap. `enable_looping_at=0.45` matches the same family.

## Seeds

The three runs used identical code and hyperparameters; only the random seed changed. The committed `train_gpt.py` is the seed-42 run (the strongest of the three). Seeds 0 and 1 differ only in `Hyperparameters.seed = N` (line 479 in this file) and the bookkeeping fields `TEST_ID` / `TEST_DATE` / `RUN_KIND` / blurb (lines 433–446). The training body is byte-identical.

Seed choice (`42`, `0`, `1`) reflects the seed-repeat batch we ran on this lane; this submission does not use the protocol's `444 / 300` convention because these specific runs were not re-executed at those seeds.

## Reproduce

```bash
# From repo root, with flash-attention/hopper on PYTHONPATH
SKIP_GPTQ=1 SEED=42 torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/train_gpt.py
```

For seeds 0 and 1, change line 479 (`Hyperparameters.seed = 42`) to `0` or `1` respectively. The default `SEED` env var is overridden inside the file.

## Artifacts

Per-seed compressed artifacts (`final_model.int6.ptz`) and SHA256 hashes are recorded in `submission.json`. Each artifact is well under the 16 MB cap (max 15.82 MB).
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"author": "Frosty40",
"github_id": "newjordan",
"name": "Mockingbird",
"blurb": "10k-vocab CaseOps body on the SOTA architecture: SP10240 lossless-caps tokenizer with PR1855 LQER asym rank-4 + per-group + phased-TTT compression, 11L dim512 mlp_mult=3.75 with looping enabled at 0.45 of training.",
"date": "2026-05-01T00:00:00Z",
"track": "10min_16mb",
"record": false,
"val_bpb": 1.0624,
"val_bpb_exact": 1.06243460,
"val_bpb_std": 0.00040,
"seeds": [42, 0, 1],
"seed_results": {
"42": {
"val_bpb": 1.0620,
"val_bpb_exact": 1.06204667,
"val_loss_exact": 2.38207881,
"steps": 5264,
"train_time_ms": 599612,
"eval_time_ms": 446734,
"bytes_total": 15816988,
"compressed_artifact_bytes": 15775768,
"compressed_artifact_sha256": "68f570ab2cccfa31ecc7064e68eada5fa83cc969c5267a6c74bfd4fe8d5835f9"
},
"0": {
"val_bpb": 1.0623,
"val_bpb_exact": 1.06226648,
"val_loss_exact": 2.38257182,
"steps": 5231,
"train_time_ms": 599626,
"eval_time_ms": 517991,
"bytes_total": 15818783,
"compressed_artifact_bytes": 15777612,
"compressed_artifact_sha256": "892f585d130801de2116aa3bfcd67aafc337119e1484c1b0f3a54d8e51bb6614"
},
"1": {
"val_bpb": 1.0630,
"val_bpb_exact": 1.06299064,
"val_loss_exact": 2.38419604,
"steps": 5221,
"train_time_ms": 599587,
"eval_time_ms": 510971,
"bytes_total": 15810544,
"compressed_artifact_bytes": 15769440,
"compressed_artifact_sha256": "46f1f5e5bb1dc67a29c1c88934910832855a0593a008da652a656da413ff2d23"
}
},
"bytes_total": 15818783,
"bytes_code": 163036,
"bytes_code_compressed": 41220,
"hardware": "8xH100 SXM",
"wallclock_train_s": 600,
"derives_from_pr": "openai/parameter-golf#1855"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# 10k vocab tooling — reviewer verification

Everything needed to inspect and reproduce the SP10240 CaseOps tokenization stack that mockingbird trained on. This subdir is appendix material — `train_gpt.py` and the seed logs in the parent directory remain the canonical submission.

## Layout

```
tokenization_10kvocab/
├── README.md (this file)
├── tokenizer/
│ ├── fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model ← actual tokenizer mockingbird used
│ ├── fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.vocab
│ ├── fineweb_10240_bpe.model ← base SP10240 BPE (no CaseOps reserves) for reference
│ ├── fineweb_10240_bpe.vocab
│ └── tokenizer_specs_sp10240.json ← BPE training spec (skip_docs, vocab_size, etc.)
├── build/
│ ├── run_sp10240_build.sh ← one-command rebuild from FineWeb 10B docs
│ ├── run_sp10240_upload.sh ← HF upload helper (used to publish the dataset)
│ └── sp10240_build.log ← BPE training log from the actual build
├── caseops/
│ ├── lossless_caps.py ← CaseOps codec module (4 reserved operators)
│ ├── prepare_sp10240_caseops_data.py ← end-to-end CaseOps tokenizer + dataset prep
│ ├── build_sp10240_caseops_local.sh ← local rebuild driver
│ ├── upload_sp10240_caseops_to_hf.sh ← HF upload driver
│ ├── download_sp10240_first80_from_hf.sh ← partial-shard download (first 80)
│ ├── download_sp10240_full124_from_hf.sh ← full 124-shard download
│ └── stream_pr1855_caseops_to_pod.sh ← pod streaming helper used in this lane
└── notes/
├── 2026-04-30_10k_caseops_hf_lane.md ← derivation note: how this lane was built
└── 2026-04-30_claude_sp10240_bytefit_plan.md ← the byte-fit reasoning (why MLP3.75 not MLP4)
```

## Tokenizer

**Vocab size:** 10,240
**Variant:** SP10240 lossless-caps CaseOps with 4 reserved operator codepoints

The CaseOps-active tokenizer is `fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model`. It is derived from the same trainer spec that PR #1855 used for SP8192:

- BPE, byte fallback enabled
- split-digits enabled
- `nmt_nfkc` normalization
- no dummy prefix
- pad / bos / eos / unk ids = 0 / 1 / 2 / 3
- hard vocab limit disabled
- reserved ids: U+E001=4, U+E002=5, U+E003=6, U+E004=7 (the four CaseOps operators)
- training corpus: FineWeb 10B docs `[50000, end)` (val docs `[0, 50000)` excluded — `tokenizer_skip_docs=50000` in `tokenizer_specs_sp10240.json`)

The standard `fineweb_10240_bpe.model` is included alongside as a reference — it is the same BPE training run **without** CaseOps reserved operators (those four codepoints map to `<unk>` id 3). Useful for diff inspection of the embedding-table cost of reserving the four ops.

## CaseOps codec

`caseops/lossless_caps.py` is the encode/decode module. The four operators are inserted at preprocessing time to record case information losslessly so the BPE doesn't need to allocate vocab to capitalization. At eval time, decode reverses the operators to reconstruct the original text.

The `prepare_sp10240_caseops_data.py` script trains the CaseOps tokenizer when no compatible model is found and tokenizes FineWeb 10B end-to-end into the dataset shards. It is the single source of truth for how mockingbird's training data was produced.

## Dataset

The full preprocessed dataset (124 train shards + 1 val shard, ~5 GB) is published publicly on Hugging Face — too large to commit to git:

**https://huggingface.co/datasets/Frosty40/10k_golfer**

Reviewers can pull it with either of:

```bash
bash caseops/download_sp10240_full124_from_hf.sh # all shards
bash caseops/download_sp10240_first80_from_hf.sh # first 80 only — enough to repro the run
```

Both scripts use the standard HF CLI and require `huggingface_hub>=1.8.0`.

## Reproducing the tokenizer + dataset from scratch

If you don't trust the HF artifacts and want to rebuild:

```bash
# 1. Build the standard SP10240 BPE tokenizer (no CaseOps)
bash build/run_sp10240_build.sh

# 2. Re-train the lossless-caps CaseOps variant + tokenize FineWeb 10B end-to-end
bash caseops/build_sp10240_caseops_local.sh
```

Output lands at `data/datasets/fineweb10B_sp10240_caseops/...` matching the paths the training scripts expect.

## Why this is in a non-record PR

A non-record submission is the right venue to land the 10k vocab tooling: it gives reviewers full access to the tokenizer, the CaseOps codec, the build/upload scripts, and the derivation notes — even though mockingbird's BPB does not beat PR #1855. The same machinery applied to SP8192 would produce a near-record SmearGate-class run; we're documenting the SP10240 cost on otherwise-identical compression / phased-TTT machinery.

## Provenance

- Tokenizer file `fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model` size 401,915 B; the byte-identical copy used by all three mockingbird seeds is at `legs/2026-05-01_pr1855_sp10240_caseops_mlp375_late045_seed{0,1}_8x/tokenizers/` and `evidence/pod_pulls/8x_10320714983_20260501_sp10240_mlp375_late045_clean_submission_candidate/...` on the source repo.
- CaseOps module `lossless_caps.py` is the seed-42 lane copy; seeds 0 and 1 used byte-identical copies of the same module.
- Build log `sp10240_build.log` is the actual SentencePiece trainer output from the build that produced the standard SP10240 BPE.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
set -uo pipefail

DATA_DIR=/home/frosty40/parameter-golf-lab/data
SPEC=/home/frosty40/parameter-golf-lab/tokenizer_specs_sp10240.json
LOG=/home/frosty40/parameter-golf-lab/sp10240_build.log
PY=/home/frosty40/miniconda3/bin/python3
TS=$(date +%Y%m%d_%H%M%S)

MAN_BAK="$DATA_DIR/manifest.json.bak.before_sp10240_$TS"
TCE_BAK="$DATA_DIR/tokenizer_config.export.json.bak.before_sp10240_$TS"

cp "$DATA_DIR/manifest.json" "$MAN_BAK"
cp "$DATA_DIR/tokenizer_config.export.json" "$TCE_BAK"
echo "[wrapper] backed up index files to $MAN_BAK / $TCE_BAK" | tee -a "$LOG"

cleanup() {
rc=$?
echo "[wrapper] build exited rc=$rc; restoring index files from backup" | tee -a "$LOG"
cp "$MAN_BAK" "$DATA_DIR/manifest.json"
cp "$TCE_BAK" "$DATA_DIR/tokenizer_config.export.json"
echo "[wrapper] index files restored. New tokenizer/dataset (if built) remain in place." | tee -a "$LOG"
}
trap cleanup EXIT

echo "[wrapper] starting sp10240 build at $TS" | tee -a "$LOG"
"$PY" "$DATA_DIR/download_hf_docs_and_tokenize.py" \
--output-root "$DATA_DIR" \
--tokenizer-config "$SPEC" \
--skip-byte 2>&1 | tee -a "$LOG"
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/usr/bin/env bash
set -uo pipefail

DATA_DIR=/home/frosty40/parameter-golf-lab/data
TOK_MODEL="$DATA_DIR/tokenizers/fineweb_10240_bpe.model"
TOK_VOCAB="$DATA_DIR/tokenizers/fineweb_10240_bpe.vocab"
DATASET_DIR="$DATA_DIR/datasets/fineweb10B_sp10240"
LOG=/home/frosty40/parameter-golf-lab/sp10240_upload.log
HF=/home/frosty40/miniconda3/bin/hf
REPO=Frosty40/10k_golfer
BUILD_PID=2018905

echo "[upload] watcher started $(date -Iseconds)" | tee -a "$LOG"
echo "[upload] waiting for build PID $BUILD_PID to exit, then for outputs to materialize" | tee -a "$LOG"

# Wait for the build process to exit
while kill -0 "$BUILD_PID" 2>/dev/null; do
sleep 30
done
echo "[upload] build PID $BUILD_PID exited at $(date -Iseconds)" | tee -a "$LOG"

# Wait for outputs to be visible (script may flush after exit)
for i in $(seq 1 20); do
if [[ -s "$TOK_MODEL" && -s "$TOK_VOCAB" && -d "$DATASET_DIR" ]]; then
shard_count=$(ls "$DATASET_DIR"/fineweb_train_*.bin 2>/dev/null | wc -l)
val_count=$(ls "$DATASET_DIR"/fineweb_val_*.bin 2>/dev/null | wc -l)
if [[ "$shard_count" -gt 0 && "$val_count" -gt 0 ]]; then
echo "[upload] outputs ready: tokenizer + $shard_count train shards + $val_count val shards" | tee -a "$LOG"
break
fi
fi
echo "[upload] outputs not yet visible (try $i/20), sleeping 15s" | tee -a "$LOG"
sleep 15
done

if [[ ! -s "$TOK_MODEL" || ! -d "$DATASET_DIR" ]]; then
echo "[upload] FATAL: outputs not present after build exit. Check sp10240_build.log." | tee -a "$LOG"
exit 1
fi

echo "[upload] creating repo $REPO (public, dataset)" | tee -a "$LOG"
"$HF" repo create "$REPO" --repo-type dataset 2>&1 | tee -a "$LOG" || \
echo "[upload] repo create returned nonzero (likely already exists), continuing" | tee -a "$LOG"

echo "[upload] uploading tokenizer files" | tee -a "$LOG"
"$HF" upload "$REPO" "$TOK_MODEL" "fineweb_10240_bpe.model" --repo-type dataset 2>&1 | tee -a "$LOG"
"$HF" upload "$REPO" "$TOK_VOCAB" "fineweb_10240_bpe.vocab" --repo-type dataset 2>&1 | tee -a "$LOG"

echo "[upload] uploading dataset shards from $DATASET_DIR (large folder)" | tee -a "$LOG"
"$HF" upload-large-folder "$REPO" "$DATASET_DIR" --repo-type dataset 2>&1 | tee -a "$LOG"

echo "[upload] DONE at $(date -Iseconds). Repo: https://huggingface.co/datasets/$REPO" | tee -a "$LOG"
Loading