openai · newjordan · May 1, 2026 · May 1, 2026
diff --git a/records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/README.md b/records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/README.md
@@ -0,0 +1,52 @@
+# Mockingbird
+
+10k-vocab CaseOps body on the SOTA architecture, derived from PR1855.
+
+This is a **non-record** submission. It does not beat the current leader. It is filed as evidence of the SP10240 CaseOps lane on the same compression / phased-TTT machinery as PR1855, for comparison with the SP8192 lane.
+
+## Results
+
+| Seed | val_bpb (quantized_ttt_phased) | Steps | Total submission size |
+|------|--------------------------------|-------|------------------------|
+| 42   | 1.06204667                     | 5,264 | 15,816,988 B           |
+| 0    | 1.06226648                     | 5,231 | 15,818,783 B           |
+| 1    | 1.06299064                     | 5,221 | 15,810,544 B           |
+| **mean** | **1.06243460**             |       | **15,818,783 B (max)** |
+
+Hardware: 8×H100 SXM · 600s wallclock · `bytes_code`: 163,036 (uncompressed) / 41,220 (compressed)
+
+## Architecture
+
+11L · dim 512 · `mlp_mult=3.75` · loop_start=3, loop_end=5, `enable_looping_at=0.45`
+
+- **Vocab/data**: SP10240 CaseOps lossless-caps tokenizer (10,240 tokens), FineWeb 10B sidecar with byte-level loss accounting
+- **Quantization**: per-group, embed int7, matrix int6, LQER asymmetric rank-4
+- **Eval**: PR1855 phased LoRA TTT — `prefix_docs=2500`, `phases=3`, `chunk=48`
+- **Compression**: pergroup
+- **Train budget**: 600 s wallclock, hard 16 MB artifact cap
+
+## Lineage
+
+This is the SP10240 sister of PR #1855 (`510d03e0fc355406c9fd06f92d23b8c5aedea7fb`), which used the same CaseOps + LQER + phased-TTT machinery on SP8192 and reported a 3-seed mean of 1.06107587 post-phased-TTT.
+
+The architecture is held fixed; only the tokenizer / vocab dimension changes (8192 → 10240). The 10k vocab consumes more bytes in the embedding table, so the body is shrunk to MLP3.75 (vs the SP8192 record's wider body) to stay under the 16 MB cap. `enable_looping_at=0.45` matches the same family.
+
+## Seeds
+
+The three runs used identical code and hyperparameters; only the random seed changed. The committed `train_gpt.py` is the seed-42 run (the strongest of the three). Seeds 0 and 1 differ only in `Hyperparameters.seed = N` (line 479 in this file) and the bookkeeping fields `TEST_ID` / `TEST_DATE` / `RUN_KIND` / blurb (lines 433–446). The training body is byte-identical.
+
+Seed choice (`42`, `0`, `1`) reflects the seed-repeat batch we ran on this lane; this submission does not use the protocol's `444 / 300` convention because these specific runs were not re-executed at those seeds.
+
+## Reproduce
+
+```bash
+# From repo root, with flash-attention/hopper on PYTHONPATH
+SKIP_GPTQ=1 SEED=42 torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/train_gpt.py
+```
+
+For seeds 0 and 1, change line 479 (`Hyperparameters.seed = 42`) to `0` or `1` respectively. The default `SEED` env var is overridden inside the file.
+
+## Artifacts
+
+Per-seed compressed artifacts (`final_model.int6.ptz`) and SHA256 hashes are recorded in `submission.json`. Each artifact is well under the 16 MB cap (max 15.82 MB).
diff --git a/records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/submission.json b/records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/submission.json
@@ -0,0 +1,54 @@
+{
+  "author": "Frosty40",
+  "github_id": "newjordan",
+  "name": "Mockingbird",
+  "blurb": "10k-vocab CaseOps body on the SOTA architecture: SP10240 lossless-caps tokenizer with PR1855 LQER asym rank-4 + per-group + phased-TTT compression, 11L dim512 mlp_mult=3.75 with looping enabled at 0.45 of training.",
+  "date": "2026-05-01T00:00:00Z",
+  "track": "10min_16mb",
+  "record": false,
+  "val_bpb": 1.0624,
+  "val_bpb_exact": 1.06243460,
+  "val_bpb_std": 0.00040,
+  "seeds": [42, 0, 1],
+  "seed_results": {
+    "42": {
+      "val_bpb": 1.0620,
+      "val_bpb_exact": 1.06204667,
+      "val_loss_exact": 2.38207881,
+      "steps": 5264,
+      "train_time_ms": 599612,
+      "eval_time_ms": 446734,
+      "bytes_total": 15816988,
+      "compressed_artifact_bytes": 15775768,
+      "compressed_artifact_sha256": "68f570ab2cccfa31ecc7064e68eada5fa83cc969c5267a6c74bfd4fe8d5835f9"
+    },
+    "0": {
+      "val_bpb": 1.0623,
+      "val_bpb_exact": 1.06226648,
+      "val_loss_exact": 2.38257182,
+      "steps": 5231,
+      "train_time_ms": 599626,
+      "eval_time_ms": 517991,
+      "bytes_total": 15818783,
+      "compressed_artifact_bytes": 15777612,
+      "compressed_artifact_sha256": "892f585d130801de2116aa3bfcd67aafc337119e1484c1b0f3a54d8e51bb6614"
+    },
+    "1": {
+      "val_bpb": 1.0630,
+      "val_bpb_exact": 1.06299064,
+      "val_loss_exact": 2.38419604,
+      "steps": 5221,
+      "train_time_ms": 599587,
+      "eval_time_ms": 510971,
+      "bytes_total": 15810544,
+      "compressed_artifact_bytes": 15769440,
+      "compressed_artifact_sha256": "46f1f5e5bb1dc67a29c1c88934910832855a0593a008da652a656da413ff2d23"
+    }
+  },
+  "bytes_total": 15818783,
+  "bytes_code": 163036,
+  "bytes_code_compressed": 41220,
+  "hardware": "8xH100 SXM",
+  "wallclock_train_s": 600,
+  "derives_from_pr": "openai/parameter-golf#1855"
+}
diff --git a/.../track_10min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/README.md b/.../track_10min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/README.md
@@ -0,0 +1,94 @@
+# 10k vocab tooling — reviewer verification
+
+Everything needed to inspect and reproduce the SP10240 CaseOps tokenization stack that mockingbird trained on. This subdir is appendix material — `train_gpt.py` and the seed logs in the parent directory remain the canonical submission.
+
+## Layout
+
+```
+tokenization_10kvocab/
+├── README.md                                    (this file)
+├── tokenizer/
+│   ├── fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model    ← actual tokenizer mockingbird used
+│   ├── fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.vocab
+│   ├── fineweb_10240_bpe.model                  ← base SP10240 BPE (no CaseOps reserves) for reference
+│   ├── fineweb_10240_bpe.vocab
+│   └── tokenizer_specs_sp10240.json             ← BPE training spec (skip_docs, vocab_size, etc.)
+├── build/
+│   ├── run_sp10240_build.sh                     ← one-command rebuild from FineWeb 10B docs
+│   ├── run_sp10240_upload.sh                    ← HF upload helper (used to publish the dataset)
+│   └── sp10240_build.log                        ← BPE training log from the actual build
+├── caseops/
+│   ├── lossless_caps.py                         ← CaseOps codec module (4 reserved operators)
+│   ├── prepare_sp10240_caseops_data.py          ← end-to-end CaseOps tokenizer + dataset prep
+│   ├── build_sp10240_caseops_local.sh           ← local rebuild driver
+│   ├── upload_sp10240_caseops_to_hf.sh          ← HF upload driver
+│   ├── download_sp10240_first80_from_hf.sh      ← partial-shard download (first 80)
+│   ├── download_sp10240_full124_from_hf.sh      ← full 124-shard download
+│   └── stream_pr1855_caseops_to_pod.sh          ← pod streaming helper used in this lane
+└── notes/
+    ├── 2026-04-30_10k_caseops_hf_lane.md        ← derivation note: how this lane was built
+    └── 2026-04-30_claude_sp10240_bytefit_plan.md ← the byte-fit reasoning (why MLP3.75 not MLP4)
+```
+
+## Tokenizer
+
+**Vocab size:** 10,240
+**Variant:** SP10240 lossless-caps CaseOps with 4 reserved operator codepoints
+
+The CaseOps-active tokenizer is `fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model`. It is derived from the same trainer spec that PR #1855 used for SP8192:
+
+- BPE, byte fallback enabled
+- split-digits enabled
+- `nmt_nfkc` normalization
+- no dummy prefix
+- pad / bos / eos / unk ids = 0 / 1 / 2 / 3
+- hard vocab limit disabled
+- reserved ids: U+E001=4, U+E002=5, U+E003=6, U+E004=7 (the four CaseOps operators)
+- training corpus: FineWeb 10B docs `[50000, end)` (val docs `[0, 50000)` excluded — `tokenizer_skip_docs=50000` in `tokenizer_specs_sp10240.json`)
+
+The standard `fineweb_10240_bpe.model` is included alongside as a reference — it is the same BPE training run **without** CaseOps reserved operators (those four codepoints map to `<unk>` id 3). Useful for diff inspection of the embedding-table cost of reserving the four ops.
+
+## CaseOps codec
+
+`caseops/lossless_caps.py` is the encode/decode module. The four operators are inserted at preprocessing time to record case information losslessly so the BPE doesn't need to allocate vocab to capitalization. At eval time, decode reverses the operators to reconstruct the original text.
+
+The `prepare_sp10240_caseops_data.py` script trains the CaseOps tokenizer when no compatible model is found and tokenizes FineWeb 10B end-to-end into the dataset shards. It is the single source of truth for how mockingbird's training data was produced.
+
+## Dataset
+
+The full preprocessed dataset (124 train shards + 1 val shard, ~5 GB) is published publicly on Hugging Face — too large to commit to git:
+
+**https://huggingface.co/datasets/Frosty40/10k_golfer**
+
+Reviewers can pull it with either of:
+
+```bash
+bash caseops/download_sp10240_full124_from_hf.sh    # all shards
+bash caseops/download_sp10240_first80_from_hf.sh    # first 80 only — enough to repro the run
+```
+
+Both scripts use the standard HF CLI and require `huggingface_hub>=1.8.0`.
+
+## Reproducing the tokenizer + dataset from scratch
+
+If you don't trust the HF artifacts and want to rebuild:
+
+```bash
+# 1. Build the standard SP10240 BPE tokenizer (no CaseOps)
+bash build/run_sp10240_build.sh
+
+# 2. Re-train the lossless-caps CaseOps variant + tokenize FineWeb 10B end-to-end
+bash caseops/build_sp10240_caseops_local.sh
+```
+
+Output lands at `data/datasets/fineweb10B_sp10240_caseops/...` matching the paths the training scripts expect.
+
+## Why this is in a non-record PR
+
+A non-record submission is the right venue to land the 10k vocab tooling: it gives reviewers full access to the tokenizer, the CaseOps codec, the build/upload scripts, and the derivation notes — even though mockingbird's BPB does not beat PR #1855. The same machinery applied to SP8192 would produce a near-record SmearGate-class run; we're documenting the SP10240 cost on otherwise-identical compression / phased-TTT machinery.
+
+## Provenance
+
+- Tokenizer file `fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model` size 401,915 B; the byte-identical copy used by all three mockingbird seeds is at `legs/2026-05-01_pr1855_sp10240_caseops_mlp375_late045_seed{0,1}_8x/tokenizers/` and `evidence/pod_pulls/8x_10320714983_20260501_sp10240_mlp375_late045_clean_submission_candidate/...` on the source repo.
+- CaseOps module `lossless_caps.py` is the seed-42 lane copy; seeds 0 and 1 used byte-identical copies of the same module.
+- Build log `sp10240_build.log` is the actual SentencePiece trainer output from the build that produced the standard SP10240 BPE.
diff --git a/...10min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/build/run_sp10240_build.sh b/...10min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/build/run_sp10240_build.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+set -uo pipefail
+
+DATA_DIR=/home/frosty40/parameter-golf-lab/data
+SPEC=/home/frosty40/parameter-golf-lab/tokenizer_specs_sp10240.json
+LOG=/home/frosty40/parameter-golf-lab/sp10240_build.log
+PY=/home/frosty40/miniconda3/bin/python3
+TS=$(date +%Y%m%d_%H%M%S)
+
+MAN_BAK="$DATA_DIR/manifest.json.bak.before_sp10240_$TS"
+TCE_BAK="$DATA_DIR/tokenizer_config.export.json.bak.before_sp10240_$TS"
+
+cp "$DATA_DIR/manifest.json" "$MAN_BAK"
+cp "$DATA_DIR/tokenizer_config.export.json" "$TCE_BAK"
+echo "[wrapper] backed up index files to $MAN_BAK / $TCE_BAK" | tee -a "$LOG"
+
+cleanup() {
+    rc=$?
+    echo "[wrapper] build exited rc=$rc; restoring index files from backup" | tee -a "$LOG"
+    cp "$MAN_BAK" "$DATA_DIR/manifest.json"
+    cp "$TCE_BAK" "$DATA_DIR/tokenizer_config.export.json"
+    echo "[wrapper] index files restored. New tokenizer/dataset (if built) remain in place." | tee -a "$LOG"
+}
+trap cleanup EXIT
+
+echo "[wrapper] starting sp10240 build at $TS" | tee -a "$LOG"
+"$PY" "$DATA_DIR/download_hf_docs_and_tokenize.py" \
+    --output-root "$DATA_DIR" \
+    --tokenizer-config "$SPEC" \
+    --skip-byte 2>&1 | tee -a "$LOG"
diff --git a/...0min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/build/run_sp10240_upload.sh b/...0min_16mb/2026-05-01_Mockingbird_8xH100/tokenization_10kvocab/build/run_sp10240_upload.sh
@@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+set -uo pipefail
+
+DATA_DIR=/home/frosty40/parameter-golf-lab/data
+TOK_MODEL="$DATA_DIR/tokenizers/fineweb_10240_bpe.model"
+TOK_VOCAB="$DATA_DIR/tokenizers/fineweb_10240_bpe.vocab"
+DATASET_DIR="$DATA_DIR/datasets/fineweb10B_sp10240"
+LOG=/home/frosty40/parameter-golf-lab/sp10240_upload.log
+HF=/home/frosty40/miniconda3/bin/hf
+REPO=Frosty40/10k_golfer
+BUILD_PID=2018905
+
+echo "[upload] watcher started $(date -Iseconds)" | tee -a "$LOG"
+echo "[upload] waiting for build PID $BUILD_PID to exit, then for outputs to materialize" | tee -a "$LOG"
+
+# Wait for the build process to exit
+while kill -0 "$BUILD_PID" 2>/dev/null; do
+    sleep 30
+done
+echo "[upload] build PID $BUILD_PID exited at $(date -Iseconds)" | tee -a "$LOG"
+
+# Wait for outputs to be visible (script may flush after exit)
+for i in $(seq 1 20); do
+    if [[ -s "$TOK_MODEL" && -s "$TOK_VOCAB" && -d "$DATASET_DIR" ]]; then
+        shard_count=$(ls "$DATASET_DIR"/fineweb_train_*.bin 2>/dev/null | wc -l)
+        val_count=$(ls "$DATASET_DIR"/fineweb_val_*.bin 2>/dev/null | wc -l)
+        if [[ "$shard_count" -gt 0 && "$val_count" -gt 0 ]]; then
+            echo "[upload] outputs ready: tokenizer + $shard_count train shards + $val_count val shards" | tee -a "$LOG"
+            break
+        fi
+    fi
+    echo "[upload] outputs not yet visible (try $i/20), sleeping 15s" | tee -a "$LOG"
+    sleep 15
+done
+
+if [[ ! -s "$TOK_MODEL" || ! -d "$DATASET_DIR" ]]; then
+    echo "[upload] FATAL: outputs not present after build exit. Check sp10240_build.log." | tee -a "$LOG"
+    exit 1
+fi
+
+echo "[upload] creating repo $REPO (public, dataset)" | tee -a "$LOG"
+"$HF" repo create "$REPO" --repo-type dataset 2>&1 | tee -a "$LOG" || \
+    echo "[upload] repo create returned nonzero (likely already exists), continuing" | tee -a "$LOG"
+
+echo "[upload] uploading tokenizer files" | tee -a "$LOG"
+"$HF" upload "$REPO" "$TOK_MODEL" "fineweb_10240_bpe.model" --repo-type dataset 2>&1 | tee -a "$LOG"
+"$HF" upload "$REPO" "$TOK_VOCAB" "fineweb_10240_bpe.vocab" --repo-type dataset 2>&1 | tee -a "$LOG"
+
+echo "[upload] uploading dataset shards from $DATASET_DIR (large folder)" | tee -a "$LOG"
+"$HF" upload-large-folder "$REPO" "$DATASET_DIR" --repo-type dataset 2>&1 | tee -a "$LOG"
+
+echo "[upload] DONE at $(date -Iseconds). Repo: https://huggingface.co/datasets/$REPO" | tee -a "$LOG"