openai · izlley · May 5, 2026
diff --git a/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/README.md b/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/README.md
@@ -0,0 +1,58 @@
+# Non-record: PR #2135 stack + MP3 marker-pair fusion
+
+Post-deadline note documenting that the MP3 marker-pair fusion from our prior PR #2109 composes with the May 1 frontier stack (PR #2135). Filed under `track_non_record_16mb` for reference only — not eligible for chronological standing.
+
+## Numbers
+
+3-seed mean of (42, 0, 314), phased TTT, on canonical CaseOps SP8192 (val byte sum identical to canonical within 0.015 %, see `submission.json` note):
+
+| Seed | Train | Pre-quant | Quant | **Post-TTT** | Eval s | Artifact |
+|-----:|------:|----------:|------:|-------------:|-------:|---------:|
+| 42   | 599.6 s | 1.05887 | 1.06741 | **1.05577** | 473.6 | 15,946,755 |
+| 0    | 599.6 s | 1.05858 | 1.06706 | **1.05534** | 501.9 | 15,936,375 |
+| 314  | 599.5 s | 1.05862 | 1.06726 | **1.05564** | 460.1 | 15,939,487 |
+| **Mean** | | **1.05869** | **1.06724** | **1.05559** (std 0.00018) | | max 15,946,755 |
+
+For reference, PR #2135's reported 3-seed mean is 1.05651 (std 0.00036); a matched-environment vanilla rerun in our setup produced 1.05669 (std 0.00041), so most of the apparent gap is attributable to the fusion lever rather than environment drift.
+
+## What changed vs PR #2135
+
+Single addition: the three 2-grams `[▁,TITLE]` / `[▁,ALLCAPS]` / `[▁,CAPNEXT]` are fused into single alias donor tokens (donors 8/9/10, byte-fallback IDs that occur 0× in the CaseOps corpus). Tokenizer is unchanged; the transform is a stream-edit on already-tokenized shards. Each alias row gets a norm-matched warm-init `0.4·E[▁] + 0.6·E[marker]` once at training start. PR #2135's training/eval code path is otherwise identical (`SmearGate` dampening branch is present in the patched `train_gpt.py` but hard-disabled via `alias_dampening_active=False`).
+
+Patch size: +45 lines vs upstream PR #2135 `train_gpt.py`. Patched md5 `c90dba41e5ce9586871a05e94e6e7445`.
+
+> Note on the log line `marker_pair:smear_dampening prev_alias_scale=0.5 num_pairs=3`: this is a config dump of `ALIAS_PREV_SMEAR_SCALE`'s default value (the env var is unused in this submission). The actual SmearGate forward path remains byte-identical to PR #2135 vanilla because `alias_dampening_active=False` is hard-coded in `train_model()` (line 3994). Verifiable by reading the patched source.
+
+## Reproduction
+
+Self-contained pipeline from the official FineWeb-10B doc stream (no third-party data mirrors):
+
+```bash
+# 1. Fetch the canonical FineWeb-10B doc stream (~45 GiB) from the official
+#    willdepueoai/parameter-golf HF dataset.
+python3 download_docs.py
+
+# 2. Apply CaseOps tokenization (canonical caseops_v1_reserved).
+python3 prepare_caseops_data.py \
+    --docs ./data/datasets/docs_selected.jsonl \
+    --out  ./data \
+    --sp   ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
+
+# 3. Apply MP3 marker-pair stream-edit on the already-tokenized shards.
+python3 prepare_marker_pair_v3.py
+
+# 4. 3-seed run (builds libonline_ngram_state.so on first invocation).
+bash run_3seed.sh
+```
+
+`run_3seed.sh` adds only `MARKER_PAIR_MODE=1`, `MARKER_PAIR_W_SPACE=0.4`, `MARKER_PAIR_W_TITLE=0.6` and `DATA_PATH=...marker_pair_v3` to PR #2135's repro env block; all other env vars are inherited from PR #2135 (`SAG_SCALE=0.5`, `NUM_PHASES=1`, `EVAL_SEQ_LEN=2560`, `GPTQ_CALIBRATION_BATCHES=32`, etc.).
+
+## Compliance
+
+- C1 strict causal dependence: standard sliding-window scoring with cu_seqlens packed-doc handling. PR #2135 BOS-fixed SmearGate inherited.
+- C2 full normalized softmax over SP8192 vocab.
+- C3 score-before-update: phased TTT (NUM_PHASES=1, prefix_docs=2500) inherited from PR #2135.
+- C4 single left-to-right pass.
+- No SLOT, no logit bias beyond the inherited PR #2135 token-only n-gram tilt, no pre-quant TTT on val data, no PPM mixture.
+- Full validation set; val byte sum identical to canonical caseops_v1_reserved within 0.015 % (151,546,741 vs 151,568,749).
+- Artifact ≤ 16,000,000 bytes (max 15,946,755), train ≤ 600 s strict, eval ≤ 600 s.
diff --git a/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/alias_map.json b/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/alias_map.json
@@ -0,0 +1,25 @@
+{
+  "alias_map": {
+    "marker_pair_space_title": 8,
+    "marker_pair_space_allcaps": 9,
+    "marker_pair_space_capnext": 10
+  },
+  "marker_pairs": [
+    {
+      "donor": 8,
+      "marker_id": 4,
+      "marker_name": "TITLE"
+    },
+    {
+      "donor": 9,
+      "marker_id": 5,
+      "marker_name": "ALLCAPS"
+    },
+    {
+      "donor": 10,
+      "marker_id": 6,
+      "marker_name": "CAPNEXT"
+    }
+  ],
+  "note": "MP3: 3-marker fusion (TITLE + ALLCAPS + CAPNEXT). Word X preserved."
+}
diff --git a/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/download_docs.py b/records/track_non_record_16mb/2026-05-05_PR2135_MP3_MarkerPairFusion/download_docs.py
@@ -0,0 +1,53 @@
+"""download_docs.py — fetch the canonical FineWeb-10B doc stream from Hugging Face.
+
+Downloads ``docs_selected.jsonl`` (~45 GiB) and its sidecar manifest from the
+``willdepueoai/parameter-golf`` HF dataset repo. The downloaded jsonl is the
+input to ``prepare_caseops_data.py`` (step 1b in the pipeline; see README).
+
+Usage::
+
+    python3 download_docs.py
+    # writes to ./data/datasets/{docs_selected.jsonl, docs_selected.source_manifest.json}
+    # override target with: BASE_DIR=/abs/path python3 download_docs.py
+
+Why a small wrapper instead of ``data/download_hf_docs_and_tokenize.py`` from
+the upstream parameter-golf repo: we only need the raw jsonl. The upstream
+script also tokenizes with multiple vocab specs (sp1024 / sp4096 / sp8192 /
+byte260) which adds ~10-20 minutes that prepare_caseops_data.py replaces.
+"""
+
+import os
+import time
+
+from huggingface_hub import hf_hub_download
+
+
+REPO_ID = os.environ.get("HF_REPO_ID", "willdepueoai/parameter-golf")
+# BASE_DIR is the parent of the ``datasets/`` directory created on disk.
+BASE_DIR = os.environ.get("BASE_DIR", "./data")
+FILES = [
+    "datasets/docs_selected.jsonl",
+    "datasets/docs_selected.source_manifest.json",
+]
+
+
+def main() -> None:
+    os.makedirs(BASE_DIR, exist_ok=True)
+    t0 = time.time()
+    for fn in FILES:
+        print("[" + time.strftime("%H:%M:%S") + "] downloading " + fn, flush=True)
+        p = hf_hub_download(
+            repo_id=REPO_ID,
+            filename=fn,
+            repo_type="dataset",
+            local_dir=BASE_DIR,
+        )
+        sz = os.path.getsize(p)
+        gib = round(sz / (1024 ** 3), 2)
+        print("  -> " + p, flush=True)
+        print("  size: " + str(sz) + " bytes (" + str(gib) + " GiB)", flush=True)
+    print("[" + time.strftime("%H:%M:%S") + "] done in " + str(round(time.time() - t0, 1)) + "s", flush=True)
+
+
+if __name__ == "__main__":
+    main()