Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Non-record: PR #2135 stack + MP3 marker-pair fusion

Post-deadline note documenting that the MP3 marker-pair fusion from our prior PR #2109 composes with the May 1 frontier stack (PR #2135). Filed under `track_non_record_16mb` for reference only — not eligible for chronological standing.

## Numbers

3-seed mean of (42, 0, 314), phased TTT, on canonical CaseOps SP8192 (val byte sum identical to canonical within 0.015 %, see `submission.json` note):

| Seed | Train | Pre-quant | Quant | **Post-TTT** | Eval s | Artifact |
|-----:|------:|----------:|------:|-------------:|-------:|---------:|
| 42 | 599.6 s | 1.05887 | 1.06741 | **1.05577** | 473.6 | 15,946,755 |
| 0 | 599.6 s | 1.05858 | 1.06706 | **1.05534** | 501.9 | 15,936,375 |
| 314 | 599.5 s | 1.05862 | 1.06726 | **1.05564** | 460.1 | 15,939,487 |
| **Mean** | | **1.05869** | **1.06724** | **1.05559** (std 0.00018) | | max 15,946,755 |

For reference, PR #2135's reported 3-seed mean is 1.05651 (std 0.00036); a matched-environment vanilla rerun in our setup produced 1.05669 (std 0.00041), so most of the apparent gap is attributable to the fusion lever rather than environment drift.

## What changed vs PR #2135

Single addition: the three 2-grams `[▁,TITLE]` / `[▁,ALLCAPS]` / `[▁,CAPNEXT]` are fused into single alias donor tokens (donors 8/9/10, byte-fallback IDs that occur 0× in the CaseOps corpus). Tokenizer is unchanged; the transform is a stream-edit on already-tokenized shards. Each alias row gets a norm-matched warm-init `0.4·E[▁] + 0.6·E[marker]` once at training start. PR #2135's training/eval code path is otherwise identical (`SmearGate` dampening branch is present in the patched `train_gpt.py` but hard-disabled via `alias_dampening_active=False`).

Patch size: +45 lines vs upstream PR #2135 `train_gpt.py`. Patched md5 `c90dba41e5ce9586871a05e94e6e7445`.

> Note on the log line `marker_pair:smear_dampening prev_alias_scale=0.5 num_pairs=3`: this is a config dump of `ALIAS_PREV_SMEAR_SCALE`'s default value (the env var is unused in this submission). The actual SmearGate forward path remains byte-identical to PR #2135 vanilla because `alias_dampening_active=False` is hard-coded in `train_model()` (line 3994). Verifiable by reading the patched source.

## Reproduction

Self-contained pipeline from the official FineWeb-10B doc stream (no third-party data mirrors):

```bash
# 1. Fetch the canonical FineWeb-10B doc stream (~45 GiB) from the official
# willdepueoai/parameter-golf HF dataset.
python3 download_docs.py

# 2. Apply CaseOps tokenization (canonical caseops_v1_reserved).
python3 prepare_caseops_data.py \
--docs ./data/datasets/docs_selected.jsonl \
--out ./data \
--sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

# 3. Apply MP3 marker-pair stream-edit on the already-tokenized shards.
python3 prepare_marker_pair_v3.py

# 4. 3-seed run (builds libonline_ngram_state.so on first invocation).
bash run_3seed.sh
```

`run_3seed.sh` adds only `MARKER_PAIR_MODE=1`, `MARKER_PAIR_W_SPACE=0.4`, `MARKER_PAIR_W_TITLE=0.6` and `DATA_PATH=...marker_pair_v3` to PR #2135's repro env block; all other env vars are inherited from PR #2135 (`SAG_SCALE=0.5`, `NUM_PHASES=1`, `EVAL_SEQ_LEN=2560`, `GPTQ_CALIBRATION_BATCHES=32`, etc.).

## Compliance

- C1 strict causal dependence: standard sliding-window scoring with cu_seqlens packed-doc handling. PR #2135 BOS-fixed SmearGate inherited.
- C2 full normalized softmax over SP8192 vocab.
- C3 score-before-update: phased TTT (NUM_PHASES=1, prefix_docs=2500) inherited from PR #2135.
- C4 single left-to-right pass.
- No SLOT, no logit bias beyond the inherited PR #2135 token-only n-gram tilt, no pre-quant TTT on val data, no PPM mixture.
- Full validation set; val byte sum identical to canonical caseops_v1_reserved within 0.015 % (151,546,741 vs 151,568,749).
- Artifact ≤ 16,000,000 bytes (max 15,946,755), train ≤ 600 s strict, eval ≤ 600 s.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"alias_map": {
"marker_pair_space_title": 8,
"marker_pair_space_allcaps": 9,
"marker_pair_space_capnext": 10
},
"marker_pairs": [
{
"donor": 8,
"marker_id": 4,
"marker_name": "TITLE"
},
{
"donor": 9,
"marker_id": 5,
"marker_name": "ALLCAPS"
},
{
"donor": 10,
"marker_id": 6,
"marker_name": "CAPNEXT"
}
],
"note": "MP3: 3-marker fusion (TITLE + ALLCAPS + CAPNEXT). Word X preserved."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""download_docs.py — fetch the canonical FineWeb-10B doc stream from Hugging Face.

Downloads ``docs_selected.jsonl`` (~45 GiB) and its sidecar manifest from the
``willdepueoai/parameter-golf`` HF dataset repo. The downloaded jsonl is the
input to ``prepare_caseops_data.py`` (step 1b in the pipeline; see README).

Usage::

python3 download_docs.py
# writes to ./data/datasets/{docs_selected.jsonl, docs_selected.source_manifest.json}
# override target with: BASE_DIR=/abs/path python3 download_docs.py

Why a small wrapper instead of ``data/download_hf_docs_and_tokenize.py`` from
the upstream parameter-golf repo: we only need the raw jsonl. The upstream
script also tokenizes with multiple vocab specs (sp1024 / sp4096 / sp8192 /
byte260) which adds ~10-20 minutes that prepare_caseops_data.py replaces.
"""

import os
import time

from huggingface_hub import hf_hub_download


REPO_ID = os.environ.get("HF_REPO_ID", "willdepueoai/parameter-golf")
# BASE_DIR is the parent of the ``datasets/`` directory created on disk.
BASE_DIR = os.environ.get("BASE_DIR", "./data")
FILES = [
"datasets/docs_selected.jsonl",
"datasets/docs_selected.source_manifest.json",
]


def main() -> None:
os.makedirs(BASE_DIR, exist_ok=True)
t0 = time.time()
for fn in FILES:
print("[" + time.strftime("%H:%M:%S") + "] downloading " + fn, flush=True)
p = hf_hub_download(
repo_id=REPO_ID,
filename=fn,
repo_type="dataset",
local_dir=BASE_DIR,
)
sz = os.path.getsize(p)
gib = round(sz / (1024 ** 3), 2)
print(" -> " + p, flush=True)
print(" size: " + str(sz) + " bytes (" + str(gib) + " GiB)", flush=True)
print("[" + time.strftime("%H:%M:%S") + "] done in " + str(round(time.time() - t0, 1)) + "s", flush=True)


if __name__ == "__main__":
main()
Loading