openai · KenMalloy · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/docs/plans/2026-05-01-semanticengine-implementation.md b/docs/plans/2026-05-01-semanticengine-implementation.md
diff --git a/docs/plans/2026-05-01-semanticengine-submission-design.md b/docs/plans/2026-05-01-semanticengine-submission-design.md
@@ -0,0 +1,190 @@
+# SemanticEngine Submission Design
+
+**Date:** 2026-05-01  
+**Track:** `track_10min_16mb`  
+**Submission folder:** `records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/`
+
+---
+
+## 1. System Overview
+
+The submission presents **SemanticEngine** — a CareSSM trunk with live episodic memory. Unlike every other top submission (transformer-based), this is a pure SSM architecture whose memory substrate is active during both training and prequential eval.
+
+### Named Components
+
+| Name | Role | Code location |
+|---|---|---|
+| **SemanticEngine** | Overall system | this submission |
+| **CareSSM** | SSM trunk blocks | `chaoscontrol.core`, `chaoscontrol.model` |
+| **ChaosSsm** | CPU SSM controller (nice-to-have rename from `CpuSsmController*`) | `chaoscontrol.episodic.cpu_ssm_controller` |
+| **Episodic memory** | CRCT evidence substrate + MultiSlotOuterModel + replay eviction pipeline | `chaoscontrol.memory`, `chaoscontrol.replay_eviction` |
+| **SemanticOptimizer** | Muon with SSM-channel-coupled momentum β | `chaoscontrol.optim.muon` (via `log_a_beta_coupling=True`) |
+
+**Note on episodic memory:** The live memory substrate (CRCT + MultiSlotOuterModel + streaming maintenance) is architecturally compatible with any Mamba-style SSM. CareSSM is built with it in mind, not the other way around.
+
+**Note on SemanticOptimizer:** The concept (per-channel momentum β coupled to each channel's `log_a` decay so optimizer time constants match recurrence time constants) is implemented as the `log_a_beta_coupling` extension in the `Muon` class. The standalone `SemanticOptimizer` class in `optim/semantic.py` is the fuller future version. The submission uses `Muon(log_a_beta_coupling=True)`.
+
+---
+
+## 2. File Structure
+
+### Submission folder (`records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/`)
+
+```
+train_gpt.py                        # ~700-900 lines, orchestrating driver (see §4)
+requirements.txt                    # chaoscontrol @ git+..., torch, sentencepiece, etc.
+submission.json                     # filled after run
+README.md                           # filled after run
+train_seed<N>.log                   # filled after run (3 seeds)
+tokenizers/
+  fineweb_16384_bpe.model           # SP16384 tokenizer, shipped in submission folder
+```
+
+### New chaoscontrol module (`src/chaoscontrol/public/`)
+
+```
+src/chaoscontrol/public/
+  __init__.py
+  engine_entry.py                   # init_arm_topology(), run_training(), build_artifact(), run_eval()
+```
+
+`public/` is the name: it signals this is the stable public-facing interface, not internal experiment scaffolding.
+
+All heavy machinery (distributed loop, CRCT, replay eviction topology, GPTQ, prequential eval) stays in existing chaoscontrol modules. `engine_entry.py` (~150–200 lines) connects them under a stable interface that `train_gpt.py` calls.
+
+---
+
+## 3. Data and Dependencies
+
+### Data
+
+- **Tokenizer:** SP16384 (`fineweb_16384_bpe.model`, 455 KB), shipped inside the submission folder
+- **Train/val shards:** `Natooka/parameter-golf-sp-tokenizers` on HuggingFace — 133 train shards (~25 GB) + 1 val shard (~84 MB, 42,266,034 tokens)
+- **ValCache:** Pre-built from the first 50,000 validation documents; used by the prequential eval. Built via `scripts/build_exp20_val_cache.py` on pod setup.
+
+### Native extensions (must be built before running)
+
+| Extension | Purpose |
+|---|---|
+| `_lm_head_loss` | Fused chunked LM head backward (8× VRAM reduction at V=16384) |
+| `_cpu_ssm_controller` | ChaosSsm CPU controller (C++ with optional CUDA write-event pack) |
+| `_ssm_scan` | Chunked parallel SSM scan CUDA kernel |
+
+Built via `scripts/pod_build_native_extensions.sh`. The full pod setup (CUDA 13 + TE 2.13 + extensions + data) runs via `scripts/pod_bootstrap.sh`.
+
+### Requirements
+
+- PyTorch 2.11.0+cu130 (CUDA 13)
+- TransformerEngine 2.13.0
+- `chaoscontrol @ git+https://github.com/KenMalloy/chaoscontrol.git`
+- `sentencepiece`, `huggingface-hub`, `numpy`
+
+No network calls inside `train_gpt.py` during training or eval. The `chaoscontrol` package is pip-installed before the script runs.
+
+---
+
+## 4. `train_gpt.py` Internal Structure
+
+Entry point: `torchrun --standalone --nproc_per_node=8 train_gpt.py`  
+All config via env vars. Matches the interface of every other submission.
+
+### Section 1 — Hyperparameters (heavily commented, ~100 lines)
+
+An env-var-configurable class. Comments explain the architectural motivation for each setting, not just the value.
+
+Key groups:
+- **Paths:** `DATA_PATH`, `VAL_CACHE_DIR`, `TOKENIZER_PATH`
+- **Model:** `model_dim=384` (artifact-safe at int6/LZMA; next size up at 416 is 15.19 MB, 448 exceeds budget), `ssm_delta_rank=32`
+- **CRCT:** `crct_memory_write_tokens_per_step=32`, `crct_target_read_rate=0.25`, `crct_target_write_rate=0.10`, `outer_max_slots=4096`, and the full locked CRCT config from `exp26._crct_lock()`
+- **Replay eviction:** `replay_eviction_memory_streams=8`, `replay_eviction_commit_policy="learned"`, and the full pipeline config from `exp26._replay_eviction_pipeline_lock()`
+- **Fast/slow:** `fast_slow_alpha=0.25`, `fast_slow_eval_copy="slow"`, controller settings from `exp26._fast_slow_lock()`
+- **Training:** `BUDGET_SECONDS=600`, `WARMUP_STEPS=20`, warmdown schedule, `GRAD_CLIP_NORM`
+- **Optimizer:** SemanticOptimizer flags — `log_a_beta_coupling=True`, `log_a_beta_ema=0.99`, `log_a_beta_min=0.5`; Muon for matrix params, AdamW fallback for embeddings/scalars
+- **Quantization:** GPTQ int6 for matrices, int7 for tied embeddings
+- **Eval:** `CHUNK_TOKENS`, `WRITE_TOKENS_PER_CHUNK`, `DECAY` for `packet_online_cache`
+
+### Section 2 — `main()` (heavily commented, ~600-800 lines)
+
+Comments in this section explain the training/eval distinction clearly for reviewers:
+
+> During training, the trunk updates weights and the memory/controller stack generates evidence and maintains the cache. During eval, the same memory substrate is live but the run is prequential: score each chunk under the current state first, accumulate loss, then optionally update from those already-scored tokens. The trunk never sees validation tokens before they are scored.
+
+**Dist init + role routing** (~25 lines)  
+Calls `chaoscontrol.public.engine_entry.init_arm_topology(world_size)`. On 8 GPUs: GPU 0–5 are train ranks, GPU 6 is the packet-serving rank, GPU 7 is the maintenance rank. On 4 GPUs: GPU 3 shares both memory roles. Role routing is encapsulated here because it can't be described readably inline.
+
+**Data + ValCache load** (~30 lines)  
+Shards from `DATA_PATH`. ValCache from `VAL_CACHE_DIR` (pre-built, not constructed at runtime).
+
+**Model + optimizer** (~60 lines)  
+Build `ChaosControlConfig` from the hyperparameter block. Instantiate `CareStudentLM`. Construct the SemanticOptimizer (Muon with `log_a_beta_coupling=True`) on matrix params; AdamW on embeddings and scalars.
+
+**Training loop** (~200 lines)  
+```
+while True:
+    if time.perf_counter() - t_start >= BUDGET_SECONDS:
+        break  # always exits at a complete step boundary
+    step += 1
+    <forward, loss, backward, optimizer step, fast/slow consolidation>
+    if step % 100 == 0:
+        log(step, loss, tokens_per_sec, elapsed_s)
+```
+
+Wallclock check is the first thing in each iteration. When it fires, the loop exits at a complete-step boundary — no partial state enters the artifact. Log message: `"training stopped at step N (wallclock), artifact built from step N state"`.
+
+**Artifact build** (~80 lines)  
+Calls `chaoscontrol.artifact.serialize_artifact(model, ...)`. GPTQ int6 + int7 embed + LZMA compression. Logs `code_bytes`, `model_bytes`, `total_bytes` explicitly.
+
+**Prequential eval** (~100 lines)  
+Loads the serialized artifact. Calls `evaluate_with_calc_types(model, val_cache, calc_types=["packet_online_cache"], config=eval_config)`. The `packet_online_cache` calc type enforces score-before-write at the Python level (raises `RuntimeError` if the cache slot count changes between cue read and score accumulation). Iterates all 50,000 validation documents. Returns `val_bpb`, `val_loss`.
+
+**Score summary** (~20 lines)  
+Rank 0 prints a parseable summary: `val_bpb`, `val_loss`, `artifact_bytes`, `train_steps`, `train_time_s`, `eval_time_s`.
+
+---
+
+## 5. New Chaoscontrol Code — `public/engine_entry.py`
+
+~150–200 lines. Three functions:
+
+**`init_arm_topology(world_size) -> RoleInfo`**  
+GPU role assignment. Returns the local process's role (train / packet-serving / maintenance) and associated NCCL group handles. Single source of truth for the 6+2 topology.
+
+**`run_training(model, optimizer, data, config) -> TrainingResult`**  
+Thin wrapper over the existing training loop in `chaoscontrol.training`. Returns `steps`, `elapsed_s`, `final_loss`. Called by `train_gpt.py` after model/optimizer construction.
+
+**`run_eval(artifact_path, val_cache, config) -> EvalResult`**  
+Loads artifact, calls `evaluate_with_calc_types` with `packet_online_cache`. Returns `bpb`, `loss`, `docs_scored`, `elapsed_s`.
+
+---
+
+## 6. Training / Eval Distinction
+
+The prequential eval contract, stated explicitly for reviewers:
+
+- **Score first:** Each chunk is scored under the model's current memory state. Loss is accumulated before the cache is updated.
+- **Write after:** The just-scored hidden states and token NLLs are committed to the episodic cache only after scoring. Future chunks may read them.
+- **Trunk weights frozen:** The trunk does not update during eval. Only the episodic cache grows.
+- **Enforced:** `packet_online_cache.py` checks `_outer_slot_count(model)` before and after scoring each chunk; a count change before score accumulation raises immediately.
+
+---
+
+## 7. Implementation Plan
+
+The following tasks (in order) produce a runnable train_gpt.py and a score:
+
+1. Create `src/chaoscontrol/public/__init__.py` and `engine_entry.py` with the three functions
+2. Write `records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/train_gpt.py`
+3. Write `requirements.txt`
+4. Spin up 8xH100 pod, run `scripts/pod_bootstrap.sh`
+5. Run `torchrun --standalone --nproc_per_node=8 train_gpt.py` for seed 42
+6. Capture log, verify `val_bpb` in output
+7. Repeat for seeds 1337 and 1234 (3-seed mean)
+8. Fill `submission.json` and `README.md`
+
+---
+
+## 8. Open Items
+
+- **ChaosSsm rename:** Nice-to-have. `CpuSsmController*` classes can be aliased or renamed in `chaoscontrol/public/` without touching internal code. Not blocking implementation.
+- **ScOpt:** Not used in this submission. `ScarcityAwareOptimizer` is the parent concept that birthed `SemanticOptimizer`; noted for future work.
+- **Folder name:** `2026-05-01_SemanticEngine_CareSSM` — may shift to a date after the actual run if we slip past May 1.
diff --git a/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/README.md b/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/README.md
@@ -0,0 +1,70 @@
+# SemanticEngine — CareSSM + Live Episodic Memory
+
+**Track:** track_10min_16mb  
+**val_bpb:** 1.642868 (3-seed mean, std 0.023340)  
+**artifact:** 13,554,222 / 16,000,000 bytes estimated contest-counted int6/LZMA payload, including 500 KB overhead  
+**eval:** full 50k FineWeb validation docs, legal prequential packet-online cache
+
+The raw bf16 runtime weight mirror is 44,600,064 bytes. That is not the submitted
+artifact size; the submitted artifact uses the same int6/LZMA artifact accounting
+used by the dim-384 headroom check.
+
+## Architecture
+
+**SemanticEngine** is a CareSSM trunk with live episodic memory. Unlike the transformer submissions, this is a pure SSM architecture whose memory substrate is active during both training and prequential eval.
+
+### Named Components
+
+| Name | Role |
+|---|---|
+| **SemanticEngine** | Overall system |
+| **CareSSM** | Diagonal recurrent SSM trunk blocks |
+| **ChaosSsm** | CPU SSM controller / scheduling plane |
+| **Episodic memory** | CRCT evidence substrate + MultiSlotOuterModel + replay eviction pipeline |
+| **SemanticOptimizer** | Muon with SSM-channel-coupled momentum beta |
+
+### Dedicated Memory GPUs (8xH100)
+
+On 8xH100, GPU 6 and GPU 7 are not train ranks. They own the memory substrate exclusively:
+
+- **GPU 6 (packet-serving rank):** Builds low-latency episodic residual packets from the pre-recurrence stream and publishes them to train ranks without blocking the trunk step.
+- **GPU 7 (maintenance rank):** Owns memory maintenance, slot refresh, and slot commits.
+
+Train ranks never wait on a memory GPU. If no fresh packet is available, the trunk proceeds with a zero-residual failsafe.
+
+### Training vs. Eval
+
+During training, the trunk updates weights while the memory/controller stack generates evidence and maintains the cache.
+
+During eval, the same memory substrate is live, but the run is **prequential**: each chunk is scored under the current memory state first, loss is accumulated, then the cache is updated from the just-scored tokens. The trunk never sees validation tokens before they are scored. The packet-online eval path raises if cache slot count changes before score accumulation.
+
+## Results
+
+| Seed | val_loss | val_bpb | Train steps | Train time | Eval time | Cache slots |
+|---|---:|---:|---:|---:|---:|---:|
+| 42 | 4.070076 | 1.640762 | 1692 | 596.0s | 347.0s | 93,346 -> 139,998 |
+| 1337 | 4.135631 | 1.667189 | 1692 | 594.1s | 349.5s | 89,776 -> 136,428 |
+| 294924 | 4.020193 | 1.620653 | 1688 | 594.3s | 364.8s | 93,091 -> 139,743 |
+| **Mean** | **4.075300** | **1.642868** | **1690.7** | **594.8s** | **353.8s** | |
+
+All evals scored the full 50,000-doc validation set: 42,216,034 scored tokens and 151,080,645 raw bytes per seed. Each eval performed 3,348 episodic reads and 3,348 score-first episodic writes.
+
+Artifact accounting: the public `artifact_bytes_estimate` is the contest-counted
+compressed artifact estimate, `13,554,222` bytes against the decimal `16,000,000`
+byte cap. The larger `raw_bf16_weight_bytes` value in `submission.json` is only
+the uncompressed runtime state size used by the shared-memory weight mirror.
+
+## Reproduction
+
+```bash
+# 1. Clone chaoscontrol and bootstrap the pod
+git clone https://github.com/KenMalloy/chaoscontrol.git /workspace/chaoscontrol
+HF_TOKEN=<token> bash /workspace/chaoscontrol/scripts/pod_bootstrap.sh
+
+# 2. Run one seed
+SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
+
+# 3. Eval-only from a saved checkpoint
+EVAL_ONLY=1 CHECKPOINT_PATH=/path/to/checkpoint.pt \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/requirements.txt b/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/requirements.txt
@@ -0,0 +1,24 @@
+# Core — exact versions used on the submission pod
+torch==2.11.0
+sentencepiece>=0.2.0
+numpy>=1.24
+huggingface-hub>=0.22
+
+# SemanticEngine / ChaosControl library. Pinned to the public commit that adds
+# the batched packet-online eval path and submission-facing engine entrypoint.
+chaoscontrol @ git+https://github.com/KenMalloy/chaoscontrol.git@e7da6b53bb5be4020a5c3ab043c12c6695d12065
+
+# TransformerEngine (CUDA 13 build). Must be installed before building native extensions.
+# transformer_engine[pytorch]==2.13.0
+# Install with:
+#   pip install transformer_engine[pytorch]==2.13.0 \
+#     --extra-index-url https://pypi.nvidia.com \
+#     --only-binary=:all: \
+#     nvidia-cublas==13.4.0.1
+#
+# See chaoscontrol/scripts/pod_setup_cuda13.sh for the full idempotent install.
+
+# Native extensions — built from the chaoscontrol repo, not pip-installed.
+# After cloning to CHAOSCONTROL_ROOT, run:
+#   bash scripts/pod_build_native_extensions.sh
+# Extensions: _lm_head_loss, _cpu_ssm_controller, _ssm_scan
diff --git a/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/submission.json b/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/submission.json
@@ -0,0 +1,72 @@
+{
+  "track": "track_10min_16mb",
+  "submission_name": "SemanticEngine_CareSSM",
+  "name": "SemanticEngine CareSSM + Live Episodic Memory",
+  "blurb": "Pure SSM trunk with a live episodic memory substrate active during training and legal prequential eval. On 8xH100, ranks 0-5 train the CareSSM trunk, rank 6 serves low-latency episodic residual packets, and rank 7 runs memory maintenance. Eval scores each chunk before committing its evidence to the cache for future chunks.",
+  "date": "2026-05-01",
+  "val_loss": 4.07530019,
+  "val_bpb": 1.64286828,
+  "val_loss_std": 0.05789620,
+  "val_bpb_std": 0.02333959,
+  "seeds": [42, 1337, 294924],
+  "seed_results": {
+    "42": {
+      "val_loss": 4.07007627,
+      "val_bpb": 1.64076237,
+      "steps": 1692,
+      "train_time_s": 595.97,
+      "eval_time_s": 347.0,
+      "docs_scored": 50000,
+      "tokens_scored": 42216034,
+      "episodic_reads": 3348,
+      "episodic_writes": 3348,
+      "slot_count_initial": 93346,
+      "slot_count_final": 139998
+    },
+    "1337": {
+      "val_loss": 4.13563133,
+      "val_bpb": 1.66718946,
+      "steps": 1692,
+      "train_time_s": 594.15,
+      "eval_time_s": 349.5,
+      "docs_scored": 50000,
+      "tokens_scored": 42216034,
+      "episodic_reads": 3348,
+      "episodic_writes": 3348,
+      "slot_count_initial": 89776,
+      "slot_count_final": 136428
+    },
+    "294924": {
+      "val_loss": 4.02019298,
+      "val_bpb": 1.62065301,
+      "steps": 1688,
+      "train_time_s": 594.27,
+      "eval_time_s": 364.8,
+      "docs_scored": 50000,
+      "tokens_scored": 42216034,
+      "episodic_reads": 3348,
+      "episodic_writes": 3348,
+      "slot_count_initial": 93091,
+      "slot_count_final": 139743
+    }
+  },
+  "train_steps_mean": 1690.67,
+  "train_time_s_mean": 594.78,
+  "eval_time_s_mean": 353.77,
+  "artifact_bytes_estimate": 13554222,
+  "artifact_bytes_limit": 16000000,
+  "artifact_margin_bytes_estimate": 2445778,
+  "raw_bf16_weight_bytes": 44600064,
+  "artifact_accounting_note": "artifact_bytes_estimate is the contest-counted int6/LZMA compressed weight estimate plus 500KB overhead. raw_bf16_weight_bytes is the uncompressed runtime weight mirror and is not the submitted artifact.",
+  "artifact_submit_valid": true,
+  "hardware": "8xH100 80GB",
+  "eval_method": "packet_online_cache_prequential_full_50k",
+  "compliance": {
+    "three_seeds": true,
+    "training_under_600s": true,
+    "eval_under_600s": true,
+    "score_before_write": true,
+    "full_50k_validation_docs": true,
+    "validation_tokens_scored_before_memory_update": true
+  }
+}
diff --git a/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/tokenizers/.gitkeep b/records/track_10min_16mb/2026-05-01_SemanticEngine_CareSSM/tokenizers/.gitkeep