Thomson-Lam · Thomson-Lam · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,9 @@ wheels/
 
 data/
 outputs/
+training-outputs/
+
+notebooks/.ipynb_checkpoints
 
 #logs
 wandb

diff --git a/README.md b/README.md
@@ -51,10 +51,17 @@ firebot/
 │       ├── train_rl_agent.py # unified PPO/A2C/DQN trainer with checkpoint and final evaluation artifacts
 │       └── evaluate_agents.py # classdef for PPO/A2C/DQN plus greedy/random baselines
 ├── scripts/ 
-│   ├── run_benchmark_train.sh # bash script for smoke validation then full 5-seed benchmark training
+│   ├── run_benchmark_train.sh # legacy all-in-one bash runner (staged scripts are canonical)
 │   ├── run_benchmark_train.ps1 # powershell equivalent 
 │   ├── run_benchmark_eval.sh # bash script for post-training benchmark evaluation by seed
 │   └── run_benchmark_eval.ps1 # powershell equivalent 
+│   └── stages/
+│       ├── _common.sh # shared defaults, validation, and helper functions for staged runs
+│       ├── 01_karpathy_overfit.sh # one-record overfit checks
+│       ├── 02_smoke_train_and_repro.sh # smoke training + reproducibility canary
+│       ├── 03_smoke_eval.sh # smoke checkpoint artifact loading + sanity eval
+│       ├── 04_pilot_sweep.sh # one-seed validation-only pilot hyperparameter sweep
+│       └── 05_final_train.sh # canonical full 5-seed training with frozen protocol
 ├── tests/ 
 │   ├── conftest.py 
 │   └── models/ # environment and benchmark metric contract tests
@@ -176,12 +183,16 @@ uv run python -m src.ingestion.static_dataset --fire-records path/to/fire_record
 
 ### Training 
 
-For controlled and reproducible benchmark training, use the script wrappers in `scripts/`.
+For controlled and reproducible benchmark training, run the staged bash scripts in order.
 
 Run from project root on macOS/Linux (bash):
 
 ```bash
-./scripts/run_benchmark_train.sh
+./scripts/stages/01_karpathy_overfit.sh
+./scripts/stages/02_smoke_train_and_repro.sh
+./scripts/stages/03_smoke_eval.sh
+./scripts/stages/04_pilot_sweep.sh
+./scripts/stages/05_final_train.sh
 ```
 
 Run from project root on Windows (PowerShell):
@@ -190,23 +201,35 @@ Run from project root on Windows (PowerShell):
 ./scripts/run_benchmark_train.ps1
 ```
 
-Script runs:
+Staged bash flow:
 
-- Stage 1 (smoke): runs short validation training for `ppo`, `a2c`, `dqn` on one seed
-- Stage 2 (smoke eval): loads smoke `best_model.zip` artifacts and runs evaluator sanity checks
-- Stage 3 (formal): runs full canonical training for all three algorithms across 5 seeds (`11,22,33,44,55`)
-- Uses artifact root `outputs/benchmark/` and keeps default trainer settings for env count, timesteps, and checkpoint cadence on formal runs
+- Stage 1: Karpathy one-record overfit checks (`ppo`, `a2c`, `dqn`)
+- Stage 2: smoke training + reproducibility canary
+- Stage 3: smoke evaluation sanity check
+- Stage 4: validation-only pilot sweeps and winner selection
+- Stage 5: full canonical 5-seed training using frozen protocol values
 
-Training script environment overrides (optional):
+Shared staged-script environment overrides (optional):
 
 - `ARTIFACT_ROOT` (default `outputs/benchmark`)
 - `SMOKE_TIMESTEPS` (default `20000`, one canonical checkpoint interval)
 - `SMOKE_SEED` (default `11`)
 - `SMOKE_EVAL_EPISODES` (default `5`)
+- `RUN_REPRO_CANARY` (default `1`)
+- `REPRO_CANARY_TOL` (default `1e-9`)
+- `KARPATHY_TIMESTEPS` (default `10000`)
+- `KARPATHY_SEED` (default `11`)
+- `KARPATHY_FAMILY` (default `center,medium,A`)
+- `KARPATHY_CHECKPOINT_EVAL_EPISODES` (default `1`)
+- `PILOT_TIMESTEPS` (default `40000`)
+- `PILOT_SEED` (default `11`)
+- `RUN_KARPATHY_CHECK` (default `1`)
+- `RUN_PILOT_SWEEP` (default `1`)
+- `USE_PILOT_WINNERS` (default `1`)
 - `FINAL_SEEDS_CSV` (default `11,22,33,44,55`)
 - `ALGO_ORDER_CSV` (default `ppo,a2c,dqn`)
 
-After `run_benchmark_train` completes, run benchmark evaluation wrappers.
+After Stage 5 completes, run benchmark evaluation wrappers.
 
 Run from project root on macOS/Linux (bash):
 

diff --git a/docs/data-pipeline.md b/docs/data-pipeline.md
@@ -221,6 +221,7 @@ Canonical integration note:
 - ignition family and asset layout remain simulator-side controls
 - seeded parameter records do not store explicit ignition/layout labels
 - `ignition_seed` and `layout_seed` make those simulator-side initializations reproducible
+- severity remains record-conditioned through `severity_bucket`; it is not an independently sampled family control variable
 
 Stored audit fields:
 

diff --git a/docs/envspec.md b/docs/envspec.md
@@ -70,6 +70,7 @@ Important boundary:
 - ignition family and asset layout labels remain simulator-side controls
 - seeded records do not store explicit ignition/layout labels
 - seeds make simulator-side initialization reproducible
+- severity is record-conditioned from `severity_bucket` and is not independently controlled by `scenario_families`
 
 ---
 
@@ -83,7 +84,7 @@ Record loading and benchmark setup:
 
 Episode construction and parameter sampling:
 
-- `reset`: samples scenario family + parameter record, resets budgets/cooldowns/state, places assets, ignites fire
+- `reset`: samples ignition/layout family + parameter record, resets budgets/cooldowns/state, places assets, ignites fire
 - `_sample_parameter_record`: seed-stable shuffled sampling over loaded records
 - `_configure_initialization_rngs`: configures ignition/layout RNGs from record seeds
 

diff --git a/docs/planning/harden-training.md b/docs/planning/harden-training.md
@@ -0,0 +1,153 @@
+# Training Hardening Plan
+
+This document defines a staged hardening workflow before launching paper-facing benchmark runs.
+
+It assumes one full run has already been executed and should now be treated as a baseline sanity pass, not the final benchmark evidence.
+
+---
+
+## Step 0: Dummy Run
+
+Purpose: establish a baseline run with the current unclean pipeline and confirm the end-to-end system is operational.
+
+Current baseline configuration:
+
+- Data variant: original seeded datasets under `data/static/`
+- Algorithms: `PPO`, `A2C`, `DQN`
+- Final budget: `200,000` steps per algorithm per seed
+- Seed protocol: canonical 5 seeds (`11,22,33,44,55`) for final runs
+- Checkpoint cadence: every `20,000` steps
+- Checkpoint model selection: highest `val.asset_survival_rate`, tie-breaker `val.mean_return`
+- Final artifact for reporting: `best_model.zip` (not `last_model.zip`)
+
+Step 0 interpretation rules:
+
+- Treat this run as a reference point for stability and feasibility.
+- Do not treat it as final paper evidence until Step 1-3 hardening is complete.
+- Preserve all Step 0 artifacts for regression comparison.
+
+---
+
+## Step 1: Pilot Hardening
+
+Purpose: reduce debugging risk and improve interpretability before expensive reruns.
+
+### 1.1 One-factor-at-a-time pilot sweeps
+
+For each algorithm:
+
+1. Run small pilot sweeps where only one hyperparameter changes at a time.
+2. Keep all other hyperparameters fixed to algorithm defaults or current pilot anchor values.
+3. Use validation-only selection for pilot ranking.
+
+Recommended order:
+
+- `PPO`: learning rate -> `n_steps` -> entropy coefficient
+- `A2C`: learning rate -> `n_steps` -> entropy coefficient
+- `DQN`: learning rate -> exploration params -> target update interval -> replay buffer size
+
+Why: if performance changes unexpectedly, attribution is immediate and debugging is faster.
+
+### 1.2 Learning-rate schedule explicitness
+
+- Keep constant learning rate in pilot/debug unless a schedule is explicitly tested as an ablation.
+- Record in run config/logs that no scheduler is active.
+- If a scheduler is tested later, it must be clearly labeled and isolated from canonical results.
+
+### 1.3 Early stopping policy
+
+- Allow optional early stopping only in `pilot` or debug runs for faster iteration.
+- Do not use early stopping in canonical final benchmark runs.
+- Final benchmark remains fixed-budget for fairness (`200,000` per algorithm per seed).
+
+### 1.4 NaN and instability guardrails
+
+Add fail-fast checks during training/eval:
+
+- Abort if any non-finite metric/loss appears (`NaN`, `inf`, `-inf`).
+- Abort if checkpoint artifacts are malformed or missing expected keys.
+- Flag severe instability patterns (for example prolonged metric collapse) for manual review.
+
+---
+
+## Step 2: Re-run and Data Audit Decision
+
+Purpose: re-run after Step 1 hardening and decide whether dataset cleaning is required.
+
+### 2.1 Re-run hardened pilot
+
+- Re-run smoke + pilot with hardening controls enabled.
+- Compare against Step 0 baseline on key validation metrics and stability.
+
+### 2.2 Data outlier and null audit
+
+- Run the audit notebook: `notebooks/data_outlier_audit.ipynb`.
+- Inspect outliers, invalid ranges, missing fields, split consistency, and duplicates.
+- Decide if cleaning is required based on documented findings.
+
+### 2.3 If cleaning is required
+
+Use strict versioning and isolation:
+
+- Save cleaned variants under `data/static/clean/`.
+- Regenerate all split artifacts consistently from the cleaned source.
+- Keep old and cleaned artifacts side-by-side (no overwrite).
+- Save model weights and run artifacts before rerunning on cleaned data.
+- Label cleaned-data runs explicitly (for example run label suffix or metadata flag).
+
+Reproducibility note:
+
+- Do not mix pilot selection on one data variant with final reporting on another without clear disclosure.
+
+---
+
+## Step 3: Defensibility and Final Freeze
+
+Purpose: increase confidence in model health and lock benchmark settings before full reporting runs.
+
+### 3.1 Training-health diagnostics
+
+Add periodic diagnostics per algorithm where available:
+
+- gradient norm trends
+- value loss / TD loss trends
+- entropy (for policy-gradient methods)
+- checkpoint metric trajectories for train vs val
+
+### 3.2 Instability-aware pilot confirmation
+
+- For each algorithm, take top-2 pilot candidates.
+- Run 2 quick repeats per candidate.
+- Use aggregate validation ranking with instability-aware tie-breaking.
+
+### 3.3 Freeze and execute final benchmark
+
+- Freeze one hyperparameter config per algorithm.
+- Freeze data variant (`original` or specific `clean` version).
+- Run canonical fixed-budget benchmark over 5 seeds.
+- Report only frozen-protocol results as primary evidence.
+
+### Step 3 Acceptance Criteria
+
+| Area | Criterion | Threshold / Rule | Action if Failed |
+|---|---|---|---|
+| Reproducibility canary | Same-config rerun consistency | `checkpoint_metrics.json` and `best_checkpoint.json` match within tolerance | Block final runs; inspect nondeterminism sources |
+| Data contract | Required fields + split integrity | No missing required fields; no split mismatches | Rebuild/repair datasets before rerun |
+| Numeric validity | Core env-feature ranges | `base_spread_prob` and `wind_strength` remain in valid bounds | Fix pipeline mapping or cleaning rules |
+| Training stability | Non-finite values | Zero `NaN`/`inf` in training/eval metrics | Stop run immediately; debug before continue |
+| Pilot sanity | Learned policy beats weak baseline trend | At least one stable pilot config improves over random on val asset survival | Revisit hyperparameters/reward/debug before final |
+| Selection robustness | Top config repeatability | Top-1 config remains top or near-top across quick repeats | Expand pilot repeats or reduce search noise |
+| Protocol fairness | Final-run comparability | Same fixed step budget across all learned algorithms | Reject run as non-canonical |
+| Leakage control | Holdout usage discipline | No holdout-based tuning/model selection | Re-run selection using train/val only |
+| Artifact completeness | Paper-traceable outputs | Config, checkpoint logs, best checkpoint, best/last model, final eval all present | Re-run missing seeds or repair pipeline |
+
+---
+
+## Reporting Notes
+
+When writing paper-facing results after hardening:
+
+- explicitly state fixed-budget protocol and seed count
+- explicitly state model-selection rule
+- explicitly state data variant used (`original` or `clean/*`)
+- clearly separate pilot diagnostics from final benchmark evidence
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,6 +12,9 @@ wheels/ @@
     data/
     outputs/
+    training-outputs/
+    notebooks/.ipynb_checkpoints
     #logs
     wandb
@@ Expand Down @@