Thomson-Lam · noahkostesku · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/.gitignore b/.gitignore
@@ -11,6 +11,7 @@ wheels/
 .env
 
 data/
+outputs/
 
 #logs
 wandb

diff --git a/README.md b/README.md
@@ -2,6 +2,68 @@
 
 Empirical RL benchmark for wildfire tactical suppression. Compares DQN, A2C, PPO, and heuristic baselines on a 25x25 grid environment with critical assets and finite suppression budgets.
 
+The physics informed environment and built environment records from the [Alberta Historical Wildfires Database](https://open.alberta.ca/opendata/wildfire-data).
+
+## Project Tree
+
+```text
+firebot/ 
+├── README.md 
+├── pyproject.toml 
+├── uv.lock 
+├── ruff.toml 
+├── lefthook.yml 
+├── fp-historical-wildfire-data-dictionary-2006-2025.pdf # from the dataset download
+├── data/ 
+│   └── static/ 
+│       ├── fp-historical-wildfire-data-2006-2025.csv # raw Alberta historical wildfire CSV 
+│       ├── snapshot_records.json # full normalized snapshot records from raw CSV
+│       ├── snapshot_records_train.json # train-year snapshot subset
+│       ├── snapshot_records_val.json # validation-year snapshot subset
+│       ├── snapshot_records_holdout.json # holdout-year snapshot subset
+│       ├── scenario_parameter_records.json # full unseeded environment parameter records
+│       ├── scenario_parameter_records_train.json # train split unseeded records
+│       ├── scenario_parameter_records_val.json # validation split unseeded records
+│       ├── scenario_parameter_records_holdout.json # holdout split unseeded records
+│       ├── scenario_parameter_records_seeded.json # full seeded records with ignition/layout seeds
+│       ├── scenario_parameter_records_seeded_train.json # train runtime records
+│       ├── scenario_parameter_records_seeded_val.json # validation runtime records
+│       └── scenario_parameter_records_seeded_holdout.json # temporal holdout runtime records
+├── docs/ 
+│   ├── data-pipeline.md 
+│   ├── envspec.md 
+│   └── planning/ 
+│       ├── env-checklist.md 
+│       ├── impl-plan.md 
+│       └── train-plan.md 
+├── src/ 
+│   ├── __init__.py 
+│   ├── ingestion/ 
+│   │   ├── __init__.py 
+│   │   ├── clean_historical.py # row cleaning and required-field checks
+│   │   ├── cffdrs.py # CFFDRS station ingestion, not used 
+│   │   ├── weather.py # legacy Open-Meteo weather fetch helpers, not used 
+│   │   └── static_dataset.py # builds snapshot/scenario parameter records in data/static
+│   └── models/ # environment, training, evaluation, and shared benchmark utilities
+│       ├── __init__.py 
+│       ├── fire_env.py # WildfireEnv implementation and benchmark env construction helpers
+│       ├── benchmarking.py # shared benchmark presets, rollout metrics, and aggregation functions
+│       ├── train_rl_agent.py # unified PPO/A2C/DQN trainer with checkpoint and final evaluation artifacts
+│       └── evaluate_agents.py # classdef for PPO/A2C/DQN plus greedy/random baselines
+├── scripts/ 
+│   ├── run_benchmark_train.sh # bash script for smoke validation then full 5-seed benchmark training
+│   ├── run_benchmark_train.ps1 # powershell equivalent 
+│   ├── run_benchmark_eval.sh # bash script for post-training benchmark evaluation by seed
+│   └── run_benchmark_eval.ps1 # powershell equivalent 
+├── tests/ 
+│   ├── conftest.py 
+│   └── models/ # environment and benchmark metric contract tests
+│       ├── test_fire_env_setup_contract.py # benchmark-mode env loading/split/schema contract tests
+│       └── test_benchmarking_metrics.py # benchmark metric/preset/aggregation tests
+├── outputs/ # generated training and evaluation artifacts (gitignored)
+└── drd-archive/ # archived prototype code from the earlier DRD proposal
+```
+
 ## Setup
 
 Requirements: [uv](https://docs.astral.sh/uv/getting-started/installation/) 
@@ -114,30 +176,68 @@ uv run python -m src.ingestion.static_dataset --fire-records path/to/fire_record
 
 ### Training 
 
-After building the dataset, you can train by running:
+For controlled and reproducible benchmark training, use the script wrappers in `scripts/`.
+
+Run from project root on macOS/Linux (bash):
 
 ```bash
-uv run python -m src.models.train_rl_agent --scenario-dataset data/static/scenario_parameter_records_seeded_train.json --val-dataset data/static/scenario_parameter_records_seeded_val.json --holdout-dataset data/static/scenario_parameter_records_seeded_holdout.json
+./scripts/run_benchmark_train.sh
 ```
 
-The seeded scenario parameter files are the canonical benchmark inputs for `FireEnv` and PPO training.
+Run from project root on Windows (PowerShell):
 
-The builder also writes year-based split files for the benchmark:
+```powershell
+./scripts/run_benchmark_train.ps1
+```
 
-- `train`: `2006-2022`
-- `val`: `2023`
-- `holdout`: `2024-2025`
+Script runs:
+
+- Stage 1 (smoke): runs short validation training for `ppo`, `a2c`, `dqn` on one seed
+- Stage 2 (smoke eval): loads smoke `best_model.zip` artifacts and runs evaluator sanity checks
+- Stage 3 (formal): runs full canonical training for all three algorithms across 5 seeds (`11,22,33,44,55`)
+- Uses artifact root `outputs/benchmark/` and keeps default trainer settings for env count, timesteps, and checkpoint cadence on formal runs
+
+Training script environment overrides (optional):
+
+- `ARTIFACT_ROOT` (default `outputs/benchmark`)
+- `SMOKE_TIMESTEPS` (default `20000`, one canonical checkpoint interval)
+- `SMOKE_SEED` (default `11`)
+- `SMOKE_EVAL_EPISODES` (default `5`)
+- `FINAL_SEEDS_CSV` (default `11,22,33,44,55`)
+- `ALGO_ORDER_CSV` (default `ppo,a2c,dqn`)
 
-Training command:
+After `run_benchmark_train` completes, run benchmark evaluation wrappers.
+
+Run from project root on macOS/Linux (bash):
 
 ```bash
-uv run python -m src.models.train_rl_agent --scenario-dataset data/static/scenario_parameter_records_seeded_train.json --val-dataset data/static/scenario_parameter_records_seeded_val.json --holdout-dataset data/static/scenario_parameter_records_seeded_holdout.json
+./scripts/run_benchmark_eval.sh
 ```
 
-General split benchmark evaluation (PPO + baselines):
+Run from project root on Windows (PowerShell):
 
-```bash
-uv run python -m src.models.evaluate_agents --agents ppo,greedy,random --train-dataset data/static/scenario_parameter_records_seeded_train.json --val-dataset data/static/scenario_parameter_records_seeded_val.json --holdout-dataset data/static/scenario_parameter_records_seeded_holdout.json --episodes 20 --seeds 42,43,44
+```powershell
+./scripts/run_benchmark_eval.ps1
 ```
 
+These are the default values that can be overridden via env ars or editing the `ps1` and `.sh` scripts.
+
+- `ARTIFACT_ROOT` (default `outputs/benchmark`)
+- `RUN_LABEL` (default `final`)
+- `EVAL_SEEDS_CSV` (default `11,22,33,44,55`)
+- `EVAL_EPISODES` (default `100`)
+- `AGENTS` (default `ppo,a2c,dqn,greedy,random`)
+- `OUTPUT_DIR` (default `outputs/benchmark/<run_label>/eval`)
+- `INCLUDE_FAMILY_HOLDOUT` (`0` or `1`, default `0`)
+- `INCLUDE_TEMPORAL_HOLDOUT` (`0` or `1`, default `0`)
+- `NO_NORMALIZED_BURN` (`0` or `1`, default `0`)
+
+The seeded scenario parameter files are the benchmark inputs for `FireEnv` training and script-driven evaluation.
+
+The builder also writes year-based split files for the benchmark:
+
+- `train`: `2006-2022`
+- `val`: `2023`
+- `holdout`: `2024-2025`
+
 The dataset builder prints cleaning/drop summaries to stdout and uses progress bars when `tqdm` is available.
diff --git a/docs/envspec.md b/docs/envspec.md
@@ -1,8 +1,10 @@
-# Fire Environment Specification (Current Implementation)
+# Fire Environment Specification
 
-This document describes the currently implemented RL environment in `src/models/fire_env.py` and its training/evaluation usage in `src/models/train_rl_agent.py` and `src/models/evaluate_agents.py`.
+This document describes the RL environment in `src/models/fire_env.py` and its training/evaluation usage in `src/models/train_rl_agent.py` and `src/models/evaluate_agents.py`.
 
-It is intentionally code-first and only documents behavior that exists in the current codebase.
+It remains code-first for implemented behavior, but it also records the benchmark-alignment requirements that training and evaluation code must satisfy before full canonical runs are launched.
+
+The concrete benchmark execution plan lives in `docs/planning/train-plan.md`.
 
 ---
 
@@ -174,35 +176,61 @@ Optimization intent:
 
 ---
 
-## 7) Training Process (Current Code)
+## 7) Training Process
 
 Current training script:
 
 - `src/models/train_rl_agent.py`
-- algorithm currently implemented in this script: `PPO` (Stable-Baselines3)
+- currently implemented learned method: `PPO` (Stable-Baselines3)
 
-Canonical training flow:
+Current implemented flow:
 
 1. load seeded train split dataset
 2. create vectorized benchmark envs (`n_envs`)
 3. train PPO for configured timesteps
 4. save model to `src/models/tactical_ppo_agent.zip`
 5. run quick evaluation on train and optional val/holdout datasets
 
+Benchmark-aligned target flow:
+
+1. use a unified runner with `--algo {ppo,a2c,dqn}`
+2. keep the same benchmark-mode dataset path and split validation for all methods
+3. use vectorized envs for `PPO` and `A2C`
+4. use a single benchmark env for `DQN` by default
+5. write checkpoint metrics every fixed training interval
+6. save per-run config and per-checkpoint metrics to disk
+7. choose the best checkpoint by validation `asset_survival_rate`
+8. run final split-wise evaluation on train/val/holdout after training completes
+
 Current benchmark evaluation script:
 
 - `src/models/evaluate_agents.py`
 - evaluates agents across splits (`train`, `val`, `holdout`)
-- supported evaluated agents: `ppo`, `greedy`, `random`
+- currently implemented evaluated agents: `ppo`, `greedy`, `random`
 - can output JSON summary via `--output`
 
+Benchmark-aligned target support:
+
+- `ppo`
+- `a2c`
+- `dqn`
+- `greedy`
+- `random`
+
 Transparency outputs from current code:
 
 - training console output: timesteps, env count, dataset path/count, quick split metrics
 - model artifact: `tactical_ppo_agent.zip`
 - evaluation console summary per split/agent
 - optional evaluation JSON with aggregate metrics
 
+Required benchmark transparency outputs:
+
+- serialized run config per seed
+- checkpoint metrics on train/val/holdout
+- best-checkpoint selection record
+- final evaluation JSON aggregated by seed, then across seeds
+
 Recommended transparency plots (from saved eval JSON/logs):
 
 - split-wise mean return (`train` vs `val` vs `holdout`)
@@ -212,24 +240,66 @@ Recommended transparency plots (from saved eval JSON/logs):
 
 ---
 
-## 8) Reporting Metrics 
+## 8) Reporting Metrics
 
-Primary optimization target/what the agent is trained to do: **Minimize assets damaged/lost**
+Primary optimization target: **Minimize assets damaged/lost**
 
-Additional reported metrics (already computed or directly derivable from current eval):
+Frozen benchmark metric definitions:
 
 - mean episodic return
-- standard deviation / variance across episodes
 - asset survival rate
 - containment success rate
-- mean final burned area
-- mean time to containment
-- mean resource efficiency
+- mean burned-area fraction: `(burned + burning + asset_burned) / 625`
+- mean time to containment, conditioned on successful containment only
+- mean resource efficiency: `successful_deployments / total_deployments`
+- standard deviation across seeds for each reported metric
+- wasted deployment rate
 - mean normalized burn ratio (optional in evaluator)
 
-Report these diagnostics during training:
+Important alignment notes:
+
+- pooled episode variance is not a substitute for seed-level aggregation
+- raw burned-cell count can still be logged, but the benchmark-facing metric should be the normalized burned-area fraction
+- holdout performance is for final reporting only and must not be used for tuning
+- the current temporal holdout dataset has only one unique seeded record, so it is a final diagnostic only until expanded
+
+Report these diagnostics during training checkpoints:
 
-- train/val/holdout gap for each metric
+- train/val gap for each metric
+- optional train/family-holdout gap for each metric
 - per-seed summary tables
-- baseline comparisons (`greedy`, `random`) against PPO
+- baseline comparisons (`greedy`, `random`) against learned methods
+
+## 9) Environment-Side Requirements For Benchmark Alignment
+
+The environment and evaluator together must expose enough information to compute the benchmark metrics exactly.
+
+Environment-side counters or `info` fields required for clean evaluation:
+
+- `assets_lost`
+- `step`
+- `heli_left`
+- `crew_left`
+- count of successful helicopter deployments
+- count of successful crew deployments
+- count of wasted deployment attempts
+- count of total deployment attempts
+
+Operational metric rules:
+
+- `mean_resource_efficiency = successful_deployments / total_deployments`
+- if `total_deployments == 0`, report `0.0`
+- `wasted_deployment_rate = wasted_deployments / total_deployment_attempts`
+- if `total_deployment_attempts == 0`, report `0.0`
+
+Evaluator-side aggregation requirements:
+
+- aggregate per seed first, then summarize across seeds
+- compute `time_to_containment` only on contained episodes
+- compute normalized burn ratio against the same scenario record and evaluation seed under the deterministic non-intervention baseline defined in `docs/planning/train-plan.md`
+- do not surface temporal holdout metrics during checkpoint evaluation in canonical runs
+- pass `scenario_families` explicitly for canonical train, validation, and family-holdout runs rather than relying on the environment default
+
+Verification requirement before full runs:
 
+- all of the above metrics must appear in a short smoke-test evaluation artifact before any full 5-seed training sweep is launched
-Original file line number
+Diff line change
@@ Expand Up / @@ -11,6 +11,7 @@ wheels/ @@
     .env
     data/
+    outputs/
     #logs
     wandb
@@ Expand Down @@