Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ wheels/

data/
outputs/
training-outputs/

notebooks/.ipynb_checkpoints

#logs
wandb
Expand Down
43 changes: 33 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,17 @@ firebot/
│ ├── train_rl_agent.py # unified PPO/A2C/DQN trainer with checkpoint and final evaluation artifacts
│ └── evaluate_agents.py # classdef for PPO/A2C/DQN plus greedy/random baselines
├── scripts/
│ ├── run_benchmark_train.sh # bash script for smoke validation then full 5-seed benchmark training
│ ├── run_benchmark_train.sh # legacy all-in-one bash runner (staged scripts are canonical)
│ ├── run_benchmark_train.ps1 # powershell equivalent
│ ├── run_benchmark_eval.sh # bash script for post-training benchmark evaluation by seed
│ └── run_benchmark_eval.ps1 # powershell equivalent
│ └── stages/
│ ├── _common.sh # shared defaults, validation, and helper functions for staged runs
│ ├── 01_karpathy_overfit.sh # one-record overfit checks
│ ├── 02_smoke_train_and_repro.sh # smoke training + reproducibility canary
│ ├── 03_smoke_eval.sh # smoke checkpoint artifact loading + sanity eval
│ ├── 04_pilot_sweep.sh # one-seed validation-only pilot hyperparameter sweep
│ └── 05_final_train.sh # canonical full 5-seed training with frozen protocol
├── tests/
│ ├── conftest.py
│ └── models/ # environment and benchmark metric contract tests
Expand Down Expand Up @@ -176,12 +183,16 @@ uv run python -m src.ingestion.static_dataset --fire-records path/to/fire_record

### Training

For controlled and reproducible benchmark training, use the script wrappers in `scripts/`.
For controlled and reproducible benchmark training, run the staged bash scripts in order.

Run from project root on macOS/Linux (bash):

```bash
./scripts/run_benchmark_train.sh
./scripts/stages/01_karpathy_overfit.sh
./scripts/stages/02_smoke_train_and_repro.sh
./scripts/stages/03_smoke_eval.sh
./scripts/stages/04_pilot_sweep.sh
./scripts/stages/05_final_train.sh
```

Run from project root on Windows (PowerShell):
Expand All @@ -190,23 +201,35 @@ Run from project root on Windows (PowerShell):
./scripts/run_benchmark_train.ps1
```

Script runs:
Staged bash flow:

- Stage 1 (smoke): runs short validation training for `ppo`, `a2c`, `dqn` on one seed
- Stage 2 (smoke eval): loads smoke `best_model.zip` artifacts and runs evaluator sanity checks
- Stage 3 (formal): runs full canonical training for all three algorithms across 5 seeds (`11,22,33,44,55`)
- Uses artifact root `outputs/benchmark/` and keeps default trainer settings for env count, timesteps, and checkpoint cadence on formal runs
- Stage 1: Karpathy one-record overfit checks (`ppo`, `a2c`, `dqn`)
- Stage 2: smoke training + reproducibility canary
- Stage 3: smoke evaluation sanity check
- Stage 4: validation-only pilot sweeps and winner selection
- Stage 5: full canonical 5-seed training using frozen protocol values

Training script environment overrides (optional):
Shared staged-script environment overrides (optional):

- `ARTIFACT_ROOT` (default `outputs/benchmark`)
- `SMOKE_TIMESTEPS` (default `20000`, one canonical checkpoint interval)
- `SMOKE_SEED` (default `11`)
- `SMOKE_EVAL_EPISODES` (default `5`)
- `RUN_REPRO_CANARY` (default `1`)
- `REPRO_CANARY_TOL` (default `1e-9`)
- `KARPATHY_TIMESTEPS` (default `10000`)
- `KARPATHY_SEED` (default `11`)
- `KARPATHY_FAMILY` (default `center,medium,A`)
- `KARPATHY_CHECKPOINT_EVAL_EPISODES` (default `1`)
- `PILOT_TIMESTEPS` (default `40000`)
- `PILOT_SEED` (default `11`)
- `RUN_KARPATHY_CHECK` (default `1`)
- `RUN_PILOT_SWEEP` (default `1`)
- `USE_PILOT_WINNERS` (default `1`)
- `FINAL_SEEDS_CSV` (default `11,22,33,44,55`)
- `ALGO_ORDER_CSV` (default `ppo,a2c,dqn`)

After `run_benchmark_train` completes, run benchmark evaluation wrappers.
After Stage 5 completes, run benchmark evaluation wrappers.

Run from project root on macOS/Linux (bash):

Expand Down
1 change: 1 addition & 0 deletions docs/data-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ Canonical integration note:
- ignition family and asset layout remain simulator-side controls
- seeded parameter records do not store explicit ignition/layout labels
- `ignition_seed` and `layout_seed` make those simulator-side initializations reproducible
- severity remains record-conditioned through `severity_bucket`; it is not an independently sampled family control variable

Stored audit fields:

Expand Down
3 changes: 2 additions & 1 deletion docs/envspec.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Important boundary:
- ignition family and asset layout labels remain simulator-side controls
- seeded records do not store explicit ignition/layout labels
- seeds make simulator-side initialization reproducible
- severity is record-conditioned from `severity_bucket` and is not independently controlled by `scenario_families`

---

Expand All @@ -83,7 +84,7 @@ Record loading and benchmark setup:

Episode construction and parameter sampling:

- `reset`: samples scenario family + parameter record, resets budgets/cooldowns/state, places assets, ignites fire
- `reset`: samples ignition/layout family + parameter record, resets budgets/cooldowns/state, places assets, ignites fire
- `_sample_parameter_record`: seed-stable shuffled sampling over loaded records
- `_configure_initialization_rngs`: configures ignition/layout RNGs from record seeds

Expand Down
153 changes: 153 additions & 0 deletions docs/planning/harden-training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Training Hardening Plan

This document defines a staged hardening workflow before launching paper-facing benchmark runs.

It assumes one full run has already been executed and should now be treated as a baseline sanity pass, not the final benchmark evidence.

---

## Step 0: Dummy Run

Purpose: establish a baseline run with the current unclean pipeline and confirm the end-to-end system is operational.

Current baseline configuration:

- Data variant: original seeded datasets under `data/static/`
- Algorithms: `PPO`, `A2C`, `DQN`
- Final budget: `200,000` steps per algorithm per seed
- Seed protocol: canonical 5 seeds (`11,22,33,44,55`) for final runs
- Checkpoint cadence: every `20,000` steps
- Checkpoint model selection: highest `val.asset_survival_rate`, tie-breaker `val.mean_return`
- Final artifact for reporting: `best_model.zip` (not `last_model.zip`)

Step 0 interpretation rules:

- Treat this run as a reference point for stability and feasibility.
- Do not treat it as final paper evidence until Step 1-3 hardening is complete.
- Preserve all Step 0 artifacts for regression comparison.

---

## Step 1: Pilot Hardening

Purpose: reduce debugging risk and improve interpretability before expensive reruns.

### 1.1 One-factor-at-a-time pilot sweeps

For each algorithm:

1. Run small pilot sweeps where only one hyperparameter changes at a time.
2. Keep all other hyperparameters fixed to algorithm defaults or current pilot anchor values.
3. Use validation-only selection for pilot ranking.

Recommended order:

- `PPO`: learning rate -> `n_steps` -> entropy coefficient
- `A2C`: learning rate -> `n_steps` -> entropy coefficient
- `DQN`: learning rate -> exploration params -> target update interval -> replay buffer size

Why: if performance changes unexpectedly, attribution is immediate and debugging is faster.

### 1.2 Learning-rate schedule explicitness

- Keep constant learning rate in pilot/debug unless a schedule is explicitly tested as an ablation.
- Record in run config/logs that no scheduler is active.
- If a scheduler is tested later, it must be clearly labeled and isolated from canonical results.

### 1.3 Early stopping policy

- Allow optional early stopping only in `pilot` or debug runs for faster iteration.
- Do not use early stopping in canonical final benchmark runs.
- Final benchmark remains fixed-budget for fairness (`200,000` per algorithm per seed).

### 1.4 NaN and instability guardrails

Add fail-fast checks during training/eval:

- Abort if any non-finite metric/loss appears (`NaN`, `inf`, `-inf`).
- Abort if checkpoint artifacts are malformed or missing expected keys.
- Flag severe instability patterns (for example prolonged metric collapse) for manual review.

---

## Step 2: Re-run and Data Audit Decision

Purpose: re-run after Step 1 hardening and decide whether dataset cleaning is required.

### 2.1 Re-run hardened pilot

- Re-run smoke + pilot with hardening controls enabled.
- Compare against Step 0 baseline on key validation metrics and stability.

### 2.2 Data outlier and null audit

- Run the audit notebook: `notebooks/data_outlier_audit.ipynb`.
- Inspect outliers, invalid ranges, missing fields, split consistency, and duplicates.
- Decide if cleaning is required based on documented findings.

### 2.3 If cleaning is required

Use strict versioning and isolation:

- Save cleaned variants under `data/static/clean/`.
- Regenerate all split artifacts consistently from the cleaned source.
- Keep old and cleaned artifacts side-by-side (no overwrite).
- Save model weights and run artifacts before rerunning on cleaned data.
- Label cleaned-data runs explicitly (for example run label suffix or metadata flag).

Reproducibility note:

- Do not mix pilot selection on one data variant with final reporting on another without clear disclosure.

---

## Step 3: Defensibility and Final Freeze

Purpose: increase confidence in model health and lock benchmark settings before full reporting runs.

### 3.1 Training-health diagnostics

Add periodic diagnostics per algorithm where available:

- gradient norm trends
- value loss / TD loss trends
- entropy (for policy-gradient methods)
- checkpoint metric trajectories for train vs val

### 3.2 Instability-aware pilot confirmation

- For each algorithm, take top-2 pilot candidates.
- Run 2 quick repeats per candidate.
- Use aggregate validation ranking with instability-aware tie-breaking.

### 3.3 Freeze and execute final benchmark

- Freeze one hyperparameter config per algorithm.
- Freeze data variant (`original` or specific `clean` version).
- Run canonical fixed-budget benchmark over 5 seeds.
- Report only frozen-protocol results as primary evidence.

### Step 3 Acceptance Criteria

| Area | Criterion | Threshold / Rule | Action if Failed |
|---|---|---|---|
| Reproducibility canary | Same-config rerun consistency | `checkpoint_metrics.json` and `best_checkpoint.json` match within tolerance | Block final runs; inspect nondeterminism sources |
| Data contract | Required fields + split integrity | No missing required fields; no split mismatches | Rebuild/repair datasets before rerun |
| Numeric validity | Core env-feature ranges | `base_spread_prob` and `wind_strength` remain in valid bounds | Fix pipeline mapping or cleaning rules |
| Training stability | Non-finite values | Zero `NaN`/`inf` in training/eval metrics | Stop run immediately; debug before continue |
| Pilot sanity | Learned policy beats weak baseline trend | At least one stable pilot config improves over random on val asset survival | Revisit hyperparameters/reward/debug before final |
| Selection robustness | Top config repeatability | Top-1 config remains top or near-top across quick repeats | Expand pilot repeats or reduce search noise |
| Protocol fairness | Final-run comparability | Same fixed step budget across all learned algorithms | Reject run as non-canonical |
| Leakage control | Holdout usage discipline | No holdout-based tuning/model selection | Re-run selection using train/val only |
| Artifact completeness | Paper-traceable outputs | Config, checkpoint logs, best checkpoint, best/last model, final eval all present | Re-run missing seeds or repair pipeline |

---

## Reporting Notes

When writing paper-facing results after hardening:

- explicitly state fixed-budget protocol and seed count
- explicitly state model-selection rule
- explicitly state data variant used (`original` or `clean/*`)
- clearly separate pilot diagnostics from final benchmark evidence
Loading
Loading