DeepGroundwater · taddyb · Mar 18, 2026 · Mar 14, 2026 · Mar 14, 2026 · Mar 18, 2026
diff --git a/docs/benchmarks/diffroute.md b/docs/benchmarks/diffroute.md
@@ -124,38 +124,6 @@ uv run python scripts/benchmark.py diffroute.enabled=false
 
 The benchmark produces:
 
-### Metrics (logged)
-
-```
-=== DDR Metrics ===
-----------------------------------------
-Metric     |         Mean |       Median
-----------------------------------------
-NSE        |       0.7234 |       0.7891
-RMSE       |      12.3456 |       8.7654
-KGE        |       0.6543 |       0.7012
-----------------------------------------
-
-=== DiffRoute Metrics ===
-----------------------------------------
-Metric     |         Mean |       Median
-----------------------------------------
-NSE        |       0.6891 |       0.7456
-...
-
-=== Summed Q' Metrics ===
-...
-```
-
-### Mass Balance (logged)
-
-```
-=== Mass Balance Accumulation Comparison ===
-DDR vs Obs       — Mean rel. error: 0.1234, Median: 0.0567
-DiffRoute vs Obs — Mean rel. error: 0.2345, Median: 0.1234
-DDR vs summed Q' — Mean rel. error: 0.0456, Median: 0.0234
-```
-
 ### Plots (saved to `output/<run>/plots/`)
 
 | File | Description |

diff --git a/docs/benchmarks/index.md b/docs/benchmarks/index.md
@@ -14,7 +14,7 @@ The `ddr-benchmarks` package provides tools for comparing DDR against other rout
 Benchmarking routing models requires:
 
 1. **Identical input data** - Same lateral inflows (Q'), network topology, and time period
-2. **Consistent evaluation** - Same metrics (NSE, KGE, RMSE) computed on same observations
+2. **Consistent evaluation** - Same evaluation criteria applied to the same observations
 3. **Fair comparison** - Account for differences in model formulations and parameters
 
 The benchmarks package addresses all three by reusing DDR's existing data infrastructure while providing adapters for other routing models.
@@ -88,17 +88,6 @@ The benchmark produces publication-quality plots and console diagnostics:
 | `gauge_map_sqp_NSE.png` | Map of gauges colored by summed Q' NSE (if enabled) |
 | `hydrographs/*.png` | Per-gage time series with all models overlaid |
 
-### Console Output
-
-Mass balance accumulation comparison is logged for each model:
-
-```
-=== Mass Balance Accumulation Comparison ===
-DDR vs Obs       — Mean rel. error: 0.1234, Median: 0.0567
-DiffRoute vs Obs — Mean rel. error: 0.2345, Median: 0.1234
-DDR vs summed Q' — Mean rel. error: 0.0456, Median: 0.0234
-```
-
 ### Results (saved to `output/<run>/benchmark_results.zarr`)
 
 ```python
@@ -146,9 +135,8 @@ The main benchmark script follows the same pattern as `scripts/test.py`:
 3. **Phase 1**: Run DDR on time-batched DataLoader, accumulate predictions
 4. **Phase 2**: Run DiffRoute per-gage using zarr subgroup graphs
 5. Optionally load summed Q' predictions for baseline comparison
-6. Compute metrics using DDR's `Metrics` class
-7. Log mass balance accumulation comparison
-8. Generate comparison plots (CDF, boxplots, gauge maps, hydrographs)
+6. Evaluate predictions using DDR's `Metrics` class
+7. Generate comparison plots (CDF, boxplots, gauge maps, hydrographs)
 9. Save results to zarr
 
 ### DiffRoute Adapter (`diffroute_adapter.py`)

diff --git a/docs/startup.md b/docs/startup.md
@@ -240,13 +240,7 @@ __NOTE:__ Please change the config to match what mode/geodataset/method you need
 
 ### Monitoring
 
-DDR logs progress including:
-
-- Loss values per epoch and mini-batch
-- NSE, RMSE, and KGE metrics
-- Parameter statistics
-
-Model checkpoints are saved to the `params.save_path` directory.
+Training progress is logged to the output directory. Model checkpoints are saved to the `params.save_path` directory.
 
 ### Expected Model outputs
 

diff --git a/docs/usage/examples.md b/docs/usage/examples.md
@@ -57,4 +57,4 @@ Each `example_config.yaml` uses `${oc.env:DDR_DATA_DIR,./../../data}` so paths r
 
 ## Model Evaluation
 
-The `examples/eval/evaluate.ipynb` notebook demonstrates how to evaluate the performance of a trained model and compare routed predictions against the summed Q' baseline.
+The `examples/eval/evaluate.ipynb` notebook demonstrates how to compare routed predictions against observations and the summed Q' baseline.
diff --git a/docs/usage/routing.md b/docs/usage/routing.md
@@ -159,4 +159,4 @@ data_sources:
 ## Next Steps
 
 - [Benchmarks](../benchmarks/index.md): Compare routing results against other models
-- [Model Testing](test.md): Evaluate model performance with observations
+- [Model Testing](test.md): Compare predictions against observations
diff --git a/docs/usage/summed_q_prime.md b/docs/usage/summed_q_prime.md
@@ -10,11 +10,7 @@ The summed lateral flow (Summed Q') baseline computes streamflow at gauge locati
 
 Routing redistributes flow in time — it delays and attenuates flood waves as they travel downstream. The Summed Q' baseline skips this step entirely, giving you a direct measure of how much your unit catchment predictions (from dHBV, NWM, or any lumped model) contribute to the total signal vs. how much routing improves it.
 
-Comparing DDR against Summed Q' tells you:
-
-- **How well your lateral inflows capture total volume** (bias, FLV)
-- **How much timing improvement routing adds** (NSE, KGE, correlation)
-- **Whether routing is worth the compute cost** for your application
+Comparing DDR against Summed Q' quantifies the effect of routing on the predicted hydrograph relative to a simple summation baseline.
 
 ## Quick Start
 
@@ -68,20 +64,6 @@ output/<run_name>/
 └── detailed_metrics_<timestamp>.csv    # Per-gauge metrics
 ```
 
-### Metrics
-
-The script reports the following metrics for all valid gauges:
-
-| Metric | Description | Ideal |
-|--------|-------------|-------|
-| **NSE** | Nash-Sutcliffe Efficiency | 1.0 |
-| **KGE** | Kling-Gupta Efficiency | 1.0 |
-| **Bias** | Mean bias ratio | 1.0 |
-| **FLV** | Low flow volume error (%) | 0.0 |
-| **FHV** | High flow volume error (%) | 0.0 |
-| **MAE** | Mean Absolute Error | 0.0 |
-| **RMSE** | Root Mean Square Error | 0.0 |
-
 ### Loading Results
 
 ```python

diff --git a/docs/usage/test.md b/docs/usage/test.md
@@ -12,7 +12,7 @@ Model testing evaluates a trained DDR model on a different time period than trai
 
 1. Load trained model checkpoint
 2. Run forward pass on test period data
-3. Compute metrics (NSE, KGE, RMSE) against observations
+3. Compare predictions against observations
 4. Generate evaluation outputs
 
 ## Quick Start
@@ -33,7 +33,7 @@ experiment:
   batch_size: 64
   start_time: 1995/10/01        # Test period start
   end_time: 2010/09/30          # Test period end
-  warmup: 3                     # Warmup days excluded from metrics
+  warmup: 3                     # Days excluded from evaluation during spin-up
   checkpoint: /path/to/trained_model.pt  # Required!
 ```
 
@@ -64,23 +64,18 @@ with torch.no_grad():
         predictions[:, indices] = dmc_output["runoff"].cpu().numpy()
 ```
 
-### 3. Compute Metrics
+### 3. Evaluate Predictions
 
-DDR computes standard hydrologic metrics:
-
-| Metric | Description | Ideal Value |
-|--------|-------------|-------------|
-| **NSE** | Nash-Sutcliffe Efficiency | 1.0 |
-| **KGE** | Kling-Gupta Efficiency | 1.0 |
-| **RMSE** | Root Mean Square Error | 0.0 |
+Use the `Metrics` class from `ddr.validation` to compare predictions against observations:
 
 ```python
+from ddr.validation.metrics import Metrics
+
 metrics = Metrics(pred=daily_runoff[:, warmup:], target=observations[:, warmup:])
-print(f"NSE: {metrics.nse.mean():.4f}")
-print(f"KGE: {metrics.kge.mean():.4f}")
-print(f"RMSE: {metrics.rmse.mean():.4f}")
 ```
 
+See the `Metrics` class for available evaluation attributes.
+
 ## Output
 
 Test results are saved to:
@@ -106,23 +101,6 @@ print(ds)
 #     observations  (gage_ids, time) float64
 ```
 
-## Interpreting Results
-
-### NSE Guidelines
-
-| NSE Range | Interpretation |
-|-----------|----------------|
-| > 0.75 | Very good |
-| 0.65 - 0.75 | Good |
-| 0.50 - 0.65 | Satisfactory |
-| < 0.50 | Unsatisfactory |
-
-### Common Issues
-
-1. **Poor performance on large basins**: May need more training data or different architecture
-2. **Negative NSE**: Model predictions worse than mean - check data alignment
-3. **Good NSE but poor KGE**: Timing/bias issues - inspect hydrographs
-
 ## Next Steps
 
 - [Routing](routing.md): Run inference on new domains

diff --git a/docs/usage/train.md b/docs/usage/train.md
@@ -136,12 +136,7 @@ The training will resume from the saved epoch and mini-batch.
 
 ## Monitoring
 
-Training logs include:
-
-- Loss values per mini-batch
-- NSE, RMSE, KGE metrics periodically
-- Learning rate changes
-- Parameter statistics
+Training progress is logged to the output directory. See the log file for details on loss values, learning rate changes, and parameter statistics.
 
 ## Tips
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -27,6 +27,7 @@ maintainers = [
 ]
 
 dependencies = [
+  "bmipy>=2.0",
   "botocore>=1.42.5",
   "colormaps",
   "cubed",
@@ -170,6 +171,7 @@ convention = "numpy"
 "mkdocs.yml" = ["I"]
 "tests/*" = ["D"]
 "*/__init__.py" = ["F401"]
+"src/ddr/bmi/ddr_bmi.py" = ["D102"]  # BMI interface methods are self-documenting (bmipy.Bmi)
 "*.ipynb" = ["E402", "E501", "F401", "F811", "F841", "T201"]  # Common notebook exceptions
 
 [tool.ruff.format]

diff --git a/src/ddr/__init__.py b/src/ddr/__init__.py
@@ -5,4 +5,9 @@
 from .nn import kan
 from .routing.torch_mc import dmc
 
-__all__ = ["__version__", "dmc", "streamflow", "ddr_functions", "kan", "validation"]
+try:
+    from . import bmi
+except ImportError:
+    bmi = None  # type: ignore[assignment]
+
+__all__ = ["__version__", "dmc", "streamflow", "ddr_functions", "kan", "validation", "bmi"]
diff --git a/src/ddr/bmi/__init__.py b/src/ddr/bmi/__init__.py
@@ -0,0 +1,9 @@
+"""BMI wrapper for DDR differentiable Muskingum-Cunge routing.
+
+Provides a BMI v2.0 (CSDMS) interface for integration with the NGWPC/ngen
+NextGen Water Resources Modeling Framework as a drop-in replacement for t-route.
+"""
+
+from .ddr_bmi import DdrBmi
+
+__all__ = ["DdrBmi"]
diff --git a/src/ddr/bmi/config.py b/src/ddr/bmi/config.py
@@ -0,0 +1,50 @@
+"""BMI initialization config schema.
+
+Defines the YAML config format for DDR's BMI wrapper. This config points to
+the full DDR Hydra config and trained KAN checkpoint, keeping BMI-specific
+settings separate from DDR's internal configuration.
+"""
+
+from pathlib import Path
+from typing import Literal
+
+from pydantic import BaseModel, Field
+
+
+class BmiInitConfig(BaseModel):
+    """Schema for the BMI initialization YAML config file.
+
+    Parameters
+    ----------
+    ddr_config : Path
+        Path to DDR's Hydra YAML config file.
+    kan_checkpoint : Path
+        Path to trained KAN .pt checkpoint file.
+    hydrofabric_gpkg : Path or None
+        Override hydrofabric GeoPackage path from ddr_config.
+    conus_adjacency : Path or None
+        Override adjacency matrix path from ddr_config.
+    device : str
+        Compute device ("cpu", "cuda", "cuda:0", etc.).
+    timestep_seconds : float
+        Internal MC routing timestep in seconds. Can be smaller than
+        ngen's coupling interval for sub-stepping (e.g., 900s routing
+        with 3600s ngen_dt gives 4 sub-steps per coupling).
+    interpolation : {"constant", "linear"}
+        Lateral inflow interpolation between ngen coupling intervals
+        when sub-stepping. "constant" holds inflows fixed (zeroth-order);
+        "linear" interpolates from previous to current inflows across
+        sub-steps. See ``data/diagrams/bmi_testing_guide.txt`` for
+        mass conservation implications.
+    """
+
+    ddr_config: Path = Field(description="Path to DDR Hydra YAML config")
+    kan_checkpoint: Path = Field(description="Path to trained KAN checkpoint")
+    hydrofabric_gpkg: Path | None = Field(default=None, description="Override hydrofabric GeoPackage path")
+    conus_adjacency: Path | None = Field(default=None, description="Override adjacency matrix path")
+    device: str = Field(default="cpu", description="Compute device")
+    timestep_seconds: float = Field(default=3600.0, description="Internal MC routing timestep in seconds")
+    interpolation: Literal["constant", "linear"] = Field(
+        default="constant",
+        description="Lateral inflow interpolation for sub-stepping: 'constant' or 'linear'",
+    )
Original file line number	Diff line number	Diff line change
Expand Up		@@ -57,4 +57,4 @@ Each `example_config.yaml` uses `${oc.env:DDR_DATA_DIR,./../../data}` so paths r

		## Model Evaluation

		The `examples/eval/evaluate.ipynb` notebook demonstrates how to evaluate the performance of a trained model and compare routed predictions against the summed Q' baseline.
		The `examples/eval/evaluate.ipynb` notebook demonstrates how to compare routed predictions against observations and the summed Q' baseline.