diff --git a/technical_decision_log/TDL-0004-error-warning-semantics.md b/technical_decision_log/TDL-0004-error-warning-semantics.md new file mode 100644 index 0000000000..2b6b53a836 --- /dev/null +++ b/technical_decision_log/TDL-0004-error-warning-semantics.md @@ -0,0 +1,171 @@ +# TDL-0004: Error vs Warning Semantics and Phase Boundaries + +**Date:** 2026-05-15 + +## What + +The toolkit has **three** parallel severity systems coexisting today, +with inconsistent user-facing semantics: + +1. **`ToolkitError`** (`cognite_toolkit/_cdf_tk/exceptions.py`) — a Python + exception. Raised at the call site, caught at the CLI root, aborts the + command immediately. Roughly 30 subclasses (`ToolkitValidationError`, + `ToolkitMissingResourceError`, etc.). +2. **`SeverityLevel` + `ToolkitWarning`** + (`cognite_toolkit/_cdf_tk/tk_warnings/base.py`) — the legacy warning + system. String enum `ERROR / HIGH / MEDIUM / LOW`. 41 warning classes + use it. `SeverityLevel.ERROR` renders as `ERROR [ERROR]:` but the + command **continues running** — used in one place today + (`fileread.py:220`). +3. **`InsightDefinition` + numeric severity** + (`cognite_toolkit/_cdf_tk/commands/build_v2/data_classes/_insights.py`) + — the build v2 system. Integer severities: `FileReadError=60`, + `FailedValidation=60`, `ModelSyntaxWarning=40`, `ConsistencyError=30`, + `IgnoredFileWarning=20`, `Recommendation=10`. Build v2 buckets them + with thresholds (`≤15`, `≤35`, `>35`). `InsightList.has_errors` + already exists and is wired to fail the phase — but only for build v2. + +A fourth, narrower `Severity` enum (`skipped / warning / failure`) exists +in `dataio/logger.py` for data IO aggregation. + +A user seeing `ERROR: ...` in build output has no way to predict whether +the command will fail. Worse: in build v2 the class names and the numeric +severities disagree — `ModelSyntaxWarning` (severity 40) outranks +`ConsistencyError` (severity 30), so a class literally called "Error" is +*less severe* than one called "Warning". This is the encoded form of the +bug Magnus reported in +[CDF-27907](https://cognitedata.atlassian.net/browse/CDF-27907) and +discussed in #contributors-toolkit on 2026-05-08. + +We also have no shared definition of a **phase boundary**, which is the +natural place to aggregate findings and decide pass/fail. + +## Goal + +**A command runs to completion, collecting warnings and errors, and +reports them at the next phase boundary — unless a `ToolkitError` is +raised.** + +`ToolkitError` is reserved for conditions that are by definition +unrecoverable: there is no way to proceed, and continuing would produce +garbage or hide more useful findings. The user must address the +underlying cause before it makes sense to retry. We accept that these +abort immediately at the raise site. + +Everything else — schema mismatches, missing references, unresolved +variables, naming-convention violations, recommendations — is collected +as the command runs. At the phase boundary the toolkit prints the full +set in one pass and decides pass/fail based on whether any of the +collected issues are errors. The user sees every problem of a given kind +at once instead of fixing them one at a time, rerunning, and hitting the +next. + +## Decision + +We adopt the **collect-then-fail-at-phase-boundary** model (the same +model used by `ruff`, `mypy`, and `tsc`). Three categories, three +contracts: + +### 1. Warnings — informational, never block + +Severities `WARNING`, `NOTICE`, `HINT` (exact names TBD with Magnus). +Collected and rendered. The command **always proceeds** and exits 0 from +these alone. Users can suppress specific warning classes via `cdf.toml` +(planned). + +### 2. Errors — collected, reported in bulk, fail the phase + +`SeverityLevel.ERROR` issues are collected during a phase. At the +**phase boundary**, the toolkit: + +1. Prints every error collected during the phase, grouped and with + source paths. +2. Prints a summary line (`Found N errors in M files`). +3. Exits non-zero. No further phases run. + +The user sees **all** failure modes from that phase in a single pass +instead of fix-one-rerun whack-a-mole. The `WarningList` (or its +successor) gains a `has_errors()` method that command entry points +consult to set the exit code. + +### 3. `ToolkitError` exceptions — abort immediately + +Reserved for conditions that genuinely cannot be collected and continued +past: missing auth, network/IO failure, unreachable CDF project, +internal invariant violations. These still raise, are caught at the CLI +root, and abort at the raise site. + +The intuition: if continuing past the condition would produce garbage or +hide more useful findings, raise `ToolkitError`. If continuing would +surface *more* useful findings of the same kind, emit a +`SeverityLevel.ERROR` and let the phase boundary handle the exit code. + +### Rendered prefixes (user-facing contract) + +- `ERROR:` — this will fail the build/phase. Every `ERROR:` you see is + shown before exit; fix them all before rerunning. +- `WARNING:` / `NOTICE:` / `HINT:` — fyi, the command continues. + +## Phase boundary — definition + +A **phase boundary** is the point in a command's execution where the +toolkit finishes a logically complete unit of work and can meaningfully +decide "proceed or stop." Concretely in the toolkit today: + +| Command | Phases (in order) | Boundary behaviour | +| --- | --- | --- | +| `cdf build` | 1. Read & parse modules → 2. Validate → 3. Write build dir | Collect errors across phases 1 and 2. Fail at the end of phase 2 before any files are written. | +| `cdf deploy` | 1. Load build dir → 2. Plan/diff → 3. Apply per resource type | Collect errors during phases 1 and 2. Fail at end of phase 2. Phase 3 stays fail-fast (side effects on CDF). | +| `cdf clean` | 1. Load build dir → 2. Plan deletions → 3. Apply | Same shape as `deploy`. | +| `cdf modules ` | Single phase | No aggregation needed; raise `ToolkitError` on failure. | + +The principle: **aggregate within a phase, fail at its boundary, do not +enter the next phase if the previous one had errors.** Phases with side +effects on CDF (the actual API calls in `deploy`/`clean`/`purge`) remain +fail-fast — once you've started mutating CDF, accumulating more errors +helps no one. + +A phase boundary is *not* the same as a Python function return or a CLI +command exit. It is the architectural seam between "gathering +information" and "acting on it." `build` has one such seam (validate → +write). `deploy` has one (plan → apply). That is the only place an +aggregated error report is appropriate. + +## Why + +- **Eliminates whack-a-mole.** Users with messy YAML get one + comprehensive report instead of fixing one error, rerunning, hitting + the next, rerunning, repeat. This is the single biggest CLI UX win. +- **Removes the `ERROR:`-but-build-passes ambiguity** flagged in + CDF-27907 — `ERROR:` always means "the phase failed." +- **Keeps `ToolkitError` exceptions for what they are good at** — + unrecoverable conditions where collecting more state is pointless or + unsafe. +- **Matches the model users already know** from `ruff`, `mypy`, `tsc`, + `eslint`. Familiar mental model, no toolkit-specific re-learning. +- **Side-effect commands stay safe.** Because `deploy`'s mutation phase + stays fail-fast, we don't introduce the failure mode "kept calling + CDF after the first 400 to collect more 400s." + +## Consequences + +- `WarningList` needs a `has_errors()` method and command entry points + need to consult it when setting the exit code. The build v2 + `InsightList` already has this — it should be the reference + implementation, and the legacy `WarningList` should gain the same + wiring (or the two should converge). +- Every place currently using `raise ToolkitValidationError(...)` + mid-validation needs review: if the surrounding loop can continue and + surface more findings, it should emit an error-level issue instead of + raising. If continuing would corrupt state, keep the raise. +- The legacy `SeverityLevel` enum will be reshaped (see follow-up plan + and CDF-27907) so the rendered prefix matches this contract. +- Build v2 class names need to align with their numeric severity: + `ModelSyntaxWarning` (40) → rename to indicate it's an error; + `ConsistencyError` (30) — either rename to a warning level or + increase the number above syntax. Tracked as part of + [CDF-27968](https://cognitedata.atlassian.net/browse/CDF-27968). +- Long-term: converge the three systems into one. The legacy + `tk_warnings`, the build v2 `_insights`, and the `dataio/logger` + `Severity` all model the same concept differently. Picking one model + and migrating is its own work item, larger than this TDL.