From 991a8e2dfead547cc5c4026f1dc9205e6223a606 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A5l=20R=C3=B8nning?= <ronpal@users.noreply.github.com>
Date: Fri, 15 May 2026 13:30:30 +0200
Subject: [PATCH] Add TDL-0004: error vs warning semantics and phase boundaries

Document the goal of running commands to completion (collecting issues
at phase boundaries) versus aborting only on ToolkitErrors, which are
by definition unrecoverable. Covers the three coexisting severity
systems (legacy tk_warnings SeverityLevel, build v2 numeric
InsightDefinition, dataio Severity) and the NEAT adapter that
originates the Error/Warning naming mismatch in CDF-27907.

Related: CDF-27907, CDF-27968.
---
 .../TDL-0004-error-warning-semantics.md       | 171 ++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 technical_decision_log/TDL-0004-error-warning-semantics.md

diff --git a/technical_decision_log/TDL-0004-error-warning-semantics.md b/technical_decision_log/TDL-0004-error-warning-semantics.md
new file mode 100644
index 0000000000..2b6b53a836
--- /dev/null
+++ b/technical_decision_log/TDL-0004-error-warning-semantics.md
@@ -0,0 +1,171 @@
+# TDL-0004: Error vs Warning Semantics and Phase Boundaries
+
+**Date:** 2026-05-15
+
+## What
+
+The toolkit has **three** parallel severity systems coexisting today,
+with inconsistent user-facing semantics:
+
+1. **`ToolkitError`** (`cognite_toolkit/_cdf_tk/exceptions.py`) — a Python
+   exception. Raised at the call site, caught at the CLI root, aborts the
+   command immediately. Roughly 30 subclasses (`ToolkitValidationError`,
+   `ToolkitMissingResourceError`, etc.).
+2. **`SeverityLevel` + `ToolkitWarning`**
+   (`cognite_toolkit/_cdf_tk/tk_warnings/base.py`) — the legacy warning
+   system. String enum `ERROR / HIGH / MEDIUM / LOW`. 41 warning classes
+   use it. `SeverityLevel.ERROR` renders as `ERROR [ERROR]:` but the
+   command **continues running** — used in one place today
+   (`fileread.py:220`).
+3. **`InsightDefinition` + numeric severity**
+   (`cognite_toolkit/_cdf_tk/commands/build_v2/data_classes/_insights.py`)
+   — the build v2 system. Integer severities: `FileReadError=60`,
+   `FailedValidation=60`, `ModelSyntaxWarning=40`, `ConsistencyError=30`,
+   `IgnoredFileWarning=20`, `Recommendation=10`. Build v2 buckets them
+   with thresholds (`≤15`, `≤35`, `>35`). `InsightList.has_errors`
+   already exists and is wired to fail the phase — but only for build v2.
+
+A fourth, narrower `Severity` enum (`skipped / warning / failure`) exists
+in `dataio/logger.py` for data IO aggregation.
+
+A user seeing `ERROR: ...` in build output has no way to predict whether
+the command will fail. Worse: in build v2 the class names and the numeric
+severities disagree — `ModelSyntaxWarning` (severity 40) outranks
+`ConsistencyError` (severity 30), so a class literally called "Error" is
+*less severe* than one called "Warning". This is the encoded form of the
+bug Magnus reported in
+[CDF-27907](https://cognitedata.atlassian.net/browse/CDF-27907) and
+discussed in #contributors-toolkit on 2026-05-08.
+
+We also have no shared definition of a **phase boundary**, which is the
+natural place to aggregate findings and decide pass/fail.
+
+## Goal
+
+**A command runs to completion, collecting warnings and errors, and
+reports them at the next phase boundary — unless a `ToolkitError` is
+raised.**
+
+`ToolkitError` is reserved for conditions that are by definition
+unrecoverable: there is no way to proceed, and continuing would produce
+garbage or hide more useful findings. The user must address the
+underlying cause before it makes sense to retry. We accept that these
+abort immediately at the raise site.
+
+Everything else — schema mismatches, missing references, unresolved
+variables, naming-convention violations, recommendations — is collected
+as the command runs. At the phase boundary the toolkit prints the full
+set in one pass and decides pass/fail based on whether any of the
+collected issues are errors. The user sees every problem of a given kind
+at once instead of fixing them one at a time, rerunning, and hitting the
+next.
+
+## Decision
+
+We adopt the **collect-then-fail-at-phase-boundary** model (the same
+model used by `ruff`, `mypy`, and `tsc`). Three categories, three
+contracts:
+
+### 1. Warnings — informational, never block
+
+Severities `WARNING`, `NOTICE`, `HINT` (exact names TBD with Magnus).
+Collected and rendered. The command **always proceeds** and exits 0 from
+these alone. Users can suppress specific warning classes via `cdf.toml`
+(planned).
+
+### 2. Errors — collected, reported in bulk, fail the phase
+
+`SeverityLevel.ERROR` issues are collected during a phase. At the
+**phase boundary**, the toolkit:
+
+1. Prints every error collected during the phase, grouped and with
+   source paths.
+2. Prints a summary line (`Found N errors in M files`).
+3. Exits non-zero. No further phases run.
+
+The user sees **all** failure modes from that phase in a single pass
+instead of fix-one-rerun whack-a-mole. The `WarningList` (or its
+successor) gains a `has_errors()` method that command entry points
+consult to set the exit code.
+
+### 3. `ToolkitError` exceptions — abort immediately
+
+Reserved for conditions that genuinely cannot be collected and continued
+past: missing auth, network/IO failure, unreachable CDF project,
+internal invariant violations. These still raise, are caught at the CLI
+root, and abort at the raise site.
+
+The intuition: if continuing past the condition would produce garbage or
+hide more useful findings, raise `ToolkitError`. If continuing would
+surface *more* useful findings of the same kind, emit a
+`SeverityLevel.ERROR` and let the phase boundary handle the exit code.
+
+### Rendered prefixes (user-facing contract)
+
+- `ERROR:` — this will fail the build/phase. Every `ERROR:` you see is
+  shown before exit; fix them all before rerunning.
+- `WARNING:` / `NOTICE:` / `HINT:` — fyi, the command continues.
+
+## Phase boundary — definition
+
+A **phase boundary** is the point in a command's execution where the
+toolkit finishes a logically complete unit of work and can meaningfully
+decide "proceed or stop." Concretely in the toolkit today:
+
+| Command | Phases (in order) | Boundary behaviour |
+| --- | --- | --- |
+| `cdf build` | 1. Read & parse modules → 2. Validate → 3. Write build dir | Collect errors across phases 1 and 2. Fail at the end of phase 2 before any files are written. |
+| `cdf deploy` | 1. Load build dir → 2. Plan/diff → 3. Apply per resource type | Collect errors during phases 1 and 2. Fail at end of phase 2. Phase 3 stays fail-fast (side effects on CDF). |
+| `cdf clean` | 1. Load build dir → 2. Plan deletions → 3. Apply | Same shape as `deploy`. |
+| `cdf modules <add\|init>` | Single phase | No aggregation needed; raise `ToolkitError` on failure. |
+
+The principle: **aggregate within a phase, fail at its boundary, do not
+enter the next phase if the previous one had errors.** Phases with side
+effects on CDF (the actual API calls in `deploy`/`clean`/`purge`) remain
+fail-fast — once you've started mutating CDF, accumulating more errors
+helps no one.
+
+A phase boundary is *not* the same as a Python function return or a CLI
+command exit. It is the architectural seam between "gathering
+information" and "acting on it." `build` has one such seam (validate →
+write). `deploy` has one (plan → apply). That is the only place an
+aggregated error report is appropriate.
+
+## Why
+
+- **Eliminates whack-a-mole.** Users with messy YAML get one
+  comprehensive report instead of fixing one error, rerunning, hitting
+  the next, rerunning, repeat. This is the single biggest CLI UX win.
+- **Removes the `ERROR:`-but-build-passes ambiguity** flagged in
+  CDF-27907 — `ERROR:` always means "the phase failed."
+- **Keeps `ToolkitError` exceptions for what they are good at** —
+  unrecoverable conditions where collecting more state is pointless or
+  unsafe.
+- **Matches the model users already know** from `ruff`, `mypy`, `tsc`,
+  `eslint`. Familiar mental model, no toolkit-specific re-learning.
+- **Side-effect commands stay safe.** Because `deploy`'s mutation phase
+  stays fail-fast, we don't introduce the failure mode "kept calling
+  CDF after the first 400 to collect more 400s."
+
+## Consequences
+
+- `WarningList` needs a `has_errors()` method and command entry points
+  need to consult it when setting the exit code. The build v2
+  `InsightList` already has this — it should be the reference
+  implementation, and the legacy `WarningList` should gain the same
+  wiring (or the two should converge).
+- Every place currently using `raise ToolkitValidationError(...)`
+  mid-validation needs review: if the surrounding loop can continue and
+  surface more findings, it should emit an error-level issue instead of
+  raising. If continuing would corrupt state, keep the raise.
+- The legacy `SeverityLevel` enum will be reshaped (see follow-up plan
+  and CDF-27907) so the rendered prefix matches this contract.
+- Build v2 class names need to align with their numeric severity:
+  `ModelSyntaxWarning` (40) → rename to indicate it's an error;
+  `ConsistencyError` (30) — either rename to a warning level or
+  increase the number above syntax. Tracked as part of
+  [CDF-27968](https://cognitedata.atlassian.net/browse/CDF-27968).
+- Long-term: converge the three systems into one. The legacy
+  `tk_warnings`, the build v2 `_insights`, and the `dataio/logger`
+  `Severity` all model the same concept differently. Picking one model
+  and migrating is its own work item, larger than this TDL.