From 229f0463536bd30ca54576e5da5f633f1874bfed Mon Sep 17 00:00:00 2001
From: Andrii Pasternak <andriipasternak31@gmail.com>
Date: Sat, 13 Jun 2026 00:27:17 +0100
Subject: [PATCH 1/3] docs(#1095): transactional agent executions
 research/design gate
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Records the research deliverable for #1095 (research-gated): mechanism
selection (transaction-owned git ref, serialized+git v1), the §0 evidence
GO/NO-GO gate with a reachable NO-GO verdict on today's sample, the §9
hazard register, the §11 crash-injection matrix, and the §13 reachable
implementation path. Implementation remains gated on a production-evidence
GO + human review; §13 step 1 (the H4 CAS fix) is the one piece shipped
independently of that gate.

Refs #1095

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../TRANSACTIONAL_EXECUTIONS_2026-06.md       | 488 ++++++++++++++++++
 1 file changed, 488 insertions(+)
 create mode 100644 docs/planning/TRANSACTIONAL_EXECUTIONS_2026-06.md

diff --git a/docs/planning/TRANSACTIONAL_EXECUTIONS_2026-06.md b/docs/planning/TRANSACTIONAL_EXECUTIONS_2026-06.md
new file mode 100644
index 00000000..61d14fd4
--- /dev/null
+++ b/docs/planning/TRANSACTIONAL_EXECUTIONS_2026-06.md
@@ -0,0 +1,488 @@
+# Transactional Agent Executions — Design Recommendation (#1095)
+
+| Field | Value |
+|---|---|
+| **Status** | Research / design — **GO/NO-GO gate**. Implementation does not start until this is reviewed and a GO is recorded. |
+| **Date** | 2026-06-12 |
+| **Issue** | [#1095](https://github.com/Abilityai/trinity/issues/1095) — *feat: transactional agent executions — discard workspace changes unless validated as success (research-gated)* · P1 · `theme-reliability` · `complexity-high` |
+| **Branch** | `AndriiPasternak31/plan-issue-1095` (fresh off `dev`; no implementation in this phase) |
+| **Scope** | **Workspace-filesystem-scoped only.** External side effects (email / Slack / git push / payment) are explicitly **out of scope** — that is effect-idempotency, [#1084](https://github.com/Abilityai/trinity/issues/1084). |
+| **Supersedes** | [#27](https://github.com/Abilityai/trinity/issues/27) *Git Worktrees for Task Isolation* (CLOSED) — folded in as the **future concurrent track**, §10.2. |
+| **Verdict** | **NO-GO for a broad mechanism on today's evidence; a narrow serialized-git v1 is GO *only* after the production evidence query in §0 clears the kill criterion.** See §0. |
+
+> **How to read this doc.** §0 is the real first deliverable: an evidence gate that can — and on the sample available today does — return **NO-GO**. Everything after §0 is the design that becomes reachable *if* §0 clears on a representative production instance. The doc deliberately does **not** assume the build.
+
+---
+
+## 0. Evidence & GO/NO-GO gate (the real first deliverable)
+
+The issue describes "recurring operator pain": a turn fails and leaves leftovers; the next turn starts dirty. A reliability subsystem (snapshot → run → validate → commit/rollback) is justified **only if that pain is real and frequent**. Before recommending a build, we state a **kill criterion** and measure against it.
+
+### 0.1 Kill criterion
+
+> **Build only if** failed/soft-failed turns that leave **harmful** workspace leftovers (changes the next turn must not inherit) occur often enough, and on large enough workspaces, that the operator cost exceeds the cost of a new transactional subsystem and its hazards (§9).
+>
+> **NO-GO if** harmful-leftover failures are rare, OR the failures that *could* leave leftovers are dominated by the **indeterminate** class (the writer's fate is unknowable, so you cannot safely roll back anyway — §6.1), OR workspaces are small enough that a cheap targeted cleanup beats a transaction.
+
+This is falsifiable in both directions. The doc is required to allow the NO-GO conclusion.
+
+### 0.2 Evidence gathered (real data, not assumption)
+
+Source: the live `schedule_executions` table from a real instance — `trinity.db` extracted read-only from the `trinity_trinity-data` Docker volume (356 MB; mtime 2026-05-30; rows span **2026-05-15 → 2026-05-17**). This is the production schema, queried directly.
+
+**Proxy validity — one-directional only (corrected).** `tool_calls` is written **only on the SUCCESS terminal** (`task_execution_service.py:903`, the lone call site that passes the column); every FAILED terminal (`:1088–1099`) and the auth-503 path omit it. So the proxy is sound in **one** direction — **`tool_calls` present ⇒ the turn mutated the workspace** (16/16 `success` rows populate it) — but the converse does **not** hold: **`tool_calls` empty on a failed row is a persistence artifact, not evidence of "no writes."** "0/6 failed rows populate it" below is therefore a *tautology of the write path*, not a measurement — a failed turn that wrote files records an empty `tool_calls` just the same. We lean on the *qualitative* failure classification (below) and on duration, **never** on tool_calls-emptiness, to reason about failed-turn leftovers.
+
+| Metric | Value |
+|---|---|
+| Total executions sampled | **23** |
+| `success` | 16 (69.6%) — avg 50 s, max 384 s; all 16 made tool calls (Bash heredocs writing `predictions.csv`, `Edit` on `.claude/skills/*/workspace/`) |
+| `failed` | **6 (26.1%)** |
+| `cancelled` | 1 (4.3%) — `agent_deleted`, 0 tool calls |
+| Failed turns that recorded **any** workspace tool call | **0 of 6** |
+| Soft-fail / validation columns populated (`validated_at`, `validation_execution_id`, `business_status`) | **0 of 23** |
+| Reader-race retries in sample (`retry_count` sum, `attempt_number` max) | 0 / 1 |
+
+**The 6 failures, classified:**
+
+| Class | n | Examples (agent · dur · error) | Workspace-leftover risk |
+|---|---|---|---|
+| **Auth — caught early in the turn** | 3 | `cleon` · 1.2 s · *Not logged in*; `cornelius-oracle` · 7.1 s · *API 401*; `oracle-3-ai` · 0.9 s · *Not logged in* | **Low, not structurally zero.** The `claude` subprocess spawns **unconditionally** (`claude_code.py:280–304`; no pre-flight auth gate); auth/401 is detected **mid-stream** (`:437–448`), not before Claude starts. The sub-2 s durations make pre-write failure near-certain *in this sample*, but a mid-turn 401 (token expiry / the 7.1 s case) could in principle land after a `tool_use` already ran. |
+| **Transport — backend never saw the result** | 3 | `test-event-subscriptions` · 10.1 s · `RemoteProtocolError`; `test-parallel-task` · **39.0 s** · `RemoteProtocolError`; `test-parallel-task` · 3.1 s · `ConnectError` | **Indeterminate** — `tool_calls` is empty because the result stream broke, *not* because no work happened. The 39 s turn especially could have written files the backend cannot see. |
+
+### 0.3 What the evidence says
+
+1. **Directly-observed harmful leftovers = 0/23 — where "observed" is itself weak.** No failure in the sample is the issue's archetype (a turn that wrote files, then failed, leaving a dirty tree the next run inherits). But because failed rows never record `tool_calls` (§0.2), "observed" here means only "no *backend-visible* leftover" — and that blindness is exactly finding #2.
+2. **The only failures that *could* leave leftovers are the 3 transport failures — and those are exactly the class you *cannot* safely roll back**, because the backend has no proof the writer is dead or what it wrote (§6.1, H5). A naive "rollback on failure" would either (a) quarantine them anyway, or (b) risk discarding a turn that actually *succeeded* on the agent side but lost its result on the wire. This is the indeterminate-state problem, and it is the dominant failure class here.
+3. **Soft-fail is not evidenced at all.** The validation machinery (`business_status`, `validated_at`) exists in the schema but is 100% unused in this sample — so "exit 0 but bad output" is, on this data, hypothetical.
+
+### 0.4 Honest caveats (do not over-read the sample)
+
+- **n = 23 is tiny** and the window is 3 days. This is a dev/test-flavored instance (agents named `test-parallel-task`, `test-event-subscriptions`); it is **not** a production fleet of long-running, file-writing scheduled jobs.
+- **`tool_calls = 0` on *any* failed row ≠ "no writes."** Failed terminals never record `tool_calls` (success-only write path, §0.2), so the backend is blind on **every** failure class — transport *and* auth. Auth failures are near-zero-risk here only *empirically* (they failed in under ~7 s), not because the column proves it.
+- A production instance with heavy scheduled file-mutating work could show a materially higher harmful-leftover rate. **The sample cannot rule that in or out.**
+
+### 0.5 Verdict — and the gate to flip it
+
+On the evidence available today: **NO-GO for a broad transactional subsystem.** The pain is unproven, and the failure class that could cause it is the one a transaction cannot safely act on.
+
+**This NO-GO is conditional, not final.** Before any GO, run the kill-criterion query below against a **representative production instance** (≥ 30 days, real scheduled workload). The doc is a GO only if the numbers clear the criterion *and* a reviewer accepts the §6/§9 hazards.
+
+```sql
+-- Run on a production trinity.db (≥30d window). Harmful-leftover proxy =
+-- a non-success terminal that recorded a workspace-mutating tool call.
+WITH term AS (
+  SELECT id, agent_name, status, duration_ms, error,
+         CASE WHEN tool_calls IS NULL OR tool_calls IN ('','[]') THEN 0 ELSE 1 END AS wrote,
+         -- did the recorded tool_calls include a filesystem mutation?
+         -- NB: schedule_executions.tool_calls stores the RAW Claude stream-json
+         -- (task_execution_service.py:856-862 → response_data["execution_log"]),
+         -- where a tool call is {"type":"tool_use","name":"Write",...} — keyed
+         -- "name", NOT "tool". The simplified "tool" key only lands in
+         -- chat_messages, never here, so a '"tool": ...' LIKE matches nothing.
+         -- Bash over-counts (most Bash calls don't mutate the tree); treat
+         -- maybe_fs as an upper bound and spot-check the matched rows.
+         CASE WHEN tool_calls LIKE '%"name": "Write"%'
+               OR tool_calls LIKE '%"name": "Edit"%'
+               OR tool_calls LIKE '%"name": "NotebookEdit"%'
+               OR tool_calls LIKE '%"name": "Bash"%' THEN 1 ELSE 0 END AS maybe_fs
+  FROM schedule_executions
+  -- ISO-Z cutoff with sub-second precision to match utc_now_iso() (Invariant #16);
+  -- a bare-second Z cutoff ('...:%SZ') drops boundary-second rows because '.' < 'Z'.
+  WHERE started_at >= strftime('%Y-%m-%dT%H:%M:%f','now','-30 days') || 'Z'
+)
+SELECT
+  COUNT(*)                                              AS terminals,
+  SUM(status='failed')                                 AS failed,
+  SUM(status='failed' AND wrote=1)                     AS failed_with_tools,        -- backend-visible leftovers
+  SUM(status='failed' AND maybe_fs=1)                  AS failed_with_fs_writes,    -- the harmful-leftover proxy
+  SUM(status='failed' AND wrote=0
+        AND error LIKE '%RemoteProtocolError%')         AS failed_transport_indet,   -- indeterminate (can't roll back)
+  ROUND(100.0*SUM(status='failed' AND maybe_fs=1)/NULLIF(SUM(status='failed'),0),1) AS harmful_pct_of_failures
+FROM term;
+```
+
+> **Decision rule.** If `failed_with_fs_writes` is a small fraction of failures **and** `failed_transport_indet` dominates → **NO-GO** (cheaper to add lease re-delivery from a clean tree, §6.4, than a transaction). If `harmful_pct_of_failures` is materially high on real workspaces → proceed to the §1 narrow v1. Workspace-size sampling (below) sets the cost side of the inequality.
+
+Workspace-size sampling (cost side of the inequality), per running agent container:
+
+```bash
+# Bytes the transaction would have to snapshot/diff, excluding hard-excludes (D1/§3).
+docker exec agent-<name> du -sx --exclude=.git --exclude=node_modules \
+  --exclude='*/.venv' --exclude=content --exclude=.trinity/backup /home/developer
+```
+
+---
+
+## 1. Problem framing & v1 scope
+
+The issue decouples two sub-problems; the doc keeps them decoupled:
+
+1. **Transaction boundary** — snapshot → run → commit-or-discard, *workspace only*.
+2. **Validation gate** — what decides "success?" Hard-fail is a typed terminal `error_code`; the painful (and here unevidenced) case is **soft-fail** (exit 0, bad output), which needs an **explicit** validator (§4).
+
+**v1 = the single defensible quadrant (OV1):**
+
+> **Serialized** (`max_parallel_tasks == 1`) **+ clean-baseline git** workspace **+ per-execution opt-in** (§7) **+ container-local coordinator** (§3), **gated behind §0**.
+
+Everything else is **named but deferred** (§10): non-git/tar workspaces, concurrent worktree-per-execution (#27), external side effects (#1084).
+
+### Relationship to the pull / work-stealing direction (`TARGET_ARCHITECTURE.md`)
+
+This is not a side feature — it composes with the active execution-stack direction:
+
+- `TARGET_ARCHITECTURE.md:276` already plans `~/.trinity/post-check` ("runs after every task completion, language-agnostic, shebang-selected … output validation"). **The v1 validator IS that hook** (§4).
+- The pull model's recovery primitive is **lease-expiry re-delivery** (`:195`–`:203`): a dead turn's row flips back to `queued` and is re-pulled with the **same `execution_id` + idempotency key**, absorbed by a result-POST **compare-and-set** (`:274`). The workspace transaction must make a **re-pulled turn start from a clean tree** — otherwise re-delivery resumes on the previous attempt's leftovers. This is the synergy the issue's "Lease re-delivery synergy" research question asks us to confirm (§6.4).
+- The agent-side coordinator is the natural home: the pull worker pool is "built on the existing in-container asyncio-loop precedent (`auto_sync.py`)" (`:268`). The transaction coordinator wraps that same worker (§3) — consistent with Invariant #8 (Trinity is not the orchestration engine; agents own multi-stage work).
+
+---
+
+## 2. v1 mechanism — serialized git, transaction-owned ref (T2 · D1 · OV2)
+
+A git-native workspace already has the primitive we want: a commit is an atomic content snapshot, and a ref is a cheap pointer to one. v1 uses a **transaction-owned ref** to bound a turn's writes, instead of the destructive `reset --hard && git clean -fd` that the issue's "destructive-reset-then-overlay" precedent (`git.py:1140`) uses for a different purpose.
+
+### 2.1 Clean-baseline precondition (H6)
+
+Before the turn, the coordinator inspects the workspace:
+
+- **Clean** (no staged/modified/untracked-non-ignored changes vs `HEAD`) → record `baseline = HEAD`, proceed.
+- **Dirty** → **reject the transaction** for this turn (run non-transactionally, log a warning) **or** capture the dirty delta into a `tx-dirty` stash first so it is *not* attributed to — and not discarded by — the turn. v1 ships **reject-and-warn**; "snapshot-dirty-separately" is a documented follow-up.
+
+Rationale: a transaction that rolls back must roll back **exactly the turn's writes**, never a human's in-flight terminal edit or another writer's commit (§6.2). Without a clean baseline you cannot attribute the delta, so you cannot safely discard it.
+
+### 2.2 Boundary
+
+```
+baseline = current HEAD (clean)        # the only thing rollback returns to
+run the turn → it writes into the working tree as normal
+on terminal:
+  COMMIT  : git add -A (within include-set, §2.4)
+            git commit  →  advance the working branch to the new commit
+            (commit gated on the execution-row CAS, §6.3)
+  ROLLBACK: restore the working tree to `baseline` for the include-set ONLY,
+            using a tree-level checkout, NOT `reset --hard`/`clean -fd`:
+              git restore --source=<baseline> --staged --worktree -- <include-paths>
+              git clean   -fd -- <include-paths>          # scoped to include-set, never bare
+            then overlay the rollback-survives journal set (§2.3)
+```
+
+The transaction owns a **ref/index it created** (e.g. a `refs/trinity/tx/<execution_id>` pointer to `baseline`). Commit advances the working branch; rollback discards the tx ref and restores the include-set to `baseline`. The working branch never moves on rollback.
+
+### 2.3 Rollback-survives = a **new narrow journal set** (D1 — load-bearing)
+
+On rollback the agent must still **remember it failed** (so it does not loop, and #560 failure-memory has something to persist). The set of paths that survive rollback is a **new, narrow journal set**, e.g.:
+
+```
+.trinity/journal/**          # failed-attempt records (this turn's audit)
+.trinity/failure-memory/**   # #560 persisted failure memory
+```
+
+**It is NOT the persistent-state allowlist.** The default allowlist (`files.py:229`, verified) is:
+
+```python
+_DEFAULT_PERSISTENT_STATE = ["workspace/**", ".trinity/**", ".mcp.json",
+                             ".claude.json", ".claude/.credentials.json"]
+```
+
+`workspace/**` is the *work directory* — the very thing a rollback must discard. **Reusing the persistent-state allowlist as the rollback-survives set would neuter the rollback.** The two sets are deliberately separate: the allowlist answers "what survives a *reset-to-main*"; the journal set answers "what survives a *transaction rollback*." Overlap is intentional only at `.trinity/journal` / failure-memory, which must be excluded from the transaction's include-set so the rollback restore does not clobber them.
+
+### 2.4 Snapshot scope = explicit include-set + hard excludes (D1)
+
+The transaction acts on an **explicit include-set**, NOT "the complement of the persistent-state allowlist." Hard excludes (never snapshotted, never rolled back):
+
+| Exclude | Why |
+|---|---|
+| `.git/` | The ref machinery itself; rolling it back corrupts the transaction. |
+| `node_modules/`, `*/.venv`, `__pycache__`, build caches | Huge, regenerable; snapshotting them makes the transaction unaffordable (§0.5 sizing). |
+| `content/` | Generated assets, gitignored by template convention. |
+| `.trinity/backup/`, `.trinity/journal/`, `.trinity/failure-memory/` | Backup target + the rollback-survives journal set (§2.3). |
+| **Mounted volumes under `/home/developer`** — `shared-out` (`crud.py:652`), `shared-in/{agent}` (`crud.py:663` → `/home/developer/shared-in/{source_agent}`) | **Device-boundary (H8):** these are *separate Docker volumes*, not the agent's workspace volume. A git operation that crosses into them touches another agent's data. They must be path-excluded. |
+
+---
+
+## 3. Coordination boundary — container-local (T4 · §3)
+
+Snapshot → run → quiesce → validate → commit/rollback all live in **one container-local coordinator** that wraps the Claude subprocess — **not** backend `docker-exec` orchestration.
+
+**Why container-local:**
+
+- **No network/crash gap.** If the boundary were driven by the backend over `docker exec` (the `pre_check_service.py` pattern), a backend restart or a dropped connection *between* "run" and "commit/rollback" leaves the workspace in an undefined state with no local owner to finish the transaction. The 3 transport failures in §0.2 are exactly this gap. A coordinator co-located with the writer survives backend disconnects.
+- **Fits the target direction.** The pull worker pool already runs in-container on the `auto_sync.py` asyncio precedent (`TARGET_ARCHITECTURE.md:268`). The coordinator is a wrapper around that worker's per-task body: `claim → snapshot → run Claude → post-check → commit/rollback → POST result`. Invariant #8: the agent owns the multi-stage work; Trinity provides the queue + result CAS, not a transaction executor.
+- **Single terminal writer.** The coordinator is the one place that classifies the terminal and decides commit vs rollback, which is what makes the §6.3 CAS gate meaningful.
+
+**Consequence / open question:** today the terminal is classified **backend-side** in `task_execution_service.py` (§6.3). A container-local coordinator either (a) moves terminal classification into the agent (the pull model already moves result-reporting agent-side, `:274`), or (b) keeps classification backend-side and has the coordinator *act on the backend's verdict* — which reintroduces a network round-trip inside the transaction. v1 recommends (a), aligned with pull; this is flagged in §9 (H11) as the highest-coupling decision.
+
+---
+
+## 4. Validation gate — `~/.trinity/post-check` (T4 · §4 · H10 · H12)
+
+### 4.1 The hook
+
+Mirror `pre_check_service.py` (`run_pre_check`, `:44`, verified): an executable `~/.trinity/post-check`, **shebang-selected interpreter** (Trinity prefixes none), stdout/exit contract, stdout capped at 32 KB. Semantics for the transaction:
+
+| Post-check outcome | Decision |
+|---|---|
+| **absent** | **COMMIT (fail-open).** No validator configured ⇒ only *hard* failures (typed `error_code` / non-zero / timeout) roll back. Matches the issue's "default off / partial persistence is often wanted." |
+| exit 0 | COMMIT (gated on §6.3 CAS). |
+| exit ≠ 0 | ROLLBACK (soft-fail caught). |
+| timeout | ROLLBACK (treated as hard-fail of the validator). |
+
+> Note the `pre_check_service` subtlety to mirror correctly: it tests the hook with `-f`, not `-x`, so a present-but-non-executable hook surfaces as exit-126 rather than a silent skip. The post-check must make the same choice deliberately (recommend: treat non-executable as a **config error → fail-open + operator-queue warning**, never a silent rollback).
+
+### 4.2 Validator hardening (H10)
+
+The validator runs **inside the workspace the turn just wrote** — so a malicious or buggy turn could rewrite its own validator to always pass.
+
+- **Capture the validator pre-turn.** Read `~/.trinity/post-check` (and digest it) **before** the turn runs; execute the captured copy, not whatever the turn left behind. A digest mismatch at validation time is itself a rollback trigger.
+- **No-write contract.** The validator runs read-only (recommend: a read-only bind of the work tree, or run it *after* the snapshot so any writes it makes are themselves outside the committed set). A validator that mutates the workspace breaks the "validated state == committed state" equality.
+
+### 4.3 Relationship to `ValidationService` (H12 — avoid recursion)
+
+`ValidationService.validate_execution` (`validation_service.py:127`, verified) dispatches a **separate** `execute_task(triggered_by="validation")` turn (`:186`) — a second Claude run against the same workspace. The post-check hook (a cheap deterministic script) and `ValidationService` (a full LLM turn) are **different validators**; the doc must prevent them from composing into recursion:
+
+- A `triggered_by="validation"` execution **must NOT itself open a workspace transaction** (§7 excludes internal executions). Otherwise validating turn T opens a transaction that runs a validation turn that opens a transaction…
+- Recommendation: the post-check hook is the **v1 validator**; `ValidationService` stays an orthogonal, opt-in, backend-side quality gate that reads `business_status` and never participates in the transaction boundary. Wiring `ValidationService` *as* the transaction validator is explicitly a **future** question, not v1.
+
+---
+
+## 5. Transaction lifecycle (belongs in the doc and in the coordinator's code comments)
+
+```
+                 dirty? ── yes ──> REJECT tx (run non-transactionally; warn)   [§2.1]
+                   │ no
+  baseline = HEAD ─┤
+                   ▼
+              [ RUN turn ]   (other writers quiesced or path-excluded — §6.2)
+                   │
+          terminal classified
+        ┌──────────┴───────────┐
+   hard-fail / timeout      success
+        │                      │
+   writer provably dead?       │
+        ├── no ──> QUARANTINE (indeterminate; reconcile on restart — §6.1/§6.3)
+        │ yes                  │
+        │              [ post-check ]── absent ──> COMMIT (fail-open)   [§4.1]
+        │                 │exit≠0   │exit0/timeout→rollback
+        ▼                 ▼         ▼
+     ROLLBACK ◄───────────┘       COMMIT ── gated on execution-row CAS ──┐   [§6.3]
+   (restore include-set to                                              │
+    baseline; NO reset --hard;                                     won CAS? ── no ──> reconcile
+    overlay journal set §2.3)                                           │ yes          (someone else
+        │                                                               │               wrote terminal)
+        └────────────► persist failed-attempt journal              COMMIT done
+```
+
+---
+
+## 6. Atomicity & quiescence — the load-bearing hard part (T3 · H3/H4/H5/H9)
+
+A workspace transaction is **not** an atomic DB transaction. The honest framing: the filesystem and the `schedule_executions` row are two stores with crash windows between them. The design's job is to make every window **recoverable**, not to pretend it cannot happen.
+
+### 6.1 Quiescence — rollback only when the writer is **provably dead** (H5)
+
+Rollback restores the tree to `baseline`. If the writer is *still running* when rollback fires, it races the restore and corrupts both. Therefore:
+
+- **Rollback requires proof of death**, not a best-effort terminate. A backend `POST /stop` that returns 404 ("may have already finished") is **not** proof — it yields an **indeterminate** terminal.
+- An indeterminate terminal → **QUARANTINE** (mark the row, leave the tree as-is, reconcile on restart — §6.3), **never a blind rollback.** This is precisely why the 3 transport failures in §0.2 cannot be auto-rolled-back: the backend has no death proof.
+- Container-local coordination (§3) makes "provably dead" cheap: the coordinator owns the subprocess handle; `waitpid`/exit-status is local truth, no network inference. This is a concrete argument *for* §3 over backend orchestration.
+
+### 6.2 Concurrent non-task writers (H1 — the quiet killer)
+
+The workspace is not single-writer even when `max_parallel_tasks == 1`. Other writers touch `/home/developer` outside the turn:
+
+| Writer | Evidence | Mitigation |
+|---|---|---|
+| **15-min git auto-sync** — `git add -A` + commit + push of the whole tree on a timer | `git.py:152` (`_run_auto_sync_once`) driven by `auto_sync.py:run_auto_sync_loop` (900 s, `:44`), **independent of any execution** | **Must be quiesced for the turn.** Auto-sync committing mid-transaction would either capture the turn's uncommitted writes or race the tx ref. Recommend: a coordinator-held lock that pauses auto-sync while a transaction is open (auto-sync already gates on `should_run_auto_sync()`). |
+| Interactive terminal / SSH session | `agent_ssh.py`, container exec | Out of scope to quiesce; **path-exclude** or accept that concurrent human edits void the clean-baseline (§2.1 rejects a dirty tree, which catches the common case). |
+| File API writes, `mkdir` | `agent_files.py` | Same as terminal. |
+| Credential injection / hot-reload (`.env`, `.mcp.json`) | `credentials.py` | Already in the persistent-state allowlist; **path-exclude** from the tx include-set. |
+| Shared-folder mounts | `crud.py:663` | Device-boundary exclude (§2.4 / H8). |
+
+**The transaction is only as atomic as the writer set it controls.** v1 must either quiesce these for the turn or scope the include-set to exclude their paths. Auto-sync is the one that *must* be quiesced (it commits); the rest can be path-scoped.
+
+### 6.3 FS↔DB atomicity — state machine + restart reconciliation (H3/H4)
+
+Define a **durable transaction state** alongside the execution row (recommend a `tx_state` column or a `.trinity/journal/tx-<execution_id>.json` the coordinator writes), with these states and crash-window recoveries:
+
+| tx_state | Meaning | Crash here → restart reconciliation |
+|---|---|---|
+| `open` | baseline recorded, turn running | Writer is dead (process gone) → **QUARANTINE**; operator decides. Never auto-rollback (can't prove what was written). |
+| `committing` | post-check passed, commit started | Re-check: did the commit land (working branch == new tree)? Yes → finish DB write. No → re-attempt commit idempotently. |
+| `committed_fs` | git commit landed, DB terminal **not yet** written | **The dangerous window.** The CAS write (below) is replayed on restart; the commit is durable in git, so reconciliation just (re)applies the terminal status. |
+| `rolling_back` | rollback started | Re-run the idempotent restore (restore-to-baseline is naturally idempotent); then write the FAILED terminal. |
+| `rolled_back` / `done` | terminal | No action. |
+
+**CAS-gated commit (H4 — fixes a real, present bug).** `update_execution_status` already returns the CAS winner — `result.rowcount > 0` at `db/schedules.py:1377` (the SUCCESS branch updates `WHERE id=? AND status != 'cancelled'`; the non-success branch `WHERE id=? AND status NOT IN (terminal)`). **The caller discards it** at `task_execution_service.py:896`–908: it proceeds to complete the activity and reset the breaker unconditionally, whether or not it won the terminal transition.
+
+> **Writer asymmetry to converge (ties into H13).** The two terminal writers don't even guard the *same way* today: the SUCCESS path (`:896`) ignores the CAS return outright, while the FAILED path (`:1088–1093`) instead does a `get_execution()` read-then-`status != 'cancelled'` check — a **check-then-act TOCTOU**, not the CAS. The H4 fix must route **both** writers through the CAS return value (lost CAS ⇒ reconcile, never act), so there is exactly one consistent terminal-write gate rather than two divergent ones.
+
+> **The commit must consume the CAS result.** Order: **win the execution-row CAS first, then make the git commit the working branch's new HEAD.** If the CAS is lost (another writer — a cancellation, a lease re-delivery, the watchdog — already wrote a terminal for this `execution_id`), the coordinator must **NOT commit** and must reconcile instead. Committing after losing the CAS is how you bless a rolled-back/cancelled turn's writes as the new baseline. This is the single most important correctness coupling in the design, and it is a one-line bug today.
+
+### 6.4 Reader-race retry (#678) and lease re-delivery — boundary is per-**attempt** (H9)
+
+Two mechanisms reuse the **same `execution_id`** and re-run work:
+
+- **#678 reader-race retry** (`task_execution_service.py:758`; second dispatch `:831`, verified): on a 502 with the reader-race signature, the **same `execution_id`** is re-dispatched **inline, with no workspace rollback** — the tree is left exactly as attempt 1 left it.
+- **Lease-expiry re-delivery** (target model, `TARGET_ARCHITECTURE.md:201`): a dead turn's row flips back to `queued` and is re-pulled with the **same `execution_id` + idempotency key**.
+
+Both break a per-`execution_id` transaction boundary. If the boundary keys on `execution_id`:
+
+- attempt 2 cannot get a fresh baseline (the tx ref is already owned), and
+- **re-snapshotting before attempt 2 would bless attempt-1's partial damage as the new baseline** — the exact anti-goal.
+
+**Decision: the transaction boundary is per-ATTEMPT, not per-`execution_id`.** Concretely:
+
+- Each attempt records its **own** baseline ref `refs/trinity/tx/<execution_id>/<attempt_number>`.
+- Before attempt 2 (reader-race **or** re-delivery), the coordinator **rolls attempt 1 back to attempt 1's baseline**, then attempt 2 snapshots fresh from that restored (clean) baseline.
+- This makes lease re-delivery **workspace-safe by construction** (the issue's synergy question): a re-pulled turn always starts from a clean tree, never resumes on leftovers (`TARGET_ARCHITECTURE.md:201` says re-delivery "is the same unit of work, never a half-finished turn resumed" — the transaction is what *makes* that true at the filesystem level).
+- **Commit-point ordering w.r.t. the result POST:** commit the git ref **before** the result POST wins the CAS would risk a committed tree with a lost CAS; therefore **win the CAS, then commit** (§6.3), and the result POST carries the terminal. Under pull, the result POST *is* the CAS write — so "CAS then commit" means "the coordinator commits only after its own result POST is accepted."
+
+---
+
+## 7. Opt-in granularity (issue: "Granularity & opt-in")
+
+- **Per-execution opt-in, with an agent-level default** — NOT a bare per-agent flag. Manual exploratory turns often want partial work kept (a research agent accumulating notes); scheduled jobs want rollback. The trigger envelope carries `transactional: true|false`; the agent-level default (`agent_ownership`, default **off**) supplies it when the trigger omits it.
+- **Internal executions are excluded**: `triggered_by="validation"` (§4.3 anti-recursion), maintenance/health turns, and the pre-check hook itself never transact.
+- **Default off** matches the issue ("many agents want partial persistence").
+
+---
+
+## 8. What already exists — reuse vs rebuild (verified against the branch)
+
+| Primitive | Location (verified) | Verdict |
+|---|---|---|
+| `build_snapshot` / `restore_from_tar` (allowlist-MATCH tar) | `docker/base-image/agent_server/routers/snapshot.py:67,77` (the issue body's "in `git.py`" is **wrong** — `git.py` only imports/composes them) | **Future non-git/tar track only**, and only after fixing the restore-delete gap (below). NOT the v1 git mechanism. |
+| ↳ restore semantics (the gap) | `restore_from_tar` writes/overwrites tar members; it **does not delete files created after the snapshot**, **does not recreate empty dirs** (`_collect_files` is `is_file()`-only), and **skips symlinks** (`extractfile` returns `None`). | A tar-based rollback would leave turn-created files behind — **incorrect as a rollback** until an explicit delete/replace phase is added (H2). |
+| `_read_persistent_state()` + default allowlist `["workspace/**", …]` | `files.py:238` (fn), `:229` (default literal) | Read for reference; **do NOT** reuse as the rollback-survives set — it includes `workspace/**` (§2.3). |
+| `reset_to_main_preserve_state_impl()` (snapshot→reset→overlay→force-push) | `git.py:1140` | **Pattern reference only.** v1 uses a tx-owned ref, not reset-to-main, and does not force-push. |
+| `pre_check_service.py` (docker-exec hook; shebang; stdout/exit contract; 32 KB cap) | `src/backend/services/pre_check_service.py:44` | **Mirror** for the post-check validator (§4). |
+| Terminal classification + CAS winner | `task_execution_service.py:896`; `db/schedules.py:1377` | Commit **must** consume the CAS result (today ignored — §6.3). |
+| `ValidationService` (dispatches a separate validation turn) | `validation_service.py:127`,`:186` | Keep orthogonal; must not recurse into a transaction (§4.3). |
+| auto-sync timer (whole-tree `git add -A`/commit/push, 900 s) | `git.py:152` + `auto_sync.py:44` | **Quiesce during a transaction** (§6.2 / H1). |
+
+---
+
+## 9. Hazards & open design questions (mandatory — OV3)
+
+| # | Hazard | Where it bites | v1 disposition |
+|---|---|---|---|
+| **H1** | Concurrent non-task writers (auto-sync `git add -A` mid-turn, terminal, file API, cred injection, shared-folder) | §6.2 | Quiesce auto-sync; path-exclude the rest; clean-baseline rejects the common human-edit case. |
+| **H2** | `restore_from_tar` has no delete phase (leaves turn-created files; no empty-dir/symlink restore) | §8 | Blocks the tar/non-git track until fixed; not in v1. |
+| **H3** | FS↔DB atomicity: crash windows between commit and terminal write | §6.3 | Durable `tx_state` + restart reconciliation table. |
+| **H4** | Commit ignores the execution-row CAS winner (present bug) | §6.3 | Gate commit on CAS; lose CAS ⇒ reconcile, never commit. |
+| **H5** | "Provably dead" before rollback; best-effort terminate is indeterminate | §6.1 | Container-local `waitpid`; indeterminate ⇒ quarantine, not rollback. |
+| **H6** | Clean-baseline precondition | §2.1 | Reject-and-warn on dirty (v1); snapshot-dirty-separately deferred. |
+| **H7** | `reset --hard`/`clean -fd` blast radius (untracked wanted files, branch/refs/stash/submodules, ignored creds survive) | §2.2 | **Rejected** as the primitive; scoped `git restore`/`clean -- <include>` only. |
+| **H8** | Mounted cross-agent volumes under `/home/developer` (`shared-in/out`) | §2.4 | Device-boundary path-exclude. |
+| **H9** | Reader-race retry (#678) & lease re-delivery reuse the same `execution_id` | §6.4 | Boundary is **per-attempt**; rollback attempt 1 before attempt 2. |
+| **H10** | Mutable/untrusted validator (turn rewrites its own post-check) | §4.2 | Capture+digest validator pre-turn; no-write contract. |
+| **H11** | Agent may have already committed / switched branch before validation | §3, §6.4 | Highest-coupling open question: terminal classification moves agent-side (pull-aligned) vs stays backend-side; v1 recommends agent-side. |
+| **H12** | `ValidationService` recursion (validation turn opens a transaction) | §4.3 | Internal executions excluded (§7). |
+| **H13** | Non-single terminal-writer paths must participate: watchdog/cleanup, cancellation, inline chat, container-recreate kills (#1037/#1089) | §6.3 | All terminal writers must go through the CAS; the coordinator must treat a CAS loss to any of them as "do not commit." |
+
+**Top open questions for the reviewer:** (1) **H11** — does v1 move terminal classification agent-side? (2) Is **reject-and-warn** on a dirty baseline acceptable for v1, or is snapshot-dirty-separately required day one? (3) Do we quiesce auto-sync via a shared lock, or disable it entirely while transactional mode is on?
+
+---
+
+## 10. NOT in scope (deferred, with rationale)
+
+### 10.1 Non-git (tar) workspaces — future / evidence-pending
+Reuses `build_snapshot`/`restore_from_tar` **only if** greenlit, and **only after** the restore-delete phase (H2) is added. Named, not detailed.
+
+### 10.2 Concurrent (`max_parallel_tasks > 1`) — future; folds in #27
+Git-worktree-per-execution. Blocked on the **runtime-config-sharing** problem: a fresh worktree does not carry the gitignored `.claude/`, `.env`, `.mcp.json`, `content/` the turn needs. **This is where #27 (CLOSED) lives** — #1095 supersedes it as the worktree implementation track, to be opened only when the concurrent quadrant is greenlit.
+
+### 10.3 External side-effect rollback — #1084
+A workspace transaction is **not** global atomicity. Email/Slack/git-push/payment already sent are not un-sent by a filesystem rollback. `TARGET_ARCHITECTURE.md:482` makes effect-scoped idempotency (#1084) the gate for defaulting side-effect-bearing agents into re-delivery; **the workspace transaction must state this limit explicitly so it does not masquerade as global atomicity.**
+
+### 10.4 Implementation & tests
+A follow-up issue, **gated on this doc's review + a GO** (§0.5). The §11 matrix is the test plan that issue inherits.
+
+### 10.5 `TODOS.md`
+Repo tracks work via GitHub Issues; the follow-up is filed as an issue, not a TODOS file.
+
+---
+
+## 11. Crash-injection test matrix (the test plan the implementation issue inherits)
+
+Not cosmetic. Each row is a fault injected at a specific phase; the assertion is on the **recovered** state.
+
+| # | Injected fault | Expected recovered state |
+|---|---|---|
+| 1 | Crash between snapshot and run | No tx ref leak; next start re-snapshots from clean `HEAD`. |
+| 2 | Crash mid-run (writer killed) | tx_state=`open` → **quarantine**; tree untouched; operator decision. |
+| 3 | Crash between post-check pass and commit | tx_state=`committing` → commit re-attempted idempotently; exactly one commit. |
+| 4 | Crash after git commit, before DB terminal | tx_state=`committed_fs` → terminal re-applied via CAS; commit durable. |
+| 5 | Crash mid-rollback | Idempotent restore re-runs; ends `rolled_back`; journal set intact. |
+| 6 | **Dirty baseline** at tx start | Tx rejected; turn runs non-transactionally; warning logged. |
+| 7 | **Cancellation during commit** (CAS lost to cancel) | Commit **aborted**; no working-branch advance; reconcile to cancelled. |
+| 8 | Rollback failure (restore errors) | tx_state stuck `rolling_back` → retried; surfaced to operator queue if it keeps failing. |
+| 9 | Backend/container restart mid-transaction | Restart reconciliation (§6.3) drives each tx_state to a defined terminal. |
+| 10 | **Auto-sync collision** (timer fires mid-turn) | Auto-sync quiesced; no mid-tx commit; tx ref intact. |
+| 11 | Terminal edit mid-turn (human SSH write) | Caught by clean-baseline reject (6) or path-excluded; never silently discarded. |
+| 12 | **Reader-race attempt 2** (#678) | Attempt 1 rolled back to its baseline; attempt 2 snapshots fresh; no attempt-1 leftovers. |
+| 13 | **Lease re-delivery** of the same `execution_id` | Re-pulled turn starts from clean tree; per-attempt baseline (§6.4). |
+| 14 | Branch change mid-turn (agent `git switch`) | Detected (H11); transaction voids to quarantine rather than committing onto the wrong branch. |
+| 15 | Symlinks in include-set | Handled (git tracks symlinks natively) — and proves the tar track's symlink gap (H2) is git-track-only. |
+| 16 | Mount-boundary traversal (write into `shared-in`) | Excluded paths untouched by commit/rollback. |
+| 17 | Snapshot / disk exhaustion | Tx fails closed (no partial commit); turn runs non-transactionally or errors cleanly. |
+| 18 | **Happy path: success-commit** | Validated tree == committed tree; working branch advanced once. |
+| 19 | **Happy path: hard-fail rollback** | Non-zero/timeout → tree back to baseline; journal set preserved. |
+| 20 | **Happy path: soft-fail rollback** | post-check exit≠0 → tree back to baseline; journal set preserved. |
+| 21 | Rollback-survives journal preservation | After any rollback, `.trinity/journal/**` + failure-memory survive. |
+
+---
+
+## 12. Answers to the issue's Research Questions (AC self-check)
+
+| Research question | Answer (section) |
+|---|---|
+| **Mechanism** (snapshot-in-place vs worktree vs OverlayFS vs CoW) | v1 = **transaction-owned git ref** for the serialized+git quadrant (§2). Worktree = future concurrent track (§10.2). OverlayFS/CoW rejected: host-FS-dependent, violates "commodity hardware / proven primitives." |
+| **Concurrency** (snapshot-in-place breaks at `max_parallel>1`) | Confirmed. v1 is **serialized-only**; concurrent needs worktrees and is deferred (§10.2). Not one mechanism for both in v1. |
+| **Validation gate** (post-check the right validator? absence handling? soft vs hard) | `~/.trinity/post-check`, **fail-open when absent** (§4.1). Hard-fail = typed `error_code`; soft-fail needs the explicit hook (and is **unevidenced** today, §0.3). |
+| **persistent-state allowlist carve-out** | **Inverse is NOT correct as-is** — the allowlist includes `workspace/**`. The rollback-survives set is a **new narrow journal set** (§2.3, D1). |
+| **Non-git agents** | Git-native in v1; tar track deferred behind the restore-delete fix (§10.1). |
+| **Granularity & opt-in** | **Per-execution**, agent-level default, default off; internal executions excluded (§7). |
+| **Lease re-delivery synergy** | Confirmed: per-attempt rollback makes a re-pulled turn workspace-safe (§6.4). Commit ordering: **win CAS → commit**, result POST carries the terminal. |
+| **Cost / latency** | Git ref + scoped diff is cheap; sizing query in §0.5 bounds it; hard-excludes (§2.4) keep `node_modules`/caches out. |
+| **Relationship to #27** | **#1095 supersedes #27** — #27 becomes the future worktree track (§10.2), opened only if the concurrent quadrant is greenlit. |
+
+**Issue acceptance criteria status after this doc:**
+
+- [x] **Research deliverable (the gate)** — this doc: mechanism selected, Research Questions answered, reachable-from-today path (§13), **reachable NO-GO verdict** (§0).
+- [x] **Decision on #27 relationship** — supersedes (§10.2).
+- [ ] Workspace transaction boundary implemented (serialized) — *follow-up, gated on GO.*
+- [ ] Validation gate wired (post-check; soft+hard rollback; validated success commits) — *follow-up.*
+- [ ] persistent-state / journal paths survive rollback — *follow-up (design fixed in §2.3).*
+- [ ] Opt-in flag (default off) + no-validator behavior — *follow-up (design fixed in §4.1/§7).*
+- [x] **External side effects explicitly NOT rolled back** — §10.3 (links #1084).
+- [ ] Tests (hard/soft/success/allowlist/concurrent) — *follow-up (matrix in §11).*
+- [ ] `architecture.md` / `TARGET_ARCHITECTURE.md` updated — *follow-up, post-GO.*
+
+---
+
+## 13. Reachable-from-today implementation path (if GO)
+
+Each step is independently shippable and reversible; the order front-loads the cheap correctness fix and the evidence.
+
+1. **Fix H4 first, regardless of GO** — make `task_execution_service.py:896` consume the `update_execution_status` CAS winner. This is a one-line latent bug (a lost-CAS writer proceeds as if it won) and is a prerequisite for *any* commit-gating. (Belongs in its own small PR; arguably ship it independent of this feature.)
+2. **Land `~/.trinity/post-check`** mirroring `pre_check_service.py` (`TARGET_ARCHITECTURE.md:276` already plans it) — useful standalone for alerting/validation even with no transaction.
+3. **Container-local coordinator skeleton** wrapping the (pull or current) per-task body: clean-baseline check → record baseline ref → run → classify → **no-op commit/rollback** (feature-flagged off). Proves the boundary without acting on it.
+4. **Wire rollback/commit** behind the per-execution opt-in (default off), serialized+git only, with the §6.3 `tx_state` + restart reconciliation and §6.2 auto-sync quiesce.
+5. **Crash-injection suite** (§11) in CI against a sibling stack before defaulting anything on.
+
+---
+
+## 14. Decision log
+
+| # | Decision | Rationale |
+|---|---|---|
+| **D1** | Snapshot scope = explicit include-set + hard excludes; rollback-survives set = **new narrow journal set**, separate from the persistent-state allowlist | The allowlist includes `workspace/**`; reusing it neuters rollback (§2.3). |
+| **D2** | Mechanism mode-selected by git × concurrency — **narrowed by OV1** to the serialized+git quadrant for v1 | Only the defensible quadrant ships first. |
+| **D3** | Serialized path is the shippable target; concurrent is not co-equal (narrowed by OV1) | Concurrent needs worktrees + config-sharing; deferred (§10.2). |
+| **OV1** | Reframe to a narrow defensible v1: serialized + clean-baseline git + per-execution opt-in + container-local coordinator, gated behind §0 | Outside-voice (Codex) pass; avoids pretending atomicity/global scope. |
+| **OV2** | v1 rollback uses a transaction-owned ref/index + clean-baseline precondition — **NOT** `git reset --hard && git clean -fd` | `reset/clean` blast radius (H7): destroys wanted untracked files; misses branch/refs/stash/submodules; ignored creds & nested repos survive. |
+| **OV3** | Mandatory hazards (§9) + crash-injection matrix (§11) | The additive crash/atomicity/concurrency gaps are the real risk; the doc must not hide them. |
+
+---
+
+*Authored as the #1095 research/design gate. **Implementation is gated on this doc's own §0 GO/NO-GO verdict and a human review — not on the plan review that produced it.** On today's evidence the verdict is NO-GO; re-run §0.5 on a representative production instance to revisit.*

From 65b7f43ed689f6228ee716608c42e3f83a2adc6e Mon Sep 17 00:00:00 2001
From: Andrii Pasternak <andriipasternak31@gmail.com>
Date: Sat, 13 Jun 2026 00:27:31 +0100
Subject: [PATCH 2/3] fix(executor): consume the CAS winner on the SUCCESS
 terminal (#671/H4)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

task_execution_service discarded the bool returned by
db.update_execution_status — an atomic compare-and-set. A SUCCESS write
loses the CAS only to a CANCELLED row, but the caller ran the won-only
side effects unconditionally: it completed the activity as COMPLETED,
reset the dispatch breaker, and returned a SUCCESS result. A turn the
operator already cancelled was therefore reported back as a billable
success.

Now the SUCCESS terminal consumes the CAS verdict: on a lost CAS it
re-reads the row, completes the activity as FAILED (not COMPLETED),
skips the breaker reset, and returns the persisted terminal status with
a new RECONCILED error code instead of SUCCESS. The won path is
unchanged. Lost CAS => reconcile, never act.

Adds tests/unit/test_terminal_write_cas_gate.py (lost-CAS reconcile +
won-path regression pin). The DB-layer CAS itself stays covered by
tests/unit/test_cancelled_not_overwritten.py.

Refs #1095

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../services/task_execution_service.py        |  32 +++-
 tests/unit/test_terminal_write_cas_gate.py    | 157 ++++++++++++++++++
 2 files changed, 188 insertions(+), 1 deletion(-)
 create mode 100644 tests/unit/test_terminal_write_cas_gate.py

diff --git a/src/backend/services/task_execution_service.py b/src/backend/services/task_execution_service.py
index 3f05d4b4..5a30e076 100644
--- a/src/backend/services/task_execution_service.py
+++ b/src/backend/services/task_execution_service.py
@@ -68,6 +68,7 @@ class TaskExecutionErrorCode(str, Enum):
     AGENT_ERROR = "agent_error"     # Agent returned non-zero exit code
     NETWORK = "network"             # HTTP/connection error to agent container
     CIRCUIT_OPEN = "circuit_open"   # Circuit breaker open — agent known unhealthy (#767)
+    RECONCILED = "reconciled"       # Terminal write lost the CAS; row reflects another writer's terminal (#671/H4)
 
 
 @dataclass
@@ -892,8 +893,14 @@ async def execute_task(
                 total_cost = retry_cost
 
             # ---- 6. Update execution record ------------------------------
+            # #671/H4: consume the CAS winner. A SUCCESS write loses the CAS
+            # ONLY to a CANCELLED row (user/operator cancel, watchdog). On a
+            # lost CAS the agent's late "done" must NOT be reported as a
+            # billable success, must NOT complete the activity as COMPLETED,
+            # and must NOT reset the dispatch breaker — reconcile to the
+            # persisted terminal instead (never bless a cancelled turn's work).
             if execution_id:
-                db.update_execution_status(
+                won = db.update_execution_status(
                     execution_id=execution_id,
                     status=TaskExecutionStatus.SUCCESS,
                     response=sanitized_resp,
@@ -906,6 +913,29 @@ async def execute_task(
                     compact_metadata=compact_metadata_json,
                     retry_count=retry_count or None,
                 )
+                if not won:
+                    reconciled = db.get_execution(execution_id)
+                    reconciled_status = (
+                        reconciled.status if reconciled else TaskExecutionStatus.FAILED
+                    )
+                    logger.warning(
+                        "[TaskExecService] SUCCESS write lost CAS for %s — row is "
+                        "%s; reconciling, not reporting success",
+                        execution_id,
+                        reconciled_status,
+                    )
+                    if activity_id:
+                        await activity_service.complete_activity(
+                            activity_id=activity_id,
+                            status=ActivityState.FAILED,
+                            error=f"superseded by {reconciled_status}",
+                        )
+                    return TaskExecutionResult(
+                        execution_id=execution_id,
+                        status=reconciled_status,
+                        response="",
+                        error_code=TaskExecutionErrorCode.RECONCILED,
+                    )
 
             # ---- 7. Complete activity ------------------------------------
             if activity_id:
diff --git a/tests/unit/test_terminal_write_cas_gate.py b/tests/unit/test_terminal_write_cas_gate.py
new file mode 100644
index 00000000..edadb570
--- /dev/null
+++ b/tests/unit/test_terminal_write_cas_gate.py
@@ -0,0 +1,157 @@
+"""Unit tests for the terminal-write CAS gate (#671 / H4, doc §6.3).
+
+`db.update_execution_status` is an atomic compare-and-set that returns the CAS
+winner (`bool`). Every terminal-write caller in
+`TaskExecutionService.execute_task` must consume that verdict: a writer that
+LOST the race (the row was already terminalised — e.g. a user/operator cancel)
+must NOT run "won-only side effects" (complete the activity, reset the dispatch
+breaker, or report success). Lost CAS ⇒ reconcile/skip, never act.
+
+The DB-layer CAS itself is covered by
+`tests/unit/test_cancelled_not_overwritten.py`; this file covers the SERVICE
+layer honouring the verdict.
+
+These are pure unit tests — they run without a backend (the `tests/unit/`
+conftest overrides the parent's api_client-dependent autouse fixtures).
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+
+_BACKEND = Path(__file__).resolve().parents[2] / "src" / "backend"
+if str(_BACKEND) not in sys.path:
+    sys.path.insert(0, str(_BACKEND))
+
+
+def _make_execution(status: str = "running"):
+    ex = MagicMock()
+    ex.id = "exec-test-001"
+    ex.status = status
+    return ex
+
+
+def _success_response() -> MagicMock:
+    """A healthy 200 agent response that drives execute_task to the SUCCESS
+    terminal (mirrors tests/test_cb_probe_execution_close.py)."""
+    resp = MagicMock()
+    resp.status_code = 200
+    resp.raise_for_status = MagicMock()
+    resp.json.return_value = {
+        "response": "done",
+        "session_id": "sess-001",
+        "metadata": {"cost_usd": 0.01, "input_tokens": 100, "context_window": 200000},
+        "execution_log": [],
+    }
+    return resp
+
+
+class TestSuccessPathCasGate:
+    """SUCCESS terminal must consume the CAS winner (#671/H4)."""
+
+    pytestmark = pytest.mark.unit
+
+    def _run(self, *, cas_won: bool, reconciled_status: str = "cancelled"):
+        """Drive execute_task down the SUCCESS path with a 200 response and a
+        configurable CAS outcome. Returns (result, mock_activity, mock_record).
+        """
+        from services.task_execution_service import TaskExecutionService
+
+        mock_db = MagicMock()
+        mock_db.get_max_parallel_tasks.return_value = 3
+        mock_db.get_execution.return_value = _make_execution(reconciled_status)
+        mock_db.update_execution_status.return_value = cas_won
+
+        mock_capacity = MagicMock()
+        admitted = MagicMock()
+        admitted.state = "admitted"
+        mock_capacity.acquire = AsyncMock(return_value=admitted)
+        mock_capacity.release = AsyncMock()
+
+        mock_circuit = MagicMock()
+        mock_circuit.allow_request.return_value = True  # transport CB closed
+
+        mock_activity = MagicMock(
+            track_activity=AsyncMock(return_value="act-001"),
+            complete_activity=AsyncMock(),
+        )
+        mock_record = AsyncMock()
+
+        with (
+            patch("services.task_execution_service.db", mock_db),
+            patch("services.task_execution_service.get_capacity_manager", return_value=mock_capacity),
+            patch("services.task_execution_service.activity_service", mock_activity),
+            patch("services.task_execution_service.CircuitState", return_value=mock_circuit),
+            patch("services.task_execution_service.agent_post_with_retry", AsyncMock(return_value=_success_response())),
+            patch("services.task_execution_service.dispatch_breaker_active", return_value=False),
+            patch("services.task_execution_service._record_dispatch_terminal", mock_record),
+        ):
+            svc = TaskExecutionService()
+            result = self._await(
+                svc.execute_task(
+                    agent_name="test-agent",
+                    message="hello",
+                    triggered_by="schedule",
+                    execution_id="exec-test-001",
+                    timeout_seconds=300,
+                    model="sonnet",
+                )
+            )
+        return result, mock_activity, mock_record
+
+    @staticmethod
+    def _await(coro):
+        import asyncio
+
+        loop = asyncio.new_event_loop()
+        try:
+            return loop.run_until_complete(coro)
+        finally:
+            loop.close()
+
+    def test_success_lost_cas_reconciles_and_skips_side_effects(self):
+        """When the SUCCESS write loses the CAS (row already cancelled), the
+        turn must reconcile to the persisted terminal and run NO won-only side
+        effects — no COMPLETED activity, no breaker reset, no SUCCESS result."""
+        from models import ActivityState
+        from services.task_execution_service import (
+            TaskExecutionErrorCode,
+            TaskExecutionStatus,
+        )
+
+        result, mock_activity, mock_record = self._run(cas_won=False, reconciled_status="cancelled")
+
+        # Reconciled to the row's real terminal, NOT reported as success.
+        assert result.status != TaskExecutionStatus.SUCCESS
+        assert result.status == TaskExecutionStatus.CANCELLED
+        assert result.error_code == TaskExecutionErrorCode.RECONCILED
+
+        # Breaker must NOT be reset by a writer that lost the CAS.
+        mock_record.assert_not_awaited()
+
+        # Activity must NOT be completed as COMPLETED.
+        completed = [
+            c
+            for c in mock_activity.complete_activity.await_args_list
+            if c.kwargs.get("status") == ActivityState.COMPLETED
+        ]
+        assert not completed, "lost-CAS turn must not complete the activity as COMPLETED"
+
+    def test_success_won_cas_runs_side_effects(self):
+        """When the SUCCESS write wins the CAS, the happy path is unchanged —
+        SUCCESS result, COMPLETED activity, breaker reset. Pins against
+        over-gating."""
+        from models import ActivityState
+        from services.task_execution_service import TaskExecutionStatus
+
+        result, mock_activity, mock_record = self._run(cas_won=True)
+
+        assert result.status == TaskExecutionStatus.SUCCESS
+        mock_record.assert_awaited_once_with("test-agent", False, None)
+        assert any(
+            c.kwargs.get("status") == ActivityState.COMPLETED
+            for c in mock_activity.complete_activity.await_args_list
+        ), "won SUCCESS turn must complete the activity as COMPLETED"

From a49512480ed14a108a7e774972bcb85f52fdca00 Mon Sep 17 00:00:00 2001
From: Andrii Pasternak <andriipasternak31@gmail.com>
Date: Sat, 13 Jun 2026 00:35:03 +0100
Subject: [PATCH 3/3] fix(executor): route FAILED terminals through the CAS
 gate (#671/H4)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Converges the five FAILED-terminal writers (CB fast-fail, timeout, backend
call-budget, HTTPError salvage, generic Exception) onto one helper,
_write_terminal_and_gate, which writes the terminal through
update_execution_status's CAS and completes the activity ONLY when this
writer won. Replaces the old check-then-act guard
(get_execution() -> status != CANCELLED), which both raced (TOCTOU) and
only blocked CANCELLED — the CAS additionally blocks an already
SUCCESS/FAILED row. The HTTPError path's AUTH dispatch-breaker outcome is
now gated on the same won verdict, so a turn superseded by a cancel can't
count toward the breaker. The FAILED TaskExecutionResult return stays
unconditional (only side effects gate). Sites with no side effects
(CapacityFull, CircuitOpen) and the CancelledError 3-state guard are left
as-is.

Adds the FAILED-path lost-CAS coverage to test_terminal_write_cas_gate.py.
Full tests/unit suite green (2135 passed).

Refs #1095

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../services/task_execution_service.py        | 177 ++++++++++--------
 tests/unit/test_terminal_write_cas_gate.py    |  74 ++++++++
 2 files changed, 175 insertions(+), 76 deletions(-)

diff --git a/src/backend/services/task_execution_service.py b/src/backend/services/task_execution_service.py
index 5a30e076..d32f04dd 100644
--- a/src/backend/services/task_execution_service.py
+++ b/src/backend/services/task_execution_service.py
@@ -439,6 +439,55 @@ async def _record_dispatch_terminal(
         _spawn_bg(_audit_circuit_transition(agent_name, "closed"))
 
 
+async def _write_terminal_and_gate(
+    execution_id: Optional[str],
+    activity_id: Optional[str],
+    *,
+    status: str,
+    activity_status: str,
+    error: Optional[str] = None,
+    cost: Optional[float] = None,
+    context_used: Optional[int] = None,
+    context_max: Optional[int] = None,
+    retry_count: Optional[int] = None,
+) -> bool:
+    """Write a non-success terminal through the CAS and gate the activity
+    completion on winning it (#671/H4).
+
+    ``db.update_execution_status`` is an atomic compare-and-set: a non-success
+    terminal write loses to any already-terminal row (SUCCESS/FAILED/CANCELLED/
+    SKIPPED). The activity is completed ONLY when this writer won — a writer
+    that lost (e.g. to a user cancel) must not also complete the activity,
+    mirroring the SUCCESS-path reconcile. Returns the CAS winner so the caller
+    can gate any further side effects (e.g. the dispatch breaker).
+
+    When ``execution_id`` is falsy there is no row to contend for, so the write
+    is skipped and the activity completion still runs (``won=True``), preserving
+    the prior no-record behaviour. Replaces the old check-then-act guard
+    (``get_execution() → status != CANCELLED``), which both raced and only
+    blocked CANCELLED — the CAS additionally blocks an already-SUCCESS/FAILED
+    row and closes the TOCTOU window.
+    """
+    won = True
+    if execution_id:
+        won = db.update_execution_status(
+            execution_id=execution_id,
+            status=status,
+            error=error,
+            cost=cost,
+            context_used=context_used,
+            context_max=context_max,
+            retry_count=retry_count,
+        )
+    if won and activity_id:
+        await activity_service.complete_activity(
+            activity_id=activity_id,
+            status=activity_status,
+            error=error,
+        )
+    return won
+
+
 # ---------------------------------------------------------------------------
 # Service
 # ---------------------------------------------------------------------------
@@ -656,20 +705,16 @@ async def execute_task(
             if transport_open or dispatch_open:
                 error_msg = "Agent circuit breaker open — agent is unhealthy"
                 logger.warning(f"[TaskExecService] CB open, fast-failing execution {execution_id} for {agent_name}")
-                if execution_id:
-                    existing = db.get_execution(execution_id)
-                    if not existing or existing.status != TaskExecutionStatus.CANCELLED:
-                        db.update_execution_status(
-                            execution_id=execution_id,
-                            status=TaskExecutionStatus.FAILED,
-                            error=error_msg,
-                        )
-                if activity_id:
-                    await activity_service.complete_activity(
-                        activity_id=activity_id,
-                        status=ActivityState.FAILED,
-                        error=error_msg,
-                    )
+                # #671/H4: route the terminal write through the CAS — the
+                # activity is completed only if this writer won (a lost CAS to a
+                # cancel/already-terminal row must not also complete it).
+                await _write_terminal_and_gate(
+                    execution_id,
+                    activity_id,
+                    status=TaskExecutionStatus.FAILED,
+                    activity_status=ActivityState.FAILED,
+                    error=error_msg,
+                )
                 return TaskExecutionResult(
                     execution_id=execution_id or "",
                     status=TaskExecutionStatus.FAILED,
@@ -978,21 +1023,15 @@ async def execute_task(
             # Claude processes from accumulating. Best-effort — watchdog is safety net.
             await terminate_execution_on_agent(agent_name, execution_id)
 
-            # Don't overwrite cancelled executions
-            if execution_id:
-                existing = db.get_execution(execution_id)
-                if not existing or existing.status != TaskExecutionStatus.CANCELLED:
-                    db.update_execution_status(
-                        execution_id=execution_id,
-                        status=TaskExecutionStatus.FAILED,
-                        error=error_msg,
-                    )
-            if activity_id:
-                await activity_service.complete_activity(
-                    activity_id=activity_id,
-                    status=ActivityState.FAILED,
-                    error=error_msg,
-                )
+            # #671/H4: CAS-gate the terminal write (replaces the CANCELLED-only
+            # check-then-act guard); complete the activity only if we won.
+            await _write_terminal_and_gate(
+                execution_id,
+                activity_id,
+                status=TaskExecutionStatus.FAILED,
+                activity_status=ActivityState.FAILED,
+                error=error_msg,
+            )
             return TaskExecutionResult(
                 execution_id=execution_id or "",
                 status=TaskExecutionStatus.FAILED,
@@ -1014,20 +1053,15 @@ async def execute_task(
                 f"[TaskExecService] Rejecting task on {agent_name} — backend "
                 f"call budget exhausted: {error_msg}"
             )
-            if execution_id:
-                existing = db.get_execution(execution_id)
-                if not existing or existing.status != TaskExecutionStatus.CANCELLED:
-                    db.update_execution_status(
-                        execution_id=execution_id,
-                        status=TaskExecutionStatus.FAILED,
-                        error=error_msg,
-                    )
-            if activity_id:
-                await activity_service.complete_activity(
-                    activity_id=activity_id,
-                    status=ActivityState.FAILED,
-                    error=error_msg,
-                )
+            # #671/H4: CAS-gate the terminal write; complete the activity only
+            # if we won.
+            await _write_terminal_and_gate(
+                execution_id,
+                activity_id,
+                status=TaskExecutionStatus.FAILED,
+                activity_status=ActivityState.FAILED,
+                error=error_msg,
+            )
             return TaskExecutionResult(
                 execution_id=execution_id or "",
                 status=TaskExecutionStatus.FAILED,
@@ -1115,30 +1149,26 @@ async def execute_task(
             else:
                 salvage_cost = salvage_cost_raw
 
-            if execution_id:
-                existing = db.get_execution(execution_id)
-                if not existing or existing.status != TaskExecutionStatus.CANCELLED:
-                    db.update_execution_status(
-                        execution_id=execution_id,
-                        status=TaskExecutionStatus.FAILED,
-                        error=error_msg,
-                        cost=salvage_cost,
-                        context_used=salvage_context,
-                        context_max=salvage_context_max,
-                        retry_count=retry_count or None,
-                    )
-            if activity_id:
-                await activity_service.complete_activity(
-                    activity_id=activity_id,
-                    status=ActivityState.FAILED,
-                    error=error_msg,
-                )
+            # #671/H4: CAS-gate the terminal write; the activity completion and
+            # the AUTH breaker outcome both fire only if this writer won (a turn
+            # superseded by a cancel must not also count toward the breaker).
+            won = await _write_terminal_and_gate(
+                execution_id,
+                activity_id,
+                status=TaskExecutionStatus.FAILED,
+                activity_status=ActivityState.FAILED,
+                error=error_msg,
+                cost=salvage_cost,
+                context_used=salvage_context,
+                context_max=salvage_context_max,
+                retry_count=retry_count or None,
+            )
             # #526 D10: AUTH-only counting. Gate on error_code == AUTH so a
             # non-auth HTTP failure (error_code is None) is NEVER passed to
             # record_outcome — which would read None as a success and reset the
             # counter. On the →open transition this backgrounds the backlog
             # drain + audit.
-            if error_code == TaskExecutionErrorCode.AUTH:
+            if won and error_code == TaskExecutionErrorCode.AUTH:
                 await _record_dispatch_terminal(agent_name, breaker_enabled, error_code)
             return TaskExecutionResult(
                 execution_id=execution_id or "",
@@ -1154,20 +1184,15 @@ async def execute_task(
         except Exception as e:
             error_msg = str(e)
             logger.error(f"[TaskExecService] Unexpected error executing task on {agent_name}: {error_msg}")
-            if execution_id:
-                existing = db.get_execution(execution_id)
-                if not existing or existing.status != TaskExecutionStatus.CANCELLED:
-                    db.update_execution_status(
-                        execution_id=execution_id,
-                        status=TaskExecutionStatus.FAILED,
-                        error=error_msg,
-                    )
-            if activity_id:
-                await activity_service.complete_activity(
-                    activity_id=activity_id,
-                    status=ActivityState.FAILED,
-                    error=error_msg,
-                )
+            # #671/H4: CAS-gate the terminal write; complete the activity only
+            # if we won.
+            await _write_terminal_and_gate(
+                execution_id,
+                activity_id,
+                status=TaskExecutionStatus.FAILED,
+                activity_status=ActivityState.FAILED,
+                error=error_msg,
+            )
             return TaskExecutionResult(
                 execution_id=execution_id or "",
                 status=TaskExecutionStatus.FAILED,
diff --git a/tests/unit/test_terminal_write_cas_gate.py b/tests/unit/test_terminal_write_cas_gate.py
index edadb570..0739b6ac 100644
--- a/tests/unit/test_terminal_write_cas_gate.py
+++ b/tests/unit/test_terminal_write_cas_gate.py
@@ -155,3 +155,77 @@ def test_success_won_cas_runs_side_effects(self):
             c.kwargs.get("status") == ActivityState.COMPLETED
             for c in mock_activity.complete_activity.await_args_list
         ), "won SUCCESS turn must complete the activity as COMPLETED"
+
+
+class TestFailedPathCasGate:
+    """Non-success terminals route through `_write_terminal_and_gate`, which
+    completes the activity only on a won CAS (#671/H4). Exercised via the
+    circuit-breaker fast-fail terminal (Site 3)."""
+
+    pytestmark = pytest.mark.unit
+
+    def _run_cb_open(self, *, cas_won: bool):
+        from services.task_execution_service import TaskExecutionService
+
+        mock_db = MagicMock()
+        mock_db.get_max_parallel_tasks.return_value = 3
+        mock_db.get_execution.return_value = _make_execution("cancelled")
+        mock_db.update_execution_status.return_value = cas_won
+
+        mock_capacity = MagicMock()
+        admitted = MagicMock()
+        admitted.state = "admitted"
+        mock_capacity.acquire = AsyncMock(return_value=admitted)
+        mock_capacity.release = AsyncMock()
+
+        mock_circuit = MagicMock()
+        mock_circuit.allow_request.return_value = False  # transport CB OPEN → fast-fail
+
+        mock_activity = MagicMock(
+            track_activity=AsyncMock(return_value="act-001"),
+            complete_activity=AsyncMock(),
+        )
+
+        with (
+            patch("services.task_execution_service.db", mock_db),
+            patch("services.task_execution_service.get_capacity_manager", return_value=mock_capacity),
+            patch("services.task_execution_service.activity_service", mock_activity),
+            patch("services.task_execution_service.CircuitState", return_value=mock_circuit),
+            patch("services.task_execution_service.dispatch_breaker_active", return_value=False),
+        ):
+            svc = TaskExecutionService()
+            result = TestSuccessPathCasGate._await(
+                svc.execute_task(
+                    agent_name="test-agent",
+                    message="hello",
+                    triggered_by="schedule",
+                    execution_id="exec-test-001",
+                    timeout_seconds=300,
+                    model="sonnet",
+                )
+            )
+        return result, mock_activity
+
+    def test_failed_lost_cas_skips_activity_completion(self):
+        """A FAILED terminal that lost the CAS must NOT complete the activity."""
+        from services.task_execution_service import (
+            TaskExecutionErrorCode,
+            TaskExecutionStatus,
+        )
+
+        result, mock_activity = self._run_cb_open(cas_won=False)
+
+        mock_activity.complete_activity.assert_not_awaited()
+        # The FAILED result is still reported (the return is unconditional;
+        # only the won-only side effects gate).
+        assert result.status == TaskExecutionStatus.FAILED
+        assert result.error_code == TaskExecutionErrorCode.CIRCUIT_OPEN
+
+    def test_failed_won_cas_completes_activity(self):
+        """A FAILED terminal that won the CAS still completes the activity."""
+        from models import ActivityState
+
+        result, mock_activity = self._run_cb_open(cas_won=True)
+
+        mock_activity.complete_activity.assert_awaited_once()
+        assert mock_activity.complete_activity.await_args.kwargs["status"] == ActivityState.FAILED