puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker by n-WN · Pull Request #109 · berabuddies/puffer

n-WN · 2026-05-12T04:27:17Z

Summary

Fixes the cancel-amnesia bug observed in agentenv managed-agent sessions: when a user cancels a running turn, tool calls that already completed before the cancel were dropped from session.jsonl. The next turn's LLM-context build saw nothing, so the assistant confidently answered "I haven't done anything yet" while the user could plainly see the tool ran.

Production trace from agent eab60bad-c00c-41d9-86b9-42675b0d9e52 (agentenv.io):

```
[ 0] user → "clone openai/codex"
[ 1] tool_call → TaskCreate(task-1) status=completed turn=968dd73a ← cancel
[ 2] user → "clone openai/codex" (retry)
[ 3] tool_call → TaskCreate(task-2) status=completed ← cancel
[ 4] user → "continue"
[ 5] tool_call → TaskCreate(task-3) status=completed ← cancel
[ 6] user → "我上面都说了什么"
[ 7] assistant → "你上面依次说了: 1.clone... 2.clone... 3.continue 4.我上面都说了什么"
↑ agent doesn't see its own 3 TaskCreate calls
[ 9] user → "当前 context 中有哪些信息"
[10] assistant → "...你之前两次要求我 clone openai/codex，但到目前为止还没有实际执行克隆"
↑ explicitly gaslights — gpt-5.4 was confident it never ran any tools
```

Root cause

`crates/puffer-cli/src/daemon.rs:2241-2301` had a `match outcome` where the `Err(_)` arm only published a `turn-error` SSE event — it never persisted any partial work to `session.jsonl`. The `agent_loop`'s `bail!("cancelled")` doesn't carry a `TurnExecution`, so all completed `tool_invocations` from that turn were silently dropped.

Even though api-server's PostgreSQL did see those tool calls (streamed via SSE) and displayed them in the UI, puffer's own `session.jsonl` — which is what `AppState::from_session_record` reads on the next turn — never received them. So the LLM's context-build was blind to the cancelled turn's work.

Cross-impl reference

A research subagent traced cancel handling in codex / pi-mono / claude-code (decompiled). All three persist partial state and add an explicit "interrupted" marker:

System	Persistence	Abort marker
codex (Rust)	per-item `record_conversation_items` + `flush_rollout`	`<turn_aborted>` developer message (`codex-rs/core/src/context/turn_aborted.rs`)
pi-mono (TS)	per `message_end` event `appendMessage`	`stopReason: "aborted"` flag
claude-code	per-yield `recordTranscript`	synthetic `Interrupted by user` `tool_result` for orphans (`query.ts:1025`)
puffer master	❌ only at turn-end, dropped on Err	❌ none

Fix (single-file, minimal blast radius)

Extract the existing Ok-arm persistence into `persist_turn_artifacts(...)` and call it from both Ok and Err arms.
Tee streaming `TurnStreamEvent::ToolInvocations` + `TextDelta` events into `Arc<Mutex<...>>` accumulators fed by the existing `on_event` closure.
On Err, persist those accumulators before publishing turn-error.
Append a `SystemMessage` marker via `format_turn_aborted_marker(...)` — includes the abort reason and names of tools that completed, so the next turn's LLM has an explicit signal not to gaslight the user.
Emit `partialInvocations: ` on the `turn-error` SSE event so frontends can render a "cancelled mid-execution" affordance.

Test plan

Unit tests (added)

`persist_turn_artifacts_records_invocations_and_text_for_cancelled_turn` — full round-trip including reload
`persist_turn_artifacts_skips_empty_text_keeps_invocations` — tool ran but no text emitted yet
`persist_turn_artifacts_noop_when_nothing_to_save` — both empty
`turn_aborted_marker_carries_reason_and_tool_summary` — 4 cases (cancel/error × empty/populated/deduped)

Regression

7/7 `daemon::tests::*` PASS
91 puffer-cli tests pass; one pre-existing `benchmark_tool_permission_uses_curated_allowlist` fails on `origin/master` too — unrelated.

End-to-end (real daemon + real LLM via builtin-openai + WS cancel_turn RPC)

```
master (buggy) fix branch
turn-error.partialInvocations None 1
session.jsonl event count 1 3
tool_invocation for echo HELLO ❌ DROPPED ✅ persisted
turn-aborted system_message ❌ missing ✅ "1 tool call completed: Bash..."
```

Fix branch `session.jsonl` after cancel:
```
[0] user_message Run TWO Bash commands in sequence...
[1] tool_invocation tool_id=Bash success=true ← echo HELLO persisted
[2] system_message [turn-aborted] cancelled by user. 1 tool call completed before the abort: Bash. Any side effects of those tool calls (file writes, network calls, task creation) have already happened. Do NOT assume the turn ran to completion; verify state before continuing, and confirm with the user if uncertain.
```

E2E script + venv kept at `/tmp/e2e-cancel-history.py` + `/tmp/puffer-e2e-venv` for reproduction.

Follow-ups (filed separately, NOT in this PR)

Incremental persistence (codex/pi-mono pattern) — accumulator still loses state on daemon crash mid-turn. Worth threading `session_store` into `agent_loop` so each tool completion writes immediately.
Orphan `tool_use` synthesis (claude-code pattern) — for model-emitted `tool_use` blocks that didn't get to execute before cancel, synthesize `Interrupted by user` `tool_result` to keep replayed Anthropic conversations API-valid.

When a turn ended in Err (user cancel via `cancel_turn` RPC, or an upstream provider error), `start_turn`'s match arm wrote a `turn-error` event to the SSE stream and threw away `turn.tool_invocations`. The agent_loop's `bail!("cancelled")` doesn't carry a `TurnExecution`, so any tool calls that ALREADY completed before the cancel boundary were silently dropped from session.jsonl. The next time the agent loaded the session, `AppState::from_session_record` saw an incomplete transcript — only the user prompt, no record of the tool calls that actually executed (and whose side effects, like file writes or task creation, were already real). The LLM then confidently answered "I haven't done anything yet" while the user could plainly see the tool had run. Production trace: agent eab60bad-c00c-41d9-86b9-42675b0d9e52 → 4× TaskCreate(Clone repository) in api-server's PostgreSQL with status=completed + matching turnIds, but assistant's "what did I say above" reply listed only the 4 user prompts and explicitly stated "to date no actual clone has been executed". Cross-system reference (researched in this branch's sister agent): - codex persists `record_conversation_items` per-item + appends a `<turn_aborted>` developer message before TurnAborted event (codex-rs/core/src/context/turn_aborted.rs) - pi-mono persists each message_end event including aborted assistant messages with `stopReason: "aborted"` (packages/agent/src/agent-loop.ts) - claude-code persists incrementally and synthesizes "Interrupted by user" tool_result blocks for orphans (cc-src/query.ts:1025 yieldMissingToolResultBlocks) Puffer's fix (minimal-blast-radius, single-file): 1. Tee the streaming `TurnStreamEvent::ToolInvocations` and `TurnStreamEvent::TextDelta` into Arc<Mutex<...>> accumulators fed by the existing on_event closure. 2. Extract the Ok-arm persistence into `persist_turn_artifacts(...)` and call it from BOTH the Ok and Err arms. 3. Append a `SystemMessage` marker via `format_turn_aborted_marker` that includes the abort reason + names of tool calls that completed, so the next turn's LLM has an explicit signal not to gaslight the user about what just happened. 4. Emit `partialInvocations: <count>` on the SSE `turn-error` event so frontends can render a "cancelled mid-execution" affordance. Tests: - 3 new unit tests around `persist_turn_artifacts` (records, skips-empty, no-op when nothing to save) — all pass. - 1 new unit test around `format_turn_aborted_marker` (reason + tool summary + dedup) — passes. - All 7 daemon::tests pass; full `puffer-cli --bin puffer` test suite has only one pre-existing failure (`benchmark_tool_permission_uses_curated_allowlist`) that also fails on origin/master — unrelated. - End-to-end (real daemon + real LLM via builtin-openai + WebSocket + cancel_turn RPC): before/after comparison shows master persists 1 event (user msg only), fix branch persists 3 events (user msg + tool_invocation for completed Bash echo + turn-aborted marker). Follow-up worth considering (filed separately, not in this PR): - Move from end-of-turn accumulator to per-event incremental writes so a daemon crash mid-turn doesn't lose state either (codex/pi-mono pattern). - Synthesize "Interrupted by user" tool_result for any orphan tool_use the model produced but didn't execute, to keep replayed Anthropic conversations API-valid (cc pattern).

…e LLM Design review on the prior commit flagged a blocking gap: the turn-aborted marker was emitted as `TranscriptEvent::SystemMessage`, but puffer's wire serializers (`runtime/openai/conversation.rs`) deliberately strip non-leading `role:"system"` items from outgoing LLM requests on all three paths (Anthropic top-level system field, Responses `drop_transient_system_messages`, Chat Completions same). The marker rendered in the TUI but the next-turn LLM never saw it — defeating the core anti-gaslighting purpose. Cross-impl confirmation (subagent research, branch's prior commit cited codex/pi-mono/cc): - codex `ContextualUser` mode wraps `<turn_aborted>...` in role:user (`codex-rs/core/src/context/turn_aborted.rs:9`) - claude-code emits "Interrupted by user" tool_result + plain text on role:user (`cc-src/query.ts:123-149`) - pi-mono drops the aborted assistant entirely (different tradeoff: silent retry instead of explicit signal) 2 of 3 references put the marker on role:user. Approach B (filter exemption for `[turn-aborted]` prefix) was rejected because the filters exist to prevent strict backends (ChatGPT Codex `/backend-api/codex/ responses`) from 400-ing on mid-conversation system items — re-enabling that codepath via prefix magic invites regression. Changes: 1. `daemon.rs`: marker now persisted as `TranscriptEvent::UserMessage` + pushed to AppState as `MessageRole::User`. Wire serializer `push_or_merge` bundles it with the prior tool_result user message in Anthropic mode, so the LLM sees a single user turn with both the tool output and the abort notice — matches claude-code's shape. 2. `conversation.rs`: 2 new wire-level tests pin the contract. - `turn_aborted_marker_survives_all_three_wire_serializers` asserts a UserMessage marker reaches every serializer's output JSON. - `legacy_system_role_marker_would_be_dropped_by_wire_filters` is the negative-control twin: pins the failure mode the old SystemMessage emit produced, so a future regression "system role is fine for markers" can't sneak through. Tests: - 7/7 daemon::tests still pass - 69/69 conversation::tests pass, including 2 new wire-level tests - End-to-end (real daemon + LLM): session.jsonl after cancel now contains `[user_message, tool_invocation, user_message(marker)]`; re-checked the marker content includes "[turn-aborted]" + tool summary + abort reason. Behavior change for frontends/TUI: The marker is now rendered as a "user" message bubble (not "system" card). Visually different but semantically more truthful: the user IS the actor who interrupted the turn. Existing role-aware UIs treat it identically to a normal user message — no breaking changes.

n-WN · 2026-05-12T05:05:24Z

Review responses

Three reviewers — two subagent passes (correctness + design) plus codex-cli. One blocker surfaced and is now fixed in 05481b2.

🚨 Blocker found by design review (now fixed)

The turn-aborted marker was emitted as TranscriptEvent::SystemMessage — but puffer's wire serializers strip non-leading role:"system" items from outgoing LLM requests on all three paths:

Anthropic: top-level system field, items_to_anthropic_messages:803 continues on system items
OpenAI Responses: drop_transient_system_messages:556 (filter is documented as catching "Interrupted by user." specifically)
OpenAI Chat Completions: same filter (:694)

Result: marker rendered in TUI ✓ but never reached the next-turn LLM ✗. The whole anti-gaslighting purpose was defeated.

Fix (commit `05481b2`)

Switched marker emit to TranscriptEvent::UserMessage. Rationale:

2 of 3 reference systems do this — codex ContextualUser mode wraps <turn_aborted>... in role:user (codex-rs/core/src/context/turn_aborted.rs:9); claude-code's yieldMissingToolResultBlocks emits role:user with is_error: "Interrupted by user" (cc-src/query.ts:123-149). pi-mono's "drop the assistant entirely" approach is the outlier and risks compaction confusion.
Filter-exemption alternative rejected — adding a [turn-aborted] prefix exemption to drop_transient_system_messages re-introduces the exact failure mode the filter exists to prevent (strict backends like chatgpt.com/backend-api/codex/responses 400 on mid-conversation system items).
push_or_merge bundles adjacent user messages for Anthropic, so the marker bundles into the same user turn as the prior tool_result. LLM reads a single user turn: "Here's the Bash output. By the way, [turn-aborted] you were cancelled, 1 tool call completed: Bash."

Tests added in this push (`conversation.rs`)

turn_aborted_marker_survives_all_three_wire_serializers — pins the positive contract.
legacy_system_role_marker_would_be_dropped_by_wire_filters — negative control. If a future commit "restores" system-role markers thinking it's harmless, this test fails first.

Other review findings (acknowledged, follow-up — non-blocking)

Item	Source	Status
`Arc<Mutex<...>>` over-cautious — on_event is single-threaded today	correctness review #1	Kept defensive; agent_loop has TODO for parallel batching. Mutex is contention-free and cheap.
`err.contains("cancelled")` heuristic could mis-classify provider errors	correctness review #5	Marker text would say "cancelled by user" instead of "ended with error" — informational, not a correctness bug. Filed for follow-up to add a typed `CancelError`.
Missing unit test for "non-empty text, zero invocations"	correctness review #5	Worth adding; non-blocking.
Incremental per-event persistence (codex/pi-mono pattern) — accumulator still loses state on daemon crash mid-turn	design review #1	Acknowledged as a separate hardening pass — explicit follow-up in commit message.
Synthetic orphan tool_result for unexecuted tool_use (cc pattern)	design review #2	Not needed for puffer — `TurnStreamEvent::ToolInvocations` is paired (call+result), never produces standalone tool_use. Confirmed by `agent_loop.rs:603-668`.

Verification

7/7 daemon::tests pass
69/69 conversation::tests pass (including 2 new wire-level tests)
E2E re-run with new binary: session.jsonl now contains [user_message, tool_invocation, user_message(marker)]; the marker serializes onto role:user in all 3 wire paths.

n-WN added 2 commits May 12, 2026 12:21

This was referenced May 12, 2026

puffer-cli: classify runner-unreachable transport errors with category + friendly message #110

Merged

puffer-cli: cancel takes precedence in classify_turn_error (follow-up to #110) #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109

puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109
n-WN wants to merge 2 commits into
masterfrom
fix/cancel-history-persistence

n-WN commented May 12, 2026

Uh oh!

n-WN commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

n-WN commented May 12, 2026

Summary

Root cause

Cross-impl reference

Fix (single-file, minimal blast radius)

Test plan

Unit tests (added)

Regression

End-to-end (real daemon + real LLM via builtin-openai + WS cancel_turn RPC)

Follow-ups (filed separately, NOT in this PR)

Uh oh!

n-WN commented May 12, 2026

Review responses

🚨 Blocker found by design review (now fixed)

Fix (commit 05481b2)

Tests added in this push (conversation.rs)

Other review findings (acknowledged, follow-up — non-blocking)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix (commit `05481b2`)

Tests added in this push (`conversation.rs`)