Skip to content

puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109

Open
n-WN wants to merge 2 commits into
masterfrom
fix/cancel-history-persistence
Open

puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109
n-WN wants to merge 2 commits into
masterfrom
fix/cancel-history-persistence

Conversation

@n-WN
Copy link
Copy Markdown
Collaborator

@n-WN n-WN commented May 12, 2026

Summary

Fixes the cancel-amnesia bug observed in agentenv managed-agent sessions: when a user cancels a running turn, tool calls that already completed before the cancel were dropped from session.jsonl. The next turn's LLM-context build saw nothing, so the assistant confidently answered "I haven't done anything yet" while the user could plainly see the tool ran.

Production trace from agent eab60bad-c00c-41d9-86b9-42675b0d9e52 (agentenv.io):

```
[ 0] user → "clone openai/codex"
[ 1] tool_call → TaskCreate(task-1) status=completed turn=968dd73a ← cancel
[ 2] user → "clone openai/codex" (retry)
[ 3] tool_call → TaskCreate(task-2) status=completed ← cancel
[ 4] user → "continue"
[ 5] tool_call → TaskCreate(task-3) status=completed ← cancel
[ 6] user → "我上面都说了什么"
[ 7] assistant → "你上面依次说了: 1.clone... 2.clone... 3.continue 4.我上面都说了什么"
↑ agent doesn't see its own 3 TaskCreate calls
[ 9] user → "当前 context 中有哪些信息"
[10] assistant → "...你之前两次要求我 clone openai/codex,但到目前为止还没有实际执行克隆"
↑ explicitly gaslights — gpt-5.4 was confident it never ran any tools
```

Root cause

`crates/puffer-cli/src/daemon.rs:2241-2301` had a `match outcome` where the `Err(_)` arm only published a `turn-error` SSE event — it never persisted any partial work to `session.jsonl`. The `agent_loop`'s `bail!("cancelled")` doesn't carry a `TurnExecution`, so all completed `tool_invocations` from that turn were silently dropped.

Even though api-server's PostgreSQL did see those tool calls (streamed via SSE) and displayed them in the UI, puffer's own `session.jsonl` — which is what `AppState::from_session_record` reads on the next turn — never received them. So the LLM's context-build was blind to the cancelled turn's work.

Cross-impl reference

A research subagent traced cancel handling in codex / pi-mono / claude-code (decompiled). All three persist partial state and add an explicit "interrupted" marker:

System Persistence Abort marker
codex (Rust) per-item `record_conversation_items` + `flush_rollout` `<turn_aborted>` developer message (`codex-rs/core/src/context/turn_aborted.rs`)
pi-mono (TS) per `message_end` event `appendMessage` `stopReason: "aborted"` flag
claude-code per-yield `recordTranscript` synthetic `Interrupted by user` `tool_result` for orphans (`query.ts:1025`)
puffer master ❌ only at turn-end, dropped on Err ❌ none

Fix (single-file, minimal blast radius)

  1. Extract the existing Ok-arm persistence into `persist_turn_artifacts(...)` and call it from both Ok and Err arms.
  2. Tee streaming `TurnStreamEvent::ToolInvocations` + `TextDelta` events into `Arc<Mutex<...>>` accumulators fed by the existing `on_event` closure.
  3. On Err, persist those accumulators before publishing turn-error.
  4. Append a `SystemMessage` marker via `format_turn_aborted_marker(...)` — includes the abort reason and names of tools that completed, so the next turn's LLM has an explicit signal not to gaslight the user.
  5. Emit `partialInvocations: ` on the `turn-error` SSE event so frontends can render a "cancelled mid-execution" affordance.

Test plan

Unit tests (added)

  • `persist_turn_artifacts_records_invocations_and_text_for_cancelled_turn` — full round-trip including reload
  • `persist_turn_artifacts_skips_empty_text_keeps_invocations` — tool ran but no text emitted yet
  • `persist_turn_artifacts_noop_when_nothing_to_save` — both empty
  • `turn_aborted_marker_carries_reason_and_tool_summary` — 4 cases (cancel/error × empty/populated/deduped)

Regression

  • 7/7 `daemon::tests::*` PASS
  • 91 puffer-cli tests pass; one pre-existing `benchmark_tool_permission_uses_curated_allowlist` fails on `origin/master` too — unrelated.

End-to-end (real daemon + real LLM via builtin-openai + WS cancel_turn RPC)

```
master (buggy) fix branch
turn-error.partialInvocations None 1
session.jsonl event count 1 3
tool_invocation for echo HELLO ❌ DROPPED ✅ persisted
turn-aborted system_message ❌ missing ✅ "1 tool call completed: Bash..."
```

Fix branch `session.jsonl` after cancel:
```
[0] user_message Run TWO Bash commands in sequence...
[1] tool_invocation tool_id=Bash success=true ← echo HELLO persisted
[2] system_message [turn-aborted] cancelled by user. 1 tool call completed before the abort: Bash. Any side effects of those tool calls (file writes, network calls, task creation) have already happened. Do NOT assume the turn ran to completion; verify state before continuing, and confirm with the user if uncertain.
```

E2E script + venv kept at `/tmp/e2e-cancel-history.py` + `/tmp/puffer-e2e-venv` for reproduction.

Follow-ups (filed separately, NOT in this PR)

  • Incremental persistence (codex/pi-mono pattern) — accumulator still loses state on daemon crash mid-turn. Worth threading `session_store` into `agent_loop` so each tool completion writes immediately.
  • Orphan `tool_use` synthesis (claude-code pattern) — for model-emitted `tool_use` blocks that didn't get to execute before cancel, synthesize `Interrupted by user` `tool_result` to keep replayed Anthropic conversations API-valid.

n-WN added 2 commits May 12, 2026 12:21
When a turn ended in Err (user cancel via `cancel_turn` RPC, or an
upstream provider error), `start_turn`'s match arm wrote a `turn-error`
event to the SSE stream and threw away `turn.tool_invocations`. The
agent_loop's `bail!("cancelled")` doesn't carry a `TurnExecution`, so
any tool calls that ALREADY completed before the cancel boundary were
silently dropped from session.jsonl.

The next time the agent loaded the session, `AppState::from_session_record`
saw an incomplete transcript — only the user prompt, no record of the
tool calls that actually executed (and whose side effects, like
file writes or task creation, were already real). The LLM then
confidently answered "I haven't done anything yet" while the user
could plainly see the tool had run. Production trace:

  agent eab60bad-c00c-41d9-86b9-42675b0d9e52 → 4× TaskCreate(Clone
  repository) in api-server's PostgreSQL with status=completed +
  matching turnIds, but assistant's "what did I say above" reply
  listed only the 4 user prompts and explicitly stated "to date no
  actual clone has been executed".

Cross-system reference (researched in this branch's sister agent):
  - codex persists `record_conversation_items` per-item + appends a
    `<turn_aborted>` developer message before TurnAborted event
    (codex-rs/core/src/context/turn_aborted.rs)
  - pi-mono persists each message_end event including aborted assistant
    messages with `stopReason: "aborted"` (packages/agent/src/agent-loop.ts)
  - claude-code persists incrementally and synthesizes
    "Interrupted by user" tool_result blocks for orphans
    (cc-src/query.ts:1025 yieldMissingToolResultBlocks)

Puffer's fix (minimal-blast-radius, single-file):
  1. Tee the streaming `TurnStreamEvent::ToolInvocations` and
     `TurnStreamEvent::TextDelta` into Arc<Mutex<...>> accumulators
     fed by the existing on_event closure.
  2. Extract the Ok-arm persistence into `persist_turn_artifacts(...)`
     and call it from BOTH the Ok and Err arms.
  3. Append a `SystemMessage` marker via `format_turn_aborted_marker`
     that includes the abort reason + names of tool calls that
     completed, so the next turn's LLM has an explicit signal not to
     gaslight the user about what just happened.
  4. Emit `partialInvocations: <count>` on the SSE `turn-error` event
     so frontends can render a "cancelled mid-execution" affordance.

Tests:
  - 3 new unit tests around `persist_turn_artifacts` (records,
    skips-empty, no-op when nothing to save) — all pass.
  - 1 new unit test around `format_turn_aborted_marker` (reason +
    tool summary + dedup) — passes.
  - All 7 daemon::tests pass; full `puffer-cli --bin puffer` test
    suite has only one pre-existing failure
    (`benchmark_tool_permission_uses_curated_allowlist`) that also
    fails on origin/master — unrelated.
  - End-to-end (real daemon + real LLM via builtin-openai +
    WebSocket + cancel_turn RPC): before/after comparison shows
    master persists 1 event (user msg only), fix branch persists 3
    events (user msg + tool_invocation for completed Bash echo +
    turn-aborted marker).

Follow-up worth considering (filed separately, not in this PR):
  - Move from end-of-turn accumulator to per-event incremental writes
    so a daemon crash mid-turn doesn't lose state either (codex/pi-mono
    pattern).
  - Synthesize "Interrupted by user" tool_result for any orphan
    tool_use the model produced but didn't execute, to keep replayed
    Anthropic conversations API-valid (cc pattern).
…e LLM

Design review on the prior commit flagged a blocking gap: the turn-aborted
marker was emitted as `TranscriptEvent::SystemMessage`, but puffer's wire
serializers (`runtime/openai/conversation.rs`) deliberately strip
non-leading `role:"system"` items from outgoing LLM requests on all three
paths (Anthropic top-level system field, Responses
`drop_transient_system_messages`, Chat Completions same). The marker
rendered in the TUI but the next-turn LLM never saw it — defeating the
core anti-gaslighting purpose.

Cross-impl confirmation (subagent research, branch's prior commit cited
codex/pi-mono/cc):
  - codex `ContextualUser` mode wraps `<turn_aborted>...` in role:user
    (`codex-rs/core/src/context/turn_aborted.rs:9`)
  - claude-code emits "Interrupted by user" tool_result + plain text on
    role:user (`cc-src/query.ts:123-149`)
  - pi-mono drops the aborted assistant entirely (different tradeoff:
    silent retry instead of explicit signal)

2 of 3 references put the marker on role:user. Approach B (filter
exemption for `[turn-aborted]` prefix) was rejected because the filters
exist to prevent strict backends (ChatGPT Codex `/backend-api/codex/
responses`) from 400-ing on mid-conversation system items — re-enabling
that codepath via prefix magic invites regression.

Changes:
  1. `daemon.rs`: marker now persisted as `TranscriptEvent::UserMessage`
     + pushed to AppState as `MessageRole::User`. Wire serializer
     `push_or_merge` bundles it with the prior tool_result user message
     in Anthropic mode, so the LLM sees a single user turn with both the
     tool output and the abort notice — matches claude-code's shape.
  2. `conversation.rs`: 2 new wire-level tests pin the contract.
     - `turn_aborted_marker_survives_all_three_wire_serializers` asserts
       a UserMessage marker reaches every serializer's output JSON.
     - `legacy_system_role_marker_would_be_dropped_by_wire_filters` is
       the negative-control twin: pins the failure mode the old
       SystemMessage emit produced, so a future regression "system role
       is fine for markers" can't sneak through.

Tests:
  - 7/7 daemon::tests still pass
  - 69/69 conversation::tests pass, including 2 new wire-level tests
  - End-to-end (real daemon + LLM): session.jsonl after cancel now
    contains `[user_message, tool_invocation, user_message(marker)]`;
    re-checked the marker content includes "[turn-aborted]" + tool
    summary + abort reason.

Behavior change for frontends/TUI:
  The marker is now rendered as a "user" message bubble (not "system"
  card). Visually different but semantically more truthful: the user
  IS the actor who interrupted the turn. Existing role-aware UIs treat
  it identically to a normal user message — no breaking changes.
@n-WN
Copy link
Copy Markdown
Collaborator Author

n-WN commented May 12, 2026

Review responses

Three reviewers — two subagent passes (correctness + design) plus codex-cli. One blocker surfaced and is now fixed in 05481b2.

🚨 Blocker found by design review (now fixed)

The turn-aborted marker was emitted as TranscriptEvent::SystemMessage — but puffer's wire serializers strip non-leading role:"system" items from outgoing LLM requests on all three paths:

  • Anthropic: top-level system field, items_to_anthropic_messages:803 continues on system items
  • OpenAI Responses: drop_transient_system_messages:556 (filter is documented as catching "Interrupted by user." specifically)
  • OpenAI Chat Completions: same filter (:694)

Result: marker rendered in TUI ✓ but never reached the next-turn LLM ✗. The whole anti-gaslighting purpose was defeated.

Fix (commit 05481b2)

Switched marker emit to TranscriptEvent::UserMessage. Rationale:

  1. 2 of 3 reference systems do this — codex ContextualUser mode wraps <turn_aborted>... in role:user (codex-rs/core/src/context/turn_aborted.rs:9); claude-code's yieldMissingToolResultBlocks emits role:user with is_error: "Interrupted by user" (cc-src/query.ts:123-149). pi-mono's "drop the assistant entirely" approach is the outlier and risks compaction confusion.
  2. Filter-exemption alternative rejected — adding a [turn-aborted] prefix exemption to drop_transient_system_messages re-introduces the exact failure mode the filter exists to prevent (strict backends like chatgpt.com/backend-api/codex/responses 400 on mid-conversation system items).
  3. push_or_merge bundles adjacent user messages for Anthropic, so the marker bundles into the same user turn as the prior tool_result. LLM reads a single user turn: "Here's the Bash output. By the way, [turn-aborted] you were cancelled, 1 tool call completed: Bash."

Tests added in this push (conversation.rs)

  • turn_aborted_marker_survives_all_three_wire_serializers — pins the positive contract.
  • legacy_system_role_marker_would_be_dropped_by_wire_filters — negative control. If a future commit "restores" system-role markers thinking it's harmless, this test fails first.

Other review findings (acknowledged, follow-up — non-blocking)

Item Source Status
Arc<Mutex<...>> over-cautious — on_event is single-threaded today correctness review #1 Kept defensive; agent_loop has TODO for parallel batching. Mutex is contention-free and cheap.
err.contains("cancelled") heuristic could mis-classify provider errors correctness review #5 Marker text would say "cancelled by user" instead of "ended with error" — informational, not a correctness bug. Filed for follow-up to add a typed CancelError.
Missing unit test for "non-empty text, zero invocations" correctness review #5 Worth adding; non-blocking.
Incremental per-event persistence (codex/pi-mono pattern) — accumulator still loses state on daemon crash mid-turn design review #1 Acknowledged as a separate hardening pass — explicit follow-up in commit message.
Synthetic orphan tool_result for unexecuted tool_use (cc pattern) design review #2 Not needed for puffer — TurnStreamEvent::ToolInvocations is paired (call+result), never produces standalone tool_use. Confirmed by agent_loop.rs:603-668.

Verification

  • 7/7 daemon::tests pass
  • 69/69 conversation::tests pass (including 2 new wire-level tests)
  • E2E re-run with new binary: session.jsonl now contains [user_message, tool_invocation, user_message(marker)]; the marker serializes onto role:user in all 3 wire paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant