puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109
Open
n-WN wants to merge 2 commits into
Open
puffer-cli: persist cancelled-turn artifacts + add turn-aborted marker#109n-WN wants to merge 2 commits into
n-WN wants to merge 2 commits into
Conversation
When a turn ended in Err (user cancel via `cancel_turn` RPC, or an
upstream provider error), `start_turn`'s match arm wrote a `turn-error`
event to the SSE stream and threw away `turn.tool_invocations`. The
agent_loop's `bail!("cancelled")` doesn't carry a `TurnExecution`, so
any tool calls that ALREADY completed before the cancel boundary were
silently dropped from session.jsonl.
The next time the agent loaded the session, `AppState::from_session_record`
saw an incomplete transcript — only the user prompt, no record of the
tool calls that actually executed (and whose side effects, like
file writes or task creation, were already real). The LLM then
confidently answered "I haven't done anything yet" while the user
could plainly see the tool had run. Production trace:
agent eab60bad-c00c-41d9-86b9-42675b0d9e52 → 4× TaskCreate(Clone
repository) in api-server's PostgreSQL with status=completed +
matching turnIds, but assistant's "what did I say above" reply
listed only the 4 user prompts and explicitly stated "to date no
actual clone has been executed".
Cross-system reference (researched in this branch's sister agent):
- codex persists `record_conversation_items` per-item + appends a
`<turn_aborted>` developer message before TurnAborted event
(codex-rs/core/src/context/turn_aborted.rs)
- pi-mono persists each message_end event including aborted assistant
messages with `stopReason: "aborted"` (packages/agent/src/agent-loop.ts)
- claude-code persists incrementally and synthesizes
"Interrupted by user" tool_result blocks for orphans
(cc-src/query.ts:1025 yieldMissingToolResultBlocks)
Puffer's fix (minimal-blast-radius, single-file):
1. Tee the streaming `TurnStreamEvent::ToolInvocations` and
`TurnStreamEvent::TextDelta` into Arc<Mutex<...>> accumulators
fed by the existing on_event closure.
2. Extract the Ok-arm persistence into `persist_turn_artifacts(...)`
and call it from BOTH the Ok and Err arms.
3. Append a `SystemMessage` marker via `format_turn_aborted_marker`
that includes the abort reason + names of tool calls that
completed, so the next turn's LLM has an explicit signal not to
gaslight the user about what just happened.
4. Emit `partialInvocations: <count>` on the SSE `turn-error` event
so frontends can render a "cancelled mid-execution" affordance.
Tests:
- 3 new unit tests around `persist_turn_artifacts` (records,
skips-empty, no-op when nothing to save) — all pass.
- 1 new unit test around `format_turn_aborted_marker` (reason +
tool summary + dedup) — passes.
- All 7 daemon::tests pass; full `puffer-cli --bin puffer` test
suite has only one pre-existing failure
(`benchmark_tool_permission_uses_curated_allowlist`) that also
fails on origin/master — unrelated.
- End-to-end (real daemon + real LLM via builtin-openai +
WebSocket + cancel_turn RPC): before/after comparison shows
master persists 1 event (user msg only), fix branch persists 3
events (user msg + tool_invocation for completed Bash echo +
turn-aborted marker).
Follow-up worth considering (filed separately, not in this PR):
- Move from end-of-turn accumulator to per-event incremental writes
so a daemon crash mid-turn doesn't lose state either (codex/pi-mono
pattern).
- Synthesize "Interrupted by user" tool_result for any orphan
tool_use the model produced but didn't execute, to keep replayed
Anthropic conversations API-valid (cc pattern).
…e LLM
Design review on the prior commit flagged a blocking gap: the turn-aborted
marker was emitted as `TranscriptEvent::SystemMessage`, but puffer's wire
serializers (`runtime/openai/conversation.rs`) deliberately strip
non-leading `role:"system"` items from outgoing LLM requests on all three
paths (Anthropic top-level system field, Responses
`drop_transient_system_messages`, Chat Completions same). The marker
rendered in the TUI but the next-turn LLM never saw it — defeating the
core anti-gaslighting purpose.
Cross-impl confirmation (subagent research, branch's prior commit cited
codex/pi-mono/cc):
- codex `ContextualUser` mode wraps `<turn_aborted>...` in role:user
(`codex-rs/core/src/context/turn_aborted.rs:9`)
- claude-code emits "Interrupted by user" tool_result + plain text on
role:user (`cc-src/query.ts:123-149`)
- pi-mono drops the aborted assistant entirely (different tradeoff:
silent retry instead of explicit signal)
2 of 3 references put the marker on role:user. Approach B (filter
exemption for `[turn-aborted]` prefix) was rejected because the filters
exist to prevent strict backends (ChatGPT Codex `/backend-api/codex/
responses`) from 400-ing on mid-conversation system items — re-enabling
that codepath via prefix magic invites regression.
Changes:
1. `daemon.rs`: marker now persisted as `TranscriptEvent::UserMessage`
+ pushed to AppState as `MessageRole::User`. Wire serializer
`push_or_merge` bundles it with the prior tool_result user message
in Anthropic mode, so the LLM sees a single user turn with both the
tool output and the abort notice — matches claude-code's shape.
2. `conversation.rs`: 2 new wire-level tests pin the contract.
- `turn_aborted_marker_survives_all_three_wire_serializers` asserts
a UserMessage marker reaches every serializer's output JSON.
- `legacy_system_role_marker_would_be_dropped_by_wire_filters` is
the negative-control twin: pins the failure mode the old
SystemMessage emit produced, so a future regression "system role
is fine for markers" can't sneak through.
Tests:
- 7/7 daemon::tests still pass
- 69/69 conversation::tests pass, including 2 new wire-level tests
- End-to-end (real daemon + LLM): session.jsonl after cancel now
contains `[user_message, tool_invocation, user_message(marker)]`;
re-checked the marker content includes "[turn-aborted]" + tool
summary + abort reason.
Behavior change for frontends/TUI:
The marker is now rendered as a "user" message bubble (not "system"
card). Visually different but semantically more truthful: the user
IS the actor who interrupted the turn. Existing role-aware UIs treat
it identically to a normal user message — no breaking changes.
Collaborator
Author
Review responsesThree reviewers — two subagent passes (correctness + design) plus codex-cli. One blocker surfaced and is now fixed in 🚨 Blocker found by design review (now fixed)The turn-aborted marker was emitted as
Result: marker rendered in TUI ✓ but never reached the next-turn LLM ✗. The whole anti-gaslighting purpose was defeated. Fix (commit
|
| Item | Source | Status |
|---|---|---|
Arc<Mutex<...>> over-cautious — on_event is single-threaded today |
correctness review #1 | Kept defensive; agent_loop has TODO for parallel batching. Mutex is contention-free and cheap. |
err.contains("cancelled") heuristic could mis-classify provider errors |
correctness review #5 | Marker text would say "cancelled by user" instead of "ended with error" — informational, not a correctness bug. Filed for follow-up to add a typed CancelError. |
| Missing unit test for "non-empty text, zero invocations" | correctness review #5 | Worth adding; non-blocking. |
| Incremental per-event persistence (codex/pi-mono pattern) — accumulator still loses state on daemon crash mid-turn | design review #1 | Acknowledged as a separate hardening pass — explicit follow-up in commit message. |
| Synthetic orphan tool_result for unexecuted tool_use (cc pattern) | design review #2 | Not needed for puffer — TurnStreamEvent::ToolInvocations is paired (call+result), never produces standalone tool_use. Confirmed by agent_loop.rs:603-668. |
Verification
- 7/7 daemon::tests pass
- 69/69 conversation::tests pass (including 2 new wire-level tests)
- E2E re-run with new binary: session.jsonl now contains
[user_message, tool_invocation, user_message(marker)]; the marker serializes onto role:user in all 3 wire paths.
This was referenced May 12, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the cancel-amnesia bug observed in agentenv managed-agent sessions: when a user cancels a running turn, tool calls that already completed before the cancel were dropped from
session.jsonl. The next turn's LLM-context build saw nothing, so the assistant confidently answered "I haven't done anything yet" while the user could plainly see the tool ran.Production trace from agent
eab60bad-c00c-41d9-86b9-42675b0d9e52(agentenv.io):```
[ 0] user → "clone openai/codex"
[ 1] tool_call → TaskCreate(task-1) status=completed turn=968dd73a ← cancel
[ 2] user → "clone openai/codex" (retry)
[ 3] tool_call → TaskCreate(task-2) status=completed ← cancel
[ 4] user → "continue"
[ 5] tool_call → TaskCreate(task-3) status=completed ← cancel
[ 6] user → "我上面都说了什么"
[ 7] assistant → "你上面依次说了: 1.clone... 2.clone... 3.continue 4.我上面都说了什么"
↑ agent doesn't see its own 3 TaskCreate calls
[ 9] user → "当前 context 中有哪些信息"
[10] assistant → "...你之前两次要求我 clone openai/codex,但到目前为止还没有实际执行克隆"
↑ explicitly gaslights — gpt-5.4 was confident it never ran any tools
```
Root cause
`crates/puffer-cli/src/daemon.rs:2241-2301` had a `match outcome` where the `Err(_)` arm only published a `turn-error` SSE event — it never persisted any partial work to `session.jsonl`. The `agent_loop`'s `bail!("cancelled")` doesn't carry a `TurnExecution`, so all completed `tool_invocations` from that turn were silently dropped.
Even though api-server's PostgreSQL did see those tool calls (streamed via SSE) and displayed them in the UI, puffer's own `session.jsonl` — which is what `AppState::from_session_record` reads on the next turn — never received them. So the LLM's context-build was blind to the cancelled turn's work.
Cross-impl reference
A research subagent traced cancel handling in codex / pi-mono / claude-code (decompiled). All three persist partial state and add an explicit "interrupted" marker:
Fix (single-file, minimal blast radius)
Test plan
Unit tests (added)
Regression
End-to-end (real daemon + real LLM via builtin-openai + WS cancel_turn RPC)
```
master (buggy) fix branch
turn-error.partialInvocations None 1
session.jsonl event count 1 3
tool_invocation for echo HELLO ❌ DROPPED ✅ persisted
turn-aborted system_message ❌ missing ✅ "1 tool call completed: Bash..."
```
Fix branch `session.jsonl` after cancel:
```
[0] user_message Run TWO Bash commands in sequence...
[1] tool_invocation tool_id=Bash success=true ← echo HELLO persisted
[2] system_message [turn-aborted] cancelled by user. 1 tool call completed before the abort: Bash. Any side effects of those tool calls (file writes, network calls, task creation) have already happened. Do NOT assume the turn ran to completion; verify state before continuing, and confirm with the user if uncertain.
```
E2E script + venv kept at `/tmp/e2e-cancel-history.py` + `/tmp/puffer-e2e-venv` for reproduction.
Follow-ups (filed separately, NOT in this PR)