Skip to content

puffer-cli: classify runner-unreachable transport errors with category + friendly message#110

Merged
shouc merged 1 commit into
masterfrom
fix/runner-transport-error-classify
May 13, 2026
Merged

puffer-cli: classify runner-unreachable transport errors with category + friendly message#110
shouc merged 1 commit into
masterfrom
fix/runner-transport-error-classify

Conversation

@n-WN
Copy link
Copy Markdown
Collaborator

@n-WN n-WN commented May 12, 2026

Summary

When the daemon's RemoteToolRunner hits tcp connect error / Connection refused, the SSE turn-error event now carries:

  • A friendly retry-hint as error
  • A stable category: "runner_unreachable" discriminator
  • The raw error preserved as errorRaw for debug

Why

Sister-side investigation in agentenv/monorepo#401: managed-agent wakes pass the puffer daemon a stale remote_runner.endpoint carrying the prior session's host_port. The hypervisor allocates a new host_port on restore — the old one is unbound — so the first tool call after wake reliably hits a transport failure. Concrete repro evidence captured today (config.toml says :10679, container's actual host_port=10680, off-by-one sequential allocation, 33s temporal gap between config write and port allocation).

The real fix belongs api-server-side (refresh sandbox.metadata.runtimeAutoExposedPorts before passing runnerEndpoint to wakePuffer). This puffer-side change is complementary: until that ships and even after, turning the cryptic "transport error: tcp connect error" into an actionable "the tool runner is not reachable yet — retry in a few seconds" is a strict UX win.

Categories distinguished

category trigger shape
runner_unreachable tcp connect error / Connection refused friendly retry-hint message + raw kept as errorRaw
cancelled cancelled bail string identical to today's behavior — preserved for PR #109 compat
other anything else raw chain preserved

Test plan

Unit

  • daemon::tests::classify_turn_error_distinguishes_runner_unreachable — 3 categories
  • All other daemon::tests::* still pass (no regression)

End-to-end

  • Built local binary, ran daemon with [remote_runner].endpoint = "http://127.0.0.1:54321" (deliberately unbound port).
  • Sent a Bash-requiring prompt via WS (Use the Bash tool to run: echo HELLO).
  • Verified the SSE turn-error payload:
    {
      "type": "turn-error",
      "category": "runner_unreachable",
      "error": "the tool runner is not reachable yet — ...retry in a few seconds.",
      "errorRaw": "transport error: tcp connect error",
      "turnId": "..."
    }

Follow-ups (not in this PR)

  • The root cause for the stale-port behavior is filed at agentenv/monorepo#401 — api-server's wakeResolvedPufferAgent must wait for the post-restore runtimeAutoExposedPorts before computing runnerEndpoint.
  • A future puffer-side enhancement could add an in-daemon retry-with-config-reload, but that requires write coordination from puffer-manager to refresh config.toml mid-session, which is non-trivial.

When the daemon's RemoteToolRunner hits `tcp connect error` /
`Connection refused`, the turn-error SSE event now carries a
friendly retry-hint plus a stable `category: "runner_unreachable"`
discriminator. The raw error is preserved as `errorRaw` for debug.

This is the puffer-side belt-and-suspenders for
agentenv/monorepo#401, where managed-agent wakes can send the puffer
daemon a stale `remote_runner.endpoint` carrying the prior session's
host_port. The hypervisor allocates a new host_port on restore (the
old one is unbound), so the very first tool call hits a transport
failure. The real fix lives api-server-side (refresh sandbox
metadata before passing runnerEndpoint to wakePuffer), but until
that ships, this categorization lets frontends turn a cryptic
"transport error: tcp connect error" message into an actionable
"the tool runner is not reachable yet, retry in a moment".

Tests:
- New unit test asserts the three categories: runner_unreachable,
  cancelled (canonical bail string preserved), other (raw chain
  preserved for any unrecognized error).
- End-to-end run with a local puffer daemon configured to point
  remote_runner at an unbound localhost port reproduces the exact
  production failure mode and confirms the new SSE event shape:

    turn-error payload:
    {
      "type": "turn-error",
      "category": "runner_unreachable",
      "error": "the tool runner is not reachable yet — ...
                retry in a few seconds.",
      "errorRaw": "transport error: tcp connect error",
      "turnId": "..."
    }

  Tested against gpt-5.4 / builtin-openai with the Bash tool to
  reliably trigger a remote-runner dispatch.

Cross-system reference: when the api-server-side fix lands, this
classifier should keep paying off for other transient network
failures between the puffer daemon and the runtime port (e.g. a
brief virtiofs hiccup or a runtime reboot).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants