puffer-cli: classify runner-unreachable transport errors with category + friendly message#110
Merged
Merged
Conversation
When the daemon's RemoteToolRunner hits `tcp connect error` / `Connection refused`, the turn-error SSE event now carries a friendly retry-hint plus a stable `category: "runner_unreachable"` discriminator. The raw error is preserved as `errorRaw` for debug. This is the puffer-side belt-and-suspenders for agentenv/monorepo#401, where managed-agent wakes can send the puffer daemon a stale `remote_runner.endpoint` carrying the prior session's host_port. The hypervisor allocates a new host_port on restore (the old one is unbound), so the very first tool call hits a transport failure. The real fix lives api-server-side (refresh sandbox metadata before passing runnerEndpoint to wakePuffer), but until that ships, this categorization lets frontends turn a cryptic "transport error: tcp connect error" message into an actionable "the tool runner is not reachable yet, retry in a moment". Tests: - New unit test asserts the three categories: runner_unreachable, cancelled (canonical bail string preserved), other (raw chain preserved for any unrecognized error). - End-to-end run with a local puffer daemon configured to point remote_runner at an unbound localhost port reproduces the exact production failure mode and confirms the new SSE event shape: turn-error payload: { "type": "turn-error", "category": "runner_unreachable", "error": "the tool runner is not reachable yet — ... retry in a few seconds.", "errorRaw": "transport error: tcp connect error", "turnId": "..." } Tested against gpt-5.4 / builtin-openai with the Bash tool to reliably trigger a remote-runner dispatch. Cross-system reference: when the api-server-side fix lands, this classifier should keep paying off for other transient network failures between the puffer daemon and the runtime port (e.g. a brief virtiofs hiccup or a runtime reboot).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the daemon's
RemoteToolRunnerhitstcp connect error/Connection refused, the SSEturn-errorevent now carries:errorcategory: "runner_unreachable"discriminatorerrorRawfor debugWhy
Sister-side investigation in agentenv/monorepo#401: managed-agent wakes pass the puffer daemon a stale
remote_runner.endpointcarrying the prior session's host_port. The hypervisor allocates a new host_port on restore — the old one is unbound — so the first tool call after wake reliably hits a transport failure. Concrete repro evidence captured today (config.tomlsays:10679, container's actualhost_port=10680, off-by-one sequential allocation, 33s temporal gap between config write and port allocation).The real fix belongs api-server-side (refresh
sandbox.metadata.runtimeAutoExposedPortsbefore passingrunnerEndpointtowakePuffer). This puffer-side change is complementary: until that ships and even after, turning the cryptic"transport error: tcp connect error"into an actionable"the tool runner is not reachable yet — retry in a few seconds"is a strict UX win.Categories distinguished
runner_unreachabletcp connect error/Connection refusederrorRawcancelledcancelledbail stringotherTest plan
Unit
daemon::tests::classify_turn_error_distinguishes_runner_unreachable— 3 categoriesdaemon::tests::*still pass (no regression)End-to-end
[remote_runner].endpoint = "http://127.0.0.1:54321"(deliberately unbound port).Use the Bash tool to run: echo HELLO).turn-errorpayload:{ "type": "turn-error", "category": "runner_unreachable", "error": "the tool runner is not reachable yet — ...retry in a few seconds.", "errorRaw": "transport error: tcp connect error", "turnId": "..." }Follow-ups (not in this PR)
wakeResolvedPufferAgentmust wait for the post-restoreruntimeAutoExposedPortsbefore computingrunnerEndpoint.config.tomlmid-session, which is non-trivial.