Skip to content

Fix serverless crashes (socket hang up) and 60s idle invocations in MCP function#185

Merged
JakeSCahill merged 2 commits into
mainfrom
fix/mcp-serverless-socket-crashes
Jun 22, 2026
Merged

Fix serverless crashes (socket hang up) and 60s idle invocations in MCP function#185
JakeSCahill merged 2 commits into
mainfrom
fix/mcp-serverless-socket-crashes

Conversation

@JakeSCahill

Copy link
Copy Markdown
Contributor

Problem

The MCP function was logging two distinct serverless-vs-long-lived-connection failures:

  1. socket hang up / ECONNRESET → Unhandled Promise Rejection → LAMBDA_RUNTIME Failed to post handler success response … Invalid request ID. The upstream MCP clients (kapaClient, bumpClient) are module-global and hold persistent connections reused across warm invocations. When the Lambda container freezes between requests, the idle upstream socket is dropped; on thaw, Node emits the socket error inside the transport's background read loop — a rejection with no awaiter. The existing isTransientError retry only catches errors during a tool call, so this slips through as an unhandled rejection and the runtime kills the invocation.

  2. Duration: 60000 ms invocations. Every connected client opens Streamable HTTP's optional GET server→client SSE stream. This server is request/response only (it never pushes server-initiated messages), so on serverless that stream just idles open until the function hits its max duration — a wasted full-length invocation per connected client.

Fix

  • Transport onerror/onclose on both Kapa and Bump transports → reset the cached connection at the source, so a dropped socket reconnects on the next request instead of bubbling up.
  • Process-level safety net (unhandledRejection / uncaughtException, registered once per cold start) → log and reset the cached connections rather than letting the runtime crash the invocation. Errors are logged clearly, so real bugs stay visible.
  • Decline the idle GET SSE with 405 — the MCP spec explicitly allows 405 Method Not Allowed on GET when the server doesn't offer an SSE stream there, and clients fall back to POST. Eliminates the 60s invocations.

Bumps SERVER_VERSION1.1.3.

Risk / notes

Two serverless-vs-long-lived-connection problems surfaced as Netlify
function errors:

1. socket hang up / ECONNRESET -> Unhandled Promise Rejection ->
   'Invalid request ID'. The cached upstream MCP clients (Kapa, Bump)
   hold persistent connections reused across warm invocations. When the
   container freezes/thaws, the idle socket is dropped and the error
   fires in the transport's background read loop with no awaiter, so the
   runtime kills the invocation. Fix: set onerror/onclose on each
   transport to reset the cached connection at the source, plus a
   process-level unhandledRejection/uncaughtException safety net that
   logs and resets instead of crashing.

2. Duration: 60000 ms. Every connected client opens the optional GET SSE
   stream; this server is request/response only, so on serverless that
   stream idles open until the function's max duration — a wasted
   full-length invocation per client. Fix: decline GET with 405 (spec
   allows this when no SSE stream is offered; clients use POST).

Bumps SERVER_VERSION to 1.1.3.
@JakeSCahill JakeSCahill requested a review from a team as a code owner June 22, 2026 11:02
@netlify

netlify Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploy Preview for redpanda-documentation ready!

Name Link
🔨 Latest commit e0a1cb2
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-documentation/deploys/6a39878e98a1190008b17119
😎 Deploy Preview https://deploy-preview-185--redpanda-documentation.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 82 (🟢 up 19 from production)
Accessibility: 92 (🔴 down 2 from production)
Best Practices: 92 (no change from production)
SEO: 83 (no change from production)
PWA: -
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@micheleRP micheleRP left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the serverless fixes are well-reasoned and the implementation is sound. I verified the one thing that could have undermined the transport-level handlers: in @modelcontextprotocol/sdk@1.17.0, Protocol.connect() (dist/esm/shared/protocol.js) assigns this._transport = transport and then captures and chains the existing onerror/onclose (_onerror?.(error); this._onerror(error)) rather than overwriting them. So the handlers you set before kapaClient.connect(...) / bumpClient.connect(...) do fire. The reconnect-after-reset path is also already exercised by the existing catch blocks, and the 405 on GET is spec-compliant and trivially revertable.

One non-blocking discussion point for your consideration:

  • Breadth of the uncaughtException guard. The unhandledRejection handler is well-targeted at exactly the described failure (a background-read-loop rejection with no awaiter that the runtime treats as fatal). The uncaughtException handler is broader: swallowing it and continuing leaves the process in a state Node's own docs flag as unsafe, and it catches all exceptions process-wide, not just upstream socket drops. The mitigation is real — the only recovery action is nulling cached connection promises (cheap and safe), and everything is logged so real bugs stay visible. Might be worth scoping it to known socket errors (ECONNRESET / socket hang up) and rethrowing otherwise, vs. keeping the broad catch for serverless robustness. Your call — not a blocker.

Per review: the broad uncaughtException handler could mask real bugs and
leave the process in an unsafe state. Recover only from known upstream
socket drops (ECONNRESET / socket hang up / EPIPE / ECONNREFUSED); log and
re-throw anything else so it surfaces normally. unhandledRejection (the
actual incident path) stays a recover-all, as reviewed.
@JakeSCahill

Copy link
Copy Markdown
Contributor Author

Thanks @micheleRP — and thanks for verifying the SDK connect() chains (rather than overwrites) onerror/onclose; that was the load-bearing assumption.

Addressed the uncaughtException note in e0a1cb2: it now recovers only from known upstream socket drops (ECONNRESET / socket hang up / EPIPE / ECONNREFUSED) and re-throws everything else so genuine bugs surface normally (re-throwing inside the handler terminates the process). Left unhandledRejection as a recover-all, since that's the actual incident path you flagged as well-targeted.

@JakeSCahill JakeSCahill merged commit e5257b6 into main Jun 22, 2026
4 checks passed
@JakeSCahill JakeSCahill deleted the fix/mcp-serverless-socket-crashes branch June 22, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants