Fix serverless crashes (socket hang up) and 60s idle invocations in MCP function by JakeSCahill · Pull Request #185 · redpanda-data/docs-site

JakeSCahill · 2026-06-22T11:02:31Z

Problem

The MCP function was logging two distinct serverless-vs-long-lived-connection failures:

socket hang up / ECONNRESET → Unhandled Promise Rejection → LAMBDA_RUNTIME Failed to post handler success response … Invalid request ID. The upstream MCP clients (kapaClient, bumpClient) are module-global and hold persistent connections reused across warm invocations. When the Lambda container freezes between requests, the idle upstream socket is dropped; on thaw, Node emits the socket error inside the transport's background read loop — a rejection with no awaiter. The existing isTransientError retry only catches errors during a tool call, so this slips through as an unhandled rejection and the runtime kills the invocation.
Duration: 60000 ms invocations. Every connected client opens Streamable HTTP's optional GET server→client SSE stream. This server is request/response only (it never pushes server-initiated messages), so on serverless that stream just idles open until the function hits its max duration — a wasted full-length invocation per connected client.

Fix

Transport onerror/onclose on both Kapa and Bump transports → reset the cached connection at the source, so a dropped socket reconnects on the next request instead of bubbling up.
Process-level safety net (unhandledRejection / uncaughtException, registered once per cold start) → log and reset the cached connections rather than letting the runtime crash the invocation. Errors are logged clearly, so real bugs stay visible.
Decline the idle GET SSE with 405 — the MCP spec explicitly allows 405 Method Not Allowed on GET when the server doesn't offer an SSE stream there, and clients fall back to POST. Eliminates the 60s invocations.

Bumps SERVER_VERSION → 1.1.3.

Risk / notes

The 405 on GET is spec-compliant and our supported clients (ChatGPT, Claude, Cursor, VS Code) handle it — sessions initialize over POST, which is unchanged. If any client unexpectedly depends on the GET stream, reverting just that hunk restores the old behavior.
No behavior change to POST/DELETE/OPTIONS or rate limiting.
Independent of the OAuth (DOC-2262: Add OAuth authentication to the docs MCP server (Redpanda Cloud IdP) #181) and Neon (Move OAuth state to Neon Postgres (atomic one-time-use) behind STORE_BACKEND flag #184) work — this is a base mcp.mjs fix targeting main so it can ship to production quickly.

Two serverless-vs-long-lived-connection problems surfaced as Netlify function errors: 1. socket hang up / ECONNRESET -> Unhandled Promise Rejection -> 'Invalid request ID'. The cached upstream MCP clients (Kapa, Bump) hold persistent connections reused across warm invocations. When the container freezes/thaws, the idle socket is dropped and the error fires in the transport's background read loop with no awaiter, so the runtime kills the invocation. Fix: set onerror/onclose on each transport to reset the cached connection at the source, plus a process-level unhandledRejection/uncaughtException safety net that logs and resets instead of crashing. 2. Duration: 60000 ms. Every connected client opens the optional GET SSE stream; this server is request/response only, so on serverless that stream idles open until the function's max duration — a wasted full-length invocation per client. Fix: decline GET with 405 (spec allows this when no SSE stream is offered; clients use POST). Bumps SERVER_VERSION to 1.1.3.

netlify · 2026-06-22T11:02:36Z

✅ Deploy Preview for redpanda-documentation ready!

Name	Link
🔨 Latest commit	`e0a1cb2`
🔍 Latest deploy log	https://app.netlify.com/projects/redpanda-documentation/deploys/6a39878e98a1190008b17119
😎 Deploy Preview	https://deploy-preview-185--redpanda-documentation.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 82 (🟢 up 19 from production) Accessibility: 92 (🔴 down 2 from production) Best Practices: 92 (no change from production) SEO: 83 (no change from production) PWA: - View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

micheleRP

Approving — the serverless fixes are well-reasoned and the implementation is sound. I verified the one thing that could have undermined the transport-level handlers: in @modelcontextprotocol/sdk@1.17.0, Protocol.connect() (dist/esm/shared/protocol.js) assigns this._transport = transport and then captures and chains the existing onerror/onclose (_onerror?.(error); this._onerror(error)) rather than overwriting them. So the handlers you set before kapaClient.connect(...) / bumpClient.connect(...) do fire. The reconnect-after-reset path is also already exercised by the existing catch blocks, and the 405 on GET is spec-compliant and trivially revertable.

One non-blocking discussion point for your consideration:

Breadth of the uncaughtException guard. The unhandledRejection handler is well-targeted at exactly the described failure (a background-read-loop rejection with no awaiter that the runtime treats as fatal). The uncaughtException handler is broader: swallowing it and continuing leaves the process in a state Node's own docs flag as unsafe, and it catches all exceptions process-wide, not just upstream socket drops. The mitigation is real — the only recovery action is nulling cached connection promises (cheap and safe), and everything is logged so real bugs stay visible. Might be worth scoping it to known socket errors (ECONNRESET / socket hang up) and rethrowing otherwise, vs. keeping the broad catch for serverless robustness. Your call — not a blocker.

Per review: the broad uncaughtException handler could mask real bugs and leave the process in an unsafe state. Recover only from known upstream socket drops (ECONNRESET / socket hang up / EPIPE / ECONNREFUSED); log and re-throw anything else so it surfaces normally. unhandledRejection (the actual incident path) stays a recover-all, as reviewed.

JakeSCahill · 2026-06-22T19:06:55Z

Thanks @micheleRP — and thanks for verifying the SDK connect() chains (rather than overwrites) onerror/onclose; that was the load-bearing assumption.

Addressed the uncaughtException note in e0a1cb2: it now recovers only from known upstream socket drops (ECONNRESET / socket hang up / EPIPE / ECONNREFUSED) and re-throws everything else so genuine bugs surface normally (re-throwing inside the handler terminates the process). Left unhandledRejection as a recover-all, since that's the actual incident path you flagged as well-targeted.

JakeSCahill requested a review from a team as a code owner June 22, 2026 11:02

micheleRP approved these changes Jun 22, 2026

View reviewed changes

JakeSCahill merged commit e5257b6 into main Jun 22, 2026
4 checks passed

JakeSCahill deleted the fix/mcp-serverless-socket-crashes branch June 22, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix serverless crashes (socket hang up) and 60s idle invocations in MCP function#185

Fix serverless crashes (socket hang up) and 60s idle invocations in MCP function#185
JakeSCahill merged 2 commits into
mainfrom
fix/mcp-serverless-socket-crashes

JakeSCahill commented Jun 22, 2026

Uh oh!

netlify Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

micheleRP left a comment

Uh oh!

JakeSCahill commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JakeSCahill commented Jun 22, 2026

Problem

Fix

Risk / notes

Uh oh!

netlify Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for redpanda-documentation ready!

Uh oh!

micheleRP left a comment

Choose a reason for hiding this comment

Uh oh!

JakeSCahill commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jun 22, 2026 •

edited

Loading