fix(mcp): Fix MCP reconnect lifecycle after local server crash#1935
Conversation
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
WalkthroughMCPBaseClient now offloads transport lifecycle (connect/reconnect/close) to a serialized background worker task with asyncio.Queue commands, replacing inline context management in aenter/aexit. Supporting auth token detection, constructor formatting, and task-bound lifecycle validation tests were added. ChangesMCP Transport Lifecycle Worker Refactoring
Sequence DiagramsequenceDiagram
participant Client as __aenter__/__aexit__
participant Queue as asyncio.Queue
participant Worker as _lifecycle_worker
participant Stack as AsyncExitStack
Client->>Queue: enqueue "connect"
Client->>Client: start _lifecycle_worker task
Worker->>Queue: await command
Worker->>Stack: _connect_connection (enter/set state)
Client->>Queue: enqueue "reconnect"
Worker->>Queue: await command
Worker->>Stack: _close_connection then _connect_connection
Client->>Queue: enqueue "close"
Worker->>Queue: await command
Worker->>Stack: _close_connection (exit/clear state)
Worker->>Worker: exit loop
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py (1)
283-283: 💤 Low valueRedundant state assignment after lifecycle command failure.
When the reconnect command fails, the lifecycle worker (line 324) already sets
_connection_established = False. This line is redundant and could obscure state ownership—all state mutations should ideally occur within the lifecycle worker.Consider removing the redundant assignment
# All attempts failed - self._connection_established = False if last_error: raise last_error🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py` at line 283, Remove the redundant assignment to self._connection_established = False in the reconnect failure handling so state mutation is owned exclusively by the lifecycle worker; locate the reconnect command handler in the client class (references to reconnect/reconnect command and self._connection_established) and delete the explicit False assignment there, ensuring only the lifecycle worker code that already sets _connection_established = False on failure manages that state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`:
- Line 283: Remove the redundant assignment to self._connection_established =
False in the reconnect failure handling so state mutation is owned exclusively
by the lifecycle worker; locate the reconnect command handler in the client
class (references to reconnect/reconnect command and
self._connection_established) and delete the explicit False assignment there,
ensuring only the lifecycle worker code that already sets
_connection_established = False on failure manages that state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c4119304-5b2a-43e3-a83e-e05a7203d640
📒 Files selected for processing (2)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.pypackages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py
Description
Fixes MCP client reconnect after a local MCP server crashes by ensuring MCP transport contexts are closed and recreated from the same task that opened them.
Previously,
MCPBaseClient._reconnect()closed the existingAsyncExitStackdirectly from the request/tool-call task that observed the failure. For local MCP transports, this can violate AnyIO cancel-scope task ownership and trigger:Attempted to exit cancel scope in a different task than it was entered in
After that failure, downstream recovery could leave the workflow in a bad state and cause later LLM calls to fail with closed-client / connection errors.
Changes
By Submitting this PR I confirm:
Summary by CodeRabbit
Refactor
Tests