Skip to content

fix(mcp): Fix MCP reconnect lifecycle after local server crash#1935

Open
yczhang-nv wants to merge 1 commit into
NVIDIA:release/1.7from
yczhang-nv:yuchenz/nat-221-investigate-agent-stuck-state-after-local-mcp-server-failure
Open

fix(mcp): Fix MCP reconnect lifecycle after local server crash#1935
yczhang-nv wants to merge 1 commit into
NVIDIA:release/1.7from
yczhang-nv:yuchenz/nat-221-investigate-agent-stuck-state-after-local-mcp-server-failure

Conversation

@yczhang-nv
Copy link
Copy Markdown
Contributor

@yczhang-nv yczhang-nv commented May 14, 2026

Description

Fixes MCP client reconnect after a local MCP server crashes by ensuring MCP transport contexts are closed and recreated from the same task that opened them.

Previously, MCPBaseClient._reconnect() closed the existing AsyncExitStack directly from the request/tool-call task that observed the failure. For local MCP transports, this can violate AnyIO cancel-scope task ownership and trigger:

Attempted to exit cancel scope in a different task than it was entered in
After that failure, downstream recovery could leave the workflow in a bad state and cause later LLM calls to fail with closed-client / connection errors.

Changes

  • Move MCP client connect/reconnect/close operations into a dedicated lifecycle task.
  • Route lifecycle operations through an async command queue so request tasks can ask for reconnect without owning transport cleanup.
  • Clear connection/session/tool cache state during lifecycle close/reconnect.
  • Add regression coverage for reconnect from a separate request task with a task-bound transport context.

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Refactor

    • Improved MCP client connection lifecycle management for better stability and reliability.
    • Enhanced reconnection handling to work seamlessly across different execution contexts.
  • Tests

    • Added comprehensive test coverage for reconnection scenarios in various contexts.

Review Change Stack

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv requested a review from a team as a code owner May 14, 2026 22:16
@yczhang-nv yczhang-nv self-assigned this May 14, 2026
@yczhang-nv yczhang-nv added bug Something isn't working non-breaking Non-breaking change labels May 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Walkthrough

MCPBaseClient now offloads transport lifecycle (connect/reconnect/close) to a serialized background worker task with asyncio.Queue commands, replacing inline context management in aenter/aexit. Supporting auth token detection, constructor formatting, and task-bound lifecycle validation tests were added.

Changes

MCP Transport Lifecycle Worker Refactoring

Layer / File(s) Summary
Lifecycle Worker Pattern & Command Queue
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py
MCPBaseClient.__init__ adds _lifecycle_task and _lifecycle_commands queue. __aenter__/__aexit__ now create/await a background worker and enqueue "connect"/"close" commands instead of directly managing AsyncExitStack. New helpers: _run_lifecycle_command (enqueue/await), _lifecycle_worker (serial command loop), _connect_connection (context enter/state setup), _close_connection (context exit/state cleanup). _reconnect enqueues "reconnect" instead of inline teardown.
Auth Token Detection & get_tools() Formatting
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py
_has_cached_auth_token attribute-check logic reorganized for _auth_code_provider/_authenticated_tokens. get_tools() reformatted to construct MCPToolClient using multiline keyword arguments with adjusted spacing near timeout handling.
Constructor Formatting Consistency
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py
AuthAdapter, MCPSSEClient, MCPStdioClient, MCPStreamableHTTPClient, and MCPToolClient constructors reformatted into multiline super().__init__ calls and assignments. connect_to_server context manager unpacking reformatted to multiline while preserving get_session_id capture.
Task-Bound Lifecycle Test Infrastructure
packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py
TaskBoundMockMCPClient and TaskBoundAsyncContextManager enforce exit in the same task that entered; raises error if __aexit__ runs in a different task. New test test_reconnect_exits_transport_stack_in_lifecycle_task validates task-bound semantics when reconnection originates from a separate request task.
Test Mock Updates & Formatting
packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py
MockAsyncContextManager reconnect-failure conditions reformatted with equivalent logic. MockMCPClient instantiations, asyncio.sleep patches, reconnect_max_backoff assignments, and MCPToolClient calls across multiple tests reformatted to multiline. Nested patch() blocks in connect_to_server tests use double-quoted paths and adjusted wiring. Skipped test definitions updated with whitespace changes.

Sequence Diagram

sequenceDiagram
  participant Client as __aenter__/__aexit__
  participant Queue as asyncio.Queue
  participant Worker as _lifecycle_worker
  participant Stack as AsyncExitStack
  Client->>Queue: enqueue "connect"
  Client->>Client: start _lifecycle_worker task
  Worker->>Queue: await command
  Worker->>Stack: _connect_connection (enter/set state)
  Client->>Queue: enqueue "reconnect"
  Worker->>Queue: await command
  Worker->>Stack: _close_connection then _connect_connection
  Client->>Queue: enqueue "close"
  Worker->>Queue: await command
  Worker->>Stack: _close_connection (exit/clear state)
  Worker->>Worker: exit loop
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: fixing MCP reconnect lifecycle after local server crash. It follows the specified format with imperative mood, is concise (62 chars), and accurately reflects the core fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py (1)

283-283: 💤 Low value

Redundant state assignment after lifecycle command failure.

When the reconnect command fails, the lifecycle worker (line 324) already sets _connection_established = False. This line is redundant and could obscure state ownership—all state mutations should ideally occur within the lifecycle worker.

Consider removing the redundant assignment
             # All attempts failed
-            self._connection_established = False
             if last_error:
                 raise last_error
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py` at line
283, Remove the redundant assignment to self._connection_established = False in
the reconnect failure handling so state mutation is owned exclusively by the
lifecycle worker; locate the reconnect command handler in the client class
(references to reconnect/reconnect command and self._connection_established) and
delete the explicit False assignment there, ensuring only the lifecycle worker
code that already sets _connection_established = False on failure manages that
state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`:
- Line 283: Remove the redundant assignment to self._connection_established =
False in the reconnect failure handling so state mutation is owned exclusively
by the lifecycle worker; locate the reconnect command handler in the client
class (references to reconnect/reconnect command and
self._connection_established) and delete the explicit False assignment there,
ensuring only the lifecycle worker code that already sets
_connection_established = False on failure manages that state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c4119304-5b2a-43e3-a83e-e05a7203d640

📥 Commits

Reviewing files that changed from the base of the PR and between 20639fc and 2e51c2d.

📒 Files selected for processing (2)
  • packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py
  • packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py

@yczhang-nv yczhang-nv added bug Something isn't working non-breaking Non-breaking change and removed bug Something isn't working non-breaking Non-breaking change labels May 14, 2026
@yczhang-nv yczhang-nv changed the base branch from develop to release/1.7 May 14, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant