fix(mcp): Fix MCP reconnect lifecycle after local server crash by yczhang-nv · Pull Request #1935 · NVIDIA/NeMo-Agent-Toolkit

yczhang-nv · 2026-05-14T22:16:48Z

Description

Fixes MCP client reconnect after a local MCP server crashes by ensuring MCP transport contexts are closed and recreated from the same task that opened them.

Previously, MCPBaseClient._reconnect() closed the existing AsyncExitStack directly from the request/tool-call task that observed the failure. For local MCP transports, this can violate AnyIO cancel-scope task ownership and trigger:

Attempted to exit cancel scope in a different task than it was entered in
After that failure, downstream recovery could leave the workflow in a bad state and cause later LLM calls to fail with closed-client / connection errors.

Changes

Move MCP client connect/reconnect/close operations into a dedicated lifecycle task.
Route lifecycle operations through an async command queue so request tasks can ask for reconnect without owning transport cleanup.
Clear connection/session/tool cache state during lifecycle close/reconnect.
Add regression coverage for reconnect from a separate request task with a task-bound transport context.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Refactor
- Improved MCP client connection lifecycle management for better stability and reliability.
- Enhanced reconnection handling to work seamlessly across different execution contexts.
Tests
- Added comprehensive test coverage for reconnection scenarios in various contexts.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

coderabbitai · 2026-05-14T22:17:10Z

Walkthrough

MCPBaseClient now offloads transport lifecycle (connect/reconnect/close) to a serialized background worker task with asyncio.Queue commands, replacing inline context management in aenter/aexit. Supporting auth token detection, constructor formatting, and task-bound lifecycle validation tests were added.

Changes

MCP Transport Lifecycle Worker Refactoring

Layer / File(s)	Summary
Lifecycle Worker Pattern & Command Queue `packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`	`MCPBaseClient.__init__` adds `_lifecycle_task` and `_lifecycle_commands` queue. `__aenter__`/`__aexit__` now create/await a background worker and enqueue `"connect"`/`"close"` commands instead of directly managing `AsyncExitStack`. New helpers: `_run_lifecycle_command` (enqueue/await), `_lifecycle_worker` (serial command loop), `_connect_connection` (context enter/state setup), `_close_connection` (context exit/state cleanup). `_reconnect` enqueues `"reconnect"` instead of inline teardown.
Auth Token Detection & get_tools() Formatting `packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`	`_has_cached_auth_token` attribute-check logic reorganized for `_auth_code_provider`/`_authenticated_tokens`. `get_tools()` reformatted to construct `MCPToolClient` using multiline keyword arguments with adjusted spacing near timeout handling.
Constructor Formatting Consistency `packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`	`AuthAdapter`, `MCPSSEClient`, `MCPStdioClient`, `MCPStreamableHTTPClient`, and `MCPToolClient` constructors reformatted into multiline `super().__init__` calls and assignments. `connect_to_server` context manager unpacking reformatted to multiline while preserving `get_session_id` capture.
Task-Bound Lifecycle Test Infrastructure `packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py`	`TaskBoundMockMCPClient` and `TaskBoundAsyncContextManager` enforce exit in the same task that entered; raises error if `__aexit__` runs in a different task. New test `test_reconnect_exits_transport_stack_in_lifecycle_task` validates task-bound semantics when reconnection originates from a separate request task.
Test Mock Updates & Formatting `packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py`	`MockAsyncContextManager` reconnect-failure conditions reformatted with equivalent logic. `MockMCPClient` instantiations, `asyncio.sleep` patches, `reconnect_max_backoff` assignments, and `MCPToolClient` calls across multiple tests reformatted to multiline. Nested `patch()` blocks in `connect_to_server` tests use double-quoted paths and adjusted wiring. Skipped test definitions updated with whitespace changes.

Sequence Diagram

sequenceDiagram
  participant Client as __aenter__/__aexit__
  participant Queue as asyncio.Queue
  participant Worker as _lifecycle_worker
  participant Stack as AsyncExitStack
  Client->>Queue: enqueue "connect"
  Client->>Client: start _lifecycle_worker task
  Worker->>Queue: await command
  Worker->>Stack: _connect_connection (enter/set state)
  Client->>Queue: enqueue "reconnect"
  Worker->>Queue: await command
  Worker->>Stack: _close_connection then _connect_connection
  Client->>Queue: enqueue "close"
  Worker->>Queue: await command
  Worker->>Stack: _close_connection (exit/clear state)
  Worker->>Worker: exit loop

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.84% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: fixing MCP reconnect lifecycle after local server crash. It follows the specified format with imperative mood, is concise (62 chars), and accurately reflects the core fix.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py (1)
283-283: 💤 Low value

Redundant state assignment after lifecycle command failure.

When the reconnect command fails, the lifecycle worker (line 324) already sets _connection_established = False. This line is redundant and could obscure state ownership—all state mutations should ideally occur within the lifecycle worker.
Consider removing the redundant assignment
             # All attempts failed
-            self._connection_established = False
             if last_error:
                 raise last_error
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py` at line
283, Remove the redundant assignment to self._connection_established = False in
the reconnect failure handling so state mutation is owned exclusively by the
lifecycle worker; locate the reconnect command handler in the client class
(references to reconnect/reconnect command and self._connection_established) and
delete the explicit False assignment there, ensuring only the lifecycle worker
code that already sets _connection_established = False on failure manages that
state.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py`:
- Line 283: Remove the redundant assignment to self._connection_established =
False in the reconnect failure handling so state mutation is owned exclusively
by the lifecycle worker; locate the reconnect command handler in the client
class (references to reconnect/reconnect command and
self._connection_established) and delete the explicit False assignment there,
ensuring only the lifecycle worker code that already sets
_connection_established = False on failure manages that state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c4119304-5b2a-43e3-a83e-e05a7203d640

📥 Commits

Reviewing files that changed from the base of the PR and between 20639fc and 2e51c2d.

📒 Files selected for processing (2)

packages/nvidia_nat_mcp/src/nat/plugins/mcp/client/client_base.py
packages/nvidia_nat_mcp/tests/client/test_mcp_client_base.py

ensure connection lifecycle is owned by one dedicated lifecycle task

2e51c2d

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

yczhang-nv requested a review from a team as a code owner May 14, 2026 22:16

yczhang-nv self-assigned this May 14, 2026

yczhang-nv added bug Something isn't working non-breaking Non-breaking change labels May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

yczhang-nv added bug Something isn't working non-breaking Non-breaking change and removed bug Something isn't working non-breaking Non-breaking change labels May 14, 2026

yczhang-nv changed the base branch from develop to release/1.7 May 14, 2026 23:02

yczhang-nv requested a review from AnuradhaKaruppiah May 14, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): Fix MCP reconnect lifecycle after local server crash#1935

fix(mcp): Fix MCP reconnect lifecycle after local server crash#1935
yczhang-nv wants to merge 1 commit into
NVIDIA:release/1.7from
yczhang-nv:yuchenz/nat-221-investigate-agent-stuck-state-after-local-mcp-server-failure

yczhang-nv commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yczhang-nv commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yczhang-nv commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading