Skip to content

fix(skills): isolate per-toolset failures in SkillToolset additional-tools resolution#6203

Open
hawktang wants to merge 1 commit into
google:mainfrom
hawktang:fix/skilltoolset-isolate-failing-toolset
Open

fix(skills): isolate per-toolset failures in SkillToolset additional-tools resolution#6203
hawktang wants to merge 1 commit into
google:mainfrom
hawktang:fix/skilltoolset-isolate-failing-toolset

Conversation

@hawktang

Copy link
Copy Markdown

Summary

SkillToolset._resolve_additional_tools_from_state resolves a skill's declared adk_additional_tools by gathering get_tools_with_prefix() from every provided toolset in a single asyncio.gather(...) with no per-toolset error isolation:

ts_results = await asyncio.gather(*(
    ts.get_tools_with_prefix(readonly_context)
    for ts in self._provided_toolsets
))

If any one toolset raises while listing its tools — most commonly an McpToolset whose remote MCP server is restarting / unreachable / returns 503 — the entire gather propagates and no additional tools resolve for the activated skill, including unrelated individual function tools that were already available with zero I/O.

In practice, a single optional remote MCP being briefly down breaks every skill that declares adk_additional_tools — surfacing as an internal error even for skills that never reference the failing MCP.

Fix

  • Pass return_exceptions=True to the gather, then skip (with a logger.warning) any toolset whose listing failed.
  • Re-raise asyncio.CancelledError so cooperative cancellation is not swallowed.

A failing/unreachable toolset now contributes no tools and recovers automatically on the next invocation (tool listing is cached per invocation_id).

Test

Adds test_skill_toolset_resolution_isolates_failing_toolset: a toolset whose get_tools_with_prefix raises is placed in the additional_tools pool alongside a healthy tool. After activating a skill, the healthy tool and core skill tools still resolve, the failing toolset contributes nothing, and a warning is logged. Verified red before the change, green after.

Notes

  • Reproduces on 2.1.0, 2.3.0, and current main.
  • The existing experimental _MCP_GRACEFUL_ERROR_HANDLING guards the MCP session layer, but not this aggregation layer, so the failure still propagated.
  • Open design question: should a failure always degrade (this PR — right for optional toolsets), or hard-error only when the failing toolset would have provided a tool the active skill actually declares in adk_additional_tools? Happy to adjust.

🤖 Generated with Claude Code

@google-cla

google-cla Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

SkillToolset._resolve_additional_tools_from_state gathered
get_tools_with_prefix() from every provided toolset in a single
asyncio.gather() with no error isolation. One toolset that raises while
listing its tools -- e.g. an McpToolset whose remote MCP server is
restarting or returns 503 -- aborted the whole gather, so no skill-gated
additional tools resolved, including unrelated individual function tools
that needed zero I/O.

Use return_exceptions=True and skip (with a warning) any toolset that
fails, re-raising CancelledError so cooperative cancellation is
preserved. A failing toolset now contributes no tools and recovers on the
next invocation.

Adds a regression test: a failing toolset in the additional_tools pool no
longer prevents the healthy tools and core skill tools from resolving.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hawktang hawktang force-pushed the fix/skilltoolset-isolate-failing-toolset branch from 606f312 to f506006 Compare June 23, 2026 19:35
@adk-bot

adk-bot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Response from ADK Triaging Agent

Hello @hawktang, thank you for creating this PR!

It looks like there are a couple of things missing based on our contribution guidelines:

  1. Contributor License Agreement (CLA): The CLA check is currently failing. Please sign the Google CLA at https://cla.developers.google.com/ to proceed.
  2. Testing Plan: Please include a formal Testing Plan section in your PR description. In particular, we'd love to see a summary of passed pytest results showing that the newly added test passes locally.

Once these are addressed, we will triage and label your PR. Thank you!

@adk-bot adk-bot added the tools [Component] This issue is related to tools label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tools [Component] This issue is related to tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants