fix: resolve AWS credentials off the event loop#270
Conversation
|
Did you manage to reproduce the issue that was reported? |
|
@arnewouters
The key to reproducing is that the credential resolution chain must trigger an actual STS
If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly. The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window. Happy to help with any additional details or testing needed. |
|
I'm struggling to understand what the blocked event loop actually starves in practice:
I'd like to understand the actual failure mode and see a reproducible example before merging any fixes. |
|
Thanks for the detailed review — these are fair points and I want to address them precisely. You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:
Let me walk through the mechanism and then show a deterministic reproduction. The mechanismIn stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends
The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long For static IAM keys, Deterministic reproductionThis can be reproduced without any AWS dependency by injecting a blocking delay: import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder
async def test_event_loop_blocked():
"""Proves that a slow get_credentials() blocks the entire event loop."""
tick_count = 0
async def ticker():
nonlocal tick_count
while True:
await asyncio.sleep(0.05)
tick_count += 1
# Simulate a 3-second credential resolution
slow_session = mock.MagicMock()
slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
holder = SessionHolder(slow_session)
fake_request = httpx.Request("POST", "https://example.com", content=b'{}')
task = asyncio.create_task(ticker())
await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
task.cancel()
# Without fix: tick_count == 0 (event loop was frozen for 3s)
# With asyncio.to_thread fix: tick_count >= 50
print(f"Ticks during credential resolution: {tick_count}")
asyncio.run(test_event_loop_blocked())With the current code, If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking Why this matters in practiceI encountered this in a Lambda-based setup using cross-account AssumeRole, where I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case. Happy to add the reproduction script above to the PR as an integration test if that would help. |
krishmakochhar
left a comment
There was a problem hiding this comment.
LGTM.
Two suggestions for a follow-up:
-
Thread safety on refresh: If
async_refresh_if_needed()is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race onself.sessionreplacement. Anasyncio.Lockwith a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage. -
PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.
|
Thank you for the review and the approval! Good call on both suggestions:
|
|
Resolved merge conflict with # main added `patch`, this PR added `AsyncMock` → merged both
from unittest.mock import AsyncMock, MagicMock, Mock, patchAll other changes ( |
|
I've never done this before so please be kind when correcting me. ;-) I'll also be honest in that this write up is with a LOT of help from opencode. Independent reproduction with CloudTrail evidenceAdding empirical data to support this PR. We've been operating Failure shape we seeParallel tool-call fanout from a single client turn -- 4-5 mixed
Successful calls in the same batch confirm the wired credentials CloudTrail signature: failures are proxy-layer, not AWS-sideA controlled n=64 audit (full pagination,
The error-message variants above ( In-the-wild reproducer eventIDs from our environment (us-east-1,
Per-call failure rates we measuredn=64 audit window, warm-state mixed-service fanout, single AWS
Cold-state size-4 mixed in the same configuration: 3/4 PASS in Same-service fanout (5x or 10x calls all to the same AWS API): Caveat I want to be honest aboutOur data is single-account-at-a-time (we tested the same proxy Why I think this PR is on the right trackThe maintainer's earlier question was: "what does the blocked The companion PR #286 (asyncio.Lock around Happy to share additional CloudTrail evidence, or the bench |
|
Thank you for the independent verification and for including the CloudTrail data. That kind of concrete evidence is very helpful. Your observation that the failures appear during mixed-service parallel fanout, but not during same-service fanout, is a useful signal. It is consistent with concurrent access to mutable credential state, which is the area this PR is trying to improve. To clarify scope: PR #270 moves the blocking |
|
Thanks for the kind reply, and for confirming the complementary framing -- that's exactly how it landed for us. I posted a follow-up on #286 with the cluster-shape evidence we'd attribute to the refresh-race specifically (the We'll watch v1.6.0 for the disposition of both. Happy to provide more CloudTrail data or rerun the bench if anything specific would help. |
Follow-up on v1.6.0Watching v1.6.0 land today, closing the loop from a downstream #294 We upgraded to v1.6.0 and re-ran the same parallel-fanout bench
Small-n caveat: 25 calls is n=5 trials; binomial CIs barely That said, this PR ( No action requested from us; just closing the loop on the |
Offload the synchronous create_aws_session() + get_credentials() call in _sign_request_hook to asyncio.to_thread(), preventing the event loop from blocking when the AWS profile requires chained credentials (e.g. STS AssumeRole). Uses get_frozen_credentials() to eagerly resolve deferred credential providers inside the worker thread, avoiding lazy resolution on the event loop when SigV4Auth reads credential attributes. This is a post-aws#294 adaptation: SessionHolder was removed in v1.6.0, so the change now wraps the per-request fresh-session credential read instead. Partially addresses aws#176.
f649a8c to
9dd1b83
Compare
|
Rebased this PR on the latest upstream/main. Since #294 removed SessionHolder, I recreated the branch from current main and reapplied only the asyncio.to_thread change needed to avoid event-loop blocking. The PR title and description have also been updated to reflect the current implementation. Validation:
This PR is still focused on the event-loop blocking issue, which is orthogonal to the v1.6.0 benchmark results discussed above. |
The single shared httpx.AsyncClient that serves all proxied MCP tool calls was constructed with a hardcoded httpx.Limits(max_keepalive_connections=1, max_connections=5). MCP clients that fan out parallel aws_* tool calls per turn (Claude Code, opencode — typically 4-8 concurrent) saturate the 5-connection pool, causing intermittent failures. Expose the pool size via new CLI flags --max-connections / --max-keepalive-connections with environment-variable fallbacks MCP_PROXY_MAX_CONNECTIONS / MCP_PROXY_MAX_KEEPALIVE_CONNECTIONS. The two values are threaded from cli -> server -> create_transport_with_sigv4 -> create_sigv4_client, and through the per-profile override middleware. Defaults are unchanged: max_connections=5, max_keepalive_connections=1, so existing deployments behave byte-for-byte as before. A backward- compat test asserts the default limits equal the original hardcoded Limits object. Scope: pool-sizing knob only. Does not touch the auth-path serialization (_sign_request_hook / next(auth_flow)), which is tracked separately in aws#270. Fixes aws#295
|
This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Summary
Offload the synchronous
create_aws_session()+get_credentials()callin
_sign_request_hooktoasyncio.to_thread(), preventing the event loopfrom blocking when the AWS profile uses chained credentials (e.g. STS AssumeRole).
What changed
_resolve_credentials()helper that callscreate_aws_session(),get_credentials(), andget_frozen_credentials()— all inside a worker thread_sign_request_hooknow awaitsasyncio.to_thread(_resolve_credentials, profile)instead of calling session methods synchronously
Post-#294 adaptation
SessionHolderwas removed in #294 (v1.6.0). This PR is now based onthe current
mainand no longer carries anySessionHolder-related code.The change wraps the per-request fresh-session credential read instead.
What is NOT changed
_handle_error_response— upstream signature maintainedcreate_sigv4_client— upstream signature maintained_patch_credential_process_stdin()— upstream addition preservedSigV4HTTPXAuth— no changesTests
_resolve_credentialsis called viaasyncio.to_threadthreading.Event-based, less timing-sensitive)Partially addresses #176.