fix: resolve AWS credentials off the event loop by nix-tkobayashi · Pull Request #270 · aws/mcp-proxy-for-aws

nix-tkobayashi · 2026-05-08T10:29:33Z

Summary

Offload the synchronous create_aws_session() + get_credentials() call
in _sign_request_hook to asyncio.to_thread(), preventing the event loop
from blocking when the AWS profile uses chained credentials (e.g. STS AssumeRole).

What changed

Added _resolve_credentials() helper that calls create_aws_session(),
get_credentials(), and get_frozen_credentials() — all inside a worker thread
_sign_request_hook now awaits asyncio.to_thread(_resolve_credentials, profile)
instead of calling session methods synchronously

Post-#294 adaptation

SessionHolder was removed in #294 (v1.6.0). This PR is now based on
the current main and no longer carries any SessionHolder-related code.
The change wraps the per-request fresh-session credential read instead.

What is NOT changed

_handle_error_response — upstream signature maintained
create_sigv4_client — upstream signature maintained
_patch_credential_process_stdin() — upstream addition preserved
SigV4HTTPXAuth — no changes

Tests

Verify _resolve_credentials is called via asyncio.to_thread
Verify event loop remains responsive during credential resolution (threading.Event-based, less timing-sensitive)
All existing upstream unit tests pass unchanged

Partially addresses #176.

arnewouters · 2026-05-19T06:52:50Z

Did you manage to reproduce the issue that was reported?

nix-tkobayashi · 2026-05-19T11:56:49Z

@arnewouters
Yes, I reproduced it with a cross-account AssumeRole profile. My setup:

MCP client: Lambda-based (custom)
Profile config: ~/.aws/config with role_arn + source_profile (chained credentials)
MFA: Not enabled for this role

The key to reproducing is that the credential resolution chain must trigger an actual STS AssumeRole call. For profiles with role_arn, botocore's AssumeRoleProvider returns a DeferredRefreshableCredentials object — the STS call is deferred until the credentials are first accessed (e.g., during SigV4 signing when .access_key / .secret_key / .token properties are read). This only blocks when:

The profile uses role_arn (chained credentials), and
The credentials haven't been resolved yet or have expired

If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly.

The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window.

Happy to help with any additional details or testing needed.

arnewouters · 2026-05-20T05:43:53Z

I'm struggling to understand what the blocked event loop actually starves in practice:

The signing hook runs before the request is sent, so nothing is waiting on a response yet
The MCP client timeout is measured on the client side (separate process), not on the proxy's event loop
A 1-3s STS call shouldn't exhaust a 60s client timeout

I'd like to understand the actual failure mode and see a reproducible example before merging any fixes.

nix-tkobayashi · 2026-05-20T07:51:15Z

Thanks for the detailed review — these are fair points and I want to address them precisely.

You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:

Calling synchronous credential resolution (get_credentials() → STS AssumeRole) from the proxy's asyncio event loop blocks all coroutines on that loop, and that blocking time adds directly to the client's timeout budget.

Let me walk through the mechanism and then show a deterministic reproduction.

The mechanism

In stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends initialize:

Proxy reads the message from stdin
Proxy prepares the HTTP request to AWS
_sign_request_hook runs — calls get_credentials() synchronously
Event loop is frozen for the duration of the STS call
Request is sent, AWS responds, proxy writes back to stdout

The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long get_credentials() takes — which varies by credential provider chain, network conditions, and retry configuration.

For static IAM keys, get_credentials() returns instantly (no network call) — no issue. For AssumeRole profiles, it makes a synchronous STS API call whose duration is environment-dependent and unbounded.

Deterministic reproduction

This can be reproduced without any AWS dependency by injecting a blocking delay:

import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder

async def test_event_loop_blocked():
    """Proves that a slow get_credentials() blocks the entire event loop."""
    tick_count = 0

    async def ticker():
        nonlocal tick_count
        while True:
            await asyncio.sleep(0.05)
            tick_count += 1

    # Simulate a 3-second credential resolution
    slow_session = mock.MagicMock()
    slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
    holder = SessionHolder(slow_session)

    fake_request = httpx.Request("POST", "https://example.com", content=b'{}')

    task = asyncio.create_task(ticker())
    await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
    task.cancel()

    # Without fix: tick_count == 0 (event loop was frozen for 3s)
    # With asyncio.to_thread fix: tick_count >= 50
    print(f"Ticks during credential resolution: {tick_count}")

asyncio.run(test_event_loop_blocked())

With the current code, tick_count is 0 — the event loop was completely frozen. With asyncio.to_thread(), it runs normally. This is essentially what test_sign_request_hook_does_not_block_event_loop in the PR validates.

If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking get_credentials().

Why this matters in practice

I encountered this in a Lambda-based setup using cross-account AssumeRole, where get_credentials() involves chained STS calls that can take seconds to tens of seconds depending on provider chain, retries, and network conditions. The asyncio.to_thread() fix resolved the issue without changing behavior on fast credential paths.

I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case.

Happy to add the reproduction script above to the PR as an integration test if that would help.

krishmakochhar

LGTM.

Two suggestions for a follow-up:

Thread safety on refresh: If async_refresh_if_needed() is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race on self.session replacement. An asyncio.Lock with a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage.
PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.

nix-tkobayashi · 2026-05-20T12:09:46Z

Thank you for the review and the approval!

Good call on both suggestions:

Thread safety: Agreed — an asyncio.Lock with double-check on async_refresh_if_needed() would close the race cheaply. I'll open a follow-up PR for this.
PR description: Updated to "Partially addresses Support for assumed IAM role (chained credentials) in MCP proxy #176" and softened the timeout causation language throughout.

nix-tkobayashi · 2026-05-25T12:28:03Z

Resolved merge conflict with main — the only manual resolution was combining the import line in tests/unit/test_hooks.py:

# main added `patch`, this PR added `AsyncMock` → merged both
from unittest.mock import AsyncMock, MagicMock, Mock, patch

All other changes (skip_auth parameter additions, new test cases) were auto-merged by Git. No functional changes to this PR's code.

aswhit1 · 2026-05-25T16:07:41Z

I've never done this before so please be kind when correcting me. ;-) I'll also be honest in that this write up is with a LOT of help from opencode.

Independent reproduction with CloudTrail evidence

Adding empirical data to support this PR. We've been operating
mcp-proxy-for-aws v1.5.0 against the managed
aws-mcp.us-east-1.api.aws endpoint as a downstream user via the
opencode MCP client, and
have observed the failure mode this PR addresses. CloudTrail-side
evidence below; happy to add more if useful.

Failure shape we see

Parallel tool-call fanout from a single client turn -- 4-5 mixed
AWS services (e.g., iam list-roles, sts get-caller-identity,
s3api list-buckets, ec2 describe-vpcs, cloudtrail lookup-events) -- intermittently fails per call, with auth-style
error messages:

InvalidClientTokenId -- "The security token included in the request is invalid"
InvalidToken -- s3 flavor
UnrecognizedClientException
AuthFailure -- ec2 flavor
502 from runtime (raw)

Successful calls in the same batch confirm the wired credentials
are valid and the proxy + endpoint are reachable. Sequential retry
of any failed call succeeds within seconds against the same wired
profile, with no policy or credential change.

CloudTrail signature: failures are proxy-layer, not AWS-side

A controlled n=64 audit (full pagination,
EventName=CallReadWriteTool, ~8-minute warm-state window):

58 events with additionalEventData.downstreamRequests populated
-- proxy forwarded to AWS, AWS responded successfully
6 events with AwsMcpEvent present but downstreamRequests
absent -- proxy received the tool call but never forwarded
to AWS. These 6 match the 6 observed client-side failures
exactly
Zero events show downstream errors -- AWS itself never rejected
any of these calls

The error-message variants above (InvalidClientTokenId,
InvalidToken, etc.) all share that same proxy-layer signature.
The proxy generates them itself; AWS sees nothing for the failed
calls. This is consistent with the M2 hypothesis in this PR:
non-atomic read of mutable Credentials state during concurrent
SigV4 signing produces a signature that AWS would reject -- but
in practice the proxy doesn't even get that far, the request is
shaped into an error in the signing hook before transmission.

In-the-wild reproducer eventIDs from our environment (us-east-1,
managed endpoint), in case useful for upstream AWS MCP service
log correlation:

a87cfab6-8cfc-4017-a7cf-692afd52b11a (s3, single proxy-layer outlier in same-service batch)
dbbc78f0, ae01a432, 270c11a4, 6d60eaf2 (mixed-service size-5, "502 from runtime" variant)
053a0ced, 1f464cfa (mixed-service size-5, InvalidToken / InvalidClientTokenId variants)

Per-call failure rates we measured

n=64 audit window, warm-state mixed-service fanout, single AWS
account, AWS SSO profile (RefreshableCredentials):

Batch size	Trials	Total calls	Per-call PASS rate
4 (warm replay)	1	4	100%
5 (mixed)	5	25	88%
6, 7, 8 (mixed)	1 each	21	100%
Warm sizes 5-8 combined	8	54	~94%

Cold-state size-4 mixed in the same configuration: 3/4 PASS in
controlled trial. After 5 sequential warm-up calls, identical
configuration replayed 4/4 PASS.

Same-service fanout (5x or 10x calls all to the same AWS API):
zero failures across n=27 calls. The race only fires when
multiple SigV4 service signatures are computed concurrently --
consistent with the mutable-Credentials-read hypothesis.

Caveat I want to be honest about

Our data is single-account-at-a-time (we tested the same proxy
build across multiple accounts in our org but each account is its
own data point, not n=concurrent-load). The race is concurrency
within a single client; account multiplicity isn't the variable.
But our 94% warm-state rate may still vary by environment --
SSO token TTL, RefreshableCredentials refresh cadence, and STS
latency would all influence the race window size.

Why I think this PR is on the right track

The maintainer's earlier question was: "what does the blocked
event loop actually starve in practice?" Our data answers a
slightly different question -- our problem isn't the 60s
connection timeout from #176, it's per-call signature corruption
during concurrent fanout. But the underlying mechanism is the
same: synchronous credential resolution during async signing,
running on shared mutable state. The asyncio.to_thread change
in this PR moves the blocking work off the event loop, which
also closes (most of) the race window for concurrent signers --
each call does its get_credentials() in a worker thread, so
the read-then-sign window inside _sign_request_hook shrinks to
near-zero on the loop side.

The companion PR #286 (asyncio.Lock around refresh_if_needed)
addresses the residual race we'd still expect after this PR
lands -- when the shared RefreshableCredentials rotates
mid-fanout, the signers that read just-before vs just-after will
still produce mismatched signatures unless the read+sign is
atomic. Both fixes together should tighten the warm-state rate
materially.

Happy to share additional CloudTrail evidence, or the bench
script we use for parallel-fanout sweeps if it would help land
this.

nix-tkobayashi · 2026-05-26T01:04:59Z

Thank you for the independent verification and for including the CloudTrail data. That kind of concrete evidence is very helpful.

Your observation that the failures appear during mixed-service parallel fanout, but not during same-service fanout, is a useful signal. It is consistent with concurrent access to mutable credential state, which is the area this PR is trying to improve.

To clarify scope: PR #270 moves the blocking get_credentials() call off the event loop via asyncio.to_thread. That should prevent credential resolution from blocking the loop and reduce one source of timing-related contention. The separate credential-rotation case you called out is being addressed in #286 with locking around refresh, so I see these two changes as complementary rather than interchangeable.

aswhit1 · 2026-05-26T19:17:27Z

Thanks for the kind reply, and for confirming the complementary framing -- that's exactly how it landed for us.

I posted a follow-up on #286 with the cluster-shape evidence we'd attribute to the refresh-race specifically (the 053a0ced / 1f464cfa pair from the same audit window), since you confirmed the two PRs are addressing different mechanisms. Hopefully it's useful as #286 progresses.

We'll watch v1.6.0 for the disposition of both. Happy to provide more CloudTrail data or rerun the bench if anything specific would help.

aswhit1 · 2026-05-29T22:20:53Z

Follow-up on v1.6.0

Watching v1.6.0 land today, closing the loop from a downstream
user as promised on May 26.

#294
(merged for v1.6.0) takes a different path than this PR -- removing
SessionHolder entirely so every request reads fresh credentials
from disk. From our perspective that change addresses the
mutable-state shape we surfaced in our CloudTrail evidence above:
no shared cached session means no concurrent-signers-on-mutated-
credentials race window during fanout.

We upgraded to v1.6.0 and re-ran the same parallel-fanout bench
that produced our prior 88% per-call PASS baseline (5 trials of
size-5 mixed-service, asyncio.gather'd, warm state):

Version	Per-call PASS rate	Failure variants observed
v1.5.0 (prior bench)	22/25 = 88%	AuthFailure, InvalidToken, InvalidClientTokenId
v1.6.0 (post-#294)	25/25 = 100%	none

Small-n caveat: 25 calls is n=5 trials; binomial CIs barely
overlap (v1.6.0 ~ [0.86, 1.00], v1.5.0 ~ [0.69, 0.97]). The
improvement is suggestive rather than conclusive at this n, but
the disappearance of the SigV4-shape variants tracks with the
race-shape diagnosis even if the eventual fix took a different
form than this PR.

That said, this PR (asyncio.to_thread for the blocking STS
call) addresses a different load-bearing concern -- event-loop
responsiveness during chained-credential resolution -- which is
orthogonal to whether SessionHolder exists. With #294 merged,
the post-v1.6.0 version of this PR would presumably wrap the
fresh-session credential read in to_thread rather than the old
SessionHolder.get_credentials(), but the event-loop-blocking
mechanism is the same.

No action requested from us; just closing the loop on the
"we'll watch v1.6.0" commitment.

Offload the synchronous create_aws_session() + get_credentials() call in _sign_request_hook to asyncio.to_thread(), preventing the event loop from blocking when the AWS profile requires chained credentials (e.g. STS AssumeRole). Uses get_frozen_credentials() to eagerly resolve deferred credential providers inside the worker thread, avoiding lazy resolution on the event loop when SigV4Auth reads credential attributes. This is a post-aws#294 adaptation: SessionHolder was removed in v1.6.0, so the change now wraps the per-request fresh-session credential read instead. Partially addresses aws#176.

nix-tkobayashi · 2026-05-31T06:13:09Z

Rebased this PR on the latest upstream/main.

Since #294 removed SessionHolder, I recreated the branch from current main and reapplied only the asyncio.to_thread change needed to avoid event-loop blocking. The PR title and description have also been updated to reflect the current implementation.

Validation:

tests pass
ruff pass
format pass

This PR is still focused on the event-loop blocking issue, which is orthogonal to the v1.6.0 benchmark results discussed above.

The single shared httpx.AsyncClient that serves all proxied MCP tool calls was constructed with a hardcoded httpx.Limits(max_keepalive_connections=1, max_connections=5). MCP clients that fan out parallel aws_* tool calls per turn (Claude Code, opencode — typically 4-8 concurrent) saturate the 5-connection pool, causing intermittent failures. Expose the pool size via new CLI flags --max-connections / --max-keepalive-connections with environment-variable fallbacks MCP_PROXY_MAX_CONNECTIONS / MCP_PROXY_MAX_KEEPALIVE_CONNECTIONS. The two values are threaded from cli -> server -> create_transport_with_sigv4 -> create_sigv4_client, and through the per-profile override middleware. Defaults are unchanged: max_connections=5, max_keepalive_connections=1, so existing deployments behave byte-for-byte as before. A backward- compat test asserts the default limits equal the original hardcoded Limits object. Scope: pool-sizing knob only. Does not touch the auth-path serialization (_sign_request_hook / next(auth_flow)), which is tracked separately in aws#270. Fixes aws#295

github-actions · 2026-06-15T01:25:03Z

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

nix-tkobayashi requested a review from a team as a code owner May 8, 2026 10:29

nix-tkobayashi requested review from acmlau and harv-aws May 8, 2026 10:29

nix-tkobayashi mentioned this pull request May 8, 2026

Support for assumed IAM role (chained credentials) in MCP proxy #176

Open

2 tasks

krishmakochhar previously approved these changes May 20, 2026

View reviewed changes

nix-tkobayashi mentioned this pull request May 20, 2026

fix: add asyncio.Lock to prevent concurrent refresh races #286

Closed

13 tasks

nix-tkobayashi dismissed krishmakochhar’s stale review via 74c2300 May 25, 2026 12:11

aswhit1 mentioned this pull request May 27, 2026

Hardcoded httpx pool size limits parallel tool-call fanout #295

Open

1 task

nix-tkobayashi force-pushed the fix/async-credential-resolution branch from f649a8c to 9dd1b83 Compare May 31, 2026 06:05

nix-tkobayashi changed the title ~~fix(auth): use asyncio.to_thread for credential resolution in signing hook~~ fix: resolve AWS credentials off the event loop May 31, 2026

valter-silva-au mentioned this pull request Jun 8, 2026

feat: make httpx connection pool size configurable via flags and env #307

Open

github-actions Bot added the stale label Jun 15, 2026

Merge branch 'main' into fix/async-credential-resolution

7b7afa7

Conversation

nix-tkobayashi commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Post-#294 adaptation

What is NOT changed

Tests

Uh oh!

arnewouters commented May 19, 2026

Uh oh!

nix-tkobayashi commented May 19, 2026

Uh oh!

arnewouters commented May 20, 2026

Uh oh!

nix-tkobayashi commented May 20, 2026

The mechanism

Deterministic reproduction

Why this matters in practice

Uh oh!

krishmakochhar left a comment

Choose a reason for hiding this comment

Uh oh!

nix-tkobayashi commented May 20, 2026

Uh oh!

nix-tkobayashi commented May 25, 2026

Uh oh!

aswhit1 commented May 25, 2026

Independent reproduction with CloudTrail evidence

Failure shape we see

CloudTrail signature: failures are proxy-layer, not AWS-side

Per-call failure rates we measured

Caveat I want to be honest about

Why I think this PR is on the right track

Uh oh!

nix-tkobayashi commented May 26, 2026

Uh oh!

aswhit1 commented May 26, 2026

Uh oh!

aswhit1 commented May 29, 2026

Follow-up on v1.6.0

Uh oh!

nix-tkobayashi commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nix-tkobayashi commented May 8, 2026 •

edited

Loading