Skip to content

fix: resolve AWS credentials off the event loop#270

Open
nix-tkobayashi wants to merge 2 commits into
aws:mainfrom
nix-tkobayashi:fix/async-credential-resolution
Open

fix: resolve AWS credentials off the event loop#270
nix-tkobayashi wants to merge 2 commits into
aws:mainfrom
nix-tkobayashi:fix/async-credential-resolution

Conversation

@nix-tkobayashi

@nix-tkobayashi nix-tkobayashi commented May 8, 2026

Copy link
Copy Markdown

Summary

Offload the synchronous create_aws_session() + get_credentials() call
in _sign_request_hook to asyncio.to_thread(), preventing the event loop
from blocking when the AWS profile uses chained credentials (e.g. STS AssumeRole).

What changed

  • Added _resolve_credentials() helper that calls create_aws_session(),
    get_credentials(), and get_frozen_credentials() — all inside a worker thread
  • _sign_request_hook now awaits asyncio.to_thread(_resolve_credentials, profile)
    instead of calling session methods synchronously

Post-#294 adaptation

SessionHolder was removed in #294 (v1.6.0). This PR is now based on
the current main and no longer carries any SessionHolder-related code.
The change wraps the per-request fresh-session credential read instead.

What is NOT changed

  • _handle_error_response — upstream signature maintained
  • create_sigv4_client — upstream signature maintained
  • _patch_credential_process_stdin() — upstream addition preserved
  • SigV4HTTPXAuth — no changes

Tests

  • Verify _resolve_credentials is called via asyncio.to_thread
  • Verify event loop remains responsive during credential resolution (threading.Event-based, less timing-sensitive)
  • All existing upstream unit tests pass unchanged

Partially addresses #176.

@arnewouters

Copy link
Copy Markdown
Contributor

Did you manage to reproduce the issue that was reported?

@nix-tkobayashi

Copy link
Copy Markdown
Author

@arnewouters
Yes, I reproduced it with a cross-account AssumeRole profile. My setup:

  • MCP client: Lambda-based (custom)
  • Profile config: ~/.aws/config with role_arn + source_profile (chained credentials)
  • MFA: Not enabled for this role

The key to reproducing is that the credential resolution chain must trigger an actual STS AssumeRole call. For profiles with role_arn, botocore's AssumeRoleProvider returns a DeferredRefreshableCredentials object — the STS call is deferred until the credentials are first accessed (e.g., during SigV4 signing when .access_key / .secret_key / .token properties are read). This only blocks when:

  1. The profile uses role_arn (chained credentials), and
  2. The credentials haven't been resolved yet or have expired

If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly.

The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window.

Happy to help with any additional details or testing needed.

@arnewouters

Copy link
Copy Markdown
Contributor

I'm struggling to understand what the blocked event loop actually starves in practice:

  1. The signing hook runs before the request is sent, so nothing is waiting on a response yet
  2. The MCP client timeout is measured on the client side (separate process), not on the proxy's event loop
  3. A 1-3s STS call shouldn't exhaust a 60s client timeout

I'd like to understand the actual failure mode and see a reproducible example before merging any fixes.

@nix-tkobayashi

Copy link
Copy Markdown
Author

Thanks for the detailed review — these are fair points and I want to address them precisely.

You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:

Calling synchronous credential resolution (get_credentials() → STS AssumeRole) from the proxy's asyncio event loop blocks all coroutines on that loop, and that blocking time adds directly to the client's timeout budget.

Let me walk through the mechanism and then show a deterministic reproduction.

The mechanism

In stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends initialize:

  1. Proxy reads the message from stdin
  2. Proxy prepares the HTTP request to AWS
  3. _sign_request_hook runs — calls get_credentials() synchronously
  4. Event loop is frozen for the duration of the STS call
  5. Request is sent, AWS responds, proxy writes back to stdout

The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long get_credentials() takes — which varies by credential provider chain, network conditions, and retry configuration.

For static IAM keys, get_credentials() returns instantly (no network call) — no issue. For AssumeRole profiles, it makes a synchronous STS API call whose duration is environment-dependent and unbounded.

Deterministic reproduction

This can be reproduced without any AWS dependency by injecting a blocking delay:

import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder

async def test_event_loop_blocked():
    """Proves that a slow get_credentials() blocks the entire event loop."""
    tick_count = 0

    async def ticker():
        nonlocal tick_count
        while True:
            await asyncio.sleep(0.05)
            tick_count += 1

    # Simulate a 3-second credential resolution
    slow_session = mock.MagicMock()
    slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
    holder = SessionHolder(slow_session)

    fake_request = httpx.Request("POST", "https://example.com", content=b'{}')

    task = asyncio.create_task(ticker())
    await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
    task.cancel()

    # Without fix: tick_count == 0 (event loop was frozen for 3s)
    # With asyncio.to_thread fix: tick_count >= 50
    print(f"Ticks during credential resolution: {tick_count}")

asyncio.run(test_event_loop_blocked())

With the current code, tick_count is 0 — the event loop was completely frozen. With asyncio.to_thread(), it runs normally. This is essentially what test_sign_request_hook_does_not_block_event_loop in the PR validates.

If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking get_credentials().

Why this matters in practice

I encountered this in a Lambda-based setup using cross-account AssumeRole, where get_credentials() involves chained STS calls that can take seconds to tens of seconds depending on provider chain, retries, and network conditions. The asyncio.to_thread() fix resolved the issue without changing behavior on fast credential paths.

I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case.

Happy to add the reproduction script above to the PR as an integration test if that would help.

krishmakochhar
krishmakochhar previously approved these changes May 20, 2026

@krishmakochhar krishmakochhar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Two suggestions for a follow-up:

  1. Thread safety on refresh: If async_refresh_if_needed() is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race on self.session replacement. An asyncio.Lock with a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage.

  2. PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.

@nix-tkobayashi

Copy link
Copy Markdown
Author

Thank you for the review and the approval!

Good call on both suggestions:

  1. Thread safety: Agreed — an asyncio.Lock with double-check on async_refresh_if_needed() would close the race cheaply. I'll open a follow-up PR for this.
  2. PR description: Updated to "Partially addresses Support for assumed IAM role (chained credentials) in MCP proxy #176" and softened the timeout causation language throughout.

@nix-tkobayashi

Copy link
Copy Markdown
Author

Resolved merge conflict with main — the only manual resolution was combining the import line in tests/unit/test_hooks.py:

# main added `patch`, this PR added `AsyncMock` → merged both
from unittest.mock import AsyncMock, MagicMock, Mock, patch

All other changes (skip_auth parameter additions, new test cases) were auto-merged by Git. No functional changes to this PR's code.

@aswhit1

aswhit1 commented May 25, 2026

Copy link
Copy Markdown

I've never done this before so please be kind when correcting me. ;-) I'll also be honest in that this write up is with a LOT of help from opencode.

Independent reproduction with CloudTrail evidence

Adding empirical data to support this PR. We've been operating
mcp-proxy-for-aws v1.5.0 against the managed
aws-mcp.us-east-1.api.aws endpoint as a downstream user via the
opencode MCP client, and
have observed the failure mode this PR addresses. CloudTrail-side
evidence below; happy to add more if useful.

Failure shape we see

Parallel tool-call fanout from a single client turn -- 4-5 mixed
AWS services (e.g., iam list-roles, sts get-caller-identity,
s3api list-buckets, ec2 describe-vpcs, cloudtrail lookup-events) -- intermittently fails per call, with auth-style
error messages:

  • InvalidClientTokenId -- "The security token included in the request is invalid"
  • InvalidToken -- s3 flavor
  • UnrecognizedClientException
  • AuthFailure -- ec2 flavor
  • 502 from runtime (raw)

Successful calls in the same batch confirm the wired credentials
are valid and the proxy + endpoint are reachable. Sequential retry
of any failed call succeeds within seconds against the same wired
profile, with no policy or credential change.

CloudTrail signature: failures are proxy-layer, not AWS-side

A controlled n=64 audit (full pagination,
EventName=CallReadWriteTool, ~8-minute warm-state window):

  • 58 events with additionalEventData.downstreamRequests populated
    -- proxy forwarded to AWS, AWS responded successfully
  • 6 events with AwsMcpEvent present but downstreamRequests
    absent -- proxy received the tool call but never forwarded
    to AWS. These 6 match the 6 observed client-side failures
    exactly
  • Zero events show downstream errors -- AWS itself never rejected
    any of these calls

The error-message variants above (InvalidClientTokenId,
InvalidToken, etc.) all share that same proxy-layer signature.
The proxy generates them itself; AWS sees nothing for the failed
calls. This is consistent with the M2 hypothesis in this PR:
non-atomic read of mutable Credentials state during concurrent
SigV4 signing produces a signature that AWS would reject -- but
in practice the proxy doesn't even get that far, the request is
shaped into an error in the signing hook before transmission.

In-the-wild reproducer eventIDs from our environment (us-east-1,
managed endpoint), in case useful for upstream AWS MCP service
log correlation:

  • a87cfab6-8cfc-4017-a7cf-692afd52b11a (s3, single proxy-layer outlier in same-service batch)
  • dbbc78f0, ae01a432, 270c11a4, 6d60eaf2 (mixed-service size-5, "502 from runtime" variant)
  • 053a0ced, 1f464cfa (mixed-service size-5, InvalidToken / InvalidClientTokenId variants)

Per-call failure rates we measured

n=64 audit window, warm-state mixed-service fanout, single AWS
account, AWS SSO profile (RefreshableCredentials):

Batch size Trials Total calls Per-call PASS rate
4 (warm replay) 1 4 100%
5 (mixed) 5 25 88%
6, 7, 8 (mixed) 1 each 21 100%
Warm sizes 5-8 combined 8 54 ~94%

Cold-state size-4 mixed in the same configuration: 3/4 PASS in
controlled trial. After 5 sequential warm-up calls, identical
configuration replayed 4/4 PASS.

Same-service fanout (5x or 10x calls all to the same AWS API):
zero failures across n=27 calls. The race only fires when
multiple SigV4 service signatures are computed concurrently --
consistent with the mutable-Credentials-read hypothesis.

Caveat I want to be honest about

Our data is single-account-at-a-time (we tested the same proxy
build across multiple accounts in our org but each account is its
own data point, not n=concurrent-load). The race is concurrency
within a single client; account multiplicity isn't the variable.
But our 94% warm-state rate may still vary by environment --
SSO token TTL, RefreshableCredentials refresh cadence, and STS
latency would all influence the race window size.

Why I think this PR is on the right track

The maintainer's earlier question was: "what does the blocked
event loop actually starve in practice?" Our data answers a
slightly different question -- our problem isn't the 60s
connection timeout from #176, it's per-call signature corruption
during concurrent fanout. But the underlying mechanism is the
same: synchronous credential resolution during async signing,
running on shared mutable state. The asyncio.to_thread change
in this PR moves the blocking work off the event loop, which
also closes (most of) the race window for concurrent signers --
each call does its get_credentials() in a worker thread, so
the read-then-sign window inside _sign_request_hook shrinks to
near-zero on the loop side.

The companion PR #286 (asyncio.Lock around refresh_if_needed)
addresses the residual race we'd still expect after this PR
lands -- when the shared RefreshableCredentials rotates
mid-fanout, the signers that read just-before vs just-after will
still produce mismatched signatures unless the read+sign is
atomic. Both fixes together should tighten the warm-state rate
materially.

Happy to share additional CloudTrail evidence, or the bench
script we use for parallel-fanout sweeps if it would help land
this.

@nix-tkobayashi

Copy link
Copy Markdown
Author

Thank you for the independent verification and for including the CloudTrail data. That kind of concrete evidence is very helpful.

Your observation that the failures appear during mixed-service parallel fanout, but not during same-service fanout, is a useful signal. It is consistent with concurrent access to mutable credential state, which is the area this PR is trying to improve.

To clarify scope: PR #270 moves the blocking get_credentials() call off the event loop via asyncio.to_thread. That should prevent credential resolution from blocking the loop and reduce one source of timing-related contention. The separate credential-rotation case you called out is being addressed in #286 with locking around refresh, so I see these two changes as complementary rather than interchangeable.

@aswhit1

aswhit1 commented May 26, 2026

Copy link
Copy Markdown

Thanks for the kind reply, and for confirming the complementary framing -- that's exactly how it landed for us.

I posted a follow-up on #286 with the cluster-shape evidence we'd attribute to the refresh-race specifically (the 053a0ced / 1f464cfa pair from the same audit window), since you confirmed the two PRs are addressing different mechanisms. Hopefully it's useful as #286 progresses.

We'll watch v1.6.0 for the disposition of both. Happy to provide more CloudTrail data or rerun the bench if anything specific would help.

@aswhit1

aswhit1 commented May 29, 2026

Copy link
Copy Markdown

Follow-up on v1.6.0

Watching v1.6.0 land today, closing the loop from a downstream
user as promised on May 26.

#294
(merged for v1.6.0) takes a different path than this PR -- removing
SessionHolder entirely so every request reads fresh credentials
from disk. From our perspective that change addresses the
mutable-state shape we surfaced in our CloudTrail evidence above:
no shared cached session means no concurrent-signers-on-mutated-
credentials race window during fanout.

We upgraded to v1.6.0 and re-ran the same parallel-fanout bench
that produced our prior 88% per-call PASS baseline (5 trials of
size-5 mixed-service, asyncio.gather'd, warm state):

Version Per-call PASS rate Failure variants observed
v1.5.0 (prior bench) 22/25 = 88% AuthFailure, InvalidToken, InvalidClientTokenId
v1.6.0 (post-#294) 25/25 = 100% none

Small-n caveat: 25 calls is n=5 trials; binomial CIs barely
overlap (v1.6.0 ~ [0.86, 1.00], v1.5.0 ~ [0.69, 0.97]). The
improvement is suggestive rather than conclusive at this n, but
the disappearance of the SigV4-shape variants tracks with the
race-shape diagnosis even if the eventual fix took a different
form than this PR.

That said, this PR (asyncio.to_thread for the blocking STS
call) addresses a different load-bearing concern -- event-loop
responsiveness during chained-credential resolution -- which is
orthogonal to whether SessionHolder exists. With #294 merged,
the post-v1.6.0 version of this PR would presumably wrap the
fresh-session credential read in to_thread rather than the old
SessionHolder.get_credentials(), but the event-loop-blocking
mechanism is the same.

No action requested from us; just closing the loop on the
"we'll watch v1.6.0" commitment.

Offload the synchronous create_aws_session() + get_credentials() call
in _sign_request_hook to asyncio.to_thread(), preventing the event loop
from blocking when the AWS profile requires chained credentials (e.g.
STS AssumeRole).

Uses get_frozen_credentials() to eagerly resolve deferred credential
providers inside the worker thread, avoiding lazy resolution on the
event loop when SigV4Auth reads credential attributes.

This is a post-aws#294 adaptation: SessionHolder was removed in v1.6.0,
so the change now wraps the per-request fresh-session credential read
instead.

Partially addresses aws#176.
@nix-tkobayashi nix-tkobayashi force-pushed the fix/async-credential-resolution branch from f649a8c to 9dd1b83 Compare May 31, 2026 06:05
@nix-tkobayashi nix-tkobayashi changed the title fix(auth): use asyncio.to_thread for credential resolution in signing hook fix: resolve AWS credentials off the event loop May 31, 2026
@nix-tkobayashi

Copy link
Copy Markdown
Author

Rebased this PR on the latest upstream/main.

Since #294 removed SessionHolder, I recreated the branch from current main and reapplied only the asyncio.to_thread change needed to avoid event-loop blocking. The PR title and description have also been updated to reflect the current implementation.

Validation:

  • tests pass
  • ruff pass
  • format pass

This PR is still focused on the event-loop blocking issue, which is orthogonal to the v1.6.0 benchmark results discussed above.

valter-silva-au added a commit to valter-silva-au/mcp-proxy-for-aws that referenced this pull request Jun 11, 2026
The single shared httpx.AsyncClient that serves all proxied MCP tool
calls was constructed with a hardcoded
httpx.Limits(max_keepalive_connections=1, max_connections=5). MCP
clients that fan out parallel aws_* tool calls per turn (Claude Code,
opencode — typically 4-8 concurrent) saturate the 5-connection pool,
causing intermittent failures.

Expose the pool size via new CLI flags --max-connections /
--max-keepalive-connections with environment-variable fallbacks
MCP_PROXY_MAX_CONNECTIONS / MCP_PROXY_MAX_KEEPALIVE_CONNECTIONS. The two
values are threaded from cli -> server -> create_transport_with_sigv4 ->
create_sigv4_client, and through the per-profile override middleware.

Defaults are unchanged: max_connections=5, max_keepalive_connections=1,
so existing deployments behave byte-for-byte as before. A backward-
compat test asserts the default limits equal the original hardcoded
Limits object.

Scope: pool-sizing knob only. Does not touch the auth-path
serialization (_sign_request_hook / next(auth_flow)), which is tracked
separately in aws#270.

Fixes aws#295
@github-actions

Copy link
Copy Markdown

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions Bot added the stale label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants