fix: add asyncio.Lock to prevent concurrent refresh races by nix-tkobayashi · Pull Request #286 · aws/mcp-proxy-for-aws

nix-tkobayashi · 2026-05-20T12:30:29Z

Summary

Changes

Follow-up to #270 (per review feedback).

SessionHolder.async_refresh_if_needed() can be called concurrently by multiple in-flight requests when the session is marked for refresh. Without synchronization, all callers race to create a new boto3 session simultaneously — wasting STS calls and risking inconsistent state.

This PR:

Adds an asyncio.Lock to SessionHolder with a double-check pattern: the outer _needs_refresh check avoids lock overhead on the fast path; the inner check ensures only one caller actually performs the refresh
Shields the to_thread call from task cancellation so that a cancelled caller does not release the lock while the worker thread is still running (prevents duplicate refresh on cancellation)
Adds two unit tests:
- Concurrent-callers test using threading.Event barriers to assert create_aws_session is called exactly once
- Cancellation test that verifies the lock is held until the refresh completes, even when the awaiting task is cancelled

User experience

No user-visible behavior change. This is a correctness hardening for the credential refresh path introduced in #270.

Checklist

I have reviewed the contributing guidelines
I have performed a self-review of this change
Changes have been tested
Changes are documented

Is this a breaking change? (Y/N)

Yes
No

Please add details about how this change was tested.

All 185 unit tests pass with 96% coverage (80% required)
Pre-commit hooks pass
Ruff linting and formatting pass
Pyright passes (no new errors)
Concurrent test uses threading.Event barriers to assert single refresh
Cancellation test verified to FAIL without shield and PASS with shield
Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… hook `_sign_request_hook` is an async httpx event hook, but it calls `session_holder.session.get_credentials()` and `session_holder.refresh_if_needed()` synchronously. For profiles that use assumed IAM roles (chained credentials), `get_credentials()` triggers a blocking STS `AssumeRole` call that stalls the event loop and causes connection timeouts (~60 s). Add `SessionHolder.async_get_credentials()` and `SessionHolder.async_refresh_if_needed()` which delegate to `asyncio.to_thread`, keeping the event loop responsive. Update `_sign_request_hook` to call the async variants. Fixes aws#176 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address review feedback: - Clarify that get_credentials() is the primary blocking path; async_refresh_if_needed() is a defensive measure - Replace timeout-based non-blocking test with ticker coroutine that proves the event loop stays responsive during credential resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prevent concurrent refresh races when multiple in-flight requests trigger credential refresh simultaneously. Uses a double-check pattern: the outer check avoids lock overhead on the fast path, the inner check ensures only one caller performs the refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the task holding _refresh_lock is cancelled during asyncio.to_thread(), the lock would be released while the worker thread is still running. A second caller could then acquire the lock, see _needs_refresh == True, and start a duplicate refresh. Shield the to_thread call so that cancellation waits for the refresh to complete before releasing the lock and re-raising CancelledError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # tests/unit/test_hooks.py

aswhit1 · 2026-05-26T13:24:35Z

Independent evidence for the refresh-race shape

Thanks for the response on #270 -- adding the cluster-shape
evidence specific to the refresh race this PR addresses, for the
same audit window, since you confirmed #270 and #286 are
complementary rather than interchangeable. Sharing as the same
downstream user, via the
opencode MCP client.

What clustered-within-batch looks like in our data

Same n=64 audit window referenced on #270, narrowed to the failure
shape this PR addresses: when a credential refresh fires mid-fanout,
multiple concurrent _sign_request_hook calls each observe
_needs_refresh=True and each call create_aws_session(profile).
The losing-side signers complete signing against credentials that
were replaced under them; AWS responds 401/403; proxy surfaces
InvalidClientTokenId / AuthFailure / InvalidToken flavors.

The shape we'd predict from this race -- and what we observed --
is clustering: failures don't distribute independently across
batches; they bunch within a batch when refresh fires during it,
and stay zero in batches where refresh doesn't fire.

Batch size	Trials	Total calls	Failed batches	Failures concentrated in failed batches
5 (mixed-service, warm)	5	25	3 of 5	All 3 failures in the 3 failed batches; remaining 2 batches 0/10
6, 7, 8 (mixed-service, warm)	1 each	21	0 of 3	n/a

If failures were independent per call, a 12% per-call rate (3/25)
would distribute as roughly 1 failure per batch, not 3 batches with
~1 failure each and 2 clean batches. The clustering is consistent
with the refresh-race shape this PR describes -- once one call
401s, the second-and-subsequent concurrent fanout calls each
observe _needs_refresh=True and race the refresh.

Why locked + double-checked is the right shape

The double-check pattern in this PR (outer _needs_refresh check
avoids lock overhead on the fast path; inner check ensures only
one caller actually performs the refresh) lines up with what our
data shows we need:

Fast path stays fast: in the 21/21 PASS warm batches above,
no refresh fired, no lock contention -- the outer check returns
immediately
Refresh path serializes: in the 3 batches that did refresh,
all signers would queue on the lock, the first completes the
refresh, the rest see _needs_refresh=False on the inner check
and skip the redundant create_aws_session -- which is exactly
the cluster of failures we observe today

The cancellation shield matters for our setup too: opencode cancels
in-flight tool calls when the user interrupts. Without the shield,
a cancelled caller mid-refresh would release the lock while the
worker thread is still running, and the next batch would race a
half-completed refresh.

In-the-wild reproducer eventIDs (refresh-race subset)

From the same audit window referenced on #270, narrowed to the
cluster-shape variant:

053a0ced-..., 1f464cfa-... (mixed-service size-5 batch,
same Trial 2.5 -- two failures in one batch with
InvalidToken and InvalidClientTokenId variants
respectively, consistent with two concurrent signers losing
the refresh race against the same cycle)

The single-shot 502-from-runtime cluster on #270 (dbbc78f0 /
ae01a432 / 270c11a4 / 6d60eaf2) is likely the concurrent-signing
territory addressed by #270; the pair above is what we'd attribute
to the refresh race specifically.

Caveat I want to be honest about

We have not directly instrumented the lock contention or
measured how often _needs_refresh actually fires within a
batch -- our evidence is the failure shape (clustering),
which is consistent with the refresh race but doesn't prove
it over a hypothetical alternative race in the same code path.
If you'd find instrumented evidence useful (e.g., a debug log
counting create_aws_session invocations per batch), we can
patch our local install and rerun the bench script.

Happy to share additional CloudTrail evidence or the bench
script if it would help.

aswhit1 · 2026-05-29T22:22:09Z

Follow-up on v1.6.0

Watching v1.6.0 land today, closing the loop from a downstream
user as promised on #270.

#294
(merged for v1.6.0) removes SessionHolder entirely and reads
fresh credentials from disk on every request. From our reading,
that architectural change closes the refresh race this PR was
addressing: there's no longer a shared mutable
RefreshableCredentials for concurrent signers to read across.

We upgraded to v1.6.0 and re-ran the parallel-fanout bench:

Version	Per-call PASS rate	Cluster-shape failures (within-batch)
v1.5.0 (prior bench)	22/25 = 88%	yes -- the within-batch clustering this PR's evidence flagged
v1.6.0 (post-#294)	25/25 = 100%	none

Small-n caveat: 25 calls is n=5 trials; CIs overlap somewhat at
this sample size. The signal is suggestive rather than conclusive,
but the disappearance of the cluster-shape tracks with the race
diagnosis -- if the race had been wrong, removing the shared state
shouldn't have moved the rate.

If that read holds up under more replication, the lock+double-check
shape proposed here becomes unnecessary in the post-#294 codebase
-- there's nothing to protect. We don't have a position on whether
this PR should land in some adapted form, be closed as superseded,
or stay open in case SessionHolder-style caching returns --
maintainers know the codebase trajectory better than we do. Just
sharing the bench in case our prior cluster-shape evidence is
helpful in deciding disposition.

nix-tkobayashi · 2026-05-31T04:44:00Z

Superseded by #294.

#294 removed SessionHolder entirely, so the asyncio.Lock introduced here no longer has a target in the current codebase. @aswhit1's v1.6.0 benchmark also confirmed the credential-refresh race is resolved on the new path.

The event-loop-blocking concern discussed in #270 remains valid but is orthogonal to this PR, and will be handled separately on top of the post-#294 codebase.

nix-tkobayashi and others added 5 commits May 8, 2026 19:29

Merge branch 'main' into fix/async-credential-resolution

65c2ba3

Merge branch 'main' into fix/async-credential-resolution

a1aa047

nix-tkobayashi requested a review from a team as a code owner May 20, 2026 12:30

nix-tkobayashi requested review from JacquesPotgieter and wzxxing May 20, 2026 12:30

nix-tkobayashi and others added 2 commits May 20, 2026 21:46

Merge remote-tracking branch 'upstream/main' into fix/async-refresh-lock

fdb906b

# Conflicts: # tests/unit/test_hooks.py

aswhit1 mentioned this pull request May 25, 2026

fix: resolve AWS credentials off the event loop #270

Open

Merge branch 'main' into fix/async-refresh-lock

25d7660

aswhit1 mentioned this pull request May 27, 2026

Hardcoded httpx pool size limits parallel tool-call fanout #295

Open

1 task

nix-tkobayashi closed this May 31, 2026

benjaminmatthews mentioned this pull request Jun 12, 2026

SigV4 signing includes all request headers, causing deterministic 401 InvalidSignatureException when any in-transit hop rewrites an incidental header #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add asyncio.Lock to prevent concurrent refresh races#286

fix: add asyncio.Lock to prevent concurrent refresh races#286
nix-tkobayashi wants to merge 8 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock

nix-tkobayashi commented May 20, 2026 •

edited

Loading

Uh oh!

aswhit1 commented May 26, 2026

Uh oh!

aswhit1 commented May 29, 2026

Uh oh!

nix-tkobayashi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nix-tkobayashi commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

User experience

Checklist

Acknowledgment

Uh oh!

aswhit1 commented May 26, 2026

Independent evidence for the refresh-race shape

What clustered-within-batch looks like in our data

Why locked + double-checked is the right shape

In-the-wild reproducer eventIDs (refresh-race subset)

Caveat I want to be honest about

Uh oh!

aswhit1 commented May 29, 2026

Follow-up on v1.6.0

Uh oh!

nix-tkobayashi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nix-tkobayashi commented May 20, 2026 •

edited

Loading