Skip to content

fix: add asyncio.Lock to prevent concurrent refresh races#286

Closed
nix-tkobayashi wants to merge 8 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock
Closed

fix: add asyncio.Lock to prevent concurrent refresh races#286
nix-tkobayashi wants to merge 8 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock

Conversation

@nix-tkobayashi

@nix-tkobayashi nix-tkobayashi commented May 20, 2026

Copy link
Copy Markdown

Summary

Changes

Follow-up to #270 (per review feedback).

SessionHolder.async_refresh_if_needed() can be called concurrently by multiple in-flight requests when the session is marked for refresh. Without synchronization, all callers race to create a new boto3 session simultaneously — wasting STS calls and risking inconsistent state.

This PR:

  • Adds an asyncio.Lock to SessionHolder with a double-check pattern: the outer _needs_refresh check avoids lock overhead on the fast path; the inner check ensures only one caller actually performs the refresh
  • Shields the to_thread call from task cancellation so that a cancelled caller does not release the lock while the worker thread is still running (prevents duplicate refresh on cancellation)
  • Adds two unit tests:
    • Concurrent-callers test using threading.Event barriers to assert create_aws_session is called exactly once
    • Cancellation test that verifies the lock is held until the refresh completes, even when the awaiting task is cancelled

User experience

No user-visible behavior change. This is a correctness hardening for the credential refresh path introduced in #270.

Checklist

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N)

  • Yes
  • No

Please add details about how this change was tested.

  • All 185 unit tests pass with 96% coverage (80% required)
  • Pre-commit hooks pass
  • Ruff linting and formatting pass
  • Pyright passes (no new errors)
  • Concurrent test uses threading.Event barriers to assert single refresh
  • Cancellation test verified to FAIL without shield and PASS with shield
  • Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

nix-tkobayashi and others added 5 commits May 8, 2026 19:29
… hook

`_sign_request_hook` is an async httpx event hook, but it calls
`session_holder.session.get_credentials()` and
`session_holder.refresh_if_needed()` synchronously. For profiles that
use assumed IAM roles (chained credentials), `get_credentials()` triggers
a blocking STS `AssumeRole` call that stalls the event loop and causes
connection timeouts (~60 s).

Add `SessionHolder.async_get_credentials()` and
`SessionHolder.async_refresh_if_needed()` which delegate to
`asyncio.to_thread`, keeping the event loop responsive. Update
`_sign_request_hook` to call the async variants.

Fixes aws#176

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback:
- Clarify that get_credentials() is the primary blocking path;
  async_refresh_if_needed() is a defensive measure
- Replace timeout-based non-blocking test with ticker coroutine that
  proves the event loop stays responsive during credential resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent concurrent refresh races when multiple in-flight requests
trigger credential refresh simultaneously. Uses a double-check
pattern: the outer check avoids lock overhead on the fast path,
the inner check ensures only one caller performs the refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nix-tkobayashi nix-tkobayashi requested a review from a team as a code owner May 20, 2026 12:30
nix-tkobayashi and others added 2 commits May 20, 2026 21:46
When the task holding _refresh_lock is cancelled during
asyncio.to_thread(), the lock would be released while the worker
thread is still running.  A second caller could then acquire the
lock, see _needs_refresh == True, and start a duplicate refresh.

Shield the to_thread call so that cancellation waits for the
refresh to complete before releasing the lock and re-raising
CancelledError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aswhit1

aswhit1 commented May 26, 2026

Copy link
Copy Markdown

Independent evidence for the refresh-race shape

Thanks for the response on #270 -- adding the cluster-shape
evidence specific to the refresh race this PR addresses, for the
same audit window, since you confirmed #270 and #286 are
complementary rather than interchangeable. Sharing as the same
downstream user, via the
opencode MCP client.

What clustered-within-batch looks like in our data

Same n=64 audit window referenced on #270, narrowed to the failure
shape this PR addresses: when a credential refresh fires mid-fanout,
multiple concurrent _sign_request_hook calls each observe
_needs_refresh=True and each call create_aws_session(profile).
The losing-side signers complete signing against credentials that
were replaced under them; AWS responds 401/403; proxy surfaces
InvalidClientTokenId / AuthFailure / InvalidToken flavors.

The shape we'd predict from this race -- and what we observed --
is clustering: failures don't distribute independently across
batches; they bunch within a batch when refresh fires during it,
and stay zero in batches where refresh doesn't fire.

Batch size Trials Total calls Failed batches Failures concentrated in failed batches
5 (mixed-service, warm) 5 25 3 of 5 All 3 failures in the 3 failed batches; remaining 2 batches 0/10
6, 7, 8 (mixed-service, warm) 1 each 21 0 of 3 n/a

If failures were independent per call, a 12% per-call rate (3/25)
would distribute as roughly 1 failure per batch, not 3 batches with
~1 failure each and 2 clean batches. The clustering is consistent
with the refresh-race shape this PR describes -- once one call
401s, the second-and-subsequent concurrent fanout calls each
observe _needs_refresh=True and race the refresh.

Why locked + double-checked is the right shape

The double-check pattern in this PR (outer _needs_refresh check
avoids lock overhead on the fast path; inner check ensures only
one caller actually performs the refresh) lines up with what our
data shows we need:

  • Fast path stays fast: in the 21/21 PASS warm batches above,
    no refresh fired, no lock contention -- the outer check returns
    immediately
  • Refresh path serializes: in the 3 batches that did refresh,
    all signers would queue on the lock, the first completes the
    refresh, the rest see _needs_refresh=False on the inner check
    and skip the redundant create_aws_session -- which is exactly
    the cluster of failures we observe today

The cancellation shield matters for our setup too: opencode cancels
in-flight tool calls when the user interrupts. Without the shield,
a cancelled caller mid-refresh would release the lock while the
worker thread is still running, and the next batch would race a
half-completed refresh.

In-the-wild reproducer eventIDs (refresh-race subset)

From the same audit window referenced on #270, narrowed to the
cluster-shape variant:

  • 053a0ced-..., 1f464cfa-... (mixed-service size-5 batch,
    same Trial 2.5 -- two failures in one batch with
    InvalidToken and InvalidClientTokenId variants
    respectively, consistent with two concurrent signers losing
    the refresh race against the same cycle)

The single-shot 502-from-runtime cluster on #270 (dbbc78f0 /
ae01a432 / 270c11a4 / 6d60eaf2) is likely the concurrent-signing
territory addressed by #270; the pair above is what we'd attribute
to the refresh race specifically.

Caveat I want to be honest about

We have not directly instrumented the lock contention or
measured how often _needs_refresh actually fires within a
batch -- our evidence is the failure shape (clustering),
which is consistent with the refresh race but doesn't prove
it over a hypothetical alternative race in the same code path.
If you'd find instrumented evidence useful (e.g., a debug log
counting create_aws_session invocations per batch), we can
patch our local install and rerun the bench script.

Happy to share additional CloudTrail evidence or the bench
script if it would help.

@aswhit1

aswhit1 commented May 29, 2026

Copy link
Copy Markdown

Follow-up on v1.6.0

Watching v1.6.0 land today, closing the loop from a downstream
user as promised on #270.

#294
(merged for v1.6.0) removes SessionHolder entirely and reads
fresh credentials from disk on every request. From our reading,
that architectural change closes the refresh race this PR was
addressing: there's no longer a shared mutable
RefreshableCredentials for concurrent signers to read across.

We upgraded to v1.6.0 and re-ran the parallel-fanout bench:

Version Per-call PASS rate Cluster-shape failures (within-batch)
v1.5.0 (prior bench) 22/25 = 88% yes -- the within-batch clustering this PR's evidence flagged
v1.6.0 (post-#294) 25/25 = 100% none

Small-n caveat: 25 calls is n=5 trials; CIs overlap somewhat at
this sample size. The signal is suggestive rather than conclusive,
but the disappearance of the cluster-shape tracks with the race
diagnosis -- if the race had been wrong, removing the shared state
shouldn't have moved the rate.

If that read holds up under more replication, the lock+double-check
shape proposed here becomes unnecessary in the post-#294 codebase
-- there's nothing to protect. We don't have a position on whether
this PR should land in some adapted form, be closed as superseded,
or stay open in case SessionHolder-style caching returns --
maintainers know the codebase trajectory better than we do. Just
sharing the bench in case our prior cluster-shape evidence is
helpful in deciding disposition.

@nix-tkobayashi

Copy link
Copy Markdown
Author

Superseded by #294.

#294 removed SessionHolder entirely, so the asyncio.Lock introduced here no longer has a target in the current codebase. @aswhit1's v1.6.0 benchmark also confirmed the credential-refresh race is resolved on the new path.

The event-loop-blocking concern discussed in #270 remains valid but is orthogonal to this PR, and will be handled separately on top of the post-#294 codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants