Skip to content

fix(otlp-aws-exporter): avoid RecursionError with pip_system_certs on…#749

Open
viw-test1 wants to merge 4 commits into
aws-observability:mainfrom
viw-test1:fix-pip-system-certs-recursion
Open

fix(otlp-aws-exporter): avoid RecursionError with pip_system_certs on…#749
viw-test1 wants to merge 4 commits into
aws-observability:mainfrom
viw-test1:fix-pip-system-certs-recursion

Conversation

@viw-test1
Copy link
Copy Markdown
Contributor

@viw-test1 viw-test1 commented May 21, 2026

… Python 3.12

Issue #, if available:
https://t.corp.amazon.com/V2191746236
Description of changes:

The problem: When a customer used ADOT Python to send telemetry to CloudWatch, their app crashed and no telemetry showed up. The crash was a RecursionError — Python ran out of stack frames trying to set up an SSL connection.

Why it happened: Three things had to be true at the same time:

  1. The customer was on Python 3.12.
  2. They had pip_system_certs installed (a corporate-CA-trust library, common in enterprise environments).
  3. They were using ADOT Python's auto-instrumentation.

pip_system_certs swaps out Python's default SSL class with its own at startup, but it does this late — after ADOT has already loaded botocore and urllib3. By that point, those libraries had already grabbed a reference to the old SSL class. Python 3.12 has a quirk where its SSL options-setter looks up the current SSL class every time it runs, which now resolves back to the new class, which calls back into the old setter, which calls itself, and so on — infinite loop, crash.

ADOT made it worse because the OTLP exporter signs every export with AWS SigV4, which builds a fresh SSL context every 5 seconds — guaranteeing the crash on every export.

Our fix: On the first export, we point botocore and urllib3 back at the current SSL class (the one pip_system_certs injected). The new class is well-behaved and doesn't have the recursive setter, so SSL works again. We also cache the AWS credentials so we only build one SSL context per exporter instead of one per export.

Result: Customer keeps pip_system_certs (they need it for corporate CA trust), keeps ADOT (they need observability), and nothing crashes anymore.

Latest fix:

  • Add patches/pip_system_certs_patches.py with a one-shot helper that rebinds botocore.httpsession.SSLContext and urllib3.util.ssl.SSLContext to the current ssl.SSLContext (truststore's wrapper). truststore's own SSLContext.options setter does not use the recursive super() pattern, so subsequent SSL context creations succeed.
  • Apply the patch on the first AwsAuthSession.request() call, which is the earliest point at which pip_system_certs's deferred injection is guaranteed to have run.
  • Cache the resolved AWS credentials on the same first request so the credential resolver chain (which also constructs an SSL context) is exercised once per exporter rather than once per export. RefreshableCredentials handles rotation internally on attribute access, so caching the reference is safe.

Tests:

  • New test_pip_system_certs_patches.py covers the no-op, rebind, idempotent, and one-shot guard paths.
  • test_aws_auth_session.py adds assertions that get_credentials runs once across multiple requests and that the patch helper is invoked once.

The fix was validated against the customer's exact failure conditions using a reproduction harness in pip-system-certs-repro/. Two ephemeral Python 3.12 venvs were built, identical except for the version of aws-opentelemetry-distro: one pinned to the published 0.10.0 release that the customer hit, the other installed in editable mode from this branch. Both venvs install pip_system_certs==5.3, botocore==1.42.96, truststore, requests, and urllib3, mirroring the ticket's environment. The repro script then deterministically recreates the post-injection module state described in prashant's worklog: it captures the original C-level ssl.SSLContext, runs truststore.inject_into_ssl() to mimic pip_system_certs's deferred .pth injection, and forces botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext to hold the original captured reference — exactly the AL2023 capture-before-replace race that ADOT's eager imports during sitecustomize lose to. With that state in place, the script constructs an AwsAuthSession and calls request() against a stubbed-out HTTP adapter — the same SigV4 code path the OTLP AWS exporter exercises every 5 seconds in production.

Old case: on the 0.10.0 venv this reliably raises RecursionError, matching the stack trace from CloudWatch in the ticket.
New case: on the branch venv the same call completes without recursion because apply_pip_system_certs_compatibility_patch() rebinds the stale references to truststore's wrapper before the credential resolver constructs an SSL context.

Thus, the fix works because the test exercises the actual AwsAuthSession.request() code path against real botocore and urllib3 instances at the real Python 3.12 stdlib ssl.SSLContext.options.set setter — the bug fires in the same code, on the same Python version, under the same module-state preconditions as in the customer's container, and we observe it disappear when the patch is applied. The harness also gives us fast iteration: any edit under aws-opentelemetry-distro/src/... is picked up by the next run.sh with no rebuild.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… Python 3.12

When pip_system_certs is installed it injects truststore.SSLContext as the
process-wide ssl.SSLContext via a .pth file. The injection runs in the finally
block of a site.execsitecustomize wrapper - i.e. after sitecustomize.py
returns. OpenTelemetry's auto-instrumentation runs from sitecustomize, which
loads the ADOT distro and transitively imports requests (via the upstream OTLP
HTTP exporters) and botocore. Both of those modules capture a reference to
ssl.SSLContext at import time, before pip_system_certs's injection runs.

On Python 3.12, ssl.SSLContext.options.__set__ resolves SSLContext from
ssl module globals at call time. After pip_system_certs runs, that name
resolves to truststore.SSLContext, and the super() chain in the options
setter bounces between the original ssl.SSLContext and truststore.SSLContext
until the recursion limit (~978 frames) is exceeded.

The OTLP AWS exporter amplifies the issue by signing every export with
SigV4, which constructs a urllib3 SSL context every 5 seconds.

Fix:
- Add patches/_pip_system_certs_patches.py with a one-shot helper that
  rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext
  to the current ssl.SSLContext (truststore's wrapper). truststore's own
  SSLContext.options setter does not use the recursive super() pattern,
  so subsequent SSL context creations succeed.
- Apply the patch on the first AwsAuthSession.request() call, which is
  the earliest point at which pip_system_certs's deferred injection is
  guaranteed to have run.
- Cache the resolved AWS credentials on the same first request so the
  credential resolver chain (which also constructs an SSL context) is
  exercised once per exporter rather than once per export.
  RefreshableCredentials handles rotation internally on attribute access,
  so caching the reference is safe.

Tests:
- New test_pip_system_certs_patches.py covers the no-op, rebind, idempotent,
  and one-shot guard paths.
- test_aws_auth_session.py adds assertions that get_credentials runs once
  across multiple requests and that the patch helper is invoked once.
@viw-test1 viw-test1 self-assigned this May 21, 2026
@viw-test1 viw-test1 requested a review from a team as a code owner May 21, 2026 18:23
@viw-test1 viw-test1 added the skip changelog doesn't need a CHANGELOG entry label May 21, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a well-researched fix for a real and gnarly issue. The one-shot SSL rebind approach is sound. A few concerns below, primarily around the permanent failure mode introduced by caching credentials.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix for a gnarly issue. The one-shot SSL rebind approach is sound. See inline comments for detailed findings.

self._credentials = None

self._credentials_resolved = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Permanent credential failure on transient errors

If get_credentials() raises (e.g., IMDS timeout, transient network blip), self._credentials stays None and _credentials_resolved is set to True unconditionally here. All subsequent request() calls will skip signing forever for the lifetime of this session/exporter -- there is no retry path.

This is a regression vs. the old behavior where get_credentials() was called on every request, meaning a transient failure on one export cycle would self-heal on the next.

Consider either:

  1. Not setting _credentials_resolved = True when credential resolution fails (so it retries next call), or
  2. Moving _credentials_resolved = True inside the try block (after the successful get_credentials() call), or
  3. Setting a TTL on the failure so it can be retried after some delay.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viw-test1 can you address this?

_logger: Logger = getLogger(__name__)

# Module-level guard so the patch is applied at most once per process.
_patch_applied = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-safety: module-level _patch_applied guard is not atomic

The _patch_applied flag is a plain boolean checked and set without a lock. While AwsAuthSession._ensure_initialized() has its own lock, apply_pip_system_certs_compatibility_patch() is a public function that could be called from multiple threads concurrently (e.g., if multiple exporters are initialized in parallel). Two threads could both read _patch_applied = False before either sets it to True, causing the patch to run twice.

This is low-severity since the patch itself is idempotent, but it contradicts the stated guarantee (applied at most once per process). Consider using a threading.Lock here as well, or document that idempotency makes the race benign.

# Only apply the patch when pip_system_certs is installed in user application space.
if not _is_pip_system_certs_installed():
_patch_applied = True
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive suggestion: verify ssl.SSLContext is actually the truststore wrapper

The patch reads ssl.SSLContext here and binds it to botocore/urllib3. This relies on the assumption that pip_system_certs has already completed its injection by the time this code runs (guaranteed when called from AwsAuthSession.request()).

However, if apply_pip_system_certs_compatibility_patch() is ever called from a different call site (e.g., during module import or distro initialization), ssl.SSLContext would still be the original, and this code would be a no-op (since botocore.httpsession.SSLContext is ssl.SSLContext would be True). This is safe but silently ineffective.

Consider adding a debug log that verifies ssl.SSLContext is actually the truststore wrapper (e.g., by checking ssl.SSLContext.module or ssl.SSLContext.name) to aid debugging if the call site assumptions change in the future.

This runs after sitecustomize has fully completed (i.e., after any ``.pth``
based ``ssl.SSLContext`` injection from packages such as ``pip_system_certs``),
which is the only point at which we can safely re-align stale ``SSLContext``
references captured by ``botocore`` / ``urllib3`` during ADOT startup.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Double-checked locking and free-threaded Python

The double-checked locking pattern is correct for CPython due to the GIL, but self._credentials_resolved is read without the lock on the fast path (two lines above). In CPython, reading/writing a bool is atomic due to the GIL, so this is safe in practice. However, if this code ever runs on a Python implementation without a GIL (e.g., the experimental free-threaded Python 3.13+), this would need a memory barrier or atomic.

Fine for now, but worth a brief comment noting the GIL dependency for future maintainers.

# Reset the module-level guard before every test so each test exercises the
# full code path.
_pip_system_certs_patches._patch_applied = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test isolation: mutating module-level state directly is fragile

Directly resetting _pip_system_certs_patches._patch_applied = False in setUp/tearDown works but creates implicit coupling between the test and the module internals. If tests run in parallel (e.g., pytest-xdist), they could interfere with each other since they share the same module-level state.

This is acceptable for unit tests running sequentially, but consider using importlib.reload(_pip_system_certs_patches) instead for better encapsulation and to avoid accidentally coupling tests to the internal guard variable name.

except Exception as patch_error: # pylint: disable=broad-except
_logger.warning(
"Failed to apply pip_system_certs compatibility patch: %s", patch_error
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Permanent credential failure is not retryable (aws_auth_session.py line 83)

If get_credentials() raises an exception on the first call (e.g., transient network error reaching IMDS), self._credentials stays None and _credentials_resolved is set to True unconditionally. Every subsequent request() call will send an unsigned request with no recovery path.

Consider not setting _credentials_resolved = True when credential resolution fails, so a future request() call can retry.


# Module-level guard so the patch is applied at most once per process.
_patch_applied = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread safety: module-level _patch_applied guard is racy.

The _patch_applied boolean is read/written without synchronization. Two threads calling apply_pip_system_certs_compatibility_patch() concurrently could both see _patch_applied = False and both execute the patch body. The patch itself is idempotent (rebinding an attribute twice is harmless), so this is not a correctness bug, but it is inconsistent with the careful double-checked locking used in _ensure_initialized().

Since the caller (_ensure_initialized) already holds _credentials_lock before calling this, the function is effectively single-threaded today. Worth a brief comment documenting that caller-side serialization assumption so future callers outside the lock do not introduce a real race.


This runs after sitecustomize has fully completed (i.e., after any ``.pth``
based ``ssl.SSLContext`` injection from packages such as ``pip_system_certs``),
which is the only point at which we can safely re-align stale ``SSLContext``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (future-proofing): double-checked locking and free-threaded Python.

This is a textbook double-checked locking pattern and works correctly in CPython due to the GIL. However, Python 3.13+ has experimental free-threaded builds (PEP 703) where reading _credentials_resolved outside the lock is a data race. Not blocking for now, but worth noting if this project plans to support --disable-gil in the future.

_logger.error("Failed to sign request: %s", signing_error)
else:
_logger.error("Failed to load AWS Credentials: %s")
_logger.error("Failed to load AWS Credentials")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. The original _logger.error with a %s format specifier but no corresponding argument would produce a malformed log message (or raise in strict logging configurations). Removing the orphaned %s is correct.

def setUp(self) -> None:
# Reset the module-level guard before every test so each test exercises the
# full code path.
_pip_system_certs_patches._patch_applied = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test isolation concern: directly mutating module-level state.

Directly assigning _pip_system_certs_patches._patch_applied = False in setUp/tearDown works but is fragile - if a test raises between setUp and tearDown, the global state leaks to subsequent tests. Consider using unittest.mock.patch.object(_pip_system_certs_patches, '_patch_applied', False) as a class decorator or via self.addCleanup(), which guarantees cleanup even on unhandled exceptions.

# Only apply the patch when pip_system_certs is installed in user application space.
if not _is_pip_system_certs_installed():
_patch_applied = True
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: coverage of other modules that capture ssl.SSLContext at import time.

This patch rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext. If other dependencies in ADOT's transitive closure also capture ssl.SSLContext at import time during sitecustomize (e.g., aiohttp, httpx), they could hit the same recursion. Not necessarily something to fix now (YAGNI), but might be worth a brief note in the docstring that the patch covers the known affected modules and may need extension if other libraries are added to the auto-instrumentation.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review of the pip_system_certs compatibility patch. Overall a well-thought-out fix. See inline comments for detailed findings.

# (re-running it produces the same final state), so a benign race between two
# threads where both observe ``_patch_applied is False`` and both run the rebind
# costs an extra dict assignment and nothing more. We don't pay for a lock here.
_patch_applied = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (race condition consideration for free-threaded Python): The comment in _ensure_initialized correctly notes that the plain-bool guard is only safe under GIL, but this module-level _patch_applied flag has the same issue. On CPython with the GIL this is fine, but the comment in aws_auth_session.py explicitly calls out free-threaded Python 3.13t+ as a future concern.

Even under the GIL, if two AwsAuthSession instances (from different exporters) race here, both may observe _patch_applied is False and both will run the patch body. The comment acknowledges this (benign race, costs an extra dict assignment), but the _is_pip_system_certs_installed() call does I/O (package metadata lookup via importlib.metadata.version()) which is more expensive than a dict assignment. Consider using a threading.Lock here too if you expect multiple exporters to initialize concurrently, or at minimum acknowledge the I/O cost in the comment.

Note: the read of ``_credentials_resolved`` outside the lock is safe because
Python's GIL makes attribute reads/writes atomic. On free-threaded Python
builds (3.13t+) this would need a memory barrier; revisit if/when we
support those.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (thread-safety of _credentials read): In the request() method, after _ensure_initialized() returns, the code reads self._credentials without holding the lock. Consider this race:

  1. Thread A calls _ensure_initialized(), enters the lock, credential resolution fails, sets self._credentials = None, returns without latching _credentials_resolved.
  2. Thread B calls _ensure_initialized(), enters the lock, credential resolution succeeds, sets self._credentials = <valid>, latches _credentials_resolved = True.
  3. Thread A (which already returned from _ensure_initialized before Thread B entered) now reads self._credentials in request(). Under GIL the attribute read is atomic, but if Thread B hasn't run its assignment yet (B is still waiting to acquire the lock while A already returned), A sees None and skips signing for that request.

This is a narrow window but possible when multiple threads share one AwsAuthSession instance. Consider returning the credentials from _ensure_initialized() directly, or having request() re-check self._credentials after confirming _credentials_resolved is True.

_logger.warning("Failed to apply pip_system_certs compatibility patch: %s", patch_error)

try:
self._credentials = self._session.get_credentials()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (double-checked locking ordering dependency): For correctness of the double-checked locking pattern, the ordering matters: self._credentials must be assigned before self._credentials_resolved = True (which it is). In CPython with GIL, this is guaranteed because both assignments are in the same thread within the lock, and the GIL prevents another thread from seeing a partial state.

However, if this code is ever refactored (e.g., the assignment order is swapped, or a future maintainer moves code around), the outer reader could see _credentials_resolved = True but _credentials still as None. A brief inline comment like # Order matters: set credentials before latching the flag above line 99 (self._credentials_resolved = True) would help prevent this subtle regression.

return

# pylint: disable=import-outside-toplevel
import ssl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security consideration: This rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext to whatever ssl.SSLContext currently points to (truststore's wrapper). This is the correct fix for the recursion bug.

Worth noting: truststore delegates certificate verification to the OS trust store rather than certifi's bundled CA certificates. This is the intended behavior of pip_system_certs, and the patch is only active when that package is installed, so this is what the customer opted into. However, it means ADOT's TLS connections will now use the system trust store in this scenario. If a customer has a misconfigured system trust store (e.g., missing intermediate CAs), connections that previously worked with certifi could fail after this patch runs. Consider mentioning this behavior in the CHANGELOG entry or docs for clarity.

@patch("requests.Session.request", return_value=requests.Response())
@patch("botocore.session.Session.get_credentials", return_value=mock_credentials)
def test_signing_failure_does_not_break_request(self, _, __):
"""If SigV4 signing itself raises, ``request()`` still issues the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code quality: Looking at the implementation, apply_pip_system_certs_compatibility_patch() is called on every entry into the lock (i.e., on each retry when credentials fail transiently), not just the first time. Since the patch has its own module-level _patch_applied guard this is safe but slightly wasteful -- each retry still pays for the function call overhead and the boolean check.

Consider either:

  1. Adding a separate instance-level _patch_attempted flag so the SSL patch is only called once per AwsAuthSession instance regardless of credential retry count, or
  2. Moving apply_pip_system_certs_compatibility_patch() outside/before the double-checked locking block entirely (it has its own module-level guard and is process-wide, so it does not need the instance lock).

@@ -0,0 +1,145 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue (missing __init__.py): The tests/amazon/opentelemetry/distro/patches/ directory (and the source src/amazon/opentelemetry/distro/patches/ directory) do not have __init__.py files. Looking at the existing project structure, the other test directories under tests/ also lack __init__.py and the project appears to rely on implicit namespace packages with hatchling.

However, looking at the source tree, some sub-packages like instrumentation/ and code_correlation/ DO have explicit __init__.py files. Since the patches directory already has other files that are successfully imported (e.g., _botocore_patches.py), this is likely fine. But please verify that pytest discovers this new test file correctly in CI -- namespace package discovery can be fragile when mixing explicit and implicit packages in the same tree.

"""Is the pip_system_certs package installed?"""
try:
dist_version = version("pip_system_certs")
_logger.debug("pip_system_certs is installed: %s", dist_version)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code quality (minor): version("pip_system_certs") reads package metadata from disk (site-packages *.dist-info/METADATA). It runs on the first request() call, which is the OTLP export hot path. Since the module-level _patch_applied guard ensures this only runs once per process, the one-time cost is acceptable. Just flagging for awareness that this adds a small latency spike to the very first export (~1-5ms for metadata lookup depending on filesystem).

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary for PR 749

  1. BUG - Double-logging on credential failure (aws_auth_session.py line 127): When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail. Then in request(), the else branch logs 'Failed to load AWS Credentials' again without the exception. This produces duplicate error logs per failed request. Consider removing the else log or downgrading to debug level.

  2. CORRECTNESS - Double-checked locking on free-threaded Python 3.13t+ (aws_auth_session.py line 65): Python 3.13 is in pyproject.toml classifiers. On free-threaded builds, the unsynchronized read of self._credentials_resolved and self._credentials in request() could produce torn reads. Consider always acquiring the lock (negligible cost when uncontended) or file a tracking issue.

  3. THREAD-SAFETY NOTE - Module-level guard (_pip_system_certs_patches.py line 14): The benign race is acceptable because the patch is idempotent and the caller serializes via per-instance lock. Document this assumption in case the function is called from other sites in the future.

  4. NIT - Future-proofing ssl rebinds (pip_system_certs_patches.py line 68): patching urllib3.util.ssl.SSLContext should be sufficient since requests delegates to urllib3. If recursion appears from other call sites, this is where to add rebindings.

  5. TEST ISOLATION (test_pip_system_certs_patches.py line 15): Directly mutating module-level _patch_applied works for sequential unittest but would break under pytest-xdist parallel execution.

self._credentials_resolved = True

def request(self, method, url, *args, data=None, headers=None, **kwargs):
credentials = self._session.get_credentials()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Double-logging on credential failure. When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail via _logger.error. Then here in request(), this else branch fires again and logs 'Failed to load AWS Credentials' without the exception. This produces duplicate error logs per failed request. Consider removing this else log or downgrading to _logger.debug.

except Exception as signing_error: # pylint: disable=broad-except
_logger.error("Failed to sign request: %s", signing_error)
else:
_logger.error("Failed to load AWS Credentials: %s")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Double-logging on credential failure. When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail via _logger.error. Then here in request(), this else branch fires again and logs 'Failed to load AWS Credentials' without the exception. This produces duplicate error logs per failed request. Consider removing this else log or downgrading to _logger.debug.


Note: the read of ``_credentials_resolved`` outside the lock is safe because
Python's GIL makes attribute reads/writes atomic. On free-threaded Python
builds (3.13t+) this would need a memory barrier; revisit if/when we
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness concern: double-checked locking on free-threaded Python 3.13t+. The docstring acknowledges this, which is good. However Python 3.13 is already in pyproject.toml classifiers. If a user runs on a free-threaded 3.13t build, the unsynchronized read of self._credentials_resolved (here, outside the lock) and self._credentials in request() could produce torn reads or stale values. Consider always acquiring the lock (it is uncontended after the first successful call, so performance cost is negligible on CPython) or file a tracking issue for free-threaded build support.

# threads where both observe ``_patch_applied is False`` and both run the rebind
# costs an extra dict assignment and nothing more. We don't pay for a lock here.
_patch_applied = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-safety note on the module-level guard: The comment correctly states a benign race is acceptable because the patch body is idempotent. Since the caller (_ensure_initialized) already serializes via its own per-instance lock, this race can only happen if apply_pip_system_certs_compatibility_patch is called from a different call site in the future. Worth documenting the assumption that the caller currently provides serialization.


# pylint: disable=import-outside-toplevel
import ssl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/future-proofing: Consider whether other libraries in the dependency tree might also hold stale ssl.SSLContext references (e.g., requests.adapters in some versions). Since requests delegates to urllib3, patching urllib3.util.ssl_.SSLContext should be sufficient for new SSL context creations in the current stack. But if this recursion is reported from a different call site in the future, this module is where to add additional rebindings.

# Reset the module-level guard before every test so each test exercises the
# full code path.
_pip_system_certs_patches._patch_applied = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test isolation concern: directly mutating module-level _patch_applied works for sequential unittest execution. However, if tests ever run in parallel (e.g., with pytest-xdist), they will interfere with each other via the shared module-level state. Worth noting if the test runner strategy changes in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip changelog doesn't need a CHANGELOG entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants