fix(otlp-aws-exporter): avoid RecursionError with pip_system_certs on…#749
fix(otlp-aws-exporter): avoid RecursionError with pip_system_certs on…#749viw-test1 wants to merge 4 commits into
Conversation
… Python 3.12 When pip_system_certs is installed it injects truststore.SSLContext as the process-wide ssl.SSLContext via a .pth file. The injection runs in the finally block of a site.execsitecustomize wrapper - i.e. after sitecustomize.py returns. OpenTelemetry's auto-instrumentation runs from sitecustomize, which loads the ADOT distro and transitively imports requests (via the upstream OTLP HTTP exporters) and botocore. Both of those modules capture a reference to ssl.SSLContext at import time, before pip_system_certs's injection runs. On Python 3.12, ssl.SSLContext.options.__set__ resolves SSLContext from ssl module globals at call time. After pip_system_certs runs, that name resolves to truststore.SSLContext, and the super() chain in the options setter bounces between the original ssl.SSLContext and truststore.SSLContext until the recursion limit (~978 frames) is exceeded. The OTLP AWS exporter amplifies the issue by signing every export with SigV4, which constructs a urllib3 SSL context every 5 seconds. Fix: - Add patches/_pip_system_certs_patches.py with a one-shot helper that rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext to the current ssl.SSLContext (truststore's wrapper). truststore's own SSLContext.options setter does not use the recursive super() pattern, so subsequent SSL context creations succeed. - Apply the patch on the first AwsAuthSession.request() call, which is the earliest point at which pip_system_certs's deferred injection is guaranteed to have run. - Cache the resolved AWS credentials on the same first request so the credential resolver chain (which also constructs an SSL context) is exercised once per exporter rather than once per export. RefreshableCredentials handles rotation internally on attribute access, so caching the reference is safe. Tests: - New test_pip_system_certs_patches.py covers the no-op, rebind, idempotent, and one-shot guard paths. - test_aws_auth_session.py adds assertions that get_credentials runs once across multiple requests and that the patch helper is invoked once.
| self._credentials = None | ||
|
|
||
| self._credentials_resolved = True | ||
|
|
There was a problem hiding this comment.
Bug: Permanent credential failure on transient errors
If get_credentials() raises (e.g., IMDS timeout, transient network blip), self._credentials stays None and _credentials_resolved is set to True unconditionally here. All subsequent request() calls will skip signing forever for the lifetime of this session/exporter -- there is no retry path.
This is a regression vs. the old behavior where get_credentials() was called on every request, meaning a transient failure on one export cycle would self-heal on the next.
Consider either:
- Not setting _credentials_resolved = True when credential resolution fails (so it retries next call), or
- Moving _credentials_resolved = True inside the try block (after the successful get_credentials() call), or
- Setting a TTL on the failure so it can be retried after some delay.
| _logger: Logger = getLogger(__name__) | ||
|
|
||
| # Module-level guard so the patch is applied at most once per process. | ||
| _patch_applied = False |
There was a problem hiding this comment.
Thread-safety: module-level _patch_applied guard is not atomic
The _patch_applied flag is a plain boolean checked and set without a lock. While AwsAuthSession._ensure_initialized() has its own lock, apply_pip_system_certs_compatibility_patch() is a public function that could be called from multiple threads concurrently (e.g., if multiple exporters are initialized in parallel). Two threads could both read _patch_applied = False before either sets it to True, causing the patch to run twice.
This is low-severity since the patch itself is idempotent, but it contradicts the stated guarantee (applied at most once per process). Consider using a threading.Lock here as well, or document that idempotency makes the race benign.
| # Only apply the patch when pip_system_certs is installed in user application space. | ||
| if not _is_pip_system_certs_installed(): | ||
| _patch_applied = True | ||
| return |
There was a problem hiding this comment.
Defensive suggestion: verify ssl.SSLContext is actually the truststore wrapper
The patch reads ssl.SSLContext here and binds it to botocore/urllib3. This relies on the assumption that pip_system_certs has already completed its injection by the time this code runs (guaranteed when called from AwsAuthSession.request()).
However, if apply_pip_system_certs_compatibility_patch() is ever called from a different call site (e.g., during module import or distro initialization), ssl.SSLContext would still be the original, and this code would be a no-op (since botocore.httpsession.SSLContext is ssl.SSLContext would be True). This is safe but silently ineffective.
Consider adding a debug log that verifies ssl.SSLContext is actually the truststore wrapper (e.g., by checking ssl.SSLContext.module or ssl.SSLContext.name) to aid debugging if the call site assumptions change in the future.
| This runs after sitecustomize has fully completed (i.e., after any ``.pth`` | ||
| based ``ssl.SSLContext`` injection from packages such as ``pip_system_certs``), | ||
| which is the only point at which we can safely re-align stale ``SSLContext`` | ||
| references captured by ``botocore`` / ``urllib3`` during ADOT startup. |
There was a problem hiding this comment.
Nit: Double-checked locking and free-threaded Python
The double-checked locking pattern is correct for CPython due to the GIL, but self._credentials_resolved is read without the lock on the fast path (two lines above). In CPython, reading/writing a bool is atomic due to the GIL, so this is safe in practice. However, if this code ever runs on a Python implementation without a GIL (e.g., the experimental free-threaded Python 3.13+), this would need a memory barrier or atomic.
Fine for now, but worth a brief comment noting the GIL dependency for future maintainers.
| # Reset the module-level guard before every test so each test exercises the | ||
| # full code path. | ||
| _pip_system_certs_patches._patch_applied = False | ||
|
|
There was a problem hiding this comment.
Test isolation: mutating module-level state directly is fragile
Directly resetting _pip_system_certs_patches._patch_applied = False in setUp/tearDown works but creates implicit coupling between the test and the module internals. If tests run in parallel (e.g., pytest-xdist), they could interfere with each other since they share the same module-level state.
This is acceptable for unit tests running sequentially, but consider using importlib.reload(_pip_system_certs_patches) instead for better encapsulation and to avoid accidentally coupling tests to the internal guard variable name.
| except Exception as patch_error: # pylint: disable=broad-except | ||
| _logger.warning( | ||
| "Failed to apply pip_system_certs compatibility patch: %s", patch_error | ||
| ) |
There was a problem hiding this comment.
Bug: Permanent credential failure is not retryable (aws_auth_session.py line 83)
If get_credentials() raises an exception on the first call (e.g., transient network error reaching IMDS), self._credentials stays None and _credentials_resolved is set to True unconditionally. Every subsequent request() call will send an unsigned request with no recovery path.
Consider not setting _credentials_resolved = True when credential resolution fails, so a future request() call can retry.
|
|
||
| # Module-level guard so the patch is applied at most once per process. | ||
| _patch_applied = False | ||
|
|
There was a problem hiding this comment.
Thread safety: module-level _patch_applied guard is racy.
The _patch_applied boolean is read/written without synchronization. Two threads calling apply_pip_system_certs_compatibility_patch() concurrently could both see _patch_applied = False and both execute the patch body. The patch itself is idempotent (rebinding an attribute twice is harmless), so this is not a correctness bug, but it is inconsistent with the careful double-checked locking used in _ensure_initialized().
Since the caller (_ensure_initialized) already holds _credentials_lock before calling this, the function is effectively single-threaded today. Worth a brief comment documenting that caller-side serialization assumption so future callers outside the lock do not introduce a real race.
|
|
||
| This runs after sitecustomize has fully completed (i.e., after any ``.pth`` | ||
| based ``ssl.SSLContext`` injection from packages such as ``pip_system_certs``), | ||
| which is the only point at which we can safely re-align stale ``SSLContext`` |
There was a problem hiding this comment.
Nit (future-proofing): double-checked locking and free-threaded Python.
This is a textbook double-checked locking pattern and works correctly in CPython due to the GIL. However, Python 3.13+ has experimental free-threaded builds (PEP 703) where reading _credentials_resolved outside the lock is a data race. Not blocking for now, but worth noting if this project plans to support --disable-gil in the future.
| _logger.error("Failed to sign request: %s", signing_error) | ||
| else: | ||
| _logger.error("Failed to load AWS Credentials: %s") | ||
| _logger.error("Failed to load AWS Credentials") |
There was a problem hiding this comment.
Good fix. The original _logger.error with a %s format specifier but no corresponding argument would produce a malformed log message (or raise in strict logging configurations). Removing the orphaned %s is correct.
| def setUp(self) -> None: | ||
| # Reset the module-level guard before every test so each test exercises the | ||
| # full code path. | ||
| _pip_system_certs_patches._patch_applied = False |
There was a problem hiding this comment.
Test isolation concern: directly mutating module-level state.
Directly assigning _pip_system_certs_patches._patch_applied = False in setUp/tearDown works but is fragile - if a test raises between setUp and tearDown, the global state leaks to subsequent tests. Consider using unittest.mock.patch.object(_pip_system_certs_patches, '_patch_applied', False) as a class decorator or via self.addCleanup(), which guarantees cleanup even on unhandled exceptions.
| # Only apply the patch when pip_system_certs is installed in user application space. | ||
| if not _is_pip_system_certs_installed(): | ||
| _patch_applied = True | ||
| return |
There was a problem hiding this comment.
Consideration: coverage of other modules that capture ssl.SSLContext at import time.
This patch rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext. If other dependencies in ADOT's transitive closure also capture ssl.SSLContext at import time during sitecustomize (e.g., aiohttp, httpx), they could hit the same recursion. Not necessarily something to fix now (YAGNI), but might be worth a brief note in the docstring that the patch covers the known affected modules and may need extension if other libraries are added to the auto-instrumentation.
| # (re-running it produces the same final state), so a benign race between two | ||
| # threads where both observe ``_patch_applied is False`` and both run the rebind | ||
| # costs an extra dict assignment and nothing more. We don't pay for a lock here. | ||
| _patch_applied = False |
There was a problem hiding this comment.
Bug (race condition consideration for free-threaded Python): The comment in _ensure_initialized correctly notes that the plain-bool guard is only safe under GIL, but this module-level _patch_applied flag has the same issue. On CPython with the GIL this is fine, but the comment in aws_auth_session.py explicitly calls out free-threaded Python 3.13t+ as a future concern.
Even under the GIL, if two AwsAuthSession instances (from different exporters) race here, both may observe _patch_applied is False and both will run the patch body. The comment acknowledges this (benign race, costs an extra dict assignment), but the _is_pip_system_certs_installed() call does I/O (package metadata lookup via importlib.metadata.version()) which is more expensive than a dict assignment. Consider using a threading.Lock here too if you expect multiple exporters to initialize concurrently, or at minimum acknowledge the I/O cost in the comment.
| Note: the read of ``_credentials_resolved`` outside the lock is safe because | ||
| Python's GIL makes attribute reads/writes atomic. On free-threaded Python | ||
| builds (3.13t+) this would need a memory barrier; revisit if/when we | ||
| support those. |
There was a problem hiding this comment.
Bug (thread-safety of _credentials read): In the request() method, after _ensure_initialized() returns, the code reads self._credentials without holding the lock. Consider this race:
- Thread A calls
_ensure_initialized(), enters the lock, credential resolution fails, setsself._credentials = None, returns without latching_credentials_resolved. - Thread B calls
_ensure_initialized(), enters the lock, credential resolution succeeds, setsself._credentials = <valid>, latches_credentials_resolved = True. - Thread A (which already returned from
_ensure_initializedbefore Thread B entered) now readsself._credentialsinrequest(). Under GIL the attribute read is atomic, but if Thread B hasn't run its assignment yet (B is still waiting to acquire the lock while A already returned), A seesNoneand skips signing for that request.
This is a narrow window but possible when multiple threads share one AwsAuthSession instance. Consider returning the credentials from _ensure_initialized() directly, or having request() re-check self._credentials after confirming _credentials_resolved is True.
| _logger.warning("Failed to apply pip_system_certs compatibility patch: %s", patch_error) | ||
|
|
||
| try: | ||
| self._credentials = self._session.get_credentials() |
There was a problem hiding this comment.
Note (double-checked locking ordering dependency): For correctness of the double-checked locking pattern, the ordering matters: self._credentials must be assigned before self._credentials_resolved = True (which it is). In CPython with GIL, this is guaranteed because both assignments are in the same thread within the lock, and the GIL prevents another thread from seeing a partial state.
However, if this code is ever refactored (e.g., the assignment order is swapped, or a future maintainer moves code around), the outer reader could see _credentials_resolved = True but _credentials still as None. A brief inline comment like # Order matters: set credentials before latching the flag above line 99 (self._credentials_resolved = True) would help prevent this subtle regression.
| return | ||
|
|
||
| # pylint: disable=import-outside-toplevel | ||
| import ssl |
There was a problem hiding this comment.
Security consideration: This rebinds botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext to whatever ssl.SSLContext currently points to (truststore's wrapper). This is the correct fix for the recursion bug.
Worth noting: truststore delegates certificate verification to the OS trust store rather than certifi's bundled CA certificates. This is the intended behavior of pip_system_certs, and the patch is only active when that package is installed, so this is what the customer opted into. However, it means ADOT's TLS connections will now use the system trust store in this scenario. If a customer has a misconfigured system trust store (e.g., missing intermediate CAs), connections that previously worked with certifi could fail after this patch runs. Consider mentioning this behavior in the CHANGELOG entry or docs for clarity.
| @patch("requests.Session.request", return_value=requests.Response()) | ||
| @patch("botocore.session.Session.get_credentials", return_value=mock_credentials) | ||
| def test_signing_failure_does_not_break_request(self, _, __): | ||
| """If SigV4 signing itself raises, ``request()`` still issues the |
There was a problem hiding this comment.
Code quality: Looking at the implementation, apply_pip_system_certs_compatibility_patch() is called on every entry into the lock (i.e., on each retry when credentials fail transiently), not just the first time. Since the patch has its own module-level _patch_applied guard this is safe but slightly wasteful -- each retry still pays for the function call overhead and the boolean check.
Consider either:
- Adding a separate instance-level
_patch_attemptedflag so the SSL patch is only called once perAwsAuthSessioninstance regardless of credential retry count, or - Moving
apply_pip_system_certs_compatibility_patch()outside/before the double-checked locking block entirely (it has its own module-level guard and is process-wide, so it does not need the instance lock).
| @@ -0,0 +1,145 @@ | |||
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |||
There was a problem hiding this comment.
Potential issue (missing __init__.py): The tests/amazon/opentelemetry/distro/patches/ directory (and the source src/amazon/opentelemetry/distro/patches/ directory) do not have __init__.py files. Looking at the existing project structure, the other test directories under tests/ also lack __init__.py and the project appears to rely on implicit namespace packages with hatchling.
However, looking at the source tree, some sub-packages like instrumentation/ and code_correlation/ DO have explicit __init__.py files. Since the patches directory already has other files that are successfully imported (e.g., _botocore_patches.py), this is likely fine. But please verify that pytest discovers this new test file correctly in CI -- namespace package discovery can be fragile when mixing explicit and implicit packages in the same tree.
| """Is the pip_system_certs package installed?""" | ||
| try: | ||
| dist_version = version("pip_system_certs") | ||
| _logger.debug("pip_system_certs is installed: %s", dist_version) |
There was a problem hiding this comment.
Code quality (minor): version("pip_system_certs") reads package metadata from disk (site-packages *.dist-info/METADATA). It runs on the first request() call, which is the OTLP export hot path. Since the module-level _patch_applied guard ensures this only runs once per process, the one-time cost is acceptable. Just flagging for awareness that this adds a small latency spike to the very first export (~1-5ms for metadata lookup depending on filesystem).
There was a problem hiding this comment.
Code Review Summary for PR 749
-
BUG - Double-logging on credential failure (aws_auth_session.py line 127): When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail. Then in request(), the else branch logs 'Failed to load AWS Credentials' again without the exception. This produces duplicate error logs per failed request. Consider removing the else log or downgrading to debug level.
-
CORRECTNESS - Double-checked locking on free-threaded Python 3.13t+ (aws_auth_session.py line 65): Python 3.13 is in pyproject.toml classifiers. On free-threaded builds, the unsynchronized read of self._credentials_resolved and self._credentials in request() could produce torn reads. Consider always acquiring the lock (negligible cost when uncontended) or file a tracking issue.
-
THREAD-SAFETY NOTE - Module-level guard (_pip_system_certs_patches.py line 14): The benign race is acceptable because the patch is idempotent and the caller serializes via per-instance lock. Document this assumption in case the function is called from other sites in the future.
-
NIT - Future-proofing ssl rebinds (pip_system_certs_patches.py line 68): patching urllib3.util.ssl.SSLContext should be sufficient since requests delegates to urllib3. If recursion appears from other call sites, this is where to add rebindings.
-
TEST ISOLATION (test_pip_system_certs_patches.py line 15): Directly mutating module-level _patch_applied works for sequential unittest but would break under pytest-xdist parallel execution.
| self._credentials_resolved = True | ||
|
|
||
| def request(self, method, url, *args, data=None, headers=None, **kwargs): | ||
| credentials = self._session.get_credentials() |
There was a problem hiding this comment.
Bug: Double-logging on credential failure. When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail via _logger.error. Then here in request(), this else branch fires again and logs 'Failed to load AWS Credentials' without the exception. This produces duplicate error logs per failed request. Consider removing this else log or downgrading to _logger.debug.
| except Exception as signing_error: # pylint: disable=broad-except | ||
| _logger.error("Failed to sign request: %s", signing_error) | ||
| else: | ||
| _logger.error("Failed to load AWS Credentials: %s") |
There was a problem hiding this comment.
Bug: Double-logging on credential failure. When get_credentials() raises inside _ensure_initialized(), the error is already logged with the exception detail via _logger.error. Then here in request(), this else branch fires again and logs 'Failed to load AWS Credentials' without the exception. This produces duplicate error logs per failed request. Consider removing this else log or downgrading to _logger.debug.
|
|
||
| Note: the read of ``_credentials_resolved`` outside the lock is safe because | ||
| Python's GIL makes attribute reads/writes atomic. On free-threaded Python | ||
| builds (3.13t+) this would need a memory barrier; revisit if/when we |
There was a problem hiding this comment.
Correctness concern: double-checked locking on free-threaded Python 3.13t+. The docstring acknowledges this, which is good. However Python 3.13 is already in pyproject.toml classifiers. If a user runs on a free-threaded 3.13t build, the unsynchronized read of self._credentials_resolved (here, outside the lock) and self._credentials in request() could produce torn reads or stale values. Consider always acquiring the lock (it is uncontended after the first successful call, so performance cost is negligible on CPython) or file a tracking issue for free-threaded build support.
| # threads where both observe ``_patch_applied is False`` and both run the rebind | ||
| # costs an extra dict assignment and nothing more. We don't pay for a lock here. | ||
| _patch_applied = False | ||
|
|
There was a problem hiding this comment.
Thread-safety note on the module-level guard: The comment correctly states a benign race is acceptable because the patch body is idempotent. Since the caller (_ensure_initialized) already serializes via its own per-instance lock, this race can only happen if apply_pip_system_certs_compatibility_patch is called from a different call site in the future. Worth documenting the assumption that the caller currently provides serialization.
|
|
||
| # pylint: disable=import-outside-toplevel | ||
| import ssl | ||
|
|
There was a problem hiding this comment.
Nit/future-proofing: Consider whether other libraries in the dependency tree might also hold stale ssl.SSLContext references (e.g., requests.adapters in some versions). Since requests delegates to urllib3, patching urllib3.util.ssl_.SSLContext should be sufficient for new SSL context creations in the current stack. But if this recursion is reported from a different call site in the future, this module is where to add additional rebindings.
| # Reset the module-level guard before every test so each test exercises the | ||
| # full code path. | ||
| _pip_system_certs_patches._patch_applied = False | ||
|
|
There was a problem hiding this comment.
Test isolation concern: directly mutating module-level _patch_applied works for sequential unittest execution. However, if tests ever run in parallel (e.g., with pytest-xdist), they will interfere with each other via the shared module-level state. Worth noting if the test runner strategy changes in the future.
… Python 3.12
Issue #, if available:
https://t.corp.amazon.com/V2191746236
Description of changes:
The problem: When a customer used ADOT Python to send telemetry to CloudWatch, their app crashed and no telemetry showed up. The crash was a RecursionError — Python ran out of stack frames trying to set up an SSL connection.
Why it happened: Three things had to be true at the same time:
pip_system_certs swaps out Python's default SSL class with its own at startup, but it does this late — after ADOT has already loaded botocore and urllib3. By that point, those libraries had already grabbed a reference to the old SSL class. Python 3.12 has a quirk where its SSL options-setter looks up the current SSL class every time it runs, which now resolves back to the new class, which calls back into the old setter, which calls itself, and so on — infinite loop, crash.
ADOT made it worse because the OTLP exporter signs every export with AWS SigV4, which builds a fresh SSL context every 5 seconds — guaranteeing the crash on every export.
Our fix: On the first export, we point botocore and urllib3 back at the current SSL class (the one pip_system_certs injected). The new class is well-behaved and doesn't have the recursive setter, so SSL works again. We also cache the AWS credentials so we only build one SSL context per exporter instead of one per export.
Result: Customer keeps pip_system_certs (they need it for corporate CA trust), keeps ADOT (they need observability), and nothing crashes anymore.
Latest fix:
Tests:
The fix was validated against the customer's exact failure conditions using a reproduction harness in pip-system-certs-repro/. Two ephemeral Python 3.12 venvs were built, identical except for the version of aws-opentelemetry-distro: one pinned to the published 0.10.0 release that the customer hit, the other installed in editable mode from this branch. Both venvs install pip_system_certs==5.3, botocore==1.42.96, truststore, requests, and urllib3, mirroring the ticket's environment. The repro script then deterministically recreates the post-injection module state described in prashant's worklog: it captures the original C-level ssl.SSLContext, runs truststore.inject_into_ssl() to mimic pip_system_certs's deferred .pth injection, and forces botocore.httpsession.SSLContext and urllib3.util.ssl_.SSLContext to hold the original captured reference — exactly the AL2023 capture-before-replace race that ADOT's eager imports during sitecustomize lose to. With that state in place, the script constructs an AwsAuthSession and calls request() against a stubbed-out HTTP adapter — the same SigV4 code path the OTLP AWS exporter exercises every 5 seconds in production.
Old case: on the 0.10.0 venv this reliably raises RecursionError, matching the stack trace from CloudWatch in the ticket.
New case: on the branch venv the same call completes without recursion because apply_pip_system_certs_compatibility_patch() rebinds the stale references to truststore's wrapper before the credential resolver constructs an SSL context.
Thus, the fix works because the test exercises the actual AwsAuthSession.request() code path against real botocore and urllib3 instances at the real Python 3.12 stdlib ssl.SSLContext.options.set setter — the bug fires in the same code, on the same Python version, under the same module-state preconditions as in the customer's container, and we observe it disappear when the patch is applied. The harness also gives us fast iteration: any edit under aws-opentelemetry-distro/src/... is picked up by the next run.sh with no rebuild.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.