Add lazy loading for io counter to prevent crash on package import#45275
Add lazy loading for io counter to prevent crash on package import#45275tbequinor wants to merge 4 commits intoAzure:mainfrom
Conversation
|
Thank you for your contribution @tbequinor! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Pull request overview
This PR addresses a startup/import crash in azure-monitor-opentelemetry-exporter caused by calling psutil.Process().io_counters() at module import time in restricted environments (e.g., container/App Service), by deferring initialization until the metric callback runs.
Changes:
- Make process I/O counter initialization lazy to avoid import-time
psutil.AccessDeniedfailures. - Disable process I/O counters when
io_countersis inaccessible/unavailable and return a safe zero observation. - Update/add unit tests to account for the new lazy-init/disable behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_performance_counters/_manager.py |
Removes import-time I/O counter read and adds lazy initialization + disable-on-failure logic in _get_process_io. |
sdk/monitor/azure-monitor-opentelemetry-exporter/tests/performance_counters/test_performance_counters.py |
Resets new global state and adds tests for AccessDenied + disabled I/O behavior. |
| # Lazily initialize counters to avoid import-time failures in restricted environments. | ||
| if not _IO_INITIALIZED: | ||
| _IO_INITIALIZED = True | ||
| if _IO_AVAILABLE: | ||
| io_counters_func = getattr(_PROCESS, "io_counters", None) | ||
| if not callable(io_counters_func): | ||
| _IO_AVAILABLE = False | ||
| yield Observation(0, {}) | ||
| return | ||
| io_counters_initial = io_counters_func() | ||
| _IO_LAST_COUNT = int(getattr(io_counters_initial, "read_bytes", 0)) + int( | ||
| getattr(io_counters_initial, "write_bytes", 0) | ||
| ) | ||
| _IO_LAST_TIME = datetime.now() | ||
|
|
||
| if not _IO_AVAILABLE: | ||
| yield Observation(0, {}) | ||
| return | ||
|
|
||
| io_counters_func = getattr(_PROCESS, "io_counters", None) | ||
| if not callable(io_counters_func): | ||
| _IO_AVAILABLE = False | ||
| yield Observation(0, {}) | ||
| return | ||
|
|
||
| io_counters = io_counters_func() | ||
| rw_count = int(getattr(io_counters, "read_bytes", 0)) + int(getattr(io_counters, "write_bytes", 0)) |
There was a problem hiding this comment.
When _IO_INITIALIZED is False, the callback initializes _IO_LAST_COUNT/_IO_LAST_TIME using an io_counters() call, but then continues in the same invocation to call io_counters() again and compute a rate over a near-zero elapsed time. This can create a large spike on the first observed value and does an extra syscall. Consider treating the first call as a baseline: after setting _IO_LAST_COUNT/_IO_LAST_TIME, yield Observation(0, {}) (or return) so rate calculation starts on the next collection interval.
| @mock.patch("azure.monitor.opentelemetry.exporter._performance_counters._manager._PROCESS") | ||
| def test_get_process_io_access_denied_disables_io(self, mock_process): | ||
| """Test process I/O retrieval disables I/O counters when access is denied.""" | ||
| mock_process.io_counters.side_effect = psutil.AccessDenied(1) | ||
|
|
||
| import azure.monitor.opentelemetry.exporter._performance_counters._manager as manager_module | ||
|
|
||
| manager_module._IO_AVAILABLE = True | ||
| manager_module._IO_INITIALIZED = False | ||
|
|
||
| result = list(_get_process_io(None)) | ||
|
|
||
| self.assertEqual(len(result), 1) | ||
| self.assertEqual(result[0].value, 0) | ||
| self.assertFalse(manager_module._IO_AVAILABLE) | ||
| mock_process.io_counters.assert_called_once_with() | ||
|
|
||
| @mock.patch("azure.monitor.opentelemetry.exporter._performance_counters._manager._PROCESS") | ||
| def test_get_process_io_disabled_does_not_call_io_counters(self, mock_process): | ||
| """Test process I/O retrieval does not call io_counters when unavailable.""" | ||
| import azure.monitor.opentelemetry.exporter._performance_counters._manager as manager_module | ||
|
|
||
| manager_module._IO_AVAILABLE = False | ||
| manager_module._IO_INITIALIZED = True | ||
|
|
||
| result = list(_get_process_io(None)) | ||
|
|
||
| self.assertEqual(len(result), 1) | ||
| self.assertEqual(result[0].value, 0) | ||
| mock_process.io_counters.assert_not_called() |
There was a problem hiding this comment.
The new tests cover AccessDenied and the disabled path, but there’s no coverage for the new lazy-initialization behavior on the first _get_process_io() invocation (when _IO_INITIALIZED is False). Adding a test for that path would catch issues like first-sample spikes / double io_counters() calls, and would directly validate the regression described in the PR (no import-time crash / baseline initialization behavior).
| except (psutil.NoSuchProcess, psutil.AccessDenied, Exception) as e: # pylint: disable=broad-except | ||
| _logger.exception("Error getting process I/O rate: %s", e) | ||
| _IO_AVAILABLE = False | ||
| _logger.debug("Disabling process I/O counter due to inaccessible io_counters: %s", e) | ||
| yield Observation(0, {}) |
There was a problem hiding this comment.
The except (psutil.NoSuchProcess, psutil.AccessDenied, Exception) block effectively catches all exceptions (because Exception is included) and permanently sets _IO_AVAILABLE = False. This means a transient/unknown failure will disable the metric for the lifetime of the process and may hide real bugs. Consider only disabling on expected non-recoverable cases (e.g., AccessDenied, missing/non-callable io_counters, NotImplementedError), and for unexpected exceptions keep logging (ideally with stack trace) without flipping _IO_AVAILABLE to False.
|
@microsoft-github-policy-service agree company="equinor" |
Description
When I bumped azure-monitor-opentelemetry-exporter from 1.0.0b36 to 1.0.0b48 in my app, it failed to start. I am running a debian docker container in Azure App Service. With Supervisor and uwsgi. Here is parts of the stack trace:
...
File "/usr/src/app/.venv/lib/python3.13/site-packages/azure/monitor/opentelemetry/exporter/_performance_counters/init.py", line 4, in
from azure.monitor.opentelemetry.exporter._performance_counters._manager import (
enable_performance_counters,
)
File "/usr/src/app/.venv/lib/python3.13/site-packages/azure/monitor/opentelemetry/exporter/_performance_counters/_manager.py", line 52, in
_io_counters_initial = _PROCESS.io_counters()
File "/usr/src/app/.venv/lib/python3.13/site-packages/psutil/init.py", line 830, in io_counters
return self._proc.io_counters()
~~~~~~~~~~~~~~~~~~~~~~^^
File "/usr/src/app/.venv/lib/python3.13/site-packages/psutil/_pslinux.py", line 1718, in wrapper
raise AccessDenied(self.pid, self._name)
psutil.AccessDenied: (pid=41)
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. going in full dynamic mode ***
--- no python application found, check your startup logs for errors ---
Added lazy loading, so it won't do this on import, and more robust handling for when psutil don't have the required access.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines