[26.0] Add debug middleware and regression tests for blocked main event loop#22207
Draft
mvdbeek wants to merge 11 commits intogalaxyproject:release_26.0from
Draft
[26.0] Add debug middleware and regression tests for blocked main event loop#22207mvdbeek wants to merge 11 commits intogalaxyproject:release_26.0from
mvdbeek wants to merge 11 commits intogalaxyproject:release_26.0from
Conversation
jmchilton
reviewed
Mar 21, 2026
Add middleware that monitors the asyncio event loop for blocking calls: - EventLoopWatchdog: daemon thread that probes the event loop and logs stack traces (at ERROR level, sent to Sentry) when it's unresponsive - EventLoopWatchdogMiddleware: ASGI middleware emitting per-request Server-Timing headers with event loop lag and nginx queue time - Enabled via galaxy.yml: event_loop_watchdog_threshold: <seconds> Add /api/debug/block and /api/debug/ok diagnostic endpoints to simulate and observe event loop blocking. Add test infrastructure to fail API tests on event loop blocking: - ApiTestInteractor checks Server-Timing header on every response - Test server auto-enables watchdog when env var is set - Enable in CI: GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05
Set GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05 (50ms) as the default for all test runs via run_tests.sh. This causes every API and integration test HTTP response to be checked for event loop blocking via the Server-Timing header. Override with GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0 to disable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the event loop thread is in the I/O selector (select/poll/epoll), it is idle waiting for work, not blocked by a sync call. Filter these out to avoid false positive reports.
Address review feedback: skip TestEventLoopBlocking when GALAXY_TEST_EXTERNAL is set since external servers may not have the watchdog enabled. Also clarify that _check_event_loop_lag is a natural no-op when the Server-Timing header is absent.
Two sources of false positives in CI: 1. GIL contention from WSGI threads: when WSGI threads do heavy Python work (ORM hydration, serialization), they hold the GIL and delay the watchdog callback. The event loop is properly awaiting at a starlette/ fastapi middleware boundary — not blocked by a sync call. Filter out stack traces whose innermost frame is in starlette, fastapi, anyio, or a2wsgi. 2. Test tearDown cleanup: UsesApiTestCaseMixin.tearDown() cancels running jobs via GET/DELETE which go through _maybe_check_event_loop. Use low-level methods to bypass the check during cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watchdog's peak_lag was being updated for ALL probe delays, including GIL contention from WSGI threads and background tasks where the event loop thread was properly awaiting at a framework boundary. This caused false test failures on CI. Now _report_blocked (watchdog thread) confirms the block is real by checking the stack trace, and sets _block_confirmed. _on_response (event loop thread, runs after _report_blocked) only updates peak_lag when confirmed. This ensures the Server-Timing header and test assertions only reflect actual blocking in Galaxy application code. Also bypass event loop lag checks in test tearDown cleanup.
Prevents re-injecting the Server-Timing header if the response-start message is sent more than once through the ASGI middleware chain, which can happen with starlette's BaseHTTPMiddleware.
Remove the debug blocking endpoint from production code and move it into an integration test where the routes are injected only into the test server via init_fast_app. Add support for forwarding init_fast_app from the test config object to launch_server.
9003f8b to
a0f3685
Compare
Member
Author
|
This testing is really worth something!: encode/httpx#3707 |
99215a2 to
1b9e4a1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add middleware that monitors the asyncio event loop for blocking calls:
stack traces (at ERROR level, sent to Sentry) when it's unresponsive
Server-Timing headers with event loop lag and nginx queue time
Add /api/debug/block and /api/debug/ok diagnostic endpoints to
simulate and observe event loop blocking.
Add test infrastructure to fail API tests on event loop blocking:
How to test the changes?
(Select all options that apply)
License