Skip to content

[26.0] Add debug middleware and regression tests for blocked main event loop#22207

Draft
mvdbeek wants to merge 11 commits intogalaxyproject:release_26.0from
mvdbeek:debug_event_loop_and_add_testing
Draft

[26.0] Add debug middleware and regression tests for blocked main event loop#22207
mvdbeek wants to merge 11 commits intogalaxyproject:release_26.0from
mvdbeek:debug_event_loop_and_add_testing

Conversation

@mvdbeek
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek commented Mar 21, 2026

Add middleware that monitors the asyncio event loop for blocking calls:

  • EventLoopWatchdog: daemon thread that probes the event loop and logs
    stack traces (at ERROR level, sent to Sentry) when it's unresponsive
  • EventLoopWatchdogMiddleware: ASGI middleware emitting per-request
    Server-Timing headers with event loop lag and nginx queue time
  • Enabled via galaxy.yml: event_loop_watchdog_threshold:

Add /api/debug/block and /api/debug/ok diagnostic endpoints to
simulate and observe event loop blocking.

Add test infrastructure to fail API tests on event loop blocking:

  • ApiTestInteractor checks Server-Timing header on every response
  • Test server auto-enables watchdog when env var is set
  • Enable in CI: GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot changed the title Debug event loop and add testing [26.0] Debug event loop and add testing Mar 21, 2026
mvdbeek and others added 9 commits March 23, 2026 15:53
Add middleware that monitors the asyncio event loop for blocking calls:
- EventLoopWatchdog: daemon thread that probes the event loop and logs
  stack traces (at ERROR level, sent to Sentry) when it's unresponsive
- EventLoopWatchdogMiddleware: ASGI middleware emitting per-request
  Server-Timing headers with event loop lag and nginx queue time
- Enabled via galaxy.yml: event_loop_watchdog_threshold: <seconds>

Add /api/debug/block and /api/debug/ok diagnostic endpoints to
simulate and observe event loop blocking.

Add test infrastructure to fail API tests on event loop blocking:
- ApiTestInteractor checks Server-Timing header on every response
- Test server auto-enables watchdog when env var is set
- Enable in CI: GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05
Set GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0.05 (50ms) as the
default for all test runs via run_tests.sh. This causes every API
and integration test HTTP response to be checked for event loop
blocking via the Server-Timing header.

Override with GALAXY_TEST_EVENT_LOOP_BLOCKING_THRESHOLD=0 to disable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the event loop thread is in the I/O selector (select/poll/epoll),
it is idle waiting for work, not blocked by a sync call. Filter these
out to avoid false positive reports.
Address review feedback: skip TestEventLoopBlocking when
GALAXY_TEST_EXTERNAL is set since external servers may not have
the watchdog enabled. Also clarify that _check_event_loop_lag is
a natural no-op when the Server-Timing header is absent.
Two sources of false positives in CI:

1. GIL contention from WSGI threads: when WSGI threads do heavy Python
   work (ORM hydration, serialization), they hold the GIL and delay the
   watchdog callback. The event loop is properly awaiting at a starlette/
   fastapi middleware boundary — not blocked by a sync call. Filter out
   stack traces whose innermost frame is in starlette, fastapi, anyio,
   or a2wsgi.

2. Test tearDown cleanup: UsesApiTestCaseMixin.tearDown() cancels running
   jobs via GET/DELETE which go through _maybe_check_event_loop. Use
   low-level methods to bypass the check during cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watchdog's peak_lag was being updated for ALL probe delays,
including GIL contention from WSGI threads and background tasks
where the event loop thread was properly awaiting at a framework
boundary.  This caused false test failures on CI.

Now _report_blocked (watchdog thread) confirms the block is real
by checking the stack trace, and sets _block_confirmed.  _on_response
(event loop thread, runs after _report_blocked) only updates peak_lag
when confirmed.  This ensures the Server-Timing header and test
assertions only reflect actual blocking in Galaxy application code.

Also bypass event loop lag checks in test tearDown cleanup.
Prevents re-injecting the Server-Timing header if the response-start
message is sent more than once through the ASGI middleware chain,
which can happen with starlette's BaseHTTPMiddleware.
Remove the debug blocking endpoint from production code and move it
into an integration test where the routes are injected only into the
test server via init_fast_app. Add support for forwarding init_fast_app
from the test config object to launch_server.
@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch from 9003f8b to a0f3685 Compare March 23, 2026 15:12
@mvdbeek mvdbeek changed the title [26.0] Debug event loop and add testing [26.0] Add debug middleware and regression tests for blocked main event loop Mar 23, 2026
@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Mar 24, 2026

This testing is really worth something!: encode/httpx#3707

@mvdbeek mvdbeek force-pushed the debug_event_loop_and_add_testing branch from 99215a2 to 1b9e4a1 Compare March 24, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants