fix(reports): reliability fixes for alert/report execution#41177
fix(reports): reliability fixes for alert/report execution#41177sadpandajoe wants to merge 2 commits into
Conversation
Webhook notification retries used backoff without a max_time bound, so a hanging or persistently-failing target could stall a worker for minutes per bad URL (up to ~5 socket waits at timeout=60 plus retry sleeps), starving sequential report dispatch. Add max_time=120 to the backoff decorator on WebhookNotification.send. factor/base/max_tries are unchanged, so legitimately-transient 5xx targets are still retried; max_time only caps total wall-clock so a bad target cannot monopolize a worker. max_time is checked between attempts, so the final in-flight request still runs its full timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ally Two robustness fixes in the alert/report execution command: 1. Missing executor user: when the configured executor cannot be resolved (security_manager.find_user returns None), the content-generation sites (_get_screenshots / _get_csv_data / _get_embedded_data) now raise a dedicated ReportScheduleExecutorNotFoundError instead of failing later with an opaque NoneType error. The guard lives at the content sites so it raises inside the state machine's error envelope, preserving the ERROR execution-log row and the owner error notification. The run() boundary continues to delegate to the state machine (a missing user is tolerated there, matching prior behavior) so visibility is unchanged. 2. Slack v2 migration revert: update_report_schedule_slack_v2 now snapshots every recipient it mutates and reverts all of them on failure, instead of only the loop variable, and no longer raises UnboundLocalError when the failure occurs before the loop binds a recipient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #41177 +/- ##
==========================================
+ Coverage 64.26% 64.32% +0.05%
==========================================
Files 2660 2661 +1
Lines 144795 145582 +787
Branches 33384 33520 +136
==========================================
+ Hits 93048 93640 +592
- Misses 50096 50261 +165
- Partials 1651 1681 +30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR improves reliability of the alerts/reports execution pipeline by bounding webhook retry wall-clock time, producing a clearer error when the configured executor user is missing, and fixing Slack v2 recipient migration rollback behavior to avoid partially-mutated state.
Changes:
- Cap webhook retry wall-clock time by adding
max_time=120to thebackoff.on_exceptiondecorator inWebhookNotification.send. - Add
resolve_executor_user()and raise a dedicatedReportScheduleExecutorNotFoundErrorfrom content-generation paths when the executor user cannot be resolved. - Make
update_report_schedule_slack_v2()snapshot and fully revert all mutated recipients on failure (and avoid pre-loopUnboundLocalError), with unit tests covering partial- and pre-iteration failures.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
superset/reports/notifications/webhook.py |
Adds a wall-clock bound (max_time=120) to webhook retries to prevent long worker stalls. |
tests/unit_tests/reports/notifications/webhook_tests.py |
Adds deterministic tests validating max_time behavior by faking backoff._sync.datetime. |
superset/commands/report/execute.py |
Introduces resolve_executor_user() and ensures Slack v2 migration reverts all recipients on error; uses the helper in content paths. |
superset/commands/report/exceptions.py |
Adds ReportScheduleExecutorNotFoundError with an actionable message and 422 status. |
tests/unit_tests/commands/report/execute_test.py |
Adds coverage for missing executor user and Slack v2 rollback behavior (partial failure, pre-iteration failure, and no-op case). |
Code Review Agent Run #c29ecaActionable Suggestions - 0Additional Suggestions - 2
Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
Three independent, low-risk reliability fixes in the alert/report subsystem. No DB migration, no API change, no UI change.
Webhook retries could stall workers.
WebhookNotification.sendretried viabackoffwith no wall-clock bound, so a hanging or persistently-failing target could tie up a worker for minutes per bad URL (up to ~5 socket waits attimeout=60plus retry sleeps), starving sequential report dispatch. Addedmax_time=120to the decorator. Retry counts (factor/base/max_tries) are unchanged, so legitimately-transient 5xx targets are still retried;max_timeonly caps total wall-clock and is checked between attempts, so the final in-flight request still gets its full timeout.Opaque failure when the executor user is missing. If the configured executor cannot be resolved (
security_manager.find_userreturnsNone), report execution previously failed later with an unclearNoneTypeerror. The content-generation paths (_get_screenshots/_get_csv_data/_get_embedded_data) now raise a dedicatedReportScheduleExecutorNotFoundError. The guard sits at the content sites so it raises inside the state machine's error envelope — theERRORexecution-log row and the owner error notification are still produced. Therun()boundary continues to delegate to the state machine (a missing user is tolerated there, matching prior behavior), so operator visibility is unchanged.Slack v2 migration left recipients half-migrated on failure.
update_report_schedule_slack_v2reverted only the loop variable on error (leaving earlier-mutated recipients changed) and raisedUnboundLocalErrorwhen the failure occurred before the loop bound a recipient. It now snapshots and reverts every recipient it mutated, and no longer crashes on a pre-loop failure.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A — backend-only change.
TESTING INSTRUCTIONS
Unit tests added/updated:
tests/unit_tests/reports/notifications/webhook_tests.pytests/unit_tests/commands/report/execute_test.pyAll pass. Each fix has at least one test that fails when the fix is reverted.
ADDITIONAL INFORMATION
🤖 Generated with Claude Code