Skip to content

fix(reports): reliability fixes for alert/report execution#41177

Open
sadpandajoe wants to merge 2 commits into
masterfrom
alert-report-fixes
Open

fix(reports): reliability fixes for alert/report execution#41177
sadpandajoe wants to merge 2 commits into
masterfrom
alert-report-fixes

Conversation

@sadpandajoe

Copy link
Copy Markdown
Member

SUMMARY

Three independent, low-risk reliability fixes in the alert/report subsystem. No DB migration, no API change, no UI change.

  1. Webhook retries could stall workers. WebhookNotification.send retried via backoff with no wall-clock bound, so a hanging or persistently-failing target could tie up a worker for minutes per bad URL (up to ~5 socket waits at timeout=60 plus retry sleeps), starving sequential report dispatch. Added max_time=120 to the decorator. Retry counts (factor/base/max_tries) are unchanged, so legitimately-transient 5xx targets are still retried; max_time only caps total wall-clock and is checked between attempts, so the final in-flight request still gets its full timeout.

  2. Opaque failure when the executor user is missing. If the configured executor cannot be resolved (security_manager.find_user returns None), report execution previously failed later with an unclear NoneType error. The content-generation paths (_get_screenshots / _get_csv_data / _get_embedded_data) now raise a dedicated ReportScheduleExecutorNotFoundError. The guard sits at the content sites so it raises inside the state machine's error envelope — the ERROR execution-log row and the owner error notification are still produced. The run() boundary continues to delegate to the state machine (a missing user is tolerated there, matching prior behavior), so operator visibility is unchanged.

  3. Slack v2 migration left recipients half-migrated on failure. update_report_schedule_slack_v2 reverted only the loop variable on error (leaving earlier-mutated recipients changed) and raised UnboundLocalError when the failure occurred before the loop bound a recipient. It now snapshots and reverts every recipient it mutated, and no longer crashes on a pre-loop failure.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A — backend-only change.

TESTING INSTRUCTIONS

Unit tests added/updated:

  • tests/unit_tests/reports/notifications/webhook_tests.py
  • tests/unit_tests/commands/report/execute_test.py
pytest tests/unit_tests/reports/notifications/webhook_tests.py tests/unit_tests/commands/report/execute_test.py

All pass. Each fix has at least one test that fails when the fix is reverted.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

🤖 Generated with Claude Code

sadpandajoe and others added 2 commits June 17, 2026 15:53
Webhook notification retries used backoff without a max_time bound, so a
hanging or persistently-failing target could stall a worker for minutes
per bad URL (up to ~5 socket waits at timeout=60 plus retry sleeps),
starving sequential report dispatch.

Add max_time=120 to the backoff decorator on WebhookNotification.send.
factor/base/max_tries are unchanged, so legitimately-transient 5xx
targets are still retried; max_time only caps total wall-clock so a bad
target cannot monopolize a worker. max_time is checked between attempts,
so the final in-flight request still runs its full timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ally

Two robustness fixes in the alert/report execution command:

1. Missing executor user: when the configured executor cannot be resolved
   (security_manager.find_user returns None), the content-generation sites
   (_get_screenshots / _get_csv_data / _get_embedded_data) now raise a
   dedicated ReportScheduleExecutorNotFoundError instead of failing later
   with an opaque NoneType error. The guard lives at the content sites so
   it raises inside the state machine's error envelope, preserving the
   ERROR execution-log row and the owner error notification. The run()
   boundary continues to delegate to the state machine (a missing user is
   tolerated there, matching prior behavior) so visibility is unchanged.

2. Slack v2 migration revert: update_report_schedule_slack_v2 now snapshots
   every recipient it mutates and reverts all of them on failure, instead
   of only the loop variable, and no longer raises UnboundLocalError when
   the failure occurs before the loop binds a recipient.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@netlify

netlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit 96477b3
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/6a3325f06332680008a09ca7
😎 Deploy Preview https://deploy-preview-41177--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.32%. Comparing base (c1b5d05) to head (96477b3).
⚠️ Report is 85 commits behind head on master.

Files with missing lines Patch % Lines
superset/commands/report/execute.py 86.66% 1 Missing and 1 partial ⚠️
superset/commands/report/exceptions.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #41177      +/-   ##
==========================================
+ Coverage   64.26%   64.32%   +0.05%     
==========================================
  Files        2660     2661       +1     
  Lines      144795   145582     +787     
  Branches    33384    33520     +136     
==========================================
+ Hits        93048    93640     +592     
- Misses      50096    50261     +165     
- Partials     1651     1681      +30     
Flag Coverage Δ
hive 39.50% <30.00%> (+0.14%) ⬆️
mysql 58.27% <85.00%> (+0.19%) ⬆️
postgres 58.34% <85.00%> (+0.20%) ⬆️
presto 41.17% <30.00%> (+0.24%) ⬆️
python 59.77% <85.00%> (+0.18%) ⬆️
sqlite 57.99% <85.00%> (+0.23%) ⬆️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability of the alerts/reports execution pipeline by bounding webhook retry wall-clock time, producing a clearer error when the configured executor user is missing, and fixing Slack v2 recipient migration rollback behavior to avoid partially-mutated state.

Changes:

  • Cap webhook retry wall-clock time by adding max_time=120 to the backoff.on_exception decorator in WebhookNotification.send.
  • Add resolve_executor_user() and raise a dedicated ReportScheduleExecutorNotFoundError from content-generation paths when the executor user cannot be resolved.
  • Make update_report_schedule_slack_v2() snapshot and fully revert all mutated recipients on failure (and avoid pre-loop UnboundLocalError), with unit tests covering partial- and pre-iteration failures.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
superset/reports/notifications/webhook.py Adds a wall-clock bound (max_time=120) to webhook retries to prevent long worker stalls.
tests/unit_tests/reports/notifications/webhook_tests.py Adds deterministic tests validating max_time behavior by faking backoff._sync.datetime.
superset/commands/report/execute.py Introduces resolve_executor_user() and ensures Slack v2 migration reverts all recipients on error; uses the helper in content paths.
superset/commands/report/exceptions.py Adds ReportScheduleExecutorNotFoundError with an actionable message and 422 status.
tests/unit_tests/commands/report/execute_test.py Adds coverage for missing executor user and Slack v2 rollback behavior (partial failure, pre-iteration failure, and no-op case).

@sadpandajoe sadpandajoe marked this pull request as ready for review June 18, 2026 18:04
@dosubot dosubot Bot added the alert-reports Namespace | Anything related to the Alert & Reports feature label Jun 18, 2026
@sadpandajoe sadpandajoe requested a review from eschutho June 18, 2026 18:13
@bito-code-review

bito-code-review Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Code Review Agent Run #c29eca

Actionable Suggestions - 0
Additional Suggestions - 2
  • superset/commands/report/exceptions.py - 1
    • Missing docstring on new exception · Line 204-214
      The new `ReportScheduleExecutorNotFoundError` class lacks a docstring. Per BITO.md rule [12147], all newly introduced Python functions and classes should include inline docstrings documenting their purpose. This helps developers understand error semantics without reading implementation.
      Code suggestion
      --- a/superset/commands/report/exceptions.py
      +++ b/superset/commands/report/exceptions.py
       @@ -201,6 +201,8 @@ class ReportScheduleDataFrameFailedError(CommandException):
       
       
        class ReportScheduleExecutorNotFoundError(CommandException):
      +    """
      +    Raised when the configured report executor user cannot be resolved.
      +    """
            status = 422
       
            def __init__(self, username: str = "", exception: Optional[Exception] = None):
  • superset/reports/notifications/webhook.py - 1
    • Missing documentation for retry cap · Line 143-143
      Consider adding a brief statement in the **Retry Behavior** section of `docs/admin_docs/configuration/alerts-reports.mdx` (around line 163) that specifies the maximum total retry duration (120 seconds). This ensures operators understand the worker-stall prevention behavior without having to read source code.
Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

  • tests/unit_tests/commands/report/execute_test.py - 1
Review Details
  • Files reviewed - 5 · Commit Range: eb79c96..96477b3
    • superset/commands/report/exceptions.py
    • superset/commands/report/execute.py
    • superset/reports/notifications/webhook.py
    • tests/unit_tests/commands/report/execute_test.py
    • tests/unit_tests/reports/notifications/webhook_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alert-reports Namespace | Anything related to the Alert & Reports feature review:draft size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants