Skip to content

fix: prune deadline alerts when DAG run fails, not just on success#62918

Draft
YoannAbriel wants to merge 3 commits into
apache:mainfrom
YoannAbriel:fix/issue-60927
Draft

fix: prune deadline alerts when DAG run fails, not just on success#62918
YoannAbriel wants to merge 3 commits into
apache:mainfrom
YoannAbriel:fix/issue-60927

Conversation

@YoannAbriel

@YoannAbriel YoannAbriel commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Problem

Deadline alerts fire even after a DAG run has failed. A user configured a 180-minute deadline, the DAG failed within 10 minutes, but the deadline alert still triggered 3 hours later.

Root Cause

DagRun.update_state() only calls Deadline.prune_deadlines() in the success path. When a DAG run fails (task failure or deadlock), pending deadlines are never cleaned up, so the scheduler still fires them later.

Fix

Moved deadline pruning from the success-only block to the shared terminal state block that runs for both SUCCESS and FAILED states. Added a test for the failure case.

Closes: #60927


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code

Generated-by: Claude Code following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Important

🛠️ Maintainer triage note for @YoannAbriel · by @potiuk · 2026-06-22 06:31 UTC

Helpful heads-up from the maintainers — please address before this PR can be reviewed (see the Pull Request quality criteria):

  • Unaddressed Copilot review — a Copilot review thread has been open for ≥7 days with no reply; please respond to or resolve it. See the Pull Request quality criteria.

The ball is in your court — you've been assigned to this PR. Fix the above, then mark it Ready for review.

Automated triage — may be imperfect; a maintainer takes the next look.

@YoannAbriel YoannAbriel requested review from XD-DENG and ashb as code owners March 5, 2026 08:09
@YoannAbriel YoannAbriel force-pushed the fix/issue-60927 branch 4 times, most recently from 46eb8e1 to 10f4217 Compare March 8, 2026 19:06
@potiuk potiuk requested a review from ferruzzi March 10, 2026 19:34
@ferruzzi

Copy link
Copy Markdown
Contributor

No, not quite the answer we need here. When you call prune_deadlines, it removes that deadline from the list and you'll never get alerted of the miss. In your example with the Dag that failed after 10 minutes, the Dag's end_date will now be before the deadline_time, so the deadline will be removed from the table, never to be called. This is the exact opposite of what should happen.

The right fix would be: when the DagRun transitions to FAILED, short-circuit by immediately calling handle_miss() on any pending (missed=False) deadlines for that DagRun, rather than waiting for deadline_time to elapse. That way when your Dag fails at 10 minutes, your callback fires immediately, informing the user.

You can keep the change you made, but you'll have to wrap the new location in a state check to only call prune on SUCCESS and also add a block for state==FAILED to call handle_miss as well.

@YoannAbriel

Copy link
Copy Markdown
Contributor Author

Good catch — will update to call handle_miss() on FAILED and only prune on SUCCESS.

@ferruzzi

Copy link
Copy Markdown
Contributor

The change of location isn't necessarily bad, but I think it's somewhat irrelevant. We can either do

if state == SUCCESS:
    prune
if state == FAILURE:
    handle
    
the existing if state in [SUCCESS, FAILURE]:   
  existing code

or

the existing if state in [SUCCESS, FAILURE]:   
    if state == SUCCESS:
        prune
    if state == FAILURE:
        handle
   existing code

I don't think the location itself matters, but maybe inside the existing "if terminal state" block is slightly easier to find/read.

@YoannAbriel YoannAbriel force-pushed the fix/issue-60927 branch 4 times, most recently from 4faf4e9 to b5a52f1 Compare March 12, 2026 15:02
@YoannAbriel

Copy link
Copy Markdown
Contributor Author

Thanks @ferruzzi — updated to keep the logic inside the existing terminal-state if-block as you suggested: handle_miss() on FAILED, prune on SUCCESS. Appreciate the thorough review!

@YoannAbriel YoannAbriel force-pushed the fix/issue-60927 branch 3 times, most recently from 35c9dc2 to feac1a4 Compare March 16, 2026 16:09
@ferruzzi

Copy link
Copy Markdown
Contributor

Don't you want to add the handle_miss branch in that same place, so it gets run immediately at failure as well?

@YoannAbriel

Copy link
Copy Markdown
Contributor Author

Done — SUCCESS prunes, FAILURE calls handle_miss() immediately.

Comment thread airflow-core/src/airflow/models/dagrun.py Outdated
Comment thread airflow-core/tests/unit/models/test_dagrun.py Outdated
@eladkal eladkal requested a review from ferruzzi March 23, 2026 07:25
@eladkal eladkal added this to the Airflow 3.1.9 milestone Mar 23, 2026
@eladkal eladkal added type:bug-fix Changelog: Bug Fixes area:deadline-alerts AIP-86 (former AIP-57) backport-to-v3-1-test labels Mar 23, 2026
@YoannAbriel YoannAbriel force-pushed the fix/issue-60927 branch 10 times, most recently from ac96d0a to 262c533 Compare April 10, 2026 18:06
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a scheduler-side cleanup bug where DagRun deadline alerts could still fire after a DagRun had already failed, by ensuring deadlines are pruned whenever the DagRun reaches a terminal state.

Changes:

  • Move Deadline.prune_deadlines() invocation from the success-only branch to the shared terminal-state block (SUCCESS/FAILED).
  • Add a unit test asserting DagRun deadline pruning also occurs on failure.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
airflow-core/src/airflow/models/dagrun.py Prunes pending DagRun deadlines when a DagRun finishes (FAILED or SUCCESS) so alerts don’t trigger after completion.
airflow-core/tests/unit/models/test_dagrun.py Adds a regression test covering deadline pruning for failed DagRuns.

self, mock_get_by_id, mock_prune, session, deadline_test_dag
):
"""Deadlines should be pruned when a DAG run fails, not just on success."""
mock_deadline_alert = mock.MagicMock()

Copilot AI Apr 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock.MagicMock() is created without a spec/autospec, which can hide attribute/typo mistakes in tests. Consider using a spec’d mock (or an autospecced patch) matching the DeadlineAlert model so the test fails if it relies on non-existent attributes.

Suggested change
mock_deadline_alert = mock.MagicMock()
mock_deadline_alert = mock.create_autospec(DeadlineAlertModel, instance=True)

Copilot uses AI. Check for mistakes.
Comment on lines +1263 to +1269
deadline_alerts = [
DeadlineAlertModel.get_by_id(alert_id, session) for alert_id in dag.deadline
]

if any(
deadline_alert.reference_class in SerializedReferenceModels.TYPES.DAGRUN
for deadline_alert in deadline_alerts

Copilot AI Apr 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadline_alerts = [DeadlineAlertModel.get_by_id(...)] issues one DB query per deadline ID. Since this code runs on every terminal DagRun (now including failures), consider fetching all referenced DeadlineAlert rows in a single query (id IN (...)) and then checking whether any reference_class is DagRun-related, to avoid N queries when multiple deadlines are configured.

Suggested change
deadline_alerts = [
DeadlineAlertModel.get_by_id(alert_id, session) for alert_id in dag.deadline
]
if any(
deadline_alert.reference_class in SerializedReferenceModels.TYPES.DAGRUN
for deadline_alert in deadline_alerts
deadline_alert_reference_classes = session.execute(
select(DeadlineAlertModel.reference_class).where(
DeadlineAlertModel.id.in_(set(dag.deadline))
)
).scalars()
if any(
reference_class in SerializedReferenceModels.TYPES.DAGRUN
for reference_class in deadline_alert_reference_classes

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't really think this is necessary and I'd just resolve it. Seems like over-optimizing around an edge case where a dagrun has many deadlines as opposed to just using the exiting helper that serves the same purpose.

@YoannAbriel YoannAbriel force-pushed the fix/issue-60927 branch 3 times, most recently from d65ff0e to 89dc314 Compare April 13, 2026 06:06
Previously, deadline alerts were only pruned when a DAG run succeeded.
When a DAG run failed, pending deadlines remained and would still fire
even though the run had already reached a terminal state.

Moved deadline pruning from the success-only code path to the shared
terminal state block (covers both SUCCESS and FAILED).

Closes: apache#60927
@vatsrahul1001

Copy link
Copy Markdown
Contributor

@YoannAbriel can you address failing static checks?

@potiuk potiuk added ready for maintainer review Set after triaging when all criteria pass. and removed ready for maintainer review Set after triaging when all criteria pass. labels May 24, 2026
@Vamsi-klu

Copy link
Copy Markdown
Contributor

Current head still deletes failed-run deadlines instead of handling them as missed. Deadline.prune_deadlines() is only for successful events; on FAILED DagRuns we need to call handle_miss() on pending DagRun deadlines so the callback is queued immediately, and keep pruning limited to SUCCESS. Otherwise a DAG that fails before deadline_time loses the deadline row and the miss callback never fires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:deadline-alerts AIP-86 (former AIP-57) ready for maintainer review Set after triaging when all criteria pass. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadline Alert triggered after dag failure

9 participants