Fix backfill marked complete before DagRuns are created#62561
Fix backfill marked complete before DagRuns are created#62561shivaam wants to merge 3 commits intoapache:mainfrom
Conversation
dcaf372 to
3372139
Compare
eladkal
left a comment
There was a problem hiding this comment.
LGTM
will need a 2nd reviewer as this is scheduler core area
|
Is it possible to also add copilot reviewer to this cr as well? |
There was a problem hiding this comment.
Pull request overview
Fixes a scheduler race where _mark_backfills_complete() could mark a Backfill as complete in the gap between the Backfill row being committed and its DagRuns being created.
Changes:
- Add a scheduler-side guard requiring a Backfill to have at least one
BackfillDagRunassociation before it can be marked complete. - Add a unit test covering the “initializing backfill” window (no associated DagRuns/BackfillDagRun rows yet), ensuring it is not prematurely completed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
airflow-core/src/airflow/jobs/scheduler_job_runner.py |
Updates the backfill completion query to require existence of at least one BackfillDagRun row before completing. |
airflow-core/tests/unit/jobs/test_scheduler_job.py |
Adds a regression test that reproduces the initialization window and asserts the scheduler skips completion until associations exist. |
You can also share your feedback on Copilot code review. Take the survey.
The scheduler's _mark_backfills_complete() could mark a backfill as completed during the window between the Backfill row commit and DagRun creation. Add an EXISTS guard on BackfillDagRun so backfills still being initialized are skipped.
Add a 2-minute age-based fallback to the backfill completion guard so orphaned backfills (those that failed during initialization and never got any BackfillDagRun associations) are auto-completed instead of permanently blocking new backfills for the same DAG. Also adds tests for: orphan cleanup after cutoff, old backfill with running DagRuns stays open, young backfill with finished runs completes immediately, and multiple backfills processed independently.
f450cd7 to
2eb0314
Compare
| # that failed during initialization and never got any associations. | ||
| or_( | ||
| exists(select(BackfillDagRun.id).where(BackfillDagRun.backfill_id == Backfill.id)), | ||
| Backfill.created_at < initializing_cutoff, |
There was a problem hiding this comment.
added this gaurd that will cleanup any backfills which dont have associated dag runs after 2 mins. Therefore, we wont have any stuck backfills
kaxil
left a comment
There was a problem hiding this comment.
lgtm but it would be good to get your review @dstandish
What
The scheduler's
_mark_backfills_complete()prematurely marks a backfillas completed when it runs during the window between the Backfill row
commit and the DagRun creation in
_create_backfill().closes: #61375
Why
_create_backfill() works in two steps:
The scheduler runs _mark_backfills_complete() every 30 seconds. If it happens to run between step 1 and step 2, it sees a backfill with no running DagRuns (because they don't exist yet) and marks it done. The DagRuns get created after, but the backfill is already completed.
How
Added an EXISTS check on the backfill_dag_run table in the completion query. Now a backfill needs at least one BackfillDagRun row before it can be marked complete. If it has zero, it means the backfill is still being set up, so we skip it.
Tests
test_mark_backfills_complete_skips_initializing_backfill— verifies that backfill without any dagruns is skipped, then completed after DagRuns finish. If we remove the fix, the test will fail.Was generative AI tooling used to co-author this PR?