Skip to content

DAG Runs from a single DAG can prevent scheduler from seeing other DAG's runs #49508

@collinmcnulty

Description

@collinmcnulty

Apache Airflow version

2.10.5

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Two DAGs each receive a large batch of DAG Runs. The number of runs for each DAG exceeds max_dagruns_per_loop_to_schedule. Each DAG run is very short, shorter than the heartrate of this Airflow deployment. Both DAGs have a max_active_runs that is far less than dagruns_per_loop.

So: max_active_runs < max_dagruns_per_loop_to_schedule < number of queued DAG runs.

Each scheduler loop, there are a very small number of DAG Run "slots" for the first DAG, so the check coalesce(running_drs.c.num_running, text("0")) < coalesce(Backfill.max_active_runs, DagModel.max_active_runs), does not apply. But then all the DAG runs that are considered are from the first DAG. So Second DAG effectively has to wait for nearly all of First DAG's runs to complete before any of its runs are moved from queued to running.

What you think should happen instead?

I think the "most correct" thing to do is to change the global yes/no for a DAG being included in the check on the basis of max_active_runs to some kind of limit on the number for that DAG that can be included. I can't see a good way to do this in SQL but others may have insight.

Alternatively, because this is predominantly a problem when a single DAG dominates the scheduler's attention, we could add an explicit check to see if the result of the DAG run query contains only the a single DAG, and if so re-run the query with that DAG excluded.

How to reproduce

  1. Create two DAGs with a single, simple task.
  2. Set max_active_runs=100
  3. Set max_dagruns_per_loop_to_schedule=2000
  4. Start 5000 Runs of the first DAG
  5. Start 5000 Runs of the second DAG
  6. Hard to reproduce: keep the heartrate of the scheduler low enough that Runs complete within one scheduler loop.

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

affected_version:2.10Issues Reported for 2.10affected_version:main_branchIssues Reported for main brancharea:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bugpriority:mediumBug that should be fixed before next release but would not block a release

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions