Support processing more resources per replica by r4victor · Pull Request #2871 · dstackai/dstack

r4victor · 2025-07-03T09:03:20Z

The PR introduces multiple changes to make a dstack server replica process more resources quickly and allow scaling processing rates per replica to make use of bigger Postgres installations:

Main changes:

Introduces a DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR environment variable. It allows adding multiple instances of background jobs per replica, increasing processing rates (close to linear). Postgres installations can increase DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR, DSTACK_DB_POOL_SIZE, DSTACK_DB_MAX_OVERFLOW to process more resources quickly.
Optimizes run_job/create_instance on AWS by caching all the API requests (reduced from 26s to 18s, 18s is basically minimum for run_instances API call with wait_until_running).
Sets a custom default asyncio executor with 128 workers by default. The default executor was a bottleneck when scaling processing, leading to blocking run_async.
Makes run stopping API async. Previously, it was synchronous that led to slow API response, and when stopping many runs, produced too many write transactions at a time that could lead to "database is locked" on SQLite.

Other changes:

Replaces in-memory locking on Postgres with dummy locking to remove unnecessary lock contention when scaling processing.
Adds "warm up offers cache" trick to use batch size 1 when processing just started.
Introduces minimal processing intervals for tasks that process resource repeatedly to avoid processing same resources too often.
Other enhancements and fixes.

Results:

The default processing rates improved significantly. Provisioning 100 runs on SQLite with default settings is ~2m (previously ~5m).
Better scaling by default – provisioning additional 100 runs is even quicker (due to warm cache). Previously, it would be only slower.
Ability to scale one replica processing rate – e.g. on Postgres with DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=4, provisioning 300 runs is ~ 4m.

Possible downsides:

A processing cycle of one job/instance/run can be a bit slower due to introduced min processing intervals (not noticeable).

…ACKGROUND_PROCESSING_FACTOR

Fixes #2872

jvstme · 2025-07-08T10:23:32Z

src/dstack/_internal/server/background/tasks/process_fleets.py



 async def _process_fleet(session: AsyncSession, fleet_model: FleetModel):
+    logger.info("Processing fleet %s", fleet_model.name)


@r4victor, we usually reserve INFO for important events, such as status changes. I suggest using DEBUG here so this message doesn't take over our production logs.

r4victor added 20 commits June 30, 2025 16:46

Process fleets in batches

70bb180

Increase offers_cache

72d5ee7

Cache redundant AWS API calls

7e2da7b

Set MIN_PROCESSING_INTERVAL

5d1783d

Prevent concurrent quotas requests

cb8515d

Increase MIN_PROCESSING_INTERVAL

6bbdcd2

Make runs stopping async

99d83fa

Introduce SERVER_BACKGROUND_PROCESSING_RATE

b46f3f0

Use dummy locking on Postgres

55bca44

Process submitted jobs in one steps if no instance lock

df25460

Fix missing cache lock

76696cc

Configure custom executor

2ed8c77

Fix typo

b51806a

Increase process_submitted_jobs max_instances

a02f2c3

Implement offers cache warm up trick

993eb97

Replace DSTACK_SERVER_BACKGROUND_PROCESSING_RATE with DSTACK_SERVER_B…

c39b6e4

…ACKGROUND_PROCESSING_FACTOR

Document DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR

6311ba0

Do not create clusters from burstable AWS instances

d659cea

Fixes #2872

Set MIN_PROCESSING_INTERVAL for process_fleets

4019872

Add TODO

d91a4c2

r4victor merged commit 2a89dff into master Jul 4, 2025
25 checks passed

r4victor deleted the issue_2855_more_resources_per_replica branch July 4, 2025 05:30

jvstme reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support processing more resources per replica#2871

Support processing more resources per replica#2871
r4victor merged 20 commits intomasterfrom
issue_2855_more_resources_per_replica

r4victor commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

jvstme Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		async def _process_fleet(session: AsyncSession, fleet_model: FleetModel):
		logger.info("Processing fleet %s", fleet_model.name)

Conversation

r4victor commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jvstme Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r4victor commented Jul 3, 2025 •

edited

Loading