Skip to content

Support processing more resources per replica#2871

Merged
r4victor merged 20 commits intomasterfrom
issue_2855_more_resources_per_replica
Jul 4, 2025
Merged

Support processing more resources per replica#2871
r4victor merged 20 commits intomasterfrom
issue_2855_more_resources_per_replica

Conversation

@r4victor
Copy link
Copy Markdown
Collaborator

@r4victor r4victor commented Jul 3, 2025

Closes #2855

The PR introduces multiple changes to make a dstack server replica process more resources quickly and allow scaling processing rates per replica to make use of bigger Postgres installations:

Main changes:

  • Introduces a DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR environment variable. It allows adding multiple instances of background jobs per replica, increasing processing rates (close to linear). Postgres installations can increase DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR, DSTACK_DB_POOL_SIZE, DSTACK_DB_MAX_OVERFLOW to process more resources quickly.
  • Optimizes run_job/create_instance on AWS by caching all the API requests (reduced from 26s to 18s, 18s is basically minimum for run_instances API call with wait_until_running).
  • Sets a custom default asyncio executor with 128 workers by default. The default executor was a bottleneck when scaling processing, leading to blocking run_async.
  • Makes run stopping API async. Previously, it was synchronous that led to slow API response, and when stopping many runs, produced too many write transactions at a time that could lead to "database is locked" on SQLite.

Other changes:

  • Replaces in-memory locking on Postgres with dummy locking to remove unnecessary lock contention when scaling processing.
  • Adds "warm up offers cache" trick to use batch size 1 when processing just started.
  • Introduces minimal processing intervals for tasks that process resource repeatedly to avoid processing same resources too often.
  • Other enhancements and fixes.

Results:

  • The default processing rates improved significantly. Provisioning 100 runs on SQLite with default settings is ~2m (previously ~5m).
  • Better scaling by default – provisioning additional 100 runs is even quicker (due to warm cache). Previously, it would be only slower.
  • Ability to scale one replica processing rate – e.g. on Postgres with DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=4, provisioning 300 runs is ~ 4m.

Possible downsides:

  • A processing cycle of one job/instance/run can be a bit slower due to introduced min processing intervals (not noticeable).

@r4victor r4victor merged commit 2a89dff into master Jul 4, 2025
25 checks passed
@r4victor r4victor deleted the issue_2855_more_resources_per_replica branch July 4, 2025 05:30


async def _process_fleet(session: AsyncSession, fleet_model: FleetModel):
logger.info("Processing fleet %s", fleet_model.name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor, we usually reserve INFO for important events, such as status changes. I suggest using DEBUG here so this message doesn't take over our production logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support processing more resources per server replica

2 participants