Skip to content

Add Slurm worker MPI mode, claim paging, and downstream backpressure#260

Open
nkeilbart wants to merge 15 commits intomainfrom
fix/slurm-mpi-none-and-resource-backfill
Open

Add Slurm worker MPI mode, claim paging, and downstream backpressure#260
nkeilbart wants to merge 15 commits intomainfrom
fix/slurm-mpi-none-and-resource-backfill

Conversation

@nkeilbart
Copy link
Copy Markdown
Collaborator

@nkeilbart nkeilbart commented Apr 8, 2026

Summary

  • move Slurm sacct collection off the job completion path so completions are recorded immediately and accounting is backfilled asynchronously on a background worker
  • add execution_config.slurm_worker_mpi_mode so the outer direct-mode worker-launch srun can opt into --mpi=none
  • page resource-based job claiming across ready-job candidates so smaller jobs can backfill past an initial priority window
  • add execution_config.downstream_buffer_multiplier so upstream setup-style stages stop admitting more work once an active downstream stage reaches a live-capacity-based buffer limit
  • keep the downstream buffer limiter inactive until downstream runners have registered active compute_node capacity, avoiding startup deadlocks
  • document direct-mode multi-GPU / nested srun usage with --overlap, --nodes, and --ntasks
  • document the asynchronous sacct collection flow and the resulting Slurm accounting fields

Validation

  • cargo fmt -- --check
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • dprint check
  • cargo test --test test_claim_jobs_based_on_resources --test test_execution_config -- --nocapture

Record Slurm job completion before any sacct lookup so short
jobs do not inherit accounting latency. Queue completed steps by
allocation and persist slurm_stats from a background worker after
resources are freed.
Add an opt-in direct-mode setting for launching one worker per node with outer srun --mpi=none, and thread that through the workflow execution config and Slurm submission path.

Refactor resource-based job claiming to scan ready jobs in pages so lower-priority jobs can backfill leftover capacity when large higher-priority jobs do not fit inside the first SQL window.
@nkeilbart nkeilbart requested a review from daniel-thom April 8, 2026 23:47
Add execution_config.downstream_buffer_multiplier and enforce it in\nresource-based job claiming once downstream runners have\nregistered active compute-node capacity.\n\nUse direct dependency families plus live compute_node resources to\ncap upstream admission at downstream capacity times the configured\nmultiplier, while leaving the limiter inactive before downstream\nrunners start to avoid bootstrap deadlocks.\n\nCover the new config parsing and validation, plus claim-time\nactivation and throttling behavior, with integration tests.
@nkeilbart nkeilbart changed the title Add Slurm worker MPI mode and claim paging Add Slurm worker MPI mode, claim paging, and downstream backpressure Apr 9, 2026
nkeilbart added 11 commits April 9, 2026 13:59
Release local runner resources as soon as Slurm job processes exit,
but defer final job completion until sacct data has been collected
and applied.

Restore sacct-derived state overrides and result backfill before
complete_job(), and make batched sacct collection keep retrying
until real accounting fields arrive.

pre-commit --all-files also normalized formatting, trailing
whitespace, and EOF handling across tracked files in the repo.
Wrap the resource-based claim path so any error after BEGIN IMMEDIATE
gets a best-effort rollback before the SQLite connection returns to
the pool.

Add a regression test that sends an invalid claim request and verifies
a subsequent normal claim still succeeds on the same server instance.

Remove an unused empty Rust test file that caused cargo fmt and the
end-of-file fixer to disagree during commit validation.
Add shared helpers for BEGIN IMMEDIATE and rollback cleanup so
claim, retry, and workflow delete paths handle rollback failures
consistently.

Also log the full client error chain for complete_job transport
failures so the next reqwest send error shows the underlying socket
cause.
Split claim_jobs_based_on_resources into a read phase and a short
write phase so buffered candidate discovery and response hydration
happen outside BEGIN IMMEDIATE.

Reuse shared selection logic across both phases, keep the bounded
write-phase recheck for status, scheduler, resource fit, and
downstream buffer semantics, and add debug timing for the claim path.

Add a concurrent claim regression that verifies there are no
double-claims and no poisoned follow-up claim or completion calls.
Move downstream buffer planning out of the claim write phase.
Bulk-load downstream families, resource shapes, compute node
capacity, and occupancy during read-side candidate collection,
then keep the write phase focused on confirming ready jobs,
applying local fit checks, and marking jobs pending.

This keeps the optimistic overfetch behavior while reducing
write-lock work and eliminating per-candidate downstream
lookups inside the transaction.
Treat unconstrained downstream families as effectively unbounded,
return 422 for negative claim limits, and use saturating
resource-fit arithmetic so large capacity values cannot wrap.

Cap read-phase claim scanning at 1000 pages and add regression
tests for the unconstrained downstream case, invalid limits,
overflow-safe fit checks, and page-cap behavior.
Report successful Slurm jobs to complete_job immediately instead of\nwaiting on sacct retries in the runner loop. Keep sacct-based state\noverrides on the synchronous path for non-zero exits, but move\nsuccessful-job accounting enrichment into a background worker.\n\nPersist result resource metrics through update_result so the async\nenrichment path can backfill CPU and memory usage after the job is\nalready complete. Add regression coverage for both the runner sync\nfilter and result metric updates.
Document the paged resource-based claim behavior so backfill
semantics match the implementation.

Add workflow docs for downstream_buffer_multiplier and
slurm_worker_mpi_mode, and note that Slurm sacct enrichment can
arrive shortly after job completion.

Also correct the execution_config mode default in the workflow
formats docs to match the code.
Prevent one-runner-per-node Slurm subtasks from timing out
independently while sibling runners in the same allocation still
have running jobs.

Use existing compute-node scheduler metadata to find sibling
runners and suppress the idle exit only when sibling compute
nodes still own running work.
Include completed jobs in the unblocking indexes so the\nbackground workflow scan matches the statuses it queries.\n\nTighten the unblocking result join to the current attempt so\nretried jobs do not inherit stale failure results from earlier\nattempts in the same run.\n\nAdd regression tests that cover completed-job unblocking and\ncurrent-attempt result selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant