Add Slurm worker MPI mode, claim paging, and downstream backpressure by nkeilbart · Pull Request #260 · NatLabRockies/torc

nkeilbart · 2026-04-08T23:45:37Z

Summary

move Slurm sacct collection off the job completion path so completions are recorded immediately and accounting is backfilled asynchronously on a background worker
add execution_config.slurm_worker_mpi_mode so the outer direct-mode worker-launch srun can opt into --mpi=none
page resource-based job claiming across ready-job candidates so smaller jobs can backfill past an initial priority window
add execution_config.downstream_buffer_multiplier so upstream setup-style stages stop admitting more work once an active downstream stage reaches a live-capacity-based buffer limit
keep the downstream buffer limiter inactive until downstream runners have registered active compute_node capacity, avoiding startup deadlocks
document direct-mode multi-GPU / nested srun usage with --overlap, --nodes, and --ntasks
document the asynchronous sacct collection flow and the resulting Slurm accounting fields

Validation

cargo fmt -- --check
cargo clippy --all --all-targets --all-features -- -D warnings
dprint check
cargo test --test test_claim_jobs_based_on_resources --test test_execution_config -- --nocapture

Record Slurm job completion before any sacct lookup so short jobs do not inherit accounting latency. Queue completed steps by allocation and persist slurm_stats from a background worker after resources are freed.

Add an opt-in direct-mode setting for launching one worker per node with outer srun --mpi=none, and thread that through the workflow execution config and Slurm submission path. Refactor resource-based job claiming to scan ready jobs in pages so lower-priority jobs can backfill leftover capacity when large higher-priority jobs do not fit inside the first SQL window.

Add execution_config.downstream_buffer_multiplier and enforce it in\nresource-based job claiming once downstream runners have\nregistered active compute-node capacity.\n\nUse direct dependency families plus live compute_node resources to\ncap upstream admission at downstream capacity times the configured\nmultiplier, while leaving the limiter inactive before downstream\nrunners start to avoid bootstrap deadlocks.\n\nCover the new config parsing and validation, plus claim-time\nactivation and throttling behavior, with integration tests.

Release local runner resources as soon as Slurm job processes exit, but defer final job completion until sacct data has been collected and applied. Restore sacct-derived state overrides and result backfill before complete_job(), and make batched sacct collection keep retrying until real accounting fields arrive. pre-commit --all-files also normalized formatting, trailing whitespace, and EOF handling across tracked files in the repo.

Wrap the resource-based claim path so any error after BEGIN IMMEDIATE gets a best-effort rollback before the SQLite connection returns to the pool. Add a regression test that sends an invalid claim request and verifies a subsequent normal claim still succeeds on the same server instance. Remove an unused empty Rust test file that caused cargo fmt and the end-of-file fixer to disagree during commit validation.

Add shared helpers for BEGIN IMMEDIATE and rollback cleanup so claim, retry, and workflow delete paths handle rollback failures consistently. Also log the full client error chain for complete_job transport failures so the next reqwest send error shows the underlying socket cause.

Split claim_jobs_based_on_resources into a read phase and a short write phase so buffered candidate discovery and response hydration happen outside BEGIN IMMEDIATE. Reuse shared selection logic across both phases, keep the bounded write-phase recheck for status, scheduler, resource fit, and downstream buffer semantics, and add debug timing for the claim path. Add a concurrent claim regression that verifies there are no double-claims and no poisoned follow-up claim or completion calls.

Move downstream buffer planning out of the claim write phase. Bulk-load downstream families, resource shapes, compute node capacity, and occupancy during read-side candidate collection, then keep the write phase focused on confirming ready jobs, applying local fit checks, and marking jobs pending. This keeps the optimistic overfetch behavior while reducing write-lock work and eliminating per-candidate downstream lookups inside the transaction.

Treat unconstrained downstream families as effectively unbounded, return 422 for negative claim limits, and use saturating resource-fit arithmetic so large capacity values cannot wrap. Cap read-phase claim scanning at 1000 pages and add regression tests for the unconstrained downstream case, invalid limits, overflow-safe fit checks, and page-cap behavior.

Report successful Slurm jobs to complete_job immediately instead of\nwaiting on sacct retries in the runner loop. Keep sacct-based state\noverrides on the synchronous path for non-zero exits, but move\nsuccessful-job accounting enrichment into a background worker.\n\nPersist result resource metrics through update_result so the async\nenrichment path can backfill CPU and memory usage after the job is\nalready complete. Add regression coverage for both the runner sync\nfilter and result metric updates.

Document the paged resource-based claim behavior so backfill semantics match the implementation. Add workflow docs for downstream_buffer_multiplier and slurm_worker_mpi_mode, and note that Slurm sacct enrichment can arrive shortly after job completion. Also correct the execution_config mode default in the workflow formats docs to match the code.

Prevent one-runner-per-node Slurm subtasks from timing out independently while sibling runners in the same allocation still have running jobs. Use existing compute-node scheduler metadata to find sibling runners and suppress the idle exit only when sibling compute nodes still own running work.

Include completed jobs in the unblocking indexes so the\nbackground workflow scan matches the statuses it queries.\n\nTighten the unblocking result join to the current attempt so\nretried jobs do not inherit stale failure results from earlier\nattempts in the same run.\n\nAdd regression tests that cover completed-job unblocking and\ncurrent-attempt result selection.

nkeilbart added 3 commits April 7, 2026 22:04

Move Slurm sacct collection off the completion path

b5478b9

Record Slurm job completion before any sacct lookup so short jobs do not inherit accounting latency. Queue completed steps by allocation and persist slurm_stats from a background worker after resources are freed.

Document direct-mode nested srun usage

7debdb1

nkeilbart requested a review from daniel-thom April 8, 2026 23:47

nkeilbart changed the title ~~Add Slurm worker MPI mode and claim paging~~ Add Slurm worker MPI mode, claim paging, and downstream backpressure Apr 9, 2026

nkeilbart added 11 commits April 9, 2026 13:59

Document async Slurm sacct collection

284aff6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Slurm worker MPI mode, claim paging, and downstream backpressure#260

Add Slurm worker MPI mode, claim paging, and downstream backpressure#260
nkeilbart wants to merge 15 commits intomainfrom
fix/slurm-mpi-none-and-resource-backfill

nkeilbart commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nkeilbart commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nkeilbart commented Apr 8, 2026 •

edited

Loading