Add Slurm worker MPI mode, claim paging, and downstream backpressure#260
Open
Add Slurm worker MPI mode, claim paging, and downstream backpressure#260
Conversation
Record Slurm job completion before any sacct lookup so short jobs do not inherit accounting latency. Queue completed steps by allocation and persist slurm_stats from a background worker after resources are freed.
Add an opt-in direct-mode setting for launching one worker per node with outer srun --mpi=none, and thread that through the workflow execution config and Slurm submission path. Refactor resource-based job claiming to scan ready jobs in pages so lower-priority jobs can backfill leftover capacity when large higher-priority jobs do not fit inside the first SQL window.
Add execution_config.downstream_buffer_multiplier and enforce it in\nresource-based job claiming once downstream runners have\nregistered active compute-node capacity.\n\nUse direct dependency families plus live compute_node resources to\ncap upstream admission at downstream capacity times the configured\nmultiplier, while leaving the limiter inactive before downstream\nrunners start to avoid bootstrap deadlocks.\n\nCover the new config parsing and validation, plus claim-time\nactivation and throttling behavior, with integration tests.
Release local runner resources as soon as Slurm job processes exit, but defer final job completion until sacct data has been collected and applied. Restore sacct-derived state overrides and result backfill before complete_job(), and make batched sacct collection keep retrying until real accounting fields arrive. pre-commit --all-files also normalized formatting, trailing whitespace, and EOF handling across tracked files in the repo.
Wrap the resource-based claim path so any error after BEGIN IMMEDIATE gets a best-effort rollback before the SQLite connection returns to the pool. Add a regression test that sends an invalid claim request and verifies a subsequent normal claim still succeeds on the same server instance. Remove an unused empty Rust test file that caused cargo fmt and the end-of-file fixer to disagree during commit validation.
Add shared helpers for BEGIN IMMEDIATE and rollback cleanup so claim, retry, and workflow delete paths handle rollback failures consistently. Also log the full client error chain for complete_job transport failures so the next reqwest send error shows the underlying socket cause.
Split claim_jobs_based_on_resources into a read phase and a short write phase so buffered candidate discovery and response hydration happen outside BEGIN IMMEDIATE. Reuse shared selection logic across both phases, keep the bounded write-phase recheck for status, scheduler, resource fit, and downstream buffer semantics, and add debug timing for the claim path. Add a concurrent claim regression that verifies there are no double-claims and no poisoned follow-up claim or completion calls.
Move downstream buffer planning out of the claim write phase. Bulk-load downstream families, resource shapes, compute node capacity, and occupancy during read-side candidate collection, then keep the write phase focused on confirming ready jobs, applying local fit checks, and marking jobs pending. This keeps the optimistic overfetch behavior while reducing write-lock work and eliminating per-candidate downstream lookups inside the transaction.
Treat unconstrained downstream families as effectively unbounded, return 422 for negative claim limits, and use saturating resource-fit arithmetic so large capacity values cannot wrap. Cap read-phase claim scanning at 1000 pages and add regression tests for the unconstrained downstream case, invalid limits, overflow-safe fit checks, and page-cap behavior.
Report successful Slurm jobs to complete_job immediately instead of\nwaiting on sacct retries in the runner loop. Keep sacct-based state\noverrides on the synchronous path for non-zero exits, but move\nsuccessful-job accounting enrichment into a background worker.\n\nPersist result resource metrics through update_result so the async\nenrichment path can backfill CPU and memory usage after the job is\nalready complete. Add regression coverage for both the runner sync\nfilter and result metric updates.
Document the paged resource-based claim behavior so backfill semantics match the implementation. Add workflow docs for downstream_buffer_multiplier and slurm_worker_mpi_mode, and note that Slurm sacct enrichment can arrive shortly after job completion. Also correct the execution_config mode default in the workflow formats docs to match the code.
Prevent one-runner-per-node Slurm subtasks from timing out independently while sibling runners in the same allocation still have running jobs. Use existing compute-node scheduler metadata to find sibling runners and suppress the idle exit only when sibling compute nodes still own running work.
Include completed jobs in the unblocking indexes so the\nbackground workflow scan matches the statuses it queries.\n\nTighten the unblocking result join to the current attempt so\nretried jobs do not inherit stale failure results from earlier\nattempts in the same run.\n\nAdd regression tests that cover completed-job unblocking and\ncurrent-attempt result selection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sacctcollection off the job completion path so completions are recorded immediately and accounting is backfilled asynchronously on a background workerexecution_config.slurm_worker_mpi_modeso the outer direct-mode worker-launchsruncan opt into--mpi=noneexecution_config.downstream_buffer_multiplierso upstream setup-style stages stop admitting more work once an active downstream stage reaches a live-capacity-based buffer limitcompute_nodecapacity, avoiding startup deadlockssrunusage with--overlap,--nodes, and--ntaskssacctcollection flow and the resulting Slurm accounting fieldsValidation
cargo fmt -- --checkcargo clippy --all --all-targets --all-features -- -D warningsdprint checkcargo test --test test_claim_jobs_based_on_resources --test test_execution_config -- --nocapture