fix(tui): job queue hardening — partial failure rollback + batch health checks by seilk · Pull Request #5 · seilk/opensmi

seilk · 2026-02-17T19:51:48Z

Post-merge hardening for Job Queue (feature/job-queue-v1)

From Capybara review R2 recommended items:

R1: Partial Failure Rollback in `executeJobRemote`

One-to-one mode: if GPU N fails mid-launch, all previously started tmux sessions are now killed (rollback)
Prevents orphaned sessions consuming GPU resources on partial failure

R3: Batch Health Checks in `watchRunningJobs`

Replaced per-job Python process spawning with batchCheckJobsAlive()
All running jobs checked in a single Python invocation per watchdog cycle
Reduces overhead from N cold Python starts to 1 (significant with 10+ running jobs)

Testing

34/34 existing tests passing
No behavioral change for single-job scenarios (checkJobAlive preserved as thin wrapper)

Ref: CAPYBARA_REVIEW_R2.md (in raven worktree, now removed)

R1: executeJobRemote now kills already-launched tmux sessions on partial failure in one-to-one mode, preventing orphaned sessions. R3: Replace per-job Python process spawning with batchCheckJobsAlive() that checks all running jobs in a single Python invocation. Reduces overhead from N processes to 1 per watchdog cycle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tui): job queue hardening — partial failure rollback + batch health checks#5

fix(tui): job queue hardening — partial failure rollback + batch health checks#5
seilk wants to merge 1 commit intomainfrom
fix/job-queue-hardening-r1-r3

seilk commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seilk commented Feb 17, 2026

Post-merge hardening for Job Queue (feature/job-queue-v1)

R1: Partial Failure Rollback in executeJobRemote

R3: Batch Health Checks in watchRunningJobs

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

R1: Partial Failure Rollback in `executeJobRemote`

R3: Batch Health Checks in `watchRunningJobs`