Skip to content

fix(tui): job queue hardening — partial failure rollback + batch health checks#5

Open
seilk wants to merge 1 commit intomainfrom
fix/job-queue-hardening-r1-r3
Open

fix(tui): job queue hardening — partial failure rollback + batch health checks#5
seilk wants to merge 1 commit intomainfrom
fix/job-queue-hardening-r1-r3

Conversation

@seilk
Copy link
Owner

@seilk seilk commented Feb 17, 2026

Post-merge hardening for Job Queue (feature/job-queue-v1)

From Capybara review R2 recommended items:

R1: Partial Failure Rollback in executeJobRemote

  • One-to-one mode: if GPU N fails mid-launch, all previously started tmux sessions are now killed (rollback)
  • Prevents orphaned sessions consuming GPU resources on partial failure

R3: Batch Health Checks in watchRunningJobs

  • Replaced per-job Python process spawning with batchCheckJobsAlive()
  • All running jobs checked in a single Python invocation per watchdog cycle
  • Reduces overhead from N cold Python starts to 1 (significant with 10+ running jobs)

Testing

  • 34/34 existing tests passing
  • No behavioral change for single-job scenarios (checkJobAlive preserved as thin wrapper)

Ref: CAPYBARA_REVIEW_R2.md (in raven worktree, now removed)

R1: executeJobRemote now kills already-launched tmux sessions on
    partial failure in one-to-one mode, preventing orphaned sessions.

R3: Replace per-job Python process spawning with batchCheckJobsAlive()
    that checks all running jobs in a single Python invocation.
    Reduces overhead from N processes to 1 per watchdog cycle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant