Skip to content

feat: pause/resume workflow execution with checkpointing#548

Draft
dataforxyz wants to merge 34 commits into
endorhq:mainfrom
dataforxyz:feat/pause-resume
Draft

feat: pause/resume workflow execution with checkpointing#548
dataforxyz wants to merge 34 commits into
endorhq:mainfrom
dataforxyz:feat/pause-resume

Conversation

@dataforxyz
Copy link
Copy Markdown
Contributor

Summary

  • Add pause/resume lifecycle for rover tasks: when a retryable error occurs (e.g., API credit exhaustion), tasks are paused with a checkpoint instead of failing permanently
  • Introduce rover resume <taskId> command to continue paused/failed tasks from the last checkpoint
  • Add automatic retry scheduling via rover list --watch with exponential backoff
  • Add orphan detection to recover tasks stuck in IN_PROGRESS after container exits

Changes across packages

schemas: New PAUSED task status, paused iteration status, AGENT_EXIT_CODE constants, workflow_pause/workflow_resume log events, schema version bump to 1.6

core: Condition evaluator extracted to shared module, IterationStatusManager.pause() with provider tracking, TaskDescriptionManager pause/resume transitions, EPERM handling in launch/launchSync

agent: Checkpoint store with atomic write-then-rename persistence, PauseWorkflowError for retryable errors, signal handling (SIGTERM/SIGINT → pause), resume from checkpoint skipping completed steps, loop progress tracking

cli: resume command, resume-helper with file-based locking, retry-scheduler with exponential backoff, orphan-detector for stuck tasks, container inspect error classification, agent provider env var cleanup

telemetry: resume_task/resume_task_failed events, schemaVersion field on all events

Test plan

  • Unit tests for checkpoint store (save/load/clear/atomic writes)
  • Unit tests for step executor pause and resume paths
  • Integration tests for full pause → checkpoint → resume → complete lifecycle
  • Unit tests for resume command, resume-helper, retry-scheduler, orphan-detector
  • Unit tests for condition evaluator and iteration status pause transitions
  • E2E tests for resume command with mocked Docker
  • Manual test: trigger credit exhaustion → verify pause → rover resume → completion

🤖 Generated with Claude Code

dataforxyz and others added 30 commits February 27, 2026 14:23
Add rover rebase command as counterpart to merge, with:
- AI-powered conflict resolution via --agent flag
- Parallel conflict resolution with --concurrency
- Context-optimized prompts using git blame and region extraction
- Output sanitization (strip markdown fences, reject unresolved conflicts)
- Truncation prevention for AI-resolved files
Wrap isInstalled() in try-catch so missing binaries return false
instead of throwing. Fix composer installer checksum to use the
correct installer.sig endpoint with SHA-384 instead of phar SHA-256.
Add GitLab as a context provider alongside GitHub:
- GitLab MR context with comments, diffs, and reviewer approval state
- Self-hosted GitLab instance support via GITLAB_HOST
- CLI commands extended: inspect, mcp, push, task accept gitlab: URIs
- includeComments MCP tool updated for fromGitlab parameter
- Fix indentation in setup.ts, remove misplaced variable declarations
- Add command step type execution and loop steps to workflows
- Fix swe-tdd workflow non-interactive mode and input handling
- Add review & fix loop to prevent post-review regressions
- Add checkpoint commits and circle detection to workflow loops
- Add build step and scope review to changed code in swe-tdd
- Refactor: rename squashCheckpointCommits to collapseTaskCommits
- Fix: check loop until condition after full iteration, not each sub-step
The includeUntracked flag was gated on !compareRef, which excluded
untracked files when using --base (since compareRef is the base commit
hash). Since rover agents write files without committing, all task
changes are untracked until merge — so --base diffs appeared empty.

Changed to !options.branch so untracked files are included for both
default and --base modes, but correctly excluded for --branch comparisons.

Also fixed diffStats() not counting untracked file insertions in totals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The inspect command now displays context entries (e.g., --context gitlab:issue/9)
in both human-readable and JSON output. Also populates task.source from context
metadata for feature parity with the deprecated --from-github flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ng, worktree git mount)

- Clamp progress bar fill to prevent negative repeat count
- Use bash and source .profile in command runner for proper PATH setup
- Return directory listing when ACP readTextFile is called on a directory
- Mount parent .git directory in containers for worktree support
- Use image digest instead of tag name for cache invalidation
- Allow restart of stuck ITERATING tasks from interrupted cache init
Add new event to track failed resume attempts with optional error details.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Log workflow pause events with retryable error metadata for better observability.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Only track eventResumeTask after successful resume. Track eventResumeTaskFailed
on resume failures to properly distinguish success and failure paths.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Use pausedAt timestamp to calculate duration for paused tasks, keeping the
duration fixed at the pause time. Adds test coverage for duration consistency.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Ensure exitWithError is properly awaited to allow telemetry shutdown to complete.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Implement lock file-based concurrency control for resume operations:
- Detect stale locks from dead PIDs and reclaim them
- Prevent concurrent resume attempts on same iteration
- Restore original task status (PAUSED/FAILED) if resume fails
- Add detailed test coverage for lock contention race conditions
- Add verbose logging for lock file read failures

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Display user-friendly message when task reaches max auto-retries, directing them
to use 'rover resume' for manual retry. Adds test coverage for this behavior.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests verifying checkpoint store retains in-memory data even when file
persistence fails, ensuring state is not lost during i/o errors.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add comprehensive tests for:
- Loop with maxIterations=0 never executing body
- PauseWorkflowError propagating through loops without being caught
- Transient error detection excluding generic timeout strings

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests for:
- Signal handler cleanup after successful completion
- Signal handler cleanup after paused exit
- Telemetry logging of workflow pause with retryable error context

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests for:
- Clearing provider field on resume
- Tracking new provider on re-pause
- Backward compatibility with pause operations without provider

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Signal handler exits FAILED when checkpoint save fails, preventing
  the CLI from scheduling a resume with no checkpoint data
- Block `rover iterate` on PAUSED tasks to prevent state corruption
- Clear pausedAt on resume (IN_PROGRESS/ITERATING transitions) so
  stale timestamps don't persist after successful completion
- Replace activeStepsOutput mutable closure with context.stepsOutput
  passed through OnStepComplete, eliminating implicit ordering dependency
- Restore pre-restart task status on failure instead of resetting to NEW
- Remove redundant clearTimer call in retry scheduler after success
- Remove dead return in resume command try block
- Tighten && detection in condition evaluator to whitespace-bounded
- Add markPaused to orphan detector mock and test exit code 2 path
- Add TaskDescriptionManager pause/resume lifecycle unit tests
- Add isContainerMissingInspectError unit tests
- Document hardcoded checkpoint container path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dataforxyz and others added 4 commits March 1, 2026 04:55
- Fix orphan detection throttle in watch mode using mutable ref instead
  of options field that was lost between recursive listCommand calls
- Clear pausedAt on terminal status transitions (FAILED, COMPLETED,
  MERGED, PUSHED) to prevent stale timestamp confusion
- Add 5-minute timeout guard to isRestartStartupInFlight so crashed
  restarts don't permanently block orphan detection
- Return structured ResumeResult from resumeTask instead of boolean
- Fix acp-client test that never exercised the error path
- Add SIGINT signal handling test alongside existing SIGTERM test
- Add PAUSED task restart rejection test with resume tip
- Add dedicated resume-lock.ts unit tests with real filesystem
- Add pausedAt clearing tests for all terminal transitions
- Add startup timeout tests for orphan detector
- Stricter taskId parsing, iterationLogsPath in resume, os.ts spread fix

Co-Authored-By: Claude <noreply@anthropic.com>
…ansitions

Clear stale failure metadata on checkpoint resume to avoid misleading
persisted state. Retry transient ACP thrown exceptions (not just failed
results). Add quiet mode to RetryScheduler/resumeTask for JSON output.
Clear pausedAt/error on status reset. Add workflow_resume log event.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Resume lock: add timestamp to lock file to guard against PID reuse;
  locks older than 30 minutes are treated as stale regardless of PID
- Status column: increase width from 16 to 20 to fit longer provider names
- Condition evaluator: detect && without requiring surrounding whitespace
- Checkpoint store: add addCompletedStep() to avoid double-copy overhead
- Command runner: deduplicate placeholder warnings across loop iterations
- Orphan detector: log rejected Promise.allSettled results in verbose mode
- Retry scheduler: document control flow invariants in re-schedule paths
- Podman: normalizeUserInfo returns new object instead of mutating input

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix test flakiness: wrap step-executor fake timers in try/finally,
  move process.kill spy cleanup to afterEach in resume-helper tests,
  replace real setTimeout delays with deterministic mocks in
  iteration-status tests
- Add missing test coverage: addCompletedStep upsert logic,
  getCompletedStep copy isolation, collectNestedStepIds for nested
  structures, ACP mode SIGTERM/SIGINT signal handling, already_resuming
  status in resume command, empty condition string, || inside values
- Fix condition evaluator: && and || detection now only matches logical
  operators between clauses (not inside values), preventing false
  positives when values contain these characters
- Harden stale lock reclamation: read lock content before delete to
  narrow the race window where another process could lose its lock
- Remove VERBOSE gating on orphan-detector rejected promise logging
  so unexpected errors are never silently swallowed
- Export MAX_AUTO_RETRIES, LOCK_STALENESS_TIMEOUT_MS, STARTUP_TIMEOUT_MS
  as named constants for visibility and future configurability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataforxyz dataforxyz marked this pull request as draft March 2, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant