feat: pause/resume workflow execution with checkpointing#548
Draft
dataforxyz wants to merge 34 commits into
Draft
feat: pause/resume workflow execution with checkpointing#548dataforxyz wants to merge 34 commits into
dataforxyz wants to merge 34 commits into
Conversation
Add rover rebase command as counterpart to merge, with: - AI-powered conflict resolution via --agent flag - Parallel conflict resolution with --concurrency - Context-optimized prompts using git blame and region extraction - Output sanitization (strip markdown fences, reject unresolved conflicts) - Truncation prevention for AI-resolved files
Wrap isInstalled() in try-catch so missing binaries return false instead of throwing. Fix composer installer checksum to use the correct installer.sig endpoint with SHA-384 instead of phar SHA-256.
Add GitLab as a context provider alongside GitHub: - GitLab MR context with comments, diffs, and reviewer approval state - Self-hosted GitLab instance support via GITLAB_HOST - CLI commands extended: inspect, mcp, push, task accept gitlab: URIs - includeComments MCP tool updated for fromGitlab parameter - Fix indentation in setup.ts, remove misplaced variable declarations
- Add command step type execution and loop steps to workflows - Fix swe-tdd workflow non-interactive mode and input handling - Add review & fix loop to prevent post-review regressions - Add checkpoint commits and circle detection to workflow loops - Add build step and scope review to changed code in swe-tdd - Refactor: rename squashCheckpointCommits to collapseTaskCommits - Fix: check loop until condition after full iteration, not each sub-step
The includeUntracked flag was gated on !compareRef, which excluded untracked files when using --base (since compareRef is the base commit hash). Since rover agents write files without committing, all task changes are untracked until merge — so --base diffs appeared empty. Changed to !options.branch so untracked files are included for both default and --base modes, but correctly excluded for --branch comparisons. Also fixed diffStats() not counting untracked file insertions in totals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The inspect command now displays context entries (e.g., --context gitlab:issue/9) in both human-readable and JSON output. Also populates task.source from context metadata for feature parity with the deprecated --from-github flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ng, worktree git mount) - Clamp progress bar fill to prevent negative repeat count - Use bash and source .profile in command runner for proper PATH setup - Return directory listing when ACP readTextFile is called on a directory - Mount parent .git directory in containers for worktree support
- Use image digest instead of tag name for cache invalidation - Allow restart of stuck ITERATING tasks from interrupted cache init
Add new event to track failed resume attempts with optional error details. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Log workflow pause events with retryable error metadata for better observability. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Only track eventResumeTask after successful resume. Track eventResumeTaskFailed on resume failures to properly distinguish success and failure paths. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Use pausedAt timestamp to calculate duration for paused tasks, keeping the duration fixed at the pause time. Adds test coverage for duration consistency. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Ensure exitWithError is properly awaited to allow telemetry shutdown to complete. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Implement lock file-based concurrency control for resume operations: - Detect stale locks from dead PIDs and reclaim them - Prevent concurrent resume attempts on same iteration - Restore original task status (PAUSED/FAILED) if resume fails - Add detailed test coverage for lock contention race conditions - Add verbose logging for lock file read failures Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Display user-friendly message when task reaches max auto-retries, directing them to use 'rover resume' for manual retry. Adds test coverage for this behavior. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests verifying checkpoint store retains in-memory data even when file persistence fails, ensuring state is not lost during i/o errors. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add comprehensive tests for: - Loop with maxIterations=0 never executing body - PauseWorkflowError propagating through loops without being caught - Transient error detection excluding generic timeout strings Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests for: - Signal handler cleanup after successful completion - Signal handler cleanup after paused exit - Telemetry logging of workflow pause with retryable error context Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add tests for: - Clearing provider field on resume - Tracking new provider on re-pause - Backward compatibility with pause operations without provider Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Signal handler exits FAILED when checkpoint save fails, preventing the CLI from scheduling a resume with no checkpoint data - Block `rover iterate` on PAUSED tasks to prevent state corruption - Clear pausedAt on resume (IN_PROGRESS/ITERATING transitions) so stale timestamps don't persist after successful completion - Replace activeStepsOutput mutable closure with context.stepsOutput passed through OnStepComplete, eliminating implicit ordering dependency - Restore pre-restart task status on failure instead of resetting to NEW - Remove redundant clearTimer call in retry scheduler after success - Remove dead return in resume command try block - Tighten && detection in condition evaluator to whitespace-bounded - Add markPaused to orphan detector mock and test exit code 2 path - Add TaskDescriptionManager pause/resume lifecycle unit tests - Add isContainerMissingInspectError unit tests - Document hardcoded checkpoint container path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix orphan detection throttle in watch mode using mutable ref instead of options field that was lost between recursive listCommand calls - Clear pausedAt on terminal status transitions (FAILED, COMPLETED, MERGED, PUSHED) to prevent stale timestamp confusion - Add 5-minute timeout guard to isRestartStartupInFlight so crashed restarts don't permanently block orphan detection - Return structured ResumeResult from resumeTask instead of boolean - Fix acp-client test that never exercised the error path - Add SIGINT signal handling test alongside existing SIGTERM test - Add PAUSED task restart rejection test with resume tip - Add dedicated resume-lock.ts unit tests with real filesystem - Add pausedAt clearing tests for all terminal transitions - Add startup timeout tests for orphan detector - Stricter taskId parsing, iterationLogsPath in resume, os.ts spread fix Co-Authored-By: Claude <noreply@anthropic.com>
…ansitions Clear stale failure metadata on checkpoint resume to avoid misleading persisted state. Retry transient ACP thrown exceptions (not just failed results). Add quiet mode to RetryScheduler/resumeTask for JSON output. Clear pausedAt/error on status reset. Add workflow_resume log event. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Resume lock: add timestamp to lock file to guard against PID reuse; locks older than 30 minutes are treated as stale regardless of PID - Status column: increase width from 16 to 20 to fit longer provider names - Condition evaluator: detect && without requiring surrounding whitespace - Checkpoint store: add addCompletedStep() to avoid double-copy overhead - Command runner: deduplicate placeholder warnings across loop iterations - Orphan detector: log rejected Promise.allSettled results in verbose mode - Retry scheduler: document control flow invariants in re-schedule paths - Podman: normalizeUserInfo returns new object instead of mutating input Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix test flakiness: wrap step-executor fake timers in try/finally, move process.kill spy cleanup to afterEach in resume-helper tests, replace real setTimeout delays with deterministic mocks in iteration-status tests - Add missing test coverage: addCompletedStep upsert logic, getCompletedStep copy isolation, collectNestedStepIds for nested structures, ACP mode SIGTERM/SIGINT signal handling, already_resuming status in resume command, empty condition string, || inside values - Fix condition evaluator: && and || detection now only matches logical operators between clauses (not inside values), preventing false positives when values contain these characters - Harden stale lock reclamation: read lock content before delete to narrow the race window where another process could lose its lock - Remove VERBOSE gating on orphan-detector rejected promise logging so unexpected errors are never silently swallowed - Export MAX_AUTO_RETRIES, LOCK_STALENESS_TIMEOUT_MS, STARTUP_TIMEOUT_MS as named constants for visibility and future configurability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rover resume <taskId>command to continue paused/failed tasks from the last checkpointrover list --watchwith exponential backoffChanges across packages
schemas: New
PAUSEDtask status,pausediteration status,AGENT_EXIT_CODEconstants,workflow_pause/workflow_resumelog events, schema version bump to 1.6core: Condition evaluator extracted to shared module,
IterationStatusManager.pause()with provider tracking,TaskDescriptionManagerpause/resume transitions, EPERM handling inlaunch/launchSyncagent: Checkpoint store with atomic write-then-rename persistence,
PauseWorkflowErrorfor retryable errors, signal handling (SIGTERM/SIGINT → pause), resume from checkpoint skipping completed steps, loop progress trackingcli:
resumecommand, resume-helper with file-based locking, retry-scheduler with exponential backoff, orphan-detector for stuck tasks, container inspect error classification, agent provider env var cleanuptelemetry:
resume_task/resume_task_failedevents,schemaVersionfield on all eventsTest plan
rover resume→ completion🤖 Generated with Claude Code