Skip to content

fix(runners,tasks): preserve terminal status and complete reconciler finalization#3992

Draft
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-a6d2
Draft

fix(runners,tasks): preserve terminal status and complete reconciler finalization#3992
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-a6d2

Conversation

@cursor

@cursor cursor Bot commented Jun 26, 2026

Copy link
Copy Markdown

Bug 1: Stopped tasks reported as failed on runner client

Impact: When the server tells a runner to emergency-stop a job via terminated_jobs (task stopped, reassigned, or reconciler action), applyTerminatedJobs sets the job to stopped and kills the process. If job.Run() then returns with an error, the post-Run error path overwrote stopped with failed. The next progress report could mark a user-stopped task as failed, triggering wrong alerts and workflow outcomes.

Root cause: Commit 2a156b30 (fix(runners): logs) moved the IsFinished() early return into the success-only branch after job.Run().

Fix: Introduce finalizeAfterRun() that checks IsFinished() before applying terminal status, used for both error and success paths.

Bug 2: HA reconciler leaks pool state when task already finished

Impact: In HA mode, when failTaskRunnerLost wins the cluster finalize lock but the DB already has a terminal status (runner reported success on another node), the early return leaked running/active pool state. This blocks parallel task capacity, runner slots, autorun children, and workflow progression until restart.

Root cause: failTaskRunnerLost returned immediately on IsFinished() without calling finalizeRemoteTaskLocked.

Fix: Call finalizeRemoteTaskLocked when the task is already finished. Also harden requeueTaskRunnerOffline with a pre-mutation DB re-check and in-memory rollback on persist failure.

Validation

  • go test ./services/runners/... ./services/tasks/... -run 'TestRunningJob_finalizeAfterRun|TestRequeueTaskRunnerOffline|TestFailTaskRunnerLost_HA'
Open in Web View Automation 

…finalization

Two correctness bugs introduced or exposed by recent runner lifecycle changes:

1. Runner client (job_pool): commit 2a156b3 moved the IsFinished() guard into
   the success-only branch after job.Run returns. When applyTerminatedJobs
   emergency-stops a job (sets stopped) and Run() unwinds with an error, the
   error path overwrote the correct terminal status with failed. Restore a shared
   finalizeAfterRun helper that checks IsFinished() before applying status.

2. Task reconciler (HA): when failTaskRunnerLost wins the cluster finalize lock
   but the DB already has a terminal status, the early return leaked running/
   active pool state and skipped End, autorun, and workflow progression. Call
   finalizeRemoteTaskLocked instead of bailing. Also harden requeueTaskRunnerOffline
   with a pre-mutation DB re-check and rollback on persist failure.

Co-authored-by: Denis Gukov <fiftin@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant