fix(runners,tasks): preserve terminal status and complete reconciler finalization#3992
Draft
cursor[bot] wants to merge 1 commit into
Draft
fix(runners,tasks): preserve terminal status and complete reconciler finalization#3992cursor[bot] wants to merge 1 commit into
cursor[bot] wants to merge 1 commit into
Conversation
…finalization Two correctness bugs introduced or exposed by recent runner lifecycle changes: 1. Runner client (job_pool): commit 2a156b3 moved the IsFinished() guard into the success-only branch after job.Run returns. When applyTerminatedJobs emergency-stops a job (sets stopped) and Run() unwinds with an error, the error path overwrote the correct terminal status with failed. Restore a shared finalizeAfterRun helper that checks IsFinished() before applying status. 2. Task reconciler (HA): when failTaskRunnerLost wins the cluster finalize lock but the DB already has a terminal status, the early return leaked running/ active pool state and skipped End, autorun, and workflow progression. Call finalizeRemoteTaskLocked instead of bailing. Also harden requeueTaskRunnerOffline with a pre-mutation DB re-check and rollback on persist failure. Co-authored-by: Denis Gukov <fiftin@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug 1: Stopped tasks reported as failed on runner client
Impact: When the server tells a runner to emergency-stop a job via
terminated_jobs(task stopped, reassigned, or reconciler action),applyTerminatedJobssets the job tostoppedand kills the process. Ifjob.Run()then returns with an error, the post-Run error path overwrotestoppedwithfailed. The next progress report could mark a user-stopped task as failed, triggering wrong alerts and workflow outcomes.Root cause: Commit
2a156b30(fix(runners): logs) moved theIsFinished()early return into the success-only branch afterjob.Run().Fix: Introduce
finalizeAfterRun()that checksIsFinished()before applying terminal status, used for both error and success paths.Bug 2: HA reconciler leaks pool state when task already finished
Impact: In HA mode, when
failTaskRunnerLostwins the cluster finalize lock but the DB already has a terminal status (runner reported success on another node), the early return leaked running/active pool state. This blocks parallel task capacity, runner slots, autorun children, and workflow progression until restart.Root cause:
failTaskRunnerLostreturned immediately onIsFinished()without callingfinalizeRemoteTaskLocked.Fix: Call
finalizeRemoteTaskLockedwhen the task is already finished. Also hardenrequeueTaskRunnerOfflinewith a pre-mutation DB re-check and in-memory rollback on persist failure.Validation
go test ./services/runners/... ./services/tasks/... -run 'TestRunningJob_finalizeAfterRun|TestRequeueTaskRunnerOffline|TestFailTaskRunnerLost_HA'