fix(tasks): complete finalization on reconciler race and notify workflow on template stop#3980
fix(tasks): complete finalization on reconciler race and notify workflow on template stop#3980cursor[bot] wants to merge 1 commit into
Conversation
…low on template stop When failTaskRunnerLost wins the cluster finalize lock but the DB already has a terminal status, complete finalization instead of returning early. This releases stale running/active pool state and persists End when the runner report lost the finalize race. StopTasksByTemplate bulk-stops waiting tasks without going through finishRun, so notify the workflow service for waiting workflow tasks. Also clear the dispatching flag when the dispatch goroutine exits and harden requeueTaskRunnerOffline against concurrent running reports. Co-authored-by: Denis Gukov <fiftin@outlook.com>
There was a problem hiding this comment.
Security review — PR #3980
Outcome: No medium, high, or critical vulnerabilities found.
Reviewed the task-pool HA race fixes, runner-reconciler hardening, and workflow notification on template stop. Prior automation threads: none to reconcile.
Scope reviewed
finalizeRemoteTaskLocked/failTaskRunnerLost— cluster finalize races and pool-state cleanuprequeueTaskRunnerOffline— HA DB re-check and persist-failure rollbackStopTasksByTemplate/notifyWorkflowTaskStopped— workflow completion after bulk stoprunTask—dispatchingflag cleanup viadefer
Why no findings
-
User-facing stop path is authorized —
StopAllTaskssits behind authentication,ProjectMiddleware, andCanManageProjectResources;projectID/templateIDcome from validated context, not raw request input. -
Workflow notification is an internal lifecycle hook —
notifyWorkflowTaskStoppeddelegates toHandleWorkflowTaskCompletionusing DB-backed task rows after an authorized stop. It does not expose a new unauthenticated entry point or bypass project/template scoping. -
Reconciler changes are defensive — HA re-checks and rollback reduce mis-routing/stale-state risk; they are not attacker-triggerable without existing runner/cluster trust.
-
No new injection or secret-exposure surfaces —
SetWaitingTasksToStoppeduses parameterized SQL; logging fields are numeric IDs only.
Slack summary
PR #3980 (task-pool HA / workflow stop fixes): clean — no security findings at medium+ severity.
Sent by Cursor Automation: Find vulnerabilities
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99a2f245a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if tsk.Task.Status != task_logger.TaskStartingStatus && | ||
| tsk.Task.Status != task_logger.TaskWaitingStatus { | ||
| return |
There was a problem hiding this comment.
Finalize terminal reports seen by the requeue recheck
In HA, if this second DB refresh observes success/error/stopped because the runner reported a terminal status while this goroutine holds TryFinalize, this branch just returns. The runner report's FinalizeRemoteTask loses that same finalize lock and exits, so no later path writes End, emits workflow completion, or releases running/claim state; handle finished statuses here like failTaskRunnerLost instead of treating them like running.
Useful? React with 👍 / 👎.
| }).Warn("can't reload stopped workflow task") | ||
| continue | ||
| } | ||
| p.notifyWorkflowTaskStopped(task) |
There was a problem hiding this comment.
Guard workflow notifications on a terminal stop
When SetWaitingTasksToStopped fails, or when a snapped task no longer matches the waiting update, this newly added call still reloads the task and passes it to HandleWorkflowTaskCompletion without checking that Status is actually stopped. In that context a workflow node can advance from a task that remains waiting/running; only notify after confirming the reload is terminal and was actually stopped.
Useful? React with 👍 / 👎.


Summary
Two critical correctness bugs found in recent task-pool / workflow changes on
develop.Bug 1: HA pool state leak when reconciler wins finalize race
Impact: In HA mode, when the runner-task reconciler wins the cluster finalize lock but the DB already has a terminal status (runner report won the race to persist status),
failTaskRunnerLostreturned early without releasing running/active/claim state. Tasks could permanently consume parallel-task capacity until restart. End timestamp, autorun children, and workflow progression were also skipped.Root cause: Early return on
tsk.Task.Status.IsFinished()after winningTryFinalize, assuming another path would clean up — but the other path had already lost the lock and bailed.Fix:
finalizeRemoteTaskLockedwhen status is already finishedEnd != nilearly-exit guard in non-HA mode toorequeueTaskRunnerOfflinewith a DB re-check before mutating and rollback on persist failureBug 2: Workflow runs stuck after template stop
Impact: Stopping a template bulk-updates waiting tasks to
stoppedwithout going throughfinishRun, soHandleWorkflowTaskCompletionwas never called. Workflow runs with queued-but-not-yet-started nodes could remain stuck inrunning/approvaluntil manual intervention.Root cause:
StopTasksByTemplatefast-path usesSetWaitingTasksToStoppedand dequeues without the normal task lifecycle completion path introduced by the Workflows feature.Fix: Snapshot waiting workflow tasks before bulk stop and notify the workflow service after stopping.
Validation