Skip to content

fix(tasks): complete finalization on reconciler race and notify workflow on template stop#3980

Open
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-286a
Open

fix(tasks): complete finalization on reconciler race and notify workflow on template stop#3980
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-286a

Conversation

@cursor

@cursor cursor Bot commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Two critical correctness bugs found in recent task-pool / workflow changes on develop.

Bug 1: HA pool state leak when reconciler wins finalize race

Impact: In HA mode, when the runner-task reconciler wins the cluster finalize lock but the DB already has a terminal status (runner report won the race to persist status), failTaskRunnerLost returned early without releasing running/active/claim state. Tasks could permanently consume parallel-task capacity until restart. End timestamp, autorun children, and workflow progression were also skipped.

Root cause: Early return on tsk.Task.Status.IsFinished() after winning TryFinalize, assuming another path would clean up — but the other path had already lost the lock and bailed.

Fix:

  • Call finalizeRemoteTaskLocked when status is already finished
  • Apply the End != nil early-exit guard in non-HA mode too
  • Harden requeueTaskRunnerOffline with a DB re-check before mutating and rollback on persist failure

Bug 2: Workflow runs stuck after template stop

Impact: Stopping a template bulk-updates waiting tasks to stopped without going through finishRun, so HandleWorkflowTaskCompletion was never called. Workflow runs with queued-but-not-yet-started nodes could remain stuck in running/approval until manual intervention.

Root cause: StopTasksByTemplate fast-path uses SetWaitingTasksToStopped and dequeues without the normal task lifecycle completion path introduced by the Workflows feature.

Fix: Snapshot waiting workflow tasks before bulk stop and notify the workflow service after stopping.

Validation

go test ./services/tasks/... -run 'TestFailTaskRunnerLost_HA|TestRequeueTaskRunnerOffline|TestTaskPool_StopTasksByTemplate_NotifiesWorkflow|TestTaskPool_FinalizeRemoteTask_HA'
Open in Web View Automation 

…low on template stop

When failTaskRunnerLost wins the cluster finalize lock but the DB already
has a terminal status, complete finalization instead of returning early.
This releases stale running/active pool state and persists End when the
runner report lost the finalize race.

StopTasksByTemplate bulk-stops waiting tasks without going through
finishRun, so notify the workflow service for waiting workflow tasks.
Also clear the dispatching flag when the dispatch goroutine exits and
harden requeueTaskRunnerOffline against concurrent running reports.

Co-authored-by: Denis Gukov <fiftin@outlook.com>
@fiftin fiftin marked this pull request as ready for review June 21, 2026 16:32

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security review — PR #3980

Outcome: No medium, high, or critical vulnerabilities found.

Reviewed the task-pool HA race fixes, runner-reconciler hardening, and workflow notification on template stop. Prior automation threads: none to reconcile.

Scope reviewed

  • finalizeRemoteTaskLocked / failTaskRunnerLost — cluster finalize races and pool-state cleanup
  • requeueTaskRunnerOffline — HA DB re-check and persist-failure rollback
  • StopTasksByTemplate / notifyWorkflowTaskStopped — workflow completion after bulk stop
  • runTaskdispatching flag cleanup via defer

Why no findings

  1. User-facing stop path is authorizedStopAllTasks sits behind authentication, ProjectMiddleware, and CanManageProjectResources; projectID / templateID come from validated context, not raw request input.

  2. Workflow notification is an internal lifecycle hooknotifyWorkflowTaskStopped delegates to HandleWorkflowTaskCompletion using DB-backed task rows after an authorized stop. It does not expose a new unauthenticated entry point or bypass project/template scoping.

  3. Reconciler changes are defensive — HA re-checks and rollback reduce mis-routing/stale-state risk; they are not attacker-triggerable without existing runner/cluster trust.

  4. No new injection or secret-exposure surfacesSetWaitingTasksToStopped uses parameterized SQL; logging fields are numeric IDs only.

Slack summary

PR #3980 (task-pool HA / workflow stop fixes): clean — no security findings at medium+ severity.

Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99a2f245a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +276 to +278
if tsk.Task.Status != task_logger.TaskStartingStatus &&
tsk.Task.Status != task_logger.TaskWaitingStatus {
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Finalize terminal reports seen by the requeue recheck

In HA, if this second DB refresh observes success/error/stopped because the runner reported a terminal status while this goroutine holds TryFinalize, this branch just returns. The runner report's FinalizeRemoteTask loses that same finalize lock and exits, so no later path writes End, emits workflow completion, or releases running/claim state; handle finished statuses here like failTaskRunnerLost instead of treating them like running.

Useful? React with 👍 / 👎.

}).Warn("can't reload stopped workflow task")
continue
}
p.notifyWorkflowTaskStopped(task)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard workflow notifications on a terminal stop

When SetWaitingTasksToStopped fails, or when a snapped task no longer matches the waiting update, this newly added call still reloads the task and passes it to HandleWorkflowTaskCompletion without checking that Status is actually stopped. In that context a workflow node can advance from a task that remains waiting/running; only notify after confirming the reload is terminal and was actually stopped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant