Skip to content

fix: runner async completion, workflow restore data loss, HA workflow progression#3979

Open
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-72ae
Open

fix: runner async completion, workflow restore data loss, HA workflow progression#3979
cursor[bot] wants to merge 1 commit into
developfrom
cursor/critical-bug-investigation-72ae

Conversation

@cursor

@cursor cursor Bot commented Jun 19, 2026

Copy link
Copy Markdown

Summary

Daily critical-bug investigation found three high-severity correctness issues in recent runner/workflow/HA changes and fixed them with minimal, targeted patches.

1. Runner job pool marks async executors successful too early

Impact: With Kubernetes (or any async) executor, the runner reported success to the server as soon as the Pod was dispatched, while work was still running. Workflow runs could advance prematurely; runner.one_off could exit early.

Root cause: TaskRunner.run() already honors Job.Async(), but services/runners/job_pool.go did not — it always set TaskSuccessStatus when Run() returned without error.

Fix: After a successful Run(), skip auto-completion when executor.Async() is true. Added unit tests via extracted applyJobRunResult.

2. Workflow restore silently drops data on OSS builds

Impact: Restoring a Pro project backup containing workflows on an OSS build returned success but persisted zero workflow definitions.

Root cause: The OSS WorkflowStoreImpl stub returned (zero, nil) from CreateWorkflowTemplate.

Fix: Return ErrWorkflowsProOnly so restore fails loudly instead of silently losing workflow data.

3. HA duplicate finalize skips workflow progression

Impact: In HA mode, if a node crashed after persisting task.end but before HandleWorkflowTaskCompletion, a duplicate finalize on another node released pool state but never progressed the workflow run — leaving it stuck.

Root cause: finalizeRemoteTaskLocked early-returned when End was already set, intentionally skipping duplicate finalization/autorun but also skipping workflow progression.

Fix: Call HandleWorkflowTaskCompletion on that early-return path (workflow service must treat it as idempotent).

Validation

  • go test ./services/runners/... ./services/tasks/... ./services/project/...
  • New tests:
    • TestApplyJobRunResult_AsyncExecutorKeepsRunningStatus
    • TestApplyJobRunResult_SyncExecutorMarksSuccess
    • TestTaskPool_FinalizeRemoteTask_HA_ProgressesWorkflowOnDuplicateFinalize
Open in Web View Automation 

…low progression

- Runner job pool: respect Executor.Async() after Run returns so Kubernetes
  (and other async) executors are not falsely marked successful while still
  running on the cluster.
- OSS workflow store stub: return ErrWorkflowsProOnly from CreateWorkflowTemplate
  instead of silently dropping workflow definitions during project restore.
- HA finalize: when duplicate finalization observes End already persisted,
  still call HandleWorkflowTaskCompletion so a crash between saveStatus and
  workflow progression does not leave runs stuck.

Co-authored-by: Denis Gukov <fiftin@outlook.com>
@fiftin fiftin marked this pull request as ready for review June 21, 2026 16:32

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security review

Outcome: No medium, high, or critical vulnerabilities identified in the added or modified code.

Scope reviewed: pro/db/sql/workflow.go, services/runners/job_pool.go, services/tasks/TaskPool.go (plus related test changes).

Prior findings: No unresolved automation review threads on this PR.

Change assessment

  1. applyJobRunResult async guard — Prevents async executors (e.g. Kubernetes dispatch) from being marked successful before the workload actually completes. This closes a correctness bug that could prematurely release pool slots and advance workflows; the change is a hardening fix, not a new exposure.

  2. HA duplicate-finalize workflow progressionHandleWorkflowTaskCompletion on the End != nil backstop path is gated on winning the cluster-wide TryFinalize lock and refreshing task state from the DB. The workflow service is documented to treat re-progression as idempotent; this addresses a crash-recovery gap rather than introducing attacker-controlled input.

  3. ErrWorkflowsProOnly on OSS workflow restore — Surfaces an explicit error when workflow persistence is attempted without Pro. Restore remains behind admin (or NonAdminCanCreateProject) authorization; no new data exposure or bypass.

No inline findings to report.

Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da119e36c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pro/db/sql/workflow.go

func (d *WorkflowStoreImpl) CreateWorkflowTemplate(workflow db.WorkflowTemplate) (res db.WorkflowTemplate, err error) {
return
return res, ErrWorkflowsProOnly

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject Pro workflow backups before restoring objects

For OSS builds this error is reached from BackupWorkflow.Restore, but BackupFormat.Restore creates the project/user and restores all non-workflow entities before looping over backup.Workflows last (services/project/restore.go lines 623-729). Importing a Pro backup with workflows will now return an error while leaving a partially restored project behind, so users still get a corrupted restore and have to clean it up manually; preflight unsupported workflows before CreateProject or roll the restore back when this sentinel is returned.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant