fix: runner async completion, workflow restore data loss, HA workflow progression#3979
fix: runner async completion, workflow restore data loss, HA workflow progression#3979cursor[bot] wants to merge 1 commit into
Conversation
…low progression - Runner job pool: respect Executor.Async() after Run returns so Kubernetes (and other async) executors are not falsely marked successful while still running on the cluster. - OSS workflow store stub: return ErrWorkflowsProOnly from CreateWorkflowTemplate instead of silently dropping workflow definitions during project restore. - HA finalize: when duplicate finalization observes End already persisted, still call HandleWorkflowTaskCompletion so a crash between saveStatus and workflow progression does not leave runs stuck. Co-authored-by: Denis Gukov <fiftin@outlook.com>
There was a problem hiding this comment.
Security review
Outcome: No medium, high, or critical vulnerabilities identified in the added or modified code.
Scope reviewed: pro/db/sql/workflow.go, services/runners/job_pool.go, services/tasks/TaskPool.go (plus related test changes).
Prior findings: No unresolved automation review threads on this PR.
Change assessment
-
applyJobRunResultasync guard — Prevents async executors (e.g. Kubernetes dispatch) from being marked successful before the workload actually completes. This closes a correctness bug that could prematurely release pool slots and advance workflows; the change is a hardening fix, not a new exposure. -
HA duplicate-finalize workflow progression —
HandleWorkflowTaskCompletionon theEnd != nilbackstop path is gated on winning the cluster-wideTryFinalizelock and refreshing task state from the DB. The workflow service is documented to treat re-progression as idempotent; this addresses a crash-recovery gap rather than introducing attacker-controlled input. -
ErrWorkflowsProOnlyon OSS workflow restore — Surfaces an explicit error when workflow persistence is attempted without Pro. Restore remains behind admin (orNonAdminCanCreateProject) authorization; no new data exposure or bypass.
No inline findings to report.
Sent by Cursor Automation: Find vulnerabilities
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da119e36c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| func (d *WorkflowStoreImpl) CreateWorkflowTemplate(workflow db.WorkflowTemplate) (res db.WorkflowTemplate, err error) { | ||
| return | ||
| return res, ErrWorkflowsProOnly |
There was a problem hiding this comment.
Reject Pro workflow backups before restoring objects
For OSS builds this error is reached from BackupWorkflow.Restore, but BackupFormat.Restore creates the project/user and restores all non-workflow entities before looping over backup.Workflows last (services/project/restore.go lines 623-729). Importing a Pro backup with workflows will now return an error while leaving a partially restored project behind, so users still get a corrupted restore and have to clean it up manually; preflight unsupported workflows before CreateProject or roll the restore back when this sentinel is returned.
Useful? React with 👍 / 👎.


Summary
Daily critical-bug investigation found three high-severity correctness issues in recent runner/workflow/HA changes and fixed them with minimal, targeted patches.
1. Runner job pool marks async executors successful too early
Impact: With Kubernetes (or any async) executor, the runner reported
successto the server as soon as the Pod was dispatched, while work was still running. Workflow runs could advance prematurely;runner.one_offcould exit early.Root cause:
TaskRunner.run()already honorsJob.Async(), butservices/runners/job_pool.godid not — it always setTaskSuccessStatuswhenRun()returned without error.Fix: After a successful
Run(), skip auto-completion whenexecutor.Async()is true. Added unit tests via extractedapplyJobRunResult.2. Workflow restore silently drops data on OSS builds
Impact: Restoring a Pro project backup containing workflows on an OSS build returned success but persisted zero workflow definitions.
Root cause: The OSS
WorkflowStoreImplstub returned(zero, nil)fromCreateWorkflowTemplate.Fix: Return
ErrWorkflowsProOnlyso restore fails loudly instead of silently losing workflow data.3. HA duplicate finalize skips workflow progression
Impact: In HA mode, if a node crashed after persisting
task.endbut beforeHandleWorkflowTaskCompletion, a duplicate finalize on another node released pool state but never progressed the workflow run — leaving it stuck.Root cause:
finalizeRemoteTaskLockedearly-returned whenEndwas already set, intentionally skipping duplicate finalization/autorun but also skipping workflow progression.Fix: Call
HandleWorkflowTaskCompletionon that early-return path (workflow service must treat it as idempotent).Validation
go test ./services/runners/... ./services/tasks/... ./services/project/...TestApplyJobRunResult_AsyncExecutorKeepsRunningStatusTestApplyJobRunResult_SyncExecutorMarksSuccessTestTaskPool_FinalizeRemoteTask_HA_ProgressesWorkflowOnDuplicateFinalize