fix: detect zombie flows with NULL pings and cap suspend timeout#8723
fix: detect zombie flows with NULL pings and cap suspend timeout#8723rubenfiszel wants to merge 1 commit intomainfrom
Conversation
…etention period Flows waiting for a suspended child (e.g. approval step) have running=true but ping=NULL by design. The zombie flow handler required ping IS NOT NULL, so these flows were invisible to detection and could stay in the queue indefinitely, blocking retention cleanup of their completed children. Two fixes: 1. monitor.rs: Add detection for root flows with NULL pings older than the retention period. These are cancelled as zombies with a critical error. 2. worker_flow.rs: Cap suspend_until at the retention period to prevent flows from being parked longer than retention allows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Claude finished @rubenfiszel's task in 4m 12s —— View job Code Review: PR #8723
Overall AssessmentThis is a well-motivated fix for a real production issue (370 zombie flows blocking retention cleanup). The approach is sound: detect and cancel stale NULL-ping flows, and proactively cap suspend timeouts to prevent the problem from recurring. The code follows existing patterns in the codebase. Issues1. Missing
The existing zombie flow query (L3220) explicitly filters Since the second change (worker_flow.rs) caps Consider adding 2. Missing The 4 new 3.
let base_url = BASE_URL.read().await;This acquires a read lock that's held while Looks Good
Testing InstructionsTo verify the suspend timeout cap: create a flow with an approval step where the suspend timeout is set very high (e.g. 86400000 seconds). Set the instance's job retention period to something shorter (e.g. 60 days / 5184000 seconds) via superadmin settings. Run the flow, let it reach the approval step, and check the |
Deploying windmill with
|
| Latest commit: |
012726e
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e813f362.windmill.pages.dev |
| Branch Preview URL: | https://fix-zombie-flow-null-ping-re.windmill.pages.dev |
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/src/monitor.rs">
<violation number="1" location="backend/src/monitor.rs:3321">
P3: `j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| JOIN v2_job j USING (id) | ||
| LEFT JOIN v2_job_runtime r USING (id) | ||
| WHERE q.running = true | ||
| AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode') |
There was a problem hiding this comment.
P3: j.kind = 'flownode' can never match here because j.parent_job IS NULL is also required, and flownodes always have a parent job. Remove OR j.kind = 'flownode' from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/src/monitor.rs, line 3321:
<comment>`j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</comment>
<file context>
@@ -3305,6 +3305,49 @@ Please check your worker logs for more details and feel free to report it to the
+ JOIN v2_job j USING (id)
+ LEFT JOIN v2_job_runtime r USING (id)
+ WHERE q.running = true
+ AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')
+ AND r.ping IS NULL
+ AND j.parent_job IS NULL
</file context>
| AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode') | |
| AND (j.kind = 'flow' OR j.kind = 'flowpreview') |
Summary
running=true, ping=NULLby design. The zombie flow handler requiredping IS NOT NULL, so these parent flows were invisible to zombie detection and could stay in the queue indefinitely — blocking retention cleanup of their completed children.Changes
monitor.rs: Add a third detection block inhandle_zombie_flowsthat catches root flows withping IS NULLolder than the retention period. These are cancelled as zombies with a critical error report.worker_flow.rs: Capsuspend_untilat the retention period when parking a flow at an approval step. If retention is 60 days, there's no point suspending for 1000 days — the flow's completed children would need cleanup before then anyway.Test plan
cargo checkpassessuspend_untilis capped🤖 Generated with Claude Code
Summary by cubic
Fixes zombie-flow detection by catching running flows with NULL pings and caps suspend timeouts at the retention period. Prevents flows from parking indefinitely and unblocks retention cleanup of completed children.
monitor.rs.suspend_untilto the retention period for approval/suspend steps inworker_flow.rs.Written for commit 012726e. Summary will update on new commits.