fix: detect zombie flows with NULL pings and cap suspend timeout by rubenfiszel · Pull Request #8723 · windmill-labs/windmill

rubenfiszel · 2026-04-04T13:13:49Z

Summary

Flows waiting for a suspended child (e.g. approval step) have running=true, ping=NULL by design. The zombie flow handler required ping IS NOT NULL, so these parent flows were invisible to zombie detection and could stay in the queue indefinitely — blocking retention cleanup of their completed children.
On the demo cloud instance, 370 instances of a single flow with a ~1000-day approval timeout accumulated since Nov 2025, protecting ~1,355 completed jobs from being cleaned up despite a 60-day retention period.

Changes

monitor.rs: Add a third detection block in handle_zombie_flows that catches root flows with ping IS NULL older than the retention period. These are cancelled as zombies with a critical error report.
worker_flow.rs: Cap suspend_until at the retention period when parking a flow at an approval step. If retention is 60 days, there's no point suspending for 1000 days — the flow's completed children would need cleanup before then anyway.

Test plan

Verify cargo check passes
Deploy to staging and confirm the monitor loop detects and cancels flows with NULL pings older than retention
Create a flow with an approval step and a suspend timeout longer than the retention period — verify suspend_until is capped
Confirm retention cleanup proceeds normally after zombie flows are cancelled

🤖 Generated with Claude Code

Summary by cubic

Fixes zombie-flow detection by catching running flows with NULL pings and caps suspend timeouts at the retention period. Prevents flows from parking indefinitely and unblocks retention cleanup of completed children.

Bug Fixes
- Detect and cancel root flows with NULL pings older than retention in monitor.rs.
- Cap suspend_until to the retention period for approval/suspend steps in worker_flow.rs.

^{Written for commit 012726e. Summary will update on new commits.}

…etention period Flows waiting for a suspended child (e.g. approval step) have running=true but ping=NULL by design. The zombie flow handler required ping IS NOT NULL, so these flows were invisible to detection and could stay in the queue indefinitely, blocking retention cleanup of their completed children. Two fixes: 1. monitor.rs: Add detection for root flows with NULL pings older than the retention period. These are cancelled as zombies with a critical error. 2. worker_flow.rs: Cap suspend_until at the retention period to prevent flows from being parked longer than retention allows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-04T13:14:09Z

Claude finished @rubenfiszel's task in 4m 12s —— View job

Code Review: PR #8723

Gather context and read changed files
Review backend/src/monitor.rs changes
Review backend/windmill-worker/src/worker_flow.rs changes
Review .sqlx query files
Check CLAUDE.md compliance
Post final review

Overall Assessment

This is a well-motivated fix for a real production issue (370 zombie flows blocking retention cleanup). The approach is sound: detect and cancel stale NULL-ping flows, and proactively cap suspend timeouts to prevent the problem from recurring. The code follows existing patterns in the codebase.

Issues

1. Missing suspend/suspend_until filter in the new NULL ping query (medium)

backend/src/monitor.rs:3320-3325

The existing zombie flow query (L3220) explicitly filters q.suspend = 0 AND q.suspend_until IS null to avoid touching legitimately suspended flows. The new query omits these conditions entirely. This means it will also match flows that are legitimately suspended (with suspend_until set in the future) but happen to have parent_job IS NULL and created_at older than retention.

Since the second change (worker_flow.rs) caps suspend_until at the retention period, these flows should time out before the retention period. But there's a race: if suspend_until is very close to — but not past — the retention cutoff, the monitor could cancel a flow that's about to be properly resumed or timed out.

Consider adding AND (q.suspend_until IS NULL OR q.suspend_until < NOW()) to avoid cancelling flows that still have a valid, non-expired suspend window.

2. Missing .sqlx offline cache file for the new query (low)

The 4 new .sqlx files added in this PR are for pre-existing queries unrelated to this change. The new sqlx::query! at monitor.rs:3314 doesn't appear to have a corresponding .sqlx cache file. This will cause cargo check to fail in offline mode (SQLX_OFFLINE=true). You'll need to run cargo sqlx prepare against a live database to generate the cache entry.

3. base_url lock held across await point (low/nit)

backend/src/monitor.rs:3333

let base_url = BASE_URL.read().await;

This acquires a read lock that's held while report_critical_error and cancel_zombie_flow_job are awaited inside the loop. The existing code in the function (L3295) does the same pattern, so this is consistent — but worth noting that if the loop processes many flows, the lock is held for the entire iteration. Consider cloning the value before the loop.

Looks Good

worker_flow.rs change: The suspend timeout cap logic is clean. The job_retention_secs > 0 guard correctly handles the case where retention is disabled. The min() approach is the right way to cap it.
Query structure: Using j.parent_job IS NULL to target only root flows is correct — child flows will be cleaned up when their parent is cancelled.
Error messaging: The critical error message is descriptive and matches the style of existing messages in the function.
Unrelated .sqlx files: These are harmless additions of offline cache for pre-existing queries.

Testing Instructions

To verify the suspend timeout cap: create a flow with an approval step where the suspend timeout is set very high (e.g. 86400000 seconds). Set the instance's job retention period to something shorter (e.g. 60 days / 5184000 seconds) via superadmin settings. Run the flow, let it reach the approval step, and check the v2_job_queue table — the suspend_until value should be capped at now + retention period, not now + the configured timeout. The zombie detection for NULL pings requires waiting for the retention period to elapse, so it's best verified on a staging instance with a short retention period or by manually adjusting created_at timestamps.

cloudflare-workers-and-pages · 2026-04-04T13:16:21Z

Deploying windmill with Cloudflare Pages

Latest commit:	`012726e`
Status:	✅ Deploy successful!
Preview URL:	https://e813f362.windmill.pages.dev
Branch Preview URL:	https://fix-zombie-flow-null-ping-re.windmill.pages.dev

View logs

cubic-dev-ai

1 issue found across 6 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/src/monitor.rs">

<violation number="1" location="backend/src/monitor.rs:3321">
P3: `j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-04-04T13:17:42Z

+            JOIN v2_job j USING (id)
+            LEFT JOIN v2_job_runtime r USING (id)
+            WHERE q.running = true
+                AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')


P3: j.kind = 'flownode' can never match here because j.parent_job IS NULL is also required, and flownodes always have a parent job. Remove OR j.kind = 'flownode' from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/src/monitor.rs, line 3321: <comment>`j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</comment> <file context> @@ -3305,6 +3305,49 @@ Please check your worker logs for more details and feel free to report it to the + JOIN v2_job j USING (id) + LEFT JOIN v2_job_runtime r USING (id) + WHERE q.running = true + AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode') + AND r.ping IS NULL + AND j.parent_job IS NULL </file context>

Suggested change

AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')

AND (j.kind = 'flow' OR j.kind = 'flowpreview')

rubenfiszel requested review from alpetric and hugocasa as code owners April 4, 2026 13:13

cubic-dev-ai bot reviewed Apr 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect zombie flows with NULL pings and cap suspend timeout#8723

fix: detect zombie flows with NULL pings and cap suspend timeout#8723
rubenfiszel wants to merge 1 commit intomainfrom
fix/zombie-flow-null-ping-retention

rubenfiszel commented Apr 4, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

claude bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 4, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')
	AND (j.kind = 'flow' OR j.kind = 'flowpreview')

Conversation

rubenfiszel commented Apr 4, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by cubic

Uh oh!

claude bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #8723

Overall Assessment

Issues

Looks Good

Testing Instructions

Uh oh!

cloudflare-workers-and-pages bot commented Apr 4, 2026

Deploying windmill with Cloudflare Pages

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rubenfiszel commented Apr 4, 2026 •

edited by cubic-dev-ai bot

Loading

claude bot commented Apr 4, 2026 •

edited

Loading

cubic-dev-ai bot Apr 4, 2026 •

edited

Loading