Skip to content

fix: detect zombie flows with NULL pings and cap suspend timeout#8723

Open
rubenfiszel wants to merge 1 commit intomainfrom
fix/zombie-flow-null-ping-retention
Open

fix: detect zombie flows with NULL pings and cap suspend timeout#8723
rubenfiszel wants to merge 1 commit intomainfrom
fix/zombie-flow-null-ping-retention

Conversation

@rubenfiszel
Copy link
Copy Markdown
Contributor

@rubenfiszel rubenfiszel commented Apr 4, 2026

Summary

  • Flows waiting for a suspended child (e.g. approval step) have running=true, ping=NULL by design. The zombie flow handler required ping IS NOT NULL, so these parent flows were invisible to zombie detection and could stay in the queue indefinitely — blocking retention cleanup of their completed children.
  • On the demo cloud instance, 370 instances of a single flow with a ~1000-day approval timeout accumulated since Nov 2025, protecting ~1,355 completed jobs from being cleaned up despite a 60-day retention period.

Changes

  1. monitor.rs: Add a third detection block in handle_zombie_flows that catches root flows with ping IS NULL older than the retention period. These are cancelled as zombies with a critical error report.
  2. worker_flow.rs: Cap suspend_until at the retention period when parking a flow at an approval step. If retention is 60 days, there's no point suspending for 1000 days — the flow's completed children would need cleanup before then anyway.

Test plan

  • Verify cargo check passes
  • Deploy to staging and confirm the monitor loop detects and cancels flows with NULL pings older than retention
  • Create a flow with an approval step and a suspend timeout longer than the retention period — verify suspend_until is capped
  • Confirm retention cleanup proceeds normally after zombie flows are cancelled

🤖 Generated with Claude Code


Summary by cubic

Fixes zombie-flow detection by catching running flows with NULL pings and caps suspend timeouts at the retention period. Prevents flows from parking indefinitely and unblocks retention cleanup of completed children.

  • Bug Fixes
    • Detect and cancel root flows with NULL pings older than retention in monitor.rs.
    • Cap suspend_until to the retention period for approval/suspend steps in worker_flow.rs.

Written for commit 012726e. Summary will update on new commits.

…etention period

Flows waiting for a suspended child (e.g. approval step) have running=true
but ping=NULL by design. The zombie flow handler required ping IS NOT NULL,
so these flows were invisible to detection and could stay in the queue
indefinitely, blocking retention cleanup of their completed children.

Two fixes:
1. monitor.rs: Add detection for root flows with NULL pings older than the
   retention period. These are cancelled as zombies with a critical error.
2. worker_flow.rs: Cap suspend_until at the retention period to prevent
   flows from being parked longer than retention allows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 4, 2026

Claude finished @rubenfiszel's task in 4m 12s —— View job


Code Review: PR #8723

  • Gather context and read changed files
  • Review backend/src/monitor.rs changes
  • Review backend/windmill-worker/src/worker_flow.rs changes
  • Review .sqlx query files
  • Check CLAUDE.md compliance
  • Post final review

Overall Assessment

This is a well-motivated fix for a real production issue (370 zombie flows blocking retention cleanup). The approach is sound: detect and cancel stale NULL-ping flows, and proactively cap suspend timeouts to prevent the problem from recurring. The code follows existing patterns in the codebase.

Issues

1. Missing suspend/suspend_until filter in the new NULL ping query (medium)

backend/src/monitor.rs:3320-3325

The existing zombie flow query (L3220) explicitly filters q.suspend = 0 AND q.suspend_until IS null to avoid touching legitimately suspended flows. The new query omits these conditions entirely. This means it will also match flows that are legitimately suspended (with suspend_until set in the future) but happen to have parent_job IS NULL and created_at older than retention.

Since the second change (worker_flow.rs) caps suspend_until at the retention period, these flows should time out before the retention period. But there's a race: if suspend_until is very close to — but not past — the retention cutoff, the monitor could cancel a flow that's about to be properly resumed or timed out.

Consider adding AND (q.suspend_until IS NULL OR q.suspend_until < NOW()) to avoid cancelling flows that still have a valid, non-expired suspend window.

2. Missing .sqlx offline cache file for the new query (low)

The 4 new .sqlx files added in this PR are for pre-existing queries unrelated to this change. The new sqlx::query! at monitor.rs:3314 doesn't appear to have a corresponding .sqlx cache file. This will cause cargo check to fail in offline mode (SQLX_OFFLINE=true). You'll need to run cargo sqlx prepare against a live database to generate the cache entry.

3. base_url lock held across await point (low/nit)

backend/src/monitor.rs:3333

let base_url = BASE_URL.read().await;

This acquires a read lock that's held while report_critical_error and cancel_zombie_flow_job are awaited inside the loop. The existing code in the function (L3295) does the same pattern, so this is consistent — but worth noting that if the loop processes many flows, the lock is held for the entire iteration. Consider cloning the value before the loop.

Looks Good

  • worker_flow.rs change: The suspend timeout cap logic is clean. The job_retention_secs > 0 guard correctly handles the case where retention is disabled. The min() approach is the right way to cap it.
  • Query structure: Using j.parent_job IS NULL to target only root flows is correct — child flows will be cleaned up when their parent is cancelled.
  • Error messaging: The critical error message is descriptive and matches the style of existing messages in the function.
  • Unrelated .sqlx files: These are harmless additions of offline cache for pre-existing queries.

Testing Instructions

To verify the suspend timeout cap: create a flow with an approval step where the suspend timeout is set very high (e.g. 86400000 seconds). Set the instance's job retention period to something shorter (e.g. 60 days / 5184000 seconds) via superadmin settings. Run the flow, let it reach the approval step, and check the v2_job_queue table — the suspend_until value should be capped at now + retention period, not now + the configured timeout. The zombie detection for NULL pings requires waiting for the retention period to elapse, so it's best verified on a staging instance with a short retention period or by manually adjusting created_at timestamps.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 012726e
Status: ✅  Deploy successful!
Preview URL: https://e813f362.windmill.pages.dev
Branch Preview URL: https://fix-zombie-flow-null-ping-re.windmill.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/src/monitor.rs">

<violation number="1" location="backend/src/monitor.rs:3321">
P3: `j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread backend/src/monitor.rs
JOIN v2_job j USING (id)
LEFT JOIN v2_job_runtime r USING (id)
WHERE q.running = true
AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: j.kind = 'flownode' can never match here because j.parent_job IS NULL is also required, and flownodes always have a parent job. Remove OR j.kind = 'flownode' from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/src/monitor.rs, line 3321:

<comment>`j.kind = 'flownode'` can never match here because `j.parent_job IS NULL` is also required, and flownodes always have a parent job. Remove `OR j.kind = 'flownode'` from the filter to avoid confusion — this branch is dead SQL. The comment above also references "flownodes" but they cannot be selected by this query.</comment>

<file context>
@@ -3305,6 +3305,49 @@ Please check your worker logs for more details and feel free to report it to the
+            JOIN v2_job j USING (id)
+            LEFT JOIN v2_job_runtime r USING (id)
+            WHERE q.running = true
+                AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')
+                AND r.ping IS NULL
+                AND j.parent_job IS NULL
</file context>
Suggested change
AND (j.kind = 'flow' OR j.kind = 'flowpreview' OR j.kind = 'flownode')
AND (j.kind = 'flow' OR j.kind = 'flowpreview')
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant