fix: improve flow chat and benchmark coverage by centdix · Pull Request #8825 · windmill-labs/windmill

centdix · 2026-04-14T12:51:40Z

Summary

This PR fixes flow chat / flow benchmark drift around special flow modules, aligns the production and benchmark flow helpers, and cleans up the flow benchmark so it rewards valid Windmill flows instead of overfitting to one exact example shape.

Changes

extend flow chat tooling to support top-level preprocessor_module and failure_module, including provider-safe tool schemas and dedicated tool guidance
align the production and benchmark flow helpers around shared mutation logic so special modules, rawscript edits, and setFlowJson behavior stay in sync
add generic whileloopflow guidance to the flow system prompt and chat guidance, then regenerate the auto-generated prompt assets used by the CLI/frontend
tighten flow-test10-while-loop-counter around meaningful loop structure while relaxing false-positive checks on rawscript input-transform key names
relax benchmark overfits in flow-test6, flow-test9, flow-test10, and flow-test12 so the judge and deterministic checks focus on Windmill-relevant correctness instead of exact example ids/field shapes
record updated runs: 2 flow benchmark history for haiku, 4o, opus, and gemini-3-flash

Benchmark Impact

Latest recorded runs: 2 flow benchmark history on benchmark-definition commit cc3e17dbc1:

haiku: 100.0% (26/26), up from the original 61.5%
4o: 80.8% (21/26), up from the original 53.8%
opus: 100.0% (26/26), up from the original 69.2%
gemini-3-flash: 100.0% (26/26), up from the original 61.5%

Notable case-level effects:

flow-test11-preprocessor-and-failure-handler now passes across all four models
flow-test10-while-loop-counter no longer false-fails on valid while-loop outputs
flow-test9-parallel-refactor no longer fails only because the parallel step id differs from the fixture example
4o still has real misses in the recorded run on flow-test4-order-processing-loop, flow-test6-ai-agent-tools, flow-test7-simple-modification, and flow-test9-parallel-refactor

Note: the latest commit 55b4036d12 only records the new history entries. The recorded benchmark rows themselves intentionally reference cc3e17dbc1, which is the pushed benchmark-definition commit they were run against.

Test plan

Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-14T12:54:18Z

Deploying windmill with Cloudflare Pages

Latest commit:	`89cb4ce`
Status:	✅ Deploy successful!
Preview URL:	https://0e9a6b2f.windmill.pages.dev
Branch Preview URL:	https://flow-mode-test-baseline.windmill.pages.dev

View logs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-04-15T15:37:38Z

Claude finished @centdix's task in 7m 16s —— View job

Claude PR Review

Reviewed the PR against origin/main (diverged at 3f5841f). The shared-helper refactor, the new set_preprocessor_module/set_failure_module tools, and the centralized FLOW_CHAT_SPECIAL_MODULES prompt are all well executed and keep production + benchmark helpers in sync. Nice job on the schema-normalization rename and on teaching set_flow_json to reject reserved-id modules.

A few things worth discussing before merge:

1. `flow-test10-while-loop-counter` — benchmark overfits mixed with AGENTS.md contradiction

The prompt now reads:

"Use a top-level while loop step named count_until_target. […] Put stop_after_if on count_until_target itself so the loop stops when result >= flow_input.target."

This contradicts ai_evals/AGENTS.md: "flow prompts should read like requests from a user who does not know the product internals […] the user should ask for behavior, not for branchone, branchall, rawscript, preprocessor_module, failure_module, exact graph topology, or other internal constructs". Users don't say "while loop" or stop_after_if. There's a carve-out ("only mention special Windmill constructs when the case is explicitly about those constructs"), so this is defensible for a construct-focused case, but it still reads more like a harness instruction than a user request.

Separately, the new deterministic checks at ai_evals/cases/flow.yaml:282-286 are the kind of overfit the rest of the PR is trying to relax:

moduleFieldRules:
  - id: count_until_target
    path: stop_after_if.skip_if_stopped
    equals: false

skip_if_stopped is never mentioned in the prompt and is an optional field. If the model omits it (reasonable — it's optional), valuesEqualForValidation(undefined, false) falls through to actual === expected at ai_evals/core/validators.ts:829 and returns false. The flow would work correctly and still fail the check. Either drop the rule, or encode "missing-or-false" semantics.

The increment_counter exprAnyOf is similarly brittle:

exprAnyOf:
  - "flow_input.iter && flow_input.iter.value !== undefined ? flow_input.iter.value : 0"
  - "flow_input.iter?.value ?? 0"

Any other valid fallback (flow_input.iter?.value !== undefined ? flow_input.iter.value : 0, flow_input.iter?.value || 0, flow_input.iter ? flow_input.iter.value : 0, or reading an explicit initial_state) fails the check even though the judgeChecklist item "uses flow_input.iter.value or an equivalent loop-state expression" passes. Since the checklist already captures this, consider dropping or broadening exprAnyOf — otherwise the "100%" recorded in the history risks regressing as soon as a model picks a different-but-valid shape.

Fix this →

2. Dead/duplicated guard in `setCode` path

frontend/src/lib/components/copilot/chat/flow/FlowAIChat.svelte:86-134 — after the outer if (module.value.type === 'rawscript') check, the code then calls updateRawScriptModuleContent(flowStore.val, id, code) which re-runs the getFlowModuleById lookup and re-checks type !== 'rawscript'. Inside the rawscript branch this second check is unreachable, and the error message "Module is not a rawscript or script" mentions script which this branch never handles. Minor — either inline the mutation or move the outer type-check into updateRawScriptModuleContent. Not a bug, just redundant work on every code edit.

3. `setCode` handles top-level preprocessor/failure but the frontend copilot calls it inconsistently

updateRawScriptModuleContent / getFlowModuleById now correctly route preprocessor/failure ids to the top-level special fields (helperUtils.ts:22-28). Good. However, the outer setCode in FlowAIChat.svelte:86-138 still only takes the snapshot if module.value.type === 'rawscript', which is fine today because set_module_code only targets rawscript modules, but be aware that any future code path that invokes setCode on a script-type preprocessor/failure module will silently throw the "Module is not a rawscript or script" error rather than short-circuit earlier. Low priority.

4. `stringArraysEqual` has a minor edge-case ambiguity

ai_evals/core/validators.ts:817-823 — works correctly, but note that exactTopLevelStepIds enforces order (since it uses index comparison), not just set equality. That's exactly what flow-test10 needs, but the name suggests "same set." If you ever reuse this for order-insensitive checks, you'll need a separate helper. Worth adding a doc comment clarifying the ordering semantics.

5. `set_flow_json` dedupe correctly includes special modules — nice catch

core.ts:885-894 now includes parsedPreprocessorModule/parsedFailureModule in the duplicate-id check after the reserved-id rejection. This is important because a flow with modules:[{id:"x"}] + failure_module:{id:"x"} would have been accepted before. Good fix.

6. `normalizeToolParameterSchema` rename + `items` array handling

frontend/src/lib/components/copilot/chat/shared.ts:611-652 — the rename and the Array.isArray(schema.items) branch are correct (JSON Schema allows tuple form). Clean refactor.

7. `judge.ts` prompt additions

ai_evals/core/judge.ts:37-38 — the two new rules ("Do not invent additional Windmill-specific constraints not in the prompt…" and "Do not lower the score for different-but-valid idioms…") align with the AGENTS.md goal and should meaningfully reduce false negatives on creation tasks. Good addition.

8. Recorded benchmark history lists a different commit SHA than the merge commit

ai_evals/history/flow.jsonl entries reference gitSha: "cc3e17dbc19..." (your benchmark-definition commit), which the PR description calls out. Fine as long as readers understand that the history row is anchored to the benchmark definition, not the latest code commit. Consider a one-line README note in ai_evals/history/ explaining this convention so future reviewers don't get confused.

Testing instructions

Open any flow in the editor and use the AI copilot chat to ask it to add a preprocessor that validates incoming data and a failure handler that logs errors. The assistant should call set_preprocessor_module / set_failure_module (visible in the tool-call stream) and apply both special modules without creating an entry in value.modules. Then ask it to remove the preprocessor — it should call set_preprocessor_module with a null argument. Also ask for a full flow with while loops to verify the new stop_after_if guidance produces valid top-level stop_after_if instead of a value.stop_after_if. After each change, confirm the flow graph renders the preprocessor/failure nodes in the correct slots and that running the flow still works end-to-end.

Summary: refactor is solid and the new tools are well-designed. The main concern is flow-test10 encoding overly-strict string matches for optional fields that the prompt doesn't request — that will reintroduce the same false-negative pattern the rest of the PR is fixing. The production flow chat changes look good.
· branch flow-mode-test-baseline

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-04-15T16:17:54Z

Claude PR Review — in progress

Reviewing this PR against origin/main.

Pull branch + diff and read changed files
Review ai_evals/ changes (validators, cases, helpers)
Review flow chat (core.ts, helperUtils.ts, FlowAIChat.svelte, shared.ts)
Review system prompts + CLI guidance
Check CLAUDE.md compliance
Post final review

View job run · branch: flow-mode-test-baseline

centdix and others added 2 commits April 14, 2026 14:40

fix: support special flow modules in evals

2c9fa83

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: extract shared flow helper logic

c972bac

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix and others added 2 commits April 14, 2026 15:57

fix: make special flow tools openai-compatible

fada91c

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: improve flow eval prompts and validation

5b7bd55

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix changed the title ~~fix: support special flow modules in flow evals~~ fix: improve flow chat and benchmark coverage Apr 15, 2026

centdix and others added 6 commits April 15, 2026 15:34

test: relax flow benchmark overfits

cc3e17d

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: record updated flow benchmark history

55b4036

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: address flow review findings

de02aac

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: source flow chat special module prompt

ab8638e

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: narrow rawscript helper return type

dc05784

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: dedupe flow chat prompt guidance

52be279

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix marked this pull request as ready for review April 15, 2026 15:37

centdix requested review from alpetric, hugocasa and rubenfiszel as code owners April 15, 2026 15:37

centdix marked this pull request as draft April 15, 2026 15:56

fix: relax flow test10 validation

89cb4ce

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix marked this pull request as ready for review April 15, 2026 16:17

hugocasa approved these changes Apr 15, 2026

View reviewed changes

hugocasa added this pull request to the merge queue Apr 15, 2026

Merged via the queue into main with commit d3cb0c6 Apr 15, 2026
21 checks passed

hugocasa deleted the flow-mode-test-baseline branch April 15, 2026 16:22

github-actions bot locked and limited conversation to collaborators Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve flow chat and benchmark coverage#8825

fix: improve flow chat and benchmark coverage#8825
hugocasa merged 11 commits intomainfrom
flow-mode-test-baseline

centdix commented Apr 14, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

centdix commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark Impact

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying windmill with Cloudflare Pages

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review

1. flow-test10-while-loop-counter — benchmark overfits mixed with AGENTS.md contradiction

2. Dead/duplicated guard in setCode path

3. setCode handles top-level preprocessor/failure but the frontend copilot calls it inconsistently

4. stringArraysEqual has a minor edge-case ambiguity

5. set_flow_json dedupe correctly includes special modules — nice catch

6. normalizeToolParameterSchema rename + items array handling

7. judge.ts prompt additions

8. Recorded benchmark history lists a different commit SHA than the merge commit

Testing instructions

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review — in progress

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

centdix commented Apr 14, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading

claude bot commented Apr 15, 2026 •

edited

Loading

1. `flow-test10-while-loop-counter` — benchmark overfits mixed with AGENTS.md contradiction

2. Dead/duplicated guard in `setCode` path

3. `setCode` handles top-level preprocessor/failure but the frontend copilot calls it inconsistently

4. `stringArraysEqual` has a minor edge-case ambiguity

5. `set_flow_json` dedupe correctly includes special modules — nice catch

6. `normalizeToolParameterSchema` rename + `items` array handling

7. `judge.ts` prompt additions

claude bot commented Apr 15, 2026 •

edited

Loading