Skip to content

fix: improve flow chat and benchmark coverage#8825

Merged
hugocasa merged 11 commits intomainfrom
flow-mode-test-baseline
Apr 15, 2026
Merged

fix: improve flow chat and benchmark coverage#8825
hugocasa merged 11 commits intomainfrom
flow-mode-test-baseline

Conversation

@centdix
Copy link
Copy Markdown
Collaborator

@centdix centdix commented Apr 14, 2026

Summary

This PR fixes flow chat / flow benchmark drift around special flow modules, aligns the production and benchmark flow helpers, and cleans up the flow benchmark so it rewards valid Windmill flows instead of overfitting to one exact example shape.

Changes

  • extend flow chat tooling to support top-level preprocessor_module and failure_module, including provider-safe tool schemas and dedicated tool guidance
  • align the production and benchmark flow helpers around shared mutation logic so special modules, rawscript edits, and setFlowJson behavior stay in sync
  • add generic whileloopflow guidance to the flow system prompt and chat guidance, then regenerate the auto-generated prompt assets used by the CLI/frontend
  • tighten flow-test10-while-loop-counter around meaningful loop structure while relaxing false-positive checks on rawscript input-transform key names
  • relax benchmark overfits in flow-test6, flow-test9, flow-test10, and flow-test12 so the judge and deterministic checks focus on Windmill-relevant correctness instead of exact example ids/field shapes
  • record updated runs: 2 flow benchmark history for haiku, 4o, opus, and gemini-3-flash

Benchmark Impact

Latest recorded runs: 2 flow benchmark history on benchmark-definition commit cc3e17dbc1:

  • haiku: 100.0% (26/26), up from the original 61.5%
  • 4o: 80.8% (21/26), up from the original 53.8%
  • opus: 100.0% (26/26), up from the original 69.2%
  • gemini-3-flash: 100.0% (26/26), up from the original 61.5%

Notable case-level effects:

  • flow-test11-preprocessor-and-failure-handler now passes across all four models
  • flow-test10-while-loop-counter no longer false-fails on valid while-loop outputs
  • flow-test9-parallel-refactor no longer fails only because the parallel step id differs from the fixture example
  • 4o still has real misses in the recorded run on flow-test4-order-processing-loop, flow-test6-ai-agent-tools, flow-test7-simple-modification, and flow-test9-parallel-refactor

Note: the latest commit 55b4036d12 only records the new history entries. The recorded benchmark rows themselves intentionally reference cc3e17dbc1, which is the pushed benchmark-definition commit they were run against.

Test plan

  • uv run --with pyyaml python3 system_prompts/generate.py
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test11-preprocessor-and-failure-handler --model opus
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test10-while-loop-counter --models haiku,4o,opus,gemini-3-flash
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test12-approval-step --models haiku,4o,opus,gemini-3-flash
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model haiku
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model 4o
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model opus
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model gemini-3-flash
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model haiku --runs 2 --record
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model 4o --runs 2 --record
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model opus --runs 2 --record
  • cd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model gemini-3-flash --runs 2 --record
  • frontend dev log checked after the prompt/tool changes

Generated with Claude Code

centdix and others added 2 commits April 14, 2026 14:40
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 14, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 89cb4ce
Status: ✅  Deploy successful!
Preview URL: https://0e9a6b2f.windmill.pages.dev
Branch Preview URL: https://flow-mode-test-baseline.windmill.pages.dev

View logs

centdix and others added 2 commits April 14, 2026 15:57
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@centdix centdix changed the title fix: support special flow modules in flow evals fix: improve flow chat and benchmark coverage Apr 15, 2026
centdix and others added 6 commits April 15, 2026 15:34
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@centdix centdix marked this pull request as ready for review April 15, 2026 15:37
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @centdix's task in 7m 16s —— View job


Claude PR Review

Reviewed the PR against origin/main (diverged at 3f5841f). The shared-helper refactor, the new set_preprocessor_module/set_failure_module tools, and the centralized FLOW_CHAT_SPECIAL_MODULES prompt are all well executed and keep production + benchmark helpers in sync. Nice job on the schema-normalization rename and on teaching set_flow_json to reject reserved-id modules.

A few things worth discussing before merge:

1. flow-test10-while-loop-counter — benchmark overfits mixed with AGENTS.md contradiction

The prompt now reads:

"Use a top-level while loop step named count_until_target. […] Put stop_after_if on count_until_target itself so the loop stops when result >= flow_input.target."

This contradicts ai_evals/AGENTS.md: "flow prompts should read like requests from a user who does not know the product internals […] the user should ask for behavior, not for branchone, branchall, rawscript, preprocessor_module, failure_module, exact graph topology, or other internal constructs". Users don't say "while loop" or stop_after_if. There's a carve-out ("only mention special Windmill constructs when the case is explicitly about those constructs"), so this is defensible for a construct-focused case, but it still reads more like a harness instruction than a user request.

Separately, the new deterministic checks at ai_evals/cases/flow.yaml:282-286 are the kind of overfit the rest of the PR is trying to relax:

moduleFieldRules:
  - id: count_until_target
    path: stop_after_if.skip_if_stopped
    equals: false

skip_if_stopped is never mentioned in the prompt and is an optional field. If the model omits it (reasonable — it's optional), valuesEqualForValidation(undefined, false) falls through to actual === expected at ai_evals/core/validators.ts:829 and returns false. The flow would work correctly and still fail the check. Either drop the rule, or encode "missing-or-false" semantics.

The increment_counter exprAnyOf is similarly brittle:

exprAnyOf:
  - "flow_input.iter && flow_input.iter.value !== undefined ? flow_input.iter.value : 0"
  - "flow_input.iter?.value ?? 0"

Any other valid fallback (flow_input.iter?.value !== undefined ? flow_input.iter.value : 0, flow_input.iter?.value || 0, flow_input.iter ? flow_input.iter.value : 0, or reading an explicit initial_state) fails the check even though the judgeChecklist item "uses flow_input.iter.value or an equivalent loop-state expression" passes. Since the checklist already captures this, consider dropping or broadening exprAnyOf — otherwise the "100%" recorded in the history risks regressing as soon as a model picks a different-but-valid shape.

Fix this →

2. Dead/duplicated guard in setCode path

frontend/src/lib/components/copilot/chat/flow/FlowAIChat.svelte:86-134 — after the outer if (module.value.type === 'rawscript') check, the code then calls updateRawScriptModuleContent(flowStore.val, id, code) which re-runs the getFlowModuleById lookup and re-checks type !== 'rawscript'. Inside the rawscript branch this second check is unreachable, and the error message "Module is not a rawscript or script" mentions script which this branch never handles. Minor — either inline the mutation or move the outer type-check into updateRawScriptModuleContent. Not a bug, just redundant work on every code edit.

3. setCode handles top-level preprocessor/failure but the frontend copilot calls it inconsistently

updateRawScriptModuleContent / getFlowModuleById now correctly route preprocessor/failure ids to the top-level special fields (helperUtils.ts:22-28). Good. However, the outer setCode in FlowAIChat.svelte:86-138 still only takes the snapshot if module.value.type === 'rawscript', which is fine today because set_module_code only targets rawscript modules, but be aware that any future code path that invokes setCode on a script-type preprocessor/failure module will silently throw the "Module is not a rawscript or script" error rather than short-circuit earlier. Low priority.

4. stringArraysEqual has a minor edge-case ambiguity

ai_evals/core/validators.ts:817-823 — works correctly, but note that exactTopLevelStepIds enforces order (since it uses index comparison), not just set equality. That's exactly what flow-test10 needs, but the name suggests "same set." If you ever reuse this for order-insensitive checks, you'll need a separate helper. Worth adding a doc comment clarifying the ordering semantics.

5. set_flow_json dedupe correctly includes special modules — nice catch

core.ts:885-894 now includes parsedPreprocessorModule/parsedFailureModule in the duplicate-id check after the reserved-id rejection. This is important because a flow with modules:[{id:"x"}] + failure_module:{id:"x"} would have been accepted before. Good fix.

6. normalizeToolParameterSchema rename + items array handling

frontend/src/lib/components/copilot/chat/shared.ts:611-652 — the rename and the Array.isArray(schema.items) branch are correct (JSON Schema allows tuple form). Clean refactor.

7. judge.ts prompt additions

ai_evals/core/judge.ts:37-38 — the two new rules ("Do not invent additional Windmill-specific constraints not in the prompt…" and "Do not lower the score for different-but-valid idioms…") align with the AGENTS.md goal and should meaningfully reduce false negatives on creation tasks. Good addition.

8. Recorded benchmark history lists a different commit SHA than the merge commit

ai_evals/history/flow.jsonl entries reference gitSha: "cc3e17dbc19..." (your benchmark-definition commit), which the PR description calls out. Fine as long as readers understand that the history row is anchored to the benchmark definition, not the latest code commit. Consider a one-line README note in ai_evals/history/ explaining this convention so future reviewers don't get confused.

Testing instructions

Open any flow in the editor and use the AI copilot chat to ask it to add a preprocessor that validates incoming data and a failure handler that logs errors. The assistant should call set_preprocessor_module / set_failure_module (visible in the tool-call stream) and apply both special modules without creating an entry in value.modules. Then ask it to remove the preprocessor — it should call set_preprocessor_module with a null argument. Also ask for a full flow with while loops to verify the new stop_after_if guidance produces valid top-level stop_after_if instead of a value.stop_after_if. After each change, confirm the flow graph renders the preprocessor/failure nodes in the correct slots and that running the flow still works end-to-end.


Summary: refactor is solid and the new tools are well-designed. The main concern is flow-test10 encoding overly-strict string matches for optional fields that the prompt doesn't request — that will reintroduce the same false-negative pattern the rest of the PR is fixing. The production flow chat changes look good.
· branch flow-mode-test-baseline

@centdix centdix marked this pull request as draft April 15, 2026 15:56
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@centdix centdix marked this pull request as ready for review April 15, 2026 16:17
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude PR Review — in progress

Reviewing this PR against origin/main.

  • Pull branch + diff and read changed files
  • Review ai_evals/ changes (validators, cases, helpers)
  • Review flow chat (core.ts, helperUtils.ts, FlowAIChat.svelte, shared.ts)
  • Review system prompts + CLI guidance
  • Check CLAUDE.md compliance
  • Post final review

View job run · branch: flow-mode-test-baseline

@hugocasa hugocasa added this pull request to the merge queue Apr 15, 2026
Merged via the queue into main with commit d3cb0c6 Apr 15, 2026
21 checks passed
@hugocasa hugocasa deleted the flow-mode-test-baseline branch April 15, 2026 16:22
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants