fix: improve flow chat and benchmark coverage#8825
Conversation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deploying windmill with
|
| Latest commit: |
89cb4ce
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0e9a6b2f.windmill.pages.dev |
| Branch Preview URL: | https://flow-mode-test-baseline.windmill.pages.dev |
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Claude finished @centdix's task in 7m 16s —— View job Claude PR ReviewReviewed the PR against A few things worth discussing before merge: 1.
|
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Claude PR Review — in progress
View job run · branch: |

Summary
This PR fixes flow chat / flow benchmark drift around special flow modules, aligns the production and benchmark flow helpers, and cleans up the flow benchmark so it rewards valid Windmill flows instead of overfitting to one exact example shape.
Changes
preprocessor_moduleandfailure_module, including provider-safe tool schemas and dedicated tool guidancesetFlowJsonbehavior stay in syncwhileloopflowguidance to the flow system prompt and chat guidance, then regenerate the auto-generated prompt assets used by the CLI/frontendflow-test10-while-loop-counteraround meaningful loop structure while relaxing false-positive checks on rawscript input-transform key namesflow-test6,flow-test9,flow-test10, andflow-test12so the judge and deterministic checks focus on Windmill-relevant correctness instead of exact example ids/field shapesruns: 2flow benchmark history forhaiku,4o,opus, andgemini-3-flashBenchmark Impact
Latest recorded
runs: 2flow benchmark history on benchmark-definition commitcc3e17dbc1:haiku:100.0%(26/26), up from the original61.5%4o:80.8%(21/26), up from the original53.8%opus:100.0%(26/26), up from the original69.2%gemini-3-flash:100.0%(26/26), up from the original61.5%Notable case-level effects:
flow-test11-preprocessor-and-failure-handlernow passes across all four modelsflow-test10-while-loop-counterno longer false-fails on valid while-loop outputsflow-test9-parallel-refactorno longer fails only because the parallel step id differs from the fixture example4ostill has real misses in the recorded run onflow-test4-order-processing-loop,flow-test6-ai-agent-tools,flow-test7-simple-modification, andflow-test9-parallel-refactorNote: the latest commit
55b4036d12only records the new history entries. The recorded benchmark rows themselves intentionally referencecc3e17dbc1, which is the pushed benchmark-definition commit they were run against.Test plan
uv run --with pyyaml python3 system_prompts/generate.pycd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test11-preprocessor-and-failure-handler --model opuscd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test10-while-loop-counter --models haiku,4o,opus,gemini-3-flashcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow flow-test12-approval-step --models haiku,4o,opus,gemini-3-flashcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model haikucd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model 4ocd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model opuscd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model gemini-3-flashcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model haiku --runs 2 --recordcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model 4o --runs 2 --recordcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model opus --runs 2 --recordcd ai_evals && set -a && source /home/farhad/windmill/ai_evals/.env && set +a && bun run cli -- run flow --model gemini-3-flash --runs 2 --recordGenerated with Claude Code