chore: add backend preview validation to ai evals#8827
Conversation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deploying windmill with
|
| Latest commit: |
df60913
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://b648c9ce.windmill.pages.dev |
| Branch Preview URL: | https://ai-evals-with-backend.windmill.pages.dev |
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
/ai-fast review this PR |
|
Claude finished @centdix's task in 5m 8s —— View job PR Review: ai-evals backend preview validation
Overall this is a solid, well-scoped addition. The layering is clean: A few findings, roughly in order of importance: 1. Shared-workspace cleanup is destructive and not documented loudly enough
The README mentions "clears managed eval assets under
2. Cleanup aborts on the first failure and leaks state
for (const path of ...) {
await this.deleteFlowByPath(workspaceId, path).catch((err) => {
console.warn(`Failed to delete managed flow ${path}: ${err}`)
})
}or collect errors and throw once at the end so the artifact still reports which deletes failed. 3.
|
Summary
Add an optional real-backend preview phase to
ai_evalsforscriptevals and opt-inflowcases, while keeping the existing offline validation and judge flow unchanged by default.This PR also makes the documented shared-workspace mode (
WMILL_AI_EVAL_BACKEND_WORKSPACE) reliable: reruns now refresh seeded scripts and flows in place instead of failing on path conflicts or validating against stale workspace assets.Changes
--backend-validation previewsupport to theai_evalsCLI and frontend benchmark adapter forscriptandflowevalsbackend-preview.jsonartifactsscriptevals and opt-inflowcases viaruntime.backendPreview, including seedinginitial.workspacefixtures before flow previewf/evals/add_two_numbers) so backend validation matches live backend path rulesparent_hash, and updating seeded flows in place on create conflictsTest plan
cd ai_evals && bun test adapters/frontend/backendPreview.test.ts core/backendValidation.test.ts core/cases.test.ts core/models.test.ts core/validators.test.tssource /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run script script-test1-greet-user --backend-validation preview --model haikusource /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run flow flow-test0-sum-two-numbers flow-test1-reuse-existing-script flow-test2-call-existing-subflow --backend-validation preview --model haikuf/evals/add_two_numbersin workspacetest, then rerunflow-test1-reuse-existing-scripttwice with backend preview enabled; both runs pass because seeding refreshes the script in placef/evals/add_numbers_flowin workspacetest, then rerunflow-test2-call-existing-subflowtwice with backend preview enabled; both runs pass because seeding refreshes the subflow in placecd frontend && npm run check:faststill fails due unrelated existingfrontend/src/lib/utils_workspace_deploy.tsexport errorsGenerated with Claude Code