chore: add backend preview validation to ai evals by centdix · Pull Request #8827 · windmill-labs/windmill

centdix · 2026-04-14T15:21:00Z

Summary

Add an optional real-backend preview phase to ai_evals for script evals and opt-in flow cases, while keeping the existing offline validation and judge flow unchanged by default.

This PR also makes the documented shared-workspace mode (WMILL_AI_EVAL_BACKEND_WORKSPACE) reliable: reruns now refresh seeded scripts and flows in place instead of failing on path conflicts or validating against stale workspace assets.

Changes

add --backend-validation preview support to the ai_evals CLI and frontend benchmark adapter for script and flow evals
add a backend preview client that logs into Windmill, creates or reuses a workspace, seeds reusable scripts and flows, runs preview jobs, polls completion, and saves backend-preview.json artifacts
wire backend preview checks into all script evals and opt-in flow cases via runtime.backendPreview, including seeding initial.workspace fixtures before flow preview
fix the reusable flow script fixture to use the real extensionless workspace path (f/evals/add_two_numbers) so backend validation matches live backend path rules
harden shared-workspace preview mode by serializing preview runs per workspace, updating seeded scripts via parent_hash, and updating seeded flows in place on create conflicts
add regression tests for shared-workspace script upserts, flow upserts, and lock serialization, and document that shared-workspace mode treats the target workspace as a dedicated preview workspace whose seeded fixture assets are refreshed before each run

Test plan

cd ai_evals && bun test adapters/frontend/backendPreview.test.ts core/backendValidation.test.ts core/cases.test.ts core/models.test.ts core/validators.test.ts
source /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run script script-test1-greet-user --backend-validation preview --model haiku
source /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run flow flow-test0-sum-two-numbers flow-test1-reuse-existing-script flow-test2-call-existing-subflow --backend-validation preview --model haiku
poison f/evals/add_two_numbers in workspace test, then rerun flow-test1-reuse-existing-script twice with backend preview enabled; both runs pass because seeding refreshes the script in place
poison f/evals/add_numbers_flow in workspace test, then rerun flow-test2-call-existing-subflow twice with backend preview enabled; both runs pass because seeding refreshes the subflow in place
cd frontend && npm run check:fast still fails due unrelated existing frontend/src/lib/utils_workspace_deploy.ts export errors

Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-14T15:25:52Z

Deploying windmill with Cloudflare Pages

Latest commit:	`df60913`
Status:	✅ Deploy successful!
Preview URL:	https://b648c9ce.windmill.pages.dev
Branch Preview URL:	https://ai-evals-with-backend.windmill.pages.dev

View logs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-04-15T14:57:36Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

centdix · 2026-04-15T15:04:45Z

/ai-fast review this PR

claude · 2026-04-15T15:06:02Z

Claude finished @centdix's task in 5m 8s —— View job

PR Review: ai-evals backend preview validation

Read core backend preview client (adapters/frontend/backendPreview.ts)
Read backend validation pipeline (core/backendValidation.ts, core/runSuite.ts)
Read script/flow modes integration (modes/script.ts, modes/flow.ts)
Read CLI and runtime wiring (cli/index.ts, adapters/frontend/benchmarkRunner.ts, adapters/frontend/runtime.ts)
Review tests (backendPreview.test.ts, backendValidation.test.ts, cases.test.ts)
Check fixtures/README/case config changes
Verify API routes (list_paths, delete/p, etc.) against backend
Post review feedback

Overall this is a solid, well-scoped addition. The layering is clean: BackendPreviewClient is self-contained and the mode runners opt in via backendValidate. Tests exercise the shared-workspace upsert paths, the serialization lock, and the cached-login failure recovery. The isConflictMessage + parent_hash / flows/update fallback matches the real backend behavior (I verified /scripts/delete/p/{path} is POST and /flows/delete/{path} is DELETE in openapi.yaml).

A few findings, roughly in order of importance:

1. Shared-workspace cleanup is destructive and not documented loudly enough

BackendPreviewClient.clearManagedSharedWorkspaceAssets deletes every flow and script under f/evals/* in the target workspace on each run (adapters/frontend/backendPreview.ts:319-329, isManagedSharedWorkspacePath:500-502). If a user points WMILL_AI_EVAL_BACKEND_WORKSPACE at their dev workspace (which the CE workspace-limit hint at backendPreview.ts:236-244 actively encourages), any hand-authored script/flow they happen to keep under f/evals/ gets silently wiped, and flows/scripts outside that managed prefix that depend on the deleted fixtures will subsequently break.

The README mentions "clears managed eval assets under f/evals/*" but the BACKEND_WORKSPACE env var doc (README.md:144) advertises reusing an existing workspace without warning about the deletion. Consider:

calling the variable something like WMILL_AI_EVAL_PREVIEW_WORKSPACE so nobody points it at a real one, or
gating the prefix-sweep behind an explicit opt-in env var and throwing otherwise, or
at minimum, print a one-line warning to stderr on the first run describing which prefixes will be cleared.

Fix this →

2. Cleanup aborts on the first failure and leaks state

clearManagedSharedWorkspaceAssets loops through flowPaths.filter(...) and deleteFlowByPath sequentially with no per-item error handling (backendPreview.ts:321-328). If a single delete fails (say another test is holding a reference, or the workspace admin revoked a folder permission), the remaining cleanup never runs and the subsequent seedWorkspaceFixtures call fails only for whatever actually conflicts. Prefer:

for (const path of ...) {
  await this.deleteFlowByPath(workspaceId, path).catch((err) => {
    console.warn(`Failed to delete managed flow ${path}: ${err}`)
  })
}

or collect errors and throw once at the end so the artifact still reports which deletes failed.

3. `isConflictMessage` substring match is very permissive

isConflictMessage (backendPreview.ts:495-498) treats any 400 body containing "already exists" or "path conflict" as a conflict, then falls through to the upsert path. That's fine for the current Windmill error strings, but a 400 like "variable already exists" (if the script create ever validated a referenced variable) would silently trigger getScriptByPath and a retry that could mask the real error. Consider matching the stricter phrasing used in tests (Path conflict for ... with non-archived hash, Flow ... already exists) via a narrower substring or regex, so unrelated 400s surface instead of being rewritten as "update the existing script".

4. Token cache invalidation only on login failure

getToken caches the login promise and only tokenCache.deletes on login rejection (backendPreview.ts:393-406). If the cached token expires mid-suite — plausible for long shared-workspace runs with the CE default token TTL — every subsequent API call will fail with 401 and there's no re-login path. expectOk just throws check workspace <id> failed: 401 .... Cheapest fix: in request, if a response is 401/403 and the token was served from cache, delete the cache entry and retry once.

5. CLI has a redundant mode guard

cli/index.ts:174-176 rejects backend validation for non-flow/non-script modes, but resolveBackendValidationSettings in core/backendValidation.ts:43-47 already throws the same error (and is reached via runFrontendBenchmarkFromEnv). The CLI guard is still useful because runCliBenchmark doesn't call resolveBackendValidationSettings, so it's not dead code — just worth a comment to avoid future drift. Non-blocking.

6. Flow `createFlow` upsert doesn't wait for anything, unlike scripts

createScript waits for /scripts/deployment_status/h/{hash} so the subsequent preview sees a compiled lock (backendPreview.ts:114, 357-380). createFlow (backendPreview.ts:135-159) just fires POST and returns. For flows that's typically OK because there's no lockfile step, but if the update path ever runs a backfill / index rebuild (flow versions, schedule reconciliation, etc.) a racey subsequent preview could hit a stale read. Worth a // flows deploy synchronously comment or a minimal version check so the invariant is documented.

7. Script args fallback picks up `initial.args` even when they're stale

modes/script.ts:61-66:

args:
  (evalCase.runtime?.backendPreview?.args as Record<string, unknown> | undefined) ??
  actual.args ??
  initial?.args ??
  {},

ScriptEvalState.args (adapters/frontend/core/script/fileHelpers.ts:13) is the local script-test arg values. If the model refactors the script signature (renames a → x), actual.args should match, but if the model returns the same actual.args as the initial (because it forgot to update them) the preview will try to call the new signature with old arg names and fail. That failure will look like a backend regression, not a code-quality regression. Not a blocker — just consider printing which fallback was used in backend-preview.json so triage is cheaper.

Also, the as Record<string, unknown> | undefined cast is redundant since EvalCaseRuntimeBackendPreview.args is already typed that way in core/types.ts:6-8.

8. Shared-workspace mode silently serializes suite concurrency

Under workspaceOverride, withSharedWorkspaceLock serializes every case (backendPreview.ts:422-441). But runSuite still spins up concurrency=5 workers and the model calls run in parallel; only the backend-preview phase is serialized. That means the first worker blocks the whole workspace for its preview while the other workers finish their model runs and then queue on the lock. This is correct, but it can make shared-workspace runs feel "stuck" behind a slow preview with no signal. Worth an onProgress event (or stderr line) noting "awaiting shared-workspace preview slot" when the lock is held.

9. Minor: per-attempt workspace suffix still uses `randomUUID` but limits case slug to 30 chars

buildWorkspaceId (backendPreview.ts:443-451) truncates the case slug to 30 chars before adding -a<attempt>-<uuid8>. That works, but since workspacePrefix defaults to ai-evals and the suffix is already a random 8-char hash, you could drop the attempt number from the workspace id and save a few bytes — multiple attempts already get fresh UUIDs. Non-blocking stylistic preference.

10. `runtime.backendPreview.args` type mismatch for flow

modes/flow.ts:91 uses evalCase.runtime?.backendPreview?.args ?? {} which is typed Record<string, unknown>. runFlowPreview accepts Record<string, unknown>, so this is fine — just flagging that the script path has an unnecessary cast while the flow path does not, which is slightly inconsistent.

Summary

Biggest concerns are the blast radius of the shared-workspace cleanup (item 1) and the lack of token refresh for long-lived runs (item 4). Everything else is polish. The tests are the strongest part of the PR — they cover exactly the failure modes (path conflict upsert, shared-workspace lock, cached-login failure) that would otherwise be hard to debug in CI.

Nothing here blocks merge for dev-only usage, but I'd at least (a) rename or hard-gate WMILL_AI_EVAL_BACKEND_WORKSPACE before it lands in docs any wider, and (b) add 401-retry to request.
• Branch

feat: add backend preview validation to ai evals

187b417

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: refresh shared preview workspace assets

1eca397

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix changed the title ~~feat: add backend preview validation to ai evals~~ chore: add backend preview validation to ai evals Apr 15, 2026

fix: harden shared backend preview validation

df60913

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

centdix marked this pull request as ready for review April 15, 2026 14:57

centdix requested review from alpetric, hugocasa and rubenfiszel as code owners April 15, 2026 14:57

rubenfiszel merged commit f1e84cb into main Apr 15, 2026
8 of 9 checks passed

rubenfiszel deleted the ai-evals-with-backend branch April 15, 2026 15:11

github-actions bot locked and limited conversation to collaborators Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add backend preview validation to ai evals#8827

chore: add backend preview validation to ai evals#8827
rubenfiszel merged 3 commits intomainfrom
ai-evals-with-backend

centdix commented Apr 14, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

centdix commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

centdix commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying windmill with Cloudflare Pages

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

centdix commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: ai-evals backend preview validation

1. Shared-workspace cleanup is destructive and not documented loudly enough

2. Cleanup aborts on the first failure and leaks state

3. isConflictMessage substring match is very permissive

4. Token cache invalidation only on login failure

5. CLI has a redundant mode guard

6. Flow createFlow upsert doesn't wait for anything, unlike scripts

7. Script args fallback picks up initial.args even when they're stale

8. Shared-workspace mode silently serializes suite concurrency

9. Minor: per-attempt workspace suffix still uses randomUUID but limits case slug to 30 chars

10. runtime.backendPreview.args type mismatch for flow

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

centdix commented Apr 14, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading

claude bot commented Apr 15, 2026 •

edited

Loading

claude bot commented Apr 15, 2026 •

edited

Loading

3. `isConflictMessage` substring match is very permissive

6. Flow `createFlow` upsert doesn't wait for anything, unlike scripts

7. Script args fallback picks up `initial.args` even when they're stale

9. Minor: per-attempt workspace suffix still uses `randomUUID` but limits case slug to 30 chars

10. `runtime.backendPreview.args` type mismatch for flow