Skip to content

chore: add backend preview validation to ai evals#8827

Merged
rubenfiszel merged 3 commits intomainfrom
ai-evals-with-backend
Apr 15, 2026
Merged

chore: add backend preview validation to ai evals#8827
rubenfiszel merged 3 commits intomainfrom
ai-evals-with-backend

Conversation

@centdix
Copy link
Copy Markdown
Collaborator

@centdix centdix commented Apr 14, 2026

Summary

Add an optional real-backend preview phase to ai_evals for script evals and opt-in flow cases, while keeping the existing offline validation and judge flow unchanged by default.

This PR also makes the documented shared-workspace mode (WMILL_AI_EVAL_BACKEND_WORKSPACE) reliable: reruns now refresh seeded scripts and flows in place instead of failing on path conflicts or validating against stale workspace assets.

Changes

  • add --backend-validation preview support to the ai_evals CLI and frontend benchmark adapter for script and flow evals
  • add a backend preview client that logs into Windmill, creates or reuses a workspace, seeds reusable scripts and flows, runs preview jobs, polls completion, and saves backend-preview.json artifacts
  • wire backend preview checks into all script evals and opt-in flow cases via runtime.backendPreview, including seeding initial.workspace fixtures before flow preview
  • fix the reusable flow script fixture to use the real extensionless workspace path (f/evals/add_two_numbers) so backend validation matches live backend path rules
  • harden shared-workspace preview mode by serializing preview runs per workspace, updating seeded scripts via parent_hash, and updating seeded flows in place on create conflicts
  • add regression tests for shared-workspace script upserts, flow upserts, and lock serialization, and document that shared-workspace mode treats the target workspace as a dedicated preview workspace whose seeded fixture assets are refreshed before each run

Test plan

  • cd ai_evals && bun test adapters/frontend/backendPreview.test.ts core/backendValidation.test.ts core/cases.test.ts core/models.test.ts core/validators.test.ts
  • source /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run script script-test1-greet-user --backend-validation preview --model haiku
  • source /home/farhad/windmill/ai_evals/.env && cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8070 WMILL_AI_EVAL_BACKEND_WORKSPACE=test bun run cli -- run flow flow-test0-sum-two-numbers flow-test1-reuse-existing-script flow-test2-call-existing-subflow --backend-validation preview --model haiku
  • poison f/evals/add_two_numbers in workspace test, then rerun flow-test1-reuse-existing-script twice with backend preview enabled; both runs pass because seeding refreshes the script in place
  • poison f/evals/add_numbers_flow in workspace test, then rerun flow-test2-call-existing-subflow twice with backend preview enabled; both runs pass because seeding refreshes the subflow in place
  • cd frontend && npm run check:fast still fails due unrelated existing frontend/src/lib/utils_workspace_deploy.ts export errors

Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 14, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: df60913
Status: ✅  Deploy successful!
Preview URL: https://b648c9ce.windmill.pages.dev
Branch Preview URL: https://ai-evals-with-backend.windmill.pages.dev

View logs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@centdix centdix changed the title feat: add backend preview validation to ai evals chore: add backend preview validation to ai evals Apr 15, 2026
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@centdix centdix marked this pull request as ready for review April 15, 2026 14:57
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@centdix
Copy link
Copy Markdown
Collaborator Author

centdix commented Apr 15, 2026

/ai-fast review this PR

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @centdix's task in 5m 8s —— View job


PR Review: ai-evals backend preview validation

  • Read core backend preview client (adapters/frontend/backendPreview.ts)
  • Read backend validation pipeline (core/backendValidation.ts, core/runSuite.ts)
  • Read script/flow modes integration (modes/script.ts, modes/flow.ts)
  • Read CLI and runtime wiring (cli/index.ts, adapters/frontend/benchmarkRunner.ts, adapters/frontend/runtime.ts)
  • Review tests (backendPreview.test.ts, backendValidation.test.ts, cases.test.ts)
  • Check fixtures/README/case config changes
  • Verify API routes (list_paths, delete/p, etc.) against backend
  • Post review feedback

Overall this is a solid, well-scoped addition. The layering is clean: BackendPreviewClient is self-contained and the mode runners opt in via backendValidate. Tests exercise the shared-workspace upsert paths, the serialization lock, and the cached-login failure recovery. The isConflictMessage + parent_hash / flows/update fallback matches the real backend behavior (I verified /scripts/delete/p/{path} is POST and /flows/delete/{path} is DELETE in openapi.yaml).

A few findings, roughly in order of importance:

1. Shared-workspace cleanup is destructive and not documented loudly enough

BackendPreviewClient.clearManagedSharedWorkspaceAssets deletes every flow and script under f/evals/* in the target workspace on each run (adapters/frontend/backendPreview.ts:319-329, isManagedSharedWorkspacePath:500-502). If a user points WMILL_AI_EVAL_BACKEND_WORKSPACE at their dev workspace (which the CE workspace-limit hint at backendPreview.ts:236-244 actively encourages), any hand-authored script/flow they happen to keep under f/evals/ gets silently wiped, and flows/scripts outside that managed prefix that depend on the deleted fixtures will subsequently break.

The README mentions "clears managed eval assets under f/evals/*" but the BACKEND_WORKSPACE env var doc (README.md:144) advertises reusing an existing workspace without warning about the deletion. Consider:

  • calling the variable something like WMILL_AI_EVAL_PREVIEW_WORKSPACE so nobody points it at a real one, or
  • gating the prefix-sweep behind an explicit opt-in env var and throwing otherwise, or
  • at minimum, print a one-line warning to stderr on the first run describing which prefixes will be cleared.

Fix this →

2. Cleanup aborts on the first failure and leaks state

clearManagedSharedWorkspaceAssets loops through flowPaths.filter(...) and deleteFlowByPath sequentially with no per-item error handling (backendPreview.ts:321-328). If a single delete fails (say another test is holding a reference, or the workspace admin revoked a folder permission), the remaining cleanup never runs and the subsequent seedWorkspaceFixtures call fails only for whatever actually conflicts. Prefer:

for (const path of ...) {
  await this.deleteFlowByPath(workspaceId, path).catch((err) => {
    console.warn(`Failed to delete managed flow ${path}: ${err}`)
  })
}

or collect errors and throw once at the end so the artifact still reports which deletes failed.

3. isConflictMessage substring match is very permissive

isConflictMessage (backendPreview.ts:495-498) treats any 400 body containing "already exists" or "path conflict" as a conflict, then falls through to the upsert path. That's fine for the current Windmill error strings, but a 400 like "variable already exists" (if the script create ever validated a referenced variable) would silently trigger getScriptByPath and a retry that could mask the real error. Consider matching the stricter phrasing used in tests (Path conflict for ... with non-archived hash, Flow ... already exists) via a narrower substring or regex, so unrelated 400s surface instead of being rewritten as "update the existing script".

4. Token cache invalidation only on login failure

getToken caches the login promise and only tokenCache.deletes on login rejection (backendPreview.ts:393-406). If the cached token expires mid-suite — plausible for long shared-workspace runs with the CE default token TTL — every subsequent API call will fail with 401 and there's no re-login path. expectOk just throws check workspace <id> failed: 401 .... Cheapest fix: in request, if a response is 401/403 and the token was served from cache, delete the cache entry and retry once.

5. CLI has a redundant mode guard

cli/index.ts:174-176 rejects backend validation for non-flow/non-script modes, but resolveBackendValidationSettings in core/backendValidation.ts:43-47 already throws the same error (and is reached via runFrontendBenchmarkFromEnv). The CLI guard is still useful because runCliBenchmark doesn't call resolveBackendValidationSettings, so it's not dead code — just worth a comment to avoid future drift. Non-blocking.

6. Flow createFlow upsert doesn't wait for anything, unlike scripts

createScript waits for /scripts/deployment_status/h/{hash} so the subsequent preview sees a compiled lock (backendPreview.ts:114, 357-380). createFlow (backendPreview.ts:135-159) just fires POST and returns. For flows that's typically OK because there's no lockfile step, but if the update path ever runs a backfill / index rebuild (flow versions, schedule reconciliation, etc.) a racey subsequent preview could hit a stale read. Worth a // flows deploy synchronously comment or a minimal version check so the invariant is documented.

7. Script args fallback picks up initial.args even when they're stale

modes/script.ts:61-66:

args:
  (evalCase.runtime?.backendPreview?.args as Record<string, unknown> | undefined) ??
  actual.args ??
  initial?.args ??
  {},

ScriptEvalState.args (adapters/frontend/core/script/fileHelpers.ts:13) is the local script-test arg values. If the model refactors the script signature (renames ax), actual.args should match, but if the model returns the same actual.args as the initial (because it forgot to update them) the preview will try to call the new signature with old arg names and fail. That failure will look like a backend regression, not a code-quality regression. Not a blocker — just consider printing which fallback was used in backend-preview.json so triage is cheaper.

Also, the as Record<string, unknown> | undefined cast is redundant since EvalCaseRuntimeBackendPreview.args is already typed that way in core/types.ts:6-8.

8. Shared-workspace mode silently serializes suite concurrency

Under workspaceOverride, withSharedWorkspaceLock serializes every case (backendPreview.ts:422-441). But runSuite still spins up concurrency=5 workers and the model calls run in parallel; only the backend-preview phase is serialized. That means the first worker blocks the whole workspace for its preview while the other workers finish their model runs and then queue on the lock. This is correct, but it can make shared-workspace runs feel "stuck" behind a slow preview with no signal. Worth an onProgress event (or stderr line) noting "awaiting shared-workspace preview slot" when the lock is held.

9. Minor: per-attempt workspace suffix still uses randomUUID but limits case slug to 30 chars

buildWorkspaceId (backendPreview.ts:443-451) truncates the case slug to 30 chars before adding -a<attempt>-<uuid8>. That works, but since workspacePrefix defaults to ai-evals and the suffix is already a random 8-char hash, you could drop the attempt number from the workspace id and save a few bytes — multiple attempts already get fresh UUIDs. Non-blocking stylistic preference.

10. runtime.backendPreview.args type mismatch for flow

modes/flow.ts:91 uses evalCase.runtime?.backendPreview?.args ?? {} which is typed Record<string, unknown>. runFlowPreview accepts Record<string, unknown>, so this is fine — just flagging that the script path has an unnecessary cast while the flow path does not, which is slightly inconsistent.


Summary

Biggest concerns are the blast radius of the shared-workspace cleanup (item 1) and the lack of token refresh for long-lived runs (item 4). Everything else is polish. The tests are the strongest part of the PR — they cover exactly the failure modes (path conflict upsert, shared-workspace lock, cached-login failure) that would otherwise be hard to debug in CI.

Nothing here blocks merge for dev-only usage, but I'd at least (a) rename or hard-gate WMILL_AI_EVAL_BACKEND_WORKSPACE before it lands in docs any wider, and (b) add 401-retry to request.
Branch

@rubenfiszel rubenfiszel merged commit f1e84cb into main Apr 15, 2026
8 of 9 checks passed
@rubenfiszel rubenfiszel deleted the ai-evals-with-backend branch April 15, 2026 15:11
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants