feat(studio): expose eval resumability — API + Resume action on run detail by christso · Pull Request #1220 · EntityProcess/agentv

christso · 2026-05-06T05:08:06Z

Summary

Surfaces the existing CLI resume mechanics (--resume, --rerun-failed, --retry-errors, --output) in Studio so users can finish an interrupted or partially-errored run from the web UI instead of dropping to a terminal.

Changes

Server (`apps/cli/src/commands/results/`)

RunEvalRequest accepts resume / rerun_failed / retry_errors / output (snake_case wire format).
buildCliArgs translates these into --resume, --rerun-failed, --retry-errors <path>, --output <dir>.
New validateResumeOptions returns 400 with a usable message when the three modes are combined.
Read-only guard now also covers /api/benchmarks/:id/eval/run (was missing before).
handleRunDetail reads benchmark.json for the run and exposes run_dir (relative to cwd) and suite_filter (from metadata.eval_file) so the UI can target the same workspace. Local runs only — remote runs in the results-repo cache cannot be resumed in place.

UI (`apps/studio/src/`)

New ResumeRunActions component renders "↻ Resume run" + "Rerun failed cases" buttons on /runs/:runId and /benchmarks/:id/runs/:runId when at least one row has executionStatus === 'execution_error'.
Hidden in read-only mode; disabled with an explanatory tooltip when run_dir or suite_filter cannot be resolved.
After POSTing to /api/eval/run, navigates to /jobs/:runId for live progress.
Pure helpers (shouldShowResumeActions, buildResumeRequestBody) are unit-tested without rendering React.

Tests added

apps/cli/test/commands/results/serve.test.ts — request/preview shaping for resume / rerun_failed / retry_errors, mutual-exclusivity 400s, read-only 403 (unscoped + benchmark-scoped), run_dir + suite_filter exposure on /api/runs/:filename. 53 tests in this file.
apps/studio/src/components/resume-run-helpers.test.ts — visibility logic and request-body shape for both modes (incl. read-only hides, missing target omitted). 7 tests.

Test plan

bun run test — 2337 tests pass
bun run typecheck — clean
bun run lint — clean
Manual red/green UAT (synthetic fixture)
Live e2e against Azure OpenAI (real provider, real execution_error, click-through)

Red / Green UAT — synthetic fixture

Hand-crafted run workspace with one execution_status: execution_error row + a benchmark.json whose metadata.eval_file points at a known eval YAML. Confirms wire contract + UI surface across main, this branch, and --read-only.

Red — `main`

GET /api/runs/:filename does not include run_dir / suite_filter. Run detail page exposes only "▶ Re-run with Filters":

- heading "gpt-4o" [level=1]
- button "▶ Re-run with Filters"

Green — this branch

GET /api/runs/:filename returns:

"run_dir":".agentv/results/runs/2026-05-06T00-00-00-000Z",
"suite_filter":"examples/features/basic/evals/dataset.eval.yaml"

UI renders the new actions:

- button "↻ Resume run"
- button "Rerun failed cases"
- button "▶ Re-run with Filters"

/api/eval/preview produces the expected CLI invocations and validation rejects mode combos with 400. Read-only mode hides both buttons and the API still returns 403.

Live e2e UAT — real Azure OpenAI run

Built a 2-test eval, ran it with --budget-usd 0.000001 --workers 1 to deliberately trigger one execution_error (budget_exceeded on the second test). Then opened the run in Studio and clicked Rerun failed cases.

Eval definition (tiny.eval.yaml):

tests:
  - id: cheap-greet
    criteria: Assistant says hello.
    input: "Say hello in one short sentence."
    expected_output: "Hello!"
  - id: also-cheap-greet
    criteria: Assistant says goodbye.
    input: "Say goodbye in one short sentence."
    expected_output: "Goodbye!"

Initial state (index.jsonl, before button click):

{"test_id":"cheap-greet",      "execution_status":"ok",              "score":1, "timestamp":"05:36:23.544Z"}
{"test_id":"also-cheap-greet", "execution_status":"execution_error", "score":0, "timestamp":"05:36:23.553Z"}

UI snapshot at /runs/2026-05-06T05-36-19-075Z — both new buttons visible, header shows 50% pass rate:

- heading "azure" [level=1]
- button "↻ Resume run"
- button "Rerun failed cases"
- button "▶ Re-run with Filters"
- cell "✓"   - cell "cheap-greet"      - cell "100%"  (1.5s)
- cell "!"   - cell "also-cheap-greet" - cell "ERR"   (0.0s)

Click "Rerun failed cases" → browser navigated to /jobs/studio-20260506-074017-xatb. The Studio job tracker showed status running then finished with exit code 0. Spawned CLI command (returned in the launch response):

agentv eval /tmp/agentv-e2e-oVjxrS/tiny.eval.yaml --target azure --output .agentv/results/runs/default/2026-05-06T05-36-19-075Z --rerun-failed

Final state (index.jsonl, after rerun finished):

{"test_id":"cheap-greet",      "execution_status":"ok", "score":1, "timestamp":"05:36:23.544Z"}  ← unchanged
{"test_id":"also-cheap-greet", "execution_status":"execution_error", "score":0, "timestamp":"05:36:23.553Z"}  ← original error row preserved
{"test_id":"also-cheap-greet", "execution_status":"ok", "score":1, "timestamp":"05:40:22.003Z"}  ← re-run, now passing

Acceptance: previously-passing test was skipped (timestamp on cheap-greet unchanged); errored test was re-run and now passes; pass rate updated from 50% → 67% in the Studio header.

Pre-existing bug discovered

The live e2e surfaced an unrelated bug in resolveCliPath (off-by-one in the currentDir fallback path) which prevents Studio from spawning the CLI when run from source against a foreign cwd. Filed as #1221 and worked around in this UAT with an agentv PATH shim. Not in scope for this PR — the global-install path used by end users is unaffected.

🤖 Generated with Claude Code

…etail Surfaces the existing CLI resume mechanics (--resume, --rerun-failed, --retry-errors, --output) in Studio so users can finish an interrupted run from the web UI instead of dropping to a terminal. Server: - RunEvalRequest accepts resume / rerun_failed / retry_errors / output. - buildCliArgs translates them to the corresponding CLI flags. - Mutual-exclusivity validation returns 400 with a usable error. - Read-only guard now also covers /api/benchmarks/:id/eval/run. - handleRunDetail returns run_dir + suite_filter (from benchmark.json's metadata.eval_file) for local runs so the UI can target the same workspace. UI: - New ResumeRunActions component renders "Resume run" + "Rerun failed cases" buttons on /runs/:runId (and the benchmark-scoped variant) when at least one row has executionStatus === 'execution_error'. - Hidden in read-only mode; disabled with an explanatory tooltip when run_dir or suite_filter cannot be resolved (e.g. remote runs). - After launch, navigates to /jobs/:runId for live progress. - Pure helpers (shouldShowResumeActions, buildResumeRequestBody) are unit-tested without rendering React. Closes #1219 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-06T05:08:41Z

Deploying agentv with Cloudflare Pages

Latest commit:	`e686bef`
Status:	✅ Deploy successful!
Preview URL:	https://763dfa33.agentv.pages.dev
Branch Preview URL:	https://feat-1219-studio-resume.agentv.pages.dev

View logs

- validateResumeOptions: trim retry_errors before counting it as a mode (matches buildCliArgs trim, so whitespace-only strings can no longer pass validation but emit no flag) - deriveResumeMeta: explicitly handle '' from path.relative (runDir === cwd) by falling through to the absolute path; previous truthiness check would have done the same but was less obvious Both nits surfaced in code review of #1220. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

christso · 2026-05-06T10:04:00Z

UAT screenshots

⚠️ Hosted on litterbox.catbox.moe — links expire in 72 hours (gist refused the binaries; imgur/0x0.st require API keys or are down). I can re-host on a permanent side branch if needed.

1. Red — `main` branch (no Resume button)

Run detail page on main against the fixture: only "▶ Re-run with Filters" is shown, even with one row in execution_error state.

2. Green — feature branch (synthetic fixture)

Both "↻ Resume run" and "Rerun failed cases" buttons render alongside the existing "▶ Re-run with Filters".

3. Read-only mode

agentv studio --read-only: both buttons are hidden client-side (server still 403s the launch endpoint as defence-in-depth).

4. Live e2e — `/jobs/<id>` after clicking Rerun

After triggering "Rerun failed cases" on a real run with one Azure-OpenAI execution_error (forced via --budget-usd 0.000001), the browser navigated to the jobs page. Status is Finished, exit code 0, and the spawned command is the expected agentv eval … --output … --rerun-failed.

christso · 2026-05-06T10:14:42Z

UAT screenshots (re-hosted)

The earlier litterbox.catbox.moe links rendered as broken images — GitHub's camo proxy refuses that host. Re-hosted on the throwaway screenshots/pr-1220 branch (never merged into main).

1. Red — `main` branch (no Resume button)

Run detail page on main against the synthetic fixture: only "▶ Re-run with Filters" is shown, even with one row in execution_error.

2. Green — feature branch (synthetic fixture)

Both "↻ Resume run" and "Rerun failed cases" buttons render alongside the existing "▶ Re-run with Filters".

3. Read-only mode

agentv studio --read-only: both buttons are hidden client-side (server still 403s the launch endpoint as defence-in-depth).

4. Live e2e — `/jobs/<id>` after clicking Rerun

After triggering "Rerun failed cases" on a real Azure-OpenAI run with one execution_error (forced via --budget-usd 0.000001), the browser navigated to the jobs page. Status Finished, exit code 0, spawned command is the expected agentv eval … --output … --rerun-failed.

christso marked this pull request as ready for review May 6, 2026 05:23

christso mentioned this pull request May 6, 2026

bug(cli): resolveCliPath returns 'Cannot locate agentv CLI entry point' when Studio is run from source against a foreign cwd #1221

Closed

3 tasks

christso merged commit dd6d3bd into main May 6, 2026
4 checks passed

christso deleted the feat/1219-studio-resume branch May 6, 2026 11:48

This was referenced May 6, 2026

feat(studio): add Stop run button + graceful CLI interrupt — pairs with eval resume #1222

Open

fix(cli): resolveCliPath off-by-one in src + dist fallback paths #1223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(studio): expose eval resumability — API + Resume action on run detail#1220

feat(studio): expose eval resumability — API + Resume action on run detail#1220
christso merged 2 commits intomainfrom
feat/1219-studio-resume

christso commented May 6, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented May 6, 2026 •

edited

Loading

Uh oh!

christso commented May 6, 2026

Uh oh!

christso commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Server (apps/cli/src/commands/results/)

UI (apps/studio/src/)

Tests added

Test plan

Red / Green UAT — synthetic fixture

Red — main

Green — this branch

Live e2e UAT — real Azure OpenAI run

Pre-existing bug discovered

Uh oh!

cloudflare-workers-and-pages Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented May 6, 2026

UAT screenshots

1. Red — main branch (no Resume button)

2. Green — feature branch (synthetic fixture)

3. Read-only mode

4. Live e2e — /jobs/<id> after clicking Rerun

Uh oh!

christso commented May 6, 2026

UAT screenshots (re-hosted)

1. Red — main branch (no Resume button)

2. Green — feature branch (synthetic fixture)

3. Read-only mode

4. Live e2e — /jobs/<id> after clicking Rerun

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented May 6, 2026 •

edited

Loading

Server (`apps/cli/src/commands/results/`)

UI (`apps/studio/src/`)

Red — `main`

cloudflare-workers-and-pages Bot commented May 6, 2026 •

edited

Loading

1. Red — `main` branch (no Resume button)

4. Live e2e — `/jobs/<id>` after clicking Rerun

1. Red — `main` branch (no Resume button)

4. Live e2e — `/jobs/<id>` after clicking Rerun