Objective
Surface the existing CLI resume mechanics (--resume, --rerun-failed, --output <dir>) in Studio so a user staring at an interrupted or partially-errored run can finish it from the web UI instead of dropping to a terminal.
Today Studio can launch a fresh eval (POST /api/eval/run) and renders execution_error per test on the run detail page, but the launch request shape doesn't carry the resume parameters and no UI affordance exists.
Follow-up to #1216 / PR #1217, which scoped TUI + flag-level UX + docs + auto-detect but explicitly deferred Studio.
Background — current state in code
- Launch endpoint:
apps/cli/src/commands/results/eval-runner.ts:240 (unscoped) and apps/cli/src/commands/results/eval-runner.ts:407 (benchmark-scoped). Both call buildCliArgs and spawn the CLI.
- Request shape:
RunEvalRequest at apps/cli/src/commands/results/eval-runner.ts:101 — has suite_filter, test_ids, target, threshold, workers, dry_run. Missing resume, rerun_failed, retry_errors, output.
- CLI arg builder:
buildCliArgs at apps/cli/src/commands/results/eval-runner.ts:110.
- UI client:
apps/studio/src/lib/api.ts:529 (runEval function).
- Run detail route:
apps/studio/src/routes/runs/$runId.tsx and benchmark variant apps/studio/src/routes/benchmarks/\$benchmarkId_/runs/$runId.tsx.
- Run detail component:
apps/studio/src/components/RunDetail.tsx:174 already renders executionStatus === 'execution_error' per row, so the data needed to decide "is there anything to resume" is already on the page.
- Job polling page:
apps/studio/src/routes/jobs/$runId.tsx (existing — reuse for the post-resume status view).
- Read-only guard: the launch endpoint already rejects in read-only mode (
eval-runner.ts:241); the new behaviour must respect this.
Proposed changes
1. Extend the launch API (server)
Add to RunEvalRequest:
interface RunEvalRequest {
// ...existing fields...
resume?: boolean;
rerun_failed?: boolean;
retry_errors?: string; // path to a prior run dir or index.jsonl
output?: string; // explicit run dir; required when resume/rerun_failed are set
// and the server isn't auto-detecting from cache
}
Wire format is snake_case per AGENTS.md ("Wire Format Convention"). Validation:
Extend buildCliArgs to translate these into --resume, --rerun-failed, --retry-errors <path>, --output <dir>.
2. Add UI action on the run detail page
On /runs/:runId (and the benchmark-scoped equivalent), when the loaded run contains at least one result with executionStatus === 'execution_error':
- Render a primary button labelled "Resume run" that calls
POST /api/eval/run with { suite_filter: <run's suite filter>, target: <run's target>, output: <run dir>, resume: true }.
- Render a secondary button "Rerun failed cases" that does the same with
rerun_failed: true instead of resume: true. (Same in-place semantics as the CLI flag — re-runs everything that wasn't executionStatus === 'ok'.)
- After POST, redirect to
/jobs/:runId (existing route) to show progress.
- Disable both buttons in read-only mode.
UI placement: top-right of the RunDetail header is fine — keep it visible without scrolling.
3. Tests
- Server tests in
apps/cli/test/commands/results/serve.test.ts (existing file): add cases for valid resume/rerun_failed/retry_errors requests, mutual-exclusivity rejections, and the read-only guard.
- UI tests: assert the button only renders when the run has at least one
execution_error row; assert the request body shape on click; assert read-only hides/disables the buttons.
Acceptance signals
Non-goals
- No
/runs list filter for incomplete runs. Add the action where users already are (the detail page); broader filters can be a separate, smaller issue if usage warrants.
- No new resume verbs. Surface the three existing CLI flags; don't invent a fourth.
- No
--retry-errors <path> UI picker. The path-based variant is for cross-run cases; in-Studio resume targets the run currently being viewed, so output: <currentRunDir> is sufficient.
- No scheduled / auto-resume. Manual button click only.
- No changes to the run-launch wizard / form for new runs — this issue is about resuming existing runs.
Related
Estimate
~1 day. Server change is mechanical (one interface, one arg builder, validation, tests). UI change is one button + one route handler + tests. No design work needed — peers (promptfoo cloud) put resume actions on run detail pages too.
Objective
Surface the existing CLI resume mechanics (
--resume,--rerun-failed,--output <dir>) in Studio so a user staring at an interrupted or partially-errored run can finish it from the web UI instead of dropping to a terminal.Today Studio can launch a fresh eval (
POST /api/eval/run) and rendersexecution_errorper test on the run detail page, but the launch request shape doesn't carry the resume parameters and no UI affordance exists.Follow-up to #1216 / PR #1217, which scoped TUI + flag-level UX + docs + auto-detect but explicitly deferred Studio.
Background — current state in code
apps/cli/src/commands/results/eval-runner.ts:240(unscoped) andapps/cli/src/commands/results/eval-runner.ts:407(benchmark-scoped). Both callbuildCliArgsandspawnthe CLI.RunEvalRequestatapps/cli/src/commands/results/eval-runner.ts:101— hassuite_filter,test_ids,target,threshold,workers,dry_run. Missingresume,rerun_failed,retry_errors,output.buildCliArgsatapps/cli/src/commands/results/eval-runner.ts:110.apps/studio/src/lib/api.ts:529(runEvalfunction).apps/studio/src/routes/runs/$runId.tsxand benchmark variantapps/studio/src/routes/benchmarks/\$benchmarkId_/runs/$runId.tsx.apps/studio/src/components/RunDetail.tsx:174already rendersexecutionStatus === 'execution_error'per row, so the data needed to decide "is there anything to resume" is already on the page.apps/studio/src/routes/jobs/$runId.tsx(existing — reuse for the post-resume status view).eval-runner.ts:241); the new behaviour must respect this.Proposed changes
1. Extend the launch API (server)
Add to
RunEvalRequest:Wire format is snake_case per
AGENTS.md("Wire Format Convention"). Validation:resumeandrerun_failedare mutually exclusive.retry_errorsis mutually exclusive withresume/rerun_failed.resumeorrerun_failedis set withoutoutput, accept it — the CLI will auto-detect from.agentv/cache.json(landed in PR feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough #1217).400with a clear error message on invalid combinations.Extend
buildCliArgsto translate these into--resume,--rerun-failed,--retry-errors <path>,--output <dir>.2. Add UI action on the run detail page
On
/runs/:runId(and the benchmark-scoped equivalent), when the loaded run contains at least one result withexecutionStatus === 'execution_error':POST /api/eval/runwith{ suite_filter: <run's suite filter>, target: <run's target>, output: <run dir>, resume: true }.rerun_failed: trueinstead ofresume: true. (Same in-place semantics as the CLI flag — re-runs everything that wasn'texecutionStatus === 'ok'.)/jobs/:runId(existing route) to show progress.UI placement: top-right of the RunDetail header is fine — keep it visible without scrolling.
3. Tests
apps/cli/test/commands/results/serve.test.ts(existing file): add cases for valid resume/rerun_failed/retry_errors requests, mutual-exclusivity rejections, and the read-only guard.execution_errorrow; assert the request body shape on click; assert read-only hides/disables the buttons.Acceptance signals
RunEvalRequestacceptsresume,rerun_failed,retry_errors,output(snake_case keys).commandfield returned in the launch response./runs/:runIdshows a "Resume run" button when any row hasexecutionStatus === 'execution_error'; clicking it triggers a launch withresume: true+output: <runDir>and redirects to/jobs/:runId.rerun_failed: true.main; green = same scenario, click Resume, observe new run dir reuses the same path and the previously-passing tests are skipped.Non-goals
/runslist filter for incomplete runs. Add the action where users already are (the detail page); broader filters can be a separate, smaller issue if usage warrants.--retry-errors <path>UI picker. The path-based variant is for cross-run cases; in-Studio resume targets the run currently being viewed, sooutput: <currentRunDir>is sufficient.Related
AGENTS.md→ "Wire Format Convention"AGENTS.md→ "Issue Workflow" (claim on the project board before starting)Estimate
~1 day. Server change is mechanical (one interface, one arg builder, validation, tests). UI change is one button + one route handler + tests. No design work needed — peers (promptfoo cloud) put resume actions on run detail pages too.