feat(studio): per-run comparison with retroactive labelling#1040
feat(studio): per-run comparison with retroactive labelling#1040
Conversation
Adds a per-run mode to the Studio Compare tab so users can select 2+ individual runs and see them side-by-side, independent of the existing (experiment, target) aggregation. Runs can be retroactively labelled via a sidecar label.json written next to index.jsonl; the label replaces the timestamp in compare column headers. Backend: - `apps/cli/src/commands/results/run-label.ts` — sidecar read/write/delete helpers (label.json next to manifest, 120-char cap, JSON schema). - `serve.ts` — /api/compare now returns a `runs[]` array with per-run entries (one per workspace), and enriches /api/runs with any label. - New endpoints: `PUT/DELETE /api/runs/:filename/label` and the benchmark-scoped variants. Remote runs are read-only. Frontend: - `CompareTab.tsx` completely reworked with an "Editorial Data Terminal" aesthetic — Fraunces serif display, JetBrains Mono tabular numerals, warm off-black canvas, antique gold accents. Scoped via inline styles under `[data-compare-root]` so it does not bleed into other surfaces. - Two modes: Aggregated (default, existing matrix re-skinned) and Per run (checkbox-selectable runs table + sticky Compare N bar + inline label editor). Compare view renders one column per selected run with label-or-timestamp headers and reuses the existing test breakdown. - API hooks `saveRunLabelApi` / `deleteRunLabelApi` invalidate compare and runs caches on mutation. Closes #1037
Deploying agentv with
|
| Latest commit: |
7fd9f06
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1a2c54c4.agentv.pages.dev |
| Branch Preview URL: | https://feat-1037-per-run-compare.agentv.pages.dev |
- CompareTab AggregatedView: hoist useMemo above the early return so adding a second experiment/target after the initial render does not violate the Rules of Hooks. - Pass `benchmarkId` and `readOnly` through to CompareTab from both routes (single-project and benchmark-scoped). Previously label mutations in the benchmark view routed to the unscoped endpoint and either 404'd or wrote the sidecar into the wrong run directory. - LabelEditor: short-circuit Save/Clear onClick handlers on `busy` to avoid a save-then-clear race where both mutations could be in flight simultaneously. - writeRunLabel: reject control characters in labels so they cannot break compare column headers or confuse test assertions.
Review follow-up (c993a20)Addressed blockers from internal code review:
Also addressed two should-fix items while I was in there:
Deferred to follow-up (tracked mentally, not blocking):
Verification after the fixes:
|
- Replace the single compare screenshot with three fresh shots at 1680x1000: the side-by-side per-run view (hero), the aggregated matrix, and the per-run list with labels. - Expand the Studio `## Compare` section to describe both modes, when to use per-run mode, how the sticky Compare N flow works, and how retroactive labels persist as sidecar `label.json` files. - While in CompareTab.tsx: honor `prefers-reduced-motion` (disables entrance animations, row stagger, hover translations), and restore focus to the row's label trigger button when the inline label editor closes so keyboard users don't lose their place.
Rewrites CompareTab markup from scratch using the same Tailwind patterns as the rest of Studio (ExperimentsTab, TargetsTab, RunList, PassRatePill) so the Compare tab is visually consistent with the rest of the app. Before: the component carried its own "Editorial Data Terminal" theme — Fraunces serif, JetBrains Mono, warm off-black canvas, antique gold hairlines, scoped via inline <style> on [data-compare-root]. This was jarringly off-brand. After: plain Tailwind utilities sourced from the existing Studio palette: - Surfaces: rounded-lg border border-gray-800 on gray-900/50 backgrounds - Tables: divide-y divide-gray-800/50 with hover:bg-gray-900/30 - Accents: cyan-400 / cyan-500 for interactive and selected states - Tones: emerald-400 (pass), red-400 (fail), yellow-400 (warn), matching ExperimentsTab and the existing Legend swatches - Pass rates: reuse the shared PassRatePill component everywhere - Selection highlight: cyan-950/20 row tint with a sticky cyan action bar - Label chip: cyan-bordered pill, matching cyan link styling elsewhere Drops the entire ScopedStyles block and the data-compare-root wrapper. Functional behavior (state, mutations, keyboard handlers, focus return, Rules-of-Hooks order, control-char validation) is preserved. Studio JS bundle drops ~31 KB (468 KB → 438 KB) from removing the embedded <style> string; CSS grows slightly from new Tailwind utilities. Screenshots in apps/web/src/assets/screenshots/studio-compare-*.png are re-captured to reflect the corrected styling.
Replaces the single-valued `label` feature with multi-valued `tags`,
matching the Langfuse / W&B / GitHub convention for mutable post-hoc
run annotations. A singular label boxed us in for future use cases like
`regression + slow + v2-prompt`-style cross-cutting filters; tags keep
the door open without blocking the current compare-column-header use
case.
Rationale:
- "Label" (singular) is an uncommon vocabulary — Langfuse, W&B, and
GitHub all use plural `tags`, and MLflow uses a singular `runName`
only for the immutable display identity (not post-hoc annotations).
- Experiment (set at eval-run time) is the run's grouping key; tags
layer mutable cross-cutting attributes on top without touching the
JSONL manifest.
- Per-run compare already solved the ad-hoc comparison mechanics;
this rename gives the UX a richer identity layer.
Backend:
- `apps/cli/src/commands/results/run-label.ts` → `run-tags.ts`:
- `RUN_LABEL_FILENAME` → `RUN_TAGS_FILENAME` (`label.json` → `tags.json`)
- `RunLabelFile { label: string }` → `RunTagsFile { tags: string[] }`
- `readRunLabel/writeRunLabel/deleteRunLabel` → `readRunTags/writeRunTags/deleteRunTags`
- New `normalizeTags()` helper: trim, dedupe, validate per-tag length
(≤60 chars), reject control chars, enforce MAX_TAGS_PER_RUN (20).
- Writing an empty array deletes the sidecar (single idempotent path).
- `serve.ts`:
- `PUT/DELETE /api/runs/:filename/label` → `/tags` (plus benchmark-scoped).
- `handleRunLabelPut/Delete` → `handleRunTagsPut/Delete`.
- `CompareRunEntry.label?` → `CompareRunEntry.tags?: string[]`.
- `handleRuns` / `handleCompare` read `readRunTags` and surface `tags[]`.
- `CompareRunEntry` and `RunMeta` wire-format fields updated accordingly.
Frontend:
- `types.ts`: `RunMeta.label?` → `tags?: string[]`;
`RunLabelResponse { label }` → `RunTagsResponse { tags: string[] }`.
- `api.ts`: `saveRunLabelApi/deleteRunLabelApi` → `saveRunTagsApi/deleteRunTagsApi`;
URL paths `/label` → `/tags`; request body `{ tags }`.
- `CompareTab.tsx`:
- Table column "Label" → "Tags".
- Per-run row: cell shows every tag as a cyan-bordered chip (wraps for
long lists); placeholder "+ tags" dashed pill when empty.
- New `TagsEditor` replaces `LabelEditor`: inline chip-based editor
with staged `string[]` state, Enter/comma commits new tag, Backspace
on empty input removes the last chip, × on each chip removes that
specific tag, Clear all wipes the sidecar, Save persists the array.
- `RunColumnHeader` (side-by-side view): timestamp stays as the primary
identifier, tags render as small chips below it (was single label
replacing the timestamp — now both coexist so the run's immutable
identity is always visible).
- Focus-restore on editor close preserved for keyboard users.
Docs:
- `apps/web/src/content/docs/docs/tools/studio.mdx`:
- "Retroactive labels" section → "Retroactive tags"; explains the
multi-valued model, the limits (20 tags × 60 chars), and the chip
editor shortcuts.
- Features bullet updated.
- Alt text and surrounding prose reworded (`labelled` → `tagged`,
`label cell` → `Tags cell`).
- Screenshots recaptured: per-run view now shows one run with two tags
(`improved-prompt`, `v2`) and another with one tag (`baseline`) so
the multi-valued pattern is visible; side-by-side view shows tag
chips directly under each column's timestamp.
Generated the file skeleton via `npx getdesign@latest add minimax` and then rewrote every section to describe AgentV Studio's actual design language rather than MiniMax's marketing-page aesthetic. The result is a practical reference that future agents and humans can drop into a Claude Code / Cursor session to keep new Studio surfaces consistent with the existing ones. Contents: - Color palette (gray-950 canvas, gray-900 surfaces, single cyan-400 accent, emerald/yellow/red data tones, blue gradient reserved for PassRatePill) - Typography (single system-ui stack, no webfonts, text-sm default, tabular-nums mandatory on numeric columns, font-medium over bold) - Canonical component patterns copied verbatim from ExperimentsTab, TargetsTab, RunList, and PassRatePill so new code can lift them without reinventing - Do / don't list codifying the hard rules: one accent, no shadows for elevation, no rounded-xl, no webfonts, PassRatePill is the only blue in the app, data tones never leak into interactive chrome - Responsive + layout principles matching the dense, desktop-first inspector posture of the current Studio UI - Agent prompt guide with ready-to-paste snippets for tables, primary buttons, segmented controls, tag chips, empty states, and form rows Placed at apps/studio/DESIGN.md (scoped to the studio app) so it lives next to the code it describes. This is documentation only — no runtime or build impact.
|
Ready for merge. Final state after review + design pivots
Verification
Deferred / follow-upTracked as #1041: Filter compare views by tag. Tag filtering (chip row above the compare view to narrow both matrix and per-run table to runs matching a selected tag set) was discussed in this PR's thread and intentionally held out of scope — #1037 is a collapse-bug fix, the tag filter is adjacent but not required, and no concrete user has asked for it yet. The issue documents the design direction (filter, not dimension), the recommended OR semantics, and an implementation sketch. Squash-merging now. |
Post-merge manual UAT (agent-browser, interactive)I ran the full interactive verification I should have done before merge — not just screenshot-rendering, but clicking every button and pressing every key — against the merged Setup:
Also re-verified during the same session:
FindingOne real bug: I was serving a stale studio dist. The `apps/cli/src/cli.ts studio` command serves `apps/studio/dist/` as static assets, and the dist folder is build output (gitignored). After pulling main post-merge, I didn't rebuild the studio bundle, so my first UAT attempt was driving against a pre-#1040 build that didn't have `TagsEditor` at all. Rebuilding fixed it. This is a gotcha worth documenting — the current AGENTS.md guidance for functional CLI testing says "From TypeScript source (preferred): `bun apps/cli/src/cli.ts …`", which works for CLI logic but does not rebuild the embedded studio UI bundle. Working in a fresh worktree where you run `bun install` + `bun run build` anyway would have caught this; in the primary checkout I skipped the rebuild and got burned. I've noted this in my project memory but it might be worth a one-line addition to the AGENTS.md "Functional Testing (CLI)" section, e.g. "If you are testing Studio UI changes, rebuild the studio bundle first: `cd apps/studio && bun run build`. The studio CLI serves static assets from `apps/studio/dist/` — it does NOT recompile on change like the Vite dev server does." Happy to open a tiny follow-up PR for that if you agree. No code changes neededAll flows work as designed. Nothing to hot-fix. The tag editor is behaving correctly post-merge. |
Running `bun apps/cli/src/cli.ts studio` only live-reloads the CLI and backend routes. The Studio web UI is served as a static bundle from `apps/studio/dist/`, which is build output and does not recompile on source changes. Without a manual `bun run build` in `apps/studio`, `agentv studio` silently serves whatever JS/CSS was last built — which may be from a different branch, before the merge you just pulled, or simply stale. This bit the post-merge UAT on #1040: the TagsEditor component was correctly in the source but not in the dist, so the driven-browser session kept rendering an older Compare tab and looked like a feature regression. Cost ~15 minutes of confusion to diagnose. Adds a paragraph under the existing "Functional Testing (CLI)" section so the next agent (or human) knows to rebuild the Studio dist before screenshotting or driving `agent-browser` against Studio.
#1042) Running `bun apps/cli/src/cli.ts studio` only live-reloads the CLI and backend routes. The Studio web UI is served as a static bundle from `apps/studio/dist/`, which is build output and does not recompile on source changes. Without a manual `bun run build` in `apps/studio`, `agentv studio` silently serves whatever JS/CSS was last built — which may be from a different branch, before the merge you just pulled, or simply stale. This bit the post-merge UAT on #1040: the TagsEditor component was correctly in the source but not in the dist, so the driven-browser session kept rendering an older Compare tab and looked like a feature regression. Cost ~15 minutes of confusion to diagnose. Adds a paragraph under the existing "Functional Testing (CLI)" section so the next agent (or human) knows to rebuild the Studio dist before screenshotting or driving `agent-browser` against Studio. Co-authored-by: devbox2-codex <devbox2-codex@agents.local>
Closes #1037
Summary
(experiment, target)twice no longer collapses into a single cell.tags.jsonsidecar written next toindex.jsonl. Each run can carry up to 20 tags (≤60 chars each, control-char rejected, deduped). Mutation is exposed throughPUT / DELETE /api/runs/:filename/tags(plus benchmark-scoped twins). Remote runs are read-only.CompareTab.tsxwas rewritten to match the existing Studio aesthetic —gray-950canvas, cyan-400 accent, sharedPassRatePill, single system-ui font stack — so the Compare tab is visually indistinguishable fromExperimentsTab/TargetsTab.apps/studio/DESIGN.mddocuments Studio's actual design language (dark + cyan, canonical Tailwind patterns, do/don't list) so future agents can keep new Studio surfaces consistent.Files touched
apps/cli/src/commands/results/run-tags.ts(new) — sidecar read/write/delete helpers with per-tag validation (length, control chars, dedupe, MAX_TAGS_PER_RUN=20). Writing an empty array deletes the sidecar.apps/cli/src/commands/results/serve.ts—handleComparenow also emitsruns[],handleRunsattaches tags, newhandleRunTagsPut/handleRunTagsDeletehandlers wired behind the existing read-only check.apps/studio/src/lib/types.ts—CompareRunEntry,RunTagsResponse,RunMeta.tags?: string[].apps/studio/src/lib/api.ts—saveRunTagsApi/deleteRunTagsApimutations that invalidate compare + runs query keys.apps/studio/src/components/CompareTab.tsx— rewritten with the Studio Tailwind aesthetic; inline chip-basedTagsEditor, per-run selection, side-by-side compare view with tags rendered as chips under the immutable timestamp header.apps/studio/src/routes/index.tsx+apps/studio/src/routes/projects/$benchmarkId.tsx— forwardbenchmarkIdandreadOnlytoCompareTabso label mutations in benchmark-scoped studio routes hit the correct endpoints.apps/studio/DESIGN.md(new) — brand-aligned design system reference.apps/web/src/content/docs/docs/tools/studio.mdx+ 3 new screenshots inapps/web/src/assets/screenshots/studio-compare-*.png— user-facing docs for the Compare tab's two modes and the retroactive tag annotation workflow.docs/plans/1037-per-run-compare.md— design plan (historical; preserved in the squash commit).Verification
bun run build,typecheck,lint,test— all green (1976 tests pass).prek) ran Build / Typecheck / Lint / Test / Validate eval YAML — passed on every push.apps/webbuilds cleanly with the new screenshots embedded.agent-browser --cdp 9222againstbun apps/cli/src/cli.ts studio --port 9100 --singleon a 4-run synthetic fixture (two sharing(exp-a, claude-sonnet)to specifically exercise the feat(studio): per-run comparison with retroactive labelling #1037 collapse).Test plan — verified
Post-merge interactive UAT against
main(commit016607e7) with real click/keystroke dispatching. Full evidence and per-flow screenshots in #1040 (comment).(experiment, target)runs. Per-run table lists all 4 fixture runs, including bothexp-a × claude-sonnetruns as separate rows — the core collapse bug from feat(studio): per-run comparison with retroactive labelling #1037 is fixed.improved-prompt, pressed Enter → chip staged, input cleared, Save enabled.v2, pressed comma → second chip staged. (Note: theagent-browser keyboard typehelper pastes text without per-charkeydownevents, so the comma was driven viaagent-browser press ","which dispatches a real event. The component'sonKeyDownhandler interceptse.key === ','correctly for real user input.)tags.json, and closes the editor. Verified on disk:tags.jsoncontains["improved-prompt","v2"]with a freshupdated_at. Editor closed. Row now shows both chips.v2→ removed from staged list,improved-promptstayed, Save enabled, Save → disk updated to["improved-prompt"].tags.jsonremoved from disk, row shows+ tagsplaceholder again.updated_at), row still shows previous chip.improved-promptwhile it was already staged → chip count stayed at 1, input cleared.hasChangesflips true. Reopened editor without changing anything →saveButton.disabled === truevia DOM inspection.PassRatePillblue gradient, same numbers as before the tags pivot.useMemohoisting inAggregatedView(fix for the Rules-of-Hooks review finding) is correct — single-experiment / single-target fixtures show the notice cleanly and don't throw when a second run is added mid-session.useStudioConfig().read_onlythrough both routes toCompareTab).agentv comparebehaviour unchanged (not touched in this PR).Follow-up
Tracked as #1041: Filter compare views by tag. Tag filtering (chip row above the compare view to narrow both matrix and per-run table to runs matching a selected tag set) was discussed in this PR's thread and intentionally held out of scope.
🤖 Generated with Claude Code