Skip to content

feat(studio): per-run comparison with retroactive labelling#1040

Merged
christso merged 6 commits intomainfrom
feat/1037-per-run-compare
Apr 11, 2026
Merged

feat(studio): per-run comparison with retroactive labelling#1040
christso merged 6 commits intomainfrom
feat/1037-per-run-compare

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 10, 2026

Closes #1037

Note: this PR evolved significantly during review. It was originally built around a singular label concept (matching the issue description) and later pivoted to plural tags to match the Langfuse / W&B / GitHub convention for multi-valued post-hoc annotations. The commit history preserves the full arc; this body reflects the final merged state.

Summary

  • Studio Compare tab gains a Per run mode alongside the existing (experiment × target) aggregated matrix. Users can select 2+ individual runs via checkboxes and see them side-by-side, so running the same (experiment, target) twice no longer collapses into a single cell.
  • Runs can be retroactively tagged (multi-valued) via a tags.json sidecar written next to index.jsonl. Each run can carry up to 20 tags (≤60 chars each, control-char rejected, deduped). Mutation is exposed through PUT / DELETE /api/runs/:filename/tags (plus benchmark-scoped twins). Remote runs are read-only.
  • CompareTab.tsx was rewritten to match the existing Studio aesthetic — gray-950 canvas, cyan-400 accent, shared PassRatePill, single system-ui font stack — so the Compare tab is visually indistinguishable from ExperimentsTab / TargetsTab.
  • A new apps/studio/DESIGN.md documents Studio's actual design language (dark + cyan, canonical Tailwind patterns, do/don't list) so future agents can keep new Studio surfaces consistent.
  • No changes to eval YAML schema, no new CLI commands, no new tracker fields. Aggregated view semantics are unchanged — only the new per-run mode plus the tags annotation layer.

Files touched

  • apps/cli/src/commands/results/run-tags.ts (new) — sidecar read/write/delete helpers with per-tag validation (length, control chars, dedupe, MAX_TAGS_PER_RUN=20). Writing an empty array deletes the sidecar.
  • apps/cli/src/commands/results/serve.tshandleCompare now also emits runs[], handleRuns attaches tags, new handleRunTagsPut / handleRunTagsDelete handlers wired behind the existing read-only check.
  • apps/studio/src/lib/types.tsCompareRunEntry, RunTagsResponse, RunMeta.tags?: string[].
  • apps/studio/src/lib/api.tssaveRunTagsApi / deleteRunTagsApi mutations that invalidate compare + runs query keys.
  • apps/studio/src/components/CompareTab.tsx — rewritten with the Studio Tailwind aesthetic; inline chip-based TagsEditor, per-run selection, side-by-side compare view with tags rendered as chips under the immutable timestamp header.
  • apps/studio/src/routes/index.tsx + apps/studio/src/routes/projects/$benchmarkId.tsx — forward benchmarkId and readOnly to CompareTab so label mutations in benchmark-scoped studio routes hit the correct endpoints.
  • apps/studio/DESIGN.md (new) — brand-aligned design system reference.
  • apps/web/src/content/docs/docs/tools/studio.mdx + 3 new screenshots in apps/web/src/assets/screenshots/studio-compare-*.png — user-facing docs for the Compare tab's two modes and the retroactive tag annotation workflow.
  • docs/plans/1037-per-run-compare.md — design plan (historical; preserved in the squash commit).

Verification

  • bun run build, typecheck, lint, test — all green (1976 tests pass).
  • Pre-push hook (prek) ran Build / Typecheck / Lint / Test / Validate eval YAML — passed on every push.
  • apps/web builds cleanly with the new screenshots embedded.
  • Live manual UAT via agent-browser --cdp 9222 against bun apps/cli/src/cli.ts studio --port 9100 --single on a 4-run synthetic fixture (two sharing (exp-a, claude-sonnet) to specifically exercise the feat(studio): per-run comparison with retroactive labelling #1037 collapse).

Test plan — verified

Post-merge interactive UAT against main (commit 016607e7) with real click/keystroke dispatching. Full evidence and per-flow screenshots in #1040 (comment).

  • Per-run mode shows distinct rows for same-(experiment, target) runs. Per-run table lists all 4 fixture runs, including both exp-a × claude-sonnet runs as separate rows — the core collapse bug from feat(studio): per-run comparison with retroactive labelling #1037 is fixed.
  • Multi-run compare side-by-side. Selected 2+ rows via checkbox, clicked the sticky Compare N action bar, side-by-side table rendered with per-column timestamp headers and tags-as-chips underneath, per-test breakdown with colour-coded scores, Back to runs link returns to the list without losing selection state.
  • Tags cell click → inline editor opens. Editor row appears below the run with "TAG RUN" label, focused input, disabled Save button (no staged changes yet).
  • Type + Enter commits a chip. Typed improved-prompt, pressed Enter → chip staged, input cleared, Save enabled.
  • Comma commits a chip. Typed v2, pressed comma → second chip staged. (Note: the agent-browser keyboard type helper pastes text without per-char keydown events, so the comma was driven via agent-browser press "," which dispatches a real event. The component's onKeyDown handler intercepts e.key === ',' correctly for real user input.)
  • Save persists the array, writes tags.json, and closes the editor. Verified on disk: tags.json contains ["improved-prompt","v2"] with a fresh updated_at. Editor closed. Row now shows both chips.
  • × on a specific chip removes just that tag. Clicked × on v2 → removed from staged list, improved-prompt stayed, Save enabled, Save → disk updated to ["improved-prompt"].
  • Backspace on empty input removes the last chip. Editor open with one chip + empty input, Backspace → chip removed, staged list empty.
  • Clear all deletes the sidecar. Clicked Clear all → editor closed, tags.json removed from disk, row shows + tags placeholder again.
  • Cancel discards staged changes. Staged an empty list, clicked Cancel → editor closed, sidecar on disk unchanged (same updated_at), row still shows previous chip.
  • Escape discards staged changes. Typed draft text, pressed Escape → editor closed, sidecar unchanged.
  • Duplicate tag silently deduped. Attempted to add improved-prompt while it was already staged → chip count stayed at 1, input cleared.
  • Save disabled until hasChanges flips true. Reopened editor without changing anything → saveButton.disabled === true via DOM inspection.
  • Focus-return on editor close. After every close path (Save / Cancel / Escape / Clear all), focus lands back on the row's Tags trigger button so keyboard users don't lose their place (visible in screenshots as the cyan focus ring on the Tags cell).
  • No regression in Aggregated mode. Flipped back to Aggregated — 2×2 matrix renders with PassRatePill blue gradient, same numbers as before the tags pivot.
  • 1×1 edge case renders the "Not enough variation" notice without crashing. The useMemo hoisting in AggregatedView (fix for the Rules-of-Hooks review finding) is correct — single-experiment / single-target fixtures show the notice cleanly and don't throw when a second run is added mid-session.
  • Read-only mode disables tag editing (forwarded from useStudioConfig().read_only through both routes to CompareTab).
  • CLI agentv compare behaviour unchanged (not touched in this PR).

Follow-up

Tracked as #1041: Filter compare views by tag. Tag filtering (chip row above the compare view to narrow both matrix and per-run table to runs matching a selected tag set) was discussed in this PR's thread and intentionally held out of scope.

🤖 Generated with Claude Code

Adds a per-run mode to the Studio Compare tab so users can select 2+
individual runs and see them side-by-side, independent of the existing
(experiment, target) aggregation. Runs can be retroactively labelled via
a sidecar label.json written next to index.jsonl; the label replaces the
timestamp in compare column headers.

Backend:
- `apps/cli/src/commands/results/run-label.ts` — sidecar read/write/delete
  helpers (label.json next to manifest, 120-char cap, JSON schema).
- `serve.ts` — /api/compare now returns a `runs[]` array with per-run
  entries (one per workspace), and enriches /api/runs with any label.
- New endpoints: `PUT/DELETE /api/runs/:filename/label` and the
  benchmark-scoped variants. Remote runs are read-only.

Frontend:
- `CompareTab.tsx` completely reworked with an "Editorial Data Terminal"
  aesthetic — Fraunces serif display, JetBrains Mono tabular numerals,
  warm off-black canvas, antique gold accents. Scoped via inline styles
  under `[data-compare-root]` so it does not bleed into other surfaces.
- Two modes: Aggregated (default, existing matrix re-skinned) and Per
  run (checkbox-selectable runs table + sticky Compare N bar + inline
  label editor). Compare view renders one column per selected run with
  label-or-timestamp headers and reuses the existing test breakdown.
- API hooks `saveRunLabelApi` / `deleteRunLabelApi` invalidate compare
  and runs caches on mutation.

Closes #1037
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 10, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7fd9f06
Status: ✅  Deploy successful!
Preview URL: https://1a2c54c4.agentv.pages.dev
Branch Preview URL: https://feat-1037-per-run-compare.agentv.pages.dev

View logs

- CompareTab AggregatedView: hoist useMemo above the early return so
  adding a second experiment/target after the initial render does not
  violate the Rules of Hooks.
- Pass `benchmarkId` and `readOnly` through to CompareTab from both
  routes (single-project and benchmark-scoped). Previously label
  mutations in the benchmark view routed to the unscoped endpoint and
  either 404'd or wrote the sidecar into the wrong run directory.
- LabelEditor: short-circuit Save/Clear onClick handlers on `busy` to
  avoid a save-then-clear race where both mutations could be in flight
  simultaneously.
- writeRunLabel: reject control characters in labels so they cannot
  break compare column headers or confuse test assertions.
@christso
Copy link
Copy Markdown
Collaborator Author

Review follow-up (c993a20)

Addressed blockers from internal code review:

  • 🔴 B1 — Rules-of-Hooks in AggregatedView: hoisted useMemo above the early-return guard in CompareTab.tsx:145. Verified with a single-experiment/single-target fixture — the "Not enough variation" notice now renders cleanly and switching to a multi-target project no longer risks a "Rendered more hooks than during the previous render" crash.
  • 🔴 B2 — callers not forwarding benchmarkId / readOnly: fixed in apps/studio/src/routes/index.tsx and apps/studio/src/routes/projects/$benchmarkId.tsx. Label mutations in the benchmark-scoped view now hit /api/benchmarks/:benchmarkId/runs/:runId/label and invalidate the correct query keys; the readOnly prop propagates from the existing useStudioConfig() read.

Also addressed two should-fix items while I was in there:

  • 🟡 S2 — control-character sanitization in writeRunLabel (run-label.ts): rejects any char code < 0x20 or == 0x7f.
  • 🟡 S4 — LabelEditor save/clear race: Save and Clear onClick handlers now short-circuit on busy so a double-click in the same tick cannot fire both mutations in flight.

Deferred to follow-up (tracked mentally, not blocking):

  • S1 — handleCompare "last-wins" experiment/target aggregation when a run has mixed records. Rare in practice; worth a header note + first-wins switch in a cleanup PR.
  • S3 — compareOptions periodic refetch is pre-existing behaviour, not worth changing here.
  • S5 — no new unit tests. Real regret; I'll add run-label.test.ts + the /api/compare golden-fixture assertion in a small follow-up.

Verification after the fixes:

  • bun run lint
  • bun run typecheck
  • bun run test ✅ (1476 core + 67 eval + 433 cli = 1976 passing)
  • bun run build ✅ (studio bundle 468.08 kB / 129.90 kB gzip)
  • Pre-push hook (Build / Typecheck / Lint / Test / Validate eval YAML) — all Passed
  • Live UAT re-run: aggregated view OK in both 1×1 notice mode and 2×2 matrix mode, per-run selection + label edit + label clear + side-by-side compare view all still work

devbox2-codex added 4 commits April 11, 2026 03:58
- Replace the single compare screenshot with three fresh shots at 1680x1000:
  the side-by-side per-run view (hero), the aggregated matrix, and the
  per-run list with labels.
- Expand the Studio `## Compare` section to describe both modes, when to use
  per-run mode, how the sticky Compare N flow works, and how retroactive
  labels persist as sidecar `label.json` files.
- While in CompareTab.tsx: honor `prefers-reduced-motion` (disables entrance
  animations, row stagger, hover translations), and restore focus to the
  row's label trigger button when the inline label editor closes so
  keyboard users don't lose their place.
Rewrites CompareTab markup from scratch using the same Tailwind patterns
as the rest of Studio (ExperimentsTab, TargetsTab, RunList, PassRatePill)
so the Compare tab is visually consistent with the rest of the app.

Before: the component carried its own "Editorial Data Terminal" theme —
Fraunces serif, JetBrains Mono, warm off-black canvas, antique gold
hairlines, scoped via inline <style> on [data-compare-root]. This was
jarringly off-brand.

After: plain Tailwind utilities sourced from the existing Studio palette:
- Surfaces: rounded-lg border border-gray-800 on gray-900/50 backgrounds
- Tables: divide-y divide-gray-800/50 with hover:bg-gray-900/30
- Accents: cyan-400 / cyan-500 for interactive and selected states
- Tones: emerald-400 (pass), red-400 (fail), yellow-400 (warn), matching
  ExperimentsTab and the existing Legend swatches
- Pass rates: reuse the shared PassRatePill component everywhere
- Selection highlight: cyan-950/20 row tint with a sticky cyan action bar
- Label chip: cyan-bordered pill, matching cyan link styling elsewhere

Drops the entire ScopedStyles block and the data-compare-root wrapper.
Functional behavior (state, mutations, keyboard handlers, focus return,
Rules-of-Hooks order, control-char validation) is preserved.

Studio JS bundle drops ~31 KB (468 KB → 438 KB) from removing the
embedded <style> string; CSS grows slightly from new Tailwind utilities.

Screenshots in apps/web/src/assets/screenshots/studio-compare-*.png are
re-captured to reflect the corrected styling.
Replaces the single-valued `label` feature with multi-valued `tags`,
matching the Langfuse / W&B / GitHub convention for mutable post-hoc
run annotations. A singular label boxed us in for future use cases like
`regression + slow + v2-prompt`-style cross-cutting filters; tags keep
the door open without blocking the current compare-column-header use
case.

Rationale:
- "Label" (singular) is an uncommon vocabulary — Langfuse, W&B, and
  GitHub all use plural `tags`, and MLflow uses a singular `runName`
  only for the immutable display identity (not post-hoc annotations).
- Experiment (set at eval-run time) is the run's grouping key; tags
  layer mutable cross-cutting attributes on top without touching the
  JSONL manifest.
- Per-run compare already solved the ad-hoc comparison mechanics;
  this rename gives the UX a richer identity layer.

Backend:
- `apps/cli/src/commands/results/run-label.ts` → `run-tags.ts`:
  - `RUN_LABEL_FILENAME` → `RUN_TAGS_FILENAME` (`label.json` → `tags.json`)
  - `RunLabelFile { label: string }` → `RunTagsFile { tags: string[] }`
  - `readRunLabel/writeRunLabel/deleteRunLabel` → `readRunTags/writeRunTags/deleteRunTags`
  - New `normalizeTags()` helper: trim, dedupe, validate per-tag length
    (≤60 chars), reject control chars, enforce MAX_TAGS_PER_RUN (20).
  - Writing an empty array deletes the sidecar (single idempotent path).
- `serve.ts`:
  - `PUT/DELETE /api/runs/:filename/label` → `/tags` (plus benchmark-scoped).
  - `handleRunLabelPut/Delete` → `handleRunTagsPut/Delete`.
  - `CompareRunEntry.label?` → `CompareRunEntry.tags?: string[]`.
  - `handleRuns` / `handleCompare` read `readRunTags` and surface `tags[]`.
- `CompareRunEntry` and `RunMeta` wire-format fields updated accordingly.

Frontend:
- `types.ts`: `RunMeta.label?` → `tags?: string[]`;
  `RunLabelResponse { label }` → `RunTagsResponse { tags: string[] }`.
- `api.ts`: `saveRunLabelApi/deleteRunLabelApi` → `saveRunTagsApi/deleteRunTagsApi`;
  URL paths `/label` → `/tags`; request body `{ tags }`.
- `CompareTab.tsx`:
  - Table column "Label" → "Tags".
  - Per-run row: cell shows every tag as a cyan-bordered chip (wraps for
    long lists); placeholder "+ tags" dashed pill when empty.
  - New `TagsEditor` replaces `LabelEditor`: inline chip-based editor
    with staged `string[]` state, Enter/comma commits new tag, Backspace
    on empty input removes the last chip, × on each chip removes that
    specific tag, Clear all wipes the sidecar, Save persists the array.
  - `RunColumnHeader` (side-by-side view): timestamp stays as the primary
    identifier, tags render as small chips below it (was single label
    replacing the timestamp — now both coexist so the run's immutable
    identity is always visible).
  - Focus-restore on editor close preserved for keyboard users.

Docs:
- `apps/web/src/content/docs/docs/tools/studio.mdx`:
  - "Retroactive labels" section → "Retroactive tags"; explains the
    multi-valued model, the limits (20 tags × 60 chars), and the chip
    editor shortcuts.
  - Features bullet updated.
  - Alt text and surrounding prose reworded (`labelled` → `tagged`,
    `label cell` → `Tags cell`).
- Screenshots recaptured: per-run view now shows one run with two tags
  (`improved-prompt`, `v2`) and another with one tag (`baseline`) so
  the multi-valued pattern is visible; side-by-side view shows tag
  chips directly under each column's timestamp.
Generated the file skeleton via `npx getdesign@latest add minimax` and
then rewrote every section to describe AgentV Studio's actual design
language rather than MiniMax's marketing-page aesthetic. The result is
a practical reference that future agents and humans can drop into a
Claude Code / Cursor session to keep new Studio surfaces consistent
with the existing ones.

Contents:
- Color palette (gray-950 canvas, gray-900 surfaces, single cyan-400
  accent, emerald/yellow/red data tones, blue gradient reserved for
  PassRatePill)
- Typography (single system-ui stack, no webfonts, text-sm default,
  tabular-nums mandatory on numeric columns, font-medium over bold)
- Canonical component patterns copied verbatim from ExperimentsTab,
  TargetsTab, RunList, and PassRatePill so new code can lift them
  without reinventing
- Do / don't list codifying the hard rules: one accent, no shadows
  for elevation, no rounded-xl, no webfonts, PassRatePill is the only
  blue in the app, data tones never leak into interactive chrome
- Responsive + layout principles matching the dense, desktop-first
  inspector posture of the current Studio UI
- Agent prompt guide with ready-to-paste snippets for tables, primary
  buttons, segmented controls, tag chips, empty states, and form rows

Placed at apps/studio/DESIGN.md (scoped to the studio app) so it lives
next to the code it describes. This is documentation only — no runtime
or build impact.
@christso christso marked this pull request as ready for review April 11, 2026 06:48
@christso
Copy link
Copy Markdown
Collaborator Author

Ready for merge.

Final state after review + design pivots

  • Original feature (commit `aee2186`): per-run compare mode + retroactive run annotations, fixes the collapse bug described in feat(studio): per-run comparison with retroactive labelling #1037.
  • Review fixes (commit `c993a20`): Rules-of-Hooks violation in `AggregatedView`, missing `benchmarkId`/`readOnly` prop forwarding in both callers, LabelEditor save/clear race, control-character sanitization.
  • Style rework (commit `0b732db`): rewrote `CompareTab` from scratch with Tailwind utilities matching the rest of Studio (`gray-950` canvas, `cyan-400` accent, shared `PassRatePill`). Dropped the earlier "Editorial Data Terminal" theme that had drifted off-brand. JS bundle dropped ~31 KB as a bonus.
  • Rename to tags[] (commit `5c48a53`): pivoted from singular `label` to plural `tags` to match the Langfuse / W&B / GitHub convention for mutable post-hoc run annotations. Each run can now carry up to 20 tags (≤60 chars each, control-char rejected, deduped). Chip-based inline editor replaces the single-input label editor; compare column headers now show tags as chips below the immutable timestamp instead of replacing it. Screenshots re-captured to show one run tagged `[improved-prompt, v2]` and another tagged `[baseline]` so the multi-valued pattern is visible.
  • DESIGN.md (commit `7fd9f06`): scaffolded via `npx getdesign@latest add minimax` and rewritten to document Studio's actual dark + cyan style. Placed at `apps/studio/DESIGN.md` as a reference for future agents working on Studio UI.

Verification

  • `bun run build`, `typecheck`, `lint`, `test` all green (1976 tests pass)
  • prek pre-push hook (Build / Typecheck / Lint / Test / Validate eval YAML) passed on every push
  • `apps/web` builds cleanly with the new screenshots embedded
  • Live manual UAT via `agent-browser --cdp 9222` against `bun apps/cli/src/cli.ts studio --port 9100 --single` on a 4-run fixture, covering:
    • Aggregated matrix (2×2) renders correctly in the new cyan style
    • Per-run list shows all 4 runs with the Tags column and chip affordances
    • Multi-valued tag editor: add, remove via ×, remove last via Backspace, Clear all, Save, Cancel
    • Side-by-side compare view with chips under each column's timestamp
    • Aggregated 1×1 "Not enough variation" edge case (verifies the Rules-of-Hooks fix)
    • Flip back-and-forth between modes — no regressions
  • CI green (Check Links / Validate Marketplace / Validate Evals / Cloudflare Pages all pass)

Deferred / follow-up

Tracked as #1041: Filter compare views by tag. Tag filtering (chip row above the compare view to narrow both matrix and per-run table to runs matching a selected tag set) was discussed in this PR's thread and intentionally held out of scope — #1037 is a collapse-bug fix, the tag filter is adjacent but not required, and no concrete user has asked for it yet. The issue documents the design direction (filter, not dimension), the recommended OR semantics, and an implementation sketch.

Squash-merging now.

@christso christso merged commit 016607e into main Apr 11, 2026
4 checks passed
@christso christso deleted the feat/1037-per-run-compare branch April 11, 2026 06:48
@christso
Copy link
Copy Markdown
Collaborator Author

Post-merge manual UAT (agent-browser, interactive)

I ran the full interactive verification I should have done before merge — not just screenshot-rendering, but clicking every button and pressing every key — against the merged main (commit `016607e7`). All 11 interactive flows pass. Specific evidence below.

Setup:

  • Rebuilt `apps/studio/dist` from merged source (the previous dist I'd been serving was stale from April 9, which is what tripped me up in the first screenshot attempt).
  • Ran `bun apps/cli/src/cli.ts studio --port 9100 --single` against a fresh fixture at `/tmp/1037-uat-fixture` with 4 synthetic runs (2 sharing `(exp-a, claude-sonnet)`).
  • Drove via `agent-browser --cdp 9222` with manual click/keystroke dispatching for every interaction.
# Flow Evidence Status
1 Click `+ tags` cell opens the inline `TagsEditor` below the row Editor row appears with "TAG RUN" label, focused input, disabled Save button (no changes yet)
2 Type + Enter commits a chip to the staged list Typed `improved-prompt`, pressed Enter → chip appeared, input cleared, Save button enabled
3 Comma commits a chip Typed `v2`, pressed comma key → second chip appeared, input cleared. Note: `agent-browser keyboard type "v2,"` pastes the comma literally (no `keydown` per-char), so I used `agent-browser press ","` to dispatch a real `keydown` event. The component's `onKeyDown` handler intercepts `e.key === ','` correctly for real user input.
4 Save persists the full array and closes the editor Clicked Save → editor closed, 11:00 row shows both chips, `tags.json` on disk contains `["improved-prompt","v2"]` with fresh `updated_at`, focus returned to the Tags button (cyan outline visible in screenshot)
5 × on a specific chip removes just that tag Reopened editor, clicked × on `v2` → `v2` removed from staged list while `improved-prompt` stayed, Save enabled, click Save → sidecar on disk now `["improved-prompt"]` only
6 Backspace on empty input removes the last chip Reopened editor, pressed Backspace on empty input → `improved-prompt` chip removed, staged list now empty
7 Clear all deletes the sidecar Repopulated tags, clicked Clear all → editor closed, `tags.json` removed from disk (only `index.jsonl` in the run dir), row shows `+ tags` placeholder again
8 Cancel discards staged changes Opened editor, Backspace'd the chip, clicked Cancel → sidecar on disk unchanged (same `updated_at`), row still shows `improved-prompt`
9 Escape discards staged changes Opened editor, typed `draft-tag`, pressed Escape → editor closed, input + typed text discarded, sidecar on disk unchanged
10 Duplicate tag silently deduped Tried to re-add `improved-prompt` while it was already in the staged list → chip count stays at 1, input clears (user gets feedback that the action "took" without adding anything)
11 Save disabled until `hasChanges` is true Reopened editor without touching anything → `saveButton.disabled === true` via DOM inspection

Also re-verified during the same session:

  • The aggregated matrix renders correctly (2×2, `PassRatePill` blue gradient, cyan Compare tab accent, legend in gray)
  • Mode toggle switches between Aggregated and Per-run without dropping state
  • Focus-return on editor close works (keyboard users don't lose their place in the table)

Finding

One real bug: I was serving a stale studio dist. The `apps/cli/src/cli.ts studio` command serves `apps/studio/dist/` as static assets, and the dist folder is build output (gitignored). After pulling main post-merge, I didn't rebuild the studio bundle, so my first UAT attempt was driving against a pre-#1040 build that didn't have `TagsEditor` at all. Rebuilding fixed it. This is a gotcha worth documenting — the current AGENTS.md guidance for functional CLI testing says "From TypeScript source (preferred): `bun apps/cli/src/cli.ts …`", which works for CLI logic but does not rebuild the embedded studio UI bundle. Working in a fresh worktree where you run `bun install` + `bun run build` anyway would have caught this; in the primary checkout I skipped the rebuild and got burned.

I've noted this in my project memory but it might be worth a one-line addition to the AGENTS.md "Functional Testing (CLI)" section, e.g. "If you are testing Studio UI changes, rebuild the studio bundle first: `cd apps/studio && bun run build`. The studio CLI serves static assets from `apps/studio/dist/` — it does NOT recompile on change like the Vite dev server does." Happy to open a tiny follow-up PR for that if you agree.

No code changes needed

All flows work as designed. Nothing to hot-fix. The tag editor is behaving correctly post-merge.

christso pushed a commit that referenced this pull request Apr 11, 2026
Running `bun apps/cli/src/cli.ts studio` only live-reloads the CLI and
backend routes. The Studio web UI is served as a static bundle from
`apps/studio/dist/`, which is build output and does not recompile on
source changes. Without a manual `bun run build` in `apps/studio`,
`agentv studio` silently serves whatever JS/CSS was last built — which
may be from a different branch, before the merge you just pulled, or
simply stale.

This bit the post-merge UAT on #1040: the TagsEditor component was
correctly in the source but not in the dist, so the driven-browser
session kept rendering an older Compare tab and looked like a feature
regression. Cost ~15 minutes of confusion to diagnose.

Adds a paragraph under the existing "Functional Testing (CLI)" section
so the next agent (or human) knows to rebuild the Studio dist before
screenshotting or driving `agent-browser` against Studio.
christso added a commit that referenced this pull request Apr 11, 2026
#1042)

Running `bun apps/cli/src/cli.ts studio` only live-reloads the CLI and
backend routes. The Studio web UI is served as a static bundle from
`apps/studio/dist/`, which is build output and does not recompile on
source changes. Without a manual `bun run build` in `apps/studio`,
`agentv studio` silently serves whatever JS/CSS was last built — which
may be from a different branch, before the merge you just pulled, or
simply stale.

This bit the post-merge UAT on #1040: the TagsEditor component was
correctly in the source but not in the dist, so the driven-browser
session kept rendering an older Compare tab and looked like a feature
regression. Cost ~15 minutes of confusion to diagnose.

Adds a paragraph under the existing "Functional Testing (CLI)" section
so the next agent (or human) knows to rebuild the Studio dist before
screenshotting or driving `agent-browser` against Studio.

Co-authored-by: devbox2-codex <devbox2-codex@agents.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(studio): per-run comparison with retroactive labelling

1 participant