From 2e657f65d9c56ac779c6332fb3dfe1e3c753c33a Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 11 Apr 2026 08:05:17 +0000 Subject: [PATCH 1/4] docs(plan): filter compare views by tag (#1041) --- docs/plans/1041-filter-compare-by-tag.md | 150 +++++++++++++++++++++++ 1 file changed, 150 insertions(+) create mode 100644 docs/plans/1041-filter-compare-by-tag.md diff --git a/docs/plans/1041-filter-compare-by-tag.md b/docs/plans/1041-filter-compare-by-tag.md new file mode 100644 index 000000000..d222077b0 --- /dev/null +++ b/docs/plans/1041-filter-compare-by-tag.md @@ -0,0 +1,150 @@ +# Plan: Filter compare views by tag (issue #1041) + +## Context + +PR #1040 landed retroactive multi-valued **tags** on run workspaces (sidecar +`tags.json`) and surfaced them as chips in both the Aggregated matrix and the +Per-run compare table. Tags currently only *display* — users who run a bunch +of experiments and tag a subset (`v2-prompt`, `baseline`, …) still have to +manually de-select runs to see how a tagged cohort did. + +Issue #1041 closes that loop: a chip row above the compare view that filters +**both** modes to runs carrying at least one of the selected tags +(OR semantics). No schema changes, no tag management UI, no new CLI commands — +just filter. + +One material refinement vs. the issue body: **client-side re-aggregation** +instead of re-fetching from the backend on filter change. That keeps the tag +chip list stable (built from the unfiltered response), avoids network +round-trips on every chip click, and makes the UI snappier. The backend +`?tags=` filter is still implemented for API consumers per the acceptance +signals. + +## Approach + +### 1. Backend — `apps/cli/src/commands/results/serve.ts` + +`handleCompare` already iterates per-run metadata and reads +`tagsEntry = readRunTags(m.path)` after aggregation. Move that read to the +**top** of the per-run loop and early-continue when a `?tags=` filter is +provided and the run's tags don't intersect. + +OR semantics (industry default). No schema change, one early continue, +covers both `cells[]`, `runs[]`, `experiments[]`, and `targets[]` because +they're all built from the same run loop. + +```ts +// At the top of handleCompare, parse filter +const tagsParam = c.req.query('tags') ?? ''; +const filterTags = new Set( + tagsParam.split(',').map((t) => t.trim()).filter(Boolean), +); + +// In the run loop, BEFORE loadLightweightResults: +const tagsEntry = readRunTags(m.path); +if (filterTags.size > 0) { + const runTags = tagsEntry?.tags ?? []; + if (!runTags.some((t) => filterTags.has(t))) continue; +} +``` + +Reuse `tagsEntry` later when building `runEntries` (it's already read, just +hoisted). The benchmark-scoped endpoint routes through the same handler via +`withBenchmark`, so it gets the same filter for free. + +### 2. Frontend — `apps/studio/src/components/CompareTab.tsx` + +Filter lives **inside** `CompareTab` (local state). Data fetching in the +route wrappers (`routes/index.tsx`, `routes/projects/$benchmarkId.tsx`) stays +unchanged — we do not thread filter state into the React Query key. + +- New `filterTags: string[]` state in `CompareTab`. +- Derive `allTags` / `tagCounts` from **unfiltered** `data.runs` so chips + stay visible even when the filter would otherwise remove them. +- Re-aggregate cells/runs client-side from the filtered subset. Safe because + the server already exposes per-run totals (`eval_count`, `passed_count`, + `avg_score`) in each `CompareRunEntry`, so we sum weighted averages per + `(experiment, target)` bucket. +- Pass `filteredData` down to `AggregatedView` and `PerRunView` in place of + `data` — they don't need to know about filtering. +- New `TagFilterBar` rendered above the content switch (only when + `allTags.length > 0`). Click toggles a tag; "Clear" link appears when any + filter is active. Styling uses the existing cyan chip pattern. +- **Empty state** when the filter yields zero runs: `Notice` component grows + an optional `action` prop (small additive change) so the user can clear + the filter from the notice. + +### 3. Semantics decision + +**OR** — the issue recommends it, it's what Langfuse / W&B / GitHub Issues +do, and it's the easier-to-explain option for v1. Noted in a header hint +under the chip row: _"Showing runs with any selected tag"_. + +### 4. Out of scope (per issue non-goals) + +- No URL query-string persistence (filter state is purely local). +- No tag management UI. +- No AND semantics toggle. +- No promoting `tag` to a matrix dimension. + +## Files to modify + +| File | Change | +|---|---| +| `apps/cli/src/commands/results/serve.ts` | Hoist `readRunTags` in `handleCompare`; add `?tags=` early-continue filter. | +| `apps/studio/src/components/CompareTab.tsx` | Add `filterTags` state, `TagFilterBar` sub-component, client-side re-aggregation, filter empty-state Notice. | +| `apps/cli/test/commands/results/serve.test.ts` | New test cases: `/api/compare` with no filter, with single tag, with multiple tags OR, with non-matching filter (empty result). | +| `apps/web/src/content/docs/docs/tools/studio.mdx` | Replace the PR #1040 "filter by tag is tracked as #1041" paragraph with a short note about the new chip row. | + +## Files NOT modified + +- `apps/studio/src/lib/api.ts` — no new query keys or API wrappers (filter is client-side). +- `apps/studio/src/lib/types.ts` — no new types. +- `apps/studio/src/routes/index.tsx` / `routes/projects/$benchmarkId.tsx` — no prop changes. + +## Verification + +### Unit tests +1. `bun run test` — all must pass. +2. New tests in `serve.test.ts` cover backend filter semantics: + - Unfiltered `/api/compare` returns all runs. + - `/api/compare?tags=baseline` returns only runs tagged `baseline`. + - `/api/compare?tags=baseline,v2` returns runs with EITHER tag (OR). + - `/api/compare?tags=nonexistent` returns empty `cells[]` / `runs[]`. + - Experiments/targets lists narrow to match filtered runs. + +### Manual end-to-end UAT (required, BLOCKING per AGENTS.md) + +Set up a 4-run synthetic fixture with a mix of tags. Start studio from +source: + +```bash +bun apps/cli/src/cli.ts studio --port 9100 --single +cd apps/studio && bun run build # rebuild frontend before UAT +``` + +Use `agent-browser --cdp 9222` to drive these flows: + +1. **Red (before):** On `main`, confirm no filter row is rendered above the + compare view. +2. **Green (after):** On this branch: + - Tag chip row appears above the mode toggle, listing all distinct tags + with counts. + - Clicking a tag narrows both the Aggregated matrix and the Per-run + table. + - OR semantics across multiple chips. + - "Clear" resets chips, matrix returns to full dataset. + - Filter with no matches shows the _"No runs match …"_ notice with a + working "Clear filter" button. + - Flip between Aggregated and Per-run while filter is active — filter + persists. + - Tag edits via `TagsEditor` still work with a filter active. +3. **Backend filter check:** `curl` against the API directly to verify OR + semantics and counts. + +### Build gates +- `bun run build`, `bun run typecheck`, `bun run lint` — all green. +- Pre-push hook (`prek`) runs everything automatically on push. + +### Final review +Spawn a code-review subagent AFTER e2e passes (per AGENTS.md ordering). From 4cf2463c150a214049789fc2047962884034785b Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 11 Apr 2026 08:12:29 +0000 Subject: [PATCH 2/4] feat(studio): filter compare views by tag (#1041) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a chip row above the compare view that filters both the aggregated matrix and the per-run table to runs carrying at least one of the selected tags (OR semantics). Backend: - `GET /api/compare?tags=baseline,v2-prompt` narrows cells[], runs[], experiments[], and targets[] to runs whose tags intersect the filter. - `readRunTags` is hoisted to the top of the per-run loop so the filter can early-continue without loading JSONL for non-matching runs. - Empty / missing `tags` param is a no-op; whitespace and empty segments are ignored. Frontend: - New local `filterTags` state in `CompareTab`. - Chip list derived from the unfiltered response so chips stay visible even when the filter would otherwise hide them. - `filteredData` re-aggregates cells/runs client-side from per-run totals — no network round trip per click. - `Notice` grows an optional `action` prop so the empty-state notice can render a "Clear filter" button. --- apps/cli/src/commands/results/serve.ts | 22 +- apps/cli/test/commands/results/serve.test.ts | 152 +++++++++++++ apps/studio/src/components/CompareTab.tsx | 214 +++++++++++++++++- .../src/content/docs/docs/tools/studio.mdx | 6 + 4 files changed, 386 insertions(+), 8 deletions(-) diff --git a/apps/cli/src/commands/results/serve.ts b/apps/cli/src/commands/results/serve.ts index 15ec4f4aa..354990faa 100644 --- a/apps/cli/src/commands/results/serve.ts +++ b/apps/cli/src/commands/results/serve.ts @@ -554,6 +554,19 @@ async function handleCompare(c: C, { searchDir, agentvDir }: DataContext) { const { runs: metas } = await listMergedResultFiles(searchDir); const { threshold: pass_threshold } = loadStudioConfig(agentvDir); + // Optional tag filter: `?tags=baseline,v2-prompt` keeps only runs that + // carry at least one of the given tags (OR semantics). Empty / missing + // param is a no-op. Filtering is applied before aggregation so it + // propagates through `cells[]`, `runs[]`, `experiments[]`, and + // `targets[]` uniformly. + const tagsParam = c.req.query('tags') ?? ''; + const filterTags = new Set( + tagsParam + .split(',') + .map((t) => t.trim()) + .filter(Boolean), + ); + // Collect per-test-case results keyed by experiment × target (aggregated view) const cellMap = new Map< string, @@ -599,6 +612,14 @@ async function handleCompare(c: C, { searchDir, agentvDir }: DataContext) { for (const m of metas) { try { + // Read tags before any heavy work so the `?tags=` filter can skip + // non-matching runs without loading their JSONL records. + const tagsEntry = readRunTags(m.path); + if (filterTags.size > 0) { + const runTags = tagsEntry?.tags ?? []; + if (!runTags.some((t) => filterTags.has(t))) continue; + } + const records = loadLightweightResults(m.path); const runTestMap = new Map< string, @@ -656,7 +677,6 @@ async function handleCompare(c: C, { searchDir, agentvDir }: DataContext) { if (runEvalCount === 0) continue; const runTests = [...runTestMap.values()].slice(-MAX_TESTS_PER_CELL); - const tagsEntry = readRunTags(m.path); runEntries.push({ run_id: m.filename, started_at: runStartedAt, diff --git a/apps/cli/test/commands/results/serve.test.ts b/apps/cli/test/commands/results/serve.test.ts index 14007b406..6b79b5c19 100644 --- a/apps/cli/test/commands/results/serve.test.ts +++ b/apps/cli/test/commands/results/serve.test.ts @@ -634,6 +634,158 @@ describe('serve app', () => { }); }); + // ── GET /api/compare (tag filter) ─────────────────────────────────── + + describe('GET /api/compare', () => { + function seedCompareFixture() { + // Four runs, each in its own run workspace, with the tags documented + // below. This setup exercises the OR filter semantics used by + // `/api/compare?tags=`. + const runsDir = path.join(tempDir, '.agentv', 'results', 'runs'); + mkdirSync(runsDir, { recursive: true }); + + const runs: Array<{ + name: string; + experiment: string; + target: string; + score: number; + tags?: string[]; + }> = [ + { + name: '2026-04-01T10-00-00-000Z', + experiment: 'exp-a', + target: 'gpt-4o', + score: 1.0, + tags: ['baseline'], + }, + { + name: '2026-04-02T10-00-00-000Z', + experiment: 'exp-a', + target: 'claude', + score: 0.9, + tags: ['baseline'], + }, + { + name: '2026-04-03T10-00-00-000Z', + experiment: 'exp-b', + target: 'gpt-4o', + score: 0.85, + tags: ['v2-prompt'], + }, + { + // Intentionally untagged — should never match any tag filter. + name: '2026-04-04T10-00-00-000Z', + experiment: 'exp-b', + target: 'claude', + score: 0.7, + }, + ]; + + for (const run of runs) { + const runDir = path.join(runsDir, run.name); + mkdirSync(runDir, { recursive: true }); + writeFileSync( + path.join(runDir, 'index.jsonl'), + toJsonl({ + ...RESULT_A, + test_id: `test-${run.name}`, + experiment: run.experiment, + target: run.target, + score: run.score, + }), + ); + if (run.tags && run.tags.length > 0) { + writeFileSync( + path.join(runDir, 'tags.json'), + `${JSON.stringify({ tags: run.tags, updated_at: '2026-04-10T00:00:00.000Z' }, null, 2)}\n`, + ); + } + } + } + + type CompareJson = { + experiments: string[]; + targets: string[]; + cells: Array<{ experiment: string; target: string; eval_count: number }>; + runs?: Array<{ run_id: string; experiment: string; target: string; tags?: string[] }>; + }; + + it('returns all runs when no filter is provided', async () => { + seedCompareFixture(); + const app = createApp([], tempDir, tempDir, undefined, { studioDir }); + + const res = await app.request('/api/compare'); + expect(res.status).toBe(200); + const data = (await res.json()) as CompareJson; + + expect(data.runs).toHaveLength(4); + expect(data.experiments.sort()).toEqual(['exp-a', 'exp-b']); + expect(data.targets.sort()).toEqual(['claude', 'gpt-4o']); + expect(data.cells).toHaveLength(4); + }); + + it('filters to a single tag', async () => { + seedCompareFixture(); + const app = createApp([], tempDir, tempDir, undefined, { studioDir }); + + const res = await app.request('/api/compare?tags=baseline'); + expect(res.status).toBe(200); + const data = (await res.json()) as CompareJson; + + expect(data.runs).toHaveLength(2); + for (const run of data.runs ?? []) { + expect(run.tags ?? []).toContain('baseline'); + } + // Only exp-a is represented; targets narrow to the two used by exp-a runs. + expect(data.experiments).toEqual(['exp-a']); + expect(data.targets.sort()).toEqual(['claude', 'gpt-4o']); + expect(data.cells).toHaveLength(2); + }); + + it('applies OR semantics across multiple tags', async () => { + seedCompareFixture(); + const app = createApp([], tempDir, tempDir, undefined, { studioDir }); + + const res = await app.request('/api/compare?tags=baseline,v2-prompt'); + expect(res.status).toBe(200); + const data = (await res.json()) as CompareJson; + + // Three tagged runs; the untagged run is excluded. + expect(data.runs).toHaveLength(3); + expect(data.experiments.sort()).toEqual(['exp-a', 'exp-b']); + // (exp-a, gpt-4o), (exp-a, claude), (exp-b, gpt-4o) — the (exp-b, claude) + // cell is missing because the only contributing run was untagged. + expect(data.cells).toHaveLength(3); + const cellKeys = (data.cells ?? []).map((c) => `${c.experiment}::${c.target}`).sort(); + expect(cellKeys).toEqual(['exp-a::claude', 'exp-a::gpt-4o', 'exp-b::gpt-4o']); + }); + + it('returns empty payload when no runs match the filter', async () => { + seedCompareFixture(); + const app = createApp([], tempDir, tempDir, undefined, { studioDir }); + + const res = await app.request('/api/compare?tags=nonexistent'); + expect(res.status).toBe(200); + const data = (await res.json()) as CompareJson; + + expect(data.runs).toEqual([]); + expect(data.cells).toEqual([]); + expect(data.experiments).toEqual([]); + expect(data.targets).toEqual([]); + }); + + it('ignores whitespace and empty segments in the tags query', async () => { + seedCompareFixture(); + const app = createApp([], tempDir, tempDir, undefined, { studioDir }); + + // ` , baseline , ` should parse to just ['baseline']. + const res = await app.request('/api/compare?tags=%20,%20baseline%20,%20'); + expect(res.status).toBe(200); + const data = (await res.json()) as CompareJson; + expect(data.runs).toHaveLength(2); + }); + }); + // ── SPA fallback ────────────────────────────────────────────────────── describe('SPA fallback', () => { diff --git a/apps/studio/src/components/CompareTab.tsx b/apps/studio/src/components/CompareTab.tsx index 188d0c413..a5887eb0c 100644 --- a/apps/studio/src/components/CompareTab.tsx +++ b/apps/studio/src/components/CompareTab.tsx @@ -55,7 +55,99 @@ export function CompareTab({ readOnly, }: CompareTabProps) { const [mode, setMode] = useState('aggregated'); - const runsCount = data?.runs?.length ?? 0; + const [filterTags, setFilterTags] = useState([]); + + // Chip list is derived from the UNFILTERED response so chips stay visible + // even when the active filter would otherwise hide the runs that supplied + // them. Sorted alphabetically for stable UI. + const { allTags, tagCounts } = useMemo(() => { + const counts = new Map(); + for (const run of data?.runs ?? []) { + for (const tag of run.tags ?? []) { + counts.set(tag, (counts.get(tag) ?? 0) + 1); + } + } + return { allTags: [...counts.keys()].sort(), tagCounts: counts }; + }, [data?.runs]); + + // When a filter is active, re-aggregate cells/runs client-side from the + // filtered subset of runs. This avoids a network round-trip on every chip + // click and keeps the backend responsible only for the initial fetch. + // Safe because the server already exposes per-run totals (`eval_count`, + // `passed_count`, `avg_score`); we just sum them per (experiment, target) + // bucket, weighting averages by run eval_count. + const filteredData = useMemo(() => { + if (!data) return data; + if (filterTags.length === 0) return data; + const filterSet = new Set(filterTags); + const filteredRuns = (data.runs ?? []).filter((r) => + (r.tags ?? []).some((t) => filterSet.has(t)), + ); + + type CellAccum = { + experiment: string; + target: string; + eval_count: number; + passed_count: number; + score_sum: number; + tests: CompareTestResult[]; + }; + const cellMap = new Map(); + const experimentsSet = new Set(); + const targetsSet = new Set(); + + for (const run of filteredRuns) { + experimentsSet.add(run.experiment); + targetsSet.add(run.target); + const key = `${run.experiment}::${run.target}`; + const entry = cellMap.get(key) ?? { + experiment: run.experiment, + target: run.target, + eval_count: 0, + passed_count: 0, + score_sum: 0, + tests: [], + }; + entry.eval_count += run.eval_count; + entry.passed_count += run.passed_count; + entry.score_sum += run.avg_score * run.eval_count; + for (const t of run.tests) entry.tests.push(t); + cellMap.set(key, entry); + } + + const cells: CompareCell[] = [...cellMap.values()].map((e) => { + // Dedupe tests by test_id, last-wins (same pattern as the server). + const dedup = new Map(); + for (const t of e.tests) dedup.set(t.test_id, t); + return { + experiment: e.experiment, + target: e.target, + eval_count: e.eval_count, + passed_count: e.passed_count, + pass_rate: e.eval_count > 0 ? e.passed_count / e.eval_count : 0, + avg_score: e.eval_count > 0 ? e.score_sum / e.eval_count : 0, + tests: [...dedup.values()].slice(-100), + }; + }); + + return { + ...data, + experiments: [...experimentsSet].sort(), + targets: [...targetsSet].sort(), + cells, + runs: filteredRuns, + }; + }, [data, filterTags]); + + const toggleFilterTag = (tag: string) => { + setFilterTags((prev) => (prev.includes(tag) ? prev.filter((x) => x !== tag) : [...prev, tag])); + }; + const clearFilterTags = () => setFilterTags([]); + + const runsCount = filteredData?.runs?.length ?? 0; + const underlyingHasData = data && data.cells.length > 0; + const filterActive = filterTags.length > 0; + const filterYieldsNoRuns = filterActive && filteredData && (filteredData.runs?.length ?? 0) === 0; return (
@@ -65,12 +157,37 @@ export function CompareTab({ {!isLoading && isError && error && ( )} - {!isLoading && !isError && (!data || data.cells.length === 0) && } - {!isLoading && !isError && data && data.cells.length > 0 && ( + {!isLoading && !isError && !underlyingHasData && } + {!isLoading && !isError && underlyingHasData && ( <> - {mode === 'aggregated' && } - {mode === 'per-run' && ( - + {allTags.length > 0 && ( + + )} + {filterYieldsNoRuns ? ( + `\`${t}\``).join(' + ')}`} + body="Clear the filter or pick a different tag combination." + action={{ label: 'Clear filter', onClick: clearFilterTags }} + /> + ) : ( + filteredData && ( + <> + {mode === 'aggregated' && } + {mode === 'per-run' && ( + + )} + + ) )} )} @@ -78,6 +195,72 @@ export function CompareTab({ ); } +// ── Tag filter bar ────────────────────────────────────────────────────── + +function TagFilterBar({ + allTags, + tagCounts, + selected, + onToggle, + onClear, +}: { + allTags: string[]; + tagCounts: Map; + selected: string[]; + onToggle: (tag: string) => void; + onClear: () => void; +}) { + const selectedSet = new Set(selected); + const anySelected = selected.length > 0; + return ( +
+
+ + Filter by tag + + {allTags.map((tag) => { + const isActive = selectedSet.has(tag); + const count = tagCounts.get(tag) ?? 0; + return ( + + ); + })} + {anySelected && ( + + )} +
+

+ Showing runs with any selected tag. +

+
+ ); +} + // ── Header ────────────────────────────────────────────────────────────── function Header({ @@ -904,11 +1087,28 @@ function EmptyState() { ); } -function Notice({ headline, body }: { headline: string; body: string }) { +function Notice({ + headline, + body, + action, +}: { + headline: string; + body: string; + action?: { label: string; onClick: () => void }; +}) { return (

{headline}

{body}

+ {action && ( + + )}
); } diff --git a/apps/web/src/content/docs/docs/tools/studio.mdx b/apps/web/src/content/docs/docs/tools/studio.mdx index ef31b0ab3..e361722ec 100644 --- a/apps/web/src/content/docs/docs/tools/studio.mdx +++ b/apps/web/src/content/docs/docs/tools/studio.mdx @@ -112,6 +112,12 @@ Click any row's **Tags** cell to tag a run after the fact. Each run can carry mu Use tags to annotate ad-hoc variants, experiment cross-cuts, or status flags you didn't plan for up front — `baseline`, `v2-prompt`, `slow`, `after-retry-fix`, `regression`, etc. Unlike `experiment` — which groups runs and is baked into the JSONL at eval-run time — tags are mutable, multi-valued, and never touch the original run data. +### Filtering by tag + +Once runs are tagged, a chip row appears above the compare view listing every distinct tag with a usage count. Click a chip to narrow both the aggregated matrix and the per-run table to runs carrying at least one of the selected tags (OR semantics — clicking a second chip widens the set). A **Clear** link resets the filter, and filter selections persist as you switch between Aggregated and Per-run modes. + +The same filter is available to API consumers via `GET /api/compare?tags=baseline,v2-prompt`, which returns only the cells and runs whose tags intersect the query. + ## Benchmarks Dashboard From 80ed3270ac034a8fc95a6f52a64cc101c792e6ba Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 11 Apr 2026 08:24:48 +0000 Subject: [PATCH 3/4] refactor(studio): inline filterActive one-shot in CompareTab --- apps/studio/src/components/CompareTab.tsx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/apps/studio/src/components/CompareTab.tsx b/apps/studio/src/components/CompareTab.tsx index a5887eb0c..ffbe73f1b 100644 --- a/apps/studio/src/components/CompareTab.tsx +++ b/apps/studio/src/components/CompareTab.tsx @@ -146,8 +146,8 @@ export function CompareTab({ const runsCount = filteredData?.runs?.length ?? 0; const underlyingHasData = data && data.cells.length > 0; - const filterActive = filterTags.length > 0; - const filterYieldsNoRuns = filterActive && filteredData && (filteredData.runs?.length ?? 0) === 0; + const filterYieldsNoRuns = + filterTags.length > 0 && filteredData && (filteredData.runs?.length ?? 0) === 0; return (
From acf97157a5a2998c571aeef84fdf96e81515a59c Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 11 Apr 2026 08:56:48 +0000 Subject: [PATCH 4/4] chore(plan): remove 1041 plan file before merge --- docs/plans/1041-filter-compare-by-tag.md | 150 ----------------------- 1 file changed, 150 deletions(-) delete mode 100644 docs/plans/1041-filter-compare-by-tag.md diff --git a/docs/plans/1041-filter-compare-by-tag.md b/docs/plans/1041-filter-compare-by-tag.md deleted file mode 100644 index d222077b0..000000000 --- a/docs/plans/1041-filter-compare-by-tag.md +++ /dev/null @@ -1,150 +0,0 @@ -# Plan: Filter compare views by tag (issue #1041) - -## Context - -PR #1040 landed retroactive multi-valued **tags** on run workspaces (sidecar -`tags.json`) and surfaced them as chips in both the Aggregated matrix and the -Per-run compare table. Tags currently only *display* — users who run a bunch -of experiments and tag a subset (`v2-prompt`, `baseline`, …) still have to -manually de-select runs to see how a tagged cohort did. - -Issue #1041 closes that loop: a chip row above the compare view that filters -**both** modes to runs carrying at least one of the selected tags -(OR semantics). No schema changes, no tag management UI, no new CLI commands — -just filter. - -One material refinement vs. the issue body: **client-side re-aggregation** -instead of re-fetching from the backend on filter change. That keeps the tag -chip list stable (built from the unfiltered response), avoids network -round-trips on every chip click, and makes the UI snappier. The backend -`?tags=` filter is still implemented for API consumers per the acceptance -signals. - -## Approach - -### 1. Backend — `apps/cli/src/commands/results/serve.ts` - -`handleCompare` already iterates per-run metadata and reads -`tagsEntry = readRunTags(m.path)` after aggregation. Move that read to the -**top** of the per-run loop and early-continue when a `?tags=` filter is -provided and the run's tags don't intersect. - -OR semantics (industry default). No schema change, one early continue, -covers both `cells[]`, `runs[]`, `experiments[]`, and `targets[]` because -they're all built from the same run loop. - -```ts -// At the top of handleCompare, parse filter -const tagsParam = c.req.query('tags') ?? ''; -const filterTags = new Set( - tagsParam.split(',').map((t) => t.trim()).filter(Boolean), -); - -// In the run loop, BEFORE loadLightweightResults: -const tagsEntry = readRunTags(m.path); -if (filterTags.size > 0) { - const runTags = tagsEntry?.tags ?? []; - if (!runTags.some((t) => filterTags.has(t))) continue; -} -``` - -Reuse `tagsEntry` later when building `runEntries` (it's already read, just -hoisted). The benchmark-scoped endpoint routes through the same handler via -`withBenchmark`, so it gets the same filter for free. - -### 2. Frontend — `apps/studio/src/components/CompareTab.tsx` - -Filter lives **inside** `CompareTab` (local state). Data fetching in the -route wrappers (`routes/index.tsx`, `routes/projects/$benchmarkId.tsx`) stays -unchanged — we do not thread filter state into the React Query key. - -- New `filterTags: string[]` state in `CompareTab`. -- Derive `allTags` / `tagCounts` from **unfiltered** `data.runs` so chips - stay visible even when the filter would otherwise remove them. -- Re-aggregate cells/runs client-side from the filtered subset. Safe because - the server already exposes per-run totals (`eval_count`, `passed_count`, - `avg_score`) in each `CompareRunEntry`, so we sum weighted averages per - `(experiment, target)` bucket. -- Pass `filteredData` down to `AggregatedView` and `PerRunView` in place of - `data` — they don't need to know about filtering. -- New `TagFilterBar` rendered above the content switch (only when - `allTags.length > 0`). Click toggles a tag; "Clear" link appears when any - filter is active. Styling uses the existing cyan chip pattern. -- **Empty state** when the filter yields zero runs: `Notice` component grows - an optional `action` prop (small additive change) so the user can clear - the filter from the notice. - -### 3. Semantics decision - -**OR** — the issue recommends it, it's what Langfuse / W&B / GitHub Issues -do, and it's the easier-to-explain option for v1. Noted in a header hint -under the chip row: _"Showing runs with any selected tag"_. - -### 4. Out of scope (per issue non-goals) - -- No URL query-string persistence (filter state is purely local). -- No tag management UI. -- No AND semantics toggle. -- No promoting `tag` to a matrix dimension. - -## Files to modify - -| File | Change | -|---|---| -| `apps/cli/src/commands/results/serve.ts` | Hoist `readRunTags` in `handleCompare`; add `?tags=` early-continue filter. | -| `apps/studio/src/components/CompareTab.tsx` | Add `filterTags` state, `TagFilterBar` sub-component, client-side re-aggregation, filter empty-state Notice. | -| `apps/cli/test/commands/results/serve.test.ts` | New test cases: `/api/compare` with no filter, with single tag, with multiple tags OR, with non-matching filter (empty result). | -| `apps/web/src/content/docs/docs/tools/studio.mdx` | Replace the PR #1040 "filter by tag is tracked as #1041" paragraph with a short note about the new chip row. | - -## Files NOT modified - -- `apps/studio/src/lib/api.ts` — no new query keys or API wrappers (filter is client-side). -- `apps/studio/src/lib/types.ts` — no new types. -- `apps/studio/src/routes/index.tsx` / `routes/projects/$benchmarkId.tsx` — no prop changes. - -## Verification - -### Unit tests -1. `bun run test` — all must pass. -2. New tests in `serve.test.ts` cover backend filter semantics: - - Unfiltered `/api/compare` returns all runs. - - `/api/compare?tags=baseline` returns only runs tagged `baseline`. - - `/api/compare?tags=baseline,v2` returns runs with EITHER tag (OR). - - `/api/compare?tags=nonexistent` returns empty `cells[]` / `runs[]`. - - Experiments/targets lists narrow to match filtered runs. - -### Manual end-to-end UAT (required, BLOCKING per AGENTS.md) - -Set up a 4-run synthetic fixture with a mix of tags. Start studio from -source: - -```bash -bun apps/cli/src/cli.ts studio --port 9100 --single -cd apps/studio && bun run build # rebuild frontend before UAT -``` - -Use `agent-browser --cdp 9222` to drive these flows: - -1. **Red (before):** On `main`, confirm no filter row is rendered above the - compare view. -2. **Green (after):** On this branch: - - Tag chip row appears above the mode toggle, listing all distinct tags - with counts. - - Clicking a tag narrows both the Aggregated matrix and the Per-run - table. - - OR semantics across multiple chips. - - "Clear" resets chips, matrix returns to full dataset. - - Filter with no matches shows the _"No runs match …"_ notice with a - working "Clear filter" button. - - Flip between Aggregated and Per-run while filter is active — filter - persists. - - Tag edits via `TagsEditor` still work with a filter active. -3. **Backend filter check:** `curl` against the API directly to verify OR - semantics and counts. - -### Build gates -- `bun run build`, `bun run typecheck`, `bun run lint` — all green. -- Pre-push hook (`prek`) runs everything automatically on push. - -### Final review -Spawn a code-review subagent AFTER e2e passes (per AGENTS.md ordering).