Skip to content

fix(studio): cross-model comparison view in Studio UI #981

@christso

Description

@christso

Objective

Enhance the Studio UI to provide side-by-side comparison of eval results across different targets/models within the same experiment.

Motivation

The agentv CLI has a compare command, but the Studio UI currently shows runs individually. When running the same eval across different providers (e.g., azure vs gemini) with the experiment feature (#977), users need to compare results side-by-side.

From benchmarking work: running reasoning evals across azure (gpt-5.4-mini) and gemini (gemini-3-flash-preview) with with-superpowers/without-superpowers experiments produces 4 separate runs. The Studio should enable:

  • Seeing all 4 runs in a comparison matrix
  • Identifying which target + experiment combination performs best
  • Spotting grading anomalies (e.g., "no response provided" grading failures)

Design

Extend the existing experiments tab in Studio to show a comparison matrix:

                    | without-superpowers | with-superpowers |
azure (gpt-5.4)    | 73.5% (1/2 pass)   | 25.0% (0/2 pass) |
gemini (flash)     | 100% (2/2 pass)    | 75.0% (1/2 pass)  |

Implementation approach

  • Add a "Compare" view to the experiments tab
  • Group runs by experiment × target
  • Show pass rate, average score, and per-test-case breakdown
  • Highlight best/worst performers
  • Support drill-down to individual test case differences

Acceptance Criteria

  • Experiments tab shows comparison matrix when multiple targets exist
  • Pass rate and average score visible for each experiment × target cell
  • Drill-down to per-test-case comparison
  • Works with existing experiment-based results layout

Non-goals

  • Statistical significance testing (future enhancement)
  • Automated recommendations

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestwuiRelates to the browser dashboard / web UI runtime

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions