feat: experiment-based result layout and read-only Studio mode

## Objective

Public leaderboard that reuses the **existing Studio result artifact format** (JSONL + benchmark.json). Benchmark-agnostic, aligned with AgentV's positioning as a general-purpose eval framework.

Supersedes the implementation approach in #966 (the objective from #966 remains the same).

## Why change the approach

PR #970 (implementing #966) introduced a **custom `result.schema.json`** with SWE-bench-specific fields (`instance_id`, `resolved`, `dataset: "swe-bench-lite"`). This creates problems:

1. **Tied to SWE-bench** — AgentV evaluates all kinds of agents, not just coding agents. A new schema would be needed for every benchmark type.
2. **Parallel data format** — Studio already reads `index.jsonl` + `benchmark.json` + `timing.json` from `.agentv/results/`. Inventing a second format means maintaining two schemas and two rendering paths.
3. **Duplicated UI work** — Studio already has sortable tables, experiment comparison, and drill-down views. Building a separate Astro component reimplements this.

## Design

### Architecture

- **Separate benchmark results repo** (e.g., `EntityProcess/agentv-benchmarks`) — contains only result artifacts, no source code. Keeps the main agentv repo clean.
- **Studio in read-only mode** serves as the leaderboard UI — no new UI to build. Deployment topology (subdomain, reverse proxy, etc.) is the operator's responsibility.
- **Any developer can deploy their own internal leaderboard** by pointing a read-only Studio instance at their own results repo/directory.

### Data hierarchy

| Leaderboard concept | Studio field | Example |
|---|---|---|
| Benchmark | Project | `swe-bench-lite` |
| Submission (model + workspace + skills) | Experiment | `claude-opus-4-custom-skills` |
| Execution | Run (timestamp) | `2026-04-08T12-00-00` |
| Test case | Test | `django__django-11099` |

**Project = the results repo** registered in Studio. Each benchmark repo is a project.

**Experiment = the submission identity.** This is the full agent stack (model + workspace template + skills), not just the model. The submitter names it. This becomes the primary leaderboard row.

**Run = a timestamped execution** within an experiment. Multiple runs per experiment are supported (re-runs, improved harness).

### Directory layout (breaking change)

Results are always nested under experiment:

```
.agentv/results/runs/
├── claude-opus-4-custom-skills/       ← experiment
│   ├── 2026-04-08T12-00-00/           ← run
│   │   ├── index.jsonl
│   │   ├── benchmark.json
│   │   └── timing.json
│   └── 2026-04-07T10-00-00/
├── gpt-4o-swe-agent/
│   └── 2026-04-05T08-00-00/
└── default/                           ← no --experiment specified
    └── 2026-04-06T09-00-00/
```

- `--experiment` provided → `runs/<experiment>/<timestamp>/`
- No `--experiment` → `runs/default/<timestamp>/`
- Always `runs/<experiment>/<timestamp>/` — one code path, consistent structure

### Data format

Use the **existing Studio result artifacts** that `agentv eval` already produces:

| Artifact | Commit to results repo? | Purpose |
|----------|------------------------|---------|
| `index.jsonl` | Yes | Scores, verdicts, cost, tool calls, timing |
| `benchmark.json` | Yes | Aggregate stats (pass rate, mean score, total cost) |
| `timing.json` | Yes | Aggregate timing |
| `grading.json` (per-test) | Optional | Per-evaluator breakdown for drill-down |
| `input.md` / `response.md` | **No** | Full conversations — too large, potentially sensitive |

For 300 SWE-bench Lite instances, `index.jsonl` ≈ 220 KB per model. 100 models ≈ 22 MB. Manageable for git.

### Submission flow

```
Run eval locally → commit safe artifacts to benchmark results repo → open PR → CI validates JSONL → merge → Studio serves updated leaderboard
```

No `agentv results export --public` CLI command needed — submitters commit the safe artifacts directly. CI validation is the gate that rejects PRs containing conversation files or invalid JSONL.

### Studio UX for leaderboard

Studio already supports this hierarchy. The key UX change:

1. **Home** → Projects dashboard (each benchmark repo = a project card)
2. **Project page** → **Experiments tab as default view** (this IS the leaderboard table: experiment name, pass rate, cost, runs, last run)
3. **Click experiment** → Runs for that submission (timestamped executions)
4. **Click run** → Individual test results with existing drill-down

The Experiments tab already exists in Studio. It just needs to be the default/primary view instead of Recent Runs.

### CLI changes

- **Add `--experiment` flag to `agentv eval`** — currently only available on `agentv pipeline input/run`. Required for leaderboard submissions, defaults to `default` when omitted.
- **Add `--read-only` flag to `agentv results serve`** — disables RunEvalModal, FeedbackPanel, and write endpoints for public deployment.

### What stays from #966

- `benchmarks/swe-bench-lite/` directory with `setup.ts`, grader, README
- GitHub PR submission workflow
- CI schema validation (validates JSONL format, not a custom JSON schema)
- Pareto frontier chart, multi-dimensional columns

### What changes from #966 / PR #970

- **No custom `result.schema.json`** — use existing JSONL schema
- **No `results/*.json` in custom format** — commit standard `.agentv/results/` artifacts
- **No `agentv results export --public`** — not needed with separate results repo
- **No landing page integration** — Studio IS the leaderboard, deployed independently
- **No Astro component** — Studio serves the UI directly
- **Leaderboard reads JSONL** not a bespoke JSON format
- **Benchmark-agnostic** — same leaderboard works for any future benchmark

## Execution path

1. ~~Docker workspace environments (#965)~~ — done (PR #971 merged)
2. Add `--experiment` flag to `agentv eval`, implement `runs/<experiment>/<timestamp>/` directory layout
3. Default Studio view to Experiments tab for project pages
4. Add `--read-only` mode to `agentv results serve`
5. `benchmarks/swe-bench-lite/` setup + grader (reuse from #970)
6. Set up benchmark results repo with CI validation for JSONL submissions
7. Pareto frontier chart on Experiments view

## Acceptance signals

- [ ] `agentv eval --experiment <name>` writes results to `runs/<name>/<timestamp>/`
- [ ] `agentv eval` without `--experiment` writes to `runs/default/<timestamp>/`
- [ ] Studio Experiments tab is default view for project pages
- [ ] `agentv results serve --read-only` disables write operations
- [ ] SWE-bench Lite results committed in standard JSONL format in separate results repo
- [ ] Studio renders leaderboard from committed JSONL artifacts
- [ ] Same format works for a non-coding benchmark (proves it's agnostic)
- [ ] Sortable multi-dimensional columns (score, cost, $/fix, tool calls, latency)
- [ ] Pareto frontier chart
- [ ] CI validates submitted JSONL in results repo
- [ ] Submission workflow documented

## Non-goals

- Modifying Studio's internal data format
- `agentv results export --public` CLI command (not needed with separate repo)
- Landing page or Astro component integration
- Prescribing deployment topology (subdomains, etc.)

## Dependencies

- #965 (Docker workspaces) — merged via #971
- #563 (Studio hardening)

## Related

- #966 (original leaderboard issue — objective still valid)
- PR #970 (prior implementation — to be closed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experiment-based result layout and read-only Studio mode #972

Objective

Why change the approach

Design

Architecture

Data hierarchy

Directory layout (breaking change)

Data format

Submission flow

Studio UX for leaderboard

CLI changes

What stays from #966

What changes from #966 / PR #970

Execution path

Acceptance signals

Non-goals

Dependencies

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Leaderboard concept	Studio field	Example
Benchmark	Project	`swe-bench-lite`
Submission (model + workspace + skills)	Experiment	`claude-opus-4-custom-skills`
Execution	Run (timestamp)	`2026-04-08T12-00-00`
Test case	Test	`django__django-11099`

Artifact	Commit to results repo?	Purpose
`index.jsonl`	Yes	Scores, verdicts, cost, tool calls, timing
`benchmark.json`	Yes	Aggregate stats (pass rate, mean score, total cost)
`timing.json`	Yes	Aggregate timing
`grading.json` (per-test)	Optional	Per-evaluator breakdown for drill-down
`input.md` / `response.md`	No	Full conversations — too large, potentially sensitive

feat: experiment-based result layout and read-only Studio mode #972

Description

Objective

Why change the approach

Design

Architecture

Data hierarchy

Directory layout (breaking change)

Data format

Submission flow

Studio UX for leaderboard

CLI changes

What stays from #966

What changes from #966 / PR #970

Execution path

Acceptance signals

Non-goals

Dependencies

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions