prompts: no evaluation harness — regressions discovered only at runtime

## Problem

The design doc says "prompts are the primary extension point" and "intelligence lives in prompts" — but prompts are never tested. The test suite mocks away the agent entirely. A prompt regression (triage that stops clustering correctly, review that approves bad issues, dedup that creates duplicates) is discovered only when it ships and causes downstream waste.

This is a critical gap: the most important part of the system has zero automated quality signal.

## Examples of silent regressions today

- `triage.md` stops merging related findings → noisy issue volume spikes
- `review.md` approves issues without code snippets → fix loop gets bad specs
- `dedup.md` posts a duplicate it should have skipped → issue clutter accumulates
- `implement.md` starts making broad changes → PRs touch unrelated files

None of these would be caught by the current test suite. They'd be noticed only by a human watching issue quality degrade over days.

## What's needed

An evaluation harness that runs prompt steps against golden inputs and asserts on structured outputs:

- Golden `findings[]` → expected `clusters[]` from triage
- Golden `clusters[]` → expected `issues[]` from draft
- Golden `issues[]` → expected review verdict (approved/rejected + reason)
- Golden issue pairs → expected dedup action (post/comment/skip)

Evaluations don't need to run in CI on every commit (cost-prohibitive), but they should be runnable on demand before merging prompt changes, and ideally on a weekly schedule to catch drift.

## Definition of Done

- At least one golden dataset per shared prompt step (triage, draft, review, dedup, implement, fix-review)
- A CLI command to run evaluations: `python run.py eval <step>`
- Failures produce a diff between expected and actual structured output
- README or playbook documents how to run evals before merging a prompt change

## Out of Scope

- Running evals in CI on every commit (API cost)
- Full end-to-end scan→issue→fix evaluation (separate effort)
- Automated prompt optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prompts: no evaluation harness — regressions discovered only at runtime #50

Problem

Examples of silent regressions today

What's needed

Definition of Done

Out of Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

prompts: no evaluation harness — regressions discovered only at runtime #50

Description

Problem

Examples of silent regressions today

What's needed

Definition of Done

Out of Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions