Problem
The design doc says "prompts are the primary extension point" and "intelligence lives in prompts" — but prompts are never tested. The test suite mocks away the agent entirely. A prompt regression (triage that stops clustering correctly, review that approves bad issues, dedup that creates duplicates) is discovered only when it ships and causes downstream waste.
This is a critical gap: the most important part of the system has zero automated quality signal.
Examples of silent regressions today
triage.md stops merging related findings → noisy issue volume spikes
review.md approves issues without code snippets → fix loop gets bad specs
dedup.md posts a duplicate it should have skipped → issue clutter accumulates
implement.md starts making broad changes → PRs touch unrelated files
None of these would be caught by the current test suite. They'd be noticed only by a human watching issue quality degrade over days.
What's needed
An evaluation harness that runs prompt steps against golden inputs and asserts on structured outputs:
- Golden
findings[] → expected clusters[] from triage
- Golden
clusters[] → expected issues[] from draft
- Golden
issues[] → expected review verdict (approved/rejected + reason)
- Golden issue pairs → expected dedup action (post/comment/skip)
Evaluations don't need to run in CI on every commit (cost-prohibitive), but they should be runnable on demand before merging prompt changes, and ideally on a weekly schedule to catch drift.
Definition of Done
- At least one golden dataset per shared prompt step (triage, draft, review, dedup, implement, fix-review)
- A CLI command to run evaluations:
python run.py eval <step>
- Failures produce a diff between expected and actual structured output
- README or playbook documents how to run evals before merging a prompt change
Out of Scope
- Running evals in CI on every commit (API cost)
- Full end-to-end scan→issue→fix evaluation (separate effort)
- Automated prompt optimization
Problem
The design doc says "prompts are the primary extension point" and "intelligence lives in prompts" — but prompts are never tested. The test suite mocks away the agent entirely. A prompt regression (triage that stops clustering correctly, review that approves bad issues, dedup that creates duplicates) is discovered only when it ships and causes downstream waste.
This is a critical gap: the most important part of the system has zero automated quality signal.
Examples of silent regressions today
triage.mdstops merging related findings → noisy issue volume spikesreview.mdapproves issues without code snippets → fix loop gets bad specsdedup.mdposts a duplicate it should have skipped → issue clutter accumulatesimplement.mdstarts making broad changes → PRs touch unrelated filesNone of these would be caught by the current test suite. They'd be noticed only by a human watching issue quality degrade over days.
What's needed
An evaluation harness that runs prompt steps against golden inputs and asserts on structured outputs:
findings[]→ expectedclusters[]from triageclusters[]→ expectedissues[]from draftissues[]→ expected review verdict (approved/rejected + reason)Evaluations don't need to run in CI on every commit (cost-prohibitive), but they should be runnable on demand before merging prompt changes, and ideally on a weekly schedule to catch drift.
Definition of Done
python run.py eval <step>Out of Scope