Skip to content

prompts: no evaluation harness — regressions discovered only at runtime #50

@ooloth

Description

@ooloth

Problem

The design doc says "prompts are the primary extension point" and "intelligence lives in prompts" — but prompts are never tested. The test suite mocks away the agent entirely. A prompt regression (triage that stops clustering correctly, review that approves bad issues, dedup that creates duplicates) is discovered only when it ships and causes downstream waste.

This is a critical gap: the most important part of the system has zero automated quality signal.

Examples of silent regressions today

  • triage.md stops merging related findings → noisy issue volume spikes
  • review.md approves issues without code snippets → fix loop gets bad specs
  • dedup.md posts a duplicate it should have skipped → issue clutter accumulates
  • implement.md starts making broad changes → PRs touch unrelated files

None of these would be caught by the current test suite. They'd be noticed only by a human watching issue quality degrade over days.

What's needed

An evaluation harness that runs prompt steps against golden inputs and asserts on structured outputs:

  • Golden findings[] → expected clusters[] from triage
  • Golden clusters[] → expected issues[] from draft
  • Golden issues[] → expected review verdict (approved/rejected + reason)
  • Golden issue pairs → expected dedup action (post/comment/skip)

Evaluations don't need to run in CI on every commit (cost-prohibitive), but they should be runnable on demand before merging prompt changes, and ideally on a weekly schedule to catch drift.

Definition of Done

  • At least one golden dataset per shared prompt step (triage, draft, review, dedup, implement, fix-review)
  • A CLI command to run evaluations: python run.py eval <step>
  • Failures produce a diff between expected and actual structured output
  • README or playbook documents how to run evals before merging a prompt change

Out of Scope

  • Running evals in CI on every commit (API cost)
  • Full end-to-end scan→issue→fix evaluation (separate effort)
  • Automated prompt optimization

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestscope:fixFix loop, implement, review, PR openingscope:scanScan loop, triage, draft, review pipeline

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions