Follow-up to #1115, which landed the board scorecard engine, registry, and four accuracy scorers. The scoring math, scorers, aggregator, and bench wiring are all tested in CI (Linux), but the parts that need the macOS app and TCC permissions can only be done on a Mac. Tracking those here.
Mac-only work
Accuracy candidate generation
The accuracy scorers are honest-by-default (INCOMPLETE without a real candidate). To turn them green we need candidate generators that drive the app:
Definition of done
A bash scripts/ops/transcripted-qa-bench.sh --mode scorecard run on a Mac that produces a board-scorecard.md where the auto boards are GREEN/YELLOW/RED on real evidence (not INCOMPLETE), and the hardware/human boards are routed to the manual packet.
https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT
Follow-up to #1115, which landed the board scorecard engine, registry, and four accuracy scorers. The scoring math, scorers, aggregator, and bench wiring are all tested in CI (Linux), but the parts that need the macOS app and TCC permissions can only be done on a Mac. Tracking those here.
Mac-only work
uievidence. Runtranscripted-qa ui-smoke --report ...and confirm--mode scorecardingests it. Check that each board'suidimension actually pulls rows.functionalevidence. Runtranscripted-qa validate-all --format jsonagainst a populated capture library and confirm boards pull the rightfunctionalrows.check_globs. The globs in.agents/board-scorecard.ymlare best-effort guesses. After the first real run, tighten them against the actualui-smoke/validate-allcheck names so each board pulls exactly the evidence it owns (see "Tuning the registry" indocs/board-scorecard.md).Accuracy candidate generation
The accuracy scorers are honest-by-default (INCOMPLETE without a real candidate). To turn them green we need candidate generators that drive the app:
compare-meeting-corpus.pyoutput intoscore-transcription.json(mean word recall → 0-100).score-diarization.py.rawinputs to produce the candidate output JSON keyed by case id.score-summary-judge.py --mode prompt), have the agent score them, fold results back in.Definition of done
A
bash scripts/ops/transcripted-qa-bench.sh --mode scorecardrun on a Mac that produces aboard-scorecard.mdwhere the auto boards are GREEN/YELLOW/RED on real evidence (not INCOMPLETE), and the hardware/human boards are routed to the manual packet.https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT