Board scorecard: wire real evidence + tune on a Mac (follow-up to #1115)

Follow-up to #1115, which landed the board scorecard engine, registry, and four accuracy scorers. The scoring math, scorers, aggregator, and bench wiring are all tested in CI (Linux), but the parts that need the macOS app and TCC permissions can only be done on a Mac. Tracking those here.

## Mac-only work

- [ ] **Wire `ui` evidence.** Run `transcripted-qa ui-smoke --report ...` and confirm `--mode scorecard` ingests it. Check that each board's `ui` dimension actually pulls rows.
- [ ] **Wire `functional` evidence.** Run `transcripted-qa validate-all --format json` against a populated capture library and confirm boards pull the right `functional` rows.
- [ ] **Tune `check_globs`.** The globs in `.agents/board-scorecard.yml` are best-effort guesses. After the first real run, tighten them against the actual `ui-smoke` / `validate-all` check names so each board pulls exactly the evidence it owns (see "Tuning the registry" in `docs/board-scorecard.md`).
- [ ] **Calibrate thresholds/weights.** Confirm GREEN ≥ 85 / YELLOW ≥ 65 and the per-board weights match how much each surface actually matters.

## Accuracy candidate generation

The accuracy scorers are honest-by-default (INCOMPLETE without a real candidate). To turn them green we need candidate generators that drive the app:

- [ ] **Transcription** — adapt `compare-meeting-corpus.py` output into `score-transcription.json` (mean word recall → 0-100).
- [ ] **Diarization** — produce reference+hypothesis segments JSON from the corpus (Zoom caption turns vs Transcripted timestamps) for `score-diarization.py`.
- [ ] **Dictation** — drive dictation on the fixture `raw` inputs to produce the candidate output JSON keyed by case id.
- [ ] **Detection** — replay the labelled app-state fixture cases through the detector to capture its decision per id.
- [ ] **Summary** — generate judge prompt packets (`score-summary-judge.py --mode prompt`), have the agent score them, fold results back in.

## Definition of done

A `bash scripts/ops/transcripted-qa-bench.sh --mode scorecard` run on a Mac that produces a `board-scorecard.md` where the auto boards are GREEN/YELLOW/RED on real evidence (not INCOMPLETE), and the hardware/human boards are routed to the manual packet.

https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Board scorecard: wire real evidence + tune on a Mac (follow-up to #1115) #1120

Mac-only work

Accuracy candidate generation

Definition of done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Board scorecard: wire real evidence + tune on a Mac (follow-up to #1115) #1120

Description

Mac-only work

Accuracy candidate generation

Definition of done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions