Skip to content

Board scorecard: wire real evidence + tune on a Mac (follow-up to #1115) #1120

@r3dbars

Description

@r3dbars

Follow-up to #1115, which landed the board scorecard engine, registry, and four accuracy scorers. The scoring math, scorers, aggregator, and bench wiring are all tested in CI (Linux), but the parts that need the macOS app and TCC permissions can only be done on a Mac. Tracking those here.

Mac-only work

  • Wire ui evidence. Run transcripted-qa ui-smoke --report ... and confirm --mode scorecard ingests it. Check that each board's ui dimension actually pulls rows.
  • Wire functional evidence. Run transcripted-qa validate-all --format json against a populated capture library and confirm boards pull the right functional rows.
  • Tune check_globs. The globs in .agents/board-scorecard.yml are best-effort guesses. After the first real run, tighten them against the actual ui-smoke / validate-all check names so each board pulls exactly the evidence it owns (see "Tuning the registry" in docs/board-scorecard.md).
  • Calibrate thresholds/weights. Confirm GREEN ≥ 85 / YELLOW ≥ 65 and the per-board weights match how much each surface actually matters.

Accuracy candidate generation

The accuracy scorers are honest-by-default (INCOMPLETE without a real candidate). To turn them green we need candidate generators that drive the app:

  • Transcription — adapt compare-meeting-corpus.py output into score-transcription.json (mean word recall → 0-100).
  • Diarization — produce reference+hypothesis segments JSON from the corpus (Zoom caption turns vs Transcripted timestamps) for score-diarization.py.
  • Dictation — drive dictation on the fixture raw inputs to produce the candidate output JSON keyed by case id.
  • Detection — replay the labelled app-state fixture cases through the detector to capture its decision per id.
  • Summary — generate judge prompt packets (score-summary-judge.py --mode prompt), have the agent score them, fold results back in.

Definition of done

A bash scripts/ops/transcripted-qa-bench.sh --mode scorecard run on a Mac that produces a board-scorecard.md where the auto boards are GREEN/YELLOW/RED on real evidence (not INCOMPLETE), and the hardware/human boards are routed to the manual packet.

https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions