Add per-board accuracy scorecard for agent-driven QA#1115
Merged
Conversation
Give an AI agent one task list and one scoring contract so it can test every surface of the app and report a 0-100 score per board instead of a flat pass/fail pile. - .agents/board-scorecard.yml: registry of 20 boards (screens, quality lanes, recording surfaces) with ui/functional/accuracy dimensions, weights, thresholds, and auto/hardware/human classification. - scripts/ops/score_boards_lib.py: pure scoring math (dimension blend with renormalization, status tiers, worst-board roll-up). No-evidence stays INCOMPLETE, never green or red. - scripts/ops/score-boards.py: aggregator that ingests existing transcripted-qa validate-all / ui-smoke JSON plus accuracy scorer outputs and writes board-scorecard.md + .json. - New accuracy scorers: diarization DER (score-diarization.py), dictation correction WER (score-dictation.py), meeting-detection F1 (score-detection.py), and summary LLM-judge rubric (score-summary-judge.py). - Synthetic fixtures under scripts/ops/fixtures/board-scorecard for tests and demos only (never fed as real product output). - transcripted-qa-bench.sh: new --mode scorecard wiring evidence + scoring. - docs/board-scorecard.md and test-score-boards.py (15 tests, all green). https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Gives an AI agent (Claude Code, Codex, or any capable assistant) one task list and one scoring contract so it can test every surface of the app and report a 0-100 accuracy score per board instead of a flat "504/504 checks passed" pile.
A board is one testable surface: a screen (Home, Speakers, Settings · General), a quality lane (Transcription, Diarization, Summary), or a recording surface (Meeting capture). All 20 are enumerated in
.agents/board-scorecard.yml.How it scores
Each board blends up to three dimensions by weight:
uitranscripted-qa ui-smoke --format jsonfunctionaltranscripted-qa validate-all --format jsonaccuracyscore-<board>.jsonfrom a scorerHonesty rules baked in:
hardware/humanboards are listed and routed to the manual packet, not faked green.What's here
.agents/board-scorecard.yml— the board registry (task list + scoring contract).scripts/ops/score_boards_lib.py— pure scoring math.scripts/ops/score-boards.py— aggregator →board-scorecard.md+.json.scripts/ops/fixtures/board-scorecard/— synthetic fixtures (tests/demos only, never fed as real product output).transcripted-qa-bench.sh— new--mode scorecard.docs/board-scorecard.md+test-score-boards.py(15 tests).Verification
Runnable in this Linux/cloud session (the macOS app + Swift tools can't build here):
python3 scripts/ops/test-score-boards.py→ 15/15 greenbash -n scripts/ops/transcripted-qa-bench.sh→ OKpy_compileon all scorers → OKStill needs a Mac run to wire
ui-smoke/validate-allevidence and tightencheck_globsagainst real check names — see the "Tuning the registry" section in the docs.https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT
Generated by Claude Code