Add per-board accuracy scorecard for agent-driven QA by r3dbars · Pull Request #1115 · r3dbars/transcripted

r3dbars · 2026-06-13T21:05:30Z

What

Gives an AI agent (Claude Code, Codex, or any capable assistant) one task list and one scoring contract so it can test every surface of the app and report a 0-100 accuracy score per board instead of a flat "504/504 checks passed" pile.

A board is one testable surface: a screen (Home, Speakers, Settings · General), a quality lane (Transcription, Diarization, Summary), or a recording surface (Meeting capture). All 20 are enumerated in .agents/board-scorecard.yml.

How it scores

Each board blends up to three dimensions by weight:

Dimension	Answers	Evidence
`ui`	renders + responds?	`transcripted-qa ui-smoke --format json`
`functional`	valid artifacts?	`transcripted-qa validate-all --format json`
`accuracy`	close to ground truth?	a `score-<board>.json` from a scorer

Honesty rules baked in:

No evidence → INCOMPLETE, never green or red.
Board score = weighted mean over dimensions that have evidence (weights renormalize).
Overall verdict = worst auto board — one RED can't hide behind greens.
hardware/human boards are listed and routed to the manual packet, not faked green.

What's here

.agents/board-scorecard.yml — the board registry (task list + scoring contract).
scripts/ops/score_boards_lib.py — pure scoring math.
scripts/ops/score-boards.py — aggregator → board-scorecard.md + .json.
New accuracy scorers: diarization DER, dictation-correction WER, meeting-detection F1, summary LLM-judge rubric.
scripts/ops/fixtures/board-scorecard/ — synthetic fixtures (tests/demos only, never fed as real product output).
transcripted-qa-bench.sh — new --mode scorecard.
docs/board-scorecard.md + test-score-boards.py (15 tests).

Verification

Runnable in this Linux/cloud session (the macOS app + Swift tools can't build here):

python3 scripts/ops/test-score-boards.py → 15/15 green
bash -n scripts/ops/transcripted-qa-bench.sh → OK
py_compile on all scorers → OK
end-to-end demo scorecard generated and eyeballed

Still needs a Mac run to wire ui-smoke/validate-all evidence and tighten check_globs against real check names — see the "Tuning the registry" section in the docs.

https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT

Generated by Claude Code

Give an AI agent one task list and one scoring contract so it can test every surface of the app and report a 0-100 score per board instead of a flat pass/fail pile. - .agents/board-scorecard.yml: registry of 20 boards (screens, quality lanes, recording surfaces) with ui/functional/accuracy dimensions, weights, thresholds, and auto/hardware/human classification. - scripts/ops/score_boards_lib.py: pure scoring math (dimension blend with renormalization, status tiers, worst-board roll-up). No-evidence stays INCOMPLETE, never green or red. - scripts/ops/score-boards.py: aggregator that ingests existing transcripted-qa validate-all / ui-smoke JSON plus accuracy scorer outputs and writes board-scorecard.md + .json. - New accuracy scorers: diarization DER (score-diarization.py), dictation correction WER (score-dictation.py), meeting-detection F1 (score-detection.py), and summary LLM-judge rubric (score-summary-judge.py). - Synthetic fixtures under scripts/ops/fixtures/board-scorecard for tests and demos only (never fed as real product output). - transcripted-qa-bench.sh: new --mode scorecard wiring evidence + scoring. - docs/board-scorecard.md and test-score-boards.py (15 tests, all green). https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT

r3dbars mentioned this pull request Jun 14, 2026

Board scorecard: wire real evidence + tune on a Mac (follow-up to #1115) #1120

Open

9 tasks

r3dbars marked this pull request as ready for review June 14, 2026 01:33

r3dbars added 5 commits June 13, 2026 21:25

Fix scorecard mode wiring

42bf204

Harden scorecard evidence parsing

9ca9bfa

Fix scorecard report edge cases

1292f8a

Remove scorecard parser ordering dependency

c90d45e

Update QA bench mode contract

4f5fd99

r3dbars merged commit 28a5cf9 into main Jun 14, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-board accuracy scorecard for agent-driven QA#1115

Add per-board accuracy scorecard for agent-driven QA#1115
r3dbars merged 6 commits into
mainfrom
claude/ai-code-testing-accuracy-e7z4cf

r3dbars commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r3dbars commented Jun 13, 2026

What

How it scores

What's here

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants