Skip to content

Add per-board accuracy scorecard for agent-driven QA#1115

Merged
r3dbars merged 6 commits into
mainfrom
claude/ai-code-testing-accuracy-e7z4cf
Jun 14, 2026
Merged

Add per-board accuracy scorecard for agent-driven QA#1115
r3dbars merged 6 commits into
mainfrom
claude/ai-code-testing-accuracy-e7z4cf

Conversation

@r3dbars

@r3dbars r3dbars commented Jun 13, 2026

Copy link
Copy Markdown
Owner

What

Gives an AI agent (Claude Code, Codex, or any capable assistant) one task list and one scoring contract so it can test every surface of the app and report a 0-100 accuracy score per board instead of a flat "504/504 checks passed" pile.

A board is one testable surface: a screen (Home, Speakers, Settings · General), a quality lane (Transcription, Diarization, Summary), or a recording surface (Meeting capture). All 20 are enumerated in .agents/board-scorecard.yml.

How it scores

Each board blends up to three dimensions by weight:

Dimension Answers Evidence
ui renders + responds? transcripted-qa ui-smoke --format json
functional valid artifacts? transcripted-qa validate-all --format json
accuracy close to ground truth? a score-<board>.json from a scorer

Honesty rules baked in:

  • No evidence → INCOMPLETE, never green or red.
  • Board score = weighted mean over dimensions that have evidence (weights renormalize).
  • Overall verdict = worst auto board — one RED can't hide behind greens.
  • hardware/human boards are listed and routed to the manual packet, not faked green.

What's here

  • .agents/board-scorecard.yml — the board registry (task list + scoring contract).
  • scripts/ops/score_boards_lib.py — pure scoring math.
  • scripts/ops/score-boards.py — aggregator → board-scorecard.md + .json.
  • New accuracy scorers: diarization DER, dictation-correction WER, meeting-detection F1, summary LLM-judge rubric.
  • scripts/ops/fixtures/board-scorecard/ — synthetic fixtures (tests/demos only, never fed as real product output).
  • transcripted-qa-bench.sh — new --mode scorecard.
  • docs/board-scorecard.md + test-score-boards.py (15 tests).

Verification

Runnable in this Linux/cloud session (the macOS app + Swift tools can't build here):

  • python3 scripts/ops/test-score-boards.py → 15/15 green
  • bash -n scripts/ops/transcripted-qa-bench.sh → OK
  • py_compile on all scorers → OK
  • end-to-end demo scorecard generated and eyeballed

Still needs a Mac run to wire ui-smoke/validate-all evidence and tighten check_globs against real check names — see the "Tuning the registry" section in the docs.

https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT


Generated by Claude Code

Give an AI agent one task list and one scoring contract so it can test
every surface of the app and report a 0-100 score per board instead of a
flat pass/fail pile.

- .agents/board-scorecard.yml: registry of 20 boards (screens, quality
  lanes, recording surfaces) with ui/functional/accuracy dimensions,
  weights, thresholds, and auto/hardware/human classification.
- scripts/ops/score_boards_lib.py: pure scoring math (dimension blend with
  renormalization, status tiers, worst-board roll-up). No-evidence stays
  INCOMPLETE, never green or red.
- scripts/ops/score-boards.py: aggregator that ingests existing
  transcripted-qa validate-all / ui-smoke JSON plus accuracy scorer outputs
  and writes board-scorecard.md + .json.
- New accuracy scorers: diarization DER (score-diarization.py), dictation
  correction WER (score-dictation.py), meeting-detection F1
  (score-detection.py), and summary LLM-judge rubric (score-summary-judge.py).
- Synthetic fixtures under scripts/ops/fixtures/board-scorecard for tests
  and demos only (never fed as real product output).
- transcripted-qa-bench.sh: new --mode scorecard wiring evidence + scoring.
- docs/board-scorecard.md and test-score-boards.py (15 tests, all green).

https://claude.ai/code/session_01XkfcGgmAvwBbHFBWTBkBaT
@r3dbars r3dbars marked this pull request as ready for review June 14, 2026 01:33
@r3dbars r3dbars merged commit 28a5cf9 into main Jun 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants