feat(review-pr): add per-finding confidence scoring model by Sayt-0 · Pull Request #17 · docker/docker-agent-action

Sayt-0 · 2026-06-24T21:02:46Z

Summary

Adds a precise, multi-criteria confidence scoring model for the PR reviewer. Each verified finding gets a 0 to 100 confidence score, a band, and a posting disposition, computed deterministically from several signals rather than a subjective guess.

The model lives in two synchronized surfaces:

Surface	Role
`src/score-confidence/` (pure module + CLI + unit tests)	Single source of truth for the model
`review-pr/agents/pr-review.yaml` "Confidence Scoring" section	Strict lookup table the orchestrator applies inline (the gitignored `dist/` is not available at agent runtime)

Criteria

Criterion	Source	Role
Verdict (CONFIRMED / LIKELY / DISMISSED)	verifier	Primary agreement signal
`evidence_strength` (direct / circumstantial / speculative)	verifier (new field)	Pattern / snippet match strength
`context_completeness` (full / partial / none)	verifier (new field)	Whether the verifier saw the code it needed
Severity concordance	derived	Drafter vs verifier severity rank distance
Scope (`in_diff` and `in_changed_code`)	derived	Hard gate
Category / severity	verifier	Posting policy only (security floor, high-severity always-post)

Scale and threshold

Score 0 to 100 from a precomputed verdict x evidence x context table (CONFIRMED 70 / LIKELY 40 base; evidence +18 / +8 / -4; context +12 / +4 / -10), plus concordance (+5 / 0 / -8), clamped.
Bands: strong >= 80, moderate 55..79, weak 30..54, negligible < 30.
Default posting threshold is 55 (the moderate-band floor, so the band and the post cutoff cannot drift apart).
Only CONFIRMED can reach strong (LIKELY tops out at 75), a property the unit tests pin.

Posting policy (first match wins; the 5-comment cap is applied last)

Rule	Behavior
Out-of-scope / DISMISSED non-security	not surfaced
Security floor (CONFIRMED/LIKELY)	always inline, never auto-suppressed, cap-exempt
High-severity (CONFIRMED/LIKELY)	always inline, cap-exempt
Band strong or moderate	inline (subject to the cap)
Band weak, non-forced	lower-confidence summary list (not silently dropped)
Negligible band, verifier severity medium	summary (visibility floor)
DISMISSED security	dismissed-security audit line (human-reviewable)
Non-forced inline overflow past 5	demoted to the summary list

Maps to the requested design

Requested	Implemented
Verifier agreement score	verdict + drafter/verifier severity concordance
Pattern match strength	`evidence_strength` field
Context completeness	`context_completeness` field
Scale (high/medium/low or 0-100)	0 to 100 plus four named bands
Default threshold	55 (moderate floor)
Precise, multiple criteria	deterministic lookup table; six criteria; 84 unit tests pin every value

How it was hardened

A design review (three independent lenses: calibration, security policy, LLM reproducibility) locked the constants, replacing error-prone post-hoc caps with the lookup table and removing band dead-zones. An adversarial verification pass then confirmed and fixed three defects:

Defect	Severity	Resolution
Verifier escalating severity could silently drop a medium-severity finding (concordance is non-monotonic in severity)	major	Medium-severity visibility floor: a still-believed medium finding is kept in the summary, never dropped
`category` was the only enum not validated, so a misspelled value silently disabled the security floor	minor	Validate `category` via `assertEnum`
Severity-disagreement can move a borderline finding from inline to summary	minor	Documented as intended: the finding stays visible, and confidence legitimately reflects assessor agreement (a different axis from severity)

TS-to-prompt numeric consistency (all 18 table cells, bands, threshold, posting precedence, schema) was verified clean.

Compatibility with #15

#15 ("always post a review comment even with zero findings") and this change touch distinct regions of pr-review.yaml and are complementary: an empty inline set yields a APPROVE assessment label while the summary and audit sections still go in the review body, and the review is still posted.

Validation

Check	Result
`pnpm build`	pass
`pnpm test` (554 tests, 84 new)	pass
`tsc --noEmit`	pass
`biome ci`	pass
`actionlint`	pass

Note on runtime placement

The model is applied by the orchestrator prompt (mirroring the tested TS module) rather than invoked as a dist bundle at agent runtime, because dist/ is gitignored and not present in the agent's working tree. If invoking the compiled scorer at runtime is preferred, that is a larger pipeline change (staging the bundle plus a permission) and can be done separately.

Score each verified finding 0-100 from the verifier verdict, evidence strength, context completeness, drafter/verifier severity concordance, and scope. Bands (strong/moderate/weak/negligible) with a default posting threshold of 55 gate inline comments; security and high-severity CONFIRMED/LIKELY findings are always posted, weak-band findings go to a visible lower-confidence summary instead of being dropped, and a medium-severity floor keeps a still-believed finding visible. The model is implemented and unit-tested in src/score-confidence (single source of truth) and mirrored in the orchestrator prompt as a strict lookup table. The verifier now emits evidence_strength and context_completeness.

CodeQL js/insecure-temporary-file: the CLI defaulted its output to a hardcoded /tmp path. Default to stdout instead (composable, no fixed temp file) and write to a file only when the caller passes an explicit output path.

docker-agent

Assessment: 🟢 APPROVE

The confidence scoring model looks correct. The lookup table arithmetic, band boundaries, clamping, posting policy precedence, enum validation, and CLI wiring all check out. The TypeScript module and the YAML mirror are consistent. No bugs were found in the code introduced by this PR.

derekmisler

two confirmed bugs in the cli entrypoint: get() instead of require() for the scope-gate flags silently drops all findings (including forced security ones), and non-array json input silently produces an empty report with exit 0. both need fixing before merge. three non-blocking items also flagged inline (nan line serialization, forced-first sort ordering, direct+none test fixture vs yaml disjointness rule).

derekmisler · 2026-06-25T13:44:01Z

+    contextCompleteness: require('context_completeness') as FindingInput['contextCompleteness'],
+    drafterSeverity: require('drafter_severity') as FindingInput['drafterSeverity'],
+    verifierSeverity: require('verifier_severity') as FindingInput['verifierSeverity'],
+    inDiff: get('in_diff') === true,


issue: get('in_diff') === true silently returns false whenever the field is absent or comes in as the string "true" instead of a boolean. get() returns undefined on a missing key, and undefined === true is false. so every finding from an agent that omits these fields, or JSON-encodes them as strings, gets both flags false, the scope hard gate fires at STEP 0, and the finding is silently dropped with no stderr output. that includes forced security findings.

the fix is to use require('in_diff') and require('in_changed_code') (which already throw on missing/null, same as the other required fields). wrap with Boolean(...) if you want to coerce a non-boolean to a boolean rather than throw:

inDiff: Boolean(require('in_diff')), inChangedCode: Boolean(require('in_changed_code')),

the same issue is on line 61 for in_changed_code.

derekmisler · 2026-06-25T13:44:01Z

+  }
+
+  const parsed = JSON.parse(readFileSync(findingsPath, 'utf-8')) as unknown;
+  const records = Array.isArray(parsed) ? (parsed as Record<string, unknown>[]) : [];


issue: when the top-level JSON value isn't an array (e.g. {}, null, a scalar, or an agent that wraps findings in {"findings": [...]}) records silently becomes [], scoreFindings([]) runs without error, and the process exits 0 with a completely empty report. a caller has no way to tell this apart from a real zero-findings run.

every other bad-input path here either throws (via assertEnum) or exits 1 (via the outer try/catch). this one should too:

if (!Array.isArray(parsed)) { process.stderr.write(`error: input must be a JSON array, got ${typeof parsed}\n`); process.exit(1); } const records = parsed as Record<string, unknown>[];

derekmisler · 2026-06-25T13:44:01Z

+  };
+  return {
+    file: String(require('file')),
+    line: Number(require('line')),


suggestion (non-blocking): require() throws on a missing key, good. but Number('abc') returns NaN, not an error, and JSON.stringify({line: NaN}) produces {"line": null}. if that reaches the GitHub review POST, the API will either silently omit the line anchor or return a 422.

add a post-conversion check:

const lineNum = Number(require('line')); if (!Number.isInteger(lineNum) || lineNum < 1) { throw new Error(`'line' must be a positive integer, got: ${require('line')}`); } // then use lineNum

derekmisler · 2026-06-25T13:44:01Z

+  const byDisposition = (d: Disposition): ScoredFinding[] =>
+    scored
+      .filter((s) => s.result.disposition === d)
+      .sort((a, b) => b.result.sortKey - a.result.sortKey);


suggestion (non-blocking): the JSDoc on ConfidenceReport.inline (line 195) says "forced first, then capped default-band", but byDisposition sorts all inline findings purely by sortKey with no forced-first partition. a LIKELY security finding with speculative evidence and none context scores 26, so it will sort after a non-forced CONFIRMED/direct/full at score 100. if the consumer renders in array order (likely), security findings could appear buried at the bottom.

fix: partition by forced before sorting within the inline bucket:

const byDisposition = (d: Disposition): ScoredFinding[] => { const matches = scored.filter((s) => s.result.disposition === d); if (d !== 'inline') return matches.sort((a, b) => b.result.sortKey - a.result.sortKey); const forced = matches.filter((s) => s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey); const nonForced = matches.filter((s) => !s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey); return [...forced, ...nonForced]; };

derekmisler · 2026-06-25T13:44:01Z

+    ['direct', 'partial', 97],
+    ['circumstantial', 'full', 95],
+    ['circumstantial', 'partial', 87],
+    ['direct', 'none', 83],


question (non-blocking): pr-review.yaml line 839 has an explicit disjointness rule: "if context_completeness is none you MUST NOT set evidence_strength to direct" (without the defining context you can't claim direct evidence). this fixture exercises that exact forbidden combination and the TS scorer silently accepts it.

two ways to resolve this:

add validation in scoreFinding to reject direct+none the same way it validates enums via assertEnum, and update this fixture to use a valid combo (e.g. ['speculative', 'none', ...]).

if direct+none is intentionally supported in the TS model (the YAML rule being aspirational/prompt-only), document that explicitly and note the divergence.

right now the TS model and the YAML rule silently contradict each other, which will cause confusion when the verifier agent produces a direct+none combo and the scorer happily accepts it.

derekmisler · 2026-06-25T13:44:01Z

+      band: 'negligible',
+      disposition: 'drop',
+      forced: false,
+      reason: 'out-of-scope (not in_diff && in_changed_code)',


nitpick (super non-blocking): 'out-of-scope (not in_diff && in_changed_code)' reads (by standard operator precedence) as (!in_diff) && in_changed_code, i.e. only in_diff is negated. the actual condition is !(in_diff && in_changed_code). anyone reading this in a debug log will be confused about which flag caused the drop.

'out-of-scope (not (in_diff && in_changed_code))' or 'out-of-scope: requires both in_diff and in_changed_code' would be unambiguous.

- parseRecord: require strict booleans for in_diff/in_changed_code so a missing or string-encoded flag fails loudly instead of silently dropping the finding (including forced security findings) at the scope gate - parseRecord: validate line as a positive integer (NaN would serialize to null and break the GitHub review line anchor) - main: reject a non-array top-level JSON value via toFindingRecords instead of producing an empty report with exit 0 - scoreFindings: order the inline bucket forced-first so security and high-severity findings are never buried; mirror in both pr-review.yaml spots - clarify the out-of-scope reason string - document that the verifier direct+none disjointness rule is an output constraint, not a scorer invariant (graceful scoring, not a batch abort) Export parseRecord/toFindingRecords and guard main() for unit testing; add tests covering all of the above.

derekmisler

all six findings addressed: requireBool() for the scope-gate flags, toFindingRecords() throwing on non-array input, requireInt() with Number.isInteger guard, forced-first partitioning in byDisposition, the reason string, and the direct+none test fixture. ci is green, approving.

Reconcile the abuse safeguards with the requester-based authorization rework that landed on main (PRs #16, #17, #18). check-org-membership: take main's evaluateMembership model. It subsumes this branch's resolveUsername helper (the directly wired pull_request auto-review path now verifies the PR author via a live API lookup) and adds the review_requested requester path so an org member can pull an external contributor's PR into review. review-pr.yml: provide EVENT_NAME, EVENT_ACTION and REQUESTER to the membership step; drop the PR_AUTHOR env that main's code never reads. tsup.config.ts: keep both the rate-limit (this branch) and score-confidence (main) entry points. review-pr/README.md: keep main's requester-model wording and preserve this branch's audit, throttle and force-push safeguards. SECURITY.md: update section 1 to the requester-based model and remove the stale resolveUsername and PR_AUTHOR references.

github-advanced-security AI found potential problems Jun 24, 2026

View reviewed changes

Comment thread src/score-confidence/index.ts Fixed

fix(score-confidence): write CLI output to stdout by default

85b1517

CodeQL js/insecure-temporary-file: the CLI defaulted its output to a hardcoded /tmp path. Default to stdout instead (composable, no fixed temp file) and write to a file only when the caller passes an explicit output path.

Sayt-0 enabled auto-merge (squash) June 24, 2026 21:10

Sayt-0 requested a review from derekmisler June 24, 2026 21:10

docker-agent reviewed Jun 24, 2026

View reviewed changes

derekmisler reviewed Jun 25, 2026

View reviewed changes

Sayt-0 added 2 commits June 25, 2026 18:31

Merge remote-tracking branch 'origin/main' into feat/confidence-scoring

df80287

derekmisler approved these changes Jun 25, 2026

View reviewed changes

Sayt-0 merged commit 70b6d7b into main Jun 25, 2026
8 checks passed

Sayt-0 deleted the feat/confidence-scoring branch June 25, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(review-pr): add per-finding confidence scoring model#17

feat(review-pr): add per-finding confidence scoring model#17
Sayt-0 merged 4 commits into
mainfrom
feat/confidence-scoring

Sayt-0 commented Jun 24, 2026

Uh oh!

Uh oh!

docker-agent left a comment

Uh oh!

derekmisler left a comment

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler Jun 25, 2026

Uh oh!

derekmisler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Sayt-0 commented Jun 24, 2026

Summary

Criteria

Scale and threshold

Posting policy (first match wins; the 5-comment cap is applied last)

Maps to the requested design

How it was hardened

Compatibility with #15

Validation

Note on runtime placement

Uh oh!

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟢 APPROVE

Uh oh!

derekmisler left a comment

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

derekmisler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants