feat(review-pr): add per-finding confidence scoring model#17
Conversation
Score each verified finding 0-100 from the verifier verdict, evidence strength, context completeness, drafter/verifier severity concordance, and scope. Bands (strong/moderate/weak/negligible) with a default posting threshold of 55 gate inline comments; security and high-severity CONFIRMED/LIKELY findings are always posted, weak-band findings go to a visible lower-confidence summary instead of being dropped, and a medium-severity floor keeps a still-believed finding visible. The model is implemented and unit-tested in src/score-confidence (single source of truth) and mirrored in the orchestrator prompt as a strict lookup table. The verifier now emits evidence_strength and context_completeness.
CodeQL js/insecure-temporary-file: the CLI defaulted its output to a hardcoded /tmp path. Default to stdout instead (composable, no fixed temp file) and write to a file only when the caller passes an explicit output path.
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
The confidence scoring model looks correct. The lookup table arithmetic, band boundaries, clamping, posting policy precedence, enum validation, and CLI wiring all check out. The TypeScript module and the YAML mirror are consistent. No bugs were found in the code introduced by this PR.
derekmisler
left a comment
There was a problem hiding this comment.
two confirmed bugs in the cli entrypoint: get() instead of require() for the scope-gate flags silently drops all findings (including forced security ones), and non-array json input silently produces an empty report with exit 0. both need fixing before merge. three non-blocking items also flagged inline (nan line serialization, forced-first sort ordering, direct+none test fixture vs yaml disjointness rule).
| contextCompleteness: require('context_completeness') as FindingInput['contextCompleteness'], | ||
| drafterSeverity: require('drafter_severity') as FindingInput['drafterSeverity'], | ||
| verifierSeverity: require('verifier_severity') as FindingInput['verifierSeverity'], | ||
| inDiff: get('in_diff') === true, |
There was a problem hiding this comment.
issue: get('in_diff') === true silently returns false whenever the field is absent or comes in as the string "true" instead of a boolean. get() returns undefined on a missing key, and undefined === true is false. so every finding from an agent that omits these fields, or JSON-encodes them as strings, gets both flags false, the scope hard gate fires at STEP 0, and the finding is silently dropped with no stderr output. that includes forced security findings.
the fix is to use require('in_diff') and require('in_changed_code') (which already throw on missing/null, same as the other required fields). wrap with Boolean(...) if you want to coerce a non-boolean to a boolean rather than throw:
inDiff: Boolean(require('in_diff')),
inChangedCode: Boolean(require('in_changed_code')),the same issue is on line 61 for in_changed_code.
| } | ||
|
|
||
| const parsed = JSON.parse(readFileSync(findingsPath, 'utf-8')) as unknown; | ||
| const records = Array.isArray(parsed) ? (parsed as Record<string, unknown>[]) : []; |
There was a problem hiding this comment.
issue: when the top-level JSON value isn't an array (e.g. {}, null, a scalar, or an agent that wraps findings in {"findings": [...]}) records silently becomes [], scoreFindings([]) runs without error, and the process exits 0 with a completely empty report. a caller has no way to tell this apart from a real zero-findings run.
every other bad-input path here either throws (via assertEnum) or exits 1 (via the outer try/catch). this one should too:
if (!Array.isArray(parsed)) {
process.stderr.write(`error: input must be a JSON array, got ${typeof parsed}\n`);
process.exit(1);
}
const records = parsed as Record<string, unknown>[];| }; | ||
| return { | ||
| file: String(require('file')), | ||
| line: Number(require('line')), |
There was a problem hiding this comment.
suggestion (non-blocking): require() throws on a missing key, good. but Number('abc') returns NaN, not an error, and JSON.stringify({line: NaN}) produces {"line": null}. if that reaches the GitHub review POST, the API will either silently omit the line anchor or return a 422.
add a post-conversion check:
const lineNum = Number(require('line'));
if (!Number.isInteger(lineNum) || lineNum < 1) {
throw new Error(`'line' must be a positive integer, got: ${require('line')}`);
}
// then use lineNum| const byDisposition = (d: Disposition): ScoredFinding[] => | ||
| scored | ||
| .filter((s) => s.result.disposition === d) | ||
| .sort((a, b) => b.result.sortKey - a.result.sortKey); |
There was a problem hiding this comment.
suggestion (non-blocking): the JSDoc on ConfidenceReport.inline (line 195) says "forced first, then capped default-band", but byDisposition sorts all inline findings purely by sortKey with no forced-first partition. a LIKELY security finding with speculative evidence and none context scores 26, so it will sort after a non-forced CONFIRMED/direct/full at score 100. if the consumer renders in array order (likely), security findings could appear buried at the bottom.
fix: partition by forced before sorting within the inline bucket:
const byDisposition = (d: Disposition): ScoredFinding[] => {
const matches = scored.filter((s) => s.result.disposition === d);
if (d !== 'inline') return matches.sort((a, b) => b.result.sortKey - a.result.sortKey);
const forced = matches.filter((s) => s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey);
const nonForced = matches.filter((s) => !s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey);
return [...forced, ...nonForced];
};| ['direct', 'partial', 97], | ||
| ['circumstantial', 'full', 95], | ||
| ['circumstantial', 'partial', 87], | ||
| ['direct', 'none', 83], |
There was a problem hiding this comment.
question (non-blocking): pr-review.yaml line 839 has an explicit disjointness rule: "if context_completeness is none you MUST NOT set evidence_strength to direct" (without the defining context you can't claim direct evidence). this fixture exercises that exact forbidden combination and the TS scorer silently accepts it.
two ways to resolve this:
- add validation in
scoreFindingto rejectdirect+nonethe same way it validates enums viaassertEnum, and update this fixture to use a valid combo (e.g.['speculative', 'none', ...]). - if
direct+noneis intentionally supported in the TS model (the YAML rule being aspirational/prompt-only), document that explicitly and note the divergence.
right now the TS model and the YAML rule silently contradict each other, which will cause confusion when the verifier agent produces a direct+none combo and the scorer happily accepts it.
| band: 'negligible', | ||
| disposition: 'drop', | ||
| forced: false, | ||
| reason: 'out-of-scope (not in_diff && in_changed_code)', |
There was a problem hiding this comment.
nitpick (super non-blocking): 'out-of-scope (not in_diff && in_changed_code)' reads (by standard operator precedence) as (!in_diff) && in_changed_code, i.e. only in_diff is negated. the actual condition is !(in_diff && in_changed_code). anyone reading this in a debug log will be confused about which flag caused the drop.
'out-of-scope (not (in_diff && in_changed_code))' or 'out-of-scope: requires both in_diff and in_changed_code' would be unambiguous.
- parseRecord: require strict booleans for in_diff/in_changed_code so a missing or string-encoded flag fails loudly instead of silently dropping the finding (including forced security findings) at the scope gate - parseRecord: validate line as a positive integer (NaN would serialize to null and break the GitHub review line anchor) - main: reject a non-array top-level JSON value via toFindingRecords instead of producing an empty report with exit 0 - scoreFindings: order the inline bucket forced-first so security and high-severity findings are never buried; mirror in both pr-review.yaml spots - clarify the out-of-scope reason string - document that the verifier direct+none disjointness rule is an output constraint, not a scorer invariant (graceful scoring, not a batch abort) Export parseRecord/toFindingRecords and guard main() for unit testing; add tests covering all of the above.
derekmisler
left a comment
There was a problem hiding this comment.
all six findings addressed: requireBool() for the scope-gate flags, toFindingRecords() throwing on non-array input, requireInt() with Number.isInteger guard, forced-first partitioning in byDisposition, the reason string, and the direct+none test fixture. ci is green, approving.
Reconcile the abuse safeguards with the requester-based authorization rework that landed on main (PRs #16, #17, #18). check-org-membership: take main's evaluateMembership model. It subsumes this branch's resolveUsername helper (the directly wired pull_request auto-review path now verifies the PR author via a live API lookup) and adds the review_requested requester path so an org member can pull an external contributor's PR into review. review-pr.yml: provide EVENT_NAME, EVENT_ACTION and REQUESTER to the membership step; drop the PR_AUTHOR env that main's code never reads. tsup.config.ts: keep both the rate-limit (this branch) and score-confidence (main) entry points. review-pr/README.md: keep main's requester-model wording and preserve this branch's audit, throttle and force-push safeguards. SECURITY.md: update section 1 to the requester-based model and remove the stale resolveUsername and PR_AUTHOR references.
Summary
Adds a precise, multi-criteria confidence scoring model for the PR reviewer. Each verified finding gets a 0 to 100 confidence score, a band, and a posting disposition, computed deterministically from several signals rather than a subjective guess.
The model lives in two synchronized surfaces:
src/score-confidence/(pure module + CLI + unit tests)review-pr/agents/pr-review.yaml"Confidence Scoring" sectiondist/is not available at agent runtime)Criteria
evidence_strength(direct / circumstantial / speculative)context_completeness(full / partial / none)in_diffandin_changed_code)Scale and threshold
verdict x evidence x contexttable (CONFIRMED 70/LIKELY 40base; evidence+18 / +8 / -4; context+12 / +4 / -10), plus concordance (+5 / 0 / -8), clamped.strong >= 80,moderate 55..79,weak 30..54,negligible < 30.strong(LIKELY tops out at 75), a property the unit tests pin.Posting policy (first match wins; the 5-comment cap is applied last)
Maps to the requested design
evidence_strengthfieldcontext_completenessfieldHow it was hardened
A design review (three independent lenses: calibration, security policy, LLM reproducibility) locked the constants, replacing error-prone post-hoc caps with the lookup table and removing band dead-zones. An adversarial verification pass then confirmed and fixed three defects:
categorywas the only enum not validated, so a misspelled value silently disabled the security floorcategoryviaassertEnumTS-to-prompt numeric consistency (all 18 table cells, bands, threshold, posting precedence, schema) was verified clean.
Compatibility with #15
#15 ("always post a review comment even with zero findings") and this change touch distinct regions of
pr-review.yamland are complementary: an empty inline set yields aAPPROVEassessment label while the summary and audit sections still go in the review body, and the review is still posted.Validation
pnpm buildpnpm test(554 tests, 84 new)tsc --noEmitbiome ciactionlintNote on runtime placement
The model is applied by the orchestrator prompt (mirroring the tested TS module) rather than invoked as a
distbundle at agent runtime, becausedist/is gitignored and not present in the agent's working tree. If invoking the compiled scorer at runtime is preferred, that is a larger pipeline change (staging the bundle plus a permission) and can be done separately.