Skip to content

feat(review-pr): add per-finding confidence scoring model#17

Merged
Sayt-0 merged 4 commits into
mainfrom
feat/confidence-scoring
Jun 25, 2026
Merged

feat(review-pr): add per-finding confidence scoring model#17
Sayt-0 merged 4 commits into
mainfrom
feat/confidence-scoring

Conversation

@Sayt-0

@Sayt-0 Sayt-0 commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Adds a precise, multi-criteria confidence scoring model for the PR reviewer. Each verified finding gets a 0 to 100 confidence score, a band, and a posting disposition, computed deterministically from several signals rather than a subjective guess.

The model lives in two synchronized surfaces:

Surface Role
src/score-confidence/ (pure module + CLI + unit tests) Single source of truth for the model
review-pr/agents/pr-review.yaml "Confidence Scoring" section Strict lookup table the orchestrator applies inline (the gitignored dist/ is not available at agent runtime)

Criteria

Criterion Source Role
Verdict (CONFIRMED / LIKELY / DISMISSED) verifier Primary agreement signal
evidence_strength (direct / circumstantial / speculative) verifier (new field) Pattern / snippet match strength
context_completeness (full / partial / none) verifier (new field) Whether the verifier saw the code it needed
Severity concordance derived Drafter vs verifier severity rank distance
Scope (in_diff and in_changed_code) derived Hard gate
Category / severity verifier Posting policy only (security floor, high-severity always-post)

Scale and threshold

  • Score 0 to 100 from a precomputed verdict x evidence x context table (CONFIRMED 70 / LIKELY 40 base; evidence +18 / +8 / -4; context +12 / +4 / -10), plus concordance (+5 / 0 / -8), clamped.
  • Bands: strong >= 80, moderate 55..79, weak 30..54, negligible < 30.
  • Default posting threshold is 55 (the moderate-band floor, so the band and the post cutoff cannot drift apart).
  • Only CONFIRMED can reach strong (LIKELY tops out at 75), a property the unit tests pin.

Posting policy (first match wins; the 5-comment cap is applied last)

Rule Behavior
Out-of-scope / DISMISSED non-security not surfaced
Security floor (CONFIRMED/LIKELY) always inline, never auto-suppressed, cap-exempt
High-severity (CONFIRMED/LIKELY) always inline, cap-exempt
Band strong or moderate inline (subject to the cap)
Band weak, non-forced lower-confidence summary list (not silently dropped)
Negligible band, verifier severity medium summary (visibility floor)
DISMISSED security dismissed-security audit line (human-reviewable)
Non-forced inline overflow past 5 demoted to the summary list

Maps to the requested design

Requested Implemented
Verifier agreement score verdict + drafter/verifier severity concordance
Pattern match strength evidence_strength field
Context completeness context_completeness field
Scale (high/medium/low or 0-100) 0 to 100 plus four named bands
Default threshold 55 (moderate floor)
Precise, multiple criteria deterministic lookup table; six criteria; 84 unit tests pin every value

How it was hardened

A design review (three independent lenses: calibration, security policy, LLM reproducibility) locked the constants, replacing error-prone post-hoc caps with the lookup table and removing band dead-zones. An adversarial verification pass then confirmed and fixed three defects:

Defect Severity Resolution
Verifier escalating severity could silently drop a medium-severity finding (concordance is non-monotonic in severity) major Medium-severity visibility floor: a still-believed medium finding is kept in the summary, never dropped
category was the only enum not validated, so a misspelled value silently disabled the security floor minor Validate category via assertEnum
Severity-disagreement can move a borderline finding from inline to summary minor Documented as intended: the finding stays visible, and confidence legitimately reflects assessor agreement (a different axis from severity)

TS-to-prompt numeric consistency (all 18 table cells, bands, threshold, posting precedence, schema) was verified clean.

Compatibility with #15

#15 ("always post a review comment even with zero findings") and this change touch distinct regions of pr-review.yaml and are complementary: an empty inline set yields a APPROVE assessment label while the summary and audit sections still go in the review body, and the review is still posted.

Validation

Check Result
pnpm build pass
pnpm test (554 tests, 84 new) pass
tsc --noEmit pass
biome ci pass
actionlint pass

Note on runtime placement

The model is applied by the orchestrator prompt (mirroring the tested TS module) rather than invoked as a dist bundle at agent runtime, because dist/ is gitignored and not present in the agent's working tree. If invoking the compiled scorer at runtime is preferred, that is a larger pipeline change (staging the bundle plus a permission) and can be done separately.

Score each verified finding 0-100 from the verifier verdict, evidence
strength, context completeness, drafter/verifier severity concordance,
and scope. Bands (strong/moderate/weak/negligible) with a default
posting threshold of 55 gate inline comments; security and high-severity
CONFIRMED/LIKELY findings are always posted, weak-band findings go to a
visible lower-confidence summary instead of being dropped, and a
medium-severity floor keeps a still-believed finding visible.

The model is implemented and unit-tested in src/score-confidence
(single source of truth) and mirrored in the orchestrator prompt as a
strict lookup table. The verifier now emits evidence_strength and
context_completeness.
Comment thread src/score-confidence/index.ts Fixed
CodeQL js/insecure-temporary-file: the CLI defaulted its output to a
hardcoded /tmp path. Default to stdout instead (composable, no fixed
temp file) and write to a file only when the caller passes an explicit
output path.
@Sayt-0 Sayt-0 enabled auto-merge (squash) June 24, 2026 21:10
@Sayt-0 Sayt-0 requested a review from derekmisler June 24, 2026 21:10

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The confidence scoring model looks correct. The lookup table arithmetic, band boundaries, clamping, posting policy precedence, enum validation, and CLI wiring all check out. The TypeScript module and the YAML mirror are consistent. No bugs were found in the code introduced by this PR.

@derekmisler derekmisler left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two confirmed bugs in the cli entrypoint: get() instead of require() for the scope-gate flags silently drops all findings (including forced security ones), and non-array json input silently produces an empty report with exit 0. both need fixing before merge. three non-blocking items also flagged inline (nan line serialization, forced-first sort ordering, direct+none test fixture vs yaml disjointness rule).

Comment thread src/score-confidence/index.ts Outdated
contextCompleteness: require('context_completeness') as FindingInput['contextCompleteness'],
drafterSeverity: require('drafter_severity') as FindingInput['drafterSeverity'],
verifierSeverity: require('verifier_severity') as FindingInput['verifierSeverity'],
inDiff: get('in_diff') === true,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: get('in_diff') === true silently returns false whenever the field is absent or comes in as the string "true" instead of a boolean. get() returns undefined on a missing key, and undefined === true is false. so every finding from an agent that omits these fields, or JSON-encodes them as strings, gets both flags false, the scope hard gate fires at STEP 0, and the finding is silently dropped with no stderr output. that includes forced security findings.

the fix is to use require('in_diff') and require('in_changed_code') (which already throw on missing/null, same as the other required fields). wrap with Boolean(...) if you want to coerce a non-boolean to a boolean rather than throw:

inDiff: Boolean(require('in_diff')),
inChangedCode: Boolean(require('in_changed_code')),

the same issue is on line 61 for in_changed_code.

Comment thread src/score-confidence/index.ts Outdated
}

const parsed = JSON.parse(readFileSync(findingsPath, 'utf-8')) as unknown;
const records = Array.isArray(parsed) ? (parsed as Record<string, unknown>[]) : [];

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: when the top-level JSON value isn't an array (e.g. {}, null, a scalar, or an agent that wraps findings in {"findings": [...]}) records silently becomes [], scoreFindings([]) runs without error, and the process exits 0 with a completely empty report. a caller has no way to tell this apart from a real zero-findings run.

every other bad-input path here either throws (via assertEnum) or exits 1 (via the outer try/catch). this one should too:

if (!Array.isArray(parsed)) {
  process.stderr.write(`error: input must be a JSON array, got ${typeof parsed}\n`);
  process.exit(1);
}
const records = parsed as Record<string, unknown>[];

Comment thread src/score-confidence/index.ts Outdated
};
return {
file: String(require('file')),
line: Number(require('line')),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): require() throws on a missing key, good. but Number('abc') returns NaN, not an error, and JSON.stringify({line: NaN}) produces {"line": null}. if that reaches the GitHub review POST, the API will either silently omit the line anchor or return a 422.

add a post-conversion check:

const lineNum = Number(require('line'));
if (!Number.isInteger(lineNum) || lineNum < 1) {
  throw new Error(`'line' must be a positive integer, got: ${require('line')}`);
}
// then use lineNum

const byDisposition = (d: Disposition): ScoredFinding[] =>
scored
.filter((s) => s.result.disposition === d)
.sort((a, b) => b.result.sortKey - a.result.sortKey);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): the JSDoc on ConfidenceReport.inline (line 195) says "forced first, then capped default-band", but byDisposition sorts all inline findings purely by sortKey with no forced-first partition. a LIKELY security finding with speculative evidence and none context scores 26, so it will sort after a non-forced CONFIRMED/direct/full at score 100. if the consumer renders in array order (likely), security findings could appear buried at the bottom.

fix: partition by forced before sorting within the inline bucket:

const byDisposition = (d: Disposition): ScoredFinding[] => {
  const matches = scored.filter((s) => s.result.disposition === d);
  if (d !== 'inline') return matches.sort((a, b) => b.result.sortKey - a.result.sortKey);
  const forced = matches.filter((s) => s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey);
  const nonForced = matches.filter((s) => !s.result.forced).sort((a, b) => b.result.sortKey - a.result.sortKey);
  return [...forced, ...nonForced];
};

['direct', 'partial', 97],
['circumstantial', 'full', 95],
['circumstantial', 'partial', 87],
['direct', 'none', 83],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): pr-review.yaml line 839 has an explicit disjointness rule: "if context_completeness is none you MUST NOT set evidence_strength to direct" (without the defining context you can't claim direct evidence). this fixture exercises that exact forbidden combination and the TS scorer silently accepts it.

two ways to resolve this:

  1. add validation in scoreFinding to reject direct+none the same way it validates enums via assertEnum, and update this fixture to use a valid combo (e.g. ['speculative', 'none', ...]).
  2. if direct+none is intentionally supported in the TS model (the YAML rule being aspirational/prompt-only), document that explicitly and note the divergence.

right now the TS model and the YAML rule silently contradict each other, which will cause confusion when the verifier agent produces a direct+none combo and the scorer happily accepts it.

band: 'negligible',
disposition: 'drop',
forced: false,
reason: 'out-of-scope (not in_diff && in_changed_code)',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (super non-blocking): 'out-of-scope (not in_diff && in_changed_code)' reads (by standard operator precedence) as (!in_diff) && in_changed_code, i.e. only in_diff is negated. the actual condition is !(in_diff && in_changed_code). anyone reading this in a debug log will be confused about which flag caused the drop.

'out-of-scope (not (in_diff && in_changed_code))' or 'out-of-scope: requires both in_diff and in_changed_code' would be unambiguous.

Sayt-0 added 2 commits June 25, 2026 18:31
- parseRecord: require strict booleans for in_diff/in_changed_code so a
  missing or string-encoded flag fails loudly instead of silently dropping
  the finding (including forced security findings) at the scope gate
- parseRecord: validate line as a positive integer (NaN would serialize to
  null and break the GitHub review line anchor)
- main: reject a non-array top-level JSON value via toFindingRecords instead
  of producing an empty report with exit 0
- scoreFindings: order the inline bucket forced-first so security and
  high-severity findings are never buried; mirror in both pr-review.yaml spots
- clarify the out-of-scope reason string
- document that the verifier direct+none disjointness rule is an output
  constraint, not a scorer invariant (graceful scoring, not a batch abort)

Export parseRecord/toFindingRecords and guard main() for unit testing; add
tests covering all of the above.

@derekmisler derekmisler left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all six findings addressed: requireBool() for the scope-gate flags, toFindingRecords() throwing on non-array input, requireInt() with Number.isInteger guard, forced-first partitioning in byDisposition, the reason string, and the direct+none test fixture. ci is green, approving.

@Sayt-0 Sayt-0 merged commit 70b6d7b into main Jun 25, 2026
8 checks passed
@Sayt-0 Sayt-0 deleted the feat/confidence-scoring branch June 25, 2026 16:33
Sayt-0 added a commit that referenced this pull request Jun 25, 2026
Reconcile the abuse safeguards with the requester-based authorization
rework that landed on main (PRs #16, #17, #18).

check-org-membership: take main's evaluateMembership model. It subsumes
this branch's resolveUsername helper (the directly wired pull_request
auto-review path now verifies the PR author via a live API lookup) and
adds the review_requested requester path so an org member can pull an
external contributor's PR into review.

review-pr.yml: provide EVENT_NAME, EVENT_ACTION and REQUESTER to the
membership step; drop the PR_AUTHOR env that main's code never reads.

tsup.config.ts: keep both the rate-limit (this branch) and
score-confidence (main) entry points.

review-pr/README.md: keep main's requester-model wording and preserve
this branch's audit, throttle and force-push safeguards.

SECURITY.md: update section 1 to the requester-based model and remove
the stale resolveUsername and PR_AUTHOR references.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants