Skip to content

feat: drop ge/le schema bounds on LLM finding confidence and start_line#133

Merged
rng1995 merged 1 commit into
NVIDIA:mainfrom
wernerkasselman-au:fix/structured-output-schema
Jun 22, 2026
Merged

feat: drop ge/le schema bounds on LLM finding confidence and start_line#133
rng1995 merged 1 commit into
NVIDIA:mainfrom
wernerkasselman-au:fix/structured-output-schema

Conversation

@wernerkasselman-au

Copy link
Copy Markdown
Contributor

Closes #132. Split out of #19 at @rng1995's request, since it changes the LLM structured-output contract and deserves its own rationale.

Problem

LLMFinding and MetaAnalyzerFinding express their numeric ranges with Pydantic ge/le (start_line >= 1, confidence in [0.0, 1.0]). Pydantic emits those as JSON-schema minimum / maximum in the response schema. Some OpenAI-compatible structured-output and tool-calling endpoints validate that schema and reject minimum / maximum, so the LLM call fails before a single finding comes back.

Change

  • Drop the ge/le Field bounds on LLMFinding.start_line, LLMFinding.confidence, and MetaAnalyzerFinding.confidence, and convey the ranges in the field descriptions.
  • Enforce the ranges with runtime field_validators that clamp into range: confidence to [0.0, 1.0], start_line to >= 1.

Clamping rather than raising is deliberate: an LLM occasionally returns a value just outside the range (confidence 1.01, or start_line 0 for a whole-file finding), and normalising it is better than a ValidationError that fails the whole structured-output parse and drops the batch.

Result

The emitted JSON schema no longer carries minimum / maximum, so it stays portable across structured-output backends, and the in-code [0, 1] / >= 1 guarantee is unchanged. Tests assert both: that neither model's model_json_schema() contains minimum or maximum, and that the validators clamp.

Notes

  • 625 passed, 12 skipped, ruff clean. The two existing tests that asserted a ValueError on out-of-range confidence now assert the clamp.
  • meta_analyzer already imported field_validator (for the overall_assessment parser), so this reuses it.

Pydantic ge/le constraints emit JSON-schema minimum/maximum. Some
OpenAI-compatible structured-output and tool-calling endpoints reject a
response schema that carries those keywords, so the LLM call fails before
any finding is returned.

Express the ranges in the field descriptions and enforce them with runtime
field_validators that clamp into range instead. Clamping rather than raising
means a slightly out-of-range model value normalises rather than failing the
whole structured-output parse and dropping the batch:

- LLMFinding.confidence and MetaAnalyzerFinding.confidence clamp to [0.0, 1.0]
- LLMFinding.start_line clamps to >= 1

Only the numeric bound is removed, not the requiredness: start_line stays a
required field (no default), so a finding with no location is still rejected
rather than materialised at line 1.

The emitted JSON schema no longer contains minimum/maximum, so it stays
portable across structured-output backends, while the in-code guarantee is
unchanged. Adds tests asserting the schema carries no numeric bounds, that
start_line stays required, and that the validators clamp.

Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: APPROVE

Thanks for splitting this out of #19 with its own rationale. The change is focused, the intent is well-documented in both the code comments and the description, and the tests lock in the guarantees that matter.

Verification against the stated contract

  • (a) ge/le removed — Confirmed on LLMFinding.start_line, LLMFinding.confidence, and MetaAnalyzerFinding.confidence; each is now a bare Field(description=...) with the range conveyed in the description.
  • (b) Validators genuinely clamp_clamp_confidence returns min(1.0, max(0.0, v)) ([0.0, 1.0]) and _clamp_start_line returns v if v >= 1 else 1 (>= 1). Both are @field_validator in the default ("after") mode, so they run on the already-coerced numeric value, which is the right choice here — a "before" validator would see raw/uncoerced input. Non-numeric input still fails type coercion before reaching the clamp, which is the desired behavior.
  • (c) start_line stays requiredLLMFinding.start_line has no default, so a finding with no location is still rejected rather than silently materialized at line 1. test_llm_finding_start_line_is_required pins this down. Good, explicit separation between "drop the numeric bound" and "drop requiredness".
  • (d) Emitted schema is clean — The recursive _numeric_keywords walk asserts neither model's model_json_schema() contains minimum/maximum, which is exactly the portability goal.

Design trade-off (clamp vs. raise)

Reasonable for this contract. A single value that lands a hair outside the range shouldn't fail the whole structured-output parse and drop the rest of the batch. Confidence is advisory, and snapping a whole-file finding's start_line 0 -> 1 keeps the finding instead of losing it over an off-by-one.

Minor / optional (non-blocking): the clamp is silent, so a genuinely misbehaving model (e.g. confidence=5.0 -> 1.0, or start_line=-50 -> 1) is normalized with no trace. For a security tool, emitting a logger.debug/warning when a value is actually out of range (vs. within range) would preserve that signal without changing the resilience behavior. Not required to merge.

Tests

Good coverage: schema-has-no-bounds for both models, clamp high/low for both, start_line clamp, and the required-field guarantee. Converting the two prior "raises on confidence 1.5" tests into clamp assertions is a deliberate, documented contract change, and the updates are consistent with it.

Code quality / compatibility

Clear and minimal. It reuses the field_validator import already present in the meta-analyzer module (used by the existing assessment parser), the comments explain the "why" rather than the "what", and there's no dead code. The only behavior change for callers is the raise -> clamp contract on out-of-range numeric input, which is intentional and covered by the updated tests.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM finding schemas emit JSON-schema minimum/maximum, which some structured-output endpoints reject

2 participants