feat(eval): recall-quality eval harness with CI baseline gate#241
Conversation
…ouchdev#226) Add vouch eval recall: score kb.context retrieval against a labeled query set (P@k, R@k, MRR, nDCG), compare against a committed baseline, and fail CI on a >5% P@5 regression. Pure-Python metrics; deterministic report; starter labeled set + fixture KB + eval workflow.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a retrieval recall evaluation harness: a new ChangesRecall Evaluation Harness
Sequence Diagram(s)sequenceDiagram
participant CI as GitHub Actions eval workflow
participant CLI as vouch eval recall
participant RunRecall as run_recall
participant KBStore as KBStore / build_context_pack
participant Compare as compare_baseline
CI->>CLI: python -m vouch.cli eval recall queries.jsonl --baseline baseline.json --k 5 --max-regression 0.05
CLI->>RunRecall: run_recall(store, queries_path, k=5)
loop 10 labeled queries
RunRecall->>KBStore: build_context_pack(store, query=q, limit=10)
KBStore-->>RunRecall: ranked claim ids
RunRecall->>RunRecall: score_query(ranked_ids, expected, k=5)
end
RunRecall-->>CLI: report {k, n_queries, macro, per_query}
CLI->>Compare: compare_baseline(report, baseline, max_regression=0.05)
Compare-->>CLI: (ok, message)
alt ok is False
CLI-->>CI: ClickException → non-zero exit
else ok is True
CLI-->>CI: exit 0
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
.github/workflows/eval.yml (1)
19-24: 💤 Low valueConsider pinning GitHub Actions to commit SHAs for supply-chain security.
The workflow currently references actions by tag (e.g.,
@v4,@v5). For enhanced security, consider pinning to specific commit SHAs to prevent tag-rewriting attacks.Additionally, consider adding
persist-credentials: falseto the checkout action to prevent credential leakage through build artifacts.🔒 Optional security hardening
- uses: actions/checkout@v4 + with: + persist-credentials: false - uses: actions/setup-python@v5For SHA pinning, you can use a tool like Dependabot to manage action versions with SHA references.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/eval.yml around lines 19 - 24, Replace the tag-based references in the GitHub Actions workflow with specific commit SHAs to prevent tag-rewriting attacks. Update actions/checkout@v4 and actions/setup-python@v5 to pin to their respective commit SHAs (you can find these on the GitHub Actions marketplace pages). Additionally, add persist-credentials: false to the actions/checkout step to prevent credential leakage through build artifacts.src/vouch/cli.py (1)
672-672: ⚡ Quick winConsider adding validation for
kparameter.The
--koption accepts any integer, including zero or negative values. Whilerun_recallguards withmax(k, 10), a negative or zerokis semantically invalid for precision/recall metrics.🛡️ Proposed validation
-@click.option("--k", default=5, show_default=True, type=int) +@click.option("--k", default=5, show_default=True, type=click.IntRange(min=1))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/cli.py` at line 672, The `--k` option in the click decorator accepts any integer value, including zero or negative numbers, which are semantically invalid for precision/recall metrics. Add a click callback validator to the `@click.option("--k")` decorator that ensures the value is a positive integer (greater than zero). The validator should raise a click.BadParameter exception if the user provides a zero or negative value, providing a clear error message about the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/vouch/cli.py`:
- Around line 687-689: The message is being output to stderr twice due to
click.echo printing it unconditionally on Line 687, and then ClickException
printing it again with an "Error: " prefix when raised on Line 689. Move the
click.echo call inside a conditional block so it only executes when the check
passes (when ok is True), and remove it from before the ClickException block.
This way the success message prints to stderr when ok is True, and only the
ClickException prints the error message to stderr when ok is False, eliminating
the duplicate output.
- Line 685: The baseline file loading at the line with
json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for
file reading or JSON decoding errors, which will result in Python tracebacks
instead of clean error messages. Wrap this baseline loading operation in the
_cli_errors() context manager (similar to how the run_recall call is handled) so
that both JSONDecodeError and file read errors are caught and handled cleanly,
since JSONDecodeError is a subclass of ValueError which is already handled by
the _cli_errors() context.
---
Nitpick comments:
In @.github/workflows/eval.yml:
- Around line 19-24: Replace the tag-based references in the GitHub Actions
workflow with specific commit SHAs to prevent tag-rewriting attacks. Update
actions/checkout@v4 and actions/setup-python@v5 to pin to their respective
commit SHAs (you can find these on the GitHub Actions marketplace pages).
Additionally, add persist-credentials: false to the actions/checkout step to
prevent credential leakage through build artifacts.
In `@src/vouch/cli.py`:
- Line 672: The `--k` option in the click decorator accepts any integer value,
including zero or negative numbers, which are semantically invalid for
precision/recall metrics. Add a click callback validator to the
`@click.option("--k")` decorator that ensures the value is a positive integer
(greater than zero). The validator should raise a click.BadParameter exception
if the user provides a zero or negative value, providing a clear error message
about the requirement.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 37830106-c2da-4678-bef5-eb32fa62bfed
📒 Files selected for processing (22)
.github/workflows/eval.ymlCHANGELOG.mdeval/baseline.jsoneval/fixture-kb/.vouch/.gitignoreeval/fixture-kb/.vouch/claims/api-rate-limit.yamleval/fixture-kb/.vouch/claims/auth-jwt.yamleval/fixture-kb/.vouch/claims/auth-session.yamleval/fixture-kb/.vouch/claims/cache-redis.yamleval/fixture-kb/.vouch/claims/db-migrations.yamleval/fixture-kb/.vouch/claims/db-postgres.yamleval/fixture-kb/.vouch/claims/deploy-ci.yamleval/fixture-kb/.vouch/claims/deploy-docker.yamleval/fixture-kb/.vouch/claims/search-embeddings.yamleval/fixture-kb/.vouch/claims/search-fts5.yamleval/fixture-kb/.vouch/config.yamleval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/contenteval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/meta.yamleval/queries.jsonlsrc/vouch/cli.pysrc/vouch/eval/__init__.pysrc/vouch/eval/recall.pytests/test_eval_recall.py
| report = run_recall(store, queries, k=k) | ||
| click.echo(json.dumps(report, indent=2)) | ||
| if baseline is not None: | ||
| base = json.loads(Path(baseline).read_text(encoding="utf-8")) |
There was a problem hiding this comment.
Add error handling for baseline JSON loading.
The baseline file reading and JSON decoding is not wrapped in error handling, unlike the run_recall call. A malformed JSON file or read error will produce a Python traceback instead of a clean error message.
🔧 Proposed fix
Wrap the baseline loading in a try/except or extend the _cli_errors() context:
store = _load_store()
with _cli_errors():
report = run_recall(store, queries, k=k)
- click.echo(json.dumps(report, indent=2))
- if baseline is not None:
- base = json.loads(Path(baseline).read_text(encoding="utf-8"))
+ click.echo(json.dumps(report, indent=2))
+ if baseline is not None:
+ base = json.loads(Path(baseline).read_text(encoding="utf-8"))
- ok, message = compare_baseline(report, base, max_regression=max_regression)
- click.echo(message, err=True)
- if not ok:
- raise click.ClickException(message)
+ ok, message = compare_baseline(report, base, max_regression=max_regression)
+ click.echo(message, err=True)
+ if not ok:
+ raise click.ClickException(message)This also catches JSONDecodeError as a ValueError subclass already handled by _cli_errors().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/cli.py` at line 685, The baseline file loading at the line with
json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for
file reading or JSON decoding errors, which will result in Python tracebacks
instead of clean error messages. Wrap this baseline loading operation in the
_cli_errors() context manager (similar to how the run_recall call is handled) so
that both JSONDecodeError and file read errors are caught and handled cleanly,
since JSONDecodeError is a subclass of ValueError which is already handled by
the _cli_errors() context.
| click.echo(message, err=True) | ||
| if not ok: | ||
| raise click.ClickException(message) |
There was a problem hiding this comment.
Remove duplicate message output.
The comparison message is printed to stderr on Line 687, then raised again in the ClickException on Line 689. Click will print the exception message to stderr prefixed with "Error: ", causing users to see the message twice.
🔧 Proposed fix
Remove the redundant click.echo before raising the exception:
base = json.loads(Path(baseline).read_text(encoding="utf-8"))
ok, message = compare_baseline(report, base, max_regression=max_regression)
- click.echo(message, err=True)
if not ok:
raise click.ClickException(message)
+ else:
+ click.echo(message, err=True)This prints the success message only when the check passes, and lets Click handle the failure message cleanly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| click.echo(message, err=True) | |
| if not ok: | |
| raise click.ClickException(message) | |
| if not ok: | |
| raise click.ClickException(message) | |
| else: | |
| click.echo(message, err=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/cli.py` around lines 687 - 689, The message is being output to
stderr twice due to click.echo printing it unconditionally on Line 687, and then
ClickException printing it again with an "Error: " prefix when raised on Line
689. Move the click.echo call inside a conditional block so it only executes
when the check passes (when ok is True), and remove it from before the
ClickException block. This way the success message prints to stderr when ok is
True, and only the ClickException prints the error message to stderr when ok is
False, eliminating the duplicate output.
feat(eval): recall-quality eval harness with CI baseline gate
Closes #226
What
Adds a retrieval-quality eval that scores the current
kb.contextretrieval(
build_context_pack) against a hand-labeled query set and gates regressionsin CI.
src/vouch/eval/recall.py— pure-Python metrics, no numpy:load_queries(path)reads a JSONL labeled set ({"query", "expected"};expected_idsaccepted as an alias).score_query(ranked_ids, expected, *, k=5)returns{p_at_k, r_at_k, mrr, ndcg_at_k}(math.log2for nDCG).run_recall(store, queries_path, *, k=5)ranks each query viabuild_context_pack(..., limit=max(k, 10))["items"], scores it, and returnsa deterministic
{k, n_queries, macro, per_query}report.compare_baseline(report, baseline, *, max_regression=0.05)returns(ok, message); not-ok when macrop_at_kdrops belowbaseline.macro.p_at_k - max_regression.vouch eval recall <queries.jsonl> [--k N] [--baseline b.json] [--max-regression F](added to the existingevalgroup). Prints the JSONreport; with
--baseline, prints a P@k diff and exits non-zero(
click.ClickException) on a regression beyond tolerance.eval/queries.jsonl(a 10-query hand-labeled starterset, not the full 50) and
eval/baseline.json, both reproducible against asmall committed fixture KB under
eval/fixture-kb/.vouch/(10 claims + 1source).
state.dbis rebuilt at eval time (it stays gitignored as a derivedindex).
.github/workflows/eval.ymlruns on PRs touchingsrc/vouch/embeddings/**,src/vouch/context.py,src/vouch/eval/**, oreval/**; installs.[dev], rebuilds the fixture index, and fails on a >5%P@5 regression vs
eval/baseline.json.tests/test_eval_recall.py: hand-computed metric math,deterministic
run_recallover a temp KB, and the baseline gate(regression / within-tolerance / improvement).
## [Unreleased] → ### Addedbullet.Why
We had no guardrail against silent retrieval-quality regressions. A labeled set
plus a committed baseline turns "did that embedding/context change make search
worse?" into a CI signal instead of a manual hunch. Metrics are pure Python so
the gate runs in the base
.[dev]job with no numpy/embedding extras.This is intentionally a starter labeled set (~10 queries). Growing it toward
the full ~50 is follow-up work; the harness, fixtures, and gate are complete.
Test plan
Gate — all green (5 skips are the numpy-gated context/embedding tests, expected
in the base
.[dev]venv):CLI against the committed fixture (
cd eval/fixture-kb && vouch reindex):Baseline gate — passes at/above baseline, fails on regression:
Summary by CodeRabbit
New Features
vouch eval recallcommand to measure retrieval quality using precision@k, recall@k, mean reciprocal rank, and normalized discounted gain metrics.Tests
Chores