feat(eval): recall-quality eval harness with CI baseline gate by dripsmvcp · Pull Request #241 · vouchdev/vouch

dripsmvcp · 2026-06-17T03:35:39Z

feat(eval): recall-quality eval harness with CI baseline gate

Closes #226

What

Adds a retrieval-quality eval that scores the current kb.context retrieval
(build_context_pack) against a hand-labeled query set and gates regressions
in CI.

src/vouch/eval/recall.py — pure-Python metrics, no numpy:
- load_queries(path) reads a JSONL labeled set ({"query", "expected"};
  expected_ids accepted as an alias).
- score_query(ranked_ids, expected, *, k=5) returns
  {p_at_k, r_at_k, mrr, ndcg_at_k} (math.log2 for nDCG).
- run_recall(store, queries_path, *, k=5) ranks each query via
  build_context_pack(..., limit=max(k, 10))["items"], scores it, and returns
  a deterministic {k, n_queries, macro, per_query} report.
- compare_baseline(report, baseline, *, max_regression=0.05) returns
  (ok, message); not-ok when macro p_at_k drops below
  baseline.macro.p_at_k - max_regression.
CLI — vouch eval recall <queries.jsonl> [--k N] [--baseline b.json] [--max-regression F] (added to the existing eval group). Prints the JSON
report; with --baseline, prints a P@k diff and exits non-zero
(click.ClickException) on a regression beyond tolerance.
Repo fixtures — eval/queries.jsonl (a 10-query hand-labeled starter
set, not the full 50) and eval/baseline.json, both reproducible against a
small committed fixture KB under eval/fixture-kb/.vouch/ (10 claims + 1
source). state.db is rebuilt at eval time (it stays gitignored as a derived
index).
CI — .github/workflows/eval.yml runs on PRs touching
src/vouch/embeddings/**, src/vouch/context.py, src/vouch/eval/**, or
eval/**; installs .[dev], rebuilds the fixture index, and fails on a >5%
P@5 regression vs eval/baseline.json.
Tests — tests/test_eval_recall.py: hand-computed metric math,
deterministic run_recall over a temp KB, and the baseline gate
(regression / within-tolerance / improvement).
CHANGELOG — ## [Unreleased] → ### Added bullet.

Why

We had no guardrail against silent retrieval-quality regressions. A labeled set
plus a committed baseline turns "did that embedding/context change make search
worse?" into a CI signal instead of a manual hunch. Metrics are pure Python so
the gate runs in the base .[dev] job with no numpy/embedding extras.

This is intentionally a starter labeled set (~10 queries). Growing it toward
the full ~50 is follow-up work; the harness, fixtures, and gate are complete.

Test plan

Gate — all green (5 skips are the numpy-gated context/embedding tests, expected
in the base .[dev] venv):

$ ruff check src tests
All checks passed!

$ mypy src
Success: no issues found in 31 source files

$ pytest -q
.........................sssss.......................................... [ 72%]
............................                                             [100%]

CLI against the committed fixture (cd eval/fixture-kb && vouch reindex):

$ vouch eval recall ../queries.jsonl --k 5
{
  "k": 5,
  "n_queries": 10,
  "macro": {
    "p_at_k": 0.18,
    "r_at_k": 0.9,
    "mrr": 0.9,
    "ndcg_at_k": 0.9
  },
  "per_query": [ ... ]
}

Baseline gate — passes at/above baseline, fails on regression:

$ vouch eval recall ../queries.jsonl --k 5 --baseline ../baseline.json ; echo exit=$?
P@5 ok: 0.1800 vs baseline 0.1800 (delta +0.0000, tol 0.0500)
exit=0

# inflated baseline (p_at_k=0.5) to simulate a regression
$ vouch eval recall ../queries.jsonl --k 5 --baseline hi.json ; echo exit=$?
P@5 regression: 0.1800 < baseline 0.5000 - tol 0.0500 = 0.4500 (delta -0.3200)
Error: P@5 regression: 0.1800 < baseline 0.5000 - tol 0.0500 = 0.4500 (delta -0.3200)
exit=1

Summary by CodeRabbit

New Features
- Added vouch eval recall command to measure retrieval quality using precision@k, recall@k, mean reciprocal rank, and normalized discounted gain metrics.
- Baseline comparison and regression detection with configurable tolerance to prevent retrieval performance degradation.
- Includes a starter labeled query set and reproducible fixture knowledge base for evaluation.
Tests
- Comprehensive test coverage for retrieval evaluation metrics and baseline comparison logic.
Chores
- Added CI workflow to automatically gate retrieval changes against baseline performance.

…ouchdev#226) Add vouch eval recall: score kb.context retrieval against a labeled query set (P@k, R@k, MRR, nDCG), compare against a committed baseline, and fail CI on a >5% P@5 regression. Pure-Python metrics; deterministic report; starter labeled set + fixture KB + eval workflow.

coderabbitai · 2026-06-17T03:35:52Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e70609a-24c8-452e-8ee2-31cb7ad24f4e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a retrieval recall evaluation harness: a new src/vouch/eval/recall.py module computing P@k, R@k, MRR, and nDCG@k against labeled JSONL queries, a vouch eval recall CLI command with baseline comparison, a fixture knowledge base with 10 claim YAML files and labeled queries, a committed eval/baseline.json, and a CI workflow that fails on P@5 regressions exceeding 5%.

Changes

Recall Evaluation Harness

Layer / File(s)	Summary
Core recall evaluation module `src/vouch/eval/recall.py`, `src/vouch/eval/__init__.py`	`recall.py` implements `load_queries` (JSONL loading with `expected_ids` alias), `score_query` (P@k, R@k, MRR, nDCG@k), `_macro` (macro-averaging), `run_recall` (invokes `build_context_pack` per query and assembles a report), and `compare_baseline` (P@k regression check with tolerance). `__init__.py` re-exports all four functions via `__all__`.
CLI `eval recall` command `src/vouch/cli.py`	Adds `eval_recall` as a Click command under `eval_group`. Runs `run_recall`, prints the JSON report, and optionally loads a baseline JSON to run `compare_baseline`, printing results to stderr and raising `click.ClickException` on failure.
Fixture KB, labeled queries, and baseline `eval/fixture-kb/.vouch/claims/*.yaml`, `eval/fixture-kb/.vouch/config.yaml`, `eval/fixture-kb/.vouch/sources/...`, `eval/fixture-kb/.vouch/.gitignore`, `eval/queries.jsonl`, `eval/baseline.json`	Adds a reproducible fixture knowledge base with 10 observation claims (auth-jwt, auth-session, db-postgres, db-migrations, cache-redis, search-fts5, search-embeddings, deploy-ci, deploy-docker, api-rate-limit), a source content/meta file, and Vouch config. Adds 10 labeled JSONL queries and a committed baseline JSON with macro and per-query metrics for the CI gate.
Tests, CI workflow, and changelog `tests/test_eval_recall.py`, `.github/workflows/eval.yml`, `CHANGELOG.md`	Unit tests cover hand-computed metric correctness, no-hit edge cases, MRR full-ranking behavior, JSONL key aliasing, deterministic `run_recall` output, and all three `compare_baseline` outcomes. The `eval` workflow triggers on PRs touching `src/vouch/embeddings/`, `src/vouch/context.py`, `src/vouch/eval/`, or `eval/**`, rebuilds the fixture KB index, and runs `vouch eval recall --k 5 --max-regression 0.05`.

Sequence Diagram(s)

sequenceDiagram
  participant CI as GitHub Actions eval workflow
  participant CLI as vouch eval recall
  participant RunRecall as run_recall
  participant KBStore as KBStore / build_context_pack
  participant Compare as compare_baseline

  CI->>CLI: python -m vouch.cli eval recall queries.jsonl --baseline baseline.json --k 5 --max-regression 0.05
  CLI->>RunRecall: run_recall(store, queries_path, k=5)
  loop 10 labeled queries
    RunRecall->>KBStore: build_context_pack(store, query=q, limit=10)
    KBStore-->>RunRecall: ranked claim ids
    RunRecall->>RunRecall: score_query(ranked_ids, expected, k=5)
  end
  RunRecall-->>CLI: report {k, n_queries, macro, per_query}
  CLI->>Compare: compare_baseline(report, baseline, max_regression=0.05)
  Compare-->>CLI: (ok, message)
  alt ok is False
    CLI-->>CI: ClickException → non-zero exit
  else ok is True
    CLI-->>CI: exit 0
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

vouchdev/vouch#44: Both PRs add subcommands to the eval CLI group in src/vouch/cli.py; the retrieved PR adds vouch eval embedding while this one adds vouch eval recall.

Suggested reviewers

plind-junior

🐇 A rabbit hops through ranked lists with glee,
P@5 and nDCG computed with care,
The fixture KB blossoms like clover so free,
Baselines committed so regressions don't dare!
CI shall guard every embed and query — 🎯
No silent decline shall pass without flair!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(eval): recall-quality eval harness with CI baseline gate' directly and accurately summarizes the main feature added—a recall-quality evaluation harness with CI baseline gating to detect retrieval regressions.
Linked Issues check	✅ Passed	The PR fully addresses all primary objectives from issue `#226`: implements `vouch eval recall` command with P@k/R@k/MRR/nDCG metrics, creates committed baseline with 5% regression tolerance, provides deterministic JSON reports, uses fixture KB for reproducibility, and requires only base dependencies without numpy.
Out of Scope Changes check	✅ Passed	All changes are in-scope: eval harness implementation, CLI integration, fixture KB/queries, CI workflow, and comprehensive tests directly address issue `#226` objectives; no unrelated code or feature creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

.github/workflows/eval.yml (1)
19-24: 💤 Low value

Consider pinning GitHub Actions to commit SHAs for supply-chain security.

The workflow currently references actions by tag (e.g., @v4, @v5). For enhanced security, consider pinning to specific commit SHAs to prevent tag-rewriting attacks.

Additionally, consider adding persist-credentials: false to the checkout action to prevent credential leakage through build artifacts.
🔒 Optional security hardening
       - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
 
       - uses: actions/setup-python@v5
For SHA pinning, you can use a tool like Dependabot to manage action versions with SHA references.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/eval.yml around lines 19 - 24, Replace the tag-based
references in the GitHub Actions workflow with specific commit SHAs to prevent
tag-rewriting attacks. Update actions/checkout@v4 and actions/setup-python@v5 to
pin to their respective commit SHAs (you can find these on the GitHub Actions
marketplace pages). Additionally, add persist-credentials: false to the
actions/checkout step to prevent credential leakage through build artifacts.
src/vouch/cli.py (1)
672-672: ⚡ Quick win

Consider adding validation for k parameter.

The --k option accepts any integer, including zero or negative values. While run_recall guards with max(k, 10), a negative or zero k is semantically invalid for precision/recall metrics.
🛡️ Proposed validation
-@click.option("--k", default=5, show_default=True, type=int)
+@click.option("--k", default=5, show_default=True, type=click.IntRange(min=1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/cli.py` at line 672, The `--k` option in the click decorator
accepts any integer value, including zero or negative numbers, which are
semantically invalid for precision/recall metrics. Add a click callback
validator to the `@click.option("--k")` decorator that ensures the value is a
positive integer (greater than zero). The validator should raise a
click.BadParameter exception if the user provides a zero or negative value,
providing a clear error message about the requirement.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/cli.py`:
- Around line 687-689: The message is being output to stderr twice due to
click.echo printing it unconditionally on Line 687, and then ClickException
printing it again with an "Error: " prefix when raised on Line 689. Move the
click.echo call inside a conditional block so it only executes when the check
passes (when ok is True), and remove it from before the ClickException block.
This way the success message prints to stderr when ok is True, and only the
ClickException prints the error message to stderr when ok is False, eliminating
the duplicate output.
- Line 685: The baseline file loading at the line with
json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for
file reading or JSON decoding errors, which will result in Python tracebacks
instead of clean error messages. Wrap this baseline loading operation in the
_cli_errors() context manager (similar to how the run_recall call is handled) so
that both JSONDecodeError and file read errors are caught and handled cleanly,
since JSONDecodeError is a subclass of ValueError which is already handled by
the _cli_errors() context.

---

Nitpick comments:
In @.github/workflows/eval.yml:
- Around line 19-24: Replace the tag-based references in the GitHub Actions
workflow with specific commit SHAs to prevent tag-rewriting attacks. Update
actions/checkout@v4 and actions/setup-python@v5 to pin to their respective
commit SHAs (you can find these on the GitHub Actions marketplace pages).
Additionally, add persist-credentials: false to the actions/checkout step to
prevent credential leakage through build artifacts.

In `@src/vouch/cli.py`:
- Line 672: The `--k` option in the click decorator accepts any integer value,
including zero or negative numbers, which are semantically invalid for
precision/recall metrics. Add a click callback validator to the
`@click.option("--k")` decorator that ensures the value is a positive integer
(greater than zero). The validator should raise a click.BadParameter exception
if the user provides a zero or negative value, providing a clear error message
about the requirement.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 37830106-c2da-4678-bef5-eb32fa62bfed

📥 Commits

Reviewing files that changed from the base of the PR and between 3beb821 and 54e3b22.

📒 Files selected for processing (22)

.github/workflows/eval.yml
CHANGELOG.md
eval/baseline.json
eval/fixture-kb/.vouch/.gitignore
eval/fixture-kb/.vouch/claims/api-rate-limit.yaml
eval/fixture-kb/.vouch/claims/auth-jwt.yaml
eval/fixture-kb/.vouch/claims/auth-session.yaml
eval/fixture-kb/.vouch/claims/cache-redis.yaml
eval/fixture-kb/.vouch/claims/db-migrations.yaml
eval/fixture-kb/.vouch/claims/db-postgres.yaml
eval/fixture-kb/.vouch/claims/deploy-ci.yaml
eval/fixture-kb/.vouch/claims/deploy-docker.yaml
eval/fixture-kb/.vouch/claims/search-embeddings.yaml
eval/fixture-kb/.vouch/claims/search-fts5.yaml
eval/fixture-kb/.vouch/config.yaml
eval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/content
eval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/meta.yaml
eval/queries.jsonl
src/vouch/cli.py
src/vouch/eval/__init__.py
src/vouch/eval/recall.py
tests/test_eval_recall.py

coderabbitai · 2026-06-17T03:42:45Z

+        report = run_recall(store, queries, k=k)
+    click.echo(json.dumps(report, indent=2))
+    if baseline is not None:
+        base = json.loads(Path(baseline).read_text(encoding="utf-8"))


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add error handling for baseline JSON loading.

The baseline file reading and JSON decoding is not wrapped in error handling, unlike the run_recall call. A malformed JSON file or read error will produce a Python traceback instead of a clean error message.

🔧 Proposed fix

Wrap the baseline loading in a try/except or extend the _cli_errors() context:

store = _load_store() with _cli_errors(): report = run_recall(store, queries, k=k) - click.echo(json.dumps(report, indent=2)) - if baseline is not None: - base = json.loads(Path(baseline).read_text(encoding="utf-8")) + click.echo(json.dumps(report, indent=2)) + if baseline is not None: + base = json.loads(Path(baseline).read_text(encoding="utf-8")) - ok, message = compare_baseline(report, base, max_regression=max_regression) - click.echo(message, err=True) - if not ok: - raise click.ClickException(message) + ok, message = compare_baseline(report, base, max_regression=max_regression) + click.echo(message, err=True) + if not ok: + raise click.ClickException(message)

This also catches JSONDecodeError as a ValueError subclass already handled by _cli_errors().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/cli.py` at line 685, The baseline file loading at the line with json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for file reading or JSON decoding errors, which will result in Python tracebacks instead of clean error messages. Wrap this baseline loading operation in the _cli_errors() context manager (similar to how the run_recall call is handled) so that both JSONDecodeError and file read errors are caught and handled cleanly, since JSONDecodeError is a subclass of ValueError which is already handled by the _cli_errors() context.

coderabbitai · 2026-06-17T03:42:45Z

+        click.echo(message, err=True)
+        if not ok:
+            raise click.ClickException(message)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate message output.

The comparison message is printed to stderr on Line 687, then raised again in the ClickException on Line 689. Click will print the exception message to stderr prefixed with "Error: ", causing users to see the message twice.

🔧 Proposed fix

Remove the redundant click.echo before raising the exception:

base = json.loads(Path(baseline).read_text(encoding="utf-8")) ok, message = compare_baseline(report, base, max_regression=max_regression) - click.echo(message, err=True) if not ok: raise click.ClickException(message) + else: + click.echo(message, err=True)

This prints the success message only when the check passes, and lets Click handle the failure message cleanly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

click.echo(message, err=True)

if not ok:

raise click.ClickException(message)

if not ok:

raise click.ClickException(message)

else:

click.echo(message, err=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/cli.py` around lines 687 - 689, The message is being output to stderr twice due to click.echo printing it unconditionally on Line 687, and then ClickException printing it again with an "Error: " prefix when raised on Line 689. Move the click.echo call inside a conditional block so it only executes when the check passes (when ok is True), and remove it from before the ClickException block. This way the success message prints to stderr when ok is True, and only the ClickException prints the error message to stderr when ok is False, eliminating the duplicate output.

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

plind-junior changed the base branch from main to test June 17, 2026 04:40

Merge branch 'test' into feat/226-recall-eval

383d0e5

plind-junior merged commit 0a083c6 into vouchdev:test Jun 17, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): recall-quality eval harness with CI baseline gate#241

feat(eval): recall-quality eval harness with CI baseline gate#241
plind-junior merged 2 commits into
vouchdev:testfrom
dripsmvcp:feat/226-recall-eval

dripsmvcp commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dripsmvcp commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(eval): recall-quality eval harness with CI baseline gate

What

Why

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dripsmvcp commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading