Skip to content

feat(eval): recall-quality eval harness with CI baseline gate#241

Merged
plind-junior merged 2 commits into
vouchdev:testfrom
dripsmvcp:feat/226-recall-eval
Jun 17, 2026
Merged

feat(eval): recall-quality eval harness with CI baseline gate#241
plind-junior merged 2 commits into
vouchdev:testfrom
dripsmvcp:feat/226-recall-eval

Conversation

@dripsmvcp

@dripsmvcp dripsmvcp commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

feat(eval): recall-quality eval harness with CI baseline gate

Closes #226

What

Adds a retrieval-quality eval that scores the current kb.context retrieval
(build_context_pack) against a hand-labeled query set and gates regressions
in CI.

  • src/vouch/eval/recall.py — pure-Python metrics, no numpy:
    • load_queries(path) reads a JSONL labeled set ({"query", "expected"};
      expected_ids accepted as an alias).
    • score_query(ranked_ids, expected, *, k=5) returns
      {p_at_k, r_at_k, mrr, ndcg_at_k} (math.log2 for nDCG).
    • run_recall(store, queries_path, *, k=5) ranks each query via
      build_context_pack(..., limit=max(k, 10))["items"], scores it, and returns
      a deterministic {k, n_queries, macro, per_query} report.
    • compare_baseline(report, baseline, *, max_regression=0.05) returns
      (ok, message); not-ok when macro p_at_k drops below
      baseline.macro.p_at_k - max_regression.
  • CLIvouch eval recall <queries.jsonl> [--k N] [--baseline b.json] [--max-regression F] (added to the existing eval group). Prints the JSON
    report; with --baseline, prints a P@k diff and exits non-zero
    (click.ClickException) on a regression beyond tolerance.
  • Repo fixtureseval/queries.jsonl (a 10-query hand-labeled starter
    set, not the full 50) and eval/baseline.json, both reproducible against a
    small committed fixture KB under eval/fixture-kb/.vouch/ (10 claims + 1
    source). state.db is rebuilt at eval time (it stays gitignored as a derived
    index).
  • CI.github/workflows/eval.yml runs on PRs touching
    src/vouch/embeddings/**, src/vouch/context.py, src/vouch/eval/**, or
    eval/**; installs .[dev], rebuilds the fixture index, and fails on a >5%
    P@5 regression vs eval/baseline.json.
  • Teststests/test_eval_recall.py: hand-computed metric math,
    deterministic run_recall over a temp KB, and the baseline gate
    (regression / within-tolerance / improvement).
  • CHANGELOG## [Unreleased] → ### Added bullet.

Why

We had no guardrail against silent retrieval-quality regressions. A labeled set
plus a committed baseline turns "did that embedding/context change make search
worse?" into a CI signal instead of a manual hunch. Metrics are pure Python so
the gate runs in the base .[dev] job with no numpy/embedding extras.

This is intentionally a starter labeled set (~10 queries). Growing it toward
the full ~50 is follow-up work; the harness, fixtures, and gate are complete.

Test plan

Gate — all green (5 skips are the numpy-gated context/embedding tests, expected
in the base .[dev] venv):

$ ruff check src tests
All checks passed!

$ mypy src
Success: no issues found in 31 source files

$ pytest -q
.........................sssss.......................................... [ 72%]
............................                                             [100%]

CLI against the committed fixture (cd eval/fixture-kb && vouch reindex):

$ vouch eval recall ../queries.jsonl --k 5
{
  "k": 5,
  "n_queries": 10,
  "macro": {
    "p_at_k": 0.18,
    "r_at_k": 0.9,
    "mrr": 0.9,
    "ndcg_at_k": 0.9
  },
  "per_query": [ ... ]
}

Baseline gate — passes at/above baseline, fails on regression:

$ vouch eval recall ../queries.jsonl --k 5 --baseline ../baseline.json ; echo exit=$?
P@5 ok: 0.1800 vs baseline 0.1800 (delta +0.0000, tol 0.0500)
exit=0

# inflated baseline (p_at_k=0.5) to simulate a regression
$ vouch eval recall ../queries.jsonl --k 5 --baseline hi.json ; echo exit=$?
P@5 regression: 0.1800 < baseline 0.5000 - tol 0.0500 = 0.4500 (delta -0.3200)
Error: P@5 regression: 0.1800 < baseline 0.5000 - tol 0.0500 = 0.4500 (delta -0.3200)
exit=1

Summary by CodeRabbit

  • New Features

    • Added vouch eval recall command to measure retrieval quality using precision@k, recall@k, mean reciprocal rank, and normalized discounted gain metrics.
    • Baseline comparison and regression detection with configurable tolerance to prevent retrieval performance degradation.
    • Includes a starter labeled query set and reproducible fixture knowledge base for evaluation.
  • Tests

    • Comprehensive test coverage for retrieval evaluation metrics and baseline comparison logic.
  • Chores

    • Added CI workflow to automatically gate retrieval changes against baseline performance.

…ouchdev#226)

Add vouch eval recall: score kb.context retrieval against a labeled query set (P@k, R@k, MRR, nDCG), compare against a committed baseline, and fail CI on a >5% P@5 regression. Pure-Python metrics; deterministic report; starter labeled set + fixture KB + eval workflow.
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e70609a-24c8-452e-8ee2-31cb7ad24f4e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a retrieval recall evaluation harness: a new src/vouch/eval/recall.py module computing P@k, R@k, MRR, and nDCG@k against labeled JSONL queries, a vouch eval recall CLI command with baseline comparison, a fixture knowledge base with 10 claim YAML files and labeled queries, a committed eval/baseline.json, and a CI workflow that fails on P@5 regressions exceeding 5%.

Changes

Recall Evaluation Harness

Layer / File(s) Summary
Core recall evaluation module
src/vouch/eval/recall.py, src/vouch/eval/__init__.py
recall.py implements load_queries (JSONL loading with expected_ids alias), score_query (P@k, R@k, MRR, nDCG@k), _macro (macro-averaging), run_recall (invokes build_context_pack per query and assembles a report), and compare_baseline (P@k regression check with tolerance). __init__.py re-exports all four functions via __all__.
CLI eval recall command
src/vouch/cli.py
Adds eval_recall as a Click command under eval_group. Runs run_recall, prints the JSON report, and optionally loads a baseline JSON to run compare_baseline, printing results to stderr and raising click.ClickException on failure.
Fixture KB, labeled queries, and baseline
eval/fixture-kb/.vouch/claims/*.yaml, eval/fixture-kb/.vouch/config.yaml, eval/fixture-kb/.vouch/sources/..., eval/fixture-kb/.vouch/.gitignore, eval/queries.jsonl, eval/baseline.json
Adds a reproducible fixture knowledge base with 10 observation claims (auth-jwt, auth-session, db-postgres, db-migrations, cache-redis, search-fts5, search-embeddings, deploy-ci, deploy-docker, api-rate-limit), a source content/meta file, and Vouch config. Adds 10 labeled JSONL queries and a committed baseline JSON with macro and per-query metrics for the CI gate.
Tests, CI workflow, and changelog
tests/test_eval_recall.py, .github/workflows/eval.yml, CHANGELOG.md
Unit tests cover hand-computed metric correctness, no-hit edge cases, MRR full-ranking behavior, JSONL key aliasing, deterministic run_recall output, and all three compare_baseline outcomes. The eval workflow triggers on PRs touching src/vouch/embeddings/**, src/vouch/context.py, src/vouch/eval/**, or eval/**, rebuilds the fixture KB index, and runs vouch eval recall --k 5 --max-regression 0.05.

Sequence Diagram(s)

sequenceDiagram
  participant CI as GitHub Actions eval workflow
  participant CLI as vouch eval recall
  participant RunRecall as run_recall
  participant KBStore as KBStore / build_context_pack
  participant Compare as compare_baseline

  CI->>CLI: python -m vouch.cli eval recall queries.jsonl --baseline baseline.json --k 5 --max-regression 0.05
  CLI->>RunRecall: run_recall(store, queries_path, k=5)
  loop 10 labeled queries
    RunRecall->>KBStore: build_context_pack(store, query=q, limit=10)
    KBStore-->>RunRecall: ranked claim ids
    RunRecall->>RunRecall: score_query(ranked_ids, expected, k=5)
  end
  RunRecall-->>CLI: report {k, n_queries, macro, per_query}
  CLI->>Compare: compare_baseline(report, baseline, max_regression=0.05)
  Compare-->>CLI: (ok, message)
  alt ok is False
    CLI-->>CI: ClickException → non-zero exit
  else ok is True
    CLI-->>CI: exit 0
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • vouchdev/vouch#44: Both PRs add subcommands to the eval CLI group in src/vouch/cli.py; the retrieved PR adds vouch eval embedding while this one adds vouch eval recall.

Suggested reviewers

  • plind-junior

🐇 A rabbit hops through ranked lists with glee,
P@5 and nDCG computed with care,
The fixture KB blossoms like clover so free,
Baselines committed so regressions don't dare!
CI shall guard every embed and query — 🎯
No silent decline shall pass without flair!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(eval): recall-quality eval harness with CI baseline gate' directly and accurately summarizes the main feature added—a recall-quality evaluation harness with CI baseline gating to detect retrieval regressions.
Linked Issues check ✅ Passed The PR fully addresses all primary objectives from issue #226: implements vouch eval recall command with P@k/R@k/MRR/nDCG metrics, creates committed baseline with 5% regression tolerance, provides deterministic JSON reports, uses fixture KB for reproducibility, and requires only base dependencies without numpy.
Out of Scope Changes check ✅ Passed All changes are in-scope: eval harness implementation, CLI integration, fixture KB/queries, CI workflow, and comprehensive tests directly address issue #226 objectives; no unrelated code or feature creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.github/workflows/eval.yml (1)

19-24: 💤 Low value

Consider pinning GitHub Actions to commit SHAs for supply-chain security.

The workflow currently references actions by tag (e.g., @v4, @v5). For enhanced security, consider pinning to specific commit SHAs to prevent tag-rewriting attacks.

Additionally, consider adding persist-credentials: false to the checkout action to prevent credential leakage through build artifacts.

🔒 Optional security hardening
       - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
 
       - uses: actions/setup-python@v5

For SHA pinning, you can use a tool like Dependabot to manage action versions with SHA references.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/eval.yml around lines 19 - 24, Replace the tag-based
references in the GitHub Actions workflow with specific commit SHAs to prevent
tag-rewriting attacks. Update actions/checkout@v4 and actions/setup-python@v5 to
pin to their respective commit SHAs (you can find these on the GitHub Actions
marketplace pages). Additionally, add persist-credentials: false to the
actions/checkout step to prevent credential leakage through build artifacts.
src/vouch/cli.py (1)

672-672: ⚡ Quick win

Consider adding validation for k parameter.

The --k option accepts any integer, including zero or negative values. While run_recall guards with max(k, 10), a negative or zero k is semantically invalid for precision/recall metrics.

🛡️ Proposed validation
-@click.option("--k", default=5, show_default=True, type=int)
+@click.option("--k", default=5, show_default=True, type=click.IntRange(min=1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/cli.py` at line 672, The `--k` option in the click decorator
accepts any integer value, including zero or negative numbers, which are
semantically invalid for precision/recall metrics. Add a click callback
validator to the `@click.option("--k")` decorator that ensures the value is a
positive integer (greater than zero). The validator should raise a
click.BadParameter exception if the user provides a zero or negative value,
providing a clear error message about the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/cli.py`:
- Around line 687-689: The message is being output to stderr twice due to
click.echo printing it unconditionally on Line 687, and then ClickException
printing it again with an "Error: " prefix when raised on Line 689. Move the
click.echo call inside a conditional block so it only executes when the check
passes (when ok is True), and remove it from before the ClickException block.
This way the success message prints to stderr when ok is True, and only the
ClickException prints the error message to stderr when ok is False, eliminating
the duplicate output.
- Line 685: The baseline file loading at the line with
json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for
file reading or JSON decoding errors, which will result in Python tracebacks
instead of clean error messages. Wrap this baseline loading operation in the
_cli_errors() context manager (similar to how the run_recall call is handled) so
that both JSONDecodeError and file read errors are caught and handled cleanly,
since JSONDecodeError is a subclass of ValueError which is already handled by
the _cli_errors() context.

---

Nitpick comments:
In @.github/workflows/eval.yml:
- Around line 19-24: Replace the tag-based references in the GitHub Actions
workflow with specific commit SHAs to prevent tag-rewriting attacks. Update
actions/checkout@v4 and actions/setup-python@v5 to pin to their respective
commit SHAs (you can find these on the GitHub Actions marketplace pages).
Additionally, add persist-credentials: false to the actions/checkout step to
prevent credential leakage through build artifacts.

In `@src/vouch/cli.py`:
- Line 672: The `--k` option in the click decorator accepts any integer value,
including zero or negative numbers, which are semantically invalid for
precision/recall metrics. Add a click callback validator to the
`@click.option("--k")` decorator that ensures the value is a positive integer
(greater than zero). The validator should raise a click.BadParameter exception
if the user provides a zero or negative value, providing a clear error message
about the requirement.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 37830106-c2da-4678-bef5-eb32fa62bfed

📥 Commits

Reviewing files that changed from the base of the PR and between 3beb821 and 54e3b22.

📒 Files selected for processing (22)
  • .github/workflows/eval.yml
  • CHANGELOG.md
  • eval/baseline.json
  • eval/fixture-kb/.vouch/.gitignore
  • eval/fixture-kb/.vouch/claims/api-rate-limit.yaml
  • eval/fixture-kb/.vouch/claims/auth-jwt.yaml
  • eval/fixture-kb/.vouch/claims/auth-session.yaml
  • eval/fixture-kb/.vouch/claims/cache-redis.yaml
  • eval/fixture-kb/.vouch/claims/db-migrations.yaml
  • eval/fixture-kb/.vouch/claims/db-postgres.yaml
  • eval/fixture-kb/.vouch/claims/deploy-ci.yaml
  • eval/fixture-kb/.vouch/claims/deploy-docker.yaml
  • eval/fixture-kb/.vouch/claims/search-embeddings.yaml
  • eval/fixture-kb/.vouch/claims/search-fts5.yaml
  • eval/fixture-kb/.vouch/config.yaml
  • eval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/content
  • eval/fixture-kb/.vouch/sources/1410e7845543213186bddf486561536087d01cbefcf7f0c35d0fe6e7b008113e/meta.yaml
  • eval/queries.jsonl
  • src/vouch/cli.py
  • src/vouch/eval/__init__.py
  • src/vouch/eval/recall.py
  • tests/test_eval_recall.py

Comment thread src/vouch/cli.py
report = run_recall(store, queries, k=k)
click.echo(json.dumps(report, indent=2))
if baseline is not None:
base = json.loads(Path(baseline).read_text(encoding="utf-8"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add error handling for baseline JSON loading.

The baseline file reading and JSON decoding is not wrapped in error handling, unlike the run_recall call. A malformed JSON file or read error will produce a Python traceback instead of a clean error message.

🔧 Proposed fix

Wrap the baseline loading in a try/except or extend the _cli_errors() context:

     store = _load_store()
     with _cli_errors():
         report = run_recall(store, queries, k=k)
-    click.echo(json.dumps(report, indent=2))
-    if baseline is not None:
-        base = json.loads(Path(baseline).read_text(encoding="utf-8"))
+        click.echo(json.dumps(report, indent=2))
+        if baseline is not None:
+            base = json.loads(Path(baseline).read_text(encoding="utf-8"))
-        ok, message = compare_baseline(report, base, max_regression=max_regression)
-        click.echo(message, err=True)
-        if not ok:
-            raise click.ClickException(message)
+            ok, message = compare_baseline(report, base, max_regression=max_regression)
+            click.echo(message, err=True)
+            if not ok:
+                raise click.ClickException(message)

This also catches JSONDecodeError as a ValueError subclass already handled by _cli_errors().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/cli.py` at line 685, The baseline file loading at the line with
json.loads(Path(baseline).read_text(encoding="utf-8")) lacks error handling for
file reading or JSON decoding errors, which will result in Python tracebacks
instead of clean error messages. Wrap this baseline loading operation in the
_cli_errors() context manager (similar to how the run_recall call is handled) so
that both JSONDecodeError and file read errors are caught and handled cleanly,
since JSONDecodeError is a subclass of ValueError which is already handled by
the _cli_errors() context.

Comment thread src/vouch/cli.py
Comment on lines +687 to +689
click.echo(message, err=True)
if not ok:
raise click.ClickException(message)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate message output.

The comparison message is printed to stderr on Line 687, then raised again in the ClickException on Line 689. Click will print the exception message to stderr prefixed with "Error: ", causing users to see the message twice.

🔧 Proposed fix

Remove the redundant click.echo before raising the exception:

         base = json.loads(Path(baseline).read_text(encoding="utf-8"))
         ok, message = compare_baseline(report, base, max_regression=max_regression)
-        click.echo(message, err=True)
         if not ok:
             raise click.ClickException(message)
+        else:
+            click.echo(message, err=True)

This prints the success message only when the check passes, and lets Click handle the failure message cleanly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
click.echo(message, err=True)
if not ok:
raise click.ClickException(message)
if not ok:
raise click.ClickException(message)
else:
click.echo(message, err=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/cli.py` around lines 687 - 689, The message is being output to
stderr twice due to click.echo printing it unconditionally on Line 687, and then
ClickException printing it again with an "Error: " prefix when raised on Line
689. Move the click.echo call inside a conditional block so it only executes
when the check passes (when ok is True), and remove it from before the
ClickException block. This way the success message prints to stderr when ok is
True, and only the ClickException prints the error message to stderr when ok is
False, eliminating the duplicate output.

@plind-junior plind-junior changed the base branch from main to test June 17, 2026 04:40
@plind-junior plind-junior merged commit 0a083c6 into vouchdev:test Jun 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(eval): recall-quality eval harness with labeled query set + CI baseline

2 participants