Skip to content

test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning#423

Merged
ohdearquant merged 1 commit into
mainfrom
v023/eval-corpus
May 25, 2026
Merged

test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning#423
ohdearquant merged 1 commit into
mainfrom
v023/eval-corpus

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Summary

P1 of v023-operationalize. Replaces the flat eval corpus from #408 with a discriminating one. Grid search now produces variation across configs.

Corpus stats (memories_corpus_v2.json)

  • 203 memories across 10 semantic domains + 46 distractors + 8 importance-trap entries
  • 48 queries with ground-truth expected_top_k:
    • 20 synonym (queries with terms NOT in memory content)
    • 20 partial (queries where only some terms match)
    • 8 importance_trap (queries that match low-importance distractors by FTS but should score lower than high-importance true positives)

Discrimination achieved

Grid search across 232 configs:

  • MRR range: 0.0625 → 0.1667 (Δ 0.1042)
  • Combined_score range: 0.0500 → 0.1302 (Δ 0.0802)

Previously (v1 corpus): recall@10 = 0.9333 for ALL 116 configs. Now configs DO discriminate.

Mechanism

The dimension that currently discriminates is min_salience (0.0 vs 0.40). Configs with min_salience=0.0 retrieve importance-trap memories (importance 0.26-0.32) that satisfy unique-term queries; min_salience=0.40 filters them out, dropping combined_score from 0.1302 to 0.0500.

Limitation: weight triples, decay models, fusion strategies still indistinguishable in FTS-only mode because FTS5 AND-logic returns identical candidate sets regardless of scoring parameters. Vector recall is required to separate those dimensions — that requires the dual-embedding work landed in v022 + #421 (EmbedderRegistry) being available, with KHIVE_ADDITIONAL_EMBEDDING_MODELS=paraphrase set.

Deliverables

  • tests/khive-contract/fixtures/memories_corpus_v2.json (preserves v1 for regression)
  • tests/khive-contract/tune/grid_search.py computes MRR + precision@k (discriminating metrics)
  • tests/khive-contract/tune/REPORT-v2.md documents the discrimination + corpus methodology
  • tests/khive-contract/tests/test_eval_corpus.py enforces corpus schema (every query.expected_top_k references real memory IDs; distractor count documented; query types distributed as claimed)

Follow-up

To unlock weight/decay/fusion discrimination: run python -m tune --with-embed (TODO) against this corpus with dual-embedding enabled.

Commit

cf75200

Replaces flat 20-query corpus with 203-memory / 48-query v2 corpus
spanning 10 domains with synonym, partial, and importance_trap query
types; adds MRR + precision@k + exclusion_penalty combined_score to
grid_search.py; min_salience grid dimension produces range=0.0802.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ohdearquant ohdearquant merged commit 4e1cff6 into main May 25, 2026
1 of 3 checks passed
@ohdearquant ohdearquant deleted the v023/eval-corpus branch May 25, 2026 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant