test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning#423
Merged
Conversation
Replaces flat 20-query corpus with 203-memory / 48-query v2 corpus spanning 10 domains with synonym, partial, and importance_trap query types; adds MRR + precision@k + exclusion_penalty combined_score to grid_search.py; min_salience grid dimension produces range=0.0802. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
P1 of v023-operationalize. Replaces the flat eval corpus from #408 with a discriminating one. Grid search now produces variation across configs.
Corpus stats (
memories_corpus_v2.json)expected_top_k:Discrimination achieved
Grid search across 232 configs:
Previously (v1 corpus): recall@10 = 0.9333 for ALL 116 configs. Now configs DO discriminate.
Mechanism
The dimension that currently discriminates is
min_salience(0.0 vs 0.40). Configs withmin_salience=0.0retrieve importance-trap memories (importance 0.26-0.32) that satisfy unique-term queries;min_salience=0.40filters them out, dropping combined_score from 0.1302 to 0.0500.Limitation: weight triples, decay models, fusion strategies still indistinguishable in FTS-only mode because FTS5 AND-logic returns identical candidate sets regardless of scoring parameters. Vector recall is required to separate those dimensions — that requires the dual-embedding work landed in v022 + #421 (EmbedderRegistry) being available, with
KHIVE_ADDITIONAL_EMBEDDING_MODELS=paraphraseset.Deliverables
tests/khive-contract/fixtures/memories_corpus_v2.json(preserves v1 for regression)tests/khive-contract/tune/grid_search.pycomputes MRR + precision@k (discriminating metrics)tests/khive-contract/tune/REPORT-v2.mddocuments the discrimination + corpus methodologytests/khive-contract/tests/test_eval_corpus.pyenforces corpus schema (every query.expected_top_k references real memory IDs; distractor count documented; query types distributed as claimed)Follow-up
To unlock weight/decay/fusion discrimination: run
python -m tune --with-embed(TODO) against this corpus with dual-embedding enabled.Commit
cf75200