tune(recall): grid search infra + PARTIAL default changes by ohdearquant · Pull Request #408 · ohdearquant/khive

ohdearquant · 2026-05-25T15:24:34Z

Summary

Slice 8/9 from PR #399 (v022-polish show). Stacked on slice 7 (dual-embedding).

Parameter tuning infrastructure for recall defaults: grid search runner, synthetic eval corpus, config nudges from grid results.

What is added

tests/khive-contract/tune/grid_search.py — grid runner over recall knob space (116 configs in 0.75s)
tests/khive-contract/tune/__main__.py — CLI entry
tests/khive-contract/fixtures/memories_corpus.json — synthetic eval corpus
tests/khive-contract/tune/results.json — sweep results
tests/khive-contract/tune/tuned-config.toml — recommended config
tests/khive-contract/tune/REPORT.md — analysis writeup
3 config nudges applied to defaults based on grid results

PARTIAL status

The play landed with grid search infra working but the synthetic eval set has a corpus ceiling — recall@10 = 0.93 for ALL configs. Cannot ground defaults with this corpus. Follow-up: harder eval corpus (embed-enabled, synonyms, partial matches).

This PR ships the infrastructure so it can be re-run when a better corpus exists.

Dependencies

Depends on #407 (slice 7: dual-embedding). Base branch is slice/v022/07-dual-embedding.

Test plan

grid runner completes on the fixture
CI green

Route pack-memory's fuse_candidates through khive_retrieval::fuse_search_results, making khive-retrieval a real consumed facade instead of an orphan crate. - Add khive-retrieval dep to khive-pack-memory/Cargo.toml - Replace direct fuse_with_strategy call with retrieval adapter (CandidateMeta side-map, HybridConfig builder, FusionStrategy conversion) - Fix issue #309: resolve --all-features compile failures in khive-retrieval (stale SqliteStore imports, missing NodeId/LinkStore imports) - Add 5 integration tests (3 fusion_surface, 2 pack-memory recall adapter) - RRF k=1 discriminator test proves strategy propagation (30x score gap) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…recall (ADR-033 §6) - Add three optional per-request fields to RecallParams: top_k (usize), fusion_strategy (string), and score_floor (f32) - fusion_strategy validated against {"rrf","weighted","union"}; clear error with valid values on invalid input - top_k overrides the result limit for a single call (capped at 100) - score_floor applied as a post-filter on the composite score after compute_score - Add parse_fusion_strategy_str helper; wire override into cfg.fuse_strategy before passing to fuse_candidates - Add 4 integration tests: default_identity, top_k_override, fusion_strategy_override (including rejection), score_floor - Document knobs in ADR-033 §6.1 with table, semantics, and example DSL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Multi-model embedding support landed across the runtime + storage + memory stack. Workspace dual-embedding now reachable end-to-end: khive-runtime: - RuntimeConfig.additional_embedding_models: Vec<EmbeddingModel> - Replaces single OnceCell<embedder> with HashMap<model_name, embedder> - default_embedder_name() + embedder(name) public methods - KHIVE_ADDITIONAL_EMBEDDING_MODELS env-var parsing - configured_embedding_models() helper enumerates active set khive-db: - V16 migration: add `embedding_model TEXT NOT NULL DEFAULT '<default>'` column to vectors table with backfill + composite index - VectorStore.insert / search scoped by embedding_model khive-storage: - VectorRecord carries model tag - vector search params include model scope khive-pack-memory: - recall + remember accept optional embedding_model arg - validation: must be a registered model name kkernel: - engine list now returns real loaded models (no longer empty Vec) - engine migrate / drift-check still return not-implemented (#380/#385) Notes: - 16 files changed, +582/-138 lines - Tests rebaselined for V16 (failed_migration_rolls_back tests V17 now; store_ddl_then_event_migration_is_idempotent expects V16 head) - Workspace: cargo build + cargo test + clippy clean + fmt clean Lattice gap status: N/A — lattice-embed 0.2.4 already exposes both MiniLM + paraphrase as 384-d local models with EmbeddingRoutingConfig primitives. khive-runtime now uses these directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Addresses codex finding on PR #405: `khive_retrieval::hnsw::HnswIndex` doesn't resolve because `hnsw` is not a public submodule — `HnswIndex` is re-exported at the crate root (lib.rs:145). The doctest at persist/mod.rs:29 must use the public facade import. Closes the remaining gap on issue #309 (--all-features doctest failure). Verified: `RUSTC_WRAPPER= cargo test --offline -p khive-retrieval --all-features --doc` passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ges unjustified Codex review on PR #408 flagged that the three default changes from the tune sweep (temporal_half_life_days 30 to 14, decay exp to hyp, multiplier 20 to 10) were made even though REPORT.md explicitly states the synthetic eval set produced identical recall@10 = 0.9333 for ALL 116 configs — the landscape was too flat to discriminate these parameters. Reverts those three lines back to their prior values. The tuned-config.toml artifact stays as an experimental record; the grid_search.py infra stays runnable; only RecallConfig::default() is restored. A discriminating eval corpus (embed-enabled, synonym queries, partial matches) is the prerequisite before changing runtime defaults. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ohdearquant · 2026-05-25T15:39:02Z

Verdict: REQUEST CHANGES
Findings: 2 Major, 0 Minor

[Major] Revert or quarantine the Rust default changes until the eval can distinguish them

Evidence: crates/khive-pack-memory/src/config.rs:54 says the tuning is PARTIAL and that the eval is "too easy to discriminate params"; crates/khive-pack-memory/src/config.rs:63 changes temporal_half_life_days from 30.0 to 14.0, crates/khive-pack-memory/src/config.rs:64 changes the default decay model to Hyperbolic, and crates/khive-pack-memory/src/config.rs:65 changes candidate_multiplier to 10. tests/khive-contract/tune/REPORT.md:32 says all 116 configs have identical recall@10, and tests/khive-contract/tune/REPORT.md:34 through tests/khive-contract/tune/REPORT.md:36 says candidate_multiplier, decay_model, and temporal_half_life_days have zero measurable effect. The report is explicit at tests/khive-contract/tune/REPORT.md:43 through tests/khive-contract/tune/REPORT.md:45 that the three committed default changes are not empirically distinguished from the old defaults.

Why this matters: the PR is honest in the report, but the unsupported choices are still shipped as runtime defaults via RecallConfig::default(). That makes a synthetic, acknowledged-ceiling eval load-bearing for all downstream users, which is exactly the risk called out in the review focus.

Suggested fix: leave the tuned values in tests/khive-contract/tune/tuned-config.toml as an experimental artifact, but revert RecallConfig::default() to the previous defaults until a harder corpus shows a real win. If the defaults stay changed, add a regression test that exercises the changed recall behavior against a discriminating fixture, not just serde/default-shape tests like crates/khive-pack-memory/src/config.rs:435 through crates/khive-pack-memory/src/config.rs:446.

[Major] The tuning CLI is not re-runnable from this branch

Evidence: tests/khive-contract/tune/grid_search.py:21 imports khive_contract.client.KhiveMcpSession, but git ls-tree -r --name-only HEAD tests/khive-contract shows this PR branch only contains the new fixtures/ and tune/ files under tests/khive-contract; there is no tracked tests/khive-contract/khive_contract/client.py on this branch. Running PYTHONPATH=. python -m tune --quick --output-dir /tmp/khive-tune-review from tests/khive-contract fails with ModuleNotFoundError: No module named 'khive_contract'.

Why this matters: one stated OSS utility goal is that a downstream user can re-run the tuning script with their own corpus. As committed, the script depends on a Python helper package that is not present in the PR branch, so the new CLI entry point at tests/khive-contract/tune/__main__.py:1 cannot run in a clean checkout.

Suggested fix: either include the contract harness package in this stack/base, or make the tune runner self-contained enough to spawn khive-mcp directly. Also add a short smoke test or documented command that runs python -m tune --quick --output-dir <tmp> from a clean checkout.

Requested Verification

Default changes applied: yes, the three Rust defaults are temporal_half_life_days 30.0 -> 14.0, decay_model Exponential -> Hyperbolic, and candidate_multiplier 20 -> 10 in crates/khive-pack-memory/src/config.rs:63 through crates/khive-pack-memory/src/config.rs:65. They are not justified by results.json; the report says those dimensions had zero measurable effect.
Corpus ceiling: surfaced clearly, not buried. tests/khive-contract/tune/REPORT.md:30 names "Flat Optimization Landscape", and tests/khive-contract/tune/REPORT.md:32 states identical recall@10 = 0.9333 for all configs.
grid_search.py determinism/coverage: grid generation is deterministic loop order plus configs[::10] for quick mode at tests/khive-contract/tune/grid_search.py:212 through tests/khive-contract/tune/grid_search.py:235. It exercises config-object knobs for weights, candidate pools, fusion strategy, decay model, and half-life at tests/khive-contract/tune/grid_search.py:181 through tests/khive-contract/tune/grid_search.py:230, but it does not exercise the per-request top_k and score_floor knobs from ADR-033 §6.
tuned-config.toml startup format: not a startup format as-is. The artifact uses [recall] at tests/khive-contract/tune/tuned-config.toml:7, while the MCP startup path shown in crates/khive-mcp/src/main.rs:73 through crates/khive-mcp/src/main.rs:79 builds RuntimeConfig from CLI/env defaults, and MemoryPack::new initializes from RecallConfig::default() at crates/khive-pack-memory/src/lib.rs:90 through crates/khive-pack-memory/src/lib.rs:95.
Corpus size/representativeness: the corpus has exactly 100 memories and 20 eval queries. It is broad across programming/math/science/history/food/health/transit, but the query set is mostly exact keyword/topic matching; examples start at tests/khive-contract/fixtures/memories_corpus.json:704 through tests/khive-contract/fixtures/memories_corpus.json:803.
Rust vs TOML defaults: the three defaults are changed in Rust source, not only TOML, at crates/khive-pack-memory/src/config.rs:63 through crates/khive-pack-memory/src/config.rs:65.
Regression test for Rust defaults: no discriminating regression test. The touched tests validate default shape/serde and explicit scoring formulas, but do not prove the new defaults improve or preserve recall behavior under a corpus that distinguishes them.
Honesty in code: partially yes. The PARTIAL caveat is present in code at crates/khive-pack-memory/src/config.rs:54, and the report is direct at tests/khive-contract/tune/REPORT.md:30 through tests/khive-contract/tune/REPORT.md:45; the issue is that the caveated values still become runtime defaults.

Looks Right

The report is candid about the eval ceiling and root cause.
The quick grid covers the same main RecallConfig dimensions it claims to sweep, and results.json contains 116 configs.
The memory-pack unit tests pass after disabling sccache.
Focused integration tests for top_k, fusion_strategy, score_floor, and PackTunable::apply_config pass.

Commands Run

git status --short --branch: clean branch slice/v022/08-param-tuning.
git diff --name-status origin/slice/v022/07-dual-embedding..HEAD: inspected all changed files.
jq 'length' tests/khive-contract/tune/results.json: 116 configs.
jq '{memories:(.memories|length), eval_queries:(.eval_queries|length)}' tests/khive-contract/fixtures/memories_corpus.json: 100 memories, 20 eval queries.
RUSTC_WRAPPER= cargo test -p khive-pack-memory --lib --offline: passed, 61 tests.
RUSTC_WRAPPER= cargo test -p khive-pack-memory --offline --test integration test_recall_top_k_override -- --exact: passed.
RUSTC_WRAPPER= cargo test -p khive-pack-memory --offline --test integration test_recall_fusion_strategy_override -- --exact: passed.
RUSTC_WRAPPER= cargo test -p khive-pack-memory --offline --test integration test_recall_score_floor -- --exact: passed.
RUSTC_WRAPPER= cargo test -p khive-pack-memory --offline --test integration test_pack_tunable_apply_config_affects_recall_score -- --exact: passed.
PYTHONPATH=. python -m tune --quick --output-dir /tmp/khive-tune-review: failed, missing khive_contract.

What I Did Not Check

I could not fetch PR metadata or post the review initially because gh pr view could not connect to api.github.com from this environment.
I did not run the full workspace test suite due the 15-minute review budget.

Re-Review Guidance

Narrow re-review is enough: check that runtime defaults are either reverted or backed by a discriminating eval, and that python -m tune --quick works from a clean checkout.

Domain utility: SKIPPED - no lore/domain MCP tools were available in this runtime.

…ger tests Two Medium findings from codex review: 1. top_k cast bug: `(k as u32).min(100)` truncates k before capping. A request with `top_k = 4_294_967_297` (larger than u32::MAX) truncates to 1 BEFORE the cap is applied, so the result limit becomes 1, not 100. Fixed: `u32::try_from(k.min(100)).unwrap_or(100)` clamps to usize first, then narrows safely. 2. Weak tests: test_recall_default_identity only checked length and the top hit. Strengthened to compare full ordered note_id+score lists across all positions, with all three knobs explicitly set to null. test_recall_fusion_strategy_override only validated string acceptance. Added a new unit test (fusion_strategy_change_produces_observable_ordering_difference) with a deterministic fixture where RRF and Weighted strategies MUST produce different orderings — proving the fusion_strategy override actually flows into fuse_candidates, not just validation. Verified: cargo test -p khive-pack-memory --lib passes (62 unit tests). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ta-safety fixes Per Ocean direction (2026-05-25): amend ADR-043 to formalize the V16 string-tag schema design, claim V16 in ADR-015 ledger, and fix the data-safety bugs codex flagged. Defer sqlite-vec data preservation to a follow-up issue. ADR amendments: - ADR-043 §1.1 (vector store column addition): replaces the old FK-based description with the actual V16 design (TEXT embedding_model column with composite index). Includes rationale for TEXT vs BLOB FK (hot-path join cost, end-to-end shape consistency with kkernel/env-vars/registry). Documents sqlite-vec rebuild behavior and follow-up. - ADR-015 schema ledger: V16 row added with cluster-22 amendment notes. High 3 fix — atomic unknown-model validation (handlers.rs): - handle_remember now calls resolve_embedding_model(Some(name)) BEFORE create_note_with_decay_for_embedding_model. resolve_embedding_model is synchronous and doesn't load the model — it only checks registration. An unknown model is rejected before any note/FTS/vector row is written. High 2 fix — scoped delete across all model stores (operations.rs): - delete_note now iterates over registered_embedding_model_names() (new public method on KhiveRuntime) and deletes the note's vector from EVERY registered model's vector store. Previously only the default model's store was touched, leaving non-default vectors orphaned. Medium fix — KHIVE_ADDITIONAL_EMBEDDING_MODELS warning on bad names: - parse_embedding_model_list now logs tracing::warn for non-empty raw names that don't parse, instead of silently filtering them out. The function still returns a Vec (no startup failure on partial validity), but operator typos now surface at startup rather than as UnknownModel errors at request time. Deferred (follow-up issue, see ADR-043 §1.1 final paragraph): - High 1 — V16 backfill hard-codes 'all-minilm-l6-v2' for all regular vec_* tables, and sqlite-vec virtual tables are still dropped-and-rebuilt on schema mismatch (data loss for non-default deployments). A copy-with- default rebuild path is tracked separately because it requires a careful multi-step migration with vec0 INSERT INTO SELECT FROM and a verification step. Operators are warned via ADR §1.1 to back up before upgrading. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Codex Major 2 (PR #408): the tune CLI is not re-runnable from a fresh clone of this branch because khive_contract is in the parent package (slice 06, PR #403 — already merged to main). Codex tested slice 08 in isolation and hit ModuleNotFoundError. Adds tests/khive-contract/tune/README.md explaining: - Install khive_contract first: uv pip install -e . from tests/khive-contract/ - Run via uv run python -m tune --quick (or full) - Documents the corpus-ceiling limitation surfaced in REPORT.md - Explicit note that RecallConfig::default() was intentionally NOT changed (defaults reverted in d943b7f based on codex Major 1) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ohdearquant · 2026-05-25T15:56:49Z

Codex findings addressed

Verdict: REQUEST CHANGES → fixes pushed

Major 1: Unjustified default changes

Fixed by reverting the three RecallConfig::default() changes in commit d943b7f:

temporal_half_life_days: 14.0 → restored to 30.0
decay_model: Hyperbolic → restored to DecayModel::default() (Exponential)
candidate_multiplier: 10 → restored to 20

Codex correctly identified that REPORT.md acknowledges recall@10 = 0.9333 for ALL 116 configs (corpus ceiling), so these changes had no empirical justification. The tuned-config.toml artifact, grid_search.py infra, and corpus all remain — only the load-bearing defaults reverted. A discriminating eval corpus is the prerequisite before changing runtime defaults.

Major 2: Tune CLI not re-runnable

Documented in commit be85904. Added tests/khive-contract/tune/README.md explaining:

Install the harness package first: cd tests/khive-contract && uv pip install -e .
Then run: uv run python -m tune --quick

The khive_contract package is in slice 06 (PR #403 — already merged to main). When this slice merges, the package is in main and the CLI works in a clean checkout. Codex tested slice 08 in isolation and saw the missing module.

Also documents the corpus-ceiling limitation explicitly.

Commits: d943b7f (default revert), be85904 (tune README), a9f5585 (fmt), d973c29 (deno fmt).

Ready for re-review.

…l consistency Two findings from codex round 2 on PR #407: 1. Medium: runtime-level unknown-model atomicity. handle_remember validated embedding_model before calling create_note_with_decay_for_embedding_model (round 1 fix), but the runtime API itself was permissive — direct Rust callers (other packs, integration tests) would still hit the write-after- failure bug. Fix: resolve_embedding_model is now called at the start of create_note_inner BEFORE any note/FTS/vector write. The pack-handler check remains as an earlier error boundary, but atomicity is enforced at the lowest layer that performs the write. 2. High: ADR-043 internal inconsistency. The §1.1 amendment landed in round 1, but the rest of the ADR still contained normative text for the rejected FK-based design: - §8 (Backward compat) described embedding_model_id BLOB column + CHECK rebuild - Alternatives table rejected per-record model_id (which is what V16 ships) - Migration version section claimed V5 + BLOB FK Fix: - §8 rewritten to describe V14 + V16 (registry + tag column) split that actually shipped - Alternatives table row strikes-through with "Superseded by V16 (2026-05-25)" and points back to §1.1 for rationale - Migration version section split into V14 (cluster-20) and V16 (v022-polish) with the actual SQL each migration runs Codex's other round-1 findings remain addressed: - Critical (ADR drift) — now resolved end-to-end across §1.1, §8, alternatives, and §Migration version - High 2 (scoped delete) — confirmed by round-2 review at operations.rs:1395, 1408 - Medium (env var warning) — confirmed at runtime.rs:549, 560 Verified: cargo test -p khive-runtime -p khive-pack-memory --lib passes (262 + 62 tests), cargo fmt clean, deno fmt clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uning

ohdearquant and others added 5 commits May 25, 2026 11:21

style(adr-033): deno fmt re-pad recall knob table (post-merge cleanup)

1169b41

tune(recall): grid search infra + PARTIAL default changes

092170c

ohdearquant mentioned this pull request May 25, 2026

Show: v022-polish (overnight) — wire retrieval/recall/dual-embed + tests + plugin polish — SLICE BEFORE MERGE #399

Closed

ohdearquant and others added 2 commits May 25, 2026 11:36

ohdearquant and others added 3 commits May 25, 2026 11:48

ohdearquant changed the base branch from slice/v022/07-dual-embedding to main May 25, 2026 15:54

ohdearquant added 2 commits May 25, 2026 11:55

style: cargo fmt --all

a9f5585

style: deno fmt ADR tables

d973c29

ohdearquant and others added 4 commits May 25, 2026 11:57

fix(test): use single-token query matching all fixture memories

0bf1927

fix(test): align partial_config test with reverted default decay_model

c194923

ci: force re-trigger

c3d8caf

ohdearquant marked this pull request as draft May 25, 2026 17:25

ohdearquant marked this pull request as ready for review May 25, 2026 17:25

ohdearquant closed this May 25, 2026

ohdearquant reopened this May 25, 2026

ohdearquant added 3 commits May 25, 2026 13:27

ci: force re-trigger via doc whitespace touch

36cfa61

style: deno fmt

a275972

Merge remote-tracking branch 'origin/main' into slice/v022/08-param-t…

1a80278

…uning

ohdearquant merged commit 2717cd2 into main May 25, 2026
3 checks passed

ohdearquant deleted the slice/v022/08-param-tuning branch May 25, 2026 17:33

ohdearquant mentioned this pull request May 25, 2026

test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tune(recall): grid search infra + PARTIAL default changes#408

tune(recall): grid search infra + PARTIAL default changes#408
ohdearquant merged 19 commits into
mainfrom
slice/v022/08-param-tuning

ohdearquant commented May 25, 2026 •

edited

Loading

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is added

PARTIAL status

Dependencies

Test plan

Uh oh!

ohdearquant commented May 25, 2026

[Major] Revert or quarantine the Rust default changes until the eval can distinguish them

[Major] The tuning CLI is not re-runnable from this branch

Requested Verification

Looks Right

Commands Run

What I Did Not Check

Re-Review Guidance

Uh oh!

ohdearquant commented May 25, 2026

Codex findings addressed

Major 1: Unjustified default changes

Major 2: Tune CLI not re-runnable

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohdearquant commented May 25, 2026 •

edited

Loading