fix(health): rebuild_index must write embeddings to embedding_index#173
fix(health): rebuild_index must write embeddings to embedding_index#173seekmistar01 wants to merge 1 commit into
Conversation
`rebuild_index` calls `index_db.reset()` (which clears `embedding_index`, the table semantic search reads via `search_embedding`/`search_semantic`) and then `_rebuild_embeddings` re-encoded every artifact through `index_db.index_embedding`, which writes the legacy `embeddings` table that no reader queries. Net effect: after every `vouch index` / `reindex` / import-apply, semantic search silently returned nothing (degrading to the FTS5/substring fallback) until each artifact happened to be re-written by the live `KBStore._embed_and_store` hook. Write to `embedding_index` via `index_db.put_embedding`, supplying the same content_hash + model/version/dim metadata the live write hook uses so the per-content/model skip-check stays consistent. Add a regression test that rebuilds and asserts each artifact's embedding is readable from `embedding_index`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place. 📝 WalkthroughWalkthroughThe PR fixes ChangesEmbedding rebuild correctness
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/vouch/health.py (1)
194-197:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the same canonical page/entity text as the live embedding hook.
content_hash(text)is now part of the persisted contract, but this rebuild path still hashes"{title} {body}"/"{name} {description}"whileKBStore._embed_and_storewrites"{title}\n\n{body}"and"{name}\n\n{description}"insrc/vouch/storage.py. After a rebuild, unchanged pages/entities will look stale to the live hook and get re-embedded again; the stored vectors also differ from the live path for the same artifact.Proposed fix
for p in store.list_pages(): - texts.append(("page", p.id, f"{p.title} {p.body}")) + texts.append(("page", p.id, f"{p.title}\n\n{p.body}")) for e in store.list_entities(): - texts.append(("entity", e.id, f"{e.name} {e.description or ''}")) + texts.append(("entity", e.id, f"{e.name}\n\n{e.description or ''}"))Also applies to: 204-215
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/health.py` around lines 194 - 197, The rebuild path in health.py builds page/entity text as "{title} {body}" and "{name} {description}" which diverges from the canonical text used by KBStore._embed_and_store ("{title}\n\n{body}" and "{name}\n\n{description}"), causing content_hash mismatches and unnecessary re-embeds; update the loops that call store.list_pages() and store.list_entities() to assemble the text exactly the same way as KBStore._embed_and_store (use a double newline between title/name and body/description, and treat missing description as empty string) so content_hash(text) and embeddings match the live hook.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/vouch/health.py`:
- Around line 194-197: The rebuild path in health.py builds page/entity text as
"{title} {body}" and "{name} {description}" which diverges from the canonical
text used by KBStore._embed_and_store ("{title}\n\n{body}" and
"{name}\n\n{description}"), causing content_hash mismatches and unnecessary
re-embeds; update the loops that call store.list_pages() and
store.list_entities() to assemble the text exactly the same way as
KBStore._embed_and_store (use a double newline between title/name and
body/description, and treat missing description as empty string) so
content_hash(text) and embeddings match the live hook.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a44844f3-7c06-45c3-9277-3f73fef40a54
📒 Files selected for processing (2)
src/vouch/health.pytests/embeddings/test_search.py
ReviewSummary: The fix correctly redirects What works
Suggestions
Verdictapprove — The fix is minimal, mechanically correct, and directly addresses the root cause described in #172. The regression test is well-formed and will catch regressions cleanly. The non-blocking notes are pre-existing gaps outside this PR's stated scope. |
|
PR should target the test branch. If not, will be auto-closed |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0819f512f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # left semantic search empty until each artifact was next | ||
| # re-written. Mirror the live `KBStore._embed_and_store` | ||
| # metadata so the per-content/model skip-check stays consistent. | ||
| index_db.put_embedding( |
There was a problem hiding this comment.
Rebuild every semantic artifact kind
reset() has already deleted embedding_index, but this new put_embedding loop only writes the claim/page/entity tuples assembled above. The live write path and search_embedding() support source, relation, and evidence embeddings, so after a plain kb.index_rebuild/import-apply those rows are lost and semantic-only searches over those artifact types return nothing until a separate kb.reindex_embeddings/rewrite. Please rebuild all artifact kinds that _embed_and_store creates.
Useful? React with 👍 / 👎.
| # metadata so the per-content/model skip-check stays consistent. | ||
| index_db.put_embedding( | ||
| conn, kind=kind, id=eid, vec=row.tolist(), | ||
| content_hash=content_hash(text), |
There was a problem hiding this comment.
Use live text when hashing rebuilt embeddings
For rebuilt pages/entities, this stores a hash for the text assembled above with single spaces, but the normal KBStore._embed_and_store path hashes and embeds f"{title}\n\n{body}" and f"{name}\n\n{description or ''}". After kb.index_rebuild, those rows therefore look stale to backfill_embeddings/future write hooks and are embedded from different text than live writes; use the same text extractor as the live path before persisting content_hash.
Useful? React with 👍 / 👎.
| # left semantic search empty until each artifact was next | ||
| # re-written. Mirror the live `KBStore._embed_and_store` | ||
| # metadata so the per-content/model skip-check stays consistent. | ||
| index_db.put_embedding( |
There was a problem hiding this comment.
Record rebuilt embedding model metadata
reset() deletes the index_meta embedding keys, and this new direct rebuild path repopulates embedding_index without the set_embedding_meta call used by _embed_and_store/backfill_embeddings. After kb.index_rebuild or import-apply, embedding stats report no model/dim and detect_mismatch() has no stored model to compare, so a later model switch can leave these rebuilt vectors in the old vector space without warning. Please update index_meta after a successful embedding rebuild.
Useful? React with 👍 / 👎.
Summary
Closes #172
rebuild_indexregenerated embeddings into the wrong table, so semantic search silently returned nothing after every reindex/import.reset()clearsembedding_index(the tablesearch_embedding/search_semanticread), but_rebuild_embeddingsre-encoded every artifact throughindex_db.index_embedding, which writes the legacyembeddingstable that no reader queries. Until each artifact was next re-written by the liveKBStore._embed_and_storehook, embedding search degraded to the FTS5/substring fallback.Change
src/vouch/health.py—_rebuild_embeddingsnow writes viaindex_db.put_embeddingtoembedding_index, supplying the samecontent_hash+model/model_version/dimmetadata the live write hook uses (so the per-content/model skip-check stays consistent). Batching is preserved.tests/embeddings/test_search.py—test_rebuild_index_repopulates_semantic_index: afterrebuild_index, each artifact's embedding is readable fromembedding_index.Why it's safe
put_embedding→embedding_index).Verification
pytest tests/embeddings/test_search.py— 15 passed (existing + the new regression test). The new test fails before this change (embeddings landed in the deadembeddingstable, soget_embeddingfromembedding_indexreturns None after rebuild) and passes after.Summary by CodeRabbit
Bug Fixes
Tests