Skip to content

fix(health): rebuild_index must write embeddings to embedding_index#173

Open
seekmistar01 wants to merge 1 commit into
vouchdev:mainfrom
seekmistar01:fix/rebuild-index-embedding-table
Open

fix(health): rebuild_index must write embeddings to embedding_index#173
seekmistar01 wants to merge 1 commit into
vouchdev:mainfrom
seekmistar01:fix/rebuild-index-embedding-table

Conversation

@seekmistar01

@seekmistar01 seekmistar01 commented Jun 6, 2026

Copy link
Copy Markdown

Summary

Closes #172

rebuild_index regenerated embeddings into the wrong table, so semantic search silently returned nothing after every reindex/import. reset() clears embedding_index (the table search_embedding/search_semantic read), but _rebuild_embeddings re-encoded every artifact through index_db.index_embedding, which writes the legacy embeddings table that no reader queries. Until each artifact was next re-written by the live KBStore._embed_and_store hook, embedding search degraded to the FTS5/substring fallback.

Change

  • src/vouch/health.py_rebuild_embeddings now writes via index_db.put_embedding to embedding_index, supplying the same content_hash + model/model_version/dim metadata the live write hook uses (so the per-content/model skip-check stays consistent). Batching is preserved.
  • tests/embeddings/test_search.pytest_rebuild_index_repopulates_semantic_index: after rebuild_index, each artifact's embedding is readable from embedding_index.

Why it's safe

  • Only the rebuild's write target/metadata changes; it now matches the canonical live path (put_embeddingembedding_index).
  • Well-formed installs are strictly improved (semantic search works after reindex); no behavior is removed.

Verification

  • pytest tests/embeddings/test_search.py15 passed (existing + the new regression test). The new test fails before this change (embeddings landed in the dead embeddings table, so get_embedding from embedding_index returns None after rebuild) and passes after.

Summary by CodeRabbit

  • Bug Fixes

    • Embedding rebuild now stores comprehensive metadata including content hash, model version, and embedding dimensions for improved tracking
  • Tests

    • Added regression test to verify semantic index is properly repopulated during embedding rebuild operations

`rebuild_index` calls `index_db.reset()` (which clears `embedding_index`,
the table semantic search reads via `search_embedding`/`search_semantic`)
and then `_rebuild_embeddings` re-encoded every artifact through
`index_db.index_embedding`, which writes the legacy `embeddings` table that
no reader queries. Net effect: after every `vouch index` / `reindex` /
import-apply, semantic search silently returned nothing (degrading to the
FTS5/substring fallback) until each artifact happened to be re-written by the
live `KBStore._embed_and_store` hook.

Write to `embedding_index` via `index_db.put_embedding`, supplying the same
content_hash + model/version/dim metadata the live write hook uses so the
per-content/model skip-check stays consistent. Add a regression test that
rebuilds and asserts each artifact's embedding is readable from
`embedding_index`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place.

Review Change Stack

📝 Walkthrough

Walkthrough

The PR fixes rebuild_index writing embeddings to a legacy table that no reader queries. The rebuild now calls index_db.put_embedding() with full metadata and includes a regression test confirming embeddings are correctly stored and retrievable after reindex.

Changes

Embedding rebuild correctness

Layer / File(s) Summary
Embedding write path correction
src/vouch/health.py
Import content_hash and replace index_db.index_embedding() with index_db.put_embedding(), now recording per-text content_hash, model, model_version, and embedding dim alongside each embedding during the batch write loop.
Rebuild semantic index regression test
tests/embeddings/test_search.py
New test test_rebuild_index_repopulates_semantic_index populates a KBStore, calls rebuild_index, and asserts embeddings for claim, page, and entity are retrievable via index_db.get_embedding after rebuild to confirm the fix writes to the correct table.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Possibly related PRs

  • vouchdev/vouch#41: The embedding persistence API (index_db.put_embedding()) and schema with content_hash, model_version, and dim fields that this PR now uses for rebuild writes.
  • vouchdev/vouch#40: The semantic search backend and embedding index table infrastructure that rebuild now correctly populates via the updated write hook.

Poem

🐰 A rebuild that lost its way so far,
Writing to tables under a dead star.
But now embeddings find the right home,
Search thrives again, no more to roam!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main fix: redirecting rebuild_index embeddings to the correct embedding_index table.
Linked Issues check ✅ Passed The PR fully addresses issue #172's requirements: changes _rebuild_embeddings to use index_db.put_embedding with content_hash/model/version/dim metadata, and adds a regression test asserting embeddings are readable from embedding_index after rebuild.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #172: modifying the rebuild write path and adding a corresponding regression test, with no unrelated alterations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/vouch/health.py (1)

194-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same canonical page/entity text as the live embedding hook.

content_hash(text) is now part of the persisted contract, but this rebuild path still hashes "{title} {body}" / "{name} {description}" while KBStore._embed_and_store writes "{title}\n\n{body}" and "{name}\n\n{description}" in src/vouch/storage.py. After a rebuild, unchanged pages/entities will look stale to the live hook and get re-embedded again; the stored vectors also differ from the live path for the same artifact.

Proposed fix
         for p in store.list_pages():
-            texts.append(("page", p.id, f"{p.title} {p.body}"))
+            texts.append(("page", p.id, f"{p.title}\n\n{p.body}"))
         for e in store.list_entities():
-            texts.append(("entity", e.id, f"{e.name} {e.description or ''}"))
+            texts.append(("entity", e.id, f"{e.name}\n\n{e.description or ''}"))

Also applies to: 204-215

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/health.py` around lines 194 - 197, The rebuild path in health.py
builds page/entity text as "{title} {body}" and "{name} {description}" which
diverges from the canonical text used by KBStore._embed_and_store
("{title}\n\n{body}" and "{name}\n\n{description}"), causing content_hash
mismatches and unnecessary re-embeds; update the loops that call
store.list_pages() and store.list_entities() to assemble the text exactly the
same way as KBStore._embed_and_store (use a double newline between title/name
and body/description, and treat missing description as empty string) so
content_hash(text) and embeddings match the live hook.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/vouch/health.py`:
- Around line 194-197: The rebuild path in health.py builds page/entity text as
"{title} {body}" and "{name} {description}" which diverges from the canonical
text used by KBStore._embed_and_store ("{title}\n\n{body}" and
"{name}\n\n{description}"), causing content_hash mismatches and unnecessary
re-embeds; update the loops that call store.list_pages() and
store.list_entities() to assemble the text exactly the same way as
KBStore._embed_and_store (use a double newline between title/name and
body/description, and treat missing description as empty string) so
content_hash(text) and embeddings match the live hook.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a44844f3-7c06-45c3-9277-3f73fef40a54

📥 Commits

Reviewing files that changed from the base of the PR and between 3beb821 and 0819f51.

📒 Files selected for processing (2)
  • src/vouch/health.py
  • tests/embeddings/test_search.py

@plind-junior

Copy link
Copy Markdown
Collaborator

Review

Summary: The fix correctly redirects _rebuild_embeddings from the dead index_embeddingembeddings path to the canonical put_embeddingembedding_index path, mirroring the live KBStore._embed_and_store hook. The regression test directly validates the corrected write target. The root cause diagnosis matches the code exactly.

What works

  • src/vouch/health.py:204-214 — Swapping index_db.index_embedding for index_db.put_embedding with the full content_hash/model/model_version/dim metadata is correct. put_embedding writes embedding_index (line 272 of index_db.py), which is the table search_embedding and search_semantic both query, and the table reset() clears. The per-content/model skip-check in the live write path will now stay consistent because the metadata keys match.
  • src/vouch/health.py:203 — Unpacking text from the batch tuple (was _) is the right mechanical prerequisite for calling content_hash(text).
  • tests/embeddings/test_search.py:52-76 — The test drives a full rebuild_index round-trip across the three artifact kinds that _rebuild_embeddings iterates (claim, page, entity) and asserts get_embedding returns non-None from embedding_index. It will fail on the unfixed code (write lands in dead embeddings table) and pass after — exactly the regression gate the issue requested.
  • src/vouch/health.py:204from .embeddings import content_hash, get_embeddercontent_hash is correctly exported from vouch/embeddings/__init__.py (__all__ line 36), so the import is safe.

Suggestions

  • [non-blocking] src/vouch/health.py:191-197_rebuild_embeddings collects claim, page, and entity artifacts but omits source (and relation/evidence kinds). The live KBStore._embed_and_store hook is called for all five kinds including source (storage.py:260). After rebuild_index, source embeddings are left stale/missing in embedding_index, which is the same class of bug this PR fixes for the other kinds. This is pre-existing scope (the old index_embedding path also skipped sources), but worth a follow-up issue since the fix now makes the gap more visible.
  • [non-blocking] tests/embeddings/test_search.py:52-76 — The new test asserts kind="source" is NOT covered by rebuild_index implicitly (a put_source is added but no assertion on its post-rebuild embedding is made). A comment clarifying the intentional omission would prevent future readers from treating it as an oversight.

Verdict

approve — The fix is minimal, mechanically correct, and directly addresses the root cause described in #172. The regression test is well-formed and will catch regressions cleanly. The non-blocking notes are pre-existing gaps outside this PR's stated scope.

@plind-junior

Copy link
Copy Markdown
Collaborator

PR should target the test branch. If not, will be auto-closed

@plind-junior plind-junior reopened this Jun 9, 2026
@plind-junior

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0819f512f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/vouch/health.py
# left semantic search empty until each artifact was next
# re-written. Mirror the live `KBStore._embed_and_store`
# metadata so the per-content/model skip-check stays consistent.
index_db.put_embedding(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebuild every semantic artifact kind

reset() has already deleted embedding_index, but this new put_embedding loop only writes the claim/page/entity tuples assembled above. The live write path and search_embedding() support source, relation, and evidence embeddings, so after a plain kb.index_rebuild/import-apply those rows are lost and semantic-only searches over those artifact types return nothing until a separate kb.reindex_embeddings/rewrite. Please rebuild all artifact kinds that _embed_and_store creates.

Useful? React with 👍 / 👎.

Comment thread src/vouch/health.py
# metadata so the per-content/model skip-check stays consistent.
index_db.put_embedding(
conn, kind=kind, id=eid, vec=row.tolist(),
content_hash=content_hash(text),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use live text when hashing rebuilt embeddings

For rebuilt pages/entities, this stores a hash for the text assembled above with single spaces, but the normal KBStore._embed_and_store path hashes and embeds f"{title}\n\n{body}" and f"{name}\n\n{description or ''}". After kb.index_rebuild, those rows therefore look stale to backfill_embeddings/future write hooks and are embedded from different text than live writes; use the same text extractor as the live path before persisting content_hash.

Useful? React with 👍 / 👎.

Comment thread src/vouch/health.py
# left semantic search empty until each artifact was next
# re-written. Mirror the live `KBStore._embed_and_store`
# metadata so the per-content/model skip-check stays consistent.
index_db.put_embedding(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record rebuilt embedding model metadata

reset() deletes the index_meta embedding keys, and this new direct rebuild path repopulates embedding_index without the set_embedding_meta call used by _embed_and_store/backfill_embeddings. After kb.index_rebuild or import-apply, embedding stats report no model/dim and detect_mismatch() has no stored model to compare, so a later model switch can leave these rebuilt vectors in the old vector space without warning. Please update index_meta after a successful embedding rebuild.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rebuild_index writes embeddings to a dead table, breaking semantic search after every reindex

2 participants