fix(health): rebuild_index must write embeddings to embedding_index by seekmistar01 · Pull Request #173 · vouchdev/vouch

seekmistar01 · 2026-06-06T07:17:42Z

Summary

Closes #172

rebuild_index regenerated embeddings into the wrong table, so semantic search silently returned nothing after every reindex/import. reset() clears embedding_index (the table search_embedding/search_semantic read), but _rebuild_embeddings re-encoded every artifact through index_db.index_embedding, which writes the legacy embeddings table that no reader queries. Until each artifact was next re-written by the live KBStore._embed_and_store hook, embedding search degraded to the FTS5/substring fallback.

Change

src/vouch/health.py — _rebuild_embeddings now writes via index_db.put_embedding to embedding_index, supplying the same content_hash + model/model_version/dim metadata the live write hook uses (so the per-content/model skip-check stays consistent). Batching is preserved.
tests/embeddings/test_search.py — test_rebuild_index_repopulates_semantic_index: after rebuild_index, each artifact's embedding is readable from embedding_index.

Why it's safe

Only the rebuild's write target/metadata changes; it now matches the canonical live path (put_embedding → embedding_index).
Well-formed installs are strictly improved (semantic search works after reindex); no behavior is removed.

Verification

pytest tests/embeddings/test_search.py — 15 passed (existing + the new regression test). The new test fails before this change (embeddings landed in the dead embeddings table, so get_embedding from embedding_index returns None after rebuild) and passes after.

Summary by CodeRabbit

Bug Fixes
- Embedding rebuild now stores comprehensive metadata including content hash, model version, and embedding dimensions for improved tracking
Tests
- Added regression test to verify semantic index is properly repopulated during embedding rebuild operations

`rebuild_index` calls `index_db.reset()` (which clears `embedding_index`, the table semantic search reads via `search_embedding`/`search_semantic`) and then `_rebuild_embeddings` re-encoded every artifact through `index_db.index_embedding`, which writes the legacy `embeddings` table that no reader queries. Net effect: after every `vouch index` / `reindex` / import-apply, semantic search silently returned nothing (degrading to the FTS5/substring fallback) until each artifact happened to be re-written by the live `KBStore._embed_and_store` hook. Write to `embedding_index` via `index_db.put_embedding`, supplying the same content_hash + model/version/dim metadata the live write hook uses so the per-content/model skip-check stays consistent. Add a regression test that rebuilds and asserts each artifact's embedding is readable from `embedding_index`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-06T07:17:52Z

Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place.

📝 Walkthrough

Walkthrough

The PR fixes rebuild_index writing embeddings to a legacy table that no reader queries. The rebuild now calls index_db.put_embedding() with full metadata and includes a regression test confirming embeddings are correctly stored and retrievable after reindex.

Changes

Embedding rebuild correctness

Layer / File(s)	Summary
Embedding write path correction `src/vouch/health.py`	Import `content_hash` and replace `index_db.index_embedding()` with `index_db.put_embedding()`, now recording per-text `content_hash`, `model`, `model_version`, and embedding `dim` alongside each embedding during the batch write loop.
Rebuild semantic index regression test `tests/embeddings/test_search.py`	New test `test_rebuild_index_repopulates_semantic_index` populates a KBStore, calls `rebuild_index`, and asserts embeddings for claim, page, and entity are retrievable via `index_db.get_embedding` after rebuild to confirm the fix writes to the correct table.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

rebuild_index writes embeddings to a dead table, breaking semantic search after every reindex #172: Rebuild_index writes embeddings to the wrong table (legacy embeddings instead of live embedding_index); this PR fixes it by switching to put_embedding() with full metadata.

Possibly related PRs

vouchdev/vouch#41: The embedding persistence API (index_db.put_embedding()) and schema with content_hash, model_version, and dim fields that this PR now uses for rebuild writes.
vouchdev/vouch#40: The semantic search backend and embedding index table infrastructure that rebuild now correctly populates via the updated write hook.

Poem

🐰 A rebuild that lost its way so far,
Writing to tables under a dead star.
But now embeddings find the right home,
Search thrives again, no more to roam! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main fix: redirecting rebuild_index embeddings to the correct embedding_index table.
Linked Issues check	✅ Passed	The PR fully addresses issue `#172`'s requirements: changes _rebuild_embeddings to use index_db.put_embedding with content_hash/model/version/dim metadata, and adds a regression test asserting embeddings are readable from embedding_index after rebuild.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing issue `#172`: modifying the rebuild write path and adding a corresponding regression test, with no unrelated alterations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/vouch/health.py (1)
194-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same canonical page/entity text as the live embedding hook.

content_hash(text) is now part of the persisted contract, but this rebuild path still hashes "{title} {body}" / "{name} {description}" while KBStore._embed_and_store writes "{title}\n\n{body}" and "{name}\n\n{description}" in src/vouch/storage.py. After a rebuild, unchanged pages/entities will look stale to the live hook and get re-embedded again; the stored vectors also differ from the live path for the same artifact.
Proposed fix
         for p in store.list_pages():
-            texts.append(("page", p.id, f"{p.title} {p.body}"))
+            texts.append(("page", p.id, f"{p.title}\n\n{p.body}"))
         for e in store.list_entities():
-            texts.append(("entity", e.id, f"{e.name} {e.description or ''}"))
+            texts.append(("entity", e.id, f"{e.name}\n\n{e.description or ''}"))
Also applies to: 204-215
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/health.py` around lines 194 - 197, The rebuild path in health.py
builds page/entity text as "{title} {body}" and "{name} {description}" which
diverges from the canonical text used by KBStore._embed_and_store
("{title}\n\n{body}" and "{name}\n\n{description}"), causing content_hash
mismatches and unnecessary re-embeds; update the loops that call
store.list_pages() and store.list_entities() to assemble the text exactly the
same way as KBStore._embed_and_store (use a double newline between title/name
and body/description, and treat missing description as empty string) so
content_hash(text) and embeddings match the live hook.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/vouch/health.py`:
- Around line 194-197: The rebuild path in health.py builds page/entity text as
"{title} {body}" and "{name} {description}" which diverges from the canonical
text used by KBStore._embed_and_store ("{title}\n\n{body}" and
"{name}\n\n{description}"), causing content_hash mismatches and unnecessary
re-embeds; update the loops that call store.list_pages() and
store.list_entities() to assemble the text exactly the same way as
KBStore._embed_and_store (use a double newline between title/name and
body/description, and treat missing description as empty string) so
content_hash(text) and embeddings match the live hook.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a44844f3-7c06-45c3-9277-3f73fef40a54

📥 Commits

Reviewing files that changed from the base of the PR and between 3beb821 and 0819f51.

📒 Files selected for processing (2)

src/vouch/health.py
tests/embeddings/test_search.py

plind-junior · 2026-06-09T06:37:31Z

Review

Summary: The fix correctly redirects _rebuild_embeddings from the dead index_embedding → embeddings path to the canonical put_embedding → embedding_index path, mirroring the live KBStore._embed_and_store hook. The regression test directly validates the corrected write target. The root cause diagnosis matches the code exactly.

What works

src/vouch/health.py:204-214 — Swapping index_db.index_embedding for index_db.put_embedding with the full content_hash/model/model_version/dim metadata is correct. put_embedding writes embedding_index (line 272 of index_db.py), which is the table search_embedding and search_semantic both query, and the table reset() clears. The per-content/model skip-check in the live write path will now stay consistent because the metadata keys match.
src/vouch/health.py:203 — Unpacking text from the batch tuple (was _) is the right mechanical prerequisite for calling content_hash(text).
tests/embeddings/test_search.py:52-76 — The test drives a full rebuild_index round-trip across the three artifact kinds that _rebuild_embeddings iterates (claim, page, entity) and asserts get_embedding returns non-None from embedding_index. It will fail on the unfixed code (write lands in dead embeddings table) and pass after — exactly the regression gate the issue requested.
src/vouch/health.py:204 — from .embeddings import content_hash, get_embedder — content_hash is correctly exported from vouch/embeddings/__init__.py (__all__ line 36), so the import is safe.

Suggestions

[non-blocking] src/vouch/health.py:191-197 — _rebuild_embeddings collects claim, page, and entity artifacts but omits source (and relation/evidence kinds). The live KBStore._embed_and_store hook is called for all five kinds including source (storage.py:260). After rebuild_index, source embeddings are left stale/missing in embedding_index, which is the same class of bug this PR fixes for the other kinds. This is pre-existing scope (the old index_embedding path also skipped sources), but worth a follow-up issue since the fix now makes the gap more visible.
[non-blocking] tests/embeddings/test_search.py:52-76 — The new test asserts kind="source" is NOT covered by rebuild_index implicitly (a put_source is added but no assertion on its post-rebuild embedding is made). A comment clarifying the intentional omission would prevent future readers from treating it as an oversight.

Verdict

approve — The fix is minimal, mechanically correct, and directly addresses the root cause described in #172. The regression test is well-formed and will catch regressions cleanly. The non-blocking notes are pre-existing gaps outside this PR's stated scope.

plind-junior · 2026-06-09T06:53:12Z

PR should target the test branch. If not, will be auto-closed

plind-junior · 2026-06-17T10:27:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0819f512f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T10:33:18Z

+                # left semantic search empty until each artifact was next
+                # re-written. Mirror the live `KBStore._embed_and_store`
+                # metadata so the per-content/model skip-check stays consistent.
+                index_db.put_embedding(


Rebuild every semantic artifact kind

reset() has already deleted embedding_index, but this new put_embedding loop only writes the claim/page/entity tuples assembled above. The live write path and search_embedding() support source, relation, and evidence embeddings, so after a plain kb.index_rebuild/import-apply those rows are lost and semantic-only searches over those artifact types return nothing until a separate kb.reindex_embeddings/rewrite. Please rebuild all artifact kinds that _embed_and_store creates.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T10:33:18Z

+                # metadata so the per-content/model skip-check stays consistent.
+                index_db.put_embedding(
+                    conn, kind=kind, id=eid, vec=row.tolist(),
+                    content_hash=content_hash(text),


Use live text when hashing rebuilt embeddings

For rebuilt pages/entities, this stores a hash for the text assembled above with single spaces, but the normal KBStore._embed_and_store path hashes and embeds f"{title}\n\n{body}" and f"{name}\n\n{description or ''}". After kb.index_rebuild, those rows therefore look stale to backfill_embeddings/future write hooks and are embedded from different text than live writes; use the same text extractor as the live path before persisting content_hash.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T10:33:18Z

+                # left semantic search empty until each artifact was next
+                # re-written. Mirror the live `KBStore._embed_and_store`
+                # metadata so the per-content/model skip-check stays consistent.
+                index_db.put_embedding(


Record rebuilt embedding model metadata

reset() deletes the index_meta embedding keys, and this new direct rebuild path repopulates embedding_index without the set_embedding_meta call used by _embed_and_store/backfill_embeddings. After kb.index_rebuild or import-apply, embedding stats report no model/dim and detect_mismatch() has no stored model to compare, so a later model switch can leave these rebuilt vectors in the old vector space without warning. Please update index_meta after a successful embedding rebuild.

Useful? React with 👍 / 👎.

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

plind-junior closed this Jun 9, 2026

plind-junior reopened this Jun 9, 2026

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(health): rebuild_index must write embeddings to embedding_index#173

fix(health): rebuild_index must write embeddings to embedding_index#173
seekmistar01 wants to merge 1 commit into
vouchdev:mainfrom
seekmistar01:fix/rebuild-index-embedding-table

seekmistar01 commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

plind-junior commented Jun 9, 2026

Uh oh!

plind-junior commented Jun 9, 2026

Uh oh!

plind-junior commented Jun 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seekmistar01 commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Why it's safe

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

plind-junior commented Jun 9, 2026

Review

What works

Suggestions

Verdict

Uh oh!

plind-junior commented Jun 9, 2026

Uh oh!

plind-junior commented Jun 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seekmistar01 commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading