Skip to content

fix(health,index_db): rebuild_index atomic swap prevents index wipeou…#160

Open
greatjourney589 wants to merge 4 commits into
vouchdev:mainfrom
greatjourney589:fix/rebuild-index-atomic-swap
Open

fix(health,index_db): rebuild_index atomic swap prevents index wipeou…#160
greatjourney589 wants to merge 4 commits into
vouchdev:mainfrom
greatjourney589:fix/rebuild-index-atomic-swap

Conversation

@greatjourney589

@greatjourney589 greatjourney589 commented Jun 3, 2026

Copy link
Copy Markdown

What changed

Closes #159

rebuild_index() now writes the new index into a temporary file in the same directory, then atomically renames it into place on success. Per-artifact parse errors are caught individually — bad files are skipped with a warning rather than aborting the whole build. vouch doctor gains an index_blown check that fires when state.db is empty but artifact files exist on disk. A new open_db_at(path) helper in index_db.py accepts an explicit path for the temp-file build.

Why

rebuild_index() called index_db.reset() — which committed destructive DELETEs — before reading a single artifact file. Any parse error (bad YAML, truncated write, ValidationError) caused the context manager to exit with an empty, committed state.db. The existing index was destroyed with no recovery path, silently blacking out vouch context, vouch search, and all MCP/JSONL retrieval. Fixes the critical index-wipeout bug; also addresses the gap where vouch doctor reported no errors after the crash.

What might break

Nothing. rebuild_index is an internal function — its signature is unchanged. No on-disk format changes. Users with an existing .vouch/ directory will see no difference in normal operation; a corrupted artifact that previously crashed vouch index will now be skipped with a warning instead.

VEP

Not required — no surface change (object model, kb.* method signatures, on-disk layout, bundle format, or audit-log shape).

Tests

  • make check passes locally (lint + mypy + pytest)
  • New / changed behaviour has a test
  • CHANGELOG.md updated under ## [Unreleased]

Summary by CodeRabbit

  • Improvements
    • Index rebuild is now atomic and more resilient: corrupt artifacts are skipped while valid entries and embeddings are preserved.
    • Health diagnostics now detect when the on-disk index is out of sync with stored data and surface a specific warning.
  • Bug Fixes
    • Rebuild failures no longer leave partial or broken index state on disk.
  • Tests
    • Added tests covering corrupted artifact handling and the new index-health detection.

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@greatjourney589, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 28 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 31872677-8867-4c0d-a19e-00717608ab72

📥 Commits

Reviewing files that changed from the base of the PR and between 68c7ee9 and 53f1d7d.

📒 Files selected for processing (1)
  • src/vouch/health.py
📝 Walkthrough

Walkthrough

Replaces destructive reset-and-reindex with an atomic temp-file rebuild that indexes durable artifacts into a temp DB while collecting per-artifact failures. On success the temp DB is swapped into place; unreadable artifacts are warned. doctor() gains a check to detect an index that is empty while artifacts exist on disk. Tests added.

Changes

Index rebuild resilience and diagnostics

Layer / File(s) Summary
Atomic rebuild infrastructure
src/vouch/index_db.py, src/vouch/health.py
Adds open_db_at(path: Path) for explicit-path DB creation and updates imports in health.py to support tempfile, shutil, warnings, and artifact deserialization for file-based indexing.
Safe artifact indexing with atomic swap
src/vouch/health.py
rebuild_index() rewritten to build the index in state.db.tmp, read and validate claims/pages/entities from disk, accumulate unreadable artifacts into skipped, write embeddings within the same temp DB transaction, atomically move the temp DB into place on success, and clean up/raise on unexpected failures.
Blown-index detection in doctor()
src/vouch/health.py
doctor() now compares index_db.stats() row counts to on-disk artifact counts and emits an index_blown finding when artifacts exist but the index contains zero rows.
Test coverage for rebuild safety and diagnostics
tests/test_index.py
Adds tests: one corrupts a claim YAML and asserts rebuild_index() warns and preserves healthy claims; another simulates a reset index and asserts doctor() reports index_blown.

Sequence Diagram

sequenceDiagram
  participant Caller
  participant rebuild_index
  participant temp_db as TempDB(.db.tmp)
  participant disk as ArtifactFiles
  participant final_db as state.db

  Caller->>rebuild_index: rebuild_index(store)
  rebuild_index->>temp_db: open temp DB via open_db_at(path)
  rebuild_index->>disk: list claims/pages/entities
  loop per artifact
    disk-->>rebuild_index: read file (YAML/MD)
    alt deserialize success
      rebuild_index->>temp_db: index_claim/page/entity (+ embeddings)
      temp_db-->>rebuild_index: row inserted
    else deserialize fails
      rebuild_index->>rebuild_index: add artifact to skipped list
    end
  end
  rebuild_index->>final_db: atomic move .db.tmp → state.db
  alt skipped not empty
    rebuild_index->>Caller: warnings.warn(skipped artifacts)
  end
  rebuild_index-->>Caller: return stats
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A temp DB hummed in the night,

I read each file by soft moonlight.
Bad YAML tripped, I marked it skipped,
Then swapped the whole and neatly zipped.
Now searches smile, no longer flipped.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: atomic swap in rebuild_index to prevent index wipeout.
Linked Issues check ✅ Passed All core objectives from issue #159 are met: atomic rebuild with temp DB and swap (Option A), per-artifact error handling with skip-and-warn, index_blown doctor check, open_db_at helper, and regression tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #159 objectives. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/health.py`:
- Around line 199-207: The current per-artifact try/except around Claim
deserialization also catches database write errors from index_db.index_claim;
narrow the rescue so only the file read/parse/validation is caught: read and
validate the YAML into Claim (using Claim.model_validate and _yaml_load) inside
a try/except that appends to skipped on deserialization/read failures, but call
index_db.index_claim outside that except so any SQLite/indexing errors bubble
and abort the rebuild; apply the same pattern to the page/entity loops where
index_db.index_page/index_entity are invoked.
- Around line 226-238: rebuild_index currently swaps tmp_path into place with
shutil.move then calls _rebuild_embeddings(store), which can mutate the live DB
and break atomicity if it fails; fix by either (A) moving the embedding rebuild
to the temp DB before the swap (invoke whatever API _rebuild_embeddings or a
variant that targets tmp_path / the new DB file so all mutations happen on the
temp DB) and only shutil.move after that succeeds, or (B) make the post-swap
embedding refresh explicitly best-effort by wrapping the
_rebuild_embeddings(store) call in a try/except that logs the failure (including
exception details) but does not raise, ensuring the swapped DB remains the
canonical state; update rebuild_index to use one of these approaches and
reference shutil.move, tmp_path, db_path, _rebuild_embeddings and rebuild_index
in the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c3a7afe1-a409-44a1-b95e-be519f633e15

📥 Commits

Reviewing files that changed from the base of the PR and between 3beb821 and 6e26f18.

📒 Files selected for processing (3)
  • src/vouch/health.py
  • src/vouch/index_db.py
  • tests/test_index.py

Comment thread src/vouch/health.py Outdated
Comment thread src/vouch/health.py
@plind-junior

Copy link
Copy Markdown
Collaborator

Review

Summary: Implements the atomic-swap approach (Option A) from #159. The fix is structurally sound — build into a temp file, swap only on success, catch per-artifact parse errors with a skip-and-warn fallback — and the two new tests cover the critical regression paths. A few issues worth discussing below.

What works

  • health.py:205open_db_at(tmp_path) correctly scopes the entire FTS build + embedding write inside the temp-file transaction, so the rename only happens after everything commits. The try/except with tmp_path.unlink(missing_ok=True) on failure is the right cleanup pattern.
  • health.py:185–220 — Per-artifact exception handling is tight: each file is tried independently, errors are collected into skipped, and the whole build continues. The warnings.warn at the end with the full list is more useful than a per-file stderr print.
  • health.py:143–174index_blown doctor check is a good complement: it will catch the pre-fix state of a KB that was damaged before upgrading, and the condition (artifacts_on_disk > 0 and indexed == 0) matches exactly what the issue described.
  • index_db.py:98–110open_db_at is a minimal, clean helper. Sharing SCHEMA with open_db means the temp DB always has the same tables, so no schema drift is possible between the temp build and the live path.
  • tests/test_index.py:51–101 — Both new tests are well-scoped. test_rebuild_index_preserves_good_claims_when_one_yaml_is_corrupt exercises the exact repro from the issue report; test_doctor_detects_blown_index covers the diagnostic gap. The warnings.catch_warnings block is the right way to assert on warnings.warn output.

Suggestions

  • [blocking] health.py:246–254_rebuild_embeddings is still defined and still has a caller at health.py:180 in the old code path, but the PR leaves this stub behind that just calls _write_embeddings(conn, store) on a separate open_db connection to the live DB — not to the temp DB. This means the old _rebuild_embeddings is now dead code in the post-PR world (the new rebuild_index calls _write_embeddings directly inside the temp-file connection), but if anything outside the module still references _rebuild_embeddings it would write embeddings to the live DB rather than the temp file, breaking the atomicity guarantee. Verify nothing calls _rebuild_embeddings externally and consider removing it to avoid confusion, or at least add a deprecation comment.

  • [blocking] health.py:185–220 — The loop variable p is reused across all three artifact types (claims, pages, entities). In the claims loop p is a Path to a .yaml file; in the pages loop p is rebound to a .md Path; in the entities loop it's rebound again. This is fine at runtime but the inner type annotations c: Claim, pg: Page, e: Entity suggest the author was aware of the shadowing yet used p throughout anyway. No bug here, but the pages block also shadows the outer variable p that holds tmp_path's parent in store.kb_dir — worth renaming for clarity (claim_path, page_path, entity_path) to avoid future confusion.

  • [non-blocking] health.py:197–199_write_embeddings rebuilds the text list by calling store.list_claims(), store.list_pages(), and store.list_entities() — which re-reads every artifact file from disk a second time (they were already read during the FTS loop above). On large KBs this doubles the I/O. Consider passing the already-parsed objects through or caching them locally in rebuild_index. This is a performance concern, not a correctness issue.

  • [non-blocking] health.py:143–153 — The index_blown check calls index_db.stats(store.kb_dir) which opens a new SQLite connection and issues three COUNT(*) queries. Immediately before this, the doctor flow may have already opened/closed the DB for other checks. Not a bug, just potentially worth noting that stats() is called every doctor run even when the DB is present, and on cold storage this could be slow. Lower priority.

  • [non-blocking] tests/test_index.py:94–101test_doctor_detects_blown_index uses raw executescript DELETE statements to manually empty the FTS tables. This works, but it bypasses index_db.reset() and any future schema changes could silently stop the test from simulating the blown-index condition correctly. A comment explaining why reset() isn't used here (or alternatively just calling reset() directly) would improve maintainability.

Verdict

request changes — The orphaned _rebuild_embeddings stub is a clarity/maintenance hazard and should be removed or explicitly deprecated before merge; the p variable shadowing in the three-loop block is harmless today but is a trap. Both are easy fixes. Core logic is correct and the tests are solid.

@greatjourney589

Copy link
Copy Markdown
Author

Hi,@plind-junior
I understood your suggestion exactly.when you reopen this PR, I will resolve it perfectly.

@plind-junior plind-junior reopened this Jun 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/health.py`:
- Around line 268-272: The current broad except in the get_embedder
import/initialization swallows any error and leaves embedding_index empty after
rebuild_index(); change it to only catch the explicit "embedder unavailable"
exception exported by the embeddings module (e.g. EmbedderUnavailableError or
similarly named exception) by importing that symbol from .embeddings and using
"except EmbedderUnavailableError:" so real initialization errors in
get_embedder() propagate (or are re-raised) and unexpected failures are not
suppressed; also add a processLogger.error or raise after catching to record the
unavailability case while preserving other exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 59df0a0b-214d-4364-a8b4-fc18d41a1c27

📥 Commits

Reviewing files that changed from the base of the PR and between 6e26f18 and 68c7ee9.

📒 Files selected for processing (2)
  • src/vouch/health.py
  • tests/test_index.py

Comment thread src/vouch/health.py
@plind-junior

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53f1d7dac5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/vouch/health.py
else:
# Detect a blown index: DB exists but FTS is empty while artifact files
# are present on disk. This is the aftermath of a crashed rebuild_index.
stats = index_db.stats(store.kb_dir)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle unreadable state.db in doctor

When state.db exists but is corrupt or not a SQLite database (for example after an interrupted write or accidental overwrite), this new index_db.stats() call raises a sqlite3.DatabaseError, so vouch doctor aborts instead of returning a health report that can tell the user to rebuild the derived index. Since the index is explicitly treated as disposable, this path should catch stats/open failures and report a warning/error rather than crashing.

Useful? React with 👍 / 👎.

Comment thread src/vouch/health.py
Comment on lines +207 to +209
except Exception as exc:
skipped.append((claim_path.name, str(exc)))
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Abort rebuild on artifact read failures

If read_text() fails here for an otherwise valid artifact (for example a transient OSError/PermissionError on one claim file), this broad catch treats it like a corrupt YAML file, skips it, and still swaps the temp database into state.db. That drops the artifact from the live search index even though the old index was still usable; read/I/O failures should abort the temp rebuild so the atomic-swap path preserves the existing index.

Useful? React with 👍 / 👎.

Comment thread src/vouch/health.py
Comment on lines +191 to +193
with tempfile.NamedTemporaryFile(
dir=store.kb_dir, suffix=".db.tmp", delete=False,
) as tmp:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use an ignored temp filename for rebuilds

If the process is killed after creating this temp DB but before cleanup, the leftover file is named like tmpabcd.db.tmp under .vouch/. KBStore.init() only ignores state.db and state.db-*, so the crash-safety path can leave an untracked derived SQLite file in the repo; use a state.db- prefix or extend the ignore pattern so crash leftovers stay out of source control.

Useful? React with 👍 / 👎.

Comment thread src/vouch/health.py
Comment on lines 275 to +277
embedder = get_embedder()
except Exception:
except (KeyError, ImportError):
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep FTS rebuild independent of embedder startup

With an embedder registered, any non-ImportError startup failure from get_embedder() (for example a sentence-transformers model cache/download or runtime error) now aborts vouch index before the temp DB is swapped, so even the FTS index stays stale or missing. Embeddings are an optional enhancement elsewhere in the codebase, and the previous path ignored these startup failures; this should still skip embeddings and let the durable FTS rebuild complete.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: rebuild_index() destroys state.db before reading artifacts — any parse error leaves the KB with a permanently blank search index

2 participants