Skip to content

Fix HNSW boot-integrity check: compare element count, not label_map size#27

Open
danielhertz1999-bit wants to merge 1 commit into
CodeAbra:mainfrom
danielhertz1999-bit:fix/hnsw-boot-integrity-windows
Open

Fix HNSW boot-integrity check: compare element count, not label_map size#27
danielhertz1999-bit wants to merge 1 commit into
CodeAbra:mainfrom
danielhertz1999-bit:fix/hnsw-boot-integrity-windows

Conversation

@danielhertz1999-bit

Copy link
Copy Markdown
Contributor

The bug

HippoDB._initialize_hnsw_index decides whether to rebuild the HNSW index by comparing the in-memory _label_map size against the SQLite row count:

active_label_count = len(self._label_map)
if active_label_count != sqlite_count:
    ... rebuild ...

But _label_map is repopulated from SQLite (_repopulate_label_map_from_sqlite()) before this check runs, so len(self._label_map) already equals sqlite_count — the condition is effectively always false and the rebuild never fires.

When the on-disk HNSW file is stale relative to SQLite (e.g. an interrupted write), the daemon boots with the HNSW index holding fewer elements than the label map / SQLite believe exist. The integrity check that's supposed to catch and repair this silently passes. A subsequent knn_query(k=2) against a 1-element index then raises RuntimeError on every insert.

Impact (observed on Windows)

This blocks the deferred-capture drain completely: with HNSW count=1 but _label_map/SQLite count=2, every capture insert crashes, so a backlog of session files never drains. After the fix, the same store drains cleanly (a 66-file / 322-turn backlog went from N=2 to draining normally).

The fix

Compare the actual HNSW element count against SQLite, so a real divergence triggers the rebuild:

hnsw_count = self._hnsw.get_current_count()
if hnsw_count != sqlite_count:
    ... rebuild ...

4 lines, no behavioral change when the index is already consistent. Surfaced while validating the new Windows support (v1.2.0) end-to-end.

🤖 Generated with Claude Code

_repopulate_label_map_from_sqlite() runs immediately before
_initialize_hnsw_index(), so len(_label_map) always equals sqlite_count
and the integrity check never fired. The HNSW file could have fewer
elements (e.g. 1 vs 2 in SQLite) causing knn_query(k=2) on a 1-element
index to raise RuntimeError, breaking every insert via the pattern
separation gate.

Fix: compare self._hnsw.get_current_count() (actual file elements) to
sqlite_count instead of len(_label_map).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@danielhertz1999-bit

Copy link
Copy Markdown
Contributor Author

Heads up: the red build & test (macOS) here is pre-existing on main, not caused by this change.

This PR's branch differs from main by exactly 4 lines in one file (src/iai_mcp/hippo/_db.py), and the failures are all in unrelated suites:

  • tests/test_mcp_tools.py:36NameError: name 'tmp_path' is not defined — a module-scoped daemon_sock fixture references the function-scoped tmp_path fixture (which isn't a parameter). This is in main as-is.
  • OSError: AF_UNIX path too long (test_capture_transcript_no_spawn_defer, test_concurrency) — the macOS runner's pytest-of-runner/... tmp path exceeds the 104-char sun_path limit.
  • daemon socket-bind timeouts (test_bridge_socket_first, test_concurrency) — the known flaky daemon-subprocess tests.

The same job on the Release v1.2.0 push to main (run 28267001306) fails with the identical test_mcp_tools.py:36 NameError — i.e. the gate was already red before this branch existed. The HNSW boot tests this change actually affects (test_daemon_hippo_boot, test_hnsw_reuse_recall_parity) pass.

Happy to send a separate PR fixing the macOS-CI issues (the tmp_pathtmp_path_factory fixture bug is a quick one; the AF_UNIX path-length one needs a short socket dir) if that's useful — just didn't want to scope-creep this 4-line correctness fix.

🤖 via Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant