embeddings: replace pending_embeddings queue with per-message embed_gen (scan-and-fill)#411
embeddings: replace pending_embeddings queue with per-message embed_gen (scan-and-fill)#411webgress wants to merge 6 commits into
Conversation
…can-and-fill) Replaces the separate pending_embeddings work queue with a per-message embed_gen column on messages plus a scan-and-fill worker, a per-generation embed_watermark, and a full-scan backstop. New messages persist with embed_gen=NULL in the same transaction that writes the message, so an embedding is never orphaned by a failed enqueue. The worker finds work via (embed_gen IS NULL OR embed_gen <> target) and stamps embed_gen on success or skip; the daemon auto-backstop (default 24h) recovers below-watermark stragglers. Coverage (live/embedded/blank/missing) is surfaced in "msgvault embeddings". Both backends: SQLite (embed_gen in the main DB, embeddings/generations/watermark in vectors.db) and PostgreSQL/pgvector. Concurrency: embed_gen is stamped via optimistic CAS on a DDL-maintained messages.last_modified; a one-time, ledger-guarded upgrade backfill stamps already-embedded rows so upgraded archives are not re-embedded. Retires pending_embeddings (plus the enqueuer and seedPending) and the sync-time enqueue sites. Implements kenn-io#387. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
…grade re-embed signal) - repair-encoding now lowers embed_watermark below the repaired ids so an incremental run re-embeds below-watermark repaired messages instead of leaving them stale until a backstop. - the one-time upgrade backfill preserves the active-generation pending_embeddings re-embed signal (those messages stay embed_gen=NULL and are re-embedded) instead of stamping them covered. - document the accepted SQLite 1-second last_modified CAS window. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…est handle leak) - repair-encoding now opens the vector backend (running the one-time upgrade backfill) BEFORE clearing embed_gen, so a first-run backfill can no longer re-cover messages that repair just marked for re-embedding. - close leaked DB handles in the new upgrade/backfill tests so Windows TempDir cleanup succeeds. - scrub stray review-severity tags from comments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… tag - direct test for the upgrade path where a pre-existing messages table lacks the new last_modified column: InitSchema creates the trigger before LegacyColumnMigrations adds the column (SQLite deferred trigger resolution), then backfillLastModified populates it. - remove a stray review-severity tag left in a test comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
- EmbedJob.lastBackstop is now keyed per generation, so a newly targeted building generation runs its first backstop instead of inheriting a recent active-generation backstop's throttle (could otherwise delay recovery of a below-watermark straggler and block auto-activation for up to BackstopInterval). - update docs to the renamed stats field missing_embeddings_total. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
|
looking at this |
The scan-and-fill worker only revisits messages whose embed_gen is NULL or differs from the target generation. Preserving a current stamp while sync/import writes new subject or body text can therefore leave stale vectors behind indefinitely. Clear the stamp when embeddable message inputs change, while preserving it for metadata-only re-persist operations so routine sync churn does not force unnecessary re-embedding. Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Implements the embedding-queue redesign proposed in #387.
Replaces the separate
pending_embeddingswork queue with a per-messageembed_gencolumn plus a scan-and-fill worker, a per-generationembed_watermark, and a full-scan backstop. Backend-agnostic (SQLite and PostgreSQL/pgvector).Logic: changing embedding generation is a rare procedure, keeping embeddings other than the target configuration is at most nice to have, not strictly necessary. The complexity required to correctly maintain multiple embeddings during transition is not justified.
What changed
embed_gencolumn onmessages(the generation a message is embedded/skipped for;NULL= needs work) and a smallembed_watermarktable. The oldpending_embeddingstable is dropped on upgrade.embed_gen = NULLin the same transaction that stores the message — there is no separate enqueue step, so an embedding can no longer be orphaned by a failed enqueue.embed_gen IS NULL OR embed_gen <> <target>(id-ordered) instead of claiming queue rows, and stampsembed_genon success or skip. A periodic full-scan backstop (default 24h, watermark-ignoring) recovers any stragglers.msgvault embeddingsreports live / embedded / blank / missing counts.Upgrade behavior
Existing archives are not re-embedded. A one-time, idempotent backfill stamps already-embedded messages with the active generation on first run (both backends); new rows need no backfill.
Usage
msgvault embeddings— show coverage.msgvault embeddings build [--backstop]— build/resume embeddings;--backstopforces a full watermark-ignoring scan.🤖 Generated with Claude Code
API note
This renames the vector-search stats field
pending_embeddings_totaltomissing_embeddings_total(on/api/v1/statsand the MCPget_statstool). Under scan-and-fill there is no pending queue; the value is the count of live messages still needing embedding under the active (and building, if any) generation. No backward-compatibility alias is included — happy to keeppending_embeddings_totalas a deprecated alias if you'd prefer to preserve the existing field name.