Skip to content

embeddings: replace pending_embeddings queue with per-message embed_gen (scan-and-fill)#411

Open
webgress wants to merge 6 commits into
kenn-io:mainfrom
webgress:embed-gen-scan-fill
Open

embeddings: replace pending_embeddings queue with per-message embed_gen (scan-and-fill)#411
webgress wants to merge 6 commits into
kenn-io:mainfrom
webgress:embed-gen-scan-fill

Conversation

@webgress

@webgress webgress commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Implements the embedding-queue redesign proposed in #387.

Replaces the separate pending_embeddings work queue with a per-message embed_gen column plus a scan-and-fill worker, a per-generation embed_watermark, and a full-scan backstop. Backend-agnostic (SQLite and PostgreSQL/pgvector).

Logic: changing embedding generation is a rare procedure, keeping embeddings other than the target configuration is at most nice to have, not strictly necessary. The complexity required to correctly maintain multiple embeddings during transition is not justified.

What changed

  • Schema: new embed_gen column on messages (the generation a message is embedded/skipped for; NULL = needs work) and a small embed_watermark table. The old pending_embeddings table is dropped on upgrade.
  • Write path: new messages persist with embed_gen = NULL in the same transaction that stores the message — there is no separate enqueue step, so an embedding can no longer be orphaned by a failed enqueue.
  • Worker: finds work by scanning embed_gen IS NULL OR embed_gen <> <target> (id-ordered) instead of claiming queue rows, and stamps embed_gen on success or skip. A periodic full-scan backstop (default 24h, watermark-ignoring) recovers any stragglers.
  • Coverage: msgvault embeddings reports live / embedded / blank / missing counts.

Upgrade behavior

Existing archives are not re-embedded. A one-time, idempotent backfill stamps already-embedded messages with the active generation on first run (both backends); new rows need no backfill.

Usage

  • msgvault embeddings — show coverage.
  • msgvault embeddings build [--backstop] — build/resume embeddings; --backstop forces a full watermark-ignoring scan.
  • Run only one embedding process at a time.

🤖 Generated with Claude Code

API note

This renames the vector-search stats field pending_embeddings_total to missing_embeddings_total (on /api/v1/stats and the MCP get_stats tool). Under scan-and-fill there is no pending queue; the value is the count of live messages still needing embedding under the active (and building, if any) generation. No backward-compatibility alias is included — happy to keep pending_embeddings_total as a deprecated alias if you'd prefer to preserve the existing field name.

…can-and-fill)

Replaces the separate pending_embeddings work queue with a per-message
embed_gen column on messages plus a scan-and-fill worker, a per-generation
embed_watermark, and a full-scan backstop. New messages persist with
embed_gen=NULL in the same transaction that writes the message, so an
embedding is never orphaned by a failed enqueue. The worker finds work via
(embed_gen IS NULL OR embed_gen <> target) and stamps embed_gen on success or
skip; the daemon auto-backstop (default 24h) recovers below-watermark
stragglers. Coverage (live/embedded/blank/missing) is surfaced in
"msgvault embeddings". Both backends: SQLite (embed_gen in the main DB,
embeddings/generations/watermark in vectors.db) and PostgreSQL/pgvector.

Concurrency: embed_gen is stamped via optimistic CAS on a DDL-maintained
messages.last_modified; a one-time, ledger-guarded upgrade backfill stamps
already-embedded rows so upgraded archives are not re-embedded.

Retires pending_embeddings (plus the enqueuer and seedPending) and the
sync-time enqueue sites.

Implements kenn-io#387.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

roborev: Combined Review (295fa02)

Synthesis unavailable. Showing individual review outputs.

claude-code — default (done)

I've completed a thorough review of this large scan-and-fill refactor. The change replaces the pending_embeddings queue with a per-message embed_gen column plus a per-generation watermark, with extensive cross-DB ordering care. The core worker logic (CAS stamping, watermark advancement, downshift drain, loop termination), the upgrade backfill (ledger-guarded, atomic, pending-signal preservation), the orphan-stamp reset, the repair-encoding ordering, and the read-only/MCP guards are all sound and exceptionally well-documented. The PG-vs-SQLite trigger-ordering asymmetry is deliberate and correct (SQLite defers trigger column resolution to fire time).

Review Findings

Severity: Low
Location: internal/vector/sqlitevec/backend.goEmbeddedMessageCount (and its caller fillFullCoverage in cmd/msgvault/cmd/embeddings_manage.go)
Problem: On SQLite the coverage display materializes the entire set of distinct embedded message ids for a generation into a Go slice, marshals it to a JSON blob, and feeds it through json_each. For the active generation of a fully-embedded archive this is the whole corpus — on this tool's stated target (20+ years of Gmail, potentially millions of messages) that is a multi-million-element slice plus a large JSON string, computed once per non-retired generation on every msgvault embeddings list. The comment defends this as mirroring dropDeletedFromSource, but that path typically operates on a deletion subset, whereas this scales with the full embedded corpus.
Fix: Acceptable as-is for an infrequent management command, but consider a bounded/streaming intersection (e.g. iterate stamped-live ids from the main DB and probe embeddings existence in batches), or derive embedded without materializing the whole id set, if large-archive embeddings list latency/memory becomes a concern.


Severity: Low
Location: internal/store/store.go InitSchema (schema.sql trigger creation at line ~669 vs last_modified column add at line ~718) / internal/store/last_modified_test.go
Problem: The universal SQLite upgrade path — a pre-existing messages table lacking last_modified, where schema.sql creates trg_messages_last_modified (referencing last_modified) before LegacyColumnMigrations adds the column, then backfillLastModified runs — has no direct test. The new trigger tests all use NewTestStore (fresh DB, column present at trigger-create time), and makeUpgradedMainDB only drops embed_gen, never last_modified. The path is relied upon to work via SQLite's deferred trigger column-resolution; it is the one path every existing SQLite user will hit on upgrade.
Fix: Add a store-level test that builds a messages table without last_modified (or drops the column), runs InitSchema, and asserts it succeeds and the backfill/trigger then function — guarding the upgrade path the same way TestEmbed_UpgradedDBMissingEmbedGen_NeedsInitSchema guards embed_gen.

Summary

A well-engineered, thoroughly documented migration from a pending_embeddings queue to a per-message embed_gen scan-and-fill model with watermark optimization, backstop recovery, and optimistic-CAS staleness handling; no correctness bugs found, only minor scalability and test-coverage nits.


claude-code — security (done)

I have completed a thorough review of all production code changes in this diff.

Summary

These three commits replace the pending_embeddings queue table with a per-message embed_gen column ("scan-and-fill"), add a last_modified CAS token (with DB triggers maintaining it), a per-generation embed_watermark, a --backstop full-scan flag, and a periodic daemon backstop pass. The bulk is internal plumbing: threading *store.Store through setupVectorFeatures, the embed worker, and the scheduler, plus a one-time upgrade backfill that stamps embed_gen on already-embedded rows.

I checked the security-relevant surfaces:

  • SQL construction — Every dynamically-built SQL fragment is either a ?-placeholder list sized to the id count (ScanForEmbedding, SetEmbedGen, ResetEmbedGen, embedBatch, SetEmbedGenIfUnchanged) or a hard-coded dialect constant (lastModifiedExpr ∈ {"CAST(m.last_modified AS TEXT)", "m.last_modified"}, dialect.Now(), LiveMessagesWhere("", true)). No attacker- or user-controlled string reaches SQL text. All bound parameters are int64/int/driver-scanned tokens.
  • PostgreSQL triggers (EnsureTriggers) are static plpgsql using CURRENT_TIMESTAMP and NEW.message_id (a column reference, not interpolation) — no injection.
  • Trust boundaries — No new HTTP/MCP endpoint, auth check, credential path, or untrusted-input sink is introduced. The sync changes only remove the enqueuer wiring; no new processing of ingested email content. Backend paths (vecPath/mainPath) derive from config, not runtime input.
  • MCP read-only path — The new ReadOnly: true propagation into the pgvector/sqlitevec backends is a hardening improvement: it suppresses DDL/UPDATE writes (schema apply, orphan reset, embed_gen backfill) that a query-only connection would otherwise attempt.
  • Documented residuals — The 1-second CAS resolution window and the "two concurrent first-opens both run the backfill" window are explicitly called out in code/docs. Both are data-freshness/correctness concerns for a single-user tool, not security boundaries: there is no weaker actor, no privilege separation, and no asset reachable by an unauthorized party. Per the project guidelines (single-user personal tool; the CLI/API user is trusted), these are out of scope.

No injection, privilege-escalation, credential-exposure, path-traversal, or trust-boundary weakening is introduced by these changes.

No issues found.

webgress and others added 3 commits June 23, 2026 07:41
…grade re-embed signal)

- repair-encoding now lowers embed_watermark below the repaired ids so an
  incremental run re-embeds below-watermark repaired messages instead of
  leaving them stale until a backstop.
- the one-time upgrade backfill preserves the active-generation
  pending_embeddings re-embed signal (those messages stay embed_gen=NULL
  and are re-embedded) instead of stamping them covered.
- document the accepted SQLite 1-second last_modified CAS window.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…est handle leak)

- repair-encoding now opens the vector backend (running the one-time
  upgrade backfill) BEFORE clearing embed_gen, so a first-run backfill can
  no longer re-cover messages that repair just marked for re-embedding.
- close leaked DB handles in the new upgrade/backfill tests so Windows
  TempDir cleanup succeeds.
- scrub stray review-severity tags from comments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… tag

- direct test for the upgrade path where a pre-existing messages table
  lacks the new last_modified column: InitSchema creates the trigger
  before LegacyColumnMigrations adds the column (SQLite deferred trigger
  resolution), then backfillLastModified populates it.
- remove a stray review-severity tag left in a test comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (2a9cda5)

Verdict: Medium-risk issues remain around embedding recovery throttling and a stats API contract change.

Medium

  • internal/scheduler/embed_job.go:216
    Backstop throttling is global to the job, not per generation. If the daemon recently backstopped the active generation, a newly selected building generation can skip its first recovery pass until the interval elapses, leaving below-watermark gaps stuck and auto-activation blocked for up to the default 24h.
    Fix: Track lastBackstop per generation, or reset/force the backstop when the target generation changes.

  • internal/vector/stats.go:34
    The stats JSON field was renamed from pending_embeddings_total to missing_embeddings_total, which breaks existing /api/v1/stats and MCP clients; the public docs still advertise the old field.
    Fix: Keep pending_embeddings_total as a deprecated alias while adding the new field, or update the documented API contract and release notes if this is an intentional breaking change.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 6m40s), codex_security (codex/security, done, 5m18s) | Total: 12m7s

- EmbedJob.lastBackstop is now keyed per generation, so a newly targeted
  building generation runs its first backstop instead of inheriting a
  recent active-generation backstop's throttle (could otherwise delay
  recovery of a below-watermark straggler and block auto-activation for
  up to BackstopInterval).
- update docs to the renamed stats field missing_embeddings_total.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (bfede51)

Stale embeddings can persist after message content updates; one medium-severity finding should be fixed before merge.

Medium

  • internal/store/messages.go:190 — Existing messages keep their current embed_gen when PersistMessage updates embedding inputs such as subject or body. The new worker only scans rows where embed_gen is NULL or different from the target generation, so an already embedded message changed during resync can be skipped permanently and keep stale vector content.
    • Fix: Reset messages.embed_gen to NULL when persisted message fields that feed embeddings change, including body upserts, ideally only when relevant content actually differs.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 7m40s), codex_security (codex/security, done, 3m33s) | Total: 11m20s

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

The scan-and-fill worker only revisits messages whose embed_gen is NULL or differs from the target generation. Preserving a current stamp while sync/import writes new subject or body text can therefore leave stale vectors behind indefinitely.

Clear the stamp when embeddable message inputs change, while preserving it for metadata-only re-persist operations so routine sync churn does not force unnecessary re-embedding.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (346d8e0)

Summary verdict: One medium issue needs attention before merge.

Medium

  • internal/store/messages.go:191 - Content updates clear messages.embed_gen for changed subjects/bodies, but do not lower or reset the embedding watermark. The normal embedding worker scans with id > watermark, so an edited message whose ID is already below the watermark can remain unembedded indefinitely during normal serve/resume processing until a full backstop happens.
    • Fix: When invalidating embed_gen for existing messages, also persist/lower the vector watermark below the changed message ID, or record a dirty min-ID that the embedding worker consumes before its next incremental scan. Add coverage for updating an already-watermarked message and verifying the normal worker re-embeds it.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 9m31s), codex_security (codex/security, done, 5m7s) | Total: 14m45s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants