Skip to content

feat(models): @embed directive + write-time hook + auto-HNSW (Phase 5 of #510)#747

Open
heskew wants to merge 7 commits into
mainfrom
feat/models-embed-directive
Open

feat(models): @embed directive + write-time hook + auto-HNSW (Phase 5 of #510)#747
heskew wants to merge 7 commits into
mainfrom
feat/models-embed-directive

Conversation

@heskew
Copy link
Copy Markdown
Member

@heskew heskew commented May 22, 2026

Summary

Phase 5 of #510. @embed schema directive that auto-fires embedding on writes, stores the vector on the record, and HNSW-indexes the column — no app code.

type Document @table @export {
  id: ID @primaryKey
  content: String
  embedding: Vector @embed(source: "content", model: "default")
}

Three lines replace a custom embedding pipeline.

What ships

  • Parser (resources/graphql.ts): @embed(source, model) branch added alongside @computed. Vector added to PRIMITIVE_TYPES. Validates both args are present and string-typed. Auto-attaches @indexed(type: "HNSW") when no explicit @indexed.
  • API (resources/Table.ts): Table.setEmbedAttribute(name, embedder) mirrors setComputedAttribute for component-author overrides. Default embedder auto-registered at schema load.
  • Write-time hook (resources/Table.ts, resources/models/embedHook.ts): chained into the existing preCommitBlobsForRecordBefore slot at the main put/patch site. Commit blocks on the embedder (sync-by-default).
  • Vector coerceType case: JSON Array<number> and Float32Array both accepted and normalized to Float32Array for storage.
  • Replicated-write predicate: skips on options.isNotification === true (cluster-subscribe), context.replicateFrom === false (REST x-replicate-from: none), and context.alreadyLogged === true (audit replay). Receivers store the originator's vector as-is — preserves cross-cluster embedding-space consistency.
  • Embedder error sanitization: backend URLs / model IDs / API-key tails stay in server logs; the caller sees Failed to compute embedding for attribute "<name>".
  • Schema-reload safety: the static embedAttributes filter is recomputed in updatedAttributes() so in-place Table.attributes.splice() at databases.ts:940 doesn't leave a stale snapshot.

Open decision (resolved per Kris's #632 comment)

Sync-by-default per Kris's preference — fits the existing write.before pattern with zero new infrastructure and is expected to be the overwhelming common case ("I would expect that's overwhelmingly the mode that should and would be used").

Low-latency follow-up direction

Per Kris's #632 comment, the right way to minimize write latency is not queued-via-server/jobs/ (the issue body's original suggestion). It's a derived caching table pattern: the origin table commits immediately without waiting for the embedder; a separate derived table holds the vector and follows sync mode against the origin. Origin write latency stays at LMDB-speed; derived-table writes do the embedding work asynchronously.

The same primitive lands re-embed-on-model-change backfill — the derived table is the natural place to iterate-and-rewrite when the schema's @embed(model: ...) changes.

Model-change invalidation — partial

The parser sets property.version = "embed:<model>", so databases.ts:1111's version-change-driven reindex fires when the model changes. But that pathway re-INDEXES existing vectors; it does NOT re-EMBED source fields through the new model. New writes after a model swap pick up the new model; existing rows keep their old-model vectors until re-written. Full re-embed-on-model-change backfill lands with the derived-table follow-up (above).

Verifier cadence

Per the phase-cadence plan, Phase 5 was scheduled for /ultrareview (write-time hook + auto-indexing is non-trivial new behavior). Locally I ran deep-review --multipass (4 domain agents × 2 runs = 8 agents) over the working tree pre-PR. The multipass surfaced:

  • (fixed pre-PR) cluster-subscribe replication path doesn't set replicateFrom/alreadyLogged — predicate now also checks options.isNotification === true.
  • (fixed pre-PR) Vector was added to PRIMITIVE_TYPES but lacked a coerceType case — added.
  • (fixed pre-PR) stale embedAttributes snapshot on schema reload — recomputed in updatedAttributes().
  • (fixed pre-PR) @embed(...) with missing source: or model: silently degraded to a no-op — now errors with directive location.
  • (fixed pre-PR) embedder backend errors could leak URLs / key tails — sanitized.

/ultrareview recommended pre-merge for multi-model cross-check (this PR's domain is new-behavior class).

Test plan

  • Unit: 17 tests in unitTests/resources/models/embedHook.test.js — default embedder, replication predicate (all 3 signals), source-presence semantics, error sanitization.
  • Build: npm run build clean.
  • Lint: oxlint --deny-warnings clean for touched files.
  • Integration: end-to-end with a fake Ollama HTTP server inside the test (integrationTests/server/embed-directive.test.ts) — happy path, source-unchanged PATCH (no re-embed), REST x-replicate-from: none (receiver path) skip.
  • CI green (3 Node versions).
  • /ultrareview cross-model pass before merge.

Files

Path Change
resources/graphql.ts extended — parse @embed, add 'Vector', validate args, auto-HNSW
resources/Table.ts extended — setEmbedAttribute, write-time hook, Vector coerceType, embedAttributes refresh in updatedAttributes()
resources/models/embedHook.ts new — default embedder + pre-commit builder + replication predicate + error sanitization
unitTests/resources/models/embedHook.test.js new — 17 tests

Out of scope (deferred follow-ups)

  • Derived caching table for low-latency writes (Kris's preferred direction — see above) — tracked at [Models] Derived caching table for @embed (low-latency writes + model-change backfill) #750. Same primitive lands re-embed-on-model-change backfill.
  • Cross-model migration tooling (harper kb reembed --from X --to Y).
  • Multimodal embeddings (image, audio).
  • Pre-existing tech debt at resources/graphql.ts:170 — the unknown-directive warning condition reads server.knownGraphQLDirectives.includes(directiveName) (warns when the directive IS known). Not touched in this PR.

Tracking: #510
Closes #632

🤖 Generated with Claude Code

heskew and others added 2 commits May 22, 2026 11:59
…hase 5 of #510)

Schema-level RAG without app code: marking a field with
`@embed(source: "<field>", model: "<logical-name>")` makes Harper compute
the embedding on every originating write, store it on the record, and
index it via HNSW — no application hooks required.

Parser (resources/graphql.ts):
  - 'embed' added to knownGraphQLDirectives; 'Vector' added to
    PRIMITIVE_TYPES (with a coerceType case in Table.ts that accepts
    Float32Array / Array<number> / typed-array views).
  - `@embed` populates `property.embed = { source, model }` and sets
    `property.version = "embed:<model>"` so the existing schema-load
    version-change-driven reindex pathway (databases.ts:1111) fires on
    model swap. (Note: today that pathway re-indexes existing vectors
    but does NOT re-embed source fields through the new model; full
    re-embed backfill is a follow-up.)
  - Auto-attaches `property.indexed = { type: 'HNSW' }` when no
    explicit `@indexed` is present — the issue's "no app code needed"
    statement only holds with auto-indexing.
  - Validates both `source:` and `model:` are present and string-typed;
    rejects variables / non-string nodes with a directive-location
    error so typoed schemas fail loudly instead of silently producing
    `undefined`.

API + write-time hook (resources/Table.ts):
  - Static `userEmbedders` map and `embedAttributes` filtered list (the
    list is recomputed in `updatedAttributes()` so in-place schema
    reload at `databases.ts:940` doesn't leave a stale snapshot).
  - `Table.setEmbedAttribute(name, embedder)` static method mirrors the
    `setComputedAttribute` registration pattern; component authors
    override the default via this surface.
  - Default embedder auto-registered at schema load when an attribute
    carries `@embed` — sits OUTSIDE the if/else-if resolver chain so
    `@embed`-decorated fields still get their `customIndex.propertyResolver`
    wired for HNSW.
  - `buildEmbedBefore` chained into the existing
    `preCommitBlobsForRecordBefore` slot at the main put/patch site —
    the embedder mutates the recordUpdate during the transaction's
    `before` phase, so commit blocks on it (sync-by-default; queued
    mode deferred to a follow-up).

Default embedder + replicated-write predicate (resources/models/embedHook.ts):
  - `createDefaultEmbedder({source, model})` returns a callback that
    calls `Models.embed(record[source], { model, inputType: 'document' })`
    and returns the first vector. `Models` is lazy-required so the
    module is unit-testable without dragging the transaction stack
    into module load.
  - `buildEmbedBefore` skips on ALL three replication-receiver signals:
    `options.isNotification === true` (cluster-subscribe path in
    Table.ts:325-373), `context.replicateFrom === false` (REST
    `x-replicate-from: none`), and `context.alreadyLogged === true`
    (local audit replay at replayLogs.ts:52). On receivers the
    originator's vector is stored as-is, preserving cross-cluster
    embedding-space consistency.
  - Embedder errors are caught and rethrown as a sanitized message
    `Failed to compute embedding for attribute "<name>"` — backend
    URLs, model IDs, and API-key tails stay in server logs only.

Tracking: #510
Closes #632

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 tests covering createDefaultEmbedder + buildEmbedBefore:

  - default embedder: source-field read, document inputType, null/undefined
    source handling, non-string source stringification.
  - buildEmbedBefore replication predicate: skip on
    `options.isNotification === true` (cluster-subscribe),
    `context.replicateFrom === false` (REST suppression),
    `context.alreadyLogged === true` (replay); fire on
    local-originating writes where replicateFrom is undefined.
  - source-field semantics: skip when source not in payload (PATCH
    that omits source survives via patch-merge), clear embedding to
    null when source is explicitly null, run when source is present.
  - per-attribute correctness: skip attributes whose source is
    absent in multi-`@embed` tables; skip attributes with no
    registered embedder; write null when embedder returns null.
  - error sanitization: backend errors (URLs, API-key tails) are
    NOT propagated to the caller; the sanitized message references
    only the attribute name.

Test seam `__setEmbedFnForTest` lets the suite drive the embedder
without instantiating Harper's full transaction stack.

Tracking: #510
Refs #632

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heskew heskew requested a review from kriszyp May 22, 2026 19:01
@heskew heskew marked this pull request as ready for review May 22, 2026 19:01
heskew and others added 2 commits May 22, 2026 12:03
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spins up a fake Ollama HTTP server inside the test, points Harper's
models.embedding.default config at it, deploys a schema with @embed,
and exercises three paths end-to-end. Replaces the "manual smoke test"
bullet that was on the PR description with a CI-runnable test.

Covered paths:

  1. Happy path — POST a record → fake-ollama returns a deterministic
     3-element Float32 vector → record stores it at the @embed-decorated
     field. Confirms exact-byte equality between the fake's response and
     what's persisted.

  2. Source-unchanged PATCH — PATCH a record with a non-source field →
     fake-ollama call count stays flat; the existing embedding survives
     via patch-merge.

  3. Replication-receiver skip (REST path) — POST with
     x-replicate-from: none header and a pre-supplied vector → no embed
     call is made; the supplied vector is stored as-is. The
     cluster-subscribe replication path (options.isNotification === true)
     is already covered by the embedHook unit tests.

The fake-ollama server tracks call count + last inputs, so we assert
not just on "vector got written" but on "embedder fired exactly when
we expect, and not when we don't".

The fake matches Ollama's actual /api/embed wire format
(POST { model, input: string[] } → { embeddings: number[][] }) so the
real OllamaBackend code path runs unchanged.

Local note: macOS dev machines without
`harper-integration-test-setup-loopback` can run via
`HARPER_INTEGRATION_TEST_FORCE_LOOPBACK=1` to bypass the loopback
address pool. CI uses the pool natively.

Refs #632 #510

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread resources/Table.ts Outdated
// `customIndex.propertyResolver` branch needs to fire for them.
// Component authors override the default via
// `Table.setEmbedAttribute(name, customEmbedder)` after schema load.
if (attribute.embed && !this.userEmbedders[attribute.name]) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default embedder not refreshed on model change.

The !this.userEmbedders[attribute.name] guard correctly avoids clobbering a component-author override set via setEmbedAttribute, but it also prevents the auto-generated default embedder from being updated when the schema is reloaded with a different model: via the databases.ts:940 in-place attributes.splice() path.

updatedAttributes() is called on the existing Table class (not a fresh makeTable()), so userEmbedders carries over. After a @embed(model: "modelB") schema reload, userEmbedders["embedding"] still holds the modelA embedder, and new writes silently use the wrong model. The PR body claims "New writes after a model swap pick up the new model" — the code doesn't support that claim for in-place schema reloads.

To fix without clobbering user overrides, track which names were explicitly set by setEmbedAttribute (e.g., a static userSetEmbedders: Set<string> = new Set()) and skip refresh only for those:

Suggested change
if (attribute.embed && !this.userEmbedders[attribute.name]) {
if (attribute.embed && !TableResource.userSetEmbedders?.has(attribute.name)) {

…then add static userSetEmbedders: Set<string> = new Set() alongside userEmbedders and have setEmbedAttribute populate it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied — setEmbedAttribute now tracks user-set names in static userSetEmbedders: Set<string>; defaults refresh on every updatedAttributes() call, user overrides are preserved. Pushed at b108804.

Comment thread resources/Table.ts
if (value instanceof Float32Array) return value;
if (Array.isArray(value)) return Float32Array.from(value);
if (ArrayBuffer.isView(value)) return new Float32Array((value as ArrayBufferView).buffer);
throw new SyntaxError();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBuffer.isView branch reinterprets bytes rather than converting values.

new Float32Array(value.buffer) copies the raw bytes of the backing ArrayBuffer from offset 0. If value is a non-Float32Array typed array — e.g., Float64Array([1.0, 2.0, 3.0]) — the eight-byte-per-element encoding is reinterpreted as four-byte float32s, producing 6 values of garbage. The same issue exists for typed arrays with a nonzero byteOffset (subarrays/slices).

The fix is to convert element values rather than reinterpret bytes:

Suggested change
throw new SyntaxError();
if (ArrayBuffer.isView(value)) return Float32Array.from(value as any);

Float32Array.from iterates the view's elements and produces the correct float32 representation of each numeric value. (DataView is not iterable, but it is also not a plausible input here.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied — switched to Float32Array.from(value) which iterates element values rather than reinterpreting bytes. Float64Array / subarray inputs now convert correctly. Pushed at b108804.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

2 blockers found (new findings on 47bd103c4).

1. Source-changing PATCH (the CRITICAL fix) has no test

File: resources/Table.ts lines ~1248, ~1257
What: update() now wraps both _writeUpdate call sites with when(...) to propagate the embed promise. The integration test only exercises the skip-path of update() — PATCH without the source field, where buildEmbedBefore returns undefined and _writeUpdate returns synchronously. No test exercises PATCH with the source field present, which is the only path that exercises the async branch the fix was written for.
Why it matters: This was the CRITICAL fix in 47bd103c4 — silent data loss / "transaction already closed" crash on source-changing PATCHes. The fix is plausible but the production path is unverified end-to-end.
Suggested fix: Add an integration test case: seed a record, PATCH it with { content: 'updated text' }, verify embed fires once and the stored vector matches deterministicVector('updated text').

2. getFromSource caching-table embed path has no test

File: resources/Table.ts ~line 4509
What: The MEDIUM fix from Kris's review wires buildEmbedBefore into the getFromSource write path (the path that bypasses _writeUpdate). Only the false branch of if (embedBefore) has pre-existing coverage; the true branch (embed fires on a caching table) has none.
Why it matters: Kris called this "one of the most compelling use cases." A silent mis-wire here would mean caching tables with @embed never embed — the same failure mode the fix was supposed to close.
Suggested fix: Add an integration test: a source table + a caching table with @embed, write to the source, verify the derived-table record has a populated vector.

heskew and others added 2 commits May 22, 2026 15:55
…s commit

Three bugs surfaced once the new integration test exercised the full REST
write path against a fake-Ollama HTTP server. All three were fixed in this
patch; the integration test now passes 4/4 locally and exercises the bugs
that would otherwise have shipped silent.

1) `require('./Models.ts')` doesn't survive the dist build (the `.ts`
   extension stays literal at runtime; production resolves from
   `dist/*.js`). Switched to `require('#src/resources/models/Models')`
   which goes through package.json conditional exports — resolves to
   `.ts` under the `typestrip` condition (unit tests) and to `dist/*.js`
   in production.

2) The embedder was wired into the pre-commit `before` slot, which is
   awaited at `Promise.all(completions)` at txn-commit time — AFTER each
   write's `commit(...)` closure has already stored the record. That
   pattern works for blob byte-writes (which reference pre-allocated
   blob IDs) but not for `@embed` where the vector itself must be on
   the record at commit time. Moved the embedder call to fire BEFORE
   `transaction.addWrite(...)` and threaded the resulting promise back
   through `_writeUpdate` → `create()` / `update()` via `when()` so the
   caller chain (and ultimately the static-create transaction wrapper)
   awaits the embedding before committing.

3) The default embedder returned a `Float32Array`, but Harper's record
   encoder mangles typed arrays via `updateAndFreeze` (it enumerates
   them as `{0,1,2,...}` maps, breaking the byte round-trip). The
   integration test confirmed: a record POSTed with a plain
   `Array<number>` round-trips correctly; one with a `Float32Array`
   came back as zero-bytes. Changed the default embedder to return
   `Array<number>` (HNSW's `propertyResolver` and `customIndex.index`
   both accept it). Unit test updated to assert the Array shape.

Integration test (`integrationTests/server/embed-directive.test.ts`)
exercises:
  - happy path: POST → embedder runs → vector stored, decoded back
  - source-unchanged PATCH: no new embed call; existing vector survives
  - REST `x-replicate-from: none`: embedder skipped, supplied vector stored

Refs #632 #510

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real blockers from PR #747 review:

1) Default embedder not refreshed on model change. The
   `!this.userEmbedders[attribute.name]` guard prevented refreshing the
   auto-generated default embedder when a schema reload via
   `databases.ts:940` in-place `attributes.splice()` calls
   `updatedAttributes()` again on the existing Table class — `userEmbedders`
   carried over, so a `@embed(model: "B")` swap from `"A"` left writes
   silently using the old model. Track user-set override names in
   `static userSetEmbedders: Set<string>` and only skip refresh for those;
   defaults are refreshed every time. `setEmbedAttribute` populates the set.

2) `ArrayBuffer.isView` branch reinterpreted bytes. `new Float32Array(value.buffer)`
   copies the raw bytes of the backing ArrayBuffer from offset 0 — when
   `value` is a `Float64Array`, the 8-byte-per-element encoding gets
   reinterpreted as 4-byte float32s (garbage). Same issue with typed-array
   subarrays (nonzero `byteOffset`). Use `Float32Array.from(value)` to
   iterate element values and convert numerically.

Both flagged inline by claude-bot on `resources/Table.ts:3433` and
`resources/Table.ts:4822`. Integration test still passes 4/4, unit
tests 17/17.

Refs #632 #510

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are entirely missing support for embeddings with getFromSource writes. And using embeddings with caching tables actually seems like one of the most compelling uses cases.

My Gemini review caught a number of things:

1. Unhandled Promise in TableResource.update leads to Transaction Commits before Embed resolves — CRITICAL (Confirmed)

File: Table.ts
Domain: concurrency
What: The update(updates, fullUpdate) instance method of TableResource calls this._writeUpdate(...) which returns a Promise when @embed is active. However, update() completely ignores this promise and returns this
synchronously.
Why it matters: Because update() returns synchronously, the calling application will immediately proceed to run await instance.save() . However, the asynchronous embedder operation is still pending, meaning the write has not been
registered in the transaction's write list yet. The transaction will either commit without saving the updated record (causing silent data loss) or crash with a "Transaction already closed" error when the embedder finally resolves and
tries to invoke addWrite() on a committed transaction.
Code:

// lines 1243-1247                                                                                                                                                                                                                     
return when(loading, () => {                                                                                                                                                                                                           
    this.#changes = updates;                                                                                                                                                                                                              
    this._writeUpdate(id, this.#changes, false);                                                                                                                                                                                          
    return this;                                                                                                                                                                                                                          
});                                                                                                                                                                                                                                    
...                                                                                                                                                                                                                                    
// line 1251-1252                                                                                                                                                                                                                      
this._writeUpdate(id, this.#changes, fullUpdate);                                                                                                                                                                                      
return this;                                                                                                                                                                                                                           

Suggested fix:
Wrap both invocations of _writeUpdate inside the update method with Harper's when() helper so that the promise is properly propagated back to the caller when an asynchronous hook is active:

return when(loading, () => {                                                                                                                                                                                                           
    this.#changes = updates;                                                                                                                                                                                                              
    return when(this._writeUpdate(id, this.#changes, false), () => this);                                                                                                                                                                 
});                                                                                                                                                                                                                                    
...                                                                                                                                                                                                                                    
return when(this._writeUpdate(id, this.#changes, fullUpdate), () => this);                                                                                                                                                             

──────

2. coerceType for Vector returns Float32Array which is mangled by the Record Encoder — HIGH (Confirmed)

File: Table.ts
Domain: data-integrity
What: coerceType converts incoming Vector attribute values into a Float32Array . However, as documented in createDefaultEmbedder , Harper's record encoder ( updateAndFreeze ) does not cleanly serialize typed arrays, instead
mangling them into {0: val1, 1: val2, ...} objects in the database.
Why it matters: While the default embedder circumvents this by casting its result back to Array , any user-supplied vectors (e.g., replication receiver paths, programmatic writes, or manual API uploads) will pass through
coerceType and be coerced to a Float32Array . These will get stored in the database in a mangled object format, corrupting the vector column and breaking subsequent HNSW indexing and search operations.
Code:

case 'Vector':                                                                                                                                                                                                                         
    if (value === null || value === 'null') return null;                                                                                                                                                                                  
    if (value instanceof Float32Array) return value;                                                                                                                                                                                      
    if (Array.isArray(value)) return Float32Array.from(value);                                                                                                                                                                            
    if (ArrayBuffer.isView(value)) return Float32Array.from(value as any);                                                                                                                                                                
    throw new SyntaxError();                                                                                                                                                                                                              

Suggested fix:
Coerce vectors to standard JavaScript arrays ( number[] ) instead of Float32Array s, ensuring they bypass record-encoder mangling while remaining compatible with HNSW:

case 'Vector':                                                                                                                                                                                                                         
    if (value === null || value === 'null') return null;                                                                                                                                                                                  
    if (value instanceof Float32Array) return Array.from(value);                                                                                                                                                                          
    if (Array.isArray(value)) return value.map(Number);                                                                                                                                                                                   
    if (ArrayBuffer.isView(value)) return Array.from(value as any);                                                                                                                                                                       
    throw new SyntaxError();                                                                                                                                                                                                              

──────

3. Model Change via Schema Reload causes Dimension Mismatches or Mixed-Model Vectors in HNSW — MEDIUM (Confirmed)

File: graphql.ts and embedHook.ts (comments)
Domain: data-integrity / config
What: Modifying the @embed(model: "...") argument in a schema file updates the attribute version, triggering a schema reload and an HNSW reindex. However, existing rows are not automatically re-embedded through the new model.
Why it matters: If a user changes their embedding model (e.g., swapping a 384-dimension model for a 1536-dimension model), existing rows retain their old-dimension vectors. Loading these mixed-dimension vectors into the same HNSW
index
will either trigger a crash during index construction or severely pollute the embedding space, returning garbage query results.
Suggested fix:

  1. Document this operational hazard explicitly.
  2. Consider adding validation to block schema updates that change model definitions on populated tables unless the vector attribute is cleared, or plan a background job pathway to invalidate and backfill the vector column when a
    model version change is detected.
    ──────

4. Malformed @embed Directives Fail Silently during Schema Deployment — LOW (Confirmed)

File: graphql.ts
Domain: api / config
What: When a malformed @embed directive is encountered (e.g., omitting source or model , or supplying non-string literals), the schema compiler logs via console.error and skips registering the @embed property metadata, but
continues compiling.
Why it matters: Because no exception is thrown, the overall schema installation succeeds with a 200 OK . The field is silently registered as a regular, un-embedded Vector attribute. Users receive no indication that their @embed
directive was discarded unless they inspect the server console logs.
Code:

if (!embedDefinition.source || !embedDefinition.model) {
    console.error(
        `@embed on "${property.name}" requires both "source" and "model" arguments, at`,
        directive.loc
    );
}

Suggested fix:
Throw a descriptive ClientError to halt compilation and fail the schema deployment loudly, returning a helpful error message to the client:

if (!embedDefinition.source || !embedDefinition.model) {
    throw new ClientError(
        `@embed on "${property.name}" requires both "source" and "model" arguments`,
        400
    );
}

──────

5. Dependency on globalThis.logger in embedHook.ts — LOW (Likely)

File: embedHook.ts
Domain: config
What: Failure logging in embedHook.ts relies on resolving the logger dynamically via (globalThis as any).logger rather than using standard module imports.
Why it matters: Relying on globalThis for logger references deviates from ES module standards, bypasses static type checks, and can lead to unhandled runtime exceptions if the logger isn't populated on globalThis in specific
worker threads.
Code:

const logger = (globalThis as any).logger;
logger?.error?.(`Embedder for attribute "${attr.name}" failed:`, err);

Suggested fix:
Import the standard Harper logger statically:

import { logger } from '#src/utility/logging/logger.ts';

Comment thread resources/models/embedHook.ts Outdated
if (!embedder) continue;
let vector;
try {
vector = await embedder(record);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential? Why not in parallel?

Comment thread resources/models/embedHook.ts Outdated
for (const attr of embedAttributes) {
const sourceKey = attr.embed?.source;
if (!sourceKey) continue;
if (!Object.prototype.hasOwnProperty.call(record, sourceKey)) continue;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So paranoid. We have prototype pollution protection (freezing). (not a blocker, just a nit)

Comment thread resources/Table.ts
return date;
}
return new Date(+value); // epoch ms number
case 'Vector':
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever idea. But does it really mean only Float32? Or should this also allow a Float64 and Float16? (seems like it should). (I would actually love to move to using Float16 in HNSW)

Kris's review (with Gemini cross-check) flagged six items + one nit; all
applied. Local: 17/17 unit + 4/4 integration green.

CRITICAL — silent data loss on PATCH:
  TableResource.update() called this._writeUpdate(...) at two sites
  (lines ~1247 and ~1257) and dropped the returned promise. With @embed
  active, _writeUpdate returns a pending promise; without awaiting it,
  the caller's `save()` ran before the write was registered on the txn —
  the embed never reached storage and the txn committed empty (or
  crashed with "transaction already closed" when the embedder finally
  resolved). Threaded both call sites through `when(...)` matching the
  pattern already in put/create.

HIGH — Float32Array storage mangling on user-supplied vectors:
  coerceType returned a Float32Array for `type: 'Vector'`. The record
  encoder mangles typed arrays via `updateAndFreeze` into `{0,1,2,...}`
  maps — the same bug we already worked around in `createDefaultEmbedder`
  by emitting `Array<number>`. coerceType is on the input path for
  user-supplied vectors (REST writes, replication receivers,
  programmatic API), so it had the same hazard. Switched to
  `Array<number>` here too.

MEDIUM — getFromSource cache writes did not run the embedder:
  Caching-table populations (`sourcedFrom()` → `getFromSource()`) build
  their own write op at Table.ts:~4511 and call `addWrite` directly,
  bypassing `_writeUpdate`. So an `@embed` declared on a caching table
  silently never embedded. Wired `buildEmbedBefore` into that path with
  the same "await before addWrite" pattern. This is the canonical use
  case Kris highlighted (and matches #750's derived-cache-table follow-
  up — caching tables are where embeddings on derived data live).

LOW — malformed @embed silently degraded:
  Missing `source:` or `model:` only logged `console.error` and let
  schema deployment succeed — the resulting Vector field was registered
  as a regular un-embedded attribute. Promote to a thrown Error so the
  component install fails loudly. Note: this breaks the file's existing
  "log and continue" convention for other directive validations (dup
  primary key etc.); justified inline because @embed's silent-degrade
  failure mode is harder to diagnose than a missing primary key.

LOW — globalThis.logger:
  Replaced with a lazy `require('#src/utility/logging/logger')` resolver.
  The cleaner static `import { logger } from ...` trips the documented
  `common_utils.ts ↔ harper_logger.ts` CJS cycle (ERR_REQUIRE_CYCLE_MODULE)
  the moment this module loads via a unit-test path that bypasses the
  full transaction stack. Lazy resolution sidesteps the cycle while still
  going through standard module resolution (not globalThis lookup).

NITS — applied:
  - Parallelize per-attribute embedders via Promise.all (Kris). Typical
    schemas have one @embed field per table, but multi-@embed shouldn't
    serialize HTTP roundtrips for no benefit.
  - Simplify `Object.prototype.hasOwnProperty.call(record, key)` to
    `key in record` (Kris's "so paranoid, we have prototype-pollution
    protection (freezing)").

Open question (not in this commit):
  Should `Vector` storage preserve Float64/Float16 precision? Today
  everything flattens to Array<number> and is re-narrowed when HNSW
  needs Float32 for distance computation. Kris asked about Float16
  specifically; deferring to PR thread.

Refs #632 #510 #750

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Models] Phase 5 — @embed directive + write-time hook + auto-HNSW + model tracking

2 participants