feat(models): @embed directive + write-time hook + auto-HNSW (Phase 5 of #510) by heskew · Pull Request #747 · HarperFast/harper

heskew · 2026-05-22T19:01:33Z

Summary

Phase 5 of #510. @embed schema directive that auto-fires embedding on writes, stores the vector on the record, and HNSW-indexes the column — no app code.

type Document @table @export {
  id: ID @primaryKey
  content: String
  embedding: Vector @embed(source: "content", model: "default")
}

Three lines replace a custom embedding pipeline.

What ships

Parser (resources/graphql.ts): @embed(source, model) branch added alongside @computed. Vector added to PRIMITIVE_TYPES. Validates both args are present and string-typed. Auto-attaches @indexed(type: "HNSW") when no explicit @indexed.
API (resources/Table.ts): Table.setEmbedAttribute(name, embedder) mirrors setComputedAttribute for component-author overrides. Default embedder auto-registered at schema load.
Write-time hook (resources/Table.ts, resources/models/embedHook.ts): chained into the existing preCommitBlobsForRecordBefore slot at the main put/patch site. Commit blocks on the embedder (sync-by-default).
Vector coerceType case: JSON Array<number> and Float32Array both accepted and normalized to Float32Array for storage.
Replicated-write predicate: skips on options.isNotification === true (cluster-subscribe), context.replicateFrom === false (REST x-replicate-from: none), and context.alreadyLogged === true (audit replay). Receivers store the originator's vector as-is — preserves cross-cluster embedding-space consistency.
Embedder error sanitization: backend URLs / model IDs / API-key tails stay in server logs; the caller sees Failed to compute embedding for attribute "<name>".
Schema-reload safety: the static embedAttributes filter is recomputed in updatedAttributes() so in-place Table.attributes.splice() at databases.ts:940 doesn't leave a stale snapshot.

Open decision (resolved per Kris's #632 comment)

Sync-by-default per Kris's preference — fits the existing write.before pattern with zero new infrastructure and is expected to be the overwhelming common case ("I would expect that's overwhelmingly the mode that should and would be used").

Low-latency follow-up direction

Per Kris's #632 comment, the right way to minimize write latency is not queued-via-server/jobs/ (the issue body's original suggestion). It's a derived caching table pattern: the origin table commits immediately without waiting for the embedder; a separate derived table holds the vector and follows sync mode against the origin. Origin write latency stays at LMDB-speed; derived-table writes do the embedding work asynchronously.

The same primitive lands re-embed-on-model-change backfill — the derived table is the natural place to iterate-and-rewrite when the schema's @embed(model: ...) changes.

Model-change invalidation — partial

The parser sets property.version = "embed:<model>", so databases.ts:1111's version-change-driven reindex fires when the model changes. But that pathway re-INDEXES existing vectors; it does NOT re-EMBED source fields through the new model. New writes after a model swap pick up the new model; existing rows keep their old-model vectors until re-written. Full re-embed-on-model-change backfill lands with the derived-table follow-up (above).

Verifier cadence

Per the phase-cadence plan, Phase 5 was scheduled for /ultrareview (write-time hook + auto-indexing is non-trivial new behavior). Locally I ran deep-review --multipass (4 domain agents × 2 runs = 8 agents) over the working tree pre-PR. The multipass surfaced:

(fixed pre-PR) cluster-subscribe replication path doesn't set replicateFrom/alreadyLogged — predicate now also checks options.isNotification === true.
(fixed pre-PR) Vector was added to PRIMITIVE_TYPES but lacked a coerceType case — added.
(fixed pre-PR) stale embedAttributes snapshot on schema reload — recomputed in updatedAttributes().
(fixed pre-PR) @embed(...) with missing source: or model: silently degraded to a no-op — now errors with directive location.
(fixed pre-PR) embedder backend errors could leak URLs / key tails — sanitized.

/ultrareview recommended pre-merge for multi-model cross-check (this PR's domain is new-behavior class).

Test plan

Unit: 17 tests in unitTests/resources/models/embedHook.test.js — default embedder, replication predicate (all 3 signals), source-presence semantics, error sanitization.
Build: npm run build clean.
Lint: oxlint --deny-warnings clean for touched files.
Integration: end-to-end with a fake Ollama HTTP server inside the test (integrationTests/server/embed-directive.test.ts) — happy path, source-unchanged PATCH (no re-embed), REST x-replicate-from: none (receiver path) skip.
CI green (3 Node versions).
/ultrareview cross-model pass before merge.

Files

Path	Change
`resources/graphql.ts`	extended — parse `@embed`, add `'Vector'`, validate args, auto-HNSW
`resources/Table.ts`	extended — `setEmbedAttribute`, write-time hook, `Vector` coerceType, `embedAttributes` refresh in `updatedAttributes()`
`resources/models/embedHook.ts`	new — default embedder + pre-commit builder + replication predicate + error sanitization
`unitTests/resources/models/embedHook.test.js`	new — 17 tests

Out of scope (deferred follow-ups)

Derived caching table for low-latency writes (Kris's preferred direction — see above) — tracked at [Models] Derived caching table for @embed (low-latency writes + model-change backfill) #750. Same primitive lands re-embed-on-model-change backfill.
Cross-model migration tooling (harper kb reembed --from X --to Y).
Multimodal embeddings (image, audio).
Pre-existing tech debt at resources/graphql.ts:170 — the unknown-directive warning condition reads server.knownGraphQLDirectives.includes(directiveName) (warns when the directive IS known). Not touched in this PR.

Tracking: #510
Closes #632

🤖 Generated with Claude Code

…hase 5 of #510) Schema-level RAG without app code: marking a field with `@embed(source: "<field>", model: "<logical-name>")` makes Harper compute the embedding on every originating write, store it on the record, and index it via HNSW — no application hooks required. Parser (resources/graphql.ts): - 'embed' added to knownGraphQLDirectives; 'Vector' added to PRIMITIVE_TYPES (with a coerceType case in Table.ts that accepts Float32Array / Array<number> / typed-array views). - `@embed` populates `property.embed = { source, model }` and sets `property.version = "embed:<model>"` so the existing schema-load version-change-driven reindex pathway (databases.ts:1111) fires on model swap. (Note: today that pathway re-indexes existing vectors but does NOT re-embed source fields through the new model; full re-embed backfill is a follow-up.) - Auto-attaches `property.indexed = { type: 'HNSW' }` when no explicit `@indexed` is present — the issue's "no app code needed" statement only holds with auto-indexing. - Validates both `source:` and `model:` are present and string-typed; rejects variables / non-string nodes with a directive-location error so typoed schemas fail loudly instead of silently producing `undefined`. API + write-time hook (resources/Table.ts): - Static `userEmbedders` map and `embedAttributes` filtered list (the list is recomputed in `updatedAttributes()` so in-place schema reload at `databases.ts:940` doesn't leave a stale snapshot). - `Table.setEmbedAttribute(name, embedder)` static method mirrors the `setComputedAttribute` registration pattern; component authors override the default via this surface. - Default embedder auto-registered at schema load when an attribute carries `@embed` — sits OUTSIDE the if/else-if resolver chain so `@embed`-decorated fields still get their `customIndex.propertyResolver` wired for HNSW. - `buildEmbedBefore` chained into the existing `preCommitBlobsForRecordBefore` slot at the main put/patch site — the embedder mutates the recordUpdate during the transaction's `before` phase, so commit blocks on it (sync-by-default; queued mode deferred to a follow-up). Default embedder + replicated-write predicate (resources/models/embedHook.ts): - `createDefaultEmbedder({source, model})` returns a callback that calls `Models.embed(record[source], { model, inputType: 'document' })` and returns the first vector. `Models` is lazy-required so the module is unit-testable without dragging the transaction stack into module load. - `buildEmbedBefore` skips on ALL three replication-receiver signals: `options.isNotification === true` (cluster-subscribe path in Table.ts:325-373), `context.replicateFrom === false` (REST `x-replicate-from: none`), and `context.alreadyLogged === true` (local audit replay at replayLogs.ts:52). On receivers the originator's vector is stored as-is, preserving cross-cluster embedding-space consistency. - Embedder errors are caught and rethrown as a sanitized message `Failed to compute embedding for attribute "<name>"` — backend URLs, model IDs, and API-key tails stay in server logs only. Tracking: #510 Closes #632 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

17 tests covering createDefaultEmbedder + buildEmbedBefore: - default embedder: source-field read, document inputType, null/undefined source handling, non-string source stringification. - buildEmbedBefore replication predicate: skip on `options.isNotification === true` (cluster-subscribe), `context.replicateFrom === false` (REST suppression), `context.alreadyLogged === true` (replay); fire on local-originating writes where replicateFrom is undefined. - source-field semantics: skip when source not in payload (PATCH that omits source survives via patch-merge), clear embedding to null when source is explicitly null, run when source is present. - per-attribute correctness: skip attributes whose source is absent in multi-`@embed` tables; skip attributes with no registered embedder; write null when embedder returns null. - error sanitization: backend errors (URLs, API-key tails) are NOT propagated to the caller; the sanitized message references only the attribute name. Test seam `__setEmbedFnForTest` lets the suite drive the embedder without instantiating Harper's full transaction stack. Tracking: #510 Refs #632 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@embed

Spins up a fake Ollama HTTP server inside the test, points Harper's models.embedding.default config at it, deploys a schema with @embed, and exercises three paths end-to-end. Replaces the "manual smoke test" bullet that was on the PR description with a CI-runnable test. Covered paths: 1. Happy path — POST a record → fake-ollama returns a deterministic 3-element Float32 vector → record stores it at the @embed-decorated field. Confirms exact-byte equality between the fake's response and what's persisted. 2. Source-unchanged PATCH — PATCH a record with a non-source field → fake-ollama call count stays flat; the existing embedding survives via patch-merge. 3. Replication-receiver skip (REST path) — POST with x-replicate-from: none header and a pre-supplied vector → no embed call is made; the supplied vector is stored as-is. The cluster-subscribe replication path (options.isNotification === true) is already covered by the embedHook unit tests. The fake-ollama server tracks call count + last inputs, so we assert not just on "vector got written" but on "embedder fired exactly when we expect, and not when we don't". The fake matches Ollama's actual /api/embed wire format (POST { model, input: string[] } → { embeddings: number[][] }) so the real OllamaBackend code path runs unchanged. Local note: macOS dev machines without `harper-integration-test-setup-loopback` can run via `HARPER_INTEGRATION_TEST_FORCE_LOOPBACK=1` to bypass the loopback address pool. CI uses the pool natively. Refs #632 #510 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-22T22:19:38Z

+				// `customIndex.propertyResolver` branch needs to fire for them.
+				// Component authors override the default via
+				// `Table.setEmbedAttribute(name, customEmbedder)` after schema load.
+				if (attribute.embed && !this.userEmbedders[attribute.name]) {


Default embedder not refreshed on model change.

The !this.userEmbedders[attribute.name] guard correctly avoids clobbering a component-author override set via setEmbedAttribute, but it also prevents the auto-generated default embedder from being updated when the schema is reloaded with a different model: via the databases.ts:940 in-place attributes.splice() path.

updatedAttributes() is called on the existing Table class (not a fresh makeTable()), so userEmbedders carries over. After a @embed(model: "modelB") schema reload, userEmbedders["embedding"] still holds the modelA embedder, and new writes silently use the wrong model. The PR body claims "New writes after a model swap pick up the new model" — the code doesn't support that claim for in-place schema reloads.

To fix without clobbering user overrides, track which names were explicitly set by setEmbedAttribute (e.g., a static userSetEmbedders: Set<string> = new Set()) and skip refresh only for those:

Suggested change

if (attribute.embed && !this.userEmbedders[attribute.name]) {

if (attribute.embed && !TableResource.userSetEmbedders?.has(attribute.name)) {

…then add static userSetEmbedders: Set<string> = new Set() alongside userEmbedders and have setEmbedAttribute populate it.

Applied — setEmbedAttribute now tracks user-set names in static userSetEmbedders: Set<string>; defaults refresh on every updatedAttributes() call, user overrides are preserved. Pushed at b108804.

claude · 2026-05-22T22:19:43Z

+				if (value instanceof Float32Array) return value;
+				if (Array.isArray(value)) return Float32Array.from(value);
+				if (ArrayBuffer.isView(value)) return new Float32Array((value as ArrayBufferView).buffer);
+				throw new SyntaxError();


ArrayBuffer.isView branch reinterprets bytes rather than converting values.

new Float32Array(value.buffer) copies the raw bytes of the backing ArrayBuffer from offset 0. If value is a non-Float32Array typed array — e.g., Float64Array([1.0, 2.0, 3.0]) — the eight-byte-per-element encoding is reinterpreted as four-byte float32s, producing 6 values of garbage. The same issue exists for typed arrays with a nonzero byteOffset (subarrays/slices).

The fix is to convert element values rather than reinterpret bytes:

Suggested change

throw new SyntaxError();

if (ArrayBuffer.isView(value)) return Float32Array.from(value as any);

Float32Array.from iterates the view's elements and produces the correct float32 representation of each numeric value. (DataView is not iterable, but it is also not a plausible input here.)

Applied — switched to Float32Array.from(value) which iterates element values rather than reinterpreting bytes. Float64Array / subarray inputs now convert correctly. Pushed at b108804.

claude · 2026-05-22T22:19:54Z

2 blockers found (new findings on 47bd103c4).

1. Source-changing PATCH (the CRITICAL fix) has no test

File: resources/Table.ts lines ~1248, ~1257
What: update() now wraps both _writeUpdate call sites with when(...) to propagate the embed promise. The integration test only exercises the skip-path of update() — PATCH without the source field, where buildEmbedBefore returns undefined and _writeUpdate returns synchronously. No test exercises PATCH with the source field present, which is the only path that exercises the async branch the fix was written for.
Why it matters: This was the CRITICAL fix in 47bd103c4 — silent data loss / "transaction already closed" crash on source-changing PATCHes. The fix is plausible but the production path is unverified end-to-end.
Suggested fix: Add an integration test case: seed a record, PATCH it with { content: 'updated text' }, verify embed fires once and the stored vector matches deterministicVector('updated text').

2. `getFromSource` caching-table embed path has no test

File: resources/Table.ts ~line 4509
What: The MEDIUM fix from Kris's review wires buildEmbedBefore into the getFromSource write path (the path that bypasses _writeUpdate). Only the false branch of if (embedBefore) has pre-existing coverage; the true branch (embed fires on a caching table) has none.
Why it matters: Kris called this "one of the most compelling use cases." A silent mis-wire here would mean caching tables with @embed never embed — the same failure mode the fix was supposed to close.
Suggested fix: Add an integration test: a source table + a caching table with @embed, write to the source, verify the derived-table record has a populated vector.

…s commit Three bugs surfaced once the new integration test exercised the full REST write path against a fake-Ollama HTTP server. All three were fixed in this patch; the integration test now passes 4/4 locally and exercises the bugs that would otherwise have shipped silent. 1) `require('./Models.ts')` doesn't survive the dist build (the `.ts` extension stays literal at runtime; production resolves from `dist/*.js`). Switched to `require('#src/resources/models/Models')` which goes through package.json conditional exports — resolves to `.ts` under the `typestrip` condition (unit tests) and to `dist/*.js` in production. 2) The embedder was wired into the pre-commit `before` slot, which is awaited at `Promise.all(completions)` at txn-commit time — AFTER each write's `commit(...)` closure has already stored the record. That pattern works for blob byte-writes (which reference pre-allocated blob IDs) but not for `@embed` where the vector itself must be on the record at commit time. Moved the embedder call to fire BEFORE `transaction.addWrite(...)` and threaded the resulting promise back through `_writeUpdate` → `create()` / `update()` via `when()` so the caller chain (and ultimately the static-create transaction wrapper) awaits the embedding before committing. 3) The default embedder returned a `Float32Array`, but Harper's record encoder mangles typed arrays via `updateAndFreeze` (it enumerates them as `{0,1,2,...}` maps, breaking the byte round-trip). The integration test confirmed: a record POSTed with a plain `Array<number>` round-trips correctly; one with a `Float32Array` came back as zero-bytes. Changed the default embedder to return `Array<number>` (HNSW's `propertyResolver` and `customIndex.index` both accept it). Unit test updated to assert the Array shape. Integration test (`integrationTests/server/embed-directive.test.ts`) exercises: - happy path: POST → embedder runs → vector stored, decoded back - source-unchanged PATCH: no new embed call; existing vector survives - REST `x-replicate-from: none`: embedder skipped, supplied vector stored Refs #632 #510 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two real blockers from PR #747 review: 1) Default embedder not refreshed on model change. The `!this.userEmbedders[attribute.name]` guard prevented refreshing the auto-generated default embedder when a schema reload via `databases.ts:940` in-place `attributes.splice()` calls `updatedAttributes()` again on the existing Table class — `userEmbedders` carried over, so a `@embed(model: "B")` swap from `"A"` left writes silently using the old model. Track user-set override names in `static userSetEmbedders: Set<string>` and only skip refresh for those; defaults are refreshed every time. `setEmbedAttribute` populates the set. 2) `ArrayBuffer.isView` branch reinterpreted bytes. `new Float32Array(value.buffer)` copies the raw bytes of the backing ArrayBuffer from offset 0 — when `value` is a `Float64Array`, the 8-byte-per-element encoding gets reinterpreted as 4-byte float32s (garbage). Same issue with typed-array subarrays (nonzero `byteOffset`). Use `Float32Array.from(value)` to iterate element values and convert numerically. Both flagged inline by claude-bot on `resources/Table.ts:3433` and `resources/Table.ts:4822`. Integration test still passes 4/4, unit tests 17/17. Refs #632 #510 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kriszyp

It seems like we are entirely missing support for embeddings with getFromSource writes. And using embeddings with caching tables actually seems like one of the most compelling uses cases.

My Gemini review caught a number of things:

1. Unhandled Promise in TableResource.update leads to Transaction Commits before Embed resolves — CRITICAL (Confirmed)

File: Table.ts
Domain: concurrency
What: The update(updates, fullUpdate) instance method of TableResource calls this._writeUpdate(...) which returns a Promise when @embed is active. However, update() completely ignores this promise and returns this
synchronously.
Why it matters: Because update() returns synchronously, the calling application will immediately proceed to run await instance.save() . However, the asynchronous embedder operation is still pending, meaning the write has not been
registered in the transaction's write list yet. The transaction will either commit without saving the updated record (causing silent data loss) or crash with a "Transaction already closed" error when the embedder finally resolves and
tries to invoke addWrite() on a committed transaction.
Code:

// lines 1243-1247                                                                                                                                                                                                                     
return when(loading, () => {                                                                                                                                                                                                           
    this.#changes = updates;                                                                                                                                                                                                              
    this._writeUpdate(id, this.#changes, false);                                                                                                                                                                                          
    return this;                                                                                                                                                                                                                          
});                                                                                                                                                                                                                                    
...                                                                                                                                                                                                                                    
// line 1251-1252                                                                                                                                                                                                                      
this._writeUpdate(id, this.#changes, fullUpdate);                                                                                                                                                                                      
return this;

Suggested fix:
Wrap both invocations of _writeUpdate inside the update method with Harper's when() helper so that the promise is properly propagated back to the caller when an asynchronous hook is active:

return when(loading, () => {                                                                                                                                                                                                           
    this.#changes = updates;                                                                                                                                                                                                              
    return when(this._writeUpdate(id, this.#changes, false), () => this);                                                                                                                                                                 
});                                                                                                                                                                                                                                    
...                                                                                                                                                                                                                                    
return when(this._writeUpdate(id, this.#changes, fullUpdate), () => this);

──────

2. coerceType for Vector returns Float32Array which is mangled by the Record Encoder — HIGH (Confirmed)

File: Table.ts
Domain: data-integrity
What: coerceType converts incoming Vector attribute values into a Float32Array . However, as documented in createDefaultEmbedder , Harper's record encoder ( updateAndFreeze ) does not cleanly serialize typed arrays, instead
mangling them into {0: val1, 1: val2, ...} objects in the database.
Why it matters: While the default embedder circumvents this by casting its result back to Array , any user-supplied vectors (e.g., replication receiver paths, programmatic writes, or manual API uploads) will pass through
coerceType and be coerced to a Float32Array . These will get stored in the database in a mangled object format, corrupting the vector column and breaking subsequent HNSW indexing and search operations.
Code:

case 'Vector':                                                                                                                                                                                                                         
    if (value === null || value === 'null') return null;                                                                                                                                                                                  
    if (value instanceof Float32Array) return value;                                                                                                                                                                                      
    if (Array.isArray(value)) return Float32Array.from(value);                                                                                                                                                                            
    if (ArrayBuffer.isView(value)) return Float32Array.from(value as any);                                                                                                                                                                
    throw new SyntaxError();

Suggested fix:
Coerce vectors to standard JavaScript arrays ( number[] ) instead of Float32Array s, ensuring they bypass record-encoder mangling while remaining compatible with HNSW:

case 'Vector':                                                                                                                                                                                                                         
    if (value === null || value === 'null') return null;                                                                                                                                                                                  
    if (value instanceof Float32Array) return Array.from(value);                                                                                                                                                                          
    if (Array.isArray(value)) return value.map(Number);                                                                                                                                                                                   
    if (ArrayBuffer.isView(value)) return Array.from(value as any);                                                                                                                                                                       
    throw new SyntaxError();

──────

3. Model Change via Schema Reload causes Dimension Mismatches or Mixed-Model Vectors in HNSW — MEDIUM (Confirmed)

File: graphql.ts and embedHook.ts (comments)
Domain: data-integrity / config
What: Modifying the @embed(model: "...") argument in a schema file updates the attribute version, triggering a schema reload and an HNSW reindex. However, existing rows are not automatically re-embedded through the new model.
Why it matters: If a user changes their embedding model (e.g., swapping a 384-dimension model for a 1536-dimension model), existing rows retain their old-dimension vectors. Loading these mixed-dimension vectors into the same HNSW
index
will either trigger a crash during index construction or severely pollute the embedding space, returning garbage query results.
Suggested fix:

Document this operational hazard explicitly.
Consider adding validation to block schema updates that change model definitions on populated tables unless the vector attribute is cleared, or plan a background job pathway to invalidate and backfill the vector column when a
model version change is detected.
──────

4. Malformed @embed Directives Fail Silently during Schema Deployment — LOW (Confirmed)

File: graphql.ts
Domain: api / config
What: When a malformed @embed directive is encountered (e.g., omitting source or model , or supplying non-string literals), the schema compiler logs via console.error and skips registering the @embed property metadata, but
continues compiling.
Why it matters: Because no exception is thrown, the overall schema installation succeeds with a 200 OK . The field is silently registered as a regular, un-embedded Vector attribute. Users receive no indication that their @embed
directive was discarded unless they inspect the server console logs.
Code:

if (!embedDefinition.source || !embedDefinition.model) {
    console.error(
        `@embed on "${property.name}" requires both "source" and "model" arguments, at`,
        directive.loc
    );
}

Suggested fix:
Throw a descriptive ClientError to halt compilation and fail the schema deployment loudly, returning a helpful error message to the client:

if (!embedDefinition.source || !embedDefinition.model) {
    throw new ClientError(
        `@embed on "${property.name}" requires both "source" and "model" arguments`,
        400
    );
}

──────

5. Dependency on globalThis.logger in embedHook.ts — LOW (Likely)

File: embedHook.ts
Domain: config
What: Failure logging in embedHook.ts relies on resolving the logger dynamically via (globalThis as any).logger rather than using standard module imports.
Why it matters: Relying on globalThis for logger references deviates from ES module standards, bypasses static type checks, and can lead to unhandled runtime exceptions if the logger isn't populated on globalThis in specific
worker threads.
Code:

const logger = (globalThis as any).logger;
logger?.error?.(`Embedder for attribute "${attr.name}" failed:`, err);

Suggested fix:
Import the standard Harper logger statically:

import { logger } from '#src/utility/logging/logger.ts';

kriszyp · 2026-05-23T02:06:31Z

+			if (!embedder) continue;
+			let vector;
+			try {
+				vector = await embedder(record);


Sequential? Why not in parallel?

kriszyp · 2026-05-23T02:07:16Z

+		for (const attr of embedAttributes) {
+			const sourceKey = attr.embed?.source;
+			if (!sourceKey) continue;
+			if (!Object.prototype.hasOwnProperty.call(record, sourceKey)) continue;


So paranoid. We have prototype pollution protection (freezing). (not a blocker, just a nit)

kriszyp · 2026-05-23T02:14:58Z

 					return date;
 				}
 				return new Date(+value); // epoch ms number
+			case 'Vector':


This is a clever idea. But does it really mean only Float32? Or should this also allow a Float64 and Float16? (seems like it should). (I would actually love to move to using Float16 in HNSW)

@embed

Kris's review (with Gemini cross-check) flagged six items + one nit; all applied. Local: 17/17 unit + 4/4 integration green. CRITICAL — silent data loss on PATCH: TableResource.update() called this._writeUpdate(...) at two sites (lines ~1247 and ~1257) and dropped the returned promise. With @embed active, _writeUpdate returns a pending promise; without awaiting it, the caller's `save()` ran before the write was registered on the txn — the embed never reached storage and the txn committed empty (or crashed with "transaction already closed" when the embedder finally resolved). Threaded both call sites through `when(...)` matching the pattern already in put/create. HIGH — Float32Array storage mangling on user-supplied vectors: coerceType returned a Float32Array for `type: 'Vector'`. The record encoder mangles typed arrays via `updateAndFreeze` into `{0,1,2,...}` maps — the same bug we already worked around in `createDefaultEmbedder` by emitting `Array<number>`. coerceType is on the input path for user-supplied vectors (REST writes, replication receivers, programmatic API), so it had the same hazard. Switched to `Array<number>` here too. MEDIUM — getFromSource cache writes did not run the embedder: Caching-table populations (`sourcedFrom()` → `getFromSource()`) build their own write op at Table.ts:~4511 and call `addWrite` directly, bypassing `_writeUpdate`. So an `@embed` declared on a caching table silently never embedded. Wired `buildEmbedBefore` into that path with the same "await before addWrite" pattern. This is the canonical use case Kris highlighted (and matches #750's derived-cache-table follow- up — caching tables are where embeddings on derived data live). LOW — malformed @embed silently degraded: Missing `source:` or `model:` only logged `console.error` and let schema deployment succeed — the resulting Vector field was registered as a regular un-embedded attribute. Promote to a thrown Error so the component install fails loudly. Note: this breaks the file's existing "log and continue" convention for other directive validations (dup primary key etc.); justified inline because @embed's silent-degrade failure mode is harder to diagnose than a missing primary key. LOW — globalThis.logger: Replaced with a lazy `require('#src/utility/logging/logger')` resolver. The cleaner static `import { logger } from ...` trips the documented `common_utils.ts ↔ harper_logger.ts` CJS cycle (ERR_REQUIRE_CYCLE_MODULE) the moment this module loads via a unit-test path that bypasses the full transaction stack. Lazy resolution sidesteps the cycle while still going through standard module resolution (not globalThis lookup). NITS — applied: - Parallelize per-attribute embedders via Promise.all (Kris). Typical schemas have one @embed field per table, but multi-@embed shouldn't serialize HTTP roundtrips for no benefit. - Simplify `Object.prototype.hasOwnProperty.call(record, key)` to `key in record` (Kris's "so paranoid, we have prototype-pollution protection (freezing)"). Open question (not in this commit): Should `Vector` storage preserve Float64/Float16 precision? Today everything flattens to Array<number> and is re-narrowed when HNSW needs Float32 for distance computation. Kris asked about Float16 specifically; deferring to PR thread. Refs #632 #510 #750 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

heskew and others added 2 commits May 22, 2026 11:59

heskew requested a review from kriszyp May 22, 2026 19:01

heskew marked this pull request as ready for review May 22, 2026 19:01

heskew and others added 2 commits May 22, 2026 12:03

style: prettier --write

b1d0a47

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude Bot reviewed May 22, 2026

View reviewed changes

heskew and others added 2 commits May 22, 2026 15:55

kriszyp requested changes May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(models): @embed directive + write-time hook + auto-HNSW (Phase 5 of #510)#747

feat(models): @embed directive + write-time hook + auto-HNSW (Phase 5 of #510)#747
heskew wants to merge 7 commits into
mainfrom
feat/models-embed-directive

heskew commented May 22, 2026 •

edited

Loading

Uh oh!

claude Bot May 22, 2026

Uh oh!

heskew May 22, 2026

Uh oh!

claude Bot May 22, 2026

Uh oh!

heskew May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 •

edited

Loading

Uh oh!

kriszyp left a comment

Uh oh!

kriszyp May 23, 2026

Uh oh!

kriszyp May 23, 2026

Uh oh!

kriszyp May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (attribute.embed && !this.userEmbedders[attribute.name]) {
	if (attribute.embed && !TableResource.userSetEmbedders?.has(attribute.name)) {

	throw new SyntaxError();
	if (ArrayBuffer.isView(value)) return Float32Array.from(value as any);

Conversation

heskew commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Open decision (resolved per Kris's #632 comment)

Low-latency follow-up direction

Model-change invalidation — partial

Verifier cadence

Test plan

Files

Out of scope (deferred follow-ups)

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

heskew May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

heskew May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Source-changing PATCH (the CRITICAL fix) has no test

2. getFromSource caching-table embed path has no test

Uh oh!

kriszyp left a comment

Choose a reason for hiding this comment

1. Unhandled Promise in TableResource.update leads to Transaction Commits before Embed resolves — CRITICAL (Confirmed)

2. coerceType for Vector returns Float32Array which is mangled by the Record Encoder — HIGH (Confirmed)

3. Model Change via Schema Reload causes Dimension Mismatches or Mixed-Model Vectors in HNSW — MEDIUM (Confirmed)

4. Malformed @embed Directives Fail Silently during Schema Deployment — LOW (Confirmed)

5. Dependency on globalThis.logger in embedHook.ts — LOW (Likely)

Uh oh!

kriszyp May 23, 2026

Choose a reason for hiding this comment

Uh oh!

kriszyp May 23, 2026

Choose a reason for hiding this comment

Uh oh!

kriszyp May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heskew commented May 22, 2026 •

edited

Loading

claude Bot commented May 22, 2026 •

edited

Loading

2. `getFromSource` caching-table embed path has no test