Skip to content

feat: SimString - automatic embedding backed similarity search#11

Merged
matthewmcneely merged 5 commits intomainfrom
matthewmcneely/add-similarity-string-type
Feb 28, 2026
Merged

feat: SimString - automatic embedding backed similarity search#11
matthewmcneely merged 5 commits intomainfrom
matthewmcneely/add-similarity-string-type

Conversation

@matthewmcneely
Copy link
Owner

@matthewmcneely matthewmcneely commented Feb 27, 2026

Description

Introduces SimString, a string type that transparently manages vector embeddings and HNSW-indexed shadow predicates, eliminating the need for users to manually maintain VectorFloat32 fields or call embedding APIs.

Checklist

  • Code compiles correctly and linting passes locally
  • Tests added for new functionality, or regression tests for bug fixes added as applicable

Summary by cubic

Adds SimString for automatic text embeddings with HNSW-backed similarity search and a simpler SimilarToText that embeds, queries, and fills the model. CI now starts a Dgraph container on Linux and runs unit tests against it.

  • New Features

    • SimString fields tagged dgraph:"embedding" auto-create a <field>__vec float32vector (HNSW) predicate via UpdateSchema.
    • EmbeddingProvider interface and OpenAI-compatible provider; configure with WithEmbeddingProvider(...) (client cache key includes the provider).
    • Automatic embedding on Insert/Upsert/Update; clears vectors when text is empty or below threshold.
    • Query helpers: SimilarTo(...) for precomputed vectors; SimilarToText(...) now embeds text, runs the query, and populates the model.
  • Migration

    • Tag fields as SimString with dgraph:"embedding"; optionally set metric, exponent, threshold. Provide an EmbeddingProvider and run UpdateSchema(...); ensure provider Dims() matches.
    • Update SimilarToText call sites: it returns only an error and populates the passed model.

Written for commit f7125e3. Summary will update on new commits.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="mutate.go">

<violation number="1" location="mutate.go:126">
P2: The embedding transaction is committed without a subsequent `Discard`, which the Dgraph client docs recommend to clean up resources (and it’s safe after commit). Add a discard after commit and on commit error to avoid leaking txn resources.</violation>
</file>

<file name="client.go">

<violation number="1" location="client.go:125">
P2: The client cache key does not include the new EmbeddingProvider option. Creating a client with a different embedding provider will reuse an existing cached client and ignore the new provider, which can generate embeddings with the wrong model or unexpectedly skip embedding generation.</violation>
</file>

<file name="embedding.go">

<violation number="1" location="embedding.go:483">
P1: Bug: `defer cleanup()` releases the pooled `*dgo.Dgraph` connection back to the pool when `SimilarToText` returns, but the returned `*dg.QueryBlock` still holds a reference to a transaction on that connection. When the caller later calls `.Scan()`, the underlying connection may already be in use by another operation, causing data races or query failures.

The cleanup function and the `*dgo.Dgraph` client should not be deferred here — they need to remain alive until after the caller finishes with the QueryBlock. Consider either: (1) returning the cleanup function alongside the QueryBlock so the caller can manage the lifecycle, or (2) executing the query inside this function and returning the results directly.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/ci-go-unit-tests.yaml">

<violation number="1" location=".github/workflows/ci-go-unit-tests.yaml:45">
P2: Pin the Dgraph Docker image to a specific version to keep CI runs reproducible and avoid unexpected breakages when `latest` changes.</violation>

<violation number="2" location=".github/workflows/ci-go-unit-tests.yaml:47">
P2: Fail the job when Dgraph does not become ready after the retry loop so test runs don’t proceed against an unavailable service.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@matthewmcneely matthewmcneely merged commit 1685071 into main Feb 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant