Skip to content

fix(embedding): respect EMBEDDING_MODEL for local provider#943

Open
skywalker-35 wants to merge 1 commit into
rohitg00:mainfrom
skywalker-35:fix/issue-881-local-embedding-model
Open

fix(embedding): respect EMBEDDING_MODEL for local provider#943
skywalker-35 wants to merge 1 commit into
rohitg00:mainfrom
skywalker-35:fix/issue-881-local-embedding-model

Conversation

@skywalker-35

@skywalker-35 skywalker-35 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Fixes #881

LocalEmbeddingProvider currently hardcodes Xenova/all-MiniLM-L6-v2 (English-only, 384-dim) and ignores the EMBEDDING_MODEL environment variable. This makes non-English embedding models (e.g. Xenova/bge-large-zh-v1.5 for Chinese, 1024-dim) impossible to configure on the local provider, even though the same env var is the standard convention across the OpenAI / Gemini providers.

Changes

src/providers/embedding/local.ts:

  • Read EMBEDDING_MODEL env var to select the model (default: Xenova/all-MiniLM-L6-v2 — preserves existing behavior)
  • Resolve dimensions from a KNOWN_DIMS map (MiniLM / bge-zh / bge-m3 / multilingual-e5), with OPENAI_EMBEDDING_DIMENSIONS as an explicit override
  • Pass { local_files_only: true, quantized: false } to the pipeline — required because BGE's ONNX release ships only model.onnx, not model_quantized.onnx, and offline / restricted-network setups need the no-fetch path

Why a separate PR from #793

#793 introduces a new LOCAL_EMBEDDING_MODEL env var and changes the default model to paraphrase-multilingual-MiniLM-L12-v2. This PR takes a complementary, more conservative approach:

#793 This PR
Env var LOCAL_EMBEDDING_MODEL (new) EMBEDDING_MODEL (cross-provider convention)
Default model paraphrase-multilingual-MiniLM-L12-v2 Xenova/all-MiniLM-L6-v2 (unchanged)
Dimension resolution not handled KNOWN_DIMS map + OPENAI_EMBEDDING_DIMENSIONS override
Offline support unchanged local_files_only: true + quantized: false

Reasoning for reusing EMBEDDING_MODEL: it is already the cross-provider convention (used by the Gemini / OpenAI factories in index.ts). Adding a second env var for the local provider would force users to remember which one applies to which provider. The KNOWN_DIMS map covers every Xenova/* model currently in use, so the typical user never needs to set OPENAI_EMBEDDING_DIMENSIONS.

Happy to consolidate with #793 if reviewers prefer a single source of truth — the two approaches are not mutually exclusive. The dimension + offline bits here are independent improvements that would still apply regardless of which env-var name wins.

Verification

  • EMBEDDING_MODEL=Xenova/bge-large-zh-v1.5 → pipeline loads in ~3.9s, single embed ~0.23s, 1024-dim vector returned
  • End-to-end semantic search: stored observation "深度学习框架对比分析", queried with "神经网络训练平台比较" (zero shared keywords) → matches via vector cosine similarity
  • EMBEDDING_MODEL unset → behavior unchanged (Xenova/all-MiniLM-L6-v2, 384-dim)
  • OPENAI_EMBEDDING_DIMENSIONS=512 override for an unknown model → 512-dim returned
  • OPENAI_EMBEDDING_DIMENSIONS=abc (invalid) → throws explicit error rather than silently defaulting

Environment

  • agentmemory v0.9.21 (base)
  • @xenova/transformers v2.17.2
  • Node.js 22.x
  • Windows 11

Operational note for reviewers

@xenova/transformers does not respect the standard ~/.cache/huggingface redirect via mklink / junction — it looks only inside node_modules/@xenova/transformers/models/{namespace}/{model}/ (a flat layout, not the usual blobs/snapshots structure). For pre-downloaded models, that is the path that must contain the model files. The namespace is always Xenova/, regardless of what EMBEDDING_MODEL is set to.

Related

Summary by CodeRabbit

  • Improvements
    • Enhanced local embedding provider with configurable model selection via environment variables
    • Added support for custom embedding dimensions with validation
    • Improved error messages for invalid configuration overrides

…itg00#881)

LocalEmbeddingProvider currently hardcodes Xenova/all-MiniLM-L6-v2
(English-only, 384-dim) and ignores EMBEDDING_MODEL, making non-English
embedding models (bge-large-zh, bge-m3, multilingual-e5, etc.)
impossible to configure on the local provider.

Changes:
- Read EMBEDDING_MODEL env var (default: Xenova/all-MiniLM-L6-v2)
- Resolve dimensions from KNOWN_DIMS map (MiniLM / bge-zh / bge-m3 /
  multilingual-e5) with OPENAI_EMBEDDING_DIMENSIONS as override
- Pass local_files_only + quantized=false to the pipeline
  (BGE's ONNX release ships only model.onnx, not model_quantized.onnx,
  and offline / restricted-network setups need the no-fetch path)

Verified locally: EMBEDDING_MODEL=Xenova/bge-large-zh-v1.5 loads in
~3.9s, single embed ~0.23s, 1024-dim vector returned. Chinese
semantic search (stored 深度学习框架对比分析 / queried 神经网络训练平台比较,
zero shared keywords) returns matches via cosine similarity.

Fixes rohitg00#881
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

@skywalker-35 is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

LocalEmbeddingProvider in src/providers/embedding/local.ts gains a constructor that reads EMBEDDING_MODEL and OPENAI_EMBEDDING_DIMENSIONS environment variables, validates the dimension override as a positive integer, and resolves dimensions via a KNOWN_DIMS model-to-dimension map. The transformers pipeline initialization is updated to use the resolved model name with local_files_only: true and quantized: false.

Changes

Dynamic LocalEmbeddingProvider Model and Dimensions

Layer / File(s) Summary
KNOWN_DIMS map, resolveDimensions helper, and constructor
src/providers/embedding/local.ts
Adds KNOWN_DIMS record mapping model names to dimensions, DEFAULT_MODEL/DEFAULT_DIMS constants, a resolveDimensions helper that validates and parses OPENAI_EMBEDDING_DIMENSIONS, and a constructor that sets modelName from EMBEDDING_MODEL and computes dimensions via the helper. Adds getEnvVar import to support env-driven resolution.
Transformers pipeline initialization
src/providers/embedding/local.ts
Updates the pipeline call to use the instance's resolved modelName and adds { local_files_only: true, quantized: false } to model options.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

  • #881 (LocalEmbeddingProvider should respect EMBEDDING_MODEL env var for non-English embeddings): This PR directly implements the proposed fix from the issue — reading EMBEDDING_MODEL, resolving dimensions via KNOWN_DIMS, validating OPENAI_EMBEDDING_DIMENSIONS, and using local_files_only in the pipeline call.
  • #725: This PR makes the local embedding model and dimensions configurable via environment variables, matching the feature described in that issue.

Suggested reviewers

  • rohitg00

🐇 No more hardcoded 384,
The model name now comes from the env today!
KNOWN_DIMS maps each model right,
And local_files_only keeps things tight.
Multilingual embeddings, hip hip hooray! 🌍✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: enabling the local embedding provider to respect the EMBEDDING_MODEL environment variable for model selection.
Linked Issues check ✅ Passed The PR fully addresses issue #881 requirements: respects EMBEDDING_MODEL env var, supports non-English models, resolves dimensions from a KNOWN_DIMS map with OPENAI_EMBEDDING_DIMENSIONS override, and passes local_files_only and quantized parameters.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #881: modifying LocalEmbeddingProvider constructor to read EMBEDDING_MODEL, implement dimension resolution, and update pipeline initialization parameters.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/providers/embedding/local.ts (1)

4-12: ⚠️ Potential issue | 🔴 Critical

Pipeline type signature does not match the call at lines 93–97.

The type definition accepts only 2 parameters, but the call passes a third options object with { local_files_only: true, quantized: false }. This causes a TypeScript type error.

Fix
 type Pipeline = (
   task: string,
   model: string,
+  options?: { local_files_only?: boolean; quantized?: boolean },
 ) => Promise<
   (
     texts: string[],
     options: { pooling: string; normalize: boolean },
   ) => Promise<{ tolist: () => number[][] }>
 >;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/providers/embedding/local.ts` around lines 4 - 12, The Pipeline type
definition in this file only accepts two parameters (task and model), but the
actual call at lines 93–97 passes a third parameter containing options like
local_files_only and quantized. Update the Pipeline type signature to accept a
third optional parameter that represents the options object with the properties
being passed at the call site, ensuring the type matches what is actually being
invoked.
🧹 Nitpick comments (1)
src/providers/embedding/local.ts (1)

14-29: ⚡ Quick win

Remove WHAT-style inline comments in the model map.

These comments narrate category labels instead of encoding intent through structure/naming, which conflicts with the TypeScript guideline for this repo.

As per coding guidelines, “In TypeScript source code, avoid code comments explaining WHAT — use clear naming instead.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/providers/embedding/local.ts` around lines 14 - 29, The KNOWN_DIMS
constant contains inline comments that describe categories of models (such as
"// MiniLM 系列(英文)", "// BGE 中文系列", etc.) which explain WHAT the code does rather
than WHY. Per the TypeScript coding guidelines for this repo, these WHAT-style
comments should be removed since the model names themselves are sufficiently
clear and self-documenting. Delete all category label comments from the
KNOWN_DIMS mapping while keeping the model names and their corresponding
dimension values intact.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/providers/embedding/local.ts`:
- Around line 35-49: The resolveDimensions function has two issues that need
fixing. First, replace the parseInt and Number.isFinite validation for the
override parameter with strict regex validation that ensures the entire override
string is a positive integer with no partial matches or decimals allowed (reject
strings like "123abc" or "12.34"). Second, replace the fallback logic at the end
where KNOWN_DIMS[modelName] ?? DEFAULT_DIMS returns a default dimension value
for unknown models—instead, explicitly check if the model exists in KNOWN_DIMS
and throw an error if it does not, since unknown models require the
OPENAI_EMBEDDING_DIMENSIONS override to be set.

---

Outside diff comments:
In `@src/providers/embedding/local.ts`:
- Around line 4-12: The Pipeline type definition in this file only accepts two
parameters (task and model), but the actual call at lines 93–97 passes a third
parameter containing options like local_files_only and quantized. Update the
Pipeline type signature to accept a third optional parameter that represents the
options object with the properties being passed at the call site, ensuring the
type matches what is actually being invoked.

---

Nitpick comments:
In `@src/providers/embedding/local.ts`:
- Around line 14-29: The KNOWN_DIMS constant contains inline comments that
describe categories of models (such as "// MiniLM 系列(英文)", "// BGE 中文系列", etc.)
which explain WHAT the code does rather than WHY. Per the TypeScript coding
guidelines for this repo, these WHAT-style comments should be removed since the
model names themselves are sufficiently clear and self-documenting. Delete all
category label comments from the KNOWN_DIMS mapping while keeping the model
names and their corresponding dimension values intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f10d129-0fbd-4197-aa75-c5fd6aa3d0cf

📥 Commits

Reviewing files that changed from the base of the PR and between f6f9e3c and ff9f7e5.

📒 Files selected for processing (1)
  • src/providers/embedding/local.ts

Comment on lines +35 to +49
function resolveDimensions(
modelName: string,
override: string | undefined,
): number {
if (override !== undefined && override.trim().length > 0) {
const parsed = parseInt(override, 10);
if (!Number.isFinite(parsed) || parsed <= 0) {
throw new Error(
`OPENAI_EMBEDDING_DIMENSIONS must be a positive integer, got: ${override}`,
);
}
return parsed;
}
return KNOWN_DIMS[modelName] ?? DEFAULT_DIMS;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current behavior and contract evidence in code/docs.
rg -n -C3 'function resolveDimensions|parseInt\(|KNOWN_DIMS\[modelName\] \?\? DEFAULT_DIMS' src/providers/embedding/local.ts
rg -n -C2 'OPENAI_EMBEDDING_DIMENSIONS.*Required when the model is not in the known-models table' README.md

Repository: rohitg00/agentmemory

Length of output: 1006


🏁 Script executed:

# Check KNOWN_DIMS definition and what models are included
rg -n -B5 -A10 'const KNOWN_DIMS|KNOWN_DIMS\s*=' src/providers/embedding/local.ts

Repository: rohitg00/agentmemory

Length of output: 605


🏁 Script executed:

# Check if provider.dimensions is used elsewhere and could cause breakage
rg -n 'provider\.dimensions|\.dimensions' src/ --type ts

Repository: rohitg00/agentmemory

Length of output: 1440


🏁 Script executed:

# Search for any TODO, FIXME, or issue comments about embedding dimensions
rg -n 'TODO|FIXME|BUG.*dimension|dimension.*error|unknown.*model' src/providers/embedding/local.ts

Repository: rohitg00/agentmemory

Length of output: 46


🏁 Script executed:

# Verify parseInt behavior with edge cases
node <<'JS'
console.log("parseInt('123abc', 10):", parseInt('123abc', 10));
console.log("parseInt('12.34', 10):", parseInt('12.34', 10));
console.log("parseInt('  456  ', 10):", parseInt('  456  ', 10));
console.log("Number.isFinite(parseInt('123abc', 10)):", Number.isFinite(parseInt('123abc', 10)));
console.log("Number('123abc'):", Number('123abc'));
console.log("Number('456'):", Number('456'));
JS

Repository: rohitg00/agentmemory

Length of output: 234


Dimension resolution accepts malformed overrides and silently defaults unknown models, breaking vector search contracts.

The current implementation has two issues:

  1. parseInt accepts partial numeric strings (e.g., "123abc"123) and decimals (e.g., "12.34"12), which bypasses validation since Number.isFinite() returns true for the parsed integer result.

  2. Line 48 silently defaults to 384 for unknown models, contradicting the README which documents that OPENAI_EMBEDDING_DIMENSIONS is "Required when the model is not in the known-models table". This causes runtime failures in src/functions/search.ts and src/functions/migrate-vector-index.ts, which validate that embedding.length === provider.dimensions.

Use strict regex validation for the override and throw an error for unknown models instead of falling back to a default dimension value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/providers/embedding/local.ts` around lines 35 - 49, The resolveDimensions
function has two issues that need fixing. First, replace the parseInt and
Number.isFinite validation for the override parameter with strict regex
validation that ensures the entire override string is a positive integer with no
partial matches or decimals allowed (reject strings like "123abc" or "12.34").
Second, replace the fallback logic at the end where KNOWN_DIMS[modelName] ??
DEFAULT_DIMS returns a default dimension value for unknown models—instead,
explicitly check if the model exists in KNOWN_DIMS and throw an error if it does
not, since unknown models require the OPENAI_EMBEDDING_DIMENSIONS override to be
set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalEmbeddingProvider should respect EMBEDDING_MODEL env var for non-English embeddings

1 participant