fix(embedding): respect EMBEDDING_MODEL for local provider#943
fix(embedding): respect EMBEDDING_MODEL for local provider#943skywalker-35 wants to merge 1 commit into
Conversation
…itg00#881) LocalEmbeddingProvider currently hardcodes Xenova/all-MiniLM-L6-v2 (English-only, 384-dim) and ignores EMBEDDING_MODEL, making non-English embedding models (bge-large-zh, bge-m3, multilingual-e5, etc.) impossible to configure on the local provider. Changes: - Read EMBEDDING_MODEL env var (default: Xenova/all-MiniLM-L6-v2) - Resolve dimensions from KNOWN_DIMS map (MiniLM / bge-zh / bge-m3 / multilingual-e5) with OPENAI_EMBEDDING_DIMENSIONS as override - Pass local_files_only + quantized=false to the pipeline (BGE's ONNX release ships only model.onnx, not model_quantized.onnx, and offline / restricted-network setups need the no-fetch path) Verified locally: EMBEDDING_MODEL=Xenova/bge-large-zh-v1.5 loads in ~3.9s, single embed ~0.23s, 1024-dim vector returned. Chinese semantic search (stored 深度学习框架对比分析 / queried 神经网络训练平台比较, zero shared keywords) returns matches via cosine similarity. Fixes rohitg00#881
|
@skywalker-35 is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthrough
ChangesDynamic LocalEmbeddingProvider Model and Dimensions
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/providers/embedding/local.ts (1)
4-12:⚠️ Potential issue | 🔴 Critical
Pipelinetype signature does not match the call at lines 93–97.The type definition accepts only 2 parameters, but the call passes a third options object with
{ local_files_only: true, quantized: false }. This causes a TypeScript type error.Fix
type Pipeline = ( task: string, model: string, + options?: { local_files_only?: boolean; quantized?: boolean }, ) => Promise< ( texts: string[], options: { pooling: string; normalize: boolean }, ) => Promise<{ tolist: () => number[][] }> >;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/providers/embedding/local.ts` around lines 4 - 12, The Pipeline type definition in this file only accepts two parameters (task and model), but the actual call at lines 93–97 passes a third parameter containing options like local_files_only and quantized. Update the Pipeline type signature to accept a third optional parameter that represents the options object with the properties being passed at the call site, ensuring the type matches what is actually being invoked.
🧹 Nitpick comments (1)
src/providers/embedding/local.ts (1)
14-29: ⚡ Quick winRemove WHAT-style inline comments in the model map.
These comments narrate category labels instead of encoding intent through structure/naming, which conflicts with the TypeScript guideline for this repo.
As per coding guidelines, “In TypeScript source code, avoid code comments explaining WHAT — use clear naming instead.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/providers/embedding/local.ts` around lines 14 - 29, The KNOWN_DIMS constant contains inline comments that describe categories of models (such as "// MiniLM 系列(英文)", "// BGE 中文系列", etc.) which explain WHAT the code does rather than WHY. Per the TypeScript coding guidelines for this repo, these WHAT-style comments should be removed since the model names themselves are sufficiently clear and self-documenting. Delete all category label comments from the KNOWN_DIMS mapping while keeping the model names and their corresponding dimension values intact.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/providers/embedding/local.ts`:
- Around line 35-49: The resolveDimensions function has two issues that need
fixing. First, replace the parseInt and Number.isFinite validation for the
override parameter with strict regex validation that ensures the entire override
string is a positive integer with no partial matches or decimals allowed (reject
strings like "123abc" or "12.34"). Second, replace the fallback logic at the end
where KNOWN_DIMS[modelName] ?? DEFAULT_DIMS returns a default dimension value
for unknown models—instead, explicitly check if the model exists in KNOWN_DIMS
and throw an error if it does not, since unknown models require the
OPENAI_EMBEDDING_DIMENSIONS override to be set.
---
Outside diff comments:
In `@src/providers/embedding/local.ts`:
- Around line 4-12: The Pipeline type definition in this file only accepts two
parameters (task and model), but the actual call at lines 93–97 passes a third
parameter containing options like local_files_only and quantized. Update the
Pipeline type signature to accept a third optional parameter that represents the
options object with the properties being passed at the call site, ensuring the
type matches what is actually being invoked.
---
Nitpick comments:
In `@src/providers/embedding/local.ts`:
- Around line 14-29: The KNOWN_DIMS constant contains inline comments that
describe categories of models (such as "// MiniLM 系列(英文)", "// BGE 中文系列", etc.)
which explain WHAT the code does rather than WHY. Per the TypeScript coding
guidelines for this repo, these WHAT-style comments should be removed since the
model names themselves are sufficiently clear and self-documenting. Delete all
category label comments from the KNOWN_DIMS mapping while keeping the model
names and their corresponding dimension values intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1f10d129-0fbd-4197-aa75-c5fd6aa3d0cf
📒 Files selected for processing (1)
src/providers/embedding/local.ts
| function resolveDimensions( | ||
| modelName: string, | ||
| override: string | undefined, | ||
| ): number { | ||
| if (override !== undefined && override.trim().length > 0) { | ||
| const parsed = parseInt(override, 10); | ||
| if (!Number.isFinite(parsed) || parsed <= 0) { | ||
| throw new Error( | ||
| `OPENAI_EMBEDDING_DIMENSIONS must be a positive integer, got: ${override}`, | ||
| ); | ||
| } | ||
| return parsed; | ||
| } | ||
| return KNOWN_DIMS[modelName] ?? DEFAULT_DIMS; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current behavior and contract evidence in code/docs.
rg -n -C3 'function resolveDimensions|parseInt\(|KNOWN_DIMS\[modelName\] \?\? DEFAULT_DIMS' src/providers/embedding/local.ts
rg -n -C2 'OPENAI_EMBEDDING_DIMENSIONS.*Required when the model is not in the known-models table' README.mdRepository: rohitg00/agentmemory
Length of output: 1006
🏁 Script executed:
# Check KNOWN_DIMS definition and what models are included
rg -n -B5 -A10 'const KNOWN_DIMS|KNOWN_DIMS\s*=' src/providers/embedding/local.tsRepository: rohitg00/agentmemory
Length of output: 605
🏁 Script executed:
# Check if provider.dimensions is used elsewhere and could cause breakage
rg -n 'provider\.dimensions|\.dimensions' src/ --type tsRepository: rohitg00/agentmemory
Length of output: 1440
🏁 Script executed:
# Search for any TODO, FIXME, or issue comments about embedding dimensions
rg -n 'TODO|FIXME|BUG.*dimension|dimension.*error|unknown.*model' src/providers/embedding/local.tsRepository: rohitg00/agentmemory
Length of output: 46
🏁 Script executed:
# Verify parseInt behavior with edge cases
node <<'JS'
console.log("parseInt('123abc', 10):", parseInt('123abc', 10));
console.log("parseInt('12.34', 10):", parseInt('12.34', 10));
console.log("parseInt(' 456 ', 10):", parseInt(' 456 ', 10));
console.log("Number.isFinite(parseInt('123abc', 10)):", Number.isFinite(parseInt('123abc', 10)));
console.log("Number('123abc'):", Number('123abc'));
console.log("Number('456'):", Number('456'));
JSRepository: rohitg00/agentmemory
Length of output: 234
Dimension resolution accepts malformed overrides and silently defaults unknown models, breaking vector search contracts.
The current implementation has two issues:
-
parseIntaccepts partial numeric strings (e.g.,"123abc"→123) and decimals (e.g.,"12.34"→12), which bypasses validation sinceNumber.isFinite()returnstruefor the parsed integer result. -
Line 48 silently defaults to
384for unknown models, contradicting the README which documents thatOPENAI_EMBEDDING_DIMENSIONSis "Required when the model is not in the known-models table". This causes runtime failures insrc/functions/search.tsandsrc/functions/migrate-vector-index.ts, which validate thatembedding.length === provider.dimensions.
Use strict regex validation for the override and throw an error for unknown models instead of falling back to a default dimension value.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/providers/embedding/local.ts` around lines 35 - 49, The resolveDimensions
function has two issues that need fixing. First, replace the parseInt and
Number.isFinite validation for the override parameter with strict regex
validation that ensures the entire override string is a positive integer with no
partial matches or decimals allowed (reject strings like "123abc" or "12.34").
Second, replace the fallback logic at the end where KNOWN_DIMS[modelName] ??
DEFAULT_DIMS returns a default dimension value for unknown models—instead,
explicitly check if the model exists in KNOWN_DIMS and throw an error if it does
not, since unknown models require the OPENAI_EMBEDDING_DIMENSIONS override to be
set.
Summary
Fixes #881
LocalEmbeddingProvidercurrently hardcodesXenova/all-MiniLM-L6-v2(English-only, 384-dim) and ignores theEMBEDDING_MODELenvironment variable. This makes non-English embedding models (e.g.Xenova/bge-large-zh-v1.5for Chinese, 1024-dim) impossible to configure on the local provider, even though the same env var is the standard convention across the OpenAI / Gemini providers.Changes
src/providers/embedding/local.ts:EMBEDDING_MODELenv var to select the model (default:Xenova/all-MiniLM-L6-v2— preserves existing behavior)dimensionsfrom aKNOWN_DIMSmap (MiniLM / bge-zh / bge-m3 / multilingual-e5), withOPENAI_EMBEDDING_DIMENSIONSas an explicit override{ local_files_only: true, quantized: false }to the pipeline — required because BGE's ONNX release ships onlymodel.onnx, notmodel_quantized.onnx, and offline / restricted-network setups need the no-fetch pathWhy a separate PR from #793
#793 introduces a new
LOCAL_EMBEDDING_MODELenv var and changes the default model toparaphrase-multilingual-MiniLM-L12-v2. This PR takes a complementary, more conservative approach:LOCAL_EMBEDDING_MODEL(new)EMBEDDING_MODEL(cross-provider convention)paraphrase-multilingual-MiniLM-L12-v2Xenova/all-MiniLM-L6-v2(unchanged)KNOWN_DIMSmap +OPENAI_EMBEDDING_DIMENSIONSoverridelocal_files_only: true+quantized: falseReasoning for reusing
EMBEDDING_MODEL: it is already the cross-provider convention (used by the Gemini / OpenAI factories inindex.ts). Adding a second env var for the local provider would force users to remember which one applies to which provider. TheKNOWN_DIMSmap covers everyXenova/*model currently in use, so the typical user never needs to setOPENAI_EMBEDDING_DIMENSIONS.Happy to consolidate with #793 if reviewers prefer a single source of truth — the two approaches are not mutually exclusive. The dimension + offline bits here are independent improvements that would still apply regardless of which env-var name wins.
Verification
EMBEDDING_MODEL=Xenova/bge-large-zh-v1.5→ pipeline loads in ~3.9s, single embed ~0.23s, 1024-dim vector returnedEMBEDDING_MODELunset → behavior unchanged (Xenova/all-MiniLM-L6-v2, 384-dim)OPENAI_EMBEDDING_DIMENSIONS=512override for an unknown model → 512-dim returnedOPENAI_EMBEDDING_DIMENSIONS=abc(invalid) → throws explicit error rather than silently defaultingEnvironment
Operational note for reviewers
@xenova/transformersdoes not respect the standard~/.cache/huggingfaceredirect via mklink / junction — it looks only insidenode_modules/@xenova/transformers/models/{namespace}/{model}/(a flat layout, not the usualblobs/snapshotsstructure). For pre-downloaded models, that is the path that must contain the model files. Thenamespaceis alwaysXenova/, regardless of whatEMBEDDING_MODELis set to.Related
Summary by CodeRabbit