-
Notifications
You must be signed in to change notification settings - Fork 395
Description
Problem / Motivation
Cross-language recall currently feels weaker than the README/config surface suggests.
After reviewing the current code path and runtime config, I found two related gaps that together make multilingual retrieval harder to tune and less predictable:
-
BM25 only indexes the
textcolumn, so cross-language lexical recall depends on authoring conventions inside prose.
In practice, Chinese -> English recall improves only when entries manually include bilingual anchors such asKeywords (zh): .... There is no first-class structured bilingual keyword field or query expansion path. -
vectorWeight/bm25Weightare exposed in config but are not actually used in fusion.
This removes an obvious tuning lever for multilingual retrieval, where users often need to bias the system toward vector evidence when BM25 has little or no same-language overlap.
These two issues are tightly connected: when lexical overlap is weak across languages, retrieval quality depends more on vector quality, but today the user-facing tuning knobs and lexical support do not fully support that workflow.
Proposed Solution
I would like memory-lancedb-pro to make cross-language retrieval a first-class, explicitly supported path.
The cleanest direction is to treat this as two complementary layers instead of a single heuristic patch:
1. Memory-side bilingual keywords / aliases (preferred long-term contract)
Add a first-class bilingual keywords / alias mechanism for stored memories.
Concretely, this means a memory should be able to carry structured bilingual lookup terms in addition to its main text, for example via a dedicated keywords / aliases field. BM25 / FTS should be able to index and search those lookup terms alongside text.
This is different from query expansion:
- memory-side aliases answer "what other names or bilingual terms can this memory be found by?"
- they are attached to the memory itself, not guessed at query time
Why this matters:
- this makes cross-language lexical recall a native data capability, not just a writing convention
- it is more stable and controllable than manually appending
Keywords (zh): ...to the main memory text - it keeps the body text human-readable while still allowing Chinese <-> English lexical matches
2. Query-side expansion / dictionary (good short-term / complementary layer)
Add an optional query expansion path for bilingual or colloquial queries.
Concretely, this means taking the incoming query and expanding it into additional bilingual or technical variants before BM25 search, for example through:
- a static synonym dictionary
- bilingual alias expansion
- domain-specific colloquial -> technical term expansion
This is different from memory-side aliases:
- query expansion answers "what other forms of this query should we try?"
- it enriches the search input at retrieval time, instead of attaching structured aliases to stored memories
Why this matters:
- it is easier to ship quickly
- it improves recall for fuzzy Chinese queries and colloquial wording
- it complements, but should not replace, the memory-side alias mechanism
3. Fix the fusion contract so lexical improvements can actually be tuned
Make retrieval.vectorWeight / retrieval.bm25Weight participate in the real fusion logic.
Why this matters:
- lexical improvements alone only help if those candidates can be combined reasonably with vector results
- in cross-language cases, BM25 may still be sparse or uneven, so users need an actual way to bias toward vector evidence when appropriate
- right now the public weighting knobs are exposed, but the current fusion implementation does not consume them
Suggested rollout shape
If this is easier to stage incrementally, I think the order should be:
- short-term: query expansion / dictionary-based lexical help
- mid-term: structured bilingual keywords / alias support on the memory side
- alongside or immediately after: fix fusion weighting so the new lexical candidates can be tuned correctly
That gives a practical path forward without locking the design into prose-only keyword stuffing.
Alternatives Considered
Current workarounds are possible but all feel weaker than a first-class fix:
- manually append
Keywords (zh): ...to memory text - switch to a more multilingual embedding model and hope vector similarity is enough
- lower thresholds or increase candidate pool
These help, but they do not solve the underlying contract/tuning gaps.
Area
Retrieval / Search
Additional Context
Evidence from the current repository state:
src/store.ts- FTS index is created on
text - BM25 search reads from the same
textfield
- FTS index is created on
README.md- recommends
Keywords (zh)authoring patterns, which currently acts as an implicit bilingual retrieval aid
- recommends
vectorWeight/bm25Weight- exposed in config/types, but not consumed by the current fusion implementation
This issue is intentionally framed as a feature request rather than a narrow bug report because the main problem is the current multilingual retrieval contract and tuning surface, not a single crash or regression.