Skip to content

Feature Request: strengthen cross-language recall with explicit multilingual rerank/fusion support #130

@furedericca-lab

Description

@furedericca-lab

Problem / Motivation

Cross-language recall currently feels weaker than the README/config surface suggests.

After reviewing the current code path and runtime config, I found two related gaps that together make multilingual retrieval harder to tune and less predictable:

  1. BM25 only indexes the text column, so cross-language lexical recall depends on authoring conventions inside prose.
    In practice, Chinese -> English recall improves only when entries manually include bilingual anchors such as Keywords (zh): .... There is no first-class structured bilingual keyword field or query expansion path.

  2. vectorWeight / bm25Weight are exposed in config but are not actually used in fusion.
    This removes an obvious tuning lever for multilingual retrieval, where users often need to bias the system toward vector evidence when BM25 has little or no same-language overlap.

These two issues are tightly connected: when lexical overlap is weak across languages, retrieval quality depends more on vector quality, but today the user-facing tuning knobs and lexical support do not fully support that workflow.

Proposed Solution

I would like memory-lancedb-pro to make cross-language retrieval a first-class, explicitly supported path.

The cleanest direction is to treat this as two complementary layers instead of a single heuristic patch:

1. Memory-side bilingual keywords / aliases (preferred long-term contract)

Add a first-class bilingual keywords / alias mechanism for stored memories.

Concretely, this means a memory should be able to carry structured bilingual lookup terms in addition to its main text, for example via a dedicated keywords / aliases field. BM25 / FTS should be able to index and search those lookup terms alongside text.

This is different from query expansion:

  • memory-side aliases answer "what other names or bilingual terms can this memory be found by?"
  • they are attached to the memory itself, not guessed at query time

Why this matters:

  • this makes cross-language lexical recall a native data capability, not just a writing convention
  • it is more stable and controllable than manually appending Keywords (zh): ... to the main memory text
  • it keeps the body text human-readable while still allowing Chinese <-> English lexical matches

2. Query-side expansion / dictionary (good short-term / complementary layer)

Add an optional query expansion path for bilingual or colloquial queries.

Concretely, this means taking the incoming query and expanding it into additional bilingual or technical variants before BM25 search, for example through:

  • a static synonym dictionary
  • bilingual alias expansion
  • domain-specific colloquial -> technical term expansion

This is different from memory-side aliases:

  • query expansion answers "what other forms of this query should we try?"
  • it enriches the search input at retrieval time, instead of attaching structured aliases to stored memories

Why this matters:

  • it is easier to ship quickly
  • it improves recall for fuzzy Chinese queries and colloquial wording
  • it complements, but should not replace, the memory-side alias mechanism

3. Fix the fusion contract so lexical improvements can actually be tuned

Make retrieval.vectorWeight / retrieval.bm25Weight participate in the real fusion logic.

Why this matters:

  • lexical improvements alone only help if those candidates can be combined reasonably with vector results
  • in cross-language cases, BM25 may still be sparse or uneven, so users need an actual way to bias toward vector evidence when appropriate
  • right now the public weighting knobs are exposed, but the current fusion implementation does not consume them

Suggested rollout shape

If this is easier to stage incrementally, I think the order should be:

  1. short-term: query expansion / dictionary-based lexical help
  2. mid-term: structured bilingual keywords / alias support on the memory side
  3. alongside or immediately after: fix fusion weighting so the new lexical candidates can be tuned correctly

That gives a practical path forward without locking the design into prose-only keyword stuffing.

Alternatives Considered

Current workarounds are possible but all feel weaker than a first-class fix:

  • manually append Keywords (zh): ... to memory text
  • switch to a more multilingual embedding model and hope vector similarity is enough
  • lower thresholds or increase candidate pool

These help, but they do not solve the underlying contract/tuning gaps.

Area

Retrieval / Search

Additional Context

Evidence from the current repository state:

  • src/store.ts
    • FTS index is created on text
    • BM25 search reads from the same text field
  • README.md
    • recommends Keywords (zh) authoring patterns, which currently acts as an implicit bilingual retrieval aid
  • vectorWeight / bm25Weight
    • exposed in config/types, but not consumed by the current fusion implementation

This issue is intentionally framed as a feature request rather than a narrow bug report because the main problem is the current multilingual retrieval contract and tuning surface, not a single crash or regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions