feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion#602
Conversation
… similarity search Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized engine (not silent return) - Deep copy question to prevent fan-out mutation corruption - Structured answer output instead of concatenated text blob Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughA new hybrid search node is introduced that fuses vector and BM25 keyword search using Reciprocal Rank Fusion (RRF). The implementation includes a core search engine, global initialization and instance execution handlers, configuration, dependencies, and comprehensive test coverage. Changes
Sequence DiagramsequenceDiagram
participant IInstance
participant HybridSearchEngine
participant BM25
participant VectorRanker
participant RRF
IInstance->>HybridSearchEngine: search(query, docs, vector_scores, top_k, rrf_k)
HybridSearchEngine->>BM25: bm25_search(query, docs)
BM25-->>HybridSearchEngine: bm25_results
HybridSearchEngine->>VectorRanker: build vector_results from scores
VectorRanker-->>HybridSearchEngine: vector_results
HybridSearchEngine->>RRF: reciprocal_rank_fusion(vector_results, bm25_results, weights=[alpha, 1-alpha])
RRF->>RRF: deduplicate by id, accumulate RRF scores
RRF-->>HybridSearchEngine: merged & sorted rrf_results
HybridSearchEngine-->>IInstance: top_k reranked documents
IInstance->>IInstance: map results to original docs, update scores
IInstance-->>Client: reranked_documents + answer
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…output
The structured answer was reading from doc.metadata.get('hybrid_score')
which was never populated — the RRF score is stored on doc.score. This
caused every score in the answer output to display 'N/A'. The test was
masking this bug by pre-populating hybrid_score in test metadata.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Around line 160-167: The search function currently ignores misaligned
vector_scores (leaving vector_results empty) which silently changes ranking;
instead validate that when vector_scores is not None its length matches
documents and raise a clear exception if not; update the validation in search
(and the same logic block referenced around the vector_results handling) to
check len(vector_scores) == len(documents) and raise ValueError (or similar)
with a message including the function name and the mismatched lengths so
upstream callers fail fast and fix the alignment bug rather than receiving
BM25-only results.
- Around line 55-67: The code must validate ranking parameters before any
slicing or Reciprocal Rank Fusion math: in bm25_search ensure the incoming top_k
is an integer >= 1 and raise a clear ValueError if not; likewise add the same
non-negative/positive checks for any RRF-related parameters used in the fusion
path (e.g., the RRF k/threshold parameters inside the method that performs
RRF/fusion — locate functions named rrf_fuse or hybrid_search/fuse_scores) so
you fail fast instead of allowing Python slicing to return unexpected results or
causing divide-by-zero during fusion.
In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-45: The depends(requirements) call in validateConfig should be
moved into the non-CONFIG branch of beginGlobal to match other nodes: remove the
dependency resolution from validateConfig, and in beginGlobal's else branch
import depends (from depends import depends), compute requirements using
os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' and call
depends(requirements) before any engine/ hybrid initialization (so lazy imports
in hybrid_search.py succeed at runtime); keep beginGlobal's CONFIG branch
unchanged.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 94-102: The loop that builds reranked_docs currently only sets
reranked_doc.score when rrf_score exists, leaving the emitted Doc with its
upstream score in BM25-only or vector-only fallbacks; update the logic in the
results iteration (the block handling result, reranked_doc, rrf_score) to set
reranked_doc.score to the actual ranking score: use rrf_score if present,
otherwise fall back to result.get('bm25_score') then result.get('vector_score')
(and only if none exist keep the original docs[orig_idx].score), and then append
to reranked_docs so documents/answers report the score that determined ranking.
In `@nodes/src/nodes/search_hybrid/requirements.txt`:
- Line 1: Update the node manifest to remove the unnecessary version bounds:
replace the current pinned entry with just the package name "rank_bm25" in this
node's requirements.txt; if you believe a concrete compatibility reason requires
the range, add a brief documented justification in the PR instead of keeping the
bound. Ensure the only identifier changed is the package token "rank_bm25" in
the requirements.txt for this node.
In `@test/nodes/test_search_hybrid.py`:
- Around line 36-39: The test currently installs a fake rank_bm25 at module
import via _install_bm25_stub() but never restores it, leaking the stub into
other tests; change this to scope the stub to the test lifecycle by moving the
install/restore into a pytest fixture or into setup/teardown so
_install_bm25_stub() is called before the test(s) run and _restore_modules() is
always called after; update uses of _STUB_MODULES, _install_bm25_stub, and
_restore_modules (and the other stub-install code at the other occurrences) so
the fake module is only present during the test run and is reliably removed
afterward.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 96c70873-89be-4644-a2b7-d9e0a3a54db1
📒 Files selected for processing (7)
nodes/src/nodes/search_hybrid/IGlobal.pynodes/src/nodes/search_hybrid/IInstance.pynodes/src/nodes/search_hybrid/__init__.pynodes/src/nodes/search_hybrid/hybrid_search.pynodes/src/nodes/search_hybrid/requirements.txtnodes/src/nodes/search_hybrid/services.jsontest/nodes/test_search_hybrid.py
| def bm25_search(self, query: str, documents: List[Dict[str, Any]], top_k: int = 10) -> List[Dict[str, Any]]: | ||
| """ | ||
| BM25 keyword search over document texts. | ||
|
|
||
| Args: | ||
| query: The search query string. | ||
| documents: List of dicts, each must have a 'text' key with the document content. | ||
| May also have an 'id' key for identification. | ||
| top_k: Maximum number of results to return. | ||
|
|
||
| Returns: | ||
| List of documents with 'bm25_score' added, sorted by score descending. | ||
| """ |
There was a problem hiding this comment.
Validate negative ranking parameters before slicing/RRF math.
Negative top_k values silently return all-but-last results because of Python slicing, and k=-1 hits a divide-by-zero on the first fused item. Since these values can come straight from node config, fail fast here instead of letting them produce truncated output or runtime crashes.
🛠️ Proposed fix
def bm25_search(self, query: str, documents: List[Dict[str, Any]], top_k: int = 10) -> List[Dict[str, Any]]:
"""
BM25 keyword search over document texts.
@@
Returns:
List of documents with 'bm25_score' added, sorted by score descending.
"""
+ if top_k < 0:
+ raise ValueError('top_k must be non-negative')
if not query or not documents:
return []
@@
def reciprocal_rank_fusion(
*result_lists: List[Dict[str, Any]],
k: int = 60,
id_key: str = 'id',
weights: Optional[List[float]] = None,
@@
Returns:
Merged, deduplicated list sorted by RRF score descending.
"""
+ if k < 0:
+ raise ValueError('k must be non-negative')
if weights is not None and len(weights) != len(result_lists):
raise ValueError('weights length must match the number of result lists')
@@
def search(
self,
query: str,
documents: List[Dict[str, Any]],
vector_scores: Optional[List[float]] = None,
@@
Returns:
List of documents ranked by hybrid RRF score.
"""
+ if top_k < 0:
+ raise ValueError('top_k must be non-negative')
if not documents:
return []Also applies to: 105-128, 160-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/hybrid_search.py` around lines 55 - 67, The
code must validate ranking parameters before any slicing or Reciprocal Rank
Fusion math: in bm25_search ensure the incoming top_k is an integer >= 1 and
raise a clear ValueError if not; likewise add the same non-negative/positive
checks for any RRF-related parameters used in the fusion path (e.g., the RRF
k/threshold parameters inside the method that performs RRF/fusion — locate
functions named rrf_fuse or hybrid_search/fuse_scores) so you fail fast instead
of allowing Python slicing to return unexpected results or causing
divide-by-zero during fusion.
| def search( | ||
| self, | ||
| query: str, | ||
| documents: List[Dict[str, Any]], | ||
| vector_scores: Optional[List[float]] = None, | ||
| top_k: int = 10, | ||
| rrf_k: int = 60, | ||
| ) -> List[Dict[str, Any]]: |
There was a problem hiding this comment.
Reject misaligned vector_scores instead of silently dropping the vector signal.
When vector_scores is provided with the wrong length, this code just leaves vector_results empty and the caller gets BM25-only results. That hides an upstream data-alignment bug and changes ranking semantics without any error.
🛠️ Proposed fix
# Build vector-ranked list if scores are provided
vector_results: List[Dict[str, Any]] = []
- if vector_scores is not None and len(vector_scores) == len(documents):
+ if vector_scores is not None:
+ if len(vector_scores) != len(documents):
+ raise ValueError('vector_scores length must match documents length')
scored = []
for i, doc in enumerate(documents):
doc_copy = dict(doc)
doc_copy['vector_score'] = float(vector_scores[i])
scored.append(doc_copy)Also applies to: 185-195
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/hybrid_search.py` around lines 160 - 167, The
search function currently ignores misaligned vector_scores (leaving
vector_results empty) which silently changes ranking; instead validate that when
vector_scores is not None its length matches documents and raise a clear
exception if not; update the validation in search (and the same logic block
referenced around the vector_results handling) to check len(vector_scores) ==
len(documents) and raise ValueError (or similar) with a message including the
function name and the mismatched lengths so upstream callers fail fast and fix
the alignment bug rather than receiving BM25-only results.
| def validateConfig(self): | ||
| """Validate that the rank_bm25 dependency is available.""" | ||
| try: | ||
| from depends import depends | ||
|
|
||
| requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' | ||
| depends(requirements) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '== search_hybrid IGlobal ==\n'
sed -n '1,140p' nodes/src/nodes/search_hybrid/IGlobal.py
printf '\n== sibling nodes resolving requirements in beginGlobal ==\n'
rg -n -C2 "depends\(requirements\)|from depends import depends" nodes/src/nodes --type py
printf '\n== framework callers of validateConfig ==\n'
rg -n -C2 "\bvalidateConfig\s*\(" --type py
printf '\n== framework callers of beginGlobal ==\n'
rg -n -C2 "\bbeginGlobal\s*\(" --type pyRepository: rocketride-org/rocketride-server
Length of output: 50388
Move dependency resolution to beginGlobal() to match the standard pattern used across the codebase.
The depends(requirements) call is currently isolated in validateConfig(), but runtime execution of beginGlobal() does not load dependencies. This pattern deviates from virtually all other nodes in the codebase (vectorizer, question, prompt, summarization, text_output, and 60+ additional nodes), which consistently load dependencies in the non-CONFIG branch of beginGlobal(). If the framework does not guarantee validateConfig() executes before beginGlobal(), the lazy import of rank_bm25 in hybrid_search.py will fail when requests reach the engine.
Move the depends(requirements) call into the else branch of beginGlobal() (the execution path for non-CONFIG mode) to ensure dependencies are available before engine initialization:
Suggested pattern (already used by sibling nodes):
def beginGlobal(self):
if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:
pass
else:
from depends import depends
requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
depends(requirements)
# existing config/engine init code🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 45, The
depends(requirements) call in validateConfig should be moved into the non-CONFIG
branch of beginGlobal to match other nodes: remove the dependency resolution
from validateConfig, and in beginGlobal's else branch import depends (from
depends import depends), compute requirements using
os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' and call
depends(requirements) before any engine/ hybrid initialization (so lazy imports
in hybrid_search.py succeed at runtime); keep beginGlobal's CONFIG branch
unchanged.
| for result in results: | ||
| orig_idx = result.get('original_index') | ||
| if orig_idx is not None and 0 <= orig_idx < len(docs): | ||
| reranked_doc = copy.deepcopy(docs[orig_idx]) | ||
| # Update score with RRF score | ||
| rrf_score = result.get('rrf_score') | ||
| if rrf_score is not None: | ||
| reranked_doc.score = rrf_score | ||
| reranked_docs.append(reranked_doc) |
There was a problem hiding this comment.
Copy the actual ranking score onto the emitted Doc.
In BM25-only mode (alpha == 0) or any non-fused fallback, result carries bm25_score/vector_score but no rrf_score. This branch leaves reranked_doc.score at the upstream value, so the documents and answers lanes can report a score that did not determine the ranking.
🛠️ Proposed fix
- rrf_score = result.get('rrf_score')
- if rrf_score is not None:
- reranked_doc.score = rrf_score
+ score = result.get('rrf_score')
+ if score is None:
+ score = result.get('bm25_score')
+ if score is None:
+ score = result.get('vector_score')
+ if score is not None:
+ reranked_doc.score = score🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 94 - 102, The loop
that builds reranked_docs currently only sets reranked_doc.score when rrf_score
exists, leaving the emitted Doc with its upstream score in BM25-only or
vector-only fallbacks; update the logic in the results iteration (the block
handling result, reranked_doc, rrf_score) to set reranked_doc.score to the
actual ranking score: use rrf_score if present, otherwise fall back to
result.get('bm25_score') then result.get('vector_score') (and only if none exist
keep the original docs[orig_idx].score), and then append to reranked_docs so
documents/answers report the score that determined ranking.
| if reranked_docs and self.instance.hasListener('answers'): | ||
| context_parts = [] | ||
| for i, doc in enumerate(reranked_docs): | ||
| score = doc.score if doc.score is not None else 'N/A' | ||
| snippet = (doc.page_content or '')[:500] | ||
| context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}') | ||
| answer_text = f'Hybrid search returned {len(reranked_docs)} results:\n\n' + '\n\n'.join(context_parts) | ||
| ans = Answer() | ||
| ans.setAnswer(answer_text) | ||
| self.instance.writeAnswers(ans) |
There was a problem hiding this comment.
The answer payload still drops source attribution.
Each block is only labeled [Document N], so any upstream source/title/url metadata disappears from the answer lane. That misses the PR goal of preserving source attribution and makes the emitted answer much harder to audit.
| @@ -0,0 +1 @@ | |||
| rank_bm25>=0.2.2,<1.0.0 | |||
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefer an unconstrained dependency in this node manifest.
These node-level requirements.txt files are used as abstract manifests, so the >=0.2.2,<1.0.0 bound mostly just constrains the resolver without providing the actual lock. Please keep only rank_bm25 here unless there is a documented compatibility break that truly requires the range.
Please confirm there is a concrete compatibility reason before keeping the bound.
Based on learnings: "The project uses uv pip compile to resolve and lock exact versions, so the requirements.txt file serves only as an abstract dependency manifest. A version pin in requirements.txt is unnecessary and was declined in PR #456."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/requirements.txt` at line 1, Update the node
manifest to remove the unnecessary version bounds: replace the current pinned
entry with just the package name "rank_bm25" in this node's requirements.txt; if
you believe a concrete compatibility reason requires the range, add a brief
documented justification in the PR instead of keeping the bound. Ensure the only
identifier changed is the package token "rank_bm25" in the requirements.txt for
this node.
| def _install_bm25_stub(): | ||
| """Install a minimal rank_bm25 stub for testing.""" | ||
| for name in _STUB_MODULES: | ||
| _original_modules[name] = sys.modules.get(name) |
There was a problem hiding this comment.
Scope the rank_bm25 stub to the test lifecycle.
_install_bm25_stub() runs at module import, but _restore_modules() is never paired with that top-level installation. Every later test in the session will see the fake rank_bm25 module, which makes the suite order-dependent and can mask real dependency/import regressions.
Also applies to: 78-83, 86-102
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 36-36: Missing return type annotation for private function _install_bm25_stub
Add return type annotation: None
(ANN202)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/nodes/test_search_hybrid.py` around lines 36 - 39, The test currently
installs a fake rank_bm25 at module import via _install_bm25_stub() but never
restores it, leaking the stub into other tests; change this to scope the stub to
the test lifecycle by moving the install/restore into a pytest fixture or into
setup/teardown so _install_bm25_stub() is called before the test(s) run and
_restore_modules() is always called after; update uses of _STUB_MODULES,
_install_bm25_stub, and _restore_modules (and the other stub-install code at the
other occurrences) so the fake module is only present during the test run and is
reliably removed afterward.
|
Thank you for the contribution and the detailed write-up. After review, this PR is not ready for merge in its current state and is being closed to keep the PR list clean. The concept and proposed changes have been archived in #619 for further elaboration and future consideration. |
Summary
Review feedback addressed
raise RuntimeErrorTest plan
Splits #514
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes