Skip to content

feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion#602

Closed
nihalnihalani wants to merge 2 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-hybrid-search-node
Closed

feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion#602
nihalnihalani wants to merge 2 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-hybrid-search-node

Conversation

@nihalnihalani
Copy link
Copy Markdown
Contributor

@nihalnihalani nihalnihalani commented Apr 2, 2026

Summary

Review feedback addressed

  1. RuntimeError on uninitialized engine — replaced silent return with raise RuntimeError
  2. Deep copy question for fan-out safety — prevents mutation corruption when question is shared across branches
  3. Structured answer output — replaced concatenated text blob with structured format preserving document boundaries, scores, and source attribution

Test plan

  • 39 hybrid search tests pass
  • ruff check/format clean
  • BM25 keyword matching works correctly
  • RRF merging with deduplication handles edge cases
  • Alpha-weighted fusion (alpha=0.8 boosts vector, alpha=0.2 boosts BM25)
  • Deep copy prevents fan-out mutation

Splits #514

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Added hybrid search functionality combining vector similarity and keyword-based search using Reciprocal Rank Fusion for improved relevance
    • Included pre-configured search profiles (balanced, semantic, keyword) with adjustable parameters for controlling vector/keyword weighting, result count limits, and RRF tuning

… similarity search

Split from rocketride-org#514 per reviewer request. Addresses review feedback:
- RuntimeError on uninitialized engine (not silent return)
- Deep copy question to prevent fan-out mutation corruption
- Structured answer output instead of concatenated text blob

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the module:nodes Python pipeline nodes label Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

A new hybrid search node is introduced that fuses vector and BM25 keyword search using Reciprocal Rank Fusion (RRF). The implementation includes a core search engine, global initialization and instance execution handlers, configuration, dependencies, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Core Engine
nodes/src/nodes/search_hybrid/hybrid_search.py
Implements HybridSearchEngine with BM25 search, reciprocal rank fusion, and multi-source result merging. Tokenization, RRF deduplication by document id, and weighted fusion of vector and keyword rankings.
Integration
nodes/src/nodes/search_hybrid/IGlobal.py, nodes/src/nodes/search_hybrid/IInstance.py
Global context initialization (config loading, engine creation, alpha clamping) and instance-level execution (question processing, document reranking, answer generation).
Configuration & Dependencies
nodes/src/nodes/search_hybrid/services.json, nodes/src/nodes/search_hybrid/requirements.txt
Service definition with search_hybrid:// protocol, parameter exposure (alpha, top_k, rrf_k), profile presets, and rank_bm25 dependency pin.
Tests
test/nodes/test_search_hybrid.py
Comprehensive unit and integration tests covering BM25 search, RRF merging, full pipeline behavior, parameter validation, edge cases, deep-copy semantics, and question/document processing.

Sequence Diagram

sequenceDiagram
    participant IInstance
    participant HybridSearchEngine
    participant BM25
    participant VectorRanker
    participant RRF

    IInstance->>HybridSearchEngine: search(query, docs, vector_scores, top_k, rrf_k)
    HybridSearchEngine->>BM25: bm25_search(query, docs)
    BM25-->>HybridSearchEngine: bm25_results
    HybridSearchEngine->>VectorRanker: build vector_results from scores
    VectorRanker-->>HybridSearchEngine: vector_results
    HybridSearchEngine->>RRF: reciprocal_rank_fusion(vector_results, bm25_results, weights=[alpha, 1-alpha])
    RRF->>RRF: deduplicate by id, accumulate RRF scores
    RRF-->>HybridSearchEngine: merged & sorted rrf_results
    HybridSearchEngine-->>IInstance: top_k reranked documents
    IInstance->>IInstance: map results to original docs, update scores
    IInstance-->>Client: reranked_documents + answer
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • jmaionchi
  • stepmikhaylov
  • Rod-Christensen

Poem

🐰 Hopping through the search with joy,
Vector and keyword words deploy,
RRF fuses them with care,
Hybrid results float in the air!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: introducing a hybrid search node that combines BM25 keyword search with vector similarity fusion.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…output

The structured answer was reading from doc.metadata.get('hybrid_score')
which was never populated — the RRF score is stored on doc.score. This
caused every score in the answer output to display 'N/A'. The test was
masking this bug by pre-populating hybrid_score in test metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Around line 160-167: The search function currently ignores misaligned
vector_scores (leaving vector_results empty) which silently changes ranking;
instead validate that when vector_scores is not None its length matches
documents and raise a clear exception if not; update the validation in search
(and the same logic block referenced around the vector_results handling) to
check len(vector_scores) == len(documents) and raise ValueError (or similar)
with a message including the function name and the mismatched lengths so
upstream callers fail fast and fix the alignment bug rather than receiving
BM25-only results.
- Around line 55-67: The code must validate ranking parameters before any
slicing or Reciprocal Rank Fusion math: in bm25_search ensure the incoming top_k
is an integer >= 1 and raise a clear ValueError if not; likewise add the same
non-negative/positive checks for any RRF-related parameters used in the fusion
path (e.g., the RRF k/threshold parameters inside the method that performs
RRF/fusion — locate functions named rrf_fuse or hybrid_search/fuse_scores) so
you fail fast instead of allowing Python slicing to return unexpected results or
causing divide-by-zero during fusion.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-45: The depends(requirements) call in validateConfig should be
moved into the non-CONFIG branch of beginGlobal to match other nodes: remove the
dependency resolution from validateConfig, and in beginGlobal's else branch
import depends (from depends import depends), compute requirements using
os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' and call
depends(requirements) before any engine/ hybrid initialization (so lazy imports
in hybrid_search.py succeed at runtime); keep beginGlobal's CONFIG branch
unchanged.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 94-102: The loop that builds reranked_docs currently only sets
reranked_doc.score when rrf_score exists, leaving the emitted Doc with its
upstream score in BM25-only or vector-only fallbacks; update the logic in the
results iteration (the block handling result, reranked_doc, rrf_score) to set
reranked_doc.score to the actual ranking score: use rrf_score if present,
otherwise fall back to result.get('bm25_score') then result.get('vector_score')
(and only if none exist keep the original docs[orig_idx].score), and then append
to reranked_docs so documents/answers report the score that determined ranking.

In `@nodes/src/nodes/search_hybrid/requirements.txt`:
- Line 1: Update the node manifest to remove the unnecessary version bounds:
replace the current pinned entry with just the package name "rank_bm25" in this
node's requirements.txt; if you believe a concrete compatibility reason requires
the range, add a brief documented justification in the PR instead of keeping the
bound. Ensure the only identifier changed is the package token "rank_bm25" in
the requirements.txt for this node.

In `@test/nodes/test_search_hybrid.py`:
- Around line 36-39: The test currently installs a fake rank_bm25 at module
import via _install_bm25_stub() but never restores it, leaking the stub into
other tests; change this to scope the stub to the test lifecycle by moving the
install/restore into a pytest fixture or into setup/teardown so
_install_bm25_stub() is called before the test(s) run and _restore_modules() is
always called after; update uses of _STUB_MODULES, _install_bm25_stub, and
_restore_modules (and the other stub-install code at the other occurrences) so
the fake module is only present during the test run and is reliably removed
afterward.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96c70873-89be-4644-a2b7-d9e0a3a54db1

📥 Commits

Reviewing files that changed from the base of the PR and between 7d81b7e and 1bf2439.

📒 Files selected for processing (7)
  • nodes/src/nodes/search_hybrid/IGlobal.py
  • nodes/src/nodes/search_hybrid/IInstance.py
  • nodes/src/nodes/search_hybrid/__init__.py
  • nodes/src/nodes/search_hybrid/hybrid_search.py
  • nodes/src/nodes/search_hybrid/requirements.txt
  • nodes/src/nodes/search_hybrid/services.json
  • test/nodes/test_search_hybrid.py

Comment on lines +55 to +67
def bm25_search(self, query: str, documents: List[Dict[str, Any]], top_k: int = 10) -> List[Dict[str, Any]]:
"""
BM25 keyword search over document texts.

Args:
query: The search query string.
documents: List of dicts, each must have a 'text' key with the document content.
May also have an 'id' key for identification.
top_k: Maximum number of results to return.

Returns:
List of documents with 'bm25_score' added, sorted by score descending.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate negative ranking parameters before slicing/RRF math.

Negative top_k values silently return all-but-last results because of Python slicing, and k=-1 hits a divide-by-zero on the first fused item. Since these values can come straight from node config, fail fast here instead of letting them produce truncated output or runtime crashes.

🛠️ Proposed fix
     def bm25_search(self, query: str, documents: List[Dict[str, Any]], top_k: int = 10) -> List[Dict[str, Any]]:
         """
         BM25 keyword search over document texts.
@@
         Returns:
             List of documents with 'bm25_score' added, sorted by score descending.
         """
+        if top_k < 0:
+            raise ValueError('top_k must be non-negative')
         if not query or not documents:
             return []
@@
     def reciprocal_rank_fusion(
         *result_lists: List[Dict[str, Any]],
         k: int = 60,
         id_key: str = 'id',
         weights: Optional[List[float]] = None,
@@
         Returns:
             Merged, deduplicated list sorted by RRF score descending.
         """
+        if k < 0:
+            raise ValueError('k must be non-negative')
         if weights is not None and len(weights) != len(result_lists):
             raise ValueError('weights length must match the number of result lists')
@@
     def search(
         self,
         query: str,
         documents: List[Dict[str, Any]],
         vector_scores: Optional[List[float]] = None,
@@
         Returns:
             List of documents ranked by hybrid RRF score.
         """
+        if top_k < 0:
+            raise ValueError('top_k must be non-negative')
         if not documents:
             return []

Also applies to: 105-128, 160-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/hybrid_search.py` around lines 55 - 67, The
code must validate ranking parameters before any slicing or Reciprocal Rank
Fusion math: in bm25_search ensure the incoming top_k is an integer >= 1 and
raise a clear ValueError if not; likewise add the same non-negative/positive
checks for any RRF-related parameters used in the fusion path (e.g., the RRF
k/threshold parameters inside the method that performs RRF/fusion — locate
functions named rrf_fuse or hybrid_search/fuse_scores) so you fail fast instead
of allowing Python slicing to return unexpected results or causing
divide-by-zero during fusion.

Comment on lines +160 to +167
def search(
self,
query: str,
documents: List[Dict[str, Any]],
vector_scores: Optional[List[float]] = None,
top_k: int = 10,
rrf_k: int = 60,
) -> List[Dict[str, Any]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject misaligned vector_scores instead of silently dropping the vector signal.

When vector_scores is provided with the wrong length, this code just leaves vector_results empty and the caller gets BM25-only results. That hides an upstream data-alignment bug and changes ranking semantics without any error.

🛠️ Proposed fix
         # Build vector-ranked list if scores are provided
         vector_results: List[Dict[str, Any]] = []
-        if vector_scores is not None and len(vector_scores) == len(documents):
+        if vector_scores is not None:
+            if len(vector_scores) != len(documents):
+                raise ValueError('vector_scores length must match documents length')
             scored = []
             for i, doc in enumerate(documents):
                 doc_copy = dict(doc)
                 doc_copy['vector_score'] = float(vector_scores[i])
                 scored.append(doc_copy)

Also applies to: 185-195

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/hybrid_search.py` around lines 160 - 167, The
search function currently ignores misaligned vector_scores (leaving
vector_results empty) which silently changes ranking; instead validate that when
vector_scores is not None its length matches documents and raise a clear
exception if not; update the validation in search (and the same logic block
referenced around the vector_results handling) to check len(vector_scores) ==
len(documents) and raise ValueError (or similar) with a message including the
function name and the mismatched lengths so upstream callers fail fast and fix
the alignment bug rather than receiving BM25-only results.

Comment on lines +39 to +45
def validateConfig(self):
"""Validate that the rank_bm25 dependency is available."""
try:
from depends import depends

requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
depends(requirements)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '== search_hybrid IGlobal ==\n'
sed -n '1,140p' nodes/src/nodes/search_hybrid/IGlobal.py

printf '\n== sibling nodes resolving requirements in beginGlobal ==\n'
rg -n -C2 "depends\(requirements\)|from depends import depends" nodes/src/nodes --type py

printf '\n== framework callers of validateConfig ==\n'
rg -n -C2 "\bvalidateConfig\s*\(" --type py

printf '\n== framework callers of beginGlobal ==\n'
rg -n -C2 "\bbeginGlobal\s*\(" --type py

Repository: rocketride-org/rocketride-server

Length of output: 50388


Move dependency resolution to beginGlobal() to match the standard pattern used across the codebase.

The depends(requirements) call is currently isolated in validateConfig(), but runtime execution of beginGlobal() does not load dependencies. This pattern deviates from virtually all other nodes in the codebase (vectorizer, question, prompt, summarization, text_output, and 60+ additional nodes), which consistently load dependencies in the non-CONFIG branch of beginGlobal(). If the framework does not guarantee validateConfig() executes before beginGlobal(), the lazy import of rank_bm25 in hybrid_search.py will fail when requests reach the engine.

Move the depends(requirements) call into the else branch of beginGlobal() (the execution path for non-CONFIG mode) to ensure dependencies are available before engine initialization:

Suggested pattern (already used by sibling nodes):
def beginGlobal(self):
    if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:
        pass
    else:
        from depends import depends
        requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
        depends(requirements)
        
        # existing config/engine init code
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 45, The
depends(requirements) call in validateConfig should be moved into the non-CONFIG
branch of beginGlobal to match other nodes: remove the dependency resolution
from validateConfig, and in beginGlobal's else branch import depends (from
depends import depends), compute requirements using
os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' and call
depends(requirements) before any engine/ hybrid initialization (so lazy imports
in hybrid_search.py succeed at runtime); keep beginGlobal's CONFIG branch
unchanged.

Comment on lines +94 to +102
for result in results:
orig_idx = result.get('original_index')
if orig_idx is not None and 0 <= orig_idx < len(docs):
reranked_doc = copy.deepcopy(docs[orig_idx])
# Update score with RRF score
rrf_score = result.get('rrf_score')
if rrf_score is not None:
reranked_doc.score = rrf_score
reranked_docs.append(reranked_doc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Copy the actual ranking score onto the emitted Doc.

In BM25-only mode (alpha == 0) or any non-fused fallback, result carries bm25_score/vector_score but no rrf_score. This branch leaves reranked_doc.score at the upstream value, so the documents and answers lanes can report a score that did not determine the ranking.

🛠️ Proposed fix
-                rrf_score = result.get('rrf_score')
-                if rrf_score is not None:
-                    reranked_doc.score = rrf_score
+                score = result.get('rrf_score')
+                if score is None:
+                    score = result.get('bm25_score')
+                if score is None:
+                    score = result.get('vector_score')
+                if score is not None:
+                    reranked_doc.score = score
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 94 - 102, The loop
that builds reranked_docs currently only sets reranked_doc.score when rrf_score
exists, leaving the emitted Doc with its upstream score in BM25-only or
vector-only fallbacks; update the logic in the results iteration (the block
handling result, reranked_doc, rrf_score) to set reranked_doc.score to the
actual ranking score: use rrf_score if present, otherwise fall back to
result.get('bm25_score') then result.get('vector_score') (and only if none exist
keep the original docs[orig_idx].score), and then append to reranked_docs so
documents/answers report the score that determined ranking.

Comment on lines +113 to +122
if reranked_docs and self.instance.hasListener('answers'):
context_parts = []
for i, doc in enumerate(reranked_docs):
score = doc.score if doc.score is not None else 'N/A'
snippet = (doc.page_content or '')[:500]
context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')
answer_text = f'Hybrid search returned {len(reranked_docs)} results:\n\n' + '\n\n'.join(context_parts)
ans = Answer()
ans.setAnswer(answer_text)
self.instance.writeAnswers(ans)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The answer payload still drops source attribution.

Each block is only labeled [Document N], so any upstream source/title/url metadata disappears from the answer lane. That misses the PR goal of preserving source attribution and makes the emitted answer much harder to audit.

@@ -0,0 +1 @@
rank_bm25>=0.2.2,<1.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer an unconstrained dependency in this node manifest.

These node-level requirements.txt files are used as abstract manifests, so the >=0.2.2,<1.0.0 bound mostly just constrains the resolver without providing the actual lock. Please keep only rank_bm25 here unless there is a documented compatibility break that truly requires the range.

Please confirm there is a concrete compatibility reason before keeping the bound.

Based on learnings: "The project uses uv pip compile to resolve and lock exact versions, so the requirements.txt file serves only as an abstract dependency manifest. A version pin in requirements.txt is unnecessary and was declined in PR #456."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/requirements.txt` at line 1, Update the node
manifest to remove the unnecessary version bounds: replace the current pinned
entry with just the package name "rank_bm25" in this node's requirements.txt; if
you believe a concrete compatibility reason requires the range, add a brief
documented justification in the PR instead of keeping the bound. Ensure the only
identifier changed is the package token "rank_bm25" in the requirements.txt for
this node.

Comment on lines +36 to +39
def _install_bm25_stub():
"""Install a minimal rank_bm25 stub for testing."""
for name in _STUB_MODULES:
_original_modules[name] = sys.modules.get(name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope the rank_bm25 stub to the test lifecycle.

_install_bm25_stub() runs at module import, but _restore_modules() is never paired with that top-level installation. Every later test in the session will see the fake rank_bm25 module, which makes the suite order-dependent and can mask real dependency/import regressions.

Also applies to: 78-83, 86-102

🧰 Tools
🪛 Ruff (0.15.7)

[warning] 36-36: Missing return type annotation for private function _install_bm25_stub

Add return type annotation: None

(ANN202)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/nodes/test_search_hybrid.py` around lines 36 - 39, The test currently
installs a fake rank_bm25 at module import via _install_bm25_stub() but never
restores it, leaking the stub into other tests; change this to scope the stub to
the test lifecycle by moving the install/restore into a pytest fixture or into
setup/teardown so _install_bm25_stub() is called before the test(s) run and
_restore_modules() is always called after; update uses of _STUB_MODULES,
_install_bm25_stub, and _restore_modules (and the other stub-install code at the
other occurrences) so the fake module is only present during the test run and is
reliably removed afterward.

@stepmikhaylov stepmikhaylov added hackathon Hackathon project or submission needs-refinement Requires further refinement before merging good-concept Solid concept worth pursuing labels Apr 6, 2026
@stepmikhaylov
Copy link
Copy Markdown
Collaborator

Thank you for the contribution and the detailed write-up.

After review, this PR is not ready for merge in its current state and is being closed to keep the PR list clean. The concept and proposed changes have been archived in #619 for further elaboration and future consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good-concept Solid concept worth pursuing hackathon Hackathon project or submission module:nodes Python pipeline nodes needs-refinement Requires further refinement before merging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants