Skip to content

feat(nodes): add text chunking strategies and hybrid search for RAG pipelines#514

Closed
nihalnihalani wants to merge 2 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-chunking-hybrid-search
Closed

feat(nodes): add text chunking strategies and hybrid search for RAG pipelines#514
nihalnihalani wants to merge 2 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-chunking-hybrid-search

Conversation

@nihalnihalani
Copy link
Copy Markdown
Contributor

@nihalnihalani nihalnihalani commented Mar 30, 2026

#Hack-with-bay-2

Contribution Type

Feature — Add text chunking strategies and hybrid search for RAG pipelines

Problem / Use Case

RocketRide's RAG pipeline has embedding and vector storage nodes, but is missing two critical preprocessing steps:

  1. No text chunking — Documents must be manually split before embedding. Large documents exceed embedding model context windows, and naive splitting loses semantic coherence.
  2. No hybrid search — Vector similarity search alone misses keyword-relevant results. A query for "RocketRide API authentication" may retrieve documents about "security" (semantically similar) but miss documents containing the exact phrase "API authentication" (keyword match).

The complete RAG pipeline should be: Chunking → Embedding → Vector Store → Hybrid Search → Rerank → LLM

Proposed Solution

Chunker Node (nodes/src/nodes/chunker/)

Three strategies via chunker_strategies.py:

  1. RecursiveCharacterChunker — Splits by separators (\n\n, \n, . , ) with configurable overlap, recurses on oversized chunks
  2. SentenceChunker — Splits at sentence boundaries (. ! ?), groups sentences until chunk_size, tracks search_start position for repeated text
  3. TokenChunker — Splits by token count using tiktoken, incremental character position tracking (O(n) not O(n²)), respects model context windows

Hybrid Search Node (nodes/src/nodes/search_hybrid/)

HybridSearchEngine combining vector + keyword search:

  1. BM25 keyword search — Okapi BM25 with consistent tokenization (lowercase, punctuation stripping)
  2. Alpha-weighted Reciprocal Rank Fusionscore = alpha * (1/(k+rank_vector)) + (1-alpha) * (1/(k+rank_bm25)) — alpha controls vector vs keyword weight
  3. Configurable: alpha (0.0=BM25 only, 1.0=vector only), top_k, RRF k parameter

Tests: 69 total (32 chunker + 28 hybrid search + 9 new after review)

Value Added

  • Better RAG quality: Proper chunking preserves semantic context; hybrid search catches both semantic and keyword matches
  • Flexible chunking: Choose strategy based on document type (recursive for prose, sentence for articles, token for model-specific splitting)
  • Tunable search: Alpha parameter lets users balance precision (keyword) vs recall (semantic)
  • Complete RAG pipeline: Fills the two missing steps between raw documents and the Cohere Rerank node (PR feat(nodes): add Cohere Rerank pipeline node for RAG quality improvement #499)

Why This Feature Fits This Codebase

Both nodes follow the standard IGlobal/IInstance pattern. The chunker's writeDocuments() method emits multiple documents per input (one per chunk) with metadata (chunk_index, start_char, end_char, parent_id, total_chunks) — following the same document emission pattern used in nodes/src/nodes/llamaparse/IInstance.py for document parsing.

The hybrid search node's writeQuestions() receives questions with attached documents from an upstream vector DB (e.g., Chroma, Pinecone) that include vector similarity scores. It runs BM25 on the document texts, then merges via alpha-weighted RRF. This matches the question→documents→answers lane flow defined in vector DB nodes like nodes/src/nodes/chroma/services.json (lines 80-90).

The depends() call pattern in IGlobal.beginGlobal ensures tiktoken and rank_bm25 are installed at runtime, following the same dependency auto-loading pattern as all other nodes (e.g., nodes/src/nodes/llm_openai/IGlobal.py lines 40-43).

Deep copy (copy.deepcopy) is used on individual documents before mutation, preventing corruption in fan-out pipelines where the engine's parallel pipe execution model shares objects across branches.

Validation

$ pytest test/nodes/test_chunker.py test/nodes/test_search_hybrid.py -v
===== 69 passed =====

Chunker tests (32): recursive splitting with overlap, sentence boundary detection, token-based splitting, empty text handling, chunk metadata correctness, repeated sentence start_char tracking, overlap computation, IGlobal lifecycle

Hybrid search tests (28+9): BM25 keyword matching, RRF merging with deduplication, alpha-weighted fusion (alpha=0.8 boosts vector, alpha=0.2 boosts BM25, equal weights verification), full hybrid end-to-end, deep copy prevention, weights length validation

How This Could Be Extended

  • Add semantic chunking using embedding similarity between sentences
  • Add document-type-aware chunking (PDF page breaks, Markdown headers, code blocks)
  • Add learned reranking weights (auto-tune alpha based on query type)
  • Integrate chunk metadata with the VS Code extension for source highlighting

🤖 Generated with Claude Code

…ipelines

- Add chunker node with recursive character, sentence, and token-based strategies
- Configurable chunk_size, chunk_overlap, and encoding for token chunking
- Add search_hybrid node combining vector similarity + BM25 keyword search
- Reciprocal Rank Fusion (RRF) for merging multi-signal search results
- Configurable alpha weight between vector and keyword scores
- Deep copy prevents input mutation in fan-out pipelines
- Comprehensive tests for all chunking strategies and hybrid search (60 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

Warning

Rate limit exceeded

@nihalnihalani has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 0 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 0 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9888894c-4f9b-4880-8e40-2801a8a3fef6

📥 Commits

Reviewing files that changed from the base of the PR and between b6b2dc1 and 455765b.

📒 Files selected for processing (14)
  • nodes/src/nodes/chunker/IGlobal.py
  • nodes/src/nodes/chunker/IInstance.py
  • nodes/src/nodes/chunker/__init__.py
  • nodes/src/nodes/chunker/chunker_strategies.py
  • nodes/src/nodes/chunker/requirements.txt
  • nodes/src/nodes/chunker/services.json
  • nodes/src/nodes/search_hybrid/IGlobal.py
  • nodes/src/nodes/search_hybrid/IInstance.py
  • nodes/src/nodes/search_hybrid/__init__.py
  • nodes/src/nodes/search_hybrid/hybrid_search.py
  • nodes/src/nodes/search_hybrid/requirements.txt
  • nodes/src/nodes/search_hybrid/services.json
  • test/nodes/test_chunker.py
  • test/nodes/test_search_hybrid.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the module:nodes Python pipeline nodes label Mar 30, 2026
…h PR

- Fix alpha parameter to actually weight RRF fusion (alpha * vector + (1-alpha) * BM25)
- Fix SentenceChunker start_char tracking for repeated sentences
- Fix TokenChunker O(n²) prefix decoding with incremental character tracking
- Remove unnecessary deep copy of entire question in hybrid search
- Update tests for alpha-weighted RRF and repeated sentence handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@asclearuc asclearuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution — both nodes are well-motivated additions to the RAG pipeline, and the algorithmic implementations (BM25, RRF, recursive/sentence/token chunking) are solid.

Inline comments have been left on specific lines in both nodes.

One structural request: please split this into two separate PRs — one for chunker and
one for search_hybrid. The two nodes share no code, no dependencies, and occupy different positions in the pipeline. Splitting them allows each to be reviewed and merged independently, and avoids blocking chunker on any unresolved search_hybrid issues (or vice versa).

if needed.
"""
if self.IGlobal.strategy is None:
debug('Chunker strategy not initialized; passing documents through')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent return here drops all documents without any error signal. If strategy failed
to initialize this should be a hard failure:

    raise RuntimeError('Chunker strategy not initialized')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc Good catch — you're absolutely right. A silent return here masks a real initialization failure and makes it impossible to debug when documents silently disappear from the pipeline. Replacing this with raise RuntimeError('Chunker strategy not initialized') is the correct approach: if the strategy failed to initialize, the pipeline should fail loudly rather than producing an empty result set that looks like "no documents found." Will fix this in the next push.

continue

# Deep copy to avoid mutating the original
base_doc = copy.deepcopy(document)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_doc is a deepcopy of document, then on line 84 each chunk_doc is a deepcopy
of base_doc — that is N+1 full copies per document for every call to writeDocuments.
For large document batches this will be a significant performance bottleneck.

Consider avoiding deepcopy entirely: page_content is always replaced (line 85) and
chunkId is always overwritten (line 89/91), so the only field that needs isolation is
metadata. A shallow copy of the doc with an explicit copy of just the metadata object
would be significantly cheaper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc Great performance observation. You're right — the double deepcopy (once for base_doc, then again for each chunk_doc) is unnecessarily expensive, especially with large document batches. Since page_content is always replaced on line 85 and chunkId is always overwritten on line 89/91, the only field that actually needs isolation is the metadata dict.

I'll refactor this to use a shallow copy of the document with an explicit copy.copy() (or dict copy) of just the metadata object. This avoids the overhead of recursively copying the entire document structure N+1 times per input document while still ensuring metadata isolation between chunks.


# Update metadata
if chunk_doc.metadata is not None:
chunk_doc.metadata.chunkId = self.chunkId
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When metadata already exists, only chunkId is updated. parent_id extracted
on line 72 is silently lost in this branch. The new chunk's metadata has no
reference back to the original document. Consider storing it:

    chunk_doc.metadata.parentId = parent_id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc You're right — this is a data integrity issue. The parent_id extracted on line 72 is critical for downstream consumers that need to trace chunks back to their source document (e.g., for citation, deduplication, or context reconstruction). Losing it silently in the existing-metadata branch means any document that already has metadata will have no lineage reference.

Will add chunk_doc.metadata.parentId = parent_id in both branches so every chunk consistently carries a reference to its originating document, regardless of whether the metadata object was freshly created or already existed.

"""
if self.IGlobal.engine is None:
debug('Hybrid search engine not initialized; passing question through')
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the chunker: silent return drops the question without any output.
If the engine is not initialized this should raise, not silently discard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc Agreed — same reasoning as the chunker. A silent return here means the question is silently dropped from the pipeline with no trace, making it extremely difficult to debug why a query produced no results. Will replace with raise RuntimeError('Hybrid search engine not initialized') to match the chunker fix and ensure consistent fail-fast behavior across both nodes.

reranked_docs.append(reranked_doc)

# Update the question with reranked documents
question.documents = reranked_docs
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question is mutated directly. In a fan-out pipeline the same question object
is shared across branches — branch A's reranking will corrupt the document list
seen by branch B. Deep copy the question at the top of this method:

    question = copy.deepcopy(question)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc This is a subtle and important catch. In a fan-out pipeline topology, the same question object reference is shared across branches, so mutating it directly here (modifying the document list via reranking) would corrupt the state seen by other branches. This is exactly the kind of bug that only surfaces in production fan-out configurations and is nearly impossible to reproduce in simple linear test pipelines.

Will add question = copy.deepcopy(question) at the top of this method to ensure each branch operates on its own isolated copy. Since question objects are typically small (query text + metadata), the deepcopy cost here is negligible compared to the search operations that follow.

Comment on lines +111 to +117
if reranked_docs and self.instance.hasListener('answers'):
# Build a combined answer from the top documents
answer_texts = [doc.page_content for doc in reranked_docs if doc.page_content]
if answer_texts:
ans = Answer()
ans.setAnswer('\n\n'.join(answer_texts))
self.instance.writeAnswers(ans)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing a concatenation of all document texts to the answers lane - is it an intention here?
Any downstream LLM node receiving this would get a large
unstructured blob instead of structured documents.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asclearuc Good question. The intent here was to provide a quick "combined context" string that a downstream LLM node could consume directly as a single input. However, you're right that this is problematic — writing a concatenated blob of all document texts to the answers lane loses document boundaries, source attribution, and any metadata structure. A downstream LLM node receiving this would have no way to distinguish between individual source documents or attribute citations.

I'll revisit this approach. The better pattern is likely to either: (1) write individual documents to the documents lane and let the downstream LLM node handle context assembly, or (2) if a combined context string is genuinely needed, make it opt-in via a node parameter rather than the default behavior. Will update this in the next revision.

@nihalnihalani
Copy link
Copy Markdown
Contributor Author

@asclearuc Thank you for the thorough and detailed review. All of the inline comments are valid and well-reasoned — I've responded to each one individually above with the planned fixes.

Regarding the structural request to split this into two separate PRs (one for chunker, one for search_hybrid): I agree this makes sense. The two nodes share no code, no dependencies, and sit at different positions in the pipeline. Keeping them in separate PRs will allow each node to be reviewed and merged independently, make the git history cleaner, and reduce blast radius if either node needs further iteration.

I'll split this PR into two focused PRs and apply all the fixes from your inline comments in the process. Will close this PR once the two new ones are up.

nihalnihalani added a commit to nihalnihalani/rocketride-server that referenced this pull request Apr 2, 2026
…n strategies

Split from rocketride-org#514 per reviewer request. Addresses review feedback:
- RuntimeError on uninitialized strategy (not silent return)
- Shallow copy + metadata copy instead of N+1 deepcopy
- parent_id preserved in all metadata branches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nihalnihalani added a commit to nihalnihalani/rocketride-server that referenced this pull request Apr 2, 2026
… similarity search

Split from rocketride-org#514 per reviewer request. Addresses review feedback:
- RuntimeError on uninitialized engine (not silent return)
- Deep copy question to prevent fan-out mutation corruption
- Structured answer output instead of concatenated text blob

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nihalnihalani
Copy link
Copy Markdown
Contributor Author

@asclearuc Thank you for the thorough review and the structural feedback. All inline comments have been addressed and the PR has been split as requested.

Split PRs

  1. Chunker nodefeat(nodes): add text chunker node with recursive, sentence, and token strategies #601 (feature/rag-chunker-node)

    • RecursiveCharacterChunker, SentenceChunker, TokenChunker
    • 40 tests passing
  2. Hybrid search nodefeat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion #602 (feature/rag-hybrid-search-node)

    • BM25 keyword search + alpha-weighted Reciprocal Rank Fusion
    • 39 tests passing

Review feedback addressed in both PRs

Chunker (#601)

  • Silent return → RuntimeError on uninitialized strategy
  • N+1 deepcopy → shallow copy + metadata copy for performance (avoids deepcopy on documents with embedding vectors)
  • parent_id preserved in all metadata branches — every chunk carries parentId back to source
  • Cleaned up dead code (unreachable else branch, unused _STRATEGY_MAP)
  • Fixed services.json documentation to match actual metadata fields

Hybrid Search (#602)

  • Silent return → RuntimeError on uninitialized engine
  • Deep copy question at top of writeQuestions to prevent fan-out mutation corruption
  • Structured answer output — replaced concatenated text blob with numbered document snippets, scores, and source attribution
  • Fixed score display bug (was reading from metadata instead of doc.score)

Closing this PR in favor of the two focused PRs above. Thank you again for the guidance!

🤖 Generated with Claude Code

@nihalnihalani
Copy link
Copy Markdown
Contributor Author

Closing in favor of split PRs: #601 (chunker) and #602 (hybrid search). All review feedback from @asclearuc has been addressed in the new PRs.

nihalnihalani added a commit to nihalnihalani/rocketride-server that referenced this pull request Apr 6, 2026
…n strategies

Split from rocketride-org#514 per reviewer request. Addresses review feedback:
- RuntimeError on uninitialized strategy (not silent return)
- Shallow copy + metadata copy instead of N+1 deepcopy
- parent_id preserved in all metadata branches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:nodes Python pipeline nodes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants