feat(nodes): add text chunking strategies and hybrid search for RAG pipelines by nihalnihalani · Pull Request #514 · rocketride-org/rocketride-server

nihalnihalani · 2026-03-30T20:13:19Z

#Hack-with-bay-2

Contribution Type

Feature — Add text chunking strategies and hybrid search for RAG pipelines

Problem / Use Case

RocketRide's RAG pipeline has embedding and vector storage nodes, but is missing two critical preprocessing steps:

No text chunking — Documents must be manually split before embedding. Large documents exceed embedding model context windows, and naive splitting loses semantic coherence.
No hybrid search — Vector similarity search alone misses keyword-relevant results. A query for "RocketRide API authentication" may retrieve documents about "security" (semantically similar) but miss documents containing the exact phrase "API authentication" (keyword match).

The complete RAG pipeline should be: Chunking → Embedding → Vector Store → Hybrid Search → Rerank → LLM

Proposed Solution

Chunker Node (`nodes/src/nodes/chunker/`)

Three strategies via chunker_strategies.py:

RecursiveCharacterChunker — Splits by separators (\n\n, \n, . , ) with configurable overlap, recurses on oversized chunks
SentenceChunker — Splits at sentence boundaries (. ! ?), groups sentences until chunk_size, tracks search_start position for repeated text
TokenChunker — Splits by token count using tiktoken, incremental character position tracking (O(n) not O(n²)), respects model context windows

Hybrid Search Node (`nodes/src/nodes/search_hybrid/`)

HybridSearchEngine combining vector + keyword search:

BM25 keyword search — Okapi BM25 with consistent tokenization (lowercase, punctuation stripping)
Alpha-weighted Reciprocal Rank Fusion — score = alpha * (1/(k+rank_vector)) + (1-alpha) * (1/(k+rank_bm25)) — alpha controls vector vs keyword weight
Configurable: alpha (0.0=BM25 only, 1.0=vector only), top_k, RRF k parameter

Tests: 69 total (32 chunker + 28 hybrid search + 9 new after review)

Value Added

Better RAG quality: Proper chunking preserves semantic context; hybrid search catches both semantic and keyword matches
Flexible chunking: Choose strategy based on document type (recursive for prose, sentence for articles, token for model-specific splitting)
Tunable search: Alpha parameter lets users balance precision (keyword) vs recall (semantic)
Complete RAG pipeline: Fills the two missing steps between raw documents and the Cohere Rerank node (PR feat(nodes): add Cohere Rerank pipeline node for RAG quality improvement #499)

Why This Feature Fits This Codebase

Both nodes follow the standard IGlobal/IInstance pattern. The chunker's writeDocuments() method emits multiple documents per input (one per chunk) with metadata (chunk_index, start_char, end_char, parent_id, total_chunks) — following the same document emission pattern used in nodes/src/nodes/llamaparse/IInstance.py for document parsing.

The hybrid search node's writeQuestions() receives questions with attached documents from an upstream vector DB (e.g., Chroma, Pinecone) that include vector similarity scores. It runs BM25 on the document texts, then merges via alpha-weighted RRF. This matches the question→documents→answers lane flow defined in vector DB nodes like nodes/src/nodes/chroma/services.json (lines 80-90).

The depends() call pattern in IGlobal.beginGlobal ensures tiktoken and rank_bm25 are installed at runtime, following the same dependency auto-loading pattern as all other nodes (e.g., nodes/src/nodes/llm_openai/IGlobal.py lines 40-43).

Deep copy (copy.deepcopy) is used on individual documents before mutation, preventing corruption in fan-out pipelines where the engine's parallel pipe execution model shares objects across branches.

Validation

$ pytest test/nodes/test_chunker.py test/nodes/test_search_hybrid.py -v
===== 69 passed =====

Chunker tests (32): recursive splitting with overlap, sentence boundary detection, token-based splitting, empty text handling, chunk metadata correctness, repeated sentence start_char tracking, overlap computation, IGlobal lifecycle

Hybrid search tests (28+9): BM25 keyword matching, RRF merging with deduplication, alpha-weighted fusion (alpha=0.8 boosts vector, alpha=0.2 boosts BM25, equal weights verification), full hybrid end-to-end, deep copy prevention, weights length validation

How This Could Be Extended

Add semantic chunking using embedding similarity between sentences
Add document-type-aware chunking (PDF page breaks, Markdown headers, code blocks)
Add learned reranking weights (auto-tune alpha based on query type)
Integrate chunk metadata with the VS Code extension for source highlighting

🤖 Generated with Claude Code

…ipelines - Add chunker node with recursive character, sentence, and token-based strategies - Configurable chunk_size, chunk_overlap, and encoding for token chunking - Add search_hybrid node combining vector similarity + BM25 keyword search - Reciprocal Rank Fusion (RRF) for merging multi-signal search results - Configurable alpha weight between vector and keyword scores - Deep copy prevents input mutation in fan-out pipelines - Comprehensive tests for all chunking strategies and hybrid search (60 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-30T20:13:25Z

Warning

Rate limit exceeded

@nihalnihalani has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 0 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 0 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9888894c-4f9b-4880-8e40-2801a8a3fef6

📥 Commits

Reviewing files that changed from the base of the PR and between b6b2dc1 and 455765b.

📒 Files selected for processing (14)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/__init__.py
nodes/src/nodes/chunker/chunker_strategies.py
nodes/src/nodes/chunker/requirements.txt
nodes/src/nodes/chunker/services.json
nodes/src/nodes/search_hybrid/IGlobal.py
nodes/src/nodes/search_hybrid/IInstance.py
nodes/src/nodes/search_hybrid/__init__.py
nodes/src/nodes/search_hybrid/hybrid_search.py
nodes/src/nodes/search_hybrid/requirements.txt
nodes/src/nodes/search_hybrid/services.json
test/nodes/test_chunker.py
test/nodes/test_search_hybrid.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…h PR - Fix alpha parameter to actually weight RRF fusion (alpha * vector + (1-alpha) * BM25) - Fix SentenceChunker start_char tracking for repeated sentences - Fix TokenChunker O(n²) prefix decoding with incremental character tracking - Remove unnecessary deep copy of entire question in hybrid search - Update tests for alpha-weighted RRF and repeated sentence handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

asclearuc

Thank you for this contribution — both nodes are well-motivated additions to the RAG pipeline, and the algorithmic implementations (BM25, RRF, recursive/sentence/token chunking) are solid.

Inline comments have been left on specific lines in both nodes.

One structural request: please split this into two separate PRs — one for chunker and
one for search_hybrid. The two nodes share no code, no dependencies, and occupy different positions in the pipeline. Splitting them allows each to be reviewed and merged independently, and avoids blocking chunker on any unresolved search_hybrid issues (or vice versa).

asclearuc · 2026-04-01T13:26:53Z

nodes/src/nodes/chunker/IInstance.py

+        if needed.
+        """
+        if self.IGlobal.strategy is None:
+            debug('Chunker strategy not initialized; passing documents through')


Silent return here drops all documents without any error signal. If strategy failed
to initialize this should be a hard failure:

raise RuntimeError('Chunker strategy not initialized')

@asclearuc Good catch — you're absolutely right. A silent return here masks a real initialization failure and makes it impossible to debug when documents silently disappear from the pipeline. Replacing this with raise RuntimeError('Chunker strategy not initialized') is the correct approach: if the strategy failed to initialize, the pipeline should fail loudly rather than producing an empty result set that looks like "no documents found." Will fix this in the next push.

asclearuc · 2026-04-01T13:27:31Z

nodes/src/nodes/chunker/IInstance.py

+                continue
+
+            # Deep copy to avoid mutating the original
+            base_doc = copy.deepcopy(document)


base_doc is a deepcopy of document, then on line 84 each chunk_doc is a deepcopy
of base_doc — that is N+1 full copies per document for every call to writeDocuments.
For large document batches this will be a significant performance bottleneck.

Consider avoiding deepcopy entirely: page_content is always replaced (line 85) and
chunkId is always overwritten (line 89/91), so the only field that needs isolation is
metadata. A shallow copy of the doc with an explicit copy of just the metadata object
would be significantly cheaper.

@asclearuc Great performance observation. You're right — the double deepcopy (once for base_doc, then again for each chunk_doc) is unnecessarily expensive, especially with large document batches. Since page_content is always replaced on line 85 and chunkId is always overwritten on line 89/91, the only field that actually needs isolation is the metadata dict.

I'll refactor this to use a shallow copy of the document with an explicit copy.copy() (or dict copy) of just the metadata object. This avoids the overhead of recursively copying the entire document structure N+1 times per input document while still ensuring metadata isolation between chunks.

asclearuc · 2026-04-01T13:32:25Z

nodes/src/nodes/chunker/IInstance.py

+
+                # Update metadata
+                if chunk_doc.metadata is not None:
+                    chunk_doc.metadata.chunkId = self.chunkId


When metadata already exists, only chunkId is updated. parent_id extracted
on line 72 is silently lost in this branch. The new chunk's metadata has no
reference back to the original document. Consider storing it:

chunk_doc.metadata.parentId = parent_id

@asclearuc You're right — this is a data integrity issue. The parent_id extracted on line 72 is critical for downstream consumers that need to trace chunks back to their source document (e.g., for citation, deduplication, or context reconstruction). Losing it silently in the existing-metadata branch means any document that already has metadata will have no lineage reference.

Will add chunk_doc.metadata.parentId = parent_id in both branches so every chunk consistently carries a reference to its originating document, regardless of whether the metadata object was freshly created or already existed.

asclearuc · 2026-04-01T13:33:09Z

nodes/src/nodes/search_hybrid/IInstance.py

+        """
+        if self.IGlobal.engine is None:
+            debug('Hybrid search engine not initialized; passing question through')
+            return


Same issue as the chunker: silent return drops the question without any output.
If the engine is not initialized this should raise, not silently discard.

@asclearuc Agreed — same reasoning as the chunker. A silent return here means the question is silently dropped from the pipeline with no trace, making it extremely difficult to debug why a query produced no results. Will replace with raise RuntimeError('Hybrid search engine not initialized') to match the chunker fix and ensure consistent fail-fast behavior across both nodes.

asclearuc · 2026-04-01T13:36:42Z

nodes/src/nodes/search_hybrid/IInstance.py

+                reranked_docs.append(reranked_doc)
+
+        # Update the question with reranked documents
+        question.documents = reranked_docs


question is mutated directly. In a fan-out pipeline the same question object
is shared across branches — branch A's reranking will corrupt the document list
seen by branch B. Deep copy the question at the top of this method:

question = copy.deepcopy(question)

@asclearuc This is a subtle and important catch. In a fan-out pipeline topology, the same question object reference is shared across branches, so mutating it directly here (modifying the document list via reranking) would corrupt the state seen by other branches. This is exactly the kind of bug that only surfaces in production fan-out configurations and is nearly impossible to reproduce in simple linear test pipelines.

Will add question = copy.deepcopy(question) at the top of this method to ensure each branch operates on its own isolated copy. Since question objects are typically small (query text + metadata), the deepcopy cost here is negligible compared to the search operations that follow.

asclearuc · 2026-04-01T13:47:20Z

nodes/src/nodes/search_hybrid/IInstance.py

+        if reranked_docs and self.instance.hasListener('answers'):
+            # Build a combined answer from the top documents
+            answer_texts = [doc.page_content for doc in reranked_docs if doc.page_content]
+            if answer_texts:
+                ans = Answer()
+                ans.setAnswer('\n\n'.join(answer_texts))
+                self.instance.writeAnswers(ans)


Writing a concatenation of all document texts to the answers lane - is it an intention here?
Any downstream LLM node receiving this would get a large
unstructured blob instead of structured documents.

@asclearuc Good question. The intent here was to provide a quick "combined context" string that a downstream LLM node could consume directly as a single input. However, you're right that this is problematic — writing a concatenated blob of all document texts to the answers lane loses document boundaries, source attribution, and any metadata structure. A downstream LLM node receiving this would have no way to distinguish between individual source documents or attribute citations.

I'll revisit this approach. The better pattern is likely to either: (1) write individual documents to the documents lane and let the downstream LLM node handle context assembly, or (2) if a combined context string is genuinely needed, make it opt-in via a node parameter rather than the default behavior. Will update this in the next revision.

nihalnihalani · 2026-04-01T19:36:51Z

@asclearuc Thank you for the thorough and detailed review. All of the inline comments are valid and well-reasoned — I've responded to each one individually above with the planned fixes.

Regarding the structural request to split this into two separate PRs (one for chunker, one for search_hybrid): I agree this makes sense. The two nodes share no code, no dependencies, and sit at different positions in the pipeline. Keeping them in separate PRs will allow each node to be reviewed and merged independently, make the git history cleaner, and reduce blast radius if either node needs further iteration.

I'll split this PR into two focused PRs and apply all the fixes from your inline comments in the process. Will close this PR once the two new ones are up.

…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… similarity search Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized engine (not silent return) - Deep copy question to prevent fan-out mutation corruption - Structured answer output instead of concatenated text blob Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nihalnihalani · 2026-04-02T10:29:51Z

@asclearuc Thank you for the thorough review and the structural feedback. All inline comments have been addressed and the PR has been split as requested.

Split PRs

Chunker node → feat(nodes): add text chunker node with recursive, sentence, and token strategies #601 (feature/rag-chunker-node)
- RecursiveCharacterChunker, SentenceChunker, TokenChunker
- 40 tests passing
Hybrid search node → feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion #602 (feature/rag-hybrid-search-node)
- BM25 keyword search + alpha-weighted Reciprocal Rank Fusion
- 39 tests passing

Review feedback addressed in both PRs

Chunker (#601)

Silent return → RuntimeError on uninitialized strategy
N+1 deepcopy → shallow copy + metadata copy for performance (avoids deepcopy on documents with embedding vectors)
parent_id preserved in all metadata branches — every chunk carries parentId back to source
Cleaned up dead code (unreachable else branch, unused _STRATEGY_MAP)
Fixed services.json documentation to match actual metadata fields

Hybrid Search (#602)

Silent return → RuntimeError on uninitialized engine
Deep copy question at top of writeQuestions to prevent fan-out mutation corruption
Structured answer output — replaced concatenated text blob with numbered document snippets, scores, and source attribution
Fixed score display bug (was reading from metadata instead of doc.score)

Closing this PR in favor of the two focused PRs above. Thank you again for the guidance!

🤖 Generated with Claude Code

nihalnihalani · 2026-04-02T10:30:11Z

Closing in favor of split PRs: #601 (chunker) and #602 (hybrid search). All review feedback from @asclearuc has been addressed in the new PRs.

…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nihalnihalani requested review from Rod-Christensen, jmaionchi and stepmikhaylov as code owners March 30, 2026 20:13

github-actions bot added the module:nodes Python pipeline nodes label Mar 30, 2026

asclearuc requested changes Apr 1, 2026

View reviewed changes

nihalnihalani mentioned this pull request Apr 2, 2026

feat(nodes): add text chunker node with recursive, sentence, and token strategies nihalnihalani/rocketride-server#2

Merged

9 tasks

This was referenced Apr 2, 2026

feat(nodes): add text chunker node with recursive, sentence, and token strategies #601

Closed

feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion #602

Closed

nihalnihalani closed this Apr 2, 2026

Conversation

nihalnihalani commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Type

Problem / Use Case

Proposed Solution

Chunker Node (nodes/src/nodes/chunker/)

Hybrid Search Node (nodes/src/nodes/search_hybrid/)

Tests: 69 total (32 chunker + 28 hybrid search + 9 new after review)

Value Added

Why This Feature Fits This Codebase

Validation

How This Could Be Extended

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

asclearuc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nihalnihalani commented Apr 1, 2026

Uh oh!

nihalnihalani commented Apr 2, 2026

Split PRs

Review feedback addressed in both PRs

Chunker (#601)

Hybrid Search (#602)

Uh oh!

nihalnihalani commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nihalnihalani commented Mar 30, 2026 •

edited

Loading

Chunker Node (`nodes/src/nodes/chunker/`)

Hybrid Search Node (`nodes/src/nodes/search_hybrid/`)

coderabbitai bot commented Mar 30, 2026 •

edited

Loading