feat(nodes): add text chunking strategies and hybrid search for RAG pipelines#514
Conversation
…ipelines - Add chunker node with recursive character, sentence, and token-based strategies - Configurable chunk_size, chunk_overlap, and encoding for token chunking - Add search_hybrid node combining vector similarity + BM25 keyword search - Reciprocal Rank Fusion (RRF) for merging multi-signal search results - Configurable alpha weight between vector and keyword scores - Deep copy prevents input mutation in fan-out pipelines - Comprehensive tests for all chunking strategies and hybrid search (60 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 0 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (14)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…h PR - Fix alpha parameter to actually weight RRF fusion (alpha * vector + (1-alpha) * BM25) - Fix SentenceChunker start_char tracking for repeated sentences - Fix TokenChunker O(n²) prefix decoding with incremental character tracking - Remove unnecessary deep copy of entire question in hybrid search - Update tests for alpha-weighted RRF and repeated sentence handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
asclearuc
left a comment
There was a problem hiding this comment.
Thank you for this contribution — both nodes are well-motivated additions to the RAG pipeline, and the algorithmic implementations (BM25, RRF, recursive/sentence/token chunking) are solid.
Inline comments have been left on specific lines in both nodes.
One structural request: please split this into two separate PRs — one for chunker and
one for search_hybrid. The two nodes share no code, no dependencies, and occupy different positions in the pipeline. Splitting them allows each to be reviewed and merged independently, and avoids blocking chunker on any unresolved search_hybrid issues (or vice versa).
| if needed. | ||
| """ | ||
| if self.IGlobal.strategy is None: | ||
| debug('Chunker strategy not initialized; passing documents through') |
There was a problem hiding this comment.
Silent return here drops all documents without any error signal. If strategy failed
to initialize this should be a hard failure:
raise RuntimeError('Chunker strategy not initialized')There was a problem hiding this comment.
@asclearuc Good catch — you're absolutely right. A silent return here masks a real initialization failure and makes it impossible to debug when documents silently disappear from the pipeline. Replacing this with raise RuntimeError('Chunker strategy not initialized') is the correct approach: if the strategy failed to initialize, the pipeline should fail loudly rather than producing an empty result set that looks like "no documents found." Will fix this in the next push.
| continue | ||
|
|
||
| # Deep copy to avoid mutating the original | ||
| base_doc = copy.deepcopy(document) |
There was a problem hiding this comment.
base_doc is a deepcopy of document, then on line 84 each chunk_doc is a deepcopy
of base_doc — that is N+1 full copies per document for every call to writeDocuments.
For large document batches this will be a significant performance bottleneck.
Consider avoiding deepcopy entirely: page_content is always replaced (line 85) and
chunkId is always overwritten (line 89/91), so the only field that needs isolation is
metadata. A shallow copy of the doc with an explicit copy of just the metadata object
would be significantly cheaper.
There was a problem hiding this comment.
@asclearuc Great performance observation. You're right — the double deepcopy (once for base_doc, then again for each chunk_doc) is unnecessarily expensive, especially with large document batches. Since page_content is always replaced on line 85 and chunkId is always overwritten on line 89/91, the only field that actually needs isolation is the metadata dict.
I'll refactor this to use a shallow copy of the document with an explicit copy.copy() (or dict copy) of just the metadata object. This avoids the overhead of recursively copying the entire document structure N+1 times per input document while still ensuring metadata isolation between chunks.
|
|
||
| # Update metadata | ||
| if chunk_doc.metadata is not None: | ||
| chunk_doc.metadata.chunkId = self.chunkId |
There was a problem hiding this comment.
When metadata already exists, only chunkId is updated. parent_id extracted
on line 72 is silently lost in this branch. The new chunk's metadata has no
reference back to the original document. Consider storing it:
chunk_doc.metadata.parentId = parent_idThere was a problem hiding this comment.
@asclearuc You're right — this is a data integrity issue. The parent_id extracted on line 72 is critical for downstream consumers that need to trace chunks back to their source document (e.g., for citation, deduplication, or context reconstruction). Losing it silently in the existing-metadata branch means any document that already has metadata will have no lineage reference.
Will add chunk_doc.metadata.parentId = parent_id in both branches so every chunk consistently carries a reference to its originating document, regardless of whether the metadata object was freshly created or already existed.
| """ | ||
| if self.IGlobal.engine is None: | ||
| debug('Hybrid search engine not initialized; passing question through') | ||
| return |
There was a problem hiding this comment.
Same issue as the chunker: silent return drops the question without any output.
If the engine is not initialized this should raise, not silently discard.
There was a problem hiding this comment.
@asclearuc Agreed — same reasoning as the chunker. A silent return here means the question is silently dropped from the pipeline with no trace, making it extremely difficult to debug why a query produced no results. Will replace with raise RuntimeError('Hybrid search engine not initialized') to match the chunker fix and ensure consistent fail-fast behavior across both nodes.
| reranked_docs.append(reranked_doc) | ||
|
|
||
| # Update the question with reranked documents | ||
| question.documents = reranked_docs |
There was a problem hiding this comment.
question is mutated directly. In a fan-out pipeline the same question object
is shared across branches — branch A's reranking will corrupt the document list
seen by branch B. Deep copy the question at the top of this method:
question = copy.deepcopy(question)There was a problem hiding this comment.
@asclearuc This is a subtle and important catch. In a fan-out pipeline topology, the same question object reference is shared across branches, so mutating it directly here (modifying the document list via reranking) would corrupt the state seen by other branches. This is exactly the kind of bug that only surfaces in production fan-out configurations and is nearly impossible to reproduce in simple linear test pipelines.
Will add question = copy.deepcopy(question) at the top of this method to ensure each branch operates on its own isolated copy. Since question objects are typically small (query text + metadata), the deepcopy cost here is negligible compared to the search operations that follow.
| if reranked_docs and self.instance.hasListener('answers'): | ||
| # Build a combined answer from the top documents | ||
| answer_texts = [doc.page_content for doc in reranked_docs if doc.page_content] | ||
| if answer_texts: | ||
| ans = Answer() | ||
| ans.setAnswer('\n\n'.join(answer_texts)) | ||
| self.instance.writeAnswers(ans) |
There was a problem hiding this comment.
Writing a concatenation of all document texts to the answers lane - is it an intention here?
Any downstream LLM node receiving this would get a large
unstructured blob instead of structured documents.
There was a problem hiding this comment.
@asclearuc Good question. The intent here was to provide a quick "combined context" string that a downstream LLM node could consume directly as a single input. However, you're right that this is problematic — writing a concatenated blob of all document texts to the answers lane loses document boundaries, source attribution, and any metadata structure. A downstream LLM node receiving this would have no way to distinguish between individual source documents or attribute citations.
I'll revisit this approach. The better pattern is likely to either: (1) write individual documents to the documents lane and let the downstream LLM node handle context assembly, or (2) if a combined context string is genuinely needed, make it opt-in via a node parameter rather than the default behavior. Will update this in the next revision.
|
@asclearuc Thank you for the thorough and detailed review. All of the inline comments are valid and well-reasoned — I've responded to each one individually above with the planned fixes. Regarding the structural request to split this into two separate PRs (one for I'll split this PR into two focused PRs and apply all the fixes from your inline comments in the process. Will close this PR once the two new ones are up. |
…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… similarity search Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized engine (not silent return) - Deep copy question to prevent fan-out mutation corruption - Structured answer output instead of concatenated text blob Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@asclearuc Thank you for the thorough review and the structural feedback. All inline comments have been addressed and the PR has been split as requested. Split PRs
Review feedback addressed in both PRsChunker (#601)
Hybrid Search (#602)
Closing this PR in favor of the two focused PRs above. Thank you again for the guidance! 🤖 Generated with Claude Code |
|
Closing in favor of split PRs: #601 (chunker) and #602 (hybrid search). All review feedback from @asclearuc has been addressed in the new PRs. |
…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#Hack-with-bay-2
Contribution Type
Feature — Add text chunking strategies and hybrid search for RAG pipelines
Problem / Use Case
RocketRide's RAG pipeline has embedding and vector storage nodes, but is missing two critical preprocessing steps:
The complete RAG pipeline should be: Chunking → Embedding → Vector Store → Hybrid Search → Rerank → LLM
Proposed Solution
Chunker Node (
nodes/src/nodes/chunker/)Three strategies via
chunker_strategies.py:\n\n,\n,.,) with configurable overlap, recurses on oversized chunks.!?), groups sentences until chunk_size, trackssearch_startposition for repeated textHybrid Search Node (
nodes/src/nodes/search_hybrid/)HybridSearchEnginecombining vector + keyword search:score = alpha * (1/(k+rank_vector)) + (1-alpha) * (1/(k+rank_bm25))— alpha controls vector vs keyword weightTests: 69 total (32 chunker + 28 hybrid search + 9 new after review)
Value Added
Why This Feature Fits This Codebase
Both nodes follow the standard IGlobal/IInstance pattern. The chunker's
writeDocuments()method emits multiple documents per input (one per chunk) with metadata (chunk_index,start_char,end_char,parent_id,total_chunks) — following the same document emission pattern used innodes/src/nodes/llamaparse/IInstance.pyfor document parsing.The hybrid search node's
writeQuestions()receives questions with attached documents from an upstream vector DB (e.g., Chroma, Pinecone) that include vector similarity scores. It runs BM25 on the document texts, then merges via alpha-weighted RRF. This matches the question→documents→answers lane flow defined in vector DB nodes likenodes/src/nodes/chroma/services.json(lines 80-90).The
depends()call pattern in IGlobal.beginGlobal ensures tiktoken and rank_bm25 are installed at runtime, following the same dependency auto-loading pattern as all other nodes (e.g.,nodes/src/nodes/llm_openai/IGlobal.pylines 40-43).Deep copy (
copy.deepcopy) is used on individual documents before mutation, preventing corruption in fan-out pipelines where the engine's parallel pipe execution model shares objects across branches.Validation
Chunker tests (32): recursive splitting with overlap, sentence boundary detection, token-based splitting, empty text handling, chunk metadata correctness, repeated sentence start_char tracking, overlap computation, IGlobal lifecycle
Hybrid search tests (28+9): BM25 keyword matching, RRF merging with deduplication, alpha-weighted fusion (alpha=0.8 boosts vector, alpha=0.2 boosts BM25, equal weights verification), full hybrid end-to-end, deep copy prevention, weights length validation
How This Could Be Extended
🤖 Generated with Claude Code