feat(nodes): add text chunker node with recursive, sentence, and token strategies by nihalnihalani · Pull Request #601 · rocketride-org/rocketride-server

nihalnihalani · 2026-04-02T10:24:51Z

Summary

Split from feat(nodes): add text chunking strategies and hybrid search for RAG pipelines #514 per reviewer request — this PR contains only the chunker node
Three chunking strategies: RecursiveCharacterChunker, SentenceChunker, TokenChunker
Addresses all review feedback from @asclearuc on feat(nodes): add text chunking strategies and hybrid search for RAG pipelines #514

Review feedback addressed

RuntimeError on uninitialized strategy — replaced silent return with raise RuntimeError
Performance: shallow copy instead of N+1 deepcopy — only metadata dict is copied, not entire document with embeddings
parent_id preserved in all metadata branches — every chunk carries parentId back to source document

Test plan

40 chunker tests pass
ruff check/format clean
Recursive splitting with overlap works correctly
Sentence boundary detection handles edge cases
Token-based splitting respects model context windows
Chunk metadata includes parent_id in all cases

Splits #514

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Text Chunker node: three chunking strategies (recursive, sentence, token), selectable profiles, configurable chunk size/overlap/encoding, per-chunk metadata and UI tile.
- Hybrid Search node: BM25 + vector fusion with tunable alpha, top-k and RRF fusion, selectable profiles, emits reranked documents and optional answer summaries.
Tests
- Added comprehensive tests covering chunking behavior, lifecycle/emission, hybrid ranking/fusion, and node I/O.

coderabbitai · 2026-04-02T10:25:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new text chunker node (three chunking strategies, lifecycle, instance emission, config, tests) and a hybrid search node (BM25 + optional vector fusion engine, lifecycle, instance reranking, config, tests).

Changes

Cohort / File(s)	Summary
Chunker — Strategies `nodes/src/nodes/chunker/chunker_strategies.py`	New `ChunkingStrategy` base and three implementations: `RecursiveCharacterChunker`, `SentenceChunker`, `TokenChunker`. Each produces chunk dicts with `text` and `metadata` (`chunk_index`, `start_char`, `end_char`) and enforces size/overlap validations.
Chunker — Global & Instance `nodes/src/nodes/chunker/IGlobal.py`, `nodes/src/nodes/chunker/IInstance.py`	`IGlobal` manages strategy lifecycle, dependency probing, and config-driven initialization; `IInstance` applies the strategy per input `Doc`, emits one chunk-`Doc` per chunk with `chunkId`, `parentId`, and chunk metadata.
Chunker — Config & deps `nodes/src/nodes/chunker/services.json`, `nodes/src/nodes/chunker/requirements.txt`	Adds node/service definition `chunker://` with selectable profiles and UI fields; adds `tiktoken>=0.7.0,<1.0.0`.
Chunker — Tests `test/nodes/test_chunker.py`	Extensive unit and integration-style tests for all chunkers, edge cases, metadata/spans, lifecycle and instance emission (with dependency stubs).
Hybrid Search — Engine `nodes/src/nodes/search_hybrid/hybrid_search.py`	Adds `HybridSearchEngine` implementing BM25 ranking, reciprocal rank fusion (RRF), and orchestration to fuse BM25 and optional vector scores respecting `alpha`, `top_k`, `rrf_k`.
Hybrid Search — Global & Instance `nodes/src/nodes/search_hybrid/IGlobal.py`, `nodes/src/nodes/search_hybrid/IInstance.py`	`IGlobal` manages engine lifecycle and dependency probing; `IInstance.writeQuestions` builds engine inputs, calls `search()`, maps results back to `Doc`s, emits reranked `documents` and assembles `answers`.
Hybrid Search — Config & deps `nodes/src/nodes/search_hybrid/services.json`, `nodes/src/nodes/search_hybrid/requirements.txt`	Adds node/service definition `search_hybrid://` with profiles and UI fields; adds `rank_bm25>=0.2.2,<1.0.0`.
Hybrid Search — Tests `test/nodes/test_search_hybrid.py`	Comprehensive tests for BM25, RRF, fusion modes, alpha behavior, tokenization, and instance-level integration with injected `rank_bm25` stub and lifecycle stubs.

Sequence Diagram

sequenceDiagram
    participant Client
    participant IGlobal as IGlobal (Chunker)
    participant Strategy as ChunkingStrategy
    participant IInstance as IInstance
    participant Output

    Client->>IGlobal: beginGlobal()
    IGlobal->>IGlobal: validateConfig() / load node config
    IGlobal->>IGlobal: instantiate Strategy
    IGlobal-->>Client: strategy available

    Client->>IInstance: open(entry) / writeDocuments(docs)
    IInstance->>IGlobal: ensure strategy initialized
    loop per input document
        IInstance->>Strategy: chunk(text)
        Strategy-->>IInstance: [chunks with metadata]
        IInstance->>IInstance: create per-chunk Doc (chunkId,parentId,metadata)
    end
    IInstance->>Output: emit chunked documents

    Client->>IGlobal: endGlobal()
    IGlobal->>IGlobal: release strategy (set None)

sequenceDiagram
    participant Client
    participant IGlobal as IGlobal (Hybrid)
    participant Engine as HybridSearchEngine
    participant IInstance as IInstance
    participant Docs as Documents
    participant Answers as Answers

    Client->>IGlobal: beginGlobal()
    IGlobal->>IGlobal: validateConfig() / load node config
    IGlobal->>Engine: instantiate HybridSearchEngine(alpha,...)
    IGlobal-->>Client: engine available

    Client->>IInstance: writeQuestions(question)
    IInstance->>IGlobal: ensure engine initialized
    IInstance->>Engine: search(query, documents, vector_scores, top_k, rrf_k)
    Engine-->>IInstance: ranked results (bm25/rrf/vector fused)
    IInstance->>Docs: emit reranked documents
    alt answers listener exists
        IInstance->>Answers: build & emit textual answer (snippets + hybrid_score)
    end

    Client->>IGlobal: endGlobal()
    IGlobal->>IGlobal: release engine (set None)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

jmaionchi
stepmikhaylov
Rod-Christensen

Poem

🐇 I nibbled text into tidy bites,

Chunks that hop through pipelines bright.
Sentences, tokens, recursive art,
BM25 and vectors play their part.
Hop, split, and rank — a happy start!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a new text chunker node with three concrete implementations (recursive, sentence, token strategies). It aligns directly with the changeset's primary objective.
Docstring Coverage	✅ Passed	Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 213-216: The chunker currently rebuilds chunks with '
'.join(current_sentences), which loses original whitespace and invalidates
offsets; update the logic in chunker_strategies.py (where chunk_text,
current_sentences, sentence_positions are used) to slice the original source
text by each sentence's original span and concatenate those exact source slices
(preserving whitespace) to form chunk_text, and compute start_char and end_char
from the original span boundaries (e.g., first sentence's start index and last
sentence's end index) instead of using lengths; apply the same change to the
second occurrence around lines 260-263 so all chunks use original spans for text
and offsets.

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 52-53: The code in IGlobal.py currently uses a blind "except
Exception as e" around dependency/config loading (calling warning(str(e))),
which can mask real runtime errors; update the handler in the
validateConfig/config-loading block to catch only the expected exceptions (e.g.,
ImportError, ModuleNotFoundError, KeyError, ValueError) and call warning(str(e))
for those, or if you intentionally must catch everything keep the broad except
but append a targeted linter suppression ( "# noqa: BLE001" ) with a one-line
justification comment explaining why a broad catch is required; reference the
existing validateConfig/config-loading block and the warning(...) call to locate
where to make this change.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 84-90: The current metadata assignment in the chunk emission logic
overwrites/preserves only chunkId and parentId and drops strategy-provided
fields (chunk_index, start_char, end_char, total_chunks). Fix the block that
sets chunk_doc.metadata (referencing chunk_doc, DocMetadata, self.chunkId,
parent_id) to merge/preserve existing strategy metadata: if chunk_doc.metadata
exists, set/overwrite only chunkId and parentId and leave or copy through
chunk_index, start_char, end_char, total_chunks; if not, construct a new
DocMetadata that includes chunkId, parentId plus chunk_index, start_char,
end_char, and total_chunks (sourcing those values from the strategy output
variables on the chunk or from the available total_chunks value). Ensure you do
not drop any of the listed fields when creating or updating metadata.

In `@test/nodes/test_chunker.py`:
- Around line 493-509: The test named test_iglobal_creates_strategy currently
only instantiates RecursiveCharacterChunker directly; update it to import
IGlobal (IGlobal.beginGlobal) after forcing module reload, set up the mocked
endpoint/config used by beginGlobal (reuse self._install_stubs() or
saved/names), call IGlobal.beginGlobal(), and then assert that the instance's
strategy attribute (e.g., iglobal_instance.strategy) is an instance of
RecursiveCharacterChunker and that chunking works (e.g., call strategy.chunk and
check length) to validate the lifecycle wiring rather than direct construction.
- Around line 255-258: The no-overlap assertion in test_chunker.py is wrong: it
compares chunks[i]['metadata']['start_char'] to the previous chunk's start_char,
which allows overlaps; change the assertion to ensure each chunk's start_char is
>= the previous chunk's metadata 'end_char' so spans don't overlap (use
chunks[i]['metadata']['start_char'] >= chunks[i-1]['metadata']['end_char']
referencing the same chunks and metadata fields).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 255f96a0-24d3-4772-b561-67e9b8bec0de

📥 Commits

Reviewing files that changed from the base of the PR and between 7d81b7e and 5d78492.

📒 Files selected for processing (7)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/__init__.py
nodes/src/nodes/chunker/chunker_strategies.py
nodes/src/nodes/chunker/requirements.txt
nodes/src/nodes/chunker/services.json
test/nodes/test_chunker.py

nodes/src/nodes/chunker/chunker_strategies.py

nodes/src/nodes/chunker/IGlobal.py

nodes/src/nodes/chunker/IInstance.py

test/nodes/test_chunker.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

nodes/src/nodes/chunker/IGlobal.py (1)
44-45: ⚠️ Potential issue | 🟡 Minor

Make the broad except explicitly intentional.

Line 44 still catches Exception broadly. If this is intentional for save-time config probing, add a targeted Ruff suppression with a one-line justification to avoid masking accidental regressions silently.
Proposed minimal fix
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001 - validateConfig probe should warn, not block node configuration
             warning(str(e))
Based on learnings: follow the existing validateConfig style, and when broad catches are intentional, use targeted Ruff # noqa with specific BLE001 justification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IGlobal.py` around lines 44 - 45, The broad except in
IGlobal.py (the except Exception as e block that calls warning(str(e))) is
intentional for save-time config probing but must be explicitly documented and
suppressed for Ruff; update that except to include a one-line targeted Ruff
suppression comment (e.g. "# noqa: BLE001 - intentional broad Exception for
save-time config probing, follows validateConfig pattern") immediately on the
except line so the broad catch remains but is clearly justified and silences
BLE001 while keeping the warning(str(e)) behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 58-61: Validate and sanitize chunk parameters before creating
strategies: after parsing strategy_name, chunk_size, chunk_overlap, and
encoding_name in IGlobal.py, ensure chunk_size is an integer > 0 and
chunk_overlap is an integer >= 0 and strictly < chunk_size (or clamp to
chunk_size - 1 if you prefer permissive behavior); also catch and handle
ValueError from int() conversions and log/raise a clear error when values are
invalid. Apply this validation in the same block where
strategy_name/chunk_size/chunk_overlap/encoding_name are read and also in the
other similar parsing block referenced (lines 64-80) so both code paths perform
the same range checks and consistent error handling. Use the variable names
chunk_size and chunk_overlap and the strategy creation entrypoint(s) in
IGlobal.py to locate where to add the guards.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 51-53: The docstring in IInstance (file IInstance.py) incorrectly
documents emitted metadata keys as chunk_index and total_chunks; update the
docstring to match runtime behavior and state that emitted documents include
metadata keys "chunkId" and "parentId" (not chunk_index/total_chunks), and
adjust any other docstring occurrences in the same file (the comments around the
emission logic referenced at lines ~85-86) to use the same "chunkId"/"parentId"
wording so documentation matches the actual emitted metadata keys.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer output is reading hybrid_score from
doc.metadata but reranking stores scores on the document object
(reranked_doc.score), so the displayed score is always 'N/A'; update the answer
assembly in the block that uses reranked_docs to read the score from the
document's score attribute (e.g., use doc.score with a fallback like 'N/A' if
missing) when building context_parts before creating Answer and calling
self.instance.writeAnswers(ans).

---

Duplicate comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 44-45: The broad except in IGlobal.py (the except Exception as e
block that calls warning(str(e))) is intentional for save-time config probing
but must be explicitly documented and suppressed for Ruff; update that except to
include a one-line targeted Ruff suppression comment (e.g. "# noqa: BLE001 -
intentional broad Exception for save-time config probing, follows validateConfig
pattern") immediately on the except line so the broad catch remains but is
clearly justified and silences BLE001 while keeping the warning(str(e))
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5dccd2c3-49af-40a7-ace1-93178201c391

📥 Commits

Reviewing files that changed from the base of the PR and between 5d78492 and e505691.

📒 Files selected for processing (10)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/services.json
nodes/src/nodes/search_hybrid/IGlobal.py
nodes/src/nodes/search_hybrid/IInstance.py
nodes/src/nodes/search_hybrid/__init__.py
nodes/src/nodes/search_hybrid/hybrid_search.py
nodes/src/nodes/search_hybrid/requirements.txt
nodes/src/nodes/search_hybrid/services.json
test/nodes/test_search_hybrid.py

nodes/src/nodes/chunker/IGlobal.py

nodes/src/nodes/chunker/IInstance.py

nodes/src/nodes/search_hybrid/IInstance.py

asclearuc

2 independent nodes

asclearuc · 2026-04-06T12:35:23Z

Thank you for you contribution

What is the additional value of the new chunker node taking into consideration that chunking is already present (see preprocess_langchain, preprocessor_llm, preprocessor_code).

nihalnihalani · 2026-04-06T18:07:56Z

Great question! The dedicated chunker node adds value over existing preprocessors (preprocess_langchain, preprocessor_llm, preprocessor_code) in several ways:

Strategy-configurable: Exposes 3 pluggable strategies (recursive, sentence, token) as a first-class pipeline step, vs preprocessors which have fixed chunking behavior
Position metadata: Emits chunk_index, start_char, end_char, total_chunks per chunk — enabling downstream nodes to reconstruct original documents or highlight source spans in RAG answers
Overlap control: Fine-grained chunk_overlap parameter for tuning retrieval recall per embedding model
Composability: Sits between any extraction node and any embedding node, vs preprocessors which bundle extraction + chunking together

Regarding the "2 independent nodes" feedback — the chunker and hybrid search are in separate directories with zero code dependencies. Happy to split into two PRs if preferred, though they were developed as a cohesive RAG pipeline feature set. Let me know your preference.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

nodes/src/nodes/chunker/IInstance.py (1)

79-82: ⚠️ Potential issue | 🟠 Major

Construct fallback DocMetadata with required fields.

Line 81 uses DocMetadata() bare. In the real schema, objectId and chunkId are required, so the metadata=None path will raise before any chunk is emitted. The permissive stub in test/nodes/test_chunker.py currently hides this.

🛠️ Suggested fix

-                chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+                chunk_doc.metadata = (
+                    copy.copy(document.metadata)
+                    if document.metadata
+                    else DocMetadata(objectId=parent_id, chunkId=self.chunkId)
+                )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The fallback
DocMetadata() used when document.metadata is falsy lacks required fields
(objectId, chunkId); update the chunk creation logic in the chunker where
chunk_doc is made (the copy.copy(document) / chunk_doc.metadata assignment) to
construct a DocMetadata instance populated with at least the required fields
(e.g., objectId from document or a generated value and chunkId from chunk_data
or an index) instead of calling bare DocMetadata(), ensuring
chunk_doc.page_content = chunk_data['text'] remains unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 211-217: The size and overlap checks for chunking currently assume
a single-character separator (using current_len + 1 + sentence_len) while the
emitted chunk slices the original text with preserved whitespace, causing
undercounts; in the chunking logic (look for variables/current_len,
sentence_len, sentence_positions, start_char, end_char, chunk_text, chunk_size,
chunk_overlap in chunker_strategies.py) replace the synthetic "+ 1" separator
math with calculations based on actual original spans (use sentence_positions
and string indices to compute widths as end_index - start_index for both the
candidate addition and existing current span) so that both the capacity check
and overlap bookkeeping use the original slice widths; apply the same fix in the
other similar blocks referenced around the later checks (the sections handling
finalization and overlap at the other noted ranges).
- Around line 129-145: The problem is that overlap is prepended using
prev[-self.chunk_overlap:] from the split output (raw) which may have lost
separators, so chunk_text no longer matches text[start_char:end_char] and
offsets are wrong; to fix, first locate the raw slice's start in the original
text (compute start_char = text.find(raw, search_start) as done), then compute
overlap_text by slicing the original text (overlap_text = text[max(0, start_char
- self.chunk_overlap): start_char]) and prepend that to raw to form chunk_text,
and finally set start_char = start_char - len(overlap_text) and end_char
accordingly; apply the same change to the second similar block (the one around
lines 157-160) and reference the functions/variables chunk_overlap, raw_chunks,
raw, chunk_text, start_char, end_char, and _split_text when making the update.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The fallback DocMetadata() used when document.metadata is
falsy lacks required fields (objectId, chunkId); update the chunk creation logic
in the chunker where chunk_doc is made (the copy.copy(document) /
chunk_doc.metadata assignment) to construct a DocMetadata instance populated
with at least the required fields (e.g., objectId from document or a generated
value and chunkId from chunk_data or an index) instead of calling bare
DocMetadata(), ensuring chunk_doc.page_content = chunk_data['text'] remains
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f902dbba-b734-4d31-9b3e-ee0acfcc03b9

📥 Commits

Reviewing files that changed from the base of the PR and between e505691 and bf59730.

📒 Files selected for processing (4)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/chunker_strategies.py
test/nodes/test_chunker.py

nodes/src/nodes/chunker/chunker_strategies.py

coderabbitai

♻️ Duplicate comments (1)

nodes/src/nodes/chunker/chunker_strategies.py (1)

129-145: ⚠️ Potential issue | 🟠 Major

Overlap-derived chunk_text still doesn't match the original text slice.

The past review flagged that chunk_text = overlap_text + raw produces text that won't match text[start_char:end_char] because _split_text() drops separators. The code still uses this pattern despite being marked as addressed.

For example, with text='abc\n\ndef' and chunk_overlap=1:

raw_chunks might be ['abc', 'def'] (separators dropped)
chunk_text for chunk 2 becomes 'c' + 'def' = 'cdef'
But text[start_char:end_char] would be 'c\n\ndef' (includes separator)

🛠️ Suggested fix: derive chunk_text from original text

-            # Compute overlap prefix from the previous chunk
-            if i > 0 and self.chunk_overlap > 0:
-                prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
-
-            # Find start_char position in original text
-            start_char = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            # Find raw chunk position in original text
+            raw_start = text.find(raw, search_start)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            # Compute start_char with overlap adjustment
+            if i > 0 and self.chunk_overlap > 0:
+                start_char = max(0, raw_start - self.chunk_overlap)
+            else:
+                start_char = raw_start
+            end_char = raw_end
+
+            # Slice chunk_text from original to preserve separators
+            chunk_text = text[start_char:end_char]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 129 - 145, The
current construction chunk_text = overlap_text + raw (using raw_chunks and
chunk_overlap) can mismatch the original text because _split_text drops
separators; instead compute chunk_text by slicing the original text using
calculated indices: find start_char = text.find(raw, search_start) (fall back to
search_start if -1), then when overlap applies set overlap_start = max(0,
start_char - self.chunk_overlap) and set end_char = overlap_start +
len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap
if needed), and finally derive chunk_text = text[overlap_start:end_char]; update
uses of start_char/end_char and remove the overlap_text + raw concatenation so
chunk_text always matches text[start_char:end_char] and honors separators
dropped by _split_text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-145: The current construction chunk_text = overlap_text + raw
(using raw_chunks and chunk_overlap) can mismatch the original text because
_split_text drops separators; instead compute chunk_text by slicing the original
text using calculated indices: find start_char = text.find(raw, search_start)
(fall back to search_start if -1), then when overlap applies set overlap_start =
max(0, start_char - self.chunk_overlap) and set end_char = overlap_start +
len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap
if needed), and finally derive chunk_text = text[overlap_start:end_char]; update
uses of start_char/end_char and remove the overlap_text + raw concatenation so
chunk_text always matches text[start_char:end_char] and honors separators
dropped by _split_text.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4dac1c14-a233-42b7-bda4-d5d0aa37b128

📥 Commits

Reviewing files that changed from the base of the PR and between bf59730 and fab5892.

📒 Files selected for processing (2)

nodes/src/nodes/chunker/chunker_strategies.py
test/nodes/test_chunker.py

…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unreachable else branch in IInstance.writeDocuments (metadata is always non-None after shallow-copy-or-create on line 81). Remove unused _STRATEGY_MAP dict in IGlobal. Fix services.json output description that incorrectly referenced total_chunks metadata which is never set on chunk documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Add early validation for chunk_size > 0, chunk_overlap >= 0, and chunk_overlap < chunk_size in beginGlobal() so invalid config is caught before strategy instantiation. Also add noqa BLE001 annotation for the intentional broad exception catch in validateConfig(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of joining sentences with ' '.join() and computing end_char from the joined string length, derive start_char/end_char from the sentence position map and slice the original text. This preserves the actual whitespace between sentences (e.g., double spaces, newlines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy chunk_index, start_char, end_char, and total_chunks from the strategy's per-chunk metadata dict onto each emitted Doc's metadata so downstream nodes can locate chunks within the source document. Update the writeDocuments docstring to list all metadata keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change the no-overlap position assertion to compare start_char against the previous chunk's end_char (not start_char) so it actually verifies non-overlapping spans. Rewrite test_iglobal_creates_strategy to instantiate IGlobal, mock the endpoint and glb with strategy='sentence', call beginGlobal(), and assert the resulting strategy type and parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nting Replace the current_len tracker (which assumed 1-char separators) with position-based span computation that accounts for multi-character whitespace between sentences. After the whitespace-preservation change (text[start_char:end_char] slicing), current_len was underestimating the real chunk width whenever the original separator was longer than one character (e.g. \n\n), causing chunks to silently exceed chunk_size. The overlap budget is now also computed from actual text spans so carried-over sentences fit within the configured overlap window. Add two regression tests: - overlap + repeated sentences: verify spans match text and respect size - multi-char whitespace: verify chunks are split, not all swallowed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nihalnihalani · 2026-04-06T18:47:56Z

Great question — here's why the dedicated chunker node adds value beyond the existing preprocessors:

1. Pipeline composability vs preprocessing stage
preprocess_langchain is a PreProcessorBase — it runs in the preprocessing stage only. The chunker is an IInstanceBase node that can be placed anywhere in the pipeline. This means you can chunk after extraction, after filtering, after enrichment — wherever it makes sense for the use case.

2. Rich document metadata for reconstruction
The chunker emits Doc objects with chunk_index, start_char, end_char, total_chunks, and parentId. This lets downstream nodes (a) highlight source spans in RAG answers, (b) reconstruct the original document from chunks, and (c) deduplicate overlapping results. Existing preprocessors return text without this provenance tracking.

3. Strategy-configurable with precise token counting
Three pluggable strategies (recursive, sentence, token) with fine-grained chunk_overlap control. The token strategy uses tiktoken for exact token counting vs the preprocessor's byte-length heuristic (_bytes_per_token = 3.0), which matters when tuning for specific embedding model context windows.

4. Zero external dependencies
The recursive and sentence strategies are pure Python — no langchain_text_splitters required. Only the token strategy optionally uses tiktoken.

Regarding "2 independent nodes" — you're right that the chunker and hybrid search are independent. They share zero code dependencies and live in separate directories. Happy to split into two PRs if you'd prefer to review and merge them separately. Just let me know.

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (2)

nodes/src/nodes/search_hybrid/IInstance.py (1)

113-122: ⚠️ Potential issue | 🟡 Minor

Answer text still references hybrid_score from metadata instead of doc.score.

Line 116 reads doc.metadata.get('hybrid_score'), but line 101 stores the RRF score in reranked_doc.score, not in metadata. This will always display 'N/A' for the score in the answer output.

🐛 Proposed fix

         if reranked_docs and self.instance.hasListener('answers'):
             context_parts = []
             for i, doc in enumerate(reranked_docs):
-                score = doc.metadata.get('hybrid_score', 'N/A') if doc.metadata else 'N/A'
+                score = f'{doc.score:.4f}' if doc.score is not None else 'N/A'
                 snippet = (doc.page_content or '')[:500]
                 context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 113 - 122, The
answer assembly loop in IInstance.py is incorrectly reading the score from
metadata ('hybrid_score') but the RRF score was saved on the Document object as
doc.score; update the loop that builds context_parts (used before ans.setAnswer
and self.instance.writeAnswers) to read the score from doc.score (falling back
to 'N/A' if doc.score is None) instead of doc.metadata.get('hybrid_score') so
the displayed scores reflect the reranked_doc.score values.

nodes/src/nodes/chunker/chunker_strategies.py (1)

127-160: ⚠️ Potential issue | 🟠 Major

Build overlapped recursive chunks from the source slice.

Line 132 prepends overlap from raw_chunks, but _split_text() has already removed boundary separators. When a split lands on '\n\n', '. ', etc., chunk_text no longer matches text[start_char:end_char], so the emitted offsets are wrong.

🛠️ Suggested fix

@@
         for i, raw in enumerate(raw_chunks):
-            # Compute overlap prefix from the previous chunk
-            if i > 0 and self.chunk_overlap > 0:
-                prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
-
-            # Find start_char position in original text
-            start_char = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            raw_start = text.find(raw, search_start)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            if i > 0 and self.chunk_overlap > 0:
+                start_char = max(0, raw_start - self.chunk_overlap)
+            else:
+                start_char = raw_start
+            end_char = raw_end
+            chunk_text = text[start_char:end_char]
@@
-            raw_start = text.find(raw, search_start)
-            if raw_start >= 0:
-                search_start = raw_start + len(raw)
+            search_start = raw_end

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 127 - 160, The
overlap prepending step uses text transformed by _split_text(), so chunk_text no
longer aligns with the original source and emitted offsets are wrong; update the
overlap construction in the chunking routine to derive overlap and offsets from
the original/raw text/raw_chunks (use the original raw_chunks text or original
source slice to compute the overlap substring and compute start_char/end_char
relative to the original source) rather than the post-split chunk_text, and
ensure emitted offsets are calculated from that original text so returned
chunk.offsets correspond to indices in the original source; locate references to
raw_chunks, _split_text(), chunk_text and the code that sets start_char/end_char
to make this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 216-277: SentenceChunker currently can emit chunks larger than
chunk_size because the budget check at SentenceChunker (use of current_sentences
and chunk_size) skips the first sentence and the finalize path restores overlap
without re-checking sizes; fix by enforcing chunk_size for any single sentence
and during finalization: when building a chunk in add_sentence (or wherever
current_sentences is appended), detect if a single sentence length > chunk_size
and split that sentence into smaller spans (respecting chunk_size and desired
overlap policy) before emitting; and in finalize/restore-overlap logic, before
appending overlap+sentence repeatedly trim the carried overlap (pop or shorten
overlap segments from current_sentences) until overlap + next sentence <=
chunk_size, and if still larger, apply the same fallback split to the oversized
sentence so every emitted chunk is <= chunk_size.
- Around line 336-349: The current mapping uses len(encoder.decode(...)) as a
character offset which breaks for multibyte UTF‑8 tokens; instead compute token
byte lengths using encoder.decode_single_token_bytes() (or aggregate its
results) to produce start_byte and end_byte offsets, then convert those byte
offsets back to character indices into the original text (e.g., by slicing
text.encode('utf-8') or using incremental decoding to map byte->char index)
before slicing; update the logic around encoder.decode(...) and the code that
builds slices in the functions that assemble token spans so all boundaries use
byte-accurate offsets mapped to character indices.
- Around line 311-312: The call to Encoding.encode(...) in chunker_strategies.py
is rejecting sentinel substrings like "<|endoftext|>" and raising ValueError;
update the Encoding.encode invocation (the call at/near line 312) to include the
parameter disallowed_special=() so sentinel strings are treated as normal text
(i.e., call Encoding.encode(..., disallowed_special=()) wherever the
chunking/tokenization routine—e.g., the method that invokes Encoding.encode in
this module—is encoding document text); ensure any similar Encoding.encode
usages in the same class/function are updated likewise.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The shallow copy of Document in the chunking logic
(chunk_doc = copy.copy(document) in IInstance) leaves inherited embedding state
on the chunk; after setting chunk_doc.page_content and chunk_doc.metadata,
explicitly clear chunk_doc.embedding and chunk_doc.embedding_model (set them to
None) so each emitted chunk does not carry the parent document's vector state.
Ensure you update the code around where chunk_doc is created/modified
(references: chunk_doc, copy.copy(document), chunk_doc.metadata,
chunk_doc.page_content) before appending/emitting the chunk.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 60-65: The code casts top_k and rrf_k from config without
validating their ranges; update the initialization in IGlobal (where alpha,
self.top_k, self.rrf_k are set) to validate that top_k and rrf_k are positive
integers (e.g., > 0) and clamp or raise a clear error/log if they are invalid,
similar to the alpha validation; ensure you convert the values to int after
validation and keep existing behavior for defaults (top_k=10, rrf_k=60) when
values are missing or invalid.
- Around line 46-47: The except block that catches a broad Exception (the line
with "except Exception as e:" followed by "warning(str(e))") should include the
suppression comment "# noqa: BLE001" to match other nodes; update that except
line to append the comment so the linter suppression is consistent with
chunker/IGlobal.py.

In `@test/nodes/test_chunker.py`:
- Around line 517-523: The _restore_modules method currently only restores
entries named in stub_names but leaves cached imports like modules starting with
"nodes.src.nodes.chunker." that were loaded against the stubs; update
_restore_modules (the function named _restore_modules) to also remove from
sys.modules any keys that match the chunker module namespace (e.g. startwith
"nodes.src.nodes.chunker.") when cleaning up: for each such matching module, pop
it from sys.modules so subsequent tests don't reuse modules wired to the fakes;
keep the existing logic that restores saved[name] for stub_names that were
present.

In `@test/nodes/test_search_hybrid.py`:
- Around line 78-83: The teardown _restore_modules currently only restores exact
dependency names in _STUB_MODULES/_original_modules, leaving stubbed packages
(e.g., the fake rank_bm25 and any modules under search_hybrid that were imported
while stubs were active) cached in sys.modules; update _restore_modules so it
removes from sys.modules any entry whose name is equal to or starts with any
name in _STUB_MODULES (or whose module object is the stub) before restoring
originals, ensuring that both top-level stub modules and any submodules (e.g.,
"rank_bm25" and "rank_bm25.*" or "rocketlib.*") are purged so later tests import
clean copies.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 127-160: The overlap prepending step uses text transformed by
_split_text(), so chunk_text no longer aligns with the original source and
emitted offsets are wrong; update the overlap construction in the chunking
routine to derive overlap and offsets from the original/raw text/raw_chunks (use
the original raw_chunks text or original source slice to compute the overlap
substring and compute start_char/end_char relative to the original source)
rather than the post-split chunk_text, and ensure emitted offsets are calculated
from that original text so returned chunk.offsets correspond to indices in the
original source; locate references to raw_chunks, _split_text(), chunk_text and
the code that sets start_char/end_char to make this change.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer assembly loop in IInstance.py is incorrectly
reading the score from metadata ('hybrid_score') but the RRF score was saved on
the Document object as doc.score; update the loop that builds context_parts
(used before ans.setAnswer and self.instance.writeAnswers) to read the score
from doc.score (falling back to 'N/A' if doc.score is None) instead of
doc.metadata.get('hybrid_score') so the displayed scores reflect the
reranked_doc.score values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 639ea044-72b7-4b42-a58b-933fa072c7b3

📥 Commits

Reviewing files that changed from the base of the PR and between fab5892 and 4181638.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml, !pnpm-lock.yaml

📒 Files selected for processing (14)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/__init__.py
nodes/src/nodes/chunker/chunker_strategies.py
nodes/src/nodes/chunker/requirements.txt
nodes/src/nodes/chunker/services.json
nodes/src/nodes/search_hybrid/IGlobal.py
nodes/src/nodes/search_hybrid/IInstance.py
nodes/src/nodes/search_hybrid/__init__.py
nodes/src/nodes/search_hybrid/hybrid_search.py
nodes/src/nodes/search_hybrid/requirements.txt
nodes/src/nodes/search_hybrid/services.json
test/nodes/test_chunker.py
test/nodes/test_search_hybrid.py

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/chunker/chunker_strategies.py

+            # Compute what the span would be if we added this sentence
+            if current_sentences:
+                candidate_len = (next_pos + len(sentence)) - sentence_positions[0]
+            else:
+                candidate_len = len(sentence)
+
+            # If adding this sentence exceeds chunk_size and we have content, finalize current
+            if current_sentences and candidate_len > self.chunk_size:
+                start_char = sentence_positions[0]
+                end_char = sentence_positions[-1] + len(current_sentences[-1])
+                chunk_text = text[start_char:end_char]
+
+                result.append(
+                    {
+                        'text': chunk_text,
+                        'metadata': {
+                            'chunk_index': chunk_index,
+                            'start_char': start_char,
+                            'end_char': end_char,
+                        },
+                    }
+                )
+                chunk_index += 1
+
+                # Compute overlap: keep trailing sentences whose actual span fits within overlap
+                if self.chunk_overlap > 0:
+                    overlap_sentences: list[str] = []
+                    overlap_positions: list[int] = []
+                    for i in range(len(current_sentences) - 1, -1, -1):
+                        trial_positions = [sentence_positions[i]] + overlap_positions
+                        trial_sentences = [current_sentences[i]] + overlap_sentences
+                        if _span_len(trial_positions, trial_sentences) <= self.chunk_overlap:
+                            overlap_sentences = trial_sentences
+                            overlap_positions = trial_positions
+                        else:
+                            break
+                    current_sentences = overlap_sentences
+                    sentence_positions = overlap_positions
+                else:
+                    current_sentences = []
+                    sentence_positions = []
+
+            # Add the sentence
+            current_sentences.append(sentence)
+            sentence_positions.append(next_pos)
+
+        # Emit the final chunk
+        if current_sentences:
+            start_char = sentence_positions[0]
+            end_char = sentence_positions[-1] + len(current_sentences[-1])
+            chunk_text = text[start_char:end_char]
+
+            result.append(
+                {
+                    'text': chunk_text,
+                    'metadata': {
+                        'chunk_index': chunk_index,
+                        'start_char': start_char,
+                        'end_char': end_char,
+                    },
+                }
+            )


⚠️ Potential issue | 🟠 Major

SentenceChunker can still emit chunks larger than chunk_size.

Line 223 only enforces the budget when current_sentences is non-empty, so a first sentence longer than the limit is emitted unchanged. After finalization, Lines 252-259 also restore overlap and append the current sentence without another size check, so overlap + sentence can overflow as well. Add a fallback split for oversized single sentences, and trim carried overlap until the new span fits before appending.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 216 - 277, SentenceChunker currently can emit chunks larger than chunk_size because the budget check at SentenceChunker (use of current_sentences and chunk_size) skips the first sentence and the finalize path restores overlap without re-checking sizes; fix by enforcing chunk_size for any single sentence and during finalization: when building a chunk in add_sentence (or wherever current_sentences is appended), detect if a single sentence length > chunk_size and split that sentence into smaller spans (respecting chunk_size and desired overlap policy) before emitting; and in finalize/restore-overlap logic, before appending overlap+sentence repeatedly trim the carried overlap (pop or shorten overlap segments from current_sentences) until overlap + next sentence <= chunk_size, and if still larger, apply the same fallback split to the oversized sentence so every emitted chunk is <= chunk_size.

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/chunker/chunker_strategies.py

+        encoder = self._get_encoder()
+        tokens = encoder.encode(text)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import pathlib import re import subprocess import sys pkg = 'tiktoken' req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt') if req_path.exists(): match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M) if match and match.group(1): pkg = f'tiktoken=={match.group(1)}' subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg]) import tiktoken enc = tiktoken.get_encoding('cl100k_base') sample = 'prefix <|endoftext|> suffix' try: print('default:', enc.encode(sample)) except Exception as exc: print('default raised:', type(exc).__name__, exc) print('disallowed_special=():', enc.encode(sample, disallowed_special=())) PY

Repository: rocketride-org/rocketride-server

Length of output: 197

🏁 Script executed:

git ls-files nodes/src/nodes/chunker/ | head -20

Repository: rocketride-org/rocketride-server

Length of output: 308

🏁 Script executed:

cat -n nodes/src/nodes/chunker/chunker_strategies.py | sed -n '305,320p'

Repository: rocketride-org/rocketride-server

Length of output: 612

🏁 Script executed:

cat nodes/src/nodes/chunker/requirements.txt

Repository: rocketride-org/rocketride-server

Length of output: 98

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import subprocess import sys # Install tiktoken with version constraint from requirements.txt subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'tiktoken>=0.7.0,<1.0.0']) import tiktoken enc = tiktoken.get_encoding('cl100k_base') sample = 'prefix <|endoftext|> suffix' print("Testing tiktoken behavior with sentinel strings:") print() try: result = enc.encode(sample) print('default call succeeded:', result) except Exception as exc: print('default call raised:', type(exc).__name__, '-', exc) print() try: result = enc.encode(sample, disallowed_special=()) print('with disallowed_special=() call succeeded:', result) except Exception as exc: print('with disallowed_special=() raised:', type(exc).__name__, '-', exc) print() print("Testing with just the sentinel string:") try: result = enc.encode('<|endoftext|>') print('sentinel alone (default) succeeded:', result) except Exception as exc: print('sentinel alone (default) raised:', type(exc).__name__) try: result = enc.encode('<|endoftext|>', disallowed_special=()) print('sentinel alone (disallowed_special=()) succeeded:', result) except Exception as exc: print('sentinel alone (disallowed_special=()) raised:', type(exc).__name__) PY

Repository: rocketride-org/rocketride-server

Length of output: 865

Add disallowed_special=() parameter to treat sentinel strings as normal text.

Encoding.encode() rejects substrings such as <|endoftext|> by default, raising a ValueError. Line 312 will therefore raise on valid document content instead of chunking it.

🛠️ Suggested fix

- tokens = encoder.encode(text) + tokens = encoder.encode(text, disallowed_special=())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 311 - 312, The call to Encoding.encode(...) in chunker_strategies.py is rejecting sentinel substrings like "<|endoftext|>" and raising ValueError; update the Encoding.encode invocation (the call at/near line 312) to include the parameter disallowed_special=() so sentinel strings are treated as normal text (i.e., call Encoding.encode(..., disallowed_special=()) wherever the chunking/tokenization routine—e.g., the method that invokes Encoding.encode in this module—is encoding document text); ensure any similar Encoding.encode usages in the same class/function are updated likewise.

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/chunker/chunker_strategies.py

+            # Decode tokens back to text, handling errors gracefully
+            try:
+                chunk_text = encoder.decode(chunk_tokens)
+            except Exception:
+                chunk_text = encoder.decode(chunk_tokens, errors='replace')
+
+            # Use cached character position for start_char (O(1) lookup)
+            start_char = char_pos_cache.get(start)
+            if start_char is None:
+                # Fallback: decode prefix (should not happen with correct step caching)
+                prefix_text = encoder.decode(tokens[:start])
+                start_char = len(prefix_text)
+                char_pos_cache[start] = start_char
+            end_char = start_char + len(chunk_text)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import pathlib import re import subprocess import sys pkg = 'tiktoken' req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt') if req_path.exists(): match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M) if match and match.group(1): pkg = f'tiktoken=={match.group(1)}' subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg]) import tiktoken enc = tiktoken.get_encoding('cl100k_base') candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨‍👩‍👧‍👦', '🏳️‍🌈'] candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable()) for text in candidates: tokens = enc.encode(text, disallowed_special=()) if len(tokens) < 2: continue prefix = b'' for i, tok in enumerate(tokens[:-1], 1): prefix += enc.decode_single_token_bytes(tok) try: prefix.decode('utf-8') except UnicodeDecodeError: print('invalid_utf8_boundary', repr(text), tokens, 'boundary', i) print('decoded_prefix', repr(enc.decode(tokens[:i], errors='replace'))) raise SystemExit(0) print('no_invalid_boundary_found') PY

Repository: rocketride-org/rocketride-server

Length of output: 197

🏁 Script executed:

cd nodes && find . -name "chunker_strategies.py" -o -name "requirements.txt" | head -20

Repository: rocketride-org/rocketride-server

Length of output: 919

🏁 Script executed:

cd nodes && wc -l ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 121

🏁 Script executed:

cd nodes && sed -n '336,372p' ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 1727

🏁 Script executed:

cd nodes && sed -n '320,380p' ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 2532

🏁 Script executed:

cd nodes && cat ./src/nodes/chunker/requirements.txt

Repository: rocketride-org/rocketride-server

Length of output: 98

🏁 Script executed:

cd nodes && python - <<'PY' import sys import subprocess # Install tiktoken with the version constraint subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'tiktoken>=0.7.0,<1.0.0'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) import tiktoken enc = tiktoken.get_encoding('cl100k_base') candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨‍👩‍👧‍👦', '🏳️‍🌈'] candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable()) invalid_boundaries_found = [] for text in candidates: tokens = enc.encode(text, disallowed_special=()) if len(tokens) < 2: continue prefix = b'' for i, tok in enumerate(tokens[:-1], 1): prefix += enc.decode_single_token_bytes(tok) try: prefix.decode('utf-8') except UnicodeDecodeError: invalid_boundaries_found.append({ 'text': text, 'tokens': tokens, 'boundary_index': i, 'decoded_with_replacement': enc.decode(tokens[:i], errors='replace') }) break if invalid_boundaries_found: print(f"FOUND {len(invalid_boundaries_found)} INVALID UTF-8 BOUNDARIES:") for example in invalid_boundaries_found[:5]: # Show first 5 print(f" Text: {repr(example['text'])}, Tokens: {example['tokens']}, Boundary at token {example['boundary_index']}") print(f" Decoded with replacement: {repr(example['decoded_with_replacement'])}") else: print("NO INVALID UTF-8 BOUNDARIES FOUND") PY

Repository: rocketride-org/rocketride-server

Length of output: 715

Use byte-level offsets to correctly map token boundaries to character positions in non-ASCII text.

Tiktoken's token boundaries are byte-based and can split multibyte UTF-8 sequences. Lines 343–349 and 365–372 incorrectly use len(encoder.decode(...)) as a character offset, which fails when token boundaries land mid-character. When errors='replace' substitutes invalid sequences with '�', the character count diverges from the original text's indices. Build offsets from decode_single_token_bytes() aggregates, then map byte positions back to character indices before slicing text.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 336 - 349, The current mapping uses len(encoder.decode(...)) as a character offset which breaks for multibyte UTF‑8 tokens; instead compute token byte lengths using encoder.decode_single_token_bytes() (or aggregate its results) to produce start_byte and end_byte offsets, then convert those byte offsets back to character indices into the original text (e.g., by slicing text.encode('utf-8') or using incremental decoding to map byte->char index) before slicing; update the logic around encoder.decode(...) and the code that builds slices in the functions that assemble token spans so all boundaries use byte-accurate offsets mapped to character indices.

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/chunker/IInstance.py

+                # Shallow copy of document, explicit copy of metadata only
+                chunk_doc = copy.copy(document)
+                chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+                chunk_doc.page_content = chunk_data['text']


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo 'Schema files exposing Doc / DocMetadata fields:' fd -i 'schema\.py$' -x rg -n -C2 '\bclass\s+Doc\b|\bclass\s+DocMetadata\b|\bembedding(s)?\b|\bvector(s)?\b' {} echo echo 'Node code referencing existing vector-bearing fields:' rg -n -C2 '\b(embedding|embeddings|vector|vectors)\b' --glob '*.py' -g '!test/**' -g '!**/test/**' echo echo 'Chunk emission loop under review:' sed -n '79,94p' nodes/src/nodes/chunker/IInstance.py

Repository: rocketride-org/rocketride-server

Length of output: 50389

Clear inherited embeddings when cloning documents for chunks.

Line 80 shallow-copies the entire document, and line 82 replaces only page_content. The Doc schema defines both embedding and embedding_model fields (see packages/client-python/src/rocketride/schema/doc.py), and each emitted chunk will inherit the parent document's vector state even though its text has changed. This causes stale embeddings to propagate downstream, poisoning indexing or allowing later nodes to skip required re-embedding.

Reset both fields before appending:

chunk_doc.embedding = None chunk_doc.embedding_model = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The shallow copy of Document in the chunking logic (chunk_doc = copy.copy(document) in IInstance) leaves inherited embedding state on the chunk; after setting chunk_doc.page_content and chunk_doc.metadata, explicitly clear chunk_doc.embedding and chunk_doc.embedding_model (set them to None) so each emitted chunk does not carry the parent document's vector state. Ensure you update the code around where chunk_doc is created/modified (references: chunk_doc, copy.copy(document), chunk_doc.metadata, chunk_doc.page_content) before appending/emitting the chunk.

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/search_hybrid/IGlobal.py

+        except Exception as e:
+            warning(str(e))


🧹 Nitpick | 🔵 Trivial

Add # noqa: BLE001 for consistency with other nodes.

The broad Exception catch should include the suppression comment for consistency with the established pattern (as done in chunker/IGlobal.py line 44).

♻️ Proposed fix

- except Exception as e: + except Exception as e: # noqa: BLE001 - intentional broad catch for dependency probing warning(str(e))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 46 - 47, The except block that catches a broad Exception (the line with "except Exception as e:" followed by "warning(str(e))") should include the suppression comment "# noqa: BLE001" to match other nodes; update that except line to append the comment so the linter suppression is consistent with chunker/IGlobal.py.

coderabbitai · 2026-04-06T19:03:15Z

nodes/src/nodes/search_hybrid/IGlobal.py

+            alpha = float(config.get('alpha', 0.5))
+            self.top_k = int(config.get('top_k', 10))
+            self.rrf_k = int(config.get('rrf_k', 60))
+
+            # Validate alpha range
+            alpha = max(0.0, min(1.0, alpha))


🧹 Nitpick | 🔵 Trivial

Consider validating top_k and rrf_k parameters.

chunk_size/chunk_overlap are validated in the chunker node, but top_k and rrf_k are cast without range checks. Invalid values (e.g., top_k <= 0 or rrf_k <= 0) could cause unexpected behavior in the search engine.

🛡️ Proposed guardrails

alpha = float(config.get('alpha', 0.5)) self.top_k = int(config.get('top_k', 10)) self.rrf_k = int(config.get('rrf_k', 60)) + if self.top_k <= 0: + raise ValueError(f'top_k must be positive, got {self.top_k}') + if self.rrf_k <= 0: + raise ValueError(f'rrf_k must be positive, got {self.rrf_k}') + # Validate alpha range alpha = max(0.0, min(1.0, alpha))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 60 - 65, The code casts top_k and rrf_k from config without validating their ranges; update the initialization in IGlobal (where alpha, self.top_k, self.rrf_k are set) to validate that top_k and rrf_k are positive integers (e.g., > 0) and clamp or raise a clear error/log if they are invalid, similar to the alpha validation; ensure you convert the values to int after validation and keep existing behavior for defaults (top_k=10, rrf_k=60) when values are missing or invalid.

coderabbitai · 2026-04-06T19:03:15Z

test/nodes/test_chunker.py

+    def _restore_modules(self, saved, stub_names):
+        for name in stub_names:
+            if saved.get(name) is None:
+                sys.modules.pop(name, None)
+            else:
+                sys.modules[name] = saved[name]
+


⚠️ Potential issue | 🟠 Major

Evict the stub-imported chunker modules during cleanup.

Lines 533-550 and 589-592 import nodes.src.nodes.chunker.* while rocketlib / ai.common are stubbed, but Lines 517-523 only restore the stubs themselves. The cached nodes.src.nodes.chunker.* modules survive in sys.modules, so later tests can reuse modules that are already wired to the fakes.

🧹 Suggested cleanup

def _restore_modules(self, saved, stub_names): + for mod_name in list(sys.modules): + if mod_name.startswith('nodes.src.nodes.chunker'): + sys.modules.pop(mod_name, None) for name in stub_names: if saved.get(name) is None: sys.modules.pop(name, None) else: sys.modules[name] = saved[name]

Also applies to: 524-557, 581-592

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/nodes/test_chunker.py` around lines 517 - 523, The _restore_modules method currently only restores entries named in stub_names but leaves cached imports like modules starting with "nodes.src.nodes.chunker." that were loaded against the stubs; update _restore_modules (the function named _restore_modules) to also remove from sys.modules any keys that match the chunker module namespace (e.g. startwith "nodes.src.nodes.chunker.") when cleaning up: for each such matching module, pop it from sys.modules so subsequent tests don't reuse modules wired to the fakes; keep the existing logic that restores saved[name] for stub_names that were present.

coderabbitai · 2026-04-06T19:03:15Z

test/nodes/test_search_hybrid.py

+def _restore_modules():
+    for name in _STUB_MODULES:
+        if _original_modules.get(name) is None:
+            sys.modules.pop(name, None)
+        else:
+            sys.modules[name] = _original_modules[name]


⚠️ Potential issue | 🟠 Major

Restore the sys.modules stubs in teardown.

Lines 86-87 install a fake rank_bm25 at module import time, and Lines 578-588 import nodes.src.nodes.search_hybrid.* while rocketlib / ai.common are stubbed. The cleanup paths only restore the dependency modules, so the BM25 stub and the stub-bound search_hybrid modules stay cached for later tests. That makes the suite order-dependent.

Also applies to: 86-87, 571-588

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/nodes/test_search_hybrid.py` around lines 78 - 83, The teardown _restore_modules currently only restores exact dependency names in _STUB_MODULES/_original_modules, leaving stubbed packages (e.g., the fake rank_bm25 and any modules under search_hybrid that were imported while stubs were active) cached in sys.modules; update _restore_modules so it removes from sys.modules any entry whose name is equal to or starts with any name in _STUB_MODULES (or whose module object is the stub) before restoring originals, ensuring that both top-level stub modules and any submodules (e.g., "rank_bm25" and "rank_bm25.*" or "rocketlib.*") are purged so later tests import clean copies.

asclearuc · 2026-04-07T11:27:37Z

Thank you for the contribution and for the detailed follow-up explaining the value of the chunker node — it helped clarify the design intent.

After review, we've decided to close this PR in favour of improving the existing preprocessor nodes instead. The reasoning:

preprocessor_langchain already covers most of this ground. It supports multiple splitter strategies, runs as an IInstanceBase node, and handles chunking today. The gaps this PR addresses — configurable overlap, richer position metadata (start_char, end_char, total_chunks), and a writeDocuments input lane for already-materialised Doc objects — are incremental improvements to that node, not reasons to introduce a new one.
Node proliferation has a real cost. Users would need to understand when to pick preprocessor_langchain vs chunker, two nodes doing overlapping work would diverge over time, and the existing chunk_overlap = 0 limitation would remain unfixed in the node pipelines already use.
The right fix is extension, not addition. Adding a writeDocuments lane, exposing chunk_overlap as a config field, and emitting position metadata to preprocessor_langchain achieves everything this PR sets out to do — with no new node.

If you'd like to continue contributing, we'd welcome a PR that extends preprocessor_langchain along those lines. Happy to discuss the approach before you start.

nihalnihalani · 2026-04-07T11:36:25Z

@asclearuc How to reach out to you, and in one the PR's I was told to use this approach and divide this into two new PR's. I would like to point it out to you and can discuss about the next steps to complete this features.

nihalnihalani requested review from Rod-Christensen, jmaionchi and stepmikhaylov as code owners April 2, 2026 10:24

github-actions bot added the module:nodes Python pipeline nodes label Apr 2, 2026

nihalnihalani changed the title ~~feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion~~ feat(nodes): add text chunker node with recursive, sentence, and token strategies Apr 2, 2026

nihalnihalani mentioned this pull request Apr 2, 2026

feat(nodes): add text chunking strategies and hybrid search for RAG pipelines #514

Closed

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

nodes/src/nodes/chunker/IGlobal.py Show resolved Hide resolved

nodes/src/nodes/chunker/IInstance.py Outdated Show resolved Hide resolved

nodes/src/nodes/search_hybrid/IInstance.py Show resolved Hide resolved

asclearuc requested changes Apr 6, 2026

View reviewed changes

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

nodes/src/nodes/chunker/chunker_strategies.py Show resolved Hide resolved

nodes/src/nodes/chunker/chunker_strategies.py Show resolved Hide resolved

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

nihalnihalani and others added 8 commits April 6, 2026 11:41

fix(deps): update pnpm-lock.yaml for lucide-react dependency

4e26890

nihalnihalani force-pushed the feature/rag-chunker-node branch from fab5892 to 4181638 Compare April 6, 2026 18:42

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

asclearuc closed this Apr 7, 2026

Conversation

nihalnihalani commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review feedback addressed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asclearuc left a comment

Choose a reason for hiding this comment

Uh oh!

asclearuc commented Apr 6, 2026

Uh oh!

nihalnihalani commented Apr 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nihalnihalani commented Apr 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

asclearuc commented Apr 7, 2026

Uh oh!

nihalnihalani commented Apr 7, 2026

nihalnihalani commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading