Skip to content

feat(nodes): add text chunker node with recursive, sentence, and token strategies#601

Closed
nihalnihalani wants to merge 8 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-chunker-node
Closed

feat(nodes): add text chunker node with recursive, sentence, and token strategies#601
nihalnihalani wants to merge 8 commits intorocketride-org:developfrom
nihalnihalani:feature/rag-chunker-node

Conversation

@nihalnihalani
Copy link
Copy Markdown
Contributor

@nihalnihalani nihalnihalani commented Apr 2, 2026

Summary

Review feedback addressed

  1. RuntimeError on uninitialized strategy — replaced silent return with raise RuntimeError
  2. Performance: shallow copy instead of N+1 deepcopy — only metadata dict is copied, not entire document with embeddings
  3. parent_id preserved in all metadata branches — every chunk carries parentId back to source document

Test plan

  • 40 chunker tests pass
  • ruff check/format clean
  • Recursive splitting with overlap works correctly
  • Sentence boundary detection handles edge cases
  • Token-based splitting respects model context windows
  • Chunk metadata includes parent_id in all cases

Splits #514

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Text Chunker node: three chunking strategies (recursive, sentence, token), selectable profiles, configurable chunk size/overlap/encoding, per-chunk metadata and UI tile.
    • Hybrid Search node: BM25 + vector fusion with tunable alpha, top-k and RRF fusion, selectable profiles, emits reranked documents and optional answer summaries.
  • Tests

    • Added comprehensive tests covering chunking behavior, lifecycle/emission, hybrid ranking/fusion, and node I/O.

@github-actions github-actions bot added the module:nodes Python pipeline nodes label Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new text chunker node (three chunking strategies, lifecycle, instance emission, config, tests) and a hybrid search node (BM25 + optional vector fusion engine, lifecycle, instance reranking, config, tests).

Changes

Cohort / File(s) Summary
Chunker — Strategies
nodes/src/nodes/chunker/chunker_strategies.py
New ChunkingStrategy base and three implementations: RecursiveCharacterChunker, SentenceChunker, TokenChunker. Each produces chunk dicts with text and metadata (chunk_index, start_char, end_char) and enforces size/overlap validations.
Chunker — Global & Instance
nodes/src/nodes/chunker/IGlobal.py, nodes/src/nodes/chunker/IInstance.py
IGlobal manages strategy lifecycle, dependency probing, and config-driven initialization; IInstance applies the strategy per input Doc, emits one chunk-Doc per chunk with chunkId, parentId, and chunk metadata.
Chunker — Config & deps
nodes/src/nodes/chunker/services.json, nodes/src/nodes/chunker/requirements.txt
Adds node/service definition chunker:// with selectable profiles and UI fields; adds tiktoken>=0.7.0,<1.0.0.
Chunker — Tests
test/nodes/test_chunker.py
Extensive unit and integration-style tests for all chunkers, edge cases, metadata/spans, lifecycle and instance emission (with dependency stubs).
Hybrid Search — Engine
nodes/src/nodes/search_hybrid/hybrid_search.py
Adds HybridSearchEngine implementing BM25 ranking, reciprocal rank fusion (RRF), and orchestration to fuse BM25 and optional vector scores respecting alpha, top_k, rrf_k.
Hybrid Search — Global & Instance
nodes/src/nodes/search_hybrid/IGlobal.py, nodes/src/nodes/search_hybrid/IInstance.py
IGlobal manages engine lifecycle and dependency probing; IInstance.writeQuestions builds engine inputs, calls search(), maps results back to Docs, emits reranked documents and assembles answers.
Hybrid Search — Config & deps
nodes/src/nodes/search_hybrid/services.json, nodes/src/nodes/search_hybrid/requirements.txt
Adds node/service definition search_hybrid:// with profiles and UI fields; adds rank_bm25>=0.2.2,<1.0.0.
Hybrid Search — Tests
test/nodes/test_search_hybrid.py
Comprehensive tests for BM25, RRF, fusion modes, alpha behavior, tokenization, and instance-level integration with injected rank_bm25 stub and lifecycle stubs.

Sequence Diagram

sequenceDiagram
    participant Client
    participant IGlobal as IGlobal (Chunker)
    participant Strategy as ChunkingStrategy
    participant IInstance as IInstance
    participant Output

    Client->>IGlobal: beginGlobal()
    IGlobal->>IGlobal: validateConfig() / load node config
    IGlobal->>IGlobal: instantiate Strategy
    IGlobal-->>Client: strategy available

    Client->>IInstance: open(entry) / writeDocuments(docs)
    IInstance->>IGlobal: ensure strategy initialized
    loop per input document
        IInstance->>Strategy: chunk(text)
        Strategy-->>IInstance: [chunks with metadata]
        IInstance->>IInstance: create per-chunk Doc (chunkId,parentId,metadata)
    end
    IInstance->>Output: emit chunked documents

    Client->>IGlobal: endGlobal()
    IGlobal->>IGlobal: release strategy (set None)
Loading
sequenceDiagram
    participant Client
    participant IGlobal as IGlobal (Hybrid)
    participant Engine as HybridSearchEngine
    participant IInstance as IInstance
    participant Docs as Documents
    participant Answers as Answers

    Client->>IGlobal: beginGlobal()
    IGlobal->>IGlobal: validateConfig() / load node config
    IGlobal->>Engine: instantiate HybridSearchEngine(alpha,...)
    IGlobal-->>Client: engine available

    Client->>IInstance: writeQuestions(question)
    IInstance->>IGlobal: ensure engine initialized
    IInstance->>Engine: search(query, documents, vector_scores, top_k, rrf_k)
    Engine-->>IInstance: ranked results (bm25/rrf/vector fused)
    IInstance->>Docs: emit reranked documents
    alt answers listener exists
        IInstance->>Answers: build & emit textual answer (snippets + hybrid_score)
    end

    Client->>IGlobal: endGlobal()
    IGlobal->>IGlobal: release engine (set None)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • jmaionchi
  • stepmikhaylov
  • Rod-Christensen

Poem

🐇 I nibbled text into tidy bites,

Chunks that hop through pipelines bright.
Sentences, tokens, recursive art,
BM25 and vectors play their part.
Hop, split, and rank — a happy start!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a new text chunker node with three concrete implementations (recursive, sentence, token strategies). It aligns directly with the changeset's primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nihalnihalani nihalnihalani changed the title feat(nodes): add hybrid search node with BM25 keyword and vector similarity fusion feat(nodes): add text chunker node with recursive, sentence, and token strategies Apr 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 213-216: The chunker currently rebuilds chunks with '
'.join(current_sentences), which loses original whitespace and invalidates
offsets; update the logic in chunker_strategies.py (where chunk_text,
current_sentences, sentence_positions are used) to slice the original source
text by each sentence's original span and concatenate those exact source slices
(preserving whitespace) to form chunk_text, and compute start_char and end_char
from the original span boundaries (e.g., first sentence's start index and last
sentence's end index) instead of using lengths; apply the same change to the
second occurrence around lines 260-263 so all chunks use original spans for text
and offsets.

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 52-53: The code in IGlobal.py currently uses a blind "except
Exception as e" around dependency/config loading (calling warning(str(e))),
which can mask real runtime errors; update the handler in the
validateConfig/config-loading block to catch only the expected exceptions (e.g.,
ImportError, ModuleNotFoundError, KeyError, ValueError) and call warning(str(e))
for those, or if you intentionally must catch everything keep the broad except
but append a targeted linter suppression ( "# noqa: BLE001" ) with a one-line
justification comment explaining why a broad catch is required; reference the
existing validateConfig/config-loading block and the warning(...) call to locate
where to make this change.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 84-90: The current metadata assignment in the chunk emission logic
overwrites/preserves only chunkId and parentId and drops strategy-provided
fields (chunk_index, start_char, end_char, total_chunks). Fix the block that
sets chunk_doc.metadata (referencing chunk_doc, DocMetadata, self.chunkId,
parent_id) to merge/preserve existing strategy metadata: if chunk_doc.metadata
exists, set/overwrite only chunkId and parentId and leave or copy through
chunk_index, start_char, end_char, total_chunks; if not, construct a new
DocMetadata that includes chunkId, parentId plus chunk_index, start_char,
end_char, and total_chunks (sourcing those values from the strategy output
variables on the chunk or from the available total_chunks value). Ensure you do
not drop any of the listed fields when creating or updating metadata.

In `@test/nodes/test_chunker.py`:
- Around line 493-509: The test named test_iglobal_creates_strategy currently
only instantiates RecursiveCharacterChunker directly; update it to import
IGlobal (IGlobal.beginGlobal) after forcing module reload, set up the mocked
endpoint/config used by beginGlobal (reuse self._install_stubs() or
saved/names), call IGlobal.beginGlobal(), and then assert that the instance's
strategy attribute (e.g., iglobal_instance.strategy) is an instance of
RecursiveCharacterChunker and that chunking works (e.g., call strategy.chunk and
check length) to validate the lifecycle wiring rather than direct construction.
- Around line 255-258: The no-overlap assertion in test_chunker.py is wrong: it
compares chunks[i]['metadata']['start_char'] to the previous chunk's start_char,
which allows overlaps; change the assertion to ensure each chunk's start_char is
>= the previous chunk's metadata 'end_char' so spans don't overlap (use
chunks[i]['metadata']['start_char'] >= chunks[i-1]['metadata']['end_char']
referencing the same chunks and metadata fields).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 255f96a0-24d3-4772-b561-67e9b8bec0de

📥 Commits

Reviewing files that changed from the base of the PR and between 7d81b7e and 5d78492.

📒 Files selected for processing (7)
  • nodes/src/nodes/chunker/IGlobal.py
  • nodes/src/nodes/chunker/IInstance.py
  • nodes/src/nodes/chunker/__init__.py
  • nodes/src/nodes/chunker/chunker_strategies.py
  • nodes/src/nodes/chunker/requirements.txt
  • nodes/src/nodes/chunker/services.json
  • test/nodes/test_chunker.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
nodes/src/nodes/chunker/IGlobal.py (1)

44-45: ⚠️ Potential issue | 🟡 Minor

Make the broad except explicitly intentional.

Line 44 still catches Exception broadly. If this is intentional for save-time config probing, add a targeted Ruff suppression with a one-line justification to avoid masking accidental regressions silently.

Proposed minimal fix
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001 - validateConfig probe should warn, not block node configuration
             warning(str(e))

Based on learnings: follow the existing validateConfig style, and when broad catches are intentional, use targeted Ruff # noqa with specific BLE001 justification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IGlobal.py` around lines 44 - 45, The broad except in
IGlobal.py (the except Exception as e block that calls warning(str(e))) is
intentional for save-time config probing but must be explicitly documented and
suppressed for Ruff; update that except to include a one-line targeted Ruff
suppression comment (e.g. "# noqa: BLE001 - intentional broad Exception for
save-time config probing, follows validateConfig pattern") immediately on the
except line so the broad catch remains but is clearly justified and silences
BLE001 while keeping the warning(str(e)) behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 58-61: Validate and sanitize chunk parameters before creating
strategies: after parsing strategy_name, chunk_size, chunk_overlap, and
encoding_name in IGlobal.py, ensure chunk_size is an integer > 0 and
chunk_overlap is an integer >= 0 and strictly < chunk_size (or clamp to
chunk_size - 1 if you prefer permissive behavior); also catch and handle
ValueError from int() conversions and log/raise a clear error when values are
invalid. Apply this validation in the same block where
strategy_name/chunk_size/chunk_overlap/encoding_name are read and also in the
other similar parsing block referenced (lines 64-80) so both code paths perform
the same range checks and consistent error handling. Use the variable names
chunk_size and chunk_overlap and the strategy creation entrypoint(s) in
IGlobal.py to locate where to add the guards.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 51-53: The docstring in IInstance (file IInstance.py) incorrectly
documents emitted metadata keys as chunk_index and total_chunks; update the
docstring to match runtime behavior and state that emitted documents include
metadata keys "chunkId" and "parentId" (not chunk_index/total_chunks), and
adjust any other docstring occurrences in the same file (the comments around the
emission logic referenced at lines ~85-86) to use the same "chunkId"/"parentId"
wording so documentation matches the actual emitted metadata keys.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer output is reading hybrid_score from
doc.metadata but reranking stores scores on the document object
(reranked_doc.score), so the displayed score is always 'N/A'; update the answer
assembly in the block that uses reranked_docs to read the score from the
document's score attribute (e.g., use doc.score with a fallback like 'N/A' if
missing) when building context_parts before creating Answer and calling
self.instance.writeAnswers(ans).

---

Duplicate comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 44-45: The broad except in IGlobal.py (the except Exception as e
block that calls warning(str(e))) is intentional for save-time config probing
but must be explicitly documented and suppressed for Ruff; update that except to
include a one-line targeted Ruff suppression comment (e.g. "# noqa: BLE001 -
intentional broad Exception for save-time config probing, follows validateConfig
pattern") immediately on the except line so the broad catch remains but is
clearly justified and silences BLE001 while keeping the warning(str(e))
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5dccd2c3-49af-40a7-ace1-93178201c391

📥 Commits

Reviewing files that changed from the base of the PR and between 5d78492 and e505691.

📒 Files selected for processing (10)
  • nodes/src/nodes/chunker/IGlobal.py
  • nodes/src/nodes/chunker/IInstance.py
  • nodes/src/nodes/chunker/services.json
  • nodes/src/nodes/search_hybrid/IGlobal.py
  • nodes/src/nodes/search_hybrid/IInstance.py
  • nodes/src/nodes/search_hybrid/__init__.py
  • nodes/src/nodes/search_hybrid/hybrid_search.py
  • nodes/src/nodes/search_hybrid/requirements.txt
  • nodes/src/nodes/search_hybrid/services.json
  • test/nodes/test_search_hybrid.py

Copy link
Copy Markdown
Collaborator

@asclearuc asclearuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 independent nodes

@asclearuc
Copy link
Copy Markdown
Collaborator

Thank you for you contribution

What is the additional value of the new chunker node taking into consideration that chunking is already present (see preprocess_langchain, preprocessor_llm, preprocessor_code).

@nihalnihalani
Copy link
Copy Markdown
Contributor Author

Great question! The dedicated chunker node adds value over existing preprocessors (preprocess_langchain, preprocessor_llm, preprocessor_code) in several ways:

  1. Strategy-configurable: Exposes 3 pluggable strategies (recursive, sentence, token) as a first-class pipeline step, vs preprocessors which have fixed chunking behavior
  2. Position metadata: Emits chunk_index, start_char, end_char, total_chunks per chunk — enabling downstream nodes to reconstruct original documents or highlight source spans in RAG answers
  3. Overlap control: Fine-grained chunk_overlap parameter for tuning retrieval recall per embedding model
  4. Composability: Sits between any extraction node and any embedding node, vs preprocessors which bundle extraction + chunking together

Regarding the "2 independent nodes" feedback — the chunker and hybrid search are in separate directories with zero code dependencies. Happy to split into two PRs if preferred, though they were developed as a cohesive RAG pipeline feature set. Let me know your preference.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
nodes/src/nodes/chunker/IInstance.py (1)

79-82: ⚠️ Potential issue | 🟠 Major

Construct fallback DocMetadata with required fields.

Line 81 uses DocMetadata() bare. In the real schema, objectId and chunkId are required, so the metadata=None path will raise before any chunk is emitted. The permissive stub in test/nodes/test_chunker.py currently hides this.

🛠️ Suggested fix
-                chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+                chunk_doc.metadata = (
+                    copy.copy(document.metadata)
+                    if document.metadata
+                    else DocMetadata(objectId=parent_id, chunkId=self.chunkId)
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The fallback
DocMetadata() used when document.metadata is falsy lacks required fields
(objectId, chunkId); update the chunk creation logic in the chunker where
chunk_doc is made (the copy.copy(document) / chunk_doc.metadata assignment) to
construct a DocMetadata instance populated with at least the required fields
(e.g., objectId from document or a generated value and chunkId from chunk_data
or an index) instead of calling bare DocMetadata(), ensuring
chunk_doc.page_content = chunk_data['text'] remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 211-217: The size and overlap checks for chunking currently assume
a single-character separator (using current_len + 1 + sentence_len) while the
emitted chunk slices the original text with preserved whitespace, causing
undercounts; in the chunking logic (look for variables/current_len,
sentence_len, sentence_positions, start_char, end_char, chunk_text, chunk_size,
chunk_overlap in chunker_strategies.py) replace the synthetic "+ 1" separator
math with calculations based on actual original spans (use sentence_positions
and string indices to compute widths as end_index - start_index for both the
candidate addition and existing current span) so that both the capacity check
and overlap bookkeeping use the original slice widths; apply the same fix in the
other similar blocks referenced around the later checks (the sections handling
finalization and overlap at the other noted ranges).
- Around line 129-145: The problem is that overlap is prepended using
prev[-self.chunk_overlap:] from the split output (raw) which may have lost
separators, so chunk_text no longer matches text[start_char:end_char] and
offsets are wrong; to fix, first locate the raw slice's start in the original
text (compute start_char = text.find(raw, search_start) as done), then compute
overlap_text by slicing the original text (overlap_text = text[max(0, start_char
- self.chunk_overlap): start_char]) and prepend that to raw to form chunk_text,
and finally set start_char = start_char - len(overlap_text) and end_char
accordingly; apply the same change to the second similar block (the one around
lines 157-160) and reference the functions/variables chunk_overlap, raw_chunks,
raw, chunk_text, start_char, end_char, and _split_text when making the update.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The fallback DocMetadata() used when document.metadata is
falsy lacks required fields (objectId, chunkId); update the chunk creation logic
in the chunker where chunk_doc is made (the copy.copy(document) /
chunk_doc.metadata assignment) to construct a DocMetadata instance populated
with at least the required fields (e.g., objectId from document or a generated
value and chunkId from chunk_data or an index) instead of calling bare
DocMetadata(), ensuring chunk_doc.page_content = chunk_data['text'] remains
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f902dbba-b734-4d31-9b3e-ee0acfcc03b9

📥 Commits

Reviewing files that changed from the base of the PR and between e505691 and bf59730.

📒 Files selected for processing (4)
  • nodes/src/nodes/chunker/IGlobal.py
  • nodes/src/nodes/chunker/IInstance.py
  • nodes/src/nodes/chunker/chunker_strategies.py
  • test/nodes/test_chunker.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nodes/src/nodes/chunker/chunker_strategies.py (1)

129-145: ⚠️ Potential issue | 🟠 Major

Overlap-derived chunk_text still doesn't match the original text slice.

The past review flagged that chunk_text = overlap_text + raw produces text that won't match text[start_char:end_char] because _split_text() drops separators. The code still uses this pattern despite being marked as addressed.

For example, with text='abc\n\ndef' and chunk_overlap=1:

  • raw_chunks might be ['abc', 'def'] (separators dropped)
  • chunk_text for chunk 2 becomes 'c' + 'def' = 'cdef'
  • But text[start_char:end_char] would be 'c\n\ndef' (includes separator)
🛠️ Suggested fix: derive chunk_text from original text
-            # Compute overlap prefix from the previous chunk
-            if i > 0 and self.chunk_overlap > 0:
-                prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
-
-            # Find start_char position in original text
-            start_char = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            # Find raw chunk position in original text
+            raw_start = text.find(raw, search_start)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            # Compute start_char with overlap adjustment
+            if i > 0 and self.chunk_overlap > 0:
+                start_char = max(0, raw_start - self.chunk_overlap)
+            else:
+                start_char = raw_start
+            end_char = raw_end
+
+            # Slice chunk_text from original to preserve separators
+            chunk_text = text[start_char:end_char]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 129 - 145, The
current construction chunk_text = overlap_text + raw (using raw_chunks and
chunk_overlap) can mismatch the original text because _split_text drops
separators; instead compute chunk_text by slicing the original text using
calculated indices: find start_char = text.find(raw, search_start) (fall back to
search_start if -1), then when overlap applies set overlap_start = max(0,
start_char - self.chunk_overlap) and set end_char = overlap_start +
len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap
if needed), and finally derive chunk_text = text[overlap_start:end_char]; update
uses of start_char/end_char and remove the overlap_text + raw concatenation so
chunk_text always matches text[start_char:end_char] and honors separators
dropped by _split_text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-145: The current construction chunk_text = overlap_text + raw
(using raw_chunks and chunk_overlap) can mismatch the original text because
_split_text drops separators; instead compute chunk_text by slicing the original
text using calculated indices: find start_char = text.find(raw, search_start)
(fall back to search_start if -1), then when overlap applies set overlap_start =
max(0, start_char - self.chunk_overlap) and set end_char = overlap_start +
len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap
if needed), and finally derive chunk_text = text[overlap_start:end_char]; update
uses of start_char/end_char and remove the overlap_text + raw concatenation so
chunk_text always matches text[start_char:end_char] and honors separators
dropped by _split_text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4dac1c14-a233-42b7-bda4-d5d0aa37b128

📥 Commits

Reviewing files that changed from the base of the PR and between bf59730 and fab5892.

📒 Files selected for processing (2)
  • nodes/src/nodes/chunker/chunker_strategies.py
  • test/nodes/test_chunker.py

nihalnihalani and others added 8 commits April 6, 2026 11:41
…n strategies

Split from rocketride-org#514 per reviewer request. Addresses review feedback:
- RuntimeError on uninitialized strategy (not silent return)
- Shallow copy + metadata copy instead of N+1 deepcopy
- parent_id preserved in all metadata branches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unreachable else branch in IInstance.writeDocuments (metadata is
always non-None after shallow-copy-or-create on line 81). Remove unused
_STRATEGY_MAP dict in IGlobal. Fix services.json output description that
incorrectly referenced total_chunks metadata which is never set on chunk
documents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Add early validation for chunk_size > 0, chunk_overlap >= 0, and
chunk_overlap < chunk_size in beginGlobal() so invalid config is
caught before strategy instantiation. Also add noqa BLE001 annotation
for the intentional broad exception catch in validateConfig().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of joining sentences with ' '.join() and computing end_char
from the joined string length, derive start_char/end_char from the
sentence position map and slice the original text. This preserves
the actual whitespace between sentences (e.g., double spaces, newlines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy chunk_index, start_char, end_char, and total_chunks from the
strategy's per-chunk metadata dict onto each emitted Doc's metadata
so downstream nodes can locate chunks within the source document.
Update the writeDocuments docstring to list all metadata keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change the no-overlap position assertion to compare start_char against
the previous chunk's end_char (not start_char) so it actually verifies
non-overlapping spans.

Rewrite test_iglobal_creates_strategy to instantiate IGlobal, mock
the endpoint and glb with strategy='sentence', call beginGlobal(),
and assert the resulting strategy type and parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nting

Replace the current_len tracker (which assumed 1-char separators) with
position-based span computation that accounts for multi-character
whitespace between sentences.  After the whitespace-preservation change
(text[start_char:end_char] slicing), current_len was underestimating
the real chunk width whenever the original separator was longer than one
character (e.g. \n\n), causing chunks to silently exceed chunk_size.

The overlap budget is now also computed from actual text spans so
carried-over sentences fit within the configured overlap window.

Add two regression tests:
- overlap + repeated sentences: verify spans match text and respect size
- multi-char whitespace: verify chunks are split, not all swallowed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nihalnihalani nihalnihalani force-pushed the feature/rag-chunker-node branch from fab5892 to 4181638 Compare April 6, 2026 18:42
@nihalnihalani
Copy link
Copy Markdown
Contributor Author

Great question — here's why the dedicated chunker node adds value beyond the existing preprocessors:

1. Pipeline composability vs preprocessing stage
preprocess_langchain is a PreProcessorBase — it runs in the preprocessing stage only. The chunker is an IInstanceBase node that can be placed anywhere in the pipeline. This means you can chunk after extraction, after filtering, after enrichment — wherever it makes sense for the use case.

2. Rich document metadata for reconstruction
The chunker emits Doc objects with chunk_index, start_char, end_char, total_chunks, and parentId. This lets downstream nodes (a) highlight source spans in RAG answers, (b) reconstruct the original document from chunks, and (c) deduplicate overlapping results. Existing preprocessors return text without this provenance tracking.

3. Strategy-configurable with precise token counting
Three pluggable strategies (recursive, sentence, token) with fine-grained chunk_overlap control. The token strategy uses tiktoken for exact token counting vs the preprocessor's byte-length heuristic (_bytes_per_token = 3.0), which matters when tuning for specific embedding model context windows.

4. Zero external dependencies
The recursive and sentence strategies are pure Python — no langchain_text_splitters required. Only the token strategy optionally uses tiktoken.

Regarding "2 independent nodes" — you're right that the chunker and hybrid search are independent. They share zero code dependencies and live in separate directories. Happy to split into two PRs if you'd prefer to review and merge them separately. Just let me know.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (2)
nodes/src/nodes/search_hybrid/IInstance.py (1)

113-122: ⚠️ Potential issue | 🟡 Minor

Answer text still references hybrid_score from metadata instead of doc.score.

Line 116 reads doc.metadata.get('hybrid_score'), but line 101 stores the RRF score in reranked_doc.score, not in metadata. This will always display 'N/A' for the score in the answer output.

🐛 Proposed fix
         if reranked_docs and self.instance.hasListener('answers'):
             context_parts = []
             for i, doc in enumerate(reranked_docs):
-                score = doc.metadata.get('hybrid_score', 'N/A') if doc.metadata else 'N/A'
+                score = f'{doc.score:.4f}' if doc.score is not None else 'N/A'
                 snippet = (doc.page_content or '')[:500]
                 context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 113 - 122, The
answer assembly loop in IInstance.py is incorrectly reading the score from
metadata ('hybrid_score') but the RRF score was saved on the Document object as
doc.score; update the loop that builds context_parts (used before ans.setAnswer
and self.instance.writeAnswers) to read the score from doc.score (falling back
to 'N/A' if doc.score is None) instead of doc.metadata.get('hybrid_score') so
the displayed scores reflect the reranked_doc.score values.
nodes/src/nodes/chunker/chunker_strategies.py (1)

127-160: ⚠️ Potential issue | 🟠 Major

Build overlapped recursive chunks from the source slice.

Line 132 prepends overlap from raw_chunks, but _split_text() has already removed boundary separators. When a split lands on '\n\n', '. ', etc., chunk_text no longer matches text[start_char:end_char], so the emitted offsets are wrong.

🛠️ Suggested fix
@@
         for i, raw in enumerate(raw_chunks):
-            # Compute overlap prefix from the previous chunk
-            if i > 0 and self.chunk_overlap > 0:
-                prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
-
-            # Find start_char position in original text
-            start_char = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            raw_start = text.find(raw, search_start)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            if i > 0 and self.chunk_overlap > 0:
+                start_char = max(0, raw_start - self.chunk_overlap)
+            else:
+                start_char = raw_start
+            end_char = raw_end
+            chunk_text = text[start_char:end_char]
@@
-            raw_start = text.find(raw, search_start)
-            if raw_start >= 0:
-                search_start = raw_start + len(raw)
+            search_start = raw_end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 127 - 160, The
overlap prepending step uses text transformed by _split_text(), so chunk_text no
longer aligns with the original source and emitted offsets are wrong; update the
overlap construction in the chunking routine to derive overlap and offsets from
the original/raw text/raw_chunks (use the original raw_chunks text or original
source slice to compute the overlap substring and compute start_char/end_char
relative to the original source) rather than the post-split chunk_text, and
ensure emitted offsets are calculated from that original text so returned
chunk.offsets correspond to indices in the original source; locate references to
raw_chunks, _split_text(), chunk_text and the code that sets start_char/end_char
to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 216-277: SentenceChunker currently can emit chunks larger than
chunk_size because the budget check at SentenceChunker (use of current_sentences
and chunk_size) skips the first sentence and the finalize path restores overlap
without re-checking sizes; fix by enforcing chunk_size for any single sentence
and during finalization: when building a chunk in add_sentence (or wherever
current_sentences is appended), detect if a single sentence length > chunk_size
and split that sentence into smaller spans (respecting chunk_size and desired
overlap policy) before emitting; and in finalize/restore-overlap logic, before
appending overlap+sentence repeatedly trim the carried overlap (pop or shorten
overlap segments from current_sentences) until overlap + next sentence <=
chunk_size, and if still larger, apply the same fallback split to the oversized
sentence so every emitted chunk is <= chunk_size.
- Around line 336-349: The current mapping uses len(encoder.decode(...)) as a
character offset which breaks for multibyte UTF‑8 tokens; instead compute token
byte lengths using encoder.decode_single_token_bytes() (or aggregate its
results) to produce start_byte and end_byte offsets, then convert those byte
offsets back to character indices into the original text (e.g., by slicing
text.encode('utf-8') or using incremental decoding to map byte->char index)
before slicing; update the logic around encoder.decode(...) and the code that
builds slices in the functions that assemble token spans so all boundaries use
byte-accurate offsets mapped to character indices.
- Around line 311-312: The call to Encoding.encode(...) in chunker_strategies.py
is rejecting sentinel substrings like "<|endoftext|>" and raising ValueError;
update the Encoding.encode invocation (the call at/near line 312) to include the
parameter disallowed_special=() so sentinel strings are treated as normal text
(i.e., call Encoding.encode(..., disallowed_special=()) wherever the
chunking/tokenization routine—e.g., the method that invokes Encoding.encode in
this module—is encoding document text); ensure any similar Encoding.encode
usages in the same class/function are updated likewise.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The shallow copy of Document in the chunking logic
(chunk_doc = copy.copy(document) in IInstance) leaves inherited embedding state
on the chunk; after setting chunk_doc.page_content and chunk_doc.metadata,
explicitly clear chunk_doc.embedding and chunk_doc.embedding_model (set them to
None) so each emitted chunk does not carry the parent document's vector state.
Ensure you update the code around where chunk_doc is created/modified
(references: chunk_doc, copy.copy(document), chunk_doc.metadata,
chunk_doc.page_content) before appending/emitting the chunk.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 60-65: The code casts top_k and rrf_k from config without
validating their ranges; update the initialization in IGlobal (where alpha,
self.top_k, self.rrf_k are set) to validate that top_k and rrf_k are positive
integers (e.g., > 0) and clamp or raise a clear error/log if they are invalid,
similar to the alpha validation; ensure you convert the values to int after
validation and keep existing behavior for defaults (top_k=10, rrf_k=60) when
values are missing or invalid.
- Around line 46-47: The except block that catches a broad Exception (the line
with "except Exception as e:" followed by "warning(str(e))") should include the
suppression comment "# noqa: BLE001" to match other nodes; update that except
line to append the comment so the linter suppression is consistent with
chunker/IGlobal.py.

In `@test/nodes/test_chunker.py`:
- Around line 517-523: The _restore_modules method currently only restores
entries named in stub_names but leaves cached imports like modules starting with
"nodes.src.nodes.chunker." that were loaded against the stubs; update
_restore_modules (the function named _restore_modules) to also remove from
sys.modules any keys that match the chunker module namespace (e.g. startwith
"nodes.src.nodes.chunker.") when cleaning up: for each such matching module, pop
it from sys.modules so subsequent tests don't reuse modules wired to the fakes;
keep the existing logic that restores saved[name] for stub_names that were
present.

In `@test/nodes/test_search_hybrid.py`:
- Around line 78-83: The teardown _restore_modules currently only restores exact
dependency names in _STUB_MODULES/_original_modules, leaving stubbed packages
(e.g., the fake rank_bm25 and any modules under search_hybrid that were imported
while stubs were active) cached in sys.modules; update _restore_modules so it
removes from sys.modules any entry whose name is equal to or starts with any
name in _STUB_MODULES (or whose module object is the stub) before restoring
originals, ensuring that both top-level stub modules and any submodules (e.g.,
"rank_bm25" and "rank_bm25.*" or "rocketlib.*") are purged so later tests import
clean copies.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 127-160: The overlap prepending step uses text transformed by
_split_text(), so chunk_text no longer aligns with the original source and
emitted offsets are wrong; update the overlap construction in the chunking
routine to derive overlap and offsets from the original/raw text/raw_chunks (use
the original raw_chunks text or original source slice to compute the overlap
substring and compute start_char/end_char relative to the original source)
rather than the post-split chunk_text, and ensure emitted offsets are calculated
from that original text so returned chunk.offsets correspond to indices in the
original source; locate references to raw_chunks, _split_text(), chunk_text and
the code that sets start_char/end_char to make this change.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer assembly loop in IInstance.py is incorrectly
reading the score from metadata ('hybrid_score') but the RRF score was saved on
the Document object as doc.score; update the loop that builds context_parts
(used before ans.setAnswer and self.instance.writeAnswers) to read the score
from doc.score (falling back to 'N/A' if doc.score is None) instead of
doc.metadata.get('hybrid_score') so the displayed scores reflect the
reranked_doc.score values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 639ea044-72b7-4b42-a58b-933fa072c7b3

📥 Commits

Reviewing files that changed from the base of the PR and between fab5892 and 4181638.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml, !pnpm-lock.yaml
📒 Files selected for processing (14)
  • nodes/src/nodes/chunker/IGlobal.py
  • nodes/src/nodes/chunker/IInstance.py
  • nodes/src/nodes/chunker/__init__.py
  • nodes/src/nodes/chunker/chunker_strategies.py
  • nodes/src/nodes/chunker/requirements.txt
  • nodes/src/nodes/chunker/services.json
  • nodes/src/nodes/search_hybrid/IGlobal.py
  • nodes/src/nodes/search_hybrid/IInstance.py
  • nodes/src/nodes/search_hybrid/__init__.py
  • nodes/src/nodes/search_hybrid/hybrid_search.py
  • nodes/src/nodes/search_hybrid/requirements.txt
  • nodes/src/nodes/search_hybrid/services.json
  • test/nodes/test_chunker.py
  • test/nodes/test_search_hybrid.py

Comment on lines +216 to +277
# Compute what the span would be if we added this sentence
if current_sentences:
candidate_len = (next_pos + len(sentence)) - sentence_positions[0]
else:
candidate_len = len(sentence)

# If adding this sentence exceeds chunk_size and we have content, finalize current
if current_sentences and candidate_len > self.chunk_size:
start_char = sentence_positions[0]
end_char = sentence_positions[-1] + len(current_sentences[-1])
chunk_text = text[start_char:end_char]

result.append(
{
'text': chunk_text,
'metadata': {
'chunk_index': chunk_index,
'start_char': start_char,
'end_char': end_char,
},
}
)
chunk_index += 1

# Compute overlap: keep trailing sentences whose actual span fits within overlap
if self.chunk_overlap > 0:
overlap_sentences: list[str] = []
overlap_positions: list[int] = []
for i in range(len(current_sentences) - 1, -1, -1):
trial_positions = [sentence_positions[i]] + overlap_positions
trial_sentences = [current_sentences[i]] + overlap_sentences
if _span_len(trial_positions, trial_sentences) <= self.chunk_overlap:
overlap_sentences = trial_sentences
overlap_positions = trial_positions
else:
break
current_sentences = overlap_sentences
sentence_positions = overlap_positions
else:
current_sentences = []
sentence_positions = []

# Add the sentence
current_sentences.append(sentence)
sentence_positions.append(next_pos)

# Emit the final chunk
if current_sentences:
start_char = sentence_positions[0]
end_char = sentence_positions[-1] + len(current_sentences[-1])
chunk_text = text[start_char:end_char]

result.append(
{
'text': chunk_text,
'metadata': {
'chunk_index': chunk_index,
'start_char': start_char,
'end_char': end_char,
},
}
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SentenceChunker can still emit chunks larger than chunk_size.

Line 223 only enforces the budget when current_sentences is non-empty, so a first sentence longer than the limit is emitted unchanged. After finalization, Lines 252-259 also restore overlap and append the current sentence without another size check, so overlap + sentence can overflow as well. Add a fallback split for oversized single sentences, and trim carried overlap until the new span fits before appending.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 216 - 277,
SentenceChunker currently can emit chunks larger than chunk_size because the
budget check at SentenceChunker (use of current_sentences and chunk_size) skips
the first sentence and the finalize path restores overlap without re-checking
sizes; fix by enforcing chunk_size for any single sentence and during
finalization: when building a chunk in add_sentence (or wherever
current_sentences is appended), detect if a single sentence length > chunk_size
and split that sentence into smaller spans (respecting chunk_size and desired
overlap policy) before emitting; and in finalize/restore-overlap logic, before
appending overlap+sentence repeatedly trim the carried overlap (pop or shorten
overlap segments from current_sentences) until overlap + next sentence <=
chunk_size, and if still larger, apply the same fallback split to the oversized
sentence so every emitted chunk is <= chunk_size.

Comment on lines +311 to +312
encoder = self._get_encoder()
tokens = encoder.encode(text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import pathlib
import re
import subprocess
import sys

pkg = 'tiktoken'
req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt')
if req_path.exists():
    match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M)
    if match and match.group(1):
        pkg = f'tiktoken=={match.group(1)}'

subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')
sample = 'prefix <|endoftext|> suffix'

try:
    print('default:', enc.encode(sample))
except Exception as exc:
    print('default raised:', type(exc).__name__, exc)

print('disallowed_special=():', enc.encode(sample, disallowed_special=()))
PY

Repository: rocketride-org/rocketride-server

Length of output: 197


🏁 Script executed:

git ls-files nodes/src/nodes/chunker/ | head -20

Repository: rocketride-org/rocketride-server

Length of output: 308


🏁 Script executed:

cat -n nodes/src/nodes/chunker/chunker_strategies.py | sed -n '305,320p'

Repository: rocketride-org/rocketride-server

Length of output: 612


🏁 Script executed:

cat nodes/src/nodes/chunker/requirements.txt

Repository: rocketride-org/rocketride-server

Length of output: 98


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import subprocess
import sys

# Install tiktoken with version constraint from requirements.txt
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'tiktoken>=0.7.0,<1.0.0'])

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')
sample = 'prefix <|endoftext|> suffix'

print("Testing tiktoken behavior with sentinel strings:")
print()

try:
    result = enc.encode(sample)
    print('default call succeeded:', result)
except Exception as exc:
    print('default call raised:', type(exc).__name__, '-', exc)

print()

try:
    result = enc.encode(sample, disallowed_special=())
    print('with disallowed_special=() call succeeded:', result)
except Exception as exc:
    print('with disallowed_special=() raised:', type(exc).__name__, '-', exc)

print()
print("Testing with just the sentinel string:")
try:
    result = enc.encode('<|endoftext|>')
    print('sentinel alone (default) succeeded:', result)
except Exception as exc:
    print('sentinel alone (default) raised:', type(exc).__name__)

try:
    result = enc.encode('<|endoftext|>', disallowed_special=())
    print('sentinel alone (disallowed_special=()) succeeded:', result)
except Exception as exc:
    print('sentinel alone (disallowed_special=()) raised:', type(exc).__name__)
PY

Repository: rocketride-org/rocketride-server

Length of output: 865


Add disallowed_special=() parameter to treat sentinel strings as normal text.

Encoding.encode() rejects substrings such as <|endoftext|> by default, raising a ValueError. Line 312 will therefore raise on valid document content instead of chunking it.

🛠️ Suggested fix
-        tokens = encoder.encode(text)
+        tokens = encoder.encode(text, disallowed_special=())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 311 - 312, The
call to Encoding.encode(...) in chunker_strategies.py is rejecting sentinel
substrings like "<|endoftext|>" and raising ValueError; update the
Encoding.encode invocation (the call at/near line 312) to include the parameter
disallowed_special=() so sentinel strings are treated as normal text (i.e., call
Encoding.encode(..., disallowed_special=()) wherever the chunking/tokenization
routine—e.g., the method that invokes Encoding.encode in this module—is encoding
document text); ensure any similar Encoding.encode usages in the same
class/function are updated likewise.

Comment on lines +336 to +349
# Decode tokens back to text, handling errors gracefully
try:
chunk_text = encoder.decode(chunk_tokens)
except Exception:
chunk_text = encoder.decode(chunk_tokens, errors='replace')

# Use cached character position for start_char (O(1) lookup)
start_char = char_pos_cache.get(start)
if start_char is None:
# Fallback: decode prefix (should not happen with correct step caching)
prefix_text = encoder.decode(tokens[:start])
start_char = len(prefix_text)
char_pos_cache[start] = start_char
end_char = start_char + len(chunk_text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import pathlib
import re
import subprocess
import sys

pkg = 'tiktoken'
req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt')
if req_path.exists():
    match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M)
    if match and match.group(1):
        pkg = f'tiktoken=={match.group(1)}'

subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')
candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨‍👩‍👧‍👦', '🏳️‍🌈']
candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable())

for text in candidates:
    tokens = enc.encode(text, disallowed_special=())
    if len(tokens) < 2:
        continue

    prefix = b''
    for i, tok in enumerate(tokens[:-1], 1):
        prefix += enc.decode_single_token_bytes(tok)
        try:
            prefix.decode('utf-8')
        except UnicodeDecodeError:
            print('invalid_utf8_boundary', repr(text), tokens, 'boundary', i)
            print('decoded_prefix', repr(enc.decode(tokens[:i], errors='replace')))
            raise SystemExit(0)

print('no_invalid_boundary_found')
PY

Repository: rocketride-org/rocketride-server

Length of output: 197


🏁 Script executed:

cd nodes && find . -name "chunker_strategies.py" -o -name "requirements.txt" | head -20

Repository: rocketride-org/rocketride-server

Length of output: 919


🏁 Script executed:

cd nodes && wc -l ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 121


🏁 Script executed:

cd nodes && sed -n '336,372p' ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 1727


🏁 Script executed:

cd nodes && sed -n '320,380p' ./src/nodes/chunker/chunker_strategies.py

Repository: rocketride-org/rocketride-server

Length of output: 2532


🏁 Script executed:

cd nodes && cat ./src/nodes/chunker/requirements.txt

Repository: rocketride-org/rocketride-server

Length of output: 98


🏁 Script executed:

cd nodes && python - <<'PY'
import sys
import subprocess

# Install tiktoken with the version constraint
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'tiktoken>=0.7.0,<1.0.0'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')
candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨‍👩‍👧‍👦', '🏳️‍🌈']
candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable())

invalid_boundaries_found = []
for text in candidates:
    tokens = enc.encode(text, disallowed_special=())
    if len(tokens) < 2:
        continue

    prefix = b''
    for i, tok in enumerate(tokens[:-1], 1):
        prefix += enc.decode_single_token_bytes(tok)
        try:
            prefix.decode('utf-8')
        except UnicodeDecodeError:
            invalid_boundaries_found.append({
                'text': text,
                'tokens': tokens,
                'boundary_index': i,
                'decoded_with_replacement': enc.decode(tokens[:i], errors='replace')
            })
            break

if invalid_boundaries_found:
    print(f"FOUND {len(invalid_boundaries_found)} INVALID UTF-8 BOUNDARIES:")
    for example in invalid_boundaries_found[:5]:  # Show first 5
        print(f"  Text: {repr(example['text'])}, Tokens: {example['tokens']}, Boundary at token {example['boundary_index']}")
        print(f"    Decoded with replacement: {repr(example['decoded_with_replacement'])}")
else:
    print("NO INVALID UTF-8 BOUNDARIES FOUND")
PY

Repository: rocketride-org/rocketride-server

Length of output: 715


Use byte-level offsets to correctly map token boundaries to character positions in non-ASCII text.

Tiktoken's token boundaries are byte-based and can split multibyte UTF-8 sequences. Lines 343–349 and 365–372 incorrectly use len(encoder.decode(...)) as a character offset, which fails when token boundaries land mid-character. When errors='replace' substitutes invalid sequences with '�', the character count diverges from the original text's indices. Build offsets from decode_single_token_bytes() aggregates, then map byte positions back to character indices before slicing text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 336 - 349, The
current mapping uses len(encoder.decode(...)) as a character offset which breaks
for multibyte UTF‑8 tokens; instead compute token byte lengths using
encoder.decode_single_token_bytes() (or aggregate its results) to produce
start_byte and end_byte offsets, then convert those byte offsets back to
character indices into the original text (e.g., by slicing text.encode('utf-8')
or using incremental decoding to map byte->char index) before slicing; update
the logic around encoder.decode(...) and the code that builds slices in the
functions that assemble token spans so all boundaries use byte-accurate offsets
mapped to character indices.

Comment on lines +79 to +82
# Shallow copy of document, explicit copy of metadata only
chunk_doc = copy.copy(document)
chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
chunk_doc.page_content = chunk_data['text']
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo 'Schema files exposing Doc / DocMetadata fields:'
fd -i 'schema\.py$' -x rg -n -C2 '\bclass\s+Doc\b|\bclass\s+DocMetadata\b|\bembedding(s)?\b|\bvector(s)?\b' {}

echo
echo 'Node code referencing existing vector-bearing fields:'
rg -n -C2 '\b(embedding|embeddings|vector|vectors)\b' --glob '*.py' -g '!test/**' -g '!**/test/**'

echo
echo 'Chunk emission loop under review:'
sed -n '79,94p' nodes/src/nodes/chunker/IInstance.py

Repository: rocketride-org/rocketride-server

Length of output: 50389


Clear inherited embeddings when cloning documents for chunks.

Line 80 shallow-copies the entire document, and line 82 replaces only page_content. The Doc schema defines both embedding and embedding_model fields (see packages/client-python/src/rocketride/schema/doc.py), and each emitted chunk will inherit the parent document's vector state even though its text has changed. This causes stale embeddings to propagate downstream, poisoning indexing or allowing later nodes to skip required re-embedding.

Reset both fields before appending:

chunk_doc.embedding = None
chunk_doc.embedding_model = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The shallow copy
of Document in the chunking logic (chunk_doc = copy.copy(document) in IInstance)
leaves inherited embedding state on the chunk; after setting
chunk_doc.page_content and chunk_doc.metadata, explicitly clear
chunk_doc.embedding and chunk_doc.embedding_model (set them to None) so each
emitted chunk does not carry the parent document's vector state. Ensure you
update the code around where chunk_doc is created/modified (references:
chunk_doc, copy.copy(document), chunk_doc.metadata, chunk_doc.page_content)
before appending/emitting the chunk.

Comment on lines +46 to +47
except Exception as e:
warning(str(e))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add # noqa: BLE001 for consistency with other nodes.

The broad Exception catch should include the suppression comment for consistency with the established pattern (as done in chunker/IGlobal.py line 44).

♻️ Proposed fix
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001 - intentional broad catch for dependency probing
             warning(str(e))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 46 - 47, The except
block that catches a broad Exception (the line with "except Exception as e:"
followed by "warning(str(e))") should include the suppression comment "# noqa:
BLE001" to match other nodes; update that except line to append the comment so
the linter suppression is consistent with chunker/IGlobal.py.

Comment on lines +60 to +65
alpha = float(config.get('alpha', 0.5))
self.top_k = int(config.get('top_k', 10))
self.rrf_k = int(config.get('rrf_k', 60))

# Validate alpha range
alpha = max(0.0, min(1.0, alpha))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider validating top_k and rrf_k parameters.

chunk_size/chunk_overlap are validated in the chunker node, but top_k and rrf_k are cast without range checks. Invalid values (e.g., top_k <= 0 or rrf_k <= 0) could cause unexpected behavior in the search engine.

🛡️ Proposed guardrails
             alpha = float(config.get('alpha', 0.5))
             self.top_k = int(config.get('top_k', 10))
             self.rrf_k = int(config.get('rrf_k', 60))

+            if self.top_k <= 0:
+                raise ValueError(f'top_k must be positive, got {self.top_k}')
+            if self.rrf_k <= 0:
+                raise ValueError(f'rrf_k must be positive, got {self.rrf_k}')
+
             # Validate alpha range
             alpha = max(0.0, min(1.0, alpha))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 60 - 65, The code
casts top_k and rrf_k from config without validating their ranges; update the
initialization in IGlobal (where alpha, self.top_k, self.rrf_k are set) to
validate that top_k and rrf_k are positive integers (e.g., > 0) and clamp or
raise a clear error/log if they are invalid, similar to the alpha validation;
ensure you convert the values to int after validation and keep existing behavior
for defaults (top_k=10, rrf_k=60) when values are missing or invalid.

Comment on lines +517 to +523
def _restore_modules(self, saved, stub_names):
for name in stub_names:
if saved.get(name) is None:
sys.modules.pop(name, None)
else:
sys.modules[name] = saved[name]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Evict the stub-imported chunker modules during cleanup.

Lines 533-550 and 589-592 import nodes.src.nodes.chunker.* while rocketlib / ai.common are stubbed, but Lines 517-523 only restore the stubs themselves. The cached nodes.src.nodes.chunker.* modules survive in sys.modules, so later tests can reuse modules that are already wired to the fakes.

🧹 Suggested cleanup
     def _restore_modules(self, saved, stub_names):
+        for mod_name in list(sys.modules):
+            if mod_name.startswith('nodes.src.nodes.chunker'):
+                sys.modules.pop(mod_name, None)
         for name in stub_names:
             if saved.get(name) is None:
                 sys.modules.pop(name, None)
             else:
                 sys.modules[name] = saved[name]

Also applies to: 524-557, 581-592

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/nodes/test_chunker.py` around lines 517 - 523, The _restore_modules
method currently only restores entries named in stub_names but leaves cached
imports like modules starting with "nodes.src.nodes.chunker." that were loaded
against the stubs; update _restore_modules (the function named _restore_modules)
to also remove from sys.modules any keys that match the chunker module namespace
(e.g. startwith "nodes.src.nodes.chunker.") when cleaning up: for each such
matching module, pop it from sys.modules so subsequent tests don't reuse modules
wired to the fakes; keep the existing logic that restores saved[name] for
stub_names that were present.

Comment on lines +78 to +83
def _restore_modules():
for name in _STUB_MODULES:
if _original_modules.get(name) is None:
sys.modules.pop(name, None)
else:
sys.modules[name] = _original_modules[name]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore the sys.modules stubs in teardown.

Lines 86-87 install a fake rank_bm25 at module import time, and Lines 578-588 import nodes.src.nodes.search_hybrid.* while rocketlib / ai.common are stubbed. The cleanup paths only restore the dependency modules, so the BM25 stub and the stub-bound search_hybrid modules stay cached for later tests. That makes the suite order-dependent.

Also applies to: 86-87, 571-588

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/nodes/test_search_hybrid.py` around lines 78 - 83, The teardown
_restore_modules currently only restores exact dependency names in
_STUB_MODULES/_original_modules, leaving stubbed packages (e.g., the fake
rank_bm25 and any modules under search_hybrid that were imported while stubs
were active) cached in sys.modules; update _restore_modules so it removes from
sys.modules any entry whose name is equal to or starts with any name in
_STUB_MODULES (or whose module object is the stub) before restoring originals,
ensuring that both top-level stub modules and any submodules (e.g., "rank_bm25"
and "rank_bm25.*" or "rocketlib.*") are purged so later tests import clean
copies.

@asclearuc
Copy link
Copy Markdown
Collaborator

Thank you for the contribution and for the detailed follow-up explaining the value of the chunker node — it helped clarify the design intent.

After review, we've decided to close this PR in favour of improving the existing preprocessor nodes instead. The reasoning:

  1. preprocessor_langchain already covers most of this ground. It supports multiple splitter strategies, runs as an IInstanceBase node, and handles chunking today. The gaps this PR addresses — configurable overlap, richer position metadata (start_char, end_char, total_chunks), and a writeDocuments input lane for already-materialised Doc objects — are incremental improvements to that node, not reasons to introduce a new one.

  2. Node proliferation has a real cost. Users would need to understand when to pick preprocessor_langchain vs chunker, two nodes doing overlapping work would diverge over time, and the existing chunk_overlap = 0 limitation would remain unfixed in the node pipelines already use.

  3. The right fix is extension, not addition. Adding a writeDocuments lane, exposing chunk_overlap as a config field, and emitting position metadata to preprocessor_langchain achieves everything this PR sets out to do — with no new node.

If you'd like to continue contributing, we'd welcome a PR that extends preprocessor_langchain along those lines. Happy to discuss the approach before you start.

@asclearuc asclearuc closed this Apr 7, 2026
@nihalnihalani
Copy link
Copy Markdown
Contributor Author

@asclearuc How to reach out to you, and in one the PR's I was told to use this approach and divide this into two new PR's. I would like to point it out to you and can discuss about the next steps to complete this features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:nodes Python pipeline nodes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants