feat(nodes): add text chunker node with recursive, sentence, and token strategies#2
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Text Chunker node (recursive, sentence, token) with global lifecycle and per-document chunk emission, and a Hybrid Search node implementing BM25 + vector RRF fusion. Includes service definitions, runtime requirements, and comprehensive unit and integration tests. Changes
Sequence Diagram(s)sequenceDiagram
actor Node
participant IGlobal as IGlobal (chunker)
participant Strategy as ChunkingStrategy
participant IInstance as ChunkerInstance
Node->>IGlobal: validateConfig()
IGlobal->>IGlobal: probe local requirements.txt
Node->>IGlobal: beginGlobal()
IGlobal->>Strategy: instantiate strategy (recursive|sentence|token)
Node->>IInstance: writeDocuments([Doc])
IInstance->>Strategy: chunk(text)
Strategy-->>IInstance: [chunk1, chunk2, ...]
loop per chunk
IInstance->>IInstance: copy Doc, set metadata(chunkId,parentId,chunk_index,start_char,end_char)
IInstance->>Node: emit chunked Doc
end
Node->>IGlobal: endGlobal()
IGlobal->>IGlobal: strategy = None
sequenceDiagram
actor Client
participant IGlobal as IGlobal (hybrid)
participant Engine as HybridSearchEngine
participant IInstance as HybridInstance
Client->>IGlobal: validateConfig()
IGlobal->>IGlobal: probe local requirements.txt
Client->>IGlobal: beginGlobal()
IGlobal->>Engine: instantiate HybridSearchEngine(alpha)
Client->>IInstance: writeQuestions(Question)
IInstance->>Engine: search(query, docs, vector_scores, top_k, rrf_k)
Engine->>Engine: bm25_search(...) and/or build vector ranking
Engine->>Engine: reciprocal_rank_fusion(...)
Engine-->>IInstance: ranked results (with rrf_score)
IInstance->>IInstance: map results -> Doc copies, set score/metadata
IInstance->>Client: emit documents (and answers if configured)
Client->>IGlobal: endGlobal()
IGlobal->>IGlobal: engine = None
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 213-215: The code rebuilds chunks using '
'.join(current_sentences), which normalizes whitespace and breaks character
offsets; instead compute chunk boundaries from the original text using
sentence_positions and slice the original text for chunk_text so spacing is
preserved and end_char is accurate: use start_char = sentence_positions[0] and
end_char = sentence_positions[-1] +
len(original_sentence_text_for_that_position) (or maintain a
sentence_end_positions array) then set chunk_text =
original_text[start_char:end_char]; update the code paths that set chunk_text,
start_char, and end_char (references: chunk_text, current_sentences,
sentence_positions, start_char, end_char, original_text/original_sentences)
accordingly.
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 45-53: The validateConfig method currently calls
depends(requirements) unconditionally and swallows all exceptions; change it so
you only invoke the installer when the node is configured to use the token
strategy (check the config/profile field used by this class, e.g.,
self.config.profile == 'token' or similar), and if depends(...) fails during
that branch surface the failure by raising an exception (or returning a failed
validation) instead of calling warning(); keep other profiles (e.g.,
'recursive', 'sentence') free of side effects and do not call depends for them.
Ensure you reference validateConfig and the depends(requirements) call when
making the change so the installer is only triggered for the token profile and
failures fail validation.
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 41-45: The code uses a single entry-scoped counter self.chunkId
(defined on IInstance and reset in open) and overwrites per-chunk metadata,
losing per-document chunk indices and start_char/end_char; update the logic so
emitted chunk metadata uses the per-document index from chunk_data['metadata']
(e.g., metadata.get('chunk') or metadata['chunk']) instead of incrementing
self.chunkId, or if you prefer an explicit per-document counter,
initialize/reset a document-scoped counter in IInstance.open and use that only
for chunks of the current document; adjust the emission code (the block
referenced around lines 76-92) to read and preserve metadata['start_char'] and
metadata['end_char'] and to reference either metadata['chunk'] or the reset
per-document counter rather than self.chunkId.
- Around line 79-89: The current code creates DocMetadata() with no required
fields when document.metadata is falsy, causing construction to fail before the
later fallback runs; fix the chunk metadata handling in the IInstance chunking
logic by initializing chunk_doc.metadata to a properly populated DocMetadata
when document.metadata is missing (e.g., DocMetadata(objectId=parent_id,
chunkId=self.chunkId, parentId=parent_id)), otherwise shallow-copy
document.metadata and then set chunk_doc.metadata.chunkId/self.chunkId and
parentId=parent_id as before (update references: chunk_doc, DocMetadata,
self.chunkId, parent_id).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 81815d7d-7bd7-4c85-a219-0d14c343c8e0
📒 Files selected for processing (7)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/__init__.pynodes/src/nodes/chunker/chunker_strategies.pynodes/src/nodes/chunker/requirements.txtnodes/src/nodes/chunker/services.jsontest/nodes/test_chunker.py
nodes/src/nodes/chunker/IGlobal.py
Outdated
| def validateConfig(self): | ||
| """Validate that tiktoken dependency is available (only needed for token strategy).""" | ||
| try: | ||
| from depends import depends | ||
|
|
||
| requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' | ||
| depends(requirements) | ||
| except Exception as e: | ||
| warning(str(e)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that validateConfig installs requirements unconditionally and that
# depends() performs installs rather than a pure validation check.
sed -n '45,53p' nodes/src/nodes/chunker/IGlobal.py
sed -n '71,175p' packages/ai/src/ai/common/config.py
sed -n '778,809p' packages/server/engine-lib/rocketlib-python/lib/depends.pyRepository: nihalnihalani/rocketride-server
Length of output: 5536
🏁 Script executed:
#!/bin/bash
# Get the full IGlobal.py file to see imports and complete validateConfig method
cat -n nodes/src/nodes/chunker/IGlobal.py | head -80Repository: nihalnihalani/rocketride-server
Length of output: 4274
🏁 Script executed:
#!/bin/bash
# Check if Config is imported/available in IGlobal.py
grep -n "^from\|^import\|Config" nodes/src/nodes/chunker/IGlobal.pyRepository: nihalnihalani/rocketride-server
Length of output: 414
🏁 Script executed:
#!/bin/bash
# Verify strategy field in config by checking preconfig in a service definition
# Look for where strategy might be defined
find . -type f -name "*.py" -o -name "*.json" | head -20
rg -i "strategy.*token\|token.*strategy" --type=py --type=json -m 3Repository: nihalnihalani/rocketride-server
Length of output: 724
Only resolve tiktoken requirements for the token profile, and fail validation if that install fails.
depends(requirements) is an installer, not a cheap availability check. Calling it unconditionally in validateConfig() adds unnecessary side effects for recursive and sentence configs, and the blanket except Exception masks true token setup failures that surface later at runtime instead of during validation.
🔧 Suggested change
def validateConfig(self):
"""Validate that tiktoken dependency is available (only needed for token strategy)."""
- try:
- from depends import depends
-
- requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
- depends(requirements)
- except Exception as e:
- warning(str(e))
+ config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)
+ if config.get('strategy', 'recursive') != 'token':
+ return
+
+ from depends import depends
+
+ requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
+ depends(requirements)🧰 Tools
🪛 Ruff (0.15.7)
[warning] 52-52: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IGlobal.py` around lines 45 - 53, The validateConfig
method currently calls depends(requirements) unconditionally and swallows all
exceptions; change it so you only invoke the installer when the node is
configured to use the token strategy (check the config/profile field used by
this class, e.g., self.config.profile == 'token' or similar), and if
depends(...) fails during that branch surface the failure by raising an
exception (or returning a failed validation) instead of calling warning(); keep
other profiles (e.g., 'recursive', 'sentence') free of side effects and do not
call depends for them. Ensure you reference validateConfig and the
depends(requirements) call when making the change so the installer is only
triggered for the token profile and failures fail validation.
| chunkId: int = 0 | ||
|
|
||
| def open(self, obj: Entry): | ||
| """Reset chunk counter for each new object.""" | ||
| self.chunkId = 0 |
There was a problem hiding this comment.
Use a per-document chunk index instead of the entry-scoped counter.
chunk_data['metadata'] already contains the chunk's index within its source document, but this code discards it and increments self.chunkId across the whole entry. If writeDocuments() receives multiple documents, the first chunk of the second document will no longer be 0, and the strategy's start_char/end_char metadata is lost before emission.
🔧 Suggested change
- for chunk_data in chunks:
+ for chunk_idx, chunk_data in enumerate(chunks):
# Shallow copy of document, explicit copy of metadata only
chunk_doc = copy.copy(document)
if document.metadata is not None:
chunk_doc.metadata = copy.copy(document.metadata)
else:
@@
- chunk_doc.metadata.chunkId = self.chunkId
+ chunk_doc.metadata.chunkId = chunk_idx
chunk_doc.metadata.parentId = parent_id
+ chunk_doc.metadata.startChar = chunk_data['metadata']['start_char']
+ chunk_doc.metadata.endChar = chunk_data['metadata']['end_char']
+ chunk_doc.metadata.totalChunks = total_chunks
- self.chunkId += 1
output_docs.append(chunk_doc)Also applies to: 76-92
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 43-43: Unused method argument: obj
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IInstance.py` around lines 41 - 45, The code uses a
single entry-scoped counter self.chunkId (defined on IInstance and reset in
open) and overwrites per-chunk metadata, losing per-document chunk indices and
start_char/end_char; update the logic so emitted chunk metadata uses the
per-document index from chunk_data['metadata'] (e.g., metadata.get('chunk') or
metadata['chunk']) instead of incrementing self.chunkId, or if you prefer an
explicit per-document counter, initialize/reset a document-scoped counter in
IInstance.open and use that only for chunks of the current document; adjust the
emission code (the block referenced around lines 76-92) to read and preserve
metadata['start_char'] and metadata['end_char'] and to reference either
metadata['chunk'] or the reset per-document counter rather than self.chunkId.
nodes/src/nodes/chunker/IInstance.py
Outdated
| # Shallow copy of document, explicit copy of metadata only | ||
| chunk_doc = copy.copy(document) | ||
| chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata() | ||
| chunk_doc.page_content = chunk_data['text'] | ||
|
|
||
| # Update metadata | ||
| if chunk_doc.metadata is not None: | ||
| chunk_doc.metadata.chunkId = self.chunkId | ||
| chunk_doc.metadata.parentId = parent_id | ||
| else: | ||
| chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=self.chunkId, parentId=parent_id) |
There was a problem hiding this comment.
DocMetadata() cannot be constructed without required fields here.
When document.metadata is None, Line 81 still builds DocMetadata() with no objectId/chunkId, so the real model will raise before the fallback on Line 89 can run. That makes metadata-less inputs fail in production; the current test stub only passes because it silently fills those fields in.
🔧 Suggested change
for chunk_data in chunks:
# Shallow copy of document, explicit copy of metadata only
chunk_doc = copy.copy(document)
- chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+ if document.metadata is not None:
+ chunk_doc.metadata = copy.copy(document.metadata)
+ else:
+ chunk_doc.metadata = DocMetadata(
+ objectId=parent_id,
+ chunkId=0,
+ parentId=parent_id,
+ )
chunk_doc.page_content = chunk_data['text']
# Update metadata
- if chunk_doc.metadata is not None:
- chunk_doc.metadata.chunkId = self.chunkId
- chunk_doc.metadata.parentId = parent_id
- else:
- chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=self.chunkId, parentId=parent_id)
+ chunk_doc.metadata.chunkId = self.chunkId
+ chunk_doc.metadata.parentId = parent_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 89, The current code
creates DocMetadata() with no required fields when document.metadata is falsy,
causing construction to fail before the later fallback runs; fix the chunk
metadata handling in the IInstance chunking logic by initializing
chunk_doc.metadata to a properly populated DocMetadata when document.metadata is
missing (e.g., DocMetadata(objectId=parent_id, chunkId=self.chunkId,
parentId=parent_id)), otherwise shallow-copy document.metadata and then set
chunk_doc.metadata.chunkId/self.chunkId and parentId=parent_id as before (update
references: chunk_doc, DocMetadata, self.chunkId, parent_id).
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
nodes/src/nodes/chunker/IGlobal.py (1)
42-42: 🧹 Nitpick | 🔵 TrivialUse
os.path.joinfor path construction.Prefer
os.path.join()over string concatenation for cross-platform compatibility.♻️ Suggested fix
- requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' + requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` at line 42, The path for the requirements file is built via string concatenation into the variable requirements; replace that concatenation with os.path.join to be cross-platform safe—use os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt') (update the assignment to the requirements variable in IGlobal.py) so the path is constructed correctly on Windows and POSIX systems.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/IInstance.py`:
- Line 28: Replace the typing.List import with the built-in list type: remove
the line "from typing import List" and update all type hints in this module
(e.g., any method signatures, return types or attributes in IInstance) that use
"List[...]" to use "list[...]" instead; ensure imports are cleaned up and run a
quick type-check to confirm all references to List are updated.
In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Line 29: Replace deprecated typing generics with built-in generics in
hybrid_search.py: remove Dict and List from the import line (keep Any and
Optional) and update all type hints in the file (e.g., change Dict[str, Any] to
dict[str, Any] and List[str] to list[str]); ensure function signatures and
variable annotations in functions like any that reference List or Dict are
updated accordingly and tests/types still pass.
- Around line 47-53: The _tokenize function accepts None (it checks `if not
text`) but is annotated as `text: str`; update the type hint to `Optional[str]`
and import Optional from typing to reflect allowed None input, e.g., change the
signature of `_tokenize` to accept Optional[str] and ensure callers still pass
strings or None; keep the implementation same and update any docstring if
desired to note None handling.
In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-47: The validateConfig method currently swallows all exceptions
from depends(...) by calling warning(str(e)), which will lead to obscure
failures later (e.g., bm25_search raising ModuleNotFoundError); update
validateConfig to log the failure at error level with context (include that
rank_bm25 installation failed and the requirements path) and then re-raise the
original exception (or raise a new RuntimeError) so startup clearly fails;
specifically modify validateConfig (the try/except around depends(requirements))
to use an error-level log and then raise the caught exception so callers like
bm25_search see a clear startup error.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 114-118: The answer shows "score: N/A" because the reranker stores
the score on each Document as reranked_doc.score (used in reranked_docs) but the
code in IInstance.py reads metadata['hybrid_score']; fix by using the
Document.score property when building context_parts (replace the metadata lookup
with doc.score or a formatted fallback) so context_parts uses the actual
rrf_score from reranked_docs; update the block that constructs context_parts
(where reranked_docs, context_parts and snippet are referenced) to pull score =
doc.score if present.
- Line 28: The file IInstance.py imports typing.List and uses List[Doc] in type
annotations; update to Python 3.10+ built-in generics by removing the
typing.List import and replacing every occurrence of List[Doc] with list[Doc]
(specifically the uses on the lines indicated where List[Doc] appears — e.g., in
the function/class annotations around the symbols that reference Doc); ensure
import statements are cleaned up so only necessary types remain.
In `@test/nodes/test_search_hybrid.py`:
- Around line 646-675: The test currently passes hybrid_score in metadata which
production ignores; update test_structured_answer_output to rely on and assert
the Doc.score values instead of metadata: create docs with score=0.85 and
score=0.72 (remove/ignore metadata={'hybrid_score': ...}), call
inst.writeQuestions as before, then assert the answer_text contains the actual
formatted score strings (e.g., '(score: 0.85' and '(score: 0.72') alongside the
document markers; reference symbols: test_structured_answer_output,
schema_mod.Doc, HybridSearchEngine, inst.writeQuestions, and
mock_instance.writeAnswers.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Line 42: The path for the requirements file is built via string concatenation
into the variable requirements; replace that concatenation with os.path.join to
be cross-platform safe—use
os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
(update the assignment to the requirements variable in IGlobal.py) so the path
is constructed correctly on Windows and POSIX systems.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: aa90780e-cf90-48d9-bf49-85fd414fd033
📒 Files selected for processing (10)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/services.jsonnodes/src/nodes/search_hybrid/IGlobal.pynodes/src/nodes/search_hybrid/IInstance.pynodes/src/nodes/search_hybrid/__init__.pynodes/src/nodes/search_hybrid/hybrid_search.pynodes/src/nodes/search_hybrid/requirements.txtnodes/src/nodes/search_hybrid/services.jsontest/nodes/test_search_hybrid.py
| def validateConfig(self): | ||
| """Validate that the rank_bm25 dependency is available.""" | ||
| try: | ||
| from depends import depends | ||
|
|
||
| requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' | ||
| depends(requirements) | ||
| except Exception as e: | ||
| warning(str(e)) |
There was a problem hiding this comment.
Suppressing dependency installation errors may cause confusing downstream failures.
validateConfig() catches all exceptions and only logs a warning. If rank_bm25 fails to install, bm25_search() will later fail with an obscure ModuleNotFoundError rather than a clear startup error. Consider re-raising after logging, or at minimum logging at error level with more context.
Proposed improvement
def validateConfig(self):
"""Validate that the rank_bm25 dependency is available."""
try:
from depends import depends
requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
depends(requirements)
- except Exception as e:
- warning(str(e))
+ except Exception as e:
+ warning(f'Failed to install rank_bm25 dependency: {e}')
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def validateConfig(self): | |
| """Validate that the rank_bm25 dependency is available.""" | |
| try: | |
| from depends import depends | |
| requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' | |
| depends(requirements) | |
| except Exception as e: | |
| warning(str(e)) | |
| def validateConfig(self): | |
| """Validate that the rank_bm25 dependency is available.""" | |
| try: | |
| from depends import depends | |
| requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' | |
| depends(requirements) | |
| except Exception as e: | |
| warning(f'Failed to install rank_bm25 dependency: {e}') | |
| raise |
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 46-46: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 47, The
validateConfig method currently swallows all exceptions from depends(...) by
calling warning(str(e)), which will lead to obscure failures later (e.g.,
bm25_search raising ModuleNotFoundError); update validateConfig to log the
failure at error level with context (include that rank_bm25 installation failed
and the requirements path) and then re-raise the original exception (or raise a
new RuntimeError) so startup clearly fails; specifically modify validateConfig
(the try/except around depends(requirements)) to use an error-level log and then
raise the caught exception so callers like bm25_search see a clear startup
error.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
nodes/src/nodes/chunker/IGlobal.py (1)
39-45:⚠️ Potential issue | 🟠 MajorOnly probe
tiktokenfor token configs, and let failures fail validation.Line 43 still runs
depends(requirements)for every profile, and Line 45 downgrades real token setup failures to warnings. That makesrecursive/sentencevalidation side-effectful and defers brokentokenconfigs until runtime.🔧 Suggested fix
- def validateConfig(self): - """Validate that tiktoken dependency is available (only needed for token strategy).""" - try: - from depends import depends - - requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' - depends(requirements) - except Exception as e: # noqa: BLE001 - intentional broad catch for dependency probing - warning(str(e)) + def validateConfig(self, syntaxOnly: bool) -> None: + """Validate token-chunker dependencies when that strategy is selected.""" + if syntaxOnly: + return + + config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) + if config.get('strategy', 'recursive') != 'token': + return + + from depends import depends + + requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt') + depends(requirements)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` around lines 39 - 45, The current code unconditionally calls depends(requirements) and catches all exceptions, turning missing/invalid token setup into a warning; instead, only probe for the tiktoken dependency when validating token-related configs and surface failures as errors. Replace the unconditional depends(requirements) call with a targeted import/check for tiktoken inside the token validation path (where token configs are validated), and on failure raise a ValidationError (or re-raise the exception) rather than calling warning(str(e)); keep depends() out of generic profile probing and limit use to explicit token config checks (referencing depends, requirements, and warning in the diff).nodes/src/nodes/chunker/chunker_strategies.py (1)
211-247:⚠️ Potential issue | 🟠 MajorUse real source spans when enforcing sentence
chunk_sizeandchunk_overlap.Lines 212, 238, and 255 still treat every sentence boundary as a single space, but Lines 217 and 266 now slice the original text with its original whitespace. With
\n\n, tabs, or repeated spaces between sentences, a chunk can exceedchunk_sizeand the retained overlap can exceedchunk_overlapeven though the bookkeeping says it fits. Compute those lengths from the actualsentence_start_mapspans instead of synthetic+1separators.Also applies to: 253-258
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 211 - 247, The bookkeeping currently assumes a single-space separator between sentences (using +1) which mismatches actual spans and can let chunks/overlaps exceed chunk_size/chunk_overlap; replace synthetic +1 logic with real spans from sentence_start_map: compute sentence_len and gaps using sentence_start_map (e.g., gap = sentence_start_map[i] - (prev_start + len(prev_sentence))) and derive current_len as the span from sentence_positions[0] start to the end of the candidate sentence (end_char - start_char), use that span when checking current_len + new_sentence_span > self.chunk_size (and similarly when computing candidate overlap_len for the overlap loop against self.chunk_overlap), and update current_len, sentence_positions, and overlap_len consistently from those real start/end spans rather than adding 1 per boundary (apply to the checks and updates around current_len, sentence_len, candidate, and overlap_len involving current_sentences, sentence_positions, and sentence_start_map).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-145: The current logic unconditionally shifts start_char back
by self.chunk_overlap which can misalign offsets when the previous chunk
contributed fewer characters or separators were dropped; instead, compute the
actual overlap length from the previous chunk slice used to build chunk_text
(e.g., overlap_text = prev[-self.chunk_overlap:]), locate that overlap_text
immediately before the found start (using text.rfind or similar constrained
search), set overlap_len to the number of characters actually matched in the
source (0 if not found), and then adjust start_char by that actual overlap_len
(not the configured self.chunk_overlap); update end_char = start_char +
len(chunk_text) as before so that text[start_char:end_char] matches chunk_text.
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Line 37: The override of validateConfig currently uses def
validateConfig(self) but must match the engine signature validateConfig(self,
syntaxOnly: bool); update the IGlobal.validateConfig method to accept the
syntaxOnly: bool parameter, preserve existing logic, and if the method calls the
parent implementation or other helpers, pass the syntaxOnly flag through (e.g.,
when invoking super().validateConfig or other validators) so calling the
framework with a syntaxOnly argument no longer raises a TypeError.
In `@test/nodes/test_chunker.py`:
- Around line 344-369: The test currently only counts decode() calls but misses
large-prefix decodes; modify TrackingEncoder.decode to record the length of the
tokens argument (e.g., append len(tokens) to a new list or update a
max_decoded_len variable) so you can assert no decode call ever receives a slice
larger than the expected bound (use chunker.chunk_size or computed step =
chunker.chunk_size - chunker.chunk_overlap); after calling chunker.chunk assert
max_decoded_len <= chunker.chunk_size (or <= step if that's the intended bound)
in addition to the existing call count check.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 211-247: The bookkeeping currently assumes a single-space
separator between sentences (using +1) which mismatches actual spans and can let
chunks/overlaps exceed chunk_size/chunk_overlap; replace synthetic +1 logic with
real spans from sentence_start_map: compute sentence_len and gaps using
sentence_start_map (e.g., gap = sentence_start_map[i] - (prev_start +
len(prev_sentence))) and derive current_len as the span from
sentence_positions[0] start to the end of the candidate sentence (end_char -
start_char), use that span when checking current_len + new_sentence_span >
self.chunk_size (and similarly when computing candidate overlap_len for the
overlap loop against self.chunk_overlap), and update current_len,
sentence_positions, and overlap_len consistently from those real start/end spans
rather than adding 1 per boundary (apply to the checks and updates around
current_len, sentence_len, candidate, and overlap_len involving
current_sentences, sentence_positions, and sentence_start_map).
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 39-45: The current code unconditionally calls
depends(requirements) and catches all exceptions, turning missing/invalid token
setup into a warning; instead, only probe for the tiktoken dependency when
validating token-related configs and surface failures as errors. Replace the
unconditional depends(requirements) call with a targeted import/check for
tiktoken inside the token validation path (where token configs are validated),
and on failure raise a ValidationError (or re-raise the exception) rather than
calling warning(str(e)); keep depends() out of generic profile probing and limit
use to explicit token config checks (referencing depends, requirements, and
warning in the diff).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: aa447cd1-e070-4408-9c9b-919e10d019ea
📒 Files selected for processing (4)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/chunker_strategies.pytest/nodes/test_chunker.py
| class IGlobal(IGlobalBase): | ||
| strategy: ChunkingStrategy | None = None | ||
|
|
||
| def validateConfig(self): |
There was a problem hiding this comment.
Match the engine's validateConfig signature.
The lifecycle hook is validateConfig(self, syntaxOnly: bool). Overriding it as def validateConfig(self) means the first framework call with syntaxOnly will raise TypeError before validation runs.
🔧 Suggested fix
- def validateConfig(self):
+ def validateConfig(self, syntaxOnly: bool) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IGlobal.py` at line 37, The override of
validateConfig currently uses def validateConfig(self) but must match the engine
signature validateConfig(self, syntaxOnly: bool); update the
IGlobal.validateConfig method to accept the syntaxOnly: bool parameter, preserve
existing logic, and if the method calls the parent implementation or other
helpers, pass the syntaxOnly flag through (e.g., when invoking
super().validateConfig or other validators) so calling the framework with a
syntaxOnly argument no longer raises a TypeError.
| def test_start_char_incremental_tracking(self): | ||
| """start_char should be computed incrementally (not via O(n^2) prefix decode).""" | ||
| chunker = TokenChunker(chunk_size=10, chunk_overlap=0) | ||
|
|
||
| # Mock encoder where each character is a token and decode returns exact chars | ||
| call_counts = {'decode': 0} | ||
|
|
||
| class TrackingEncoder: | ||
| def encode(self, text): | ||
| return list(range(len(text))) | ||
|
|
||
| def decode(self, tokens, **kwargs): | ||
| call_counts['decode'] += 1 | ||
| return 'x' * len(tokens) | ||
|
|
||
| chunker._encoder = TrackingEncoder() | ||
| text = 'A' * 50 # 50 tokens -> 5 chunks of 10 | ||
| chunks = chunker.chunk(text) | ||
| assert len(chunks) == 5 | ||
|
|
||
| # With O(n^2) approach, decode would be called 5 (chunks) + 4 (prefixes) = 9 times | ||
| # With incremental approach, decode is called 5 (chunks) + 4 (steps) = 9 times | ||
| # but no prefix decode calls scale with chunk index. | ||
| # Key check: decode should NOT be called with tokens[:N] for large N. | ||
| # We verify the count is bounded linearly: at most 2*num_chunks calls | ||
| assert call_counts['decode'] <= 2 * len(chunks), f'decode called {call_counts["decode"]} times for {len(chunks)} chunks; expected at most {2 * len(chunks)} (linear)' |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
This test doesn't detect the quadratic prefix-decode regression.
Line 369 only counts decode() calls. An implementation that repeatedly does decode(tokens[:start]) would make the same number of calls, just with much larger slices, so this assertion still passes. Record the token-window lengths (or the largest decoded slice) and assert they stay bounded by chunk_size/step instead.
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 352-352: Missing return type annotation for private function encode
(ANN202)
[warning] 355-355: Missing return type annotation for private function decode
(ANN202)
[warning] 355-355: Missing type annotation for **kwargs
(ANN003)
[warning] 355-355: Unused method argument: kwargs
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/nodes/test_chunker.py` around lines 344 - 369, The test currently only
counts decode() calls but misses large-prefix decodes; modify
TrackingEncoder.decode to record the length of the tokens argument (e.g., append
len(tokens) to a new list or update a max_decoded_len variable) so you can
assert no decode call ever receives a slice larger than the expected bound (use
chunker.chunk_size or computed step = chunker.chunk_size -
chunker.chunk_overlap); after calling chunker.chunk assert max_decoded_len <=
chunker.chunk_size (or <= step if that's the intended bound) in addition to the
existing call count check.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
nodes/src/nodes/chunker/chunker_strategies.py (1)
129-144:⚠️ Potential issue | 🟠 MajorOverlap offset derived from config, not actual overlap length.
When the previous chunk is shorter than
chunk_overlap,overlap_textwill have fewer characters thanchunk_overlap. However,start_charis still shifted back by the fullself.chunk_overlap(line 142), causingtext[start_char:end_char]to not matchchunk_text.Additionally, separators consumed during splitting (e.g.,
". ") appear in the original text between chunks but not in the reconstructed overlap. This means the text slice and the constructedchunk_textwill differ.🔧 Proposed fix
for i, raw in enumerate(raw_chunks): # Compute overlap prefix from the previous chunk if i > 0 and self.chunk_overlap > 0: prev = raw_chunks[i - 1] - overlap_text = prev[-self.chunk_overlap :] - chunk_text = overlap_text + raw - else: - chunk_text = raw + actual_overlap_len = min(self.chunk_overlap, len(prev)) + else: + actual_overlap_len = 0 # Find start_char position in original text - start_char = text.find(raw, search_start) + raw_start = text.find(raw, search_start) - if start_char == -1: - start_char = search_start - # Adjust for overlap prefix - if i > 0 and self.chunk_overlap > 0: - overlap_start = max(0, start_char - self.chunk_overlap) - start_char = overlap_start - end_char = start_char + len(chunk_text) + if raw_start == -1: + raw_start = search_start + raw_end = raw_start + len(raw) + + # Derive overlap from actual source text, not reconstructed chunk + start_char = max(0, raw_start - actual_overlap_len) + chunk_text = text[start_char:raw_end] + end_char = raw_end result.append( { 'text': chunk_text,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 129 - 144, The overlap adjustment uses the configured self.chunk_overlap instead of the actual overlap string length and thus misaligns slices; in the chunk creation logic (symbols: raw_chunks, overlap_text, chunk_text, start_char, search_start, self.chunk_overlap) compute the actual_overlap = len(overlap_text) (or locate overlap_text in the prior region of text) and subtract actual_overlap from start_char instead of self.chunk_overlap, and also ensure any separators consumed by splitting are considered when locating overlap_text so that text[start_char:end_char] matches chunk_text; update the overlap-start calculation and keep end_char = start_char + len(chunk_text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Line 29: Replace the deprecated typing.List usage: remove List from the import
statement and update all type annotations that use List[str] to use the built-in
list[str] (e.g., change any function signatures in chunker_strategies.py that
currently use List[str] to list[str]); keep Optional if still used (so import
only Optional from typing or remove typing import entirely if unused). Ensure
the import line becomes "from typing import Optional" (or remove it) and update
the annotations in the functions that referenced List[str].
In `@test/nodes/test_chunker.py`:
- Around line 608-618: The test method
test_raises_runtime_error_when_strategy_is_none unpacks RC from
self._load_iinstance() but never uses it; update the tuple unpacking returned by
_load_iinstance() to omit RC (e.g., assign only saved, names, IInstance) or
replace RC with an underscore to indicate an unused value, ensuring the call
remains _load_iinstance() and other symbols (Doc, _make_instance,
inst.writeDocuments) are unchanged.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-144: The overlap adjustment uses the configured
self.chunk_overlap instead of the actual overlap string length and thus
misaligns slices; in the chunk creation logic (symbols: raw_chunks,
overlap_text, chunk_text, start_char, search_start, self.chunk_overlap) compute
the actual_overlap = len(overlap_text) (or locate overlap_text in the prior
region of text) and subtract actual_overlap from start_char instead of
self.chunk_overlap, and also ensure any separators consumed by splitting are
considered when locating overlap_text so that text[start_char:end_char] matches
chunk_text; update the overlap-start calculation and keep end_char = start_char
+ len(chunk_text).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4e58c6b5-bae4-4f10-b249-976393025ff9
📒 Files selected for processing (2)
nodes/src/nodes/chunker/chunker_strategies.pytest/nodes/test_chunker.py
…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unreachable else branch in IInstance.writeDocuments (metadata is always non-None after shallow-copy-or-create on line 81). Remove unused _STRATEGY_MAP dict in IGlobal. Fix services.json output description that incorrectly referenced total_chunks metadata which is never set on chunk documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Add early validation for chunk_size > 0, chunk_overlap >= 0, and chunk_overlap < chunk_size in beginGlobal() so invalid config is caught before strategy instantiation. Also add noqa BLE001 annotation for the intentional broad exception catch in validateConfig(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of joining sentences with ' '.join() and computing end_char from the joined string length, derive start_char/end_char from the sentence position map and slice the original text. This preserves the actual whitespace between sentences (e.g., double spaces, newlines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy chunk_index, start_char, end_char, and total_chunks from the strategy's per-chunk metadata dict onto each emitted Doc's metadata so downstream nodes can locate chunks within the source document. Update the writeDocuments docstring to list all metadata keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change the no-overlap position assertion to compare start_char against the previous chunk's end_char (not start_char) so it actually verifies non-overlapping spans. Rewrite test_iglobal_creates_strategy to instantiate IGlobal, mock the endpoint and glb with strategy='sentence', call beginGlobal(), and assert the resulting strategy type and parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nting Replace the current_len tracker (which assumed 1-char separators) with position-based span computation that accounts for multi-character whitespace between sentences. After the whitespace-preservation change (text[start_char:end_char] slicing), current_len was underestimating the real chunk width whenever the original separator was longer than one character (e.g. \n\n), causing chunks to silently exceed chunk_size. The overlap budget is now also computed from actual text spans so carried-over sentences fit within the configured overlap window. Add two regression tests: - overlap + repeated sentences: verify spans match text and respect size - multi-char whitespace: verify chunks are split, not all swallowed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fab5892 to
4181638
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (5)
nodes/src/nodes/chunker/IGlobal.py (1)
37-45:⚠️ Potential issue | 🟠 MajorOnly install
tiktokenfor the token profile, and let validation fail if that install breaks.
depends(requirements)is an installer, so this adds side effects torecursiveandsentenceconfigs as well. The blanket catch then downgrades token setup failures to a warning, so the first hard failure moves toTokenChunker._get_encoder()during document processing instead of config validation.Verify against the real config and dependency helpers; you should see
validateConfig()always callingdepends(...), the chunker config exposing a barestrategyvalue, andTokenChunkerimportingtiktokenlazily at runtime.#!/bin/bash set -e sed -n '37,45p' nodes/src/nodes/chunker/IGlobal.py sed -n '29,50p' nodes/src/nodes/chunker/services.json sed -n '71,171p' packages/ai/src/ai/common/config.py sed -n '778,824p' packages/server/engine-lib/rocketlib-python/lib/depends.py sed -n '298,304p' nodes/src/nodes/chunker/chunker_strategies.py🔧 Suggested change
def validateConfig(self): """Validate that tiktoken dependency is available (only needed for token strategy).""" - try: - from depends import depends - - requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' - depends(requirements) - except Exception as e: # noqa: BLE001 - intentional broad catch for dependency probing - warning(str(e)) + config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) + if config.get('strategy', 'recursive') != 'token': + return + + from depends import depends + + requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt') + depends(requirements)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` around lines 37 - 45, validateConfig currently always calls depends(requirements) which installs tiktoken for all strategies and swallows errors; change validateConfig to check the chunker configuration's strategy value and only call depends(requirements) when strategy == "token", and if depends throws let the exception propagate (do not convert to warning) so failures surface during config validation; reference the validateConfig method in IGlobal.py, the chunker config's "strategy" field, the depends(...) installer, and TokenChunker._get_encoder (which depends on tiktoken) when making this change.nodes/src/nodes/chunker/IInstance.py (1)
77-95:⚠️ Potential issue | 🔴 CriticalPopulate fallback
DocMetadataup front, and stop using the entry-scoped counter forchunkId.
DocMetadata()withoutobjectIdandchunkIdfails against the real schema, so metadata-less inputs will raise before the later assignments run. Usingself.chunkIdalso means the first chunk of the second source document no longer starts at0, even thoughchunkIdis defined as the position within that source document. The current test stub only passes because it auto-fills those required fields.Verify against the real schema and the current test stub; you should see
IInstancestill callingDocMetadata()with no required fields, the production model requiringobjectId/chunkId, and the test double masking that mismatch.#!/bin/bash set -e sed -n '77,95p' nodes/src/nodes/chunker/IInstance.py sed -n '49,130p' packages/client-python/src/rocketride/schema/doc_metadata.py sed -n '494,502p' test/nodes/test_chunker.py🔧 Suggested change
# Build output documents output_docs: List[Doc] = [] - for chunk_data in chunks: + for chunk_idx, chunk_data in enumerate(chunks): # Shallow copy of document, explicit copy of metadata only chunk_doc = copy.copy(document) - chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata() + strategy_meta = chunk_data.get('metadata', {}) + if document.metadata is not None: + chunk_doc.metadata = copy.copy(document.metadata) + else: + chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=chunk_idx) chunk_doc.page_content = chunk_data['text'] # Update metadata (always non-None after the copy/create above) - chunk_doc.metadata.chunkId = self.chunkId + chunk_doc.metadata.chunkId = strategy_meta.get('chunk_index', chunk_idx) chunk_doc.metadata.parentId = parent_id # Propagate strategy metadata (chunk_index, start_char, end_char) - strategy_meta = chunk_data.get('metadata', {}) chunk_doc.metadata.chunk_index = strategy_meta.get('chunk_index', 0) chunk_doc.metadata.start_char = strategy_meta.get('start_char', 0) chunk_doc.metadata.end_char = strategy_meta.get('end_char', 0) chunk_doc.metadata.total_chunks = total_chunks - self.chunkId += 1 output_docs.append(chunk_doc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IInstance.py` around lines 77 - 95, The code creates a fallback DocMetadata() with no required fields and uses self.chunkId (a global counter) for chunkId, causing schema validation failures and incorrect chunk numbering across source documents; update the chunking in IInstance so that when document.metadata is falsy you instantiate DocMetadata with required fields (e.g., objectId populated from the source document identifier) before assigning any chunk fields, and compute chunkId as the position within the current source (reset a local per-document counter or use the chunk index from strategy_meta) instead of self.chunkId; adjust references to chunk_doc.metadata.chunkId, chunk_doc.metadata.parentId, and the increment logic so chunkId starts at 0 for each source document and schema-required fields are always present.nodes/src/nodes/chunker/chunker_strategies.py (1)
127-160:⚠️ Potential issue | 🟠 MajorDerive recursive overlap from the source span, not
prev[-overlap:] + raw.When a separator is dropped between raw chunks, that concatenation no longer matches any contiguous slice of
text.start_charandend_charthen point at the wrong substring, and downstream code sees chunk text that does not match the recorded offsets.🔧 Suggested change
result = [] chunk_index = 0 search_start = 0 + prev_raw_start: int | None = None for i, raw in enumerate(raw_chunks): - # Compute overlap prefix from the previous chunk - if i > 0 and self.chunk_overlap > 0: - prev = raw_chunks[i - 1] - overlap_text = prev[-self.chunk_overlap :] - chunk_text = overlap_text + raw - else: - chunk_text = raw - - # Find start_char position in original text - start_char = text.find(raw, search_start) - if start_char == -1: - start_char = search_start - # Adjust for overlap prefix - if i > 0 and self.chunk_overlap > 0: - overlap_start = max(0, start_char - self.chunk_overlap) - start_char = overlap_start - end_char = start_char + len(chunk_text) + raw_start = text.find(raw, search_start) + if raw_start == -1: + raw_start = search_start + raw_end = raw_start + len(raw) + + start_char = raw_start + if i > 0 and self.chunk_overlap > 0 and prev_raw_start is not None: + overlap_text = raw_chunks[i - 1][-self.chunk_overlap :] + overlap_start = text.rfind(overlap_text, prev_raw_start, raw_start) + if overlap_start != -1: + start_char = overlap_start + + chunk_text = text[start_char:raw_end] + end_char = raw_end @@ - raw_start = text.find(raw, search_start) - if raw_start >= 0: - search_start = raw_start + len(raw) + search_start = raw_end + prev_raw_start = raw_start🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 127 - 160, The current code builds chunk_text by concatenating prev[-chunk_overlap:] + raw which can create strings that don't correspond to a contiguous slice of the original text when separators were dropped; instead compute the overlap by slicing the original text using the previously computed offsets: use the previous chunk's recorded end_char (from result[-1]['metadata']['end_char']) to derive the overlap_start = max(prev_end_char - chunk_overlap, prev_start_char) and set overlap_text = text[overlap_start:prev_end_char]; then build chunk_text = overlap_text + raw and compute start_char = overlap_start (or the found raw start if no overlap), update end_char = start_char + len(chunk_text), and ensure search_start is advanced using the actual raw's start position found in text so start_char/end_char always map to contiguous spans in text (update uses of raw_chunks, prev, chunk_overlap, start_char, end_char, search_start and result accordingly).nodes/src/nodes/search_hybrid/IGlobal.py (1)
39-47:⚠️ Potential issue | 🟠 MajorFail fast when dependency validation fails.
Line 47 turns a missing/failed
rank_bm25install into a warning and lets the node boot without its BM25 backend. The first BM25 call then fails much later with a harder-to-diagnose runtime error.Suggested fix
def validateConfig(self): """Validate that the rank_bm25 dependency is available.""" + requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' try: from depends import depends - requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' depends(requirements) - except Exception as e: - warning(str(e)) + except Exception as exc: + warning(f'Failed to validate rank_bm25 dependency from {requirements}: {exc}') + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 47, The validateConfig method currently swallows dependency installation failures by calling warning(str(e)); change it to fail fast: when depends(requirements) raises, log/record the error and re-raise a RuntimeError (or call sys.exit(1)) so the node does not boot without BM25; update validateConfig to propagate the exception instead of only calling warning, referencing the validateConfig function, the local requirements variable, and the depends import to locate the change.nodes/src/nodes/search_hybrid/IInstance.py (1)
114-118:⚠️ Potential issue | 🟠 MajorRender scores from
doc.score, not missing metadata.Lines 99-101 store the fused score on
doc.score, but Line 116 readsmetadata['hybrid_score'], which this node never populates. That makes every structured answer printN/Aunless a test fixture happens to pre-fill that metadata key.Suggested fix
if reranked_docs and self.instance.hasListener('answers'): context_parts = [] for i, doc in enumerate(reranked_docs): - score = doc.metadata.get('hybrid_score', 'N/A') if doc.metadata else 'N/A' + score = f'{doc.score:.4f}' if doc.score is not None else 'N/A' snippet = (doc.page_content or '')[:500] context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 114 - 118, The context-building loop is reading a non-existent metadata key 'hybrid_score' so scores show as "N/A"; update the loop that constructs context_parts (the for i, doc in enumerate(reranked_docs) block in IInstance) to use doc.score (with a safe fallback like 'N/A' or 0 when None) instead of doc.metadata.get('hybrid_score'), and keep the existing snippet extraction (snippet = (doc.page_content or '')[:500]) unchanged so rendered answers display the actual fused score stored on doc.score.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 47-91: The lifecycle hooks beginGlobal and endGlobal are missing
PEP 257 docstrings; add concise triple-single-quoted ('''...''') docstrings to
both methods: beginGlobal should have a one-line summary plus a short second
line describing that it loads node config and initializes the chunking strategy
(mention the strategy selection: 'token', 'sentence', default 'recursive') and
note it raises ValueError for bad chunk sizes; endGlobal should have a one-line
summary saying it releases/cleans up the strategy (sets self.strategy to None).
Use single quotes for the docstrings to match project style and ensure
ruff/formatting passes for Python 3.10+.
In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Around line 127-141: Validate the RRF parameter `k` (from `rrf_k`/`k`) before
entering the RRF scoring loop in the hybrid search logic: ensure `k` is a
non-negative integer (e.g., k >= 0) and raise a clear ValueError if it's
negative or otherwise invalid so you never compute rrf_score = weight * (1.0 /
(k + rank + 1)) with a zero denominator; place this check immediately before the
loop that iterates over `result_lists` (the block that references `weights`,
`result_lists`, `id_key`, and computes `rrf_score`).
- Around line 186-194: The code silently ignores misaligned vector_scores by
only using them when len(vector_scores) == len(documents); update the block
around vector_scores/documents handling (the vector_results construction using
vector_scores, scored, and doc_copy['vector_score']) to validate lengths and
raise a clear exception (e.g., ValueError) if vector_scores is not None but its
length != len(documents) so upstream callers fail fast instead of falling back
to BM25-only; keep the successful path the same (attach float(vector_scores[i])
to each document and sort by 'vector_score') and ensure the raised error message
references vector_scores and documents for easier debugging.
In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 34-72: Add PEP 257 docstrings to the public lifecycle surface: add
a module/class docstring for IGlobal and add triple-quoted single-quote
docstrings to the IGlobal class and its public lifecycle methods beginGlobal()
and endGlobal() that briefly describe the purpose, when they are called, key
parameters/attributes used (e.g., engine, top_k, rrf_k, alpha) and their effects
(e.g., engine creation in beginGlobal, engine teardown in endGlobal); ensure
docstrings follow the project's style (single-quoted triple quotes) and are
concise per PEP 257 for IGlobal, beginGlobal, and endGlobal.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 97-101: The reranked_doc only copies result.get('rrf_score') into
reranked_doc.score, so when HybridSearchEngine.search() returns bm25_score
(e.g., alpha == 0.0) the emitted Doc.score remains the stale upstream vector
value; update the logic that builds reranked_doc (the block that reads result
and sets rrf_score) to also check for and use result.get('bm25_score') (or the
active scoring key returned by HybridSearchEngine.search()) when rrf_score is
None, assigning that score to reranked_doc.score so emitted Doc.score reflects
the actual active ranking score.
- Around line 120-122: The code constructs an Answer via Answer() and calls
ans.setAnswer(...), but then passes the bare Answer to writeAnswers which
expects a List[Answer]; update the call to pass a list containing the Answer
(e.g., [ans]) so writeAnswers receives a List[Answer] and matches the rocketlib
contract; adjust any surrounding logic that assumes a single Answer return
accordingly and ensure imports/type hints align with List[Answer] if present.
In `@test/nodes/test_search_hybrid.py`:
- Around line 78-87: Install and remove the BM25 stub using a pytest
module-scoped fixture rather than calling _install_bm25_stub() at import time:
create a fixture (scope="module") that calls _install_bm25_stub() during setup
and ensures _restore_modules() is invoked in its teardown, so the stub in
sys.modules (names in _STUB_MODULES) is present only for this module's tests and
is restored afterward to avoid leaking the fake rank_bm25 into other tests.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 127-160: The current code builds chunk_text by concatenating
prev[-chunk_overlap:] + raw which can create strings that don't correspond to a
contiguous slice of the original text when separators were dropped; instead
compute the overlap by slicing the original text using the previously computed
offsets: use the previous chunk's recorded end_char (from
result[-1]['metadata']['end_char']) to derive the overlap_start =
max(prev_end_char - chunk_overlap, prev_start_char) and set overlap_text =
text[overlap_start:prev_end_char]; then build chunk_text = overlap_text + raw
and compute start_char = overlap_start (or the found raw start if no overlap),
update end_char = start_char + len(chunk_text), and ensure search_start is
advanced using the actual raw's start position found in text so
start_char/end_char always map to contiguous spans in text (update uses of
raw_chunks, prev, chunk_overlap, start_char, end_char, search_start and result
accordingly).
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 37-45: validateConfig currently always calls depends(requirements)
which installs tiktoken for all strategies and swallows errors; change
validateConfig to check the chunker configuration's strategy value and only call
depends(requirements) when strategy == "token", and if depends throws let the
exception propagate (do not convert to warning) so failures surface during
config validation; reference the validateConfig method in IGlobal.py, the
chunker config's "strategy" field, the depends(...) installer, and
TokenChunker._get_encoder (which depends on tiktoken) when making this change.
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 77-95: The code creates a fallback DocMetadata() with no required
fields and uses self.chunkId (a global counter) for chunkId, causing schema
validation failures and incorrect chunk numbering across source documents;
update the chunking in IInstance so that when document.metadata is falsy you
instantiate DocMetadata with required fields (e.g., objectId populated from the
source document identifier) before assigning any chunk fields, and compute
chunkId as the position within the current source (reset a local per-document
counter or use the chunk index from strategy_meta) instead of self.chunkId;
adjust references to chunk_doc.metadata.chunkId, chunk_doc.metadata.parentId,
and the increment logic so chunkId starts at 0 for each source document and
schema-required fields are always present.
In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-47: The validateConfig method currently swallows dependency
installation failures by calling warning(str(e)); change it to fail fast: when
depends(requirements) raises, log/record the error and re-raise a RuntimeError
(or call sys.exit(1)) so the node does not boot without BM25; update
validateConfig to propagate the exception instead of only calling warning,
referencing the validateConfig function, the local requirements variable, and
the depends import to locate the change.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 114-118: The context-building loop is reading a non-existent
metadata key 'hybrid_score' so scores show as "N/A"; update the loop that
constructs context_parts (the for i, doc in enumerate(reranked_docs) block in
IInstance) to use doc.score (with a safe fallback like 'N/A' or 0 when None)
instead of doc.metadata.get('hybrid_score'), and keep the existing snippet
extraction (snippet = (doc.page_content or '')[:500]) unchanged so rendered
answers display the actual fused score stored on doc.score.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4f1a6f92-efe2-4514-a6ab-ba032bd4b6d1
📒 Files selected for processing (14)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/__init__.pynodes/src/nodes/chunker/chunker_strategies.pynodes/src/nodes/chunker/requirements.txtnodes/src/nodes/chunker/services.jsonnodes/src/nodes/search_hybrid/IGlobal.pynodes/src/nodes/search_hybrid/IInstance.pynodes/src/nodes/search_hybrid/__init__.pynodes/src/nodes/search_hybrid/hybrid_search.pynodes/src/nodes/search_hybrid/requirements.txtnodes/src/nodes/search_hybrid/services.jsontest/nodes/test_chunker.pytest/nodes/test_search_hybrid.py
| def beginGlobal(self): | ||
| # Are we in config mode or some other mode? | ||
| if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG: | ||
| # We are going to get a call to configureService but | ||
| # we don't actually need to load the strategy for that | ||
| pass | ||
| else: | ||
| # Get this node's config | ||
| config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) | ||
|
|
||
| # Read strategy parameters from config | ||
| strategy_name = config.get('strategy', 'recursive') | ||
| chunk_size = int(config.get('chunk_size', 1000)) | ||
| chunk_overlap = int(config.get('chunk_overlap', 200)) | ||
| encoding_name = config.get('encoding_name', 'cl100k_base') | ||
|
|
||
| if chunk_size <= 0: | ||
| raise ValueError(f'chunk_size must be positive, got {chunk_size}') | ||
| if chunk_overlap < 0: | ||
| raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}') | ||
| if chunk_overlap >= chunk_size: | ||
| raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})') | ||
|
|
||
| # Build the appropriate strategy | ||
| if strategy_name == 'token': | ||
| self.strategy = TokenChunker( | ||
| chunk_size=chunk_size, | ||
| chunk_overlap=chunk_overlap, | ||
| encoding_name=encoding_name, | ||
| ) | ||
| elif strategy_name == 'sentence': | ||
| self.strategy = SentenceChunker( | ||
| chunk_size=chunk_size, | ||
| chunk_overlap=chunk_overlap, | ||
| ) | ||
| else: | ||
| # Default to recursive character chunker | ||
| self.strategy = RecursiveCharacterChunker( | ||
| chunk_size=chunk_size, | ||
| chunk_overlap=chunk_overlap, | ||
| ) | ||
|
|
||
| def endGlobal(self): | ||
| # Release the strategy | ||
| self.strategy = None |
There was a problem hiding this comment.
Add PEP 257 docstrings to the lifecycle hooks.
beginGlobal() and endGlobal() are public methods in a nodes/**/*.py module, but they currently have no docstrings.
📝 Suggested change
def beginGlobal(self):
+ """Initialize the configured chunking strategy for runtime execution."""
# Are we in config mode or some other mode?
if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:
# We are going to get a call to configureService but
# we don't actually need to load the strategy for that
pass
@@
def endGlobal(self):
+ """Release the configured chunking strategy."""
# Release the strategy
self.strategy = NoneAs per coding guidelines, nodes/**/*.py: Python pipeline nodes: use single quotes, ruff for linting/formatting, PEP 257 docstrings, target Python 3.10+.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def beginGlobal(self): | |
| # Are we in config mode or some other mode? | |
| if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG: | |
| # We are going to get a call to configureService but | |
| # we don't actually need to load the strategy for that | |
| pass | |
| else: | |
| # Get this node's config | |
| config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) | |
| # Read strategy parameters from config | |
| strategy_name = config.get('strategy', 'recursive') | |
| chunk_size = int(config.get('chunk_size', 1000)) | |
| chunk_overlap = int(config.get('chunk_overlap', 200)) | |
| encoding_name = config.get('encoding_name', 'cl100k_base') | |
| if chunk_size <= 0: | |
| raise ValueError(f'chunk_size must be positive, got {chunk_size}') | |
| if chunk_overlap < 0: | |
| raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}') | |
| if chunk_overlap >= chunk_size: | |
| raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})') | |
| # Build the appropriate strategy | |
| if strategy_name == 'token': | |
| self.strategy = TokenChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| encoding_name=encoding_name, | |
| ) | |
| elif strategy_name == 'sentence': | |
| self.strategy = SentenceChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| ) | |
| else: | |
| # Default to recursive character chunker | |
| self.strategy = RecursiveCharacterChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| ) | |
| def endGlobal(self): | |
| # Release the strategy | |
| self.strategy = None | |
| def beginGlobal(self): | |
| """Initialize the configured chunking strategy for runtime execution.""" | |
| # Are we in config mode or some other mode? | |
| if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG: | |
| # We are going to get a call to configureService but | |
| # we don't actually need to load the strategy for that | |
| pass | |
| else: | |
| # Get this node's config | |
| config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) | |
| # Read strategy parameters from config | |
| strategy_name = config.get('strategy', 'recursive') | |
| chunk_size = int(config.get('chunk_size', 1000)) | |
| chunk_overlap = int(config.get('chunk_overlap', 200)) | |
| encoding_name = config.get('encoding_name', 'cl100k_base') | |
| if chunk_size <= 0: | |
| raise ValueError(f'chunk_size must be positive, got {chunk_size}') | |
| if chunk_overlap < 0: | |
| raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}') | |
| if chunk_overlap >= chunk_size: | |
| raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})') | |
| # Build the appropriate strategy | |
| if strategy_name == 'token': | |
| self.strategy = TokenChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| encoding_name=encoding_name, | |
| ) | |
| elif strategy_name == 'sentence': | |
| self.strategy = SentenceChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| ) | |
| else: | |
| # Default to recursive character chunker | |
| self.strategy = RecursiveCharacterChunker( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap, | |
| ) | |
| def endGlobal(self): | |
| """Release the configured chunking strategy.""" | |
| # Release the strategy | |
| self.strategy = None |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 64-64: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 66-66: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 68-68: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IGlobal.py` around lines 47 - 91, The lifecycle hooks
beginGlobal and endGlobal are missing PEP 257 docstrings; add concise
triple-single-quoted ('''...''') docstrings to both methods: beginGlobal should
have a one-line summary plus a short second line describing that it loads node
config and initializes the chunking strategy (mention the strategy selection:
'token', 'sentence', default 'recursive') and note it raises ValueError for bad
chunk sizes; endGlobal should have a one-line summary saying it releases/cleans
up the strategy (sets self.strategy to None). Use single quotes for the
docstrings to match project style and ensure ruff/formatting passes for Python
3.10+.
…rch nodes Critical fixes: - writeAnswers() now passes a list[Answer] matching the rocketlib contract - hybrid_score N/A bug fixed: use doc.score instead of nonexistent metadata key - Score propagation now works in all modes (RRF, BM25-only, vector-only) - RRF k parameter validated to prevent division-by-zero - vector_scores length mismatch now raises ValueError instead of silent fallback Improvements: - chunker validateConfig only installs tiktoken when token strategy is configured - search_hybrid validateConfig re-raises dependency failures instead of swallowing - Added PEP 257 docstrings to all public lifecycle hooks - Modernized typing: List/Dict -> list/dict for Python 3.10+ (ruff UP035) - _tokenize type hint updated to Optional[str] matching implementation - BM25 test stub now properly cleaned up via module-scoped pytest fixture - test_structured_answer_output verifies actual score values, not N/A All 42 chunker tests pass. All 39 hybrid search tests pass. ruff check/format clean. https://claude.ai/code/session_017DgznbpdyRsQ3svzCjYLXo
Review Feedback Resolution — All 23 CodeRabbit threads addressedI've gone through every review comment and pushed a comprehensive fix in commit Critical Fixes (3 bugs)
Major Fixes (3 bugs)
Improvement Fixes (2 items)
Typing Modernization (5 items)
Test Fixes (2 items)
Intentionally Not Changed (4 items)
Verification
Generated by Claude Code |
Review Feedback Response —
|
| Issue | File | Fix |
|---|---|---|
Path construction — string concat instead of os.path.join() |
chunker/IGlobal.py, search_hybrid/IGlobal.py |
Replaced with os.path.join() for cross-platform safety |
Dependency failure silently swallowed — warning(str(e)) masks install failures |
chunker/IGlobal.py |
Now raises RuntimeError with chained exception. Removed unused warning import (caught by ruff) |
| Overlap offset bug — used config value instead of actual overlap length | chunker/chunker_strategies.py |
Fixed: now uses len(overlap_text) to compute start_char adjustment, so text[start_char:end_char] matches chunk_text when previous chunk is shorter than configured overlap |
Deprecated typing.Optional |
chunker_strategies.py, hybrid_search.py |
Replaced with PEP 604 union syntax (X | None), removed unused typing imports |
| Token decode test coverage — only tracked call count, not window sizes | test_chunker.py |
Added decoded_lengths tracking + assertion that no decode call exceeds chunk_size, verifying O(n) behavior |
⏭️ No Changes Needed (already correct or incorrect feedback)
validateConfig(self) signature (Thread 13)
The codebase convention across all nodes (
llamaparse,qdrant,mcp_client,index_search,llm_deepseek,llm_gmi_cloud,local_text_output, etc.) isdef validateConfig(self)— nosyntaxOnlyparameter. Our implementation is consistent with the framework contract.
Sentence chunker whitespace (Thread 1)
Already fixed in prior commit
4181638— chunk text is derived viatext[start_char:end_char]slice, not' '.join(). Whitespace is preserved exactly as it appears in the original text.
doc.score vs metadata['hybrid_score'] (Thread 10)
Code already uses
doc.scoredirectly at line 117 ofsearch_hybrid/IInstance.py:score = f'{doc.score:.4f}' if doc.score is not None else 'N/A'After RRF reranking,
reranked_doc.scoreis set to the ranking signal (lines 98-102), so scores are always numeric.
Per-document chunk_index (Thread 3)
Two separate fields serve different purposes:
chunk_index— from strategy's.chunk(), resets to 0 per document ✅chunkId— entry-scoped monotonic ID, intentionally unique across all chunks in an entry for downstream dedup/trackingBoth are correct by design.
writeAnswers() interface (Thread 22)
Already passes a list:
self.instance.writeAnswers([ans])(line 123). No interface violation.
rrf_k validation (Thread 18)
Already validated:
if k < 0: raise ValueError('k must be non-negative')(line 129-130 ofhybrid_search.py).
vector_scores length mismatch (Thread 19)
Already validated:
if len(vector_scores) != len(documents): raise ValueError(...)(lines 190-191).
Score propagation for different alpha modes (Thread 21)
Already handled with fallback chain (lines 98-102):
ranking_score = result.get('rrf_score') if ranking_score is None: ranking_score = result.get('bm25_score', result.get('vector_score'))
Test Results
81/81 tests pass (42 chunker + 39 hybrid search) ✅
Ready for re-review.
Generated by Claude Code
Summary
Review feedback addressed
raise RuntimeErrorparentIdback to source documentTest plan
Splits rocketride-org#514
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests