feat(nodes): add text chunker node with recursive, sentence, and token strategies#601
feat(nodes): add text chunker node with recursive, sentence, and token strategies#601nihalnihalani wants to merge 8 commits intorocketride-org:developfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new text chunker node (three chunking strategies, lifecycle, instance emission, config, tests) and a hybrid search node (BM25 + optional vector fusion engine, lifecycle, instance reranking, config, tests). Changes
Sequence DiagramsequenceDiagram
participant Client
participant IGlobal as IGlobal (Chunker)
participant Strategy as ChunkingStrategy
participant IInstance as IInstance
participant Output
Client->>IGlobal: beginGlobal()
IGlobal->>IGlobal: validateConfig() / load node config
IGlobal->>IGlobal: instantiate Strategy
IGlobal-->>Client: strategy available
Client->>IInstance: open(entry) / writeDocuments(docs)
IInstance->>IGlobal: ensure strategy initialized
loop per input document
IInstance->>Strategy: chunk(text)
Strategy-->>IInstance: [chunks with metadata]
IInstance->>IInstance: create per-chunk Doc (chunkId,parentId,metadata)
end
IInstance->>Output: emit chunked documents
Client->>IGlobal: endGlobal()
IGlobal->>IGlobal: release strategy (set None)
sequenceDiagram
participant Client
participant IGlobal as IGlobal (Hybrid)
participant Engine as HybridSearchEngine
participant IInstance as IInstance
participant Docs as Documents
participant Answers as Answers
Client->>IGlobal: beginGlobal()
IGlobal->>IGlobal: validateConfig() / load node config
IGlobal->>Engine: instantiate HybridSearchEngine(alpha,...)
IGlobal-->>Client: engine available
Client->>IInstance: writeQuestions(question)
IInstance->>IGlobal: ensure engine initialized
IInstance->>Engine: search(query, documents, vector_scores, top_k, rrf_k)
Engine-->>IInstance: ranked results (bm25/rrf/vector fused)
IInstance->>Docs: emit reranked documents
alt answers listener exists
IInstance->>Answers: build & emit textual answer (snippets + hybrid_score)
end
Client->>IGlobal: endGlobal()
IGlobal->>IGlobal: release engine (set None)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 213-216: The chunker currently rebuilds chunks with '
'.join(current_sentences), which loses original whitespace and invalidates
offsets; update the logic in chunker_strategies.py (where chunk_text,
current_sentences, sentence_positions are used) to slice the original source
text by each sentence's original span and concatenate those exact source slices
(preserving whitespace) to form chunk_text, and compute start_char and end_char
from the original span boundaries (e.g., first sentence's start index and last
sentence's end index) instead of using lengths; apply the same change to the
second occurrence around lines 260-263 so all chunks use original spans for text
and offsets.
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 52-53: The code in IGlobal.py currently uses a blind "except
Exception as e" around dependency/config loading (calling warning(str(e))),
which can mask real runtime errors; update the handler in the
validateConfig/config-loading block to catch only the expected exceptions (e.g.,
ImportError, ModuleNotFoundError, KeyError, ValueError) and call warning(str(e))
for those, or if you intentionally must catch everything keep the broad except
but append a targeted linter suppression ( "# noqa: BLE001" ) with a one-line
justification comment explaining why a broad catch is required; reference the
existing validateConfig/config-loading block and the warning(...) call to locate
where to make this change.
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 84-90: The current metadata assignment in the chunk emission logic
overwrites/preserves only chunkId and parentId and drops strategy-provided
fields (chunk_index, start_char, end_char, total_chunks). Fix the block that
sets chunk_doc.metadata (referencing chunk_doc, DocMetadata, self.chunkId,
parent_id) to merge/preserve existing strategy metadata: if chunk_doc.metadata
exists, set/overwrite only chunkId and parentId and leave or copy through
chunk_index, start_char, end_char, total_chunks; if not, construct a new
DocMetadata that includes chunkId, parentId plus chunk_index, start_char,
end_char, and total_chunks (sourcing those values from the strategy output
variables on the chunk or from the available total_chunks value). Ensure you do
not drop any of the listed fields when creating or updating metadata.
In `@test/nodes/test_chunker.py`:
- Around line 493-509: The test named test_iglobal_creates_strategy currently
only instantiates RecursiveCharacterChunker directly; update it to import
IGlobal (IGlobal.beginGlobal) after forcing module reload, set up the mocked
endpoint/config used by beginGlobal (reuse self._install_stubs() or
saved/names), call IGlobal.beginGlobal(), and then assert that the instance's
strategy attribute (e.g., iglobal_instance.strategy) is an instance of
RecursiveCharacterChunker and that chunking works (e.g., call strategy.chunk and
check length) to validate the lifecycle wiring rather than direct construction.
- Around line 255-258: The no-overlap assertion in test_chunker.py is wrong: it
compares chunks[i]['metadata']['start_char'] to the previous chunk's start_char,
which allows overlaps; change the assertion to ensure each chunk's start_char is
>= the previous chunk's metadata 'end_char' so spans don't overlap (use
chunks[i]['metadata']['start_char'] >= chunks[i-1]['metadata']['end_char']
referencing the same chunks and metadata fields).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 255f96a0-24d3-4772-b561-67e9b8bec0de
📒 Files selected for processing (7)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/__init__.pynodes/src/nodes/chunker/chunker_strategies.pynodes/src/nodes/chunker/requirements.txtnodes/src/nodes/chunker/services.jsontest/nodes/test_chunker.py
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
nodes/src/nodes/chunker/IGlobal.py (1)
44-45:⚠️ Potential issue | 🟡 MinorMake the broad
exceptexplicitly intentional.Line 44 still catches
Exceptionbroadly. If this is intentional for save-time config probing, add a targeted Ruff suppression with a one-line justification to avoid masking accidental regressions silently.Proposed minimal fix
- except Exception as e: + except Exception as e: # noqa: BLE001 - validateConfig probe should warn, not block node configuration warning(str(e))Based on learnings: follow the existing validateConfig style, and when broad catches are intentional, use targeted Ruff
# noqawith specificBLE001justification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` around lines 44 - 45, The broad except in IGlobal.py (the except Exception as e block that calls warning(str(e))) is intentional for save-time config probing but must be explicitly documented and suppressed for Ruff; update that except to include a one-line targeted Ruff suppression comment (e.g. "# noqa: BLE001 - intentional broad Exception for save-time config probing, follows validateConfig pattern") immediately on the except line so the broad catch remains but is clearly justified and silences BLE001 while keeping the warning(str(e)) behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 58-61: Validate and sanitize chunk parameters before creating
strategies: after parsing strategy_name, chunk_size, chunk_overlap, and
encoding_name in IGlobal.py, ensure chunk_size is an integer > 0 and
chunk_overlap is an integer >= 0 and strictly < chunk_size (or clamp to
chunk_size - 1 if you prefer permissive behavior); also catch and handle
ValueError from int() conversions and log/raise a clear error when values are
invalid. Apply this validation in the same block where
strategy_name/chunk_size/chunk_overlap/encoding_name are read and also in the
other similar parsing block referenced (lines 64-80) so both code paths perform
the same range checks and consistent error handling. Use the variable names
chunk_size and chunk_overlap and the strategy creation entrypoint(s) in
IGlobal.py to locate where to add the guards.
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 51-53: The docstring in IInstance (file IInstance.py) incorrectly
documents emitted metadata keys as chunk_index and total_chunks; update the
docstring to match runtime behavior and state that emitted documents include
metadata keys "chunkId" and "parentId" (not chunk_index/total_chunks), and
adjust any other docstring occurrences in the same file (the comments around the
emission logic referenced at lines ~85-86) to use the same "chunkId"/"parentId"
wording so documentation matches the actual emitted metadata keys.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer output is reading hybrid_score from
doc.metadata but reranking stores scores on the document object
(reranked_doc.score), so the displayed score is always 'N/A'; update the answer
assembly in the block that uses reranked_docs to read the score from the
document's score attribute (e.g., use doc.score with a fallback like 'N/A' if
missing) when building context_parts before creating Answer and calling
self.instance.writeAnswers(ans).
---
Duplicate comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 44-45: The broad except in IGlobal.py (the except Exception as e
block that calls warning(str(e))) is intentional for save-time config probing
but must be explicitly documented and suppressed for Ruff; update that except to
include a one-line targeted Ruff suppression comment (e.g. "# noqa: BLE001 -
intentional broad Exception for save-time config probing, follows validateConfig
pattern") immediately on the except line so the broad catch remains but is
clearly justified and silences BLE001 while keeping the warning(str(e))
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5dccd2c3-49af-40a7-ace1-93178201c391
📒 Files selected for processing (10)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/services.jsonnodes/src/nodes/search_hybrid/IGlobal.pynodes/src/nodes/search_hybrid/IInstance.pynodes/src/nodes/search_hybrid/__init__.pynodes/src/nodes/search_hybrid/hybrid_search.pynodes/src/nodes/search_hybrid/requirements.txtnodes/src/nodes/search_hybrid/services.jsontest/nodes/test_search_hybrid.py
|
Thank you for you contribution What is the additional value of the new chunker node taking into consideration that chunking is already present (see |
|
Great question! The dedicated chunker node adds value over existing preprocessors (
Regarding the "2 independent nodes" feedback — the chunker and hybrid search are in separate directories with zero code dependencies. Happy to split into two PRs if preferred, though they were developed as a cohesive RAG pipeline feature set. Let me know your preference. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
nodes/src/nodes/chunker/IInstance.py (1)
79-82:⚠️ Potential issue | 🟠 MajorConstruct fallback
DocMetadatawith required fields.Line 81 uses
DocMetadata()bare. In the real schema,objectIdandchunkIdare required, so themetadata=Nonepath will raise before any chunk is emitted. The permissive stub intest/nodes/test_chunker.pycurrently hides this.🛠️ Suggested fix
- chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata() + chunk_doc.metadata = ( + copy.copy(document.metadata) + if document.metadata + else DocMetadata(objectId=parent_id, chunkId=self.chunkId) + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The fallback DocMetadata() used when document.metadata is falsy lacks required fields (objectId, chunkId); update the chunk creation logic in the chunker where chunk_doc is made (the copy.copy(document) / chunk_doc.metadata assignment) to construct a DocMetadata instance populated with at least the required fields (e.g., objectId from document or a generated value and chunkId from chunk_data or an index) instead of calling bare DocMetadata(), ensuring chunk_doc.page_content = chunk_data['text'] remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 211-217: The size and overlap checks for chunking currently assume
a single-character separator (using current_len + 1 + sentence_len) while the
emitted chunk slices the original text with preserved whitespace, causing
undercounts; in the chunking logic (look for variables/current_len,
sentence_len, sentence_positions, start_char, end_char, chunk_text, chunk_size,
chunk_overlap in chunker_strategies.py) replace the synthetic "+ 1" separator
math with calculations based on actual original spans (use sentence_positions
and string indices to compute widths as end_index - start_index for both the
candidate addition and existing current span) so that both the capacity check
and overlap bookkeeping use the original slice widths; apply the same fix in the
other similar blocks referenced around the later checks (the sections handling
finalization and overlap at the other noted ranges).
- Around line 129-145: The problem is that overlap is prepended using
prev[-self.chunk_overlap:] from the split output (raw) which may have lost
separators, so chunk_text no longer matches text[start_char:end_char] and
offsets are wrong; to fix, first locate the raw slice's start in the original
text (compute start_char = text.find(raw, search_start) as done), then compute
overlap_text by slicing the original text (overlap_text = text[max(0, start_char
- self.chunk_overlap): start_char]) and prepend that to raw to form chunk_text,
and finally set start_char = start_char - len(overlap_text) and end_char
accordingly; apply the same change to the second similar block (the one around
lines 157-160) and reference the functions/variables chunk_overlap, raw_chunks,
raw, chunk_text, start_char, end_char, and _split_text when making the update.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The fallback DocMetadata() used when document.metadata is
falsy lacks required fields (objectId, chunkId); update the chunk creation logic
in the chunker where chunk_doc is made (the copy.copy(document) /
chunk_doc.metadata assignment) to construct a DocMetadata instance populated
with at least the required fields (e.g., objectId from document or a generated
value and chunkId from chunk_data or an index) instead of calling bare
DocMetadata(), ensuring chunk_doc.page_content = chunk_data['text'] remains
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f902dbba-b734-4d31-9b3e-ee0acfcc03b9
📒 Files selected for processing (4)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/chunker_strategies.pytest/nodes/test_chunker.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nodes/src/nodes/chunker/chunker_strategies.py (1)
129-145:⚠️ Potential issue | 🟠 MajorOverlap-derived
chunk_textstill doesn't match the original text slice.The past review flagged that
chunk_text = overlap_text + rawproduces text that won't matchtext[start_char:end_char]because_split_text()drops separators. The code still uses this pattern despite being marked as addressed.For example, with
text='abc\n\ndef'andchunk_overlap=1:
raw_chunksmight be['abc', 'def'](separators dropped)chunk_textfor chunk 2 becomes'c' + 'def' = 'cdef'- But
text[start_char:end_char]would be'c\n\ndef'(includes separator)🛠️ Suggested fix: derive chunk_text from original text
- # Compute overlap prefix from the previous chunk - if i > 0 and self.chunk_overlap > 0: - prev = raw_chunks[i - 1] - overlap_text = prev[-self.chunk_overlap :] - chunk_text = overlap_text + raw - else: - chunk_text = raw - - # Find start_char position in original text - start_char = text.find(raw, search_start) - if start_char == -1: - start_char = search_start - # Adjust for overlap prefix - if i > 0 and self.chunk_overlap > 0: - overlap_start = max(0, start_char - self.chunk_overlap) - start_char = overlap_start - end_char = start_char + len(chunk_text) + # Find raw chunk position in original text + raw_start = text.find(raw, search_start) + if raw_start == -1: + raw_start = search_start + raw_end = raw_start + len(raw) + + # Compute start_char with overlap adjustment + if i > 0 and self.chunk_overlap > 0: + start_char = max(0, raw_start - self.chunk_overlap) + else: + start_char = raw_start + end_char = raw_end + + # Slice chunk_text from original to preserve separators + chunk_text = text[start_char:end_char]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 129 - 145, The current construction chunk_text = overlap_text + raw (using raw_chunks and chunk_overlap) can mismatch the original text because _split_text drops separators; instead compute chunk_text by slicing the original text using calculated indices: find start_char = text.find(raw, search_start) (fall back to search_start if -1), then when overlap applies set overlap_start = max(0, start_char - self.chunk_overlap) and set end_char = overlap_start + len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap if needed), and finally derive chunk_text = text[overlap_start:end_char]; update uses of start_char/end_char and remove the overlap_text + raw concatenation so chunk_text always matches text[start_char:end_char] and honors separators dropped by _split_text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-145: The current construction chunk_text = overlap_text + raw
(using raw_chunks and chunk_overlap) can mismatch the original text because
_split_text drops separators; instead compute chunk_text by slicing the original
text using calculated indices: find start_char = text.find(raw, search_start)
(fall back to search_start if -1), then when overlap applies set overlap_start =
max(0, start_char - self.chunk_overlap) and set end_char = overlap_start +
len(chunk_text) or better end_char = start_char + len(raw) + (self.chunk_overlap
if needed), and finally derive chunk_text = text[overlap_start:end_char]; update
uses of start_char/end_char and remove the overlap_text + raw concatenation so
chunk_text always matches text[start_char:end_char] and honors separators
dropped by _split_text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4dac1c14-a233-42b7-bda4-d5d0aa37b128
📒 Files selected for processing (2)
nodes/src/nodes/chunker/chunker_strategies.pytest/nodes/test_chunker.py
…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unreachable else branch in IInstance.writeDocuments (metadata is always non-None after shallow-copy-or-create on line 81). Remove unused _STRATEGY_MAP dict in IGlobal. Fix services.json output description that incorrectly referenced total_chunks metadata which is never set on chunk documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Add early validation for chunk_size > 0, chunk_overlap >= 0, and chunk_overlap < chunk_size in beginGlobal() so invalid config is caught before strategy instantiation. Also add noqa BLE001 annotation for the intentional broad exception catch in validateConfig(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of joining sentences with ' '.join() and computing end_char from the joined string length, derive start_char/end_char from the sentence position map and slice the original text. This preserves the actual whitespace between sentences (e.g., double spaces, newlines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy chunk_index, start_char, end_char, and total_chunks from the strategy's per-chunk metadata dict onto each emitted Doc's metadata so downstream nodes can locate chunks within the source document. Update the writeDocuments docstring to list all metadata keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change the no-overlap position assertion to compare start_char against the previous chunk's end_char (not start_char) so it actually verifies non-overlapping spans. Rewrite test_iglobal_creates_strategy to instantiate IGlobal, mock the endpoint and glb with strategy='sentence', call beginGlobal(), and assert the resulting strategy type and parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nting Replace the current_len tracker (which assumed 1-char separators) with position-based span computation that accounts for multi-character whitespace between sentences. After the whitespace-preservation change (text[start_char:end_char] slicing), current_len was underestimating the real chunk width whenever the original separator was longer than one character (e.g. \n\n), causing chunks to silently exceed chunk_size. The overlap budget is now also computed from actual text spans so carried-over sentences fit within the configured overlap window. Add two regression tests: - overlap + repeated sentences: verify spans match text and respect size - multi-char whitespace: verify chunks are split, not all swallowed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fab5892 to
4181638
Compare
|
Great question — here's why the dedicated chunker node adds value beyond the existing preprocessors: 1. Pipeline composability vs preprocessing stage 2. Rich document metadata for reconstruction 3. Strategy-configurable with precise token counting 4. Zero external dependencies Regarding "2 independent nodes" — you're right that the chunker and hybrid search are independent. They share zero code dependencies and live in separate directories. Happy to split into two PRs if you'd prefer to review and merge them separately. Just let me know. |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (2)
nodes/src/nodes/search_hybrid/IInstance.py (1)
113-122:⚠️ Potential issue | 🟡 MinorAnswer text still references
hybrid_scorefrom metadata instead ofdoc.score.Line 116 reads
doc.metadata.get('hybrid_score'), but line 101 stores the RRF score inreranked_doc.score, not in metadata. This will always display'N/A'for the score in the answer output.🐛 Proposed fix
if reranked_docs and self.instance.hasListener('answers'): context_parts = [] for i, doc in enumerate(reranked_docs): - score = doc.metadata.get('hybrid_score', 'N/A') if doc.metadata else 'N/A' + score = f'{doc.score:.4f}' if doc.score is not None else 'N/A' snippet = (doc.page_content or '')[:500] context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 113 - 122, The answer assembly loop in IInstance.py is incorrectly reading the score from metadata ('hybrid_score') but the RRF score was saved on the Document object as doc.score; update the loop that builds context_parts (used before ans.setAnswer and self.instance.writeAnswers) to read the score from doc.score (falling back to 'N/A' if doc.score is None) instead of doc.metadata.get('hybrid_score') so the displayed scores reflect the reranked_doc.score values.nodes/src/nodes/chunker/chunker_strategies.py (1)
127-160:⚠️ Potential issue | 🟠 MajorBuild overlapped recursive chunks from the source slice.
Line 132 prepends overlap from
raw_chunks, but_split_text()has already removed boundary separators. When a split lands on'\n\n','. ', etc.,chunk_textno longer matchestext[start_char:end_char], so the emitted offsets are wrong.🛠️ Suggested fix
@@ for i, raw in enumerate(raw_chunks): - # Compute overlap prefix from the previous chunk - if i > 0 and self.chunk_overlap > 0: - prev = raw_chunks[i - 1] - overlap_text = prev[-self.chunk_overlap :] - chunk_text = overlap_text + raw - else: - chunk_text = raw - - # Find start_char position in original text - start_char = text.find(raw, search_start) - if start_char == -1: - start_char = search_start - # Adjust for overlap prefix - if i > 0 and self.chunk_overlap > 0: - overlap_start = max(0, start_char - self.chunk_overlap) - start_char = overlap_start - end_char = start_char + len(chunk_text) + raw_start = text.find(raw, search_start) + if raw_start == -1: + raw_start = search_start + raw_end = raw_start + len(raw) + + if i > 0 and self.chunk_overlap > 0: + start_char = max(0, raw_start - self.chunk_overlap) + else: + start_char = raw_start + end_char = raw_end + chunk_text = text[start_char:end_char] @@ - raw_start = text.find(raw, search_start) - if raw_start >= 0: - search_start = raw_start + len(raw) + search_start = raw_end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 127 - 160, The overlap prepending step uses text transformed by _split_text(), so chunk_text no longer aligns with the original source and emitted offsets are wrong; update the overlap construction in the chunking routine to derive overlap and offsets from the original/raw text/raw_chunks (use the original raw_chunks text or original source slice to compute the overlap substring and compute start_char/end_char relative to the original source) rather than the post-split chunk_text, and ensure emitted offsets are calculated from that original text so returned chunk.offsets correspond to indices in the original source; locate references to raw_chunks, _split_text(), chunk_text and the code that sets start_char/end_char to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 216-277: SentenceChunker currently can emit chunks larger than
chunk_size because the budget check at SentenceChunker (use of current_sentences
and chunk_size) skips the first sentence and the finalize path restores overlap
without re-checking sizes; fix by enforcing chunk_size for any single sentence
and during finalization: when building a chunk in add_sentence (or wherever
current_sentences is appended), detect if a single sentence length > chunk_size
and split that sentence into smaller spans (respecting chunk_size and desired
overlap policy) before emitting; and in finalize/restore-overlap logic, before
appending overlap+sentence repeatedly trim the carried overlap (pop or shorten
overlap segments from current_sentences) until overlap + next sentence <=
chunk_size, and if still larger, apply the same fallback split to the oversized
sentence so every emitted chunk is <= chunk_size.
- Around line 336-349: The current mapping uses len(encoder.decode(...)) as a
character offset which breaks for multibyte UTF‑8 tokens; instead compute token
byte lengths using encoder.decode_single_token_bytes() (or aggregate its
results) to produce start_byte and end_byte offsets, then convert those byte
offsets back to character indices into the original text (e.g., by slicing
text.encode('utf-8') or using incremental decoding to map byte->char index)
before slicing; update the logic around encoder.decode(...) and the code that
builds slices in the functions that assemble token spans so all boundaries use
byte-accurate offsets mapped to character indices.
- Around line 311-312: The call to Encoding.encode(...) in chunker_strategies.py
is rejecting sentinel substrings like "<|endoftext|>" and raising ValueError;
update the Encoding.encode invocation (the call at/near line 312) to include the
parameter disallowed_special=() so sentinel strings are treated as normal text
(i.e., call Encoding.encode(..., disallowed_special=()) wherever the
chunking/tokenization routine—e.g., the method that invokes Encoding.encode in
this module—is encoding document text); ensure any similar Encoding.encode
usages in the same class/function are updated likewise.
In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 79-82: The shallow copy of Document in the chunking logic
(chunk_doc = copy.copy(document) in IInstance) leaves inherited embedding state
on the chunk; after setting chunk_doc.page_content and chunk_doc.metadata,
explicitly clear chunk_doc.embedding and chunk_doc.embedding_model (set them to
None) so each emitted chunk does not carry the parent document's vector state.
Ensure you update the code around where chunk_doc is created/modified
(references: chunk_doc, copy.copy(document), chunk_doc.metadata,
chunk_doc.page_content) before appending/emitting the chunk.
In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 60-65: The code casts top_k and rrf_k from config without
validating their ranges; update the initialization in IGlobal (where alpha,
self.top_k, self.rrf_k are set) to validate that top_k and rrf_k are positive
integers (e.g., > 0) and clamp or raise a clear error/log if they are invalid,
similar to the alpha validation; ensure you convert the values to int after
validation and keep existing behavior for defaults (top_k=10, rrf_k=60) when
values are missing or invalid.
- Around line 46-47: The except block that catches a broad Exception (the line
with "except Exception as e:" followed by "warning(str(e))") should include the
suppression comment "# noqa: BLE001" to match other nodes; update that except
line to append the comment so the linter suppression is consistent with
chunker/IGlobal.py.
In `@test/nodes/test_chunker.py`:
- Around line 517-523: The _restore_modules method currently only restores
entries named in stub_names but leaves cached imports like modules starting with
"nodes.src.nodes.chunker." that were loaded against the stubs; update
_restore_modules (the function named _restore_modules) to also remove from
sys.modules any keys that match the chunker module namespace (e.g. startwith
"nodes.src.nodes.chunker.") when cleaning up: for each such matching module, pop
it from sys.modules so subsequent tests don't reuse modules wired to the fakes;
keep the existing logic that restores saved[name] for stub_names that were
present.
In `@test/nodes/test_search_hybrid.py`:
- Around line 78-83: The teardown _restore_modules currently only restores exact
dependency names in _STUB_MODULES/_original_modules, leaving stubbed packages
(e.g., the fake rank_bm25 and any modules under search_hybrid that were imported
while stubs were active) cached in sys.modules; update _restore_modules so it
removes from sys.modules any entry whose name is equal to or starts with any
name in _STUB_MODULES (or whose module object is the stub) before restoring
originals, ensuring that both top-level stub modules and any submodules (e.g.,
"rank_bm25" and "rank_bm25.*" or "rocketlib.*") are purged so later tests import
clean copies.
---
Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 127-160: The overlap prepending step uses text transformed by
_split_text(), so chunk_text no longer aligns with the original source and
emitted offsets are wrong; update the overlap construction in the chunking
routine to derive overlap and offsets from the original/raw text/raw_chunks (use
the original raw_chunks text or original source slice to compute the overlap
substring and compute start_char/end_char relative to the original source)
rather than the post-split chunk_text, and ensure emitted offsets are calculated
from that original text so returned chunk.offsets correspond to indices in the
original source; locate references to raw_chunks, _split_text(), chunk_text and
the code that sets start_char/end_char to make this change.
In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 113-122: The answer assembly loop in IInstance.py is incorrectly
reading the score from metadata ('hybrid_score') but the RRF score was saved on
the Document object as doc.score; update the loop that builds context_parts
(used before ans.setAnswer and self.instance.writeAnswers) to read the score
from doc.score (falling back to 'N/A' if doc.score is None) instead of
doc.metadata.get('hybrid_score') so the displayed scores reflect the
reranked_doc.score values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 639ea044-72b7-4b42-a58b-933fa072c7b3
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml,!pnpm-lock.yaml
📒 Files selected for processing (14)
nodes/src/nodes/chunker/IGlobal.pynodes/src/nodes/chunker/IInstance.pynodes/src/nodes/chunker/__init__.pynodes/src/nodes/chunker/chunker_strategies.pynodes/src/nodes/chunker/requirements.txtnodes/src/nodes/chunker/services.jsonnodes/src/nodes/search_hybrid/IGlobal.pynodes/src/nodes/search_hybrid/IInstance.pynodes/src/nodes/search_hybrid/__init__.pynodes/src/nodes/search_hybrid/hybrid_search.pynodes/src/nodes/search_hybrid/requirements.txtnodes/src/nodes/search_hybrid/services.jsontest/nodes/test_chunker.pytest/nodes/test_search_hybrid.py
| # Compute what the span would be if we added this sentence | ||
| if current_sentences: | ||
| candidate_len = (next_pos + len(sentence)) - sentence_positions[0] | ||
| else: | ||
| candidate_len = len(sentence) | ||
|
|
||
| # If adding this sentence exceeds chunk_size and we have content, finalize current | ||
| if current_sentences and candidate_len > self.chunk_size: | ||
| start_char = sentence_positions[0] | ||
| end_char = sentence_positions[-1] + len(current_sentences[-1]) | ||
| chunk_text = text[start_char:end_char] | ||
|
|
||
| result.append( | ||
| { | ||
| 'text': chunk_text, | ||
| 'metadata': { | ||
| 'chunk_index': chunk_index, | ||
| 'start_char': start_char, | ||
| 'end_char': end_char, | ||
| }, | ||
| } | ||
| ) | ||
| chunk_index += 1 | ||
|
|
||
| # Compute overlap: keep trailing sentences whose actual span fits within overlap | ||
| if self.chunk_overlap > 0: | ||
| overlap_sentences: list[str] = [] | ||
| overlap_positions: list[int] = [] | ||
| for i in range(len(current_sentences) - 1, -1, -1): | ||
| trial_positions = [sentence_positions[i]] + overlap_positions | ||
| trial_sentences = [current_sentences[i]] + overlap_sentences | ||
| if _span_len(trial_positions, trial_sentences) <= self.chunk_overlap: | ||
| overlap_sentences = trial_sentences | ||
| overlap_positions = trial_positions | ||
| else: | ||
| break | ||
| current_sentences = overlap_sentences | ||
| sentence_positions = overlap_positions | ||
| else: | ||
| current_sentences = [] | ||
| sentence_positions = [] | ||
|
|
||
| # Add the sentence | ||
| current_sentences.append(sentence) | ||
| sentence_positions.append(next_pos) | ||
|
|
||
| # Emit the final chunk | ||
| if current_sentences: | ||
| start_char = sentence_positions[0] | ||
| end_char = sentence_positions[-1] + len(current_sentences[-1]) | ||
| chunk_text = text[start_char:end_char] | ||
|
|
||
| result.append( | ||
| { | ||
| 'text': chunk_text, | ||
| 'metadata': { | ||
| 'chunk_index': chunk_index, | ||
| 'start_char': start_char, | ||
| 'end_char': end_char, | ||
| }, | ||
| } | ||
| ) |
There was a problem hiding this comment.
SentenceChunker can still emit chunks larger than chunk_size.
Line 223 only enforces the budget when current_sentences is non-empty, so a first sentence longer than the limit is emitted unchanged. After finalization, Lines 252-259 also restore overlap and append the current sentence without another size check, so overlap + sentence can overflow as well. Add a fallback split for oversized single sentences, and trim carried overlap until the new span fits before appending.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 216 - 277,
SentenceChunker currently can emit chunks larger than chunk_size because the
budget check at SentenceChunker (use of current_sentences and chunk_size) skips
the first sentence and the finalize path restores overlap without re-checking
sizes; fix by enforcing chunk_size for any single sentence and during
finalization: when building a chunk in add_sentence (or wherever
current_sentences is appended), detect if a single sentence length > chunk_size
and split that sentence into smaller spans (respecting chunk_size and desired
overlap policy) before emitting; and in finalize/restore-overlap logic, before
appending overlap+sentence repeatedly trim the carried overlap (pop or shorten
overlap segments from current_sentences) until overlap + next sentence <=
chunk_size, and if still larger, apply the same fallback split to the oversized
sentence so every emitted chunk is <= chunk_size.
| encoder = self._get_encoder() | ||
| tokens = encoder.encode(text) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import pathlib
import re
import subprocess
import sys
pkg = 'tiktoken'
req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt')
if req_path.exists():
match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M)
if match and match.group(1):
pkg = f'tiktoken=={match.group(1)}'
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
sample = 'prefix <|endoftext|> suffix'
try:
print('default:', enc.encode(sample))
except Exception as exc:
print('default raised:', type(exc).__name__, exc)
print('disallowed_special=():', enc.encode(sample, disallowed_special=()))
PYRepository: rocketride-org/rocketride-server
Length of output: 197
🏁 Script executed:
git ls-files nodes/src/nodes/chunker/ | head -20Repository: rocketride-org/rocketride-server
Length of output: 308
🏁 Script executed:
cat -n nodes/src/nodes/chunker/chunker_strategies.py | sed -n '305,320p'Repository: rocketride-org/rocketride-server
Length of output: 612
🏁 Script executed:
cat nodes/src/nodes/chunker/requirements.txtRepository: rocketride-org/rocketride-server
Length of output: 98
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import subprocess
import sys
# Install tiktoken with version constraint from requirements.txt
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'tiktoken>=0.7.0,<1.0.0'])
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
sample = 'prefix <|endoftext|> suffix'
print("Testing tiktoken behavior with sentinel strings:")
print()
try:
result = enc.encode(sample)
print('default call succeeded:', result)
except Exception as exc:
print('default call raised:', type(exc).__name__, '-', exc)
print()
try:
result = enc.encode(sample, disallowed_special=())
print('with disallowed_special=() call succeeded:', result)
except Exception as exc:
print('with disallowed_special=() raised:', type(exc).__name__, '-', exc)
print()
print("Testing with just the sentinel string:")
try:
result = enc.encode('<|endoftext|>')
print('sentinel alone (default) succeeded:', result)
except Exception as exc:
print('sentinel alone (default) raised:', type(exc).__name__)
try:
result = enc.encode('<|endoftext|>', disallowed_special=())
print('sentinel alone (disallowed_special=()) succeeded:', result)
except Exception as exc:
print('sentinel alone (disallowed_special=()) raised:', type(exc).__name__)
PYRepository: rocketride-org/rocketride-server
Length of output: 865
Add disallowed_special=() parameter to treat sentinel strings as normal text.
Encoding.encode() rejects substrings such as <|endoftext|> by default, raising a ValueError. Line 312 will therefore raise on valid document content instead of chunking it.
🛠️ Suggested fix
- tokens = encoder.encode(text)
+ tokens = encoder.encode(text, disallowed_special=())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 311 - 312, The
call to Encoding.encode(...) in chunker_strategies.py is rejecting sentinel
substrings like "<|endoftext|>" and raising ValueError; update the
Encoding.encode invocation (the call at/near line 312) to include the parameter
disallowed_special=() so sentinel strings are treated as normal text (i.e., call
Encoding.encode(..., disallowed_special=()) wherever the chunking/tokenization
routine—e.g., the method that invokes Encoding.encode in this module—is encoding
document text); ensure any similar Encoding.encode usages in the same
class/function are updated likewise.
| # Decode tokens back to text, handling errors gracefully | ||
| try: | ||
| chunk_text = encoder.decode(chunk_tokens) | ||
| except Exception: | ||
| chunk_text = encoder.decode(chunk_tokens, errors='replace') | ||
|
|
||
| # Use cached character position for start_char (O(1) lookup) | ||
| start_char = char_pos_cache.get(start) | ||
| if start_char is None: | ||
| # Fallback: decode prefix (should not happen with correct step caching) | ||
| prefix_text = encoder.decode(tokens[:start]) | ||
| start_char = len(prefix_text) | ||
| char_pos_cache[start] = start_char | ||
| end_char = start_char + len(chunk_text) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import pathlib
import re
import subprocess
import sys
pkg = 'tiktoken'
req_path = pathlib.Path('nodes/src/nodes/chunker/requirements.txt')
if req_path.exists():
match = re.search(r'^\s*tiktoken(?:==([^\s#]+))?', req_path.read_text(), re.M)
if match and match.group(1):
pkg = f'tiktoken=={match.group(1)}'
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨👩👧👦', '🏳️🌈']
candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable())
for text in candidates:
tokens = enc.encode(text, disallowed_special=())
if len(tokens) < 2:
continue
prefix = b''
for i, tok in enumerate(tokens[:-1], 1):
prefix += enc.decode_single_token_bytes(tok)
try:
prefix.decode('utf-8')
except UnicodeDecodeError:
print('invalid_utf8_boundary', repr(text), tokens, 'boundary', i)
print('decoded_prefix', repr(enc.decode(tokens[:i], errors='replace')))
raise SystemExit(0)
print('no_invalid_boundary_found')
PYRepository: rocketride-org/rocketride-server
Length of output: 197
🏁 Script executed:
cd nodes && find . -name "chunker_strategies.py" -o -name "requirements.txt" | head -20Repository: rocketride-org/rocketride-server
Length of output: 919
🏁 Script executed:
cd nodes && wc -l ./src/nodes/chunker/chunker_strategies.pyRepository: rocketride-org/rocketride-server
Length of output: 121
🏁 Script executed:
cd nodes && sed -n '336,372p' ./src/nodes/chunker/chunker_strategies.pyRepository: rocketride-org/rocketride-server
Length of output: 1727
🏁 Script executed:
cd nodes && sed -n '320,380p' ./src/nodes/chunker/chunker_strategies.pyRepository: rocketride-org/rocketride-server
Length of output: 2532
🏁 Script executed:
cd nodes && cat ./src/nodes/chunker/requirements.txtRepository: rocketride-org/rocketride-server
Length of output: 98
🏁 Script executed:
cd nodes && python - <<'PY'
import sys
import subprocess
# Install tiktoken with the version constraint
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'tiktoken>=0.7.0,<1.0.0'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
candidates = ['🙂', '🚀', '你好', 'café', 'naïve', '👨👩👧👦', '🏳️🌈']
candidates.extend(chr(cp) for cp in range(0x80, 0x2000) if chr(cp).isprintable())
invalid_boundaries_found = []
for text in candidates:
tokens = enc.encode(text, disallowed_special=())
if len(tokens) < 2:
continue
prefix = b''
for i, tok in enumerate(tokens[:-1], 1):
prefix += enc.decode_single_token_bytes(tok)
try:
prefix.decode('utf-8')
except UnicodeDecodeError:
invalid_boundaries_found.append({
'text': text,
'tokens': tokens,
'boundary_index': i,
'decoded_with_replacement': enc.decode(tokens[:i], errors='replace')
})
break
if invalid_boundaries_found:
print(f"FOUND {len(invalid_boundaries_found)} INVALID UTF-8 BOUNDARIES:")
for example in invalid_boundaries_found[:5]: # Show first 5
print(f" Text: {repr(example['text'])}, Tokens: {example['tokens']}, Boundary at token {example['boundary_index']}")
print(f" Decoded with replacement: {repr(example['decoded_with_replacement'])}")
else:
print("NO INVALID UTF-8 BOUNDARIES FOUND")
PYRepository: rocketride-org/rocketride-server
Length of output: 715
Use byte-level offsets to correctly map token boundaries to character positions in non-ASCII text.
Tiktoken's token boundaries are byte-based and can split multibyte UTF-8 sequences. Lines 343–349 and 365–372 incorrectly use len(encoder.decode(...)) as a character offset, which fails when token boundaries land mid-character. When errors='replace' substitutes invalid sequences with '�', the character count diverges from the original text's indices. Build offsets from decode_single_token_bytes() aggregates, then map byte positions back to character indices before slicing text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 336 - 349, The
current mapping uses len(encoder.decode(...)) as a character offset which breaks
for multibyte UTF‑8 tokens; instead compute token byte lengths using
encoder.decode_single_token_bytes() (or aggregate its results) to produce
start_byte and end_byte offsets, then convert those byte offsets back to
character indices into the original text (e.g., by slicing text.encode('utf-8')
or using incremental decoding to map byte->char index) before slicing; update
the logic around encoder.decode(...) and the code that builds slices in the
functions that assemble token spans so all boundaries use byte-accurate offsets
mapped to character indices.
| # Shallow copy of document, explicit copy of metadata only | ||
| chunk_doc = copy.copy(document) | ||
| chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata() | ||
| chunk_doc.page_content = chunk_data['text'] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo 'Schema files exposing Doc / DocMetadata fields:'
fd -i 'schema\.py$' -x rg -n -C2 '\bclass\s+Doc\b|\bclass\s+DocMetadata\b|\bembedding(s)?\b|\bvector(s)?\b' {}
echo
echo 'Node code referencing existing vector-bearing fields:'
rg -n -C2 '\b(embedding|embeddings|vector|vectors)\b' --glob '*.py' -g '!test/**' -g '!**/test/**'
echo
echo 'Chunk emission loop under review:'
sed -n '79,94p' nodes/src/nodes/chunker/IInstance.pyRepository: rocketride-org/rocketride-server
Length of output: 50389
Clear inherited embeddings when cloning documents for chunks.
Line 80 shallow-copies the entire document, and line 82 replaces only page_content. The Doc schema defines both embedding and embedding_model fields (see packages/client-python/src/rocketride/schema/doc.py), and each emitted chunk will inherit the parent document's vector state even though its text has changed. This causes stale embeddings to propagate downstream, poisoning indexing or allowing later nodes to skip required re-embedding.
Reset both fields before appending:
chunk_doc.embedding = None
chunk_doc.embedding_model = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 82, The shallow copy
of Document in the chunking logic (chunk_doc = copy.copy(document) in IInstance)
leaves inherited embedding state on the chunk; after setting
chunk_doc.page_content and chunk_doc.metadata, explicitly clear
chunk_doc.embedding and chunk_doc.embedding_model (set them to None) so each
emitted chunk does not carry the parent document's vector state. Ensure you
update the code around where chunk_doc is created/modified (references:
chunk_doc, copy.copy(document), chunk_doc.metadata, chunk_doc.page_content)
before appending/emitting the chunk.
| except Exception as e: | ||
| warning(str(e)) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add # noqa: BLE001 for consistency with other nodes.
The broad Exception catch should include the suppression comment for consistency with the established pattern (as done in chunker/IGlobal.py line 44).
♻️ Proposed fix
- except Exception as e:
+ except Exception as e: # noqa: BLE001 - intentional broad catch for dependency probing
warning(str(e))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 46 - 47, The except
block that catches a broad Exception (the line with "except Exception as e:"
followed by "warning(str(e))") should include the suppression comment "# noqa:
BLE001" to match other nodes; update that except line to append the comment so
the linter suppression is consistent with chunker/IGlobal.py.
| alpha = float(config.get('alpha', 0.5)) | ||
| self.top_k = int(config.get('top_k', 10)) | ||
| self.rrf_k = int(config.get('rrf_k', 60)) | ||
|
|
||
| # Validate alpha range | ||
| alpha = max(0.0, min(1.0, alpha)) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider validating top_k and rrf_k parameters.
chunk_size/chunk_overlap are validated in the chunker node, but top_k and rrf_k are cast without range checks. Invalid values (e.g., top_k <= 0 or rrf_k <= 0) could cause unexpected behavior in the search engine.
🛡️ Proposed guardrails
alpha = float(config.get('alpha', 0.5))
self.top_k = int(config.get('top_k', 10))
self.rrf_k = int(config.get('rrf_k', 60))
+ if self.top_k <= 0:
+ raise ValueError(f'top_k must be positive, got {self.top_k}')
+ if self.rrf_k <= 0:
+ raise ValueError(f'rrf_k must be positive, got {self.rrf_k}')
+
# Validate alpha range
alpha = max(0.0, min(1.0, alpha))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 60 - 65, The code
casts top_k and rrf_k from config without validating their ranges; update the
initialization in IGlobal (where alpha, self.top_k, self.rrf_k are set) to
validate that top_k and rrf_k are positive integers (e.g., > 0) and clamp or
raise a clear error/log if they are invalid, similar to the alpha validation;
ensure you convert the values to int after validation and keep existing behavior
for defaults (top_k=10, rrf_k=60) when values are missing or invalid.
| def _restore_modules(self, saved, stub_names): | ||
| for name in stub_names: | ||
| if saved.get(name) is None: | ||
| sys.modules.pop(name, None) | ||
| else: | ||
| sys.modules[name] = saved[name] | ||
|
|
There was a problem hiding this comment.
Evict the stub-imported chunker modules during cleanup.
Lines 533-550 and 589-592 import nodes.src.nodes.chunker.* while rocketlib / ai.common are stubbed, but Lines 517-523 only restore the stubs themselves. The cached nodes.src.nodes.chunker.* modules survive in sys.modules, so later tests can reuse modules that are already wired to the fakes.
🧹 Suggested cleanup
def _restore_modules(self, saved, stub_names):
+ for mod_name in list(sys.modules):
+ if mod_name.startswith('nodes.src.nodes.chunker'):
+ sys.modules.pop(mod_name, None)
for name in stub_names:
if saved.get(name) is None:
sys.modules.pop(name, None)
else:
sys.modules[name] = saved[name]Also applies to: 524-557, 581-592
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/nodes/test_chunker.py` around lines 517 - 523, The _restore_modules
method currently only restores entries named in stub_names but leaves cached
imports like modules starting with "nodes.src.nodes.chunker." that were loaded
against the stubs; update _restore_modules (the function named _restore_modules)
to also remove from sys.modules any keys that match the chunker module namespace
(e.g. startwith "nodes.src.nodes.chunker.") when cleaning up: for each such
matching module, pop it from sys.modules so subsequent tests don't reuse modules
wired to the fakes; keep the existing logic that restores saved[name] for
stub_names that were present.
| def _restore_modules(): | ||
| for name in _STUB_MODULES: | ||
| if _original_modules.get(name) is None: | ||
| sys.modules.pop(name, None) | ||
| else: | ||
| sys.modules[name] = _original_modules[name] |
There was a problem hiding this comment.
Restore the sys.modules stubs in teardown.
Lines 86-87 install a fake rank_bm25 at module import time, and Lines 578-588 import nodes.src.nodes.search_hybrid.* while rocketlib / ai.common are stubbed. The cleanup paths only restore the dependency modules, so the BM25 stub and the stub-bound search_hybrid modules stay cached for later tests. That makes the suite order-dependent.
Also applies to: 86-87, 571-588
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/nodes/test_search_hybrid.py` around lines 78 - 83, The teardown
_restore_modules currently only restores exact dependency names in
_STUB_MODULES/_original_modules, leaving stubbed packages (e.g., the fake
rank_bm25 and any modules under search_hybrid that were imported while stubs
were active) cached in sys.modules; update _restore_modules so it removes from
sys.modules any entry whose name is equal to or starts with any name in
_STUB_MODULES (or whose module object is the stub) before restoring originals,
ensuring that both top-level stub modules and any submodules (e.g., "rank_bm25"
and "rank_bm25.*" or "rocketlib.*") are purged so later tests import clean
copies.
|
Thank you for the contribution and for the detailed follow-up explaining the value of the chunker node — it helped clarify the design intent. After review, we've decided to close this PR in favour of improving the existing preprocessor nodes instead. The reasoning:
If you'd like to continue contributing, we'd welcome a PR that extends |
|
@asclearuc How to reach out to you, and in one the PR's I was told to use this approach and divide this into two new PR's. I would like to point it out to you and can discuss about the next steps to complete this features. |
Summary
Review feedback addressed
raise RuntimeErrorparentIdback to source documentTest plan
Splits #514
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests