feat(nodes): add text chunker node with recursive, sentence, and token strategies by nihalnihalani · Pull Request #2 · nihalnihalani/rocketride-server

nihalnihalani · 2026-04-02T10:23:09Z

Summary

Split from feat(nodes): add text chunking strategies and hybrid search for RAG pipelines rocketride-org/rocketride-server#514 per reviewer request — this PR contains only the chunker node
Three chunking strategies: RecursiveCharacterChunker, SentenceChunker, TokenChunker
Addresses all review feedback from @asclearuc on feat(nodes): add text chunking strategies and hybrid search for RAG pipelines rocketride-org/rocketride-server#514

Review feedback addressed

RuntimeError on uninitialized strategy — replaced silent return with raise RuntimeError
Performance: shallow copy instead of N+1 deepcopy — only metadata dict is copied, not entire document with embeddings
parent_id preserved in all metadata branches — every chunk carries parentId back to source document

Test plan

40 chunker tests pass
ruff check/format clean
Prettier clean
Recursive splitting with overlap works correctly
Sentence boundary detection handles edge cases
Token-based splitting respects model context windows
Chunk metadata includes parent_id in all cases
Shallow copy does not corrupt original document
RuntimeError raised when strategy is None

Splits rocketride-org#514

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Text Chunker node: three chunking strategies (recursive, sentence, token), configurable size/overlap and encoding; emits per-chunk documents with positioning metadata and preserves originals. Shared chunker state is initialized/cleared across lifecycle.
- Hybrid Search node: BM25 + vector fusion (RRF) with configurable alpha/top_k/rrf_k, selectable profiles; returns reranked documents and optional formatted answer summaries.
Tests
- Comprehensive test suites for chunking, lifecycle/integration behavior, hybrid algorithms, and fusion logic.

coderabbitai · 2026-04-02T10:24:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Text Chunker node (recursive, sentence, token) with global lifecycle and per-document chunk emission, and a Hybrid Search node implementing BM25 + vector RRF fusion. Includes service definitions, runtime requirements, and comprehensive unit and integration tests.

Changes

Cohort / File(s)	Summary
Chunker Strategies `nodes/src/nodes/chunker/chunker_strategies.py`	New `ChunkingStrategy` plus `RecursiveCharacterChunker`, `SentenceChunker`, and `TokenChunker`. Implementations produce chunk dicts with `text` and `metadata` (`chunk_index`, `start_char`, `end_char`) and validate `chunk_size`/`chunk_overlap`.
Chunker Lifecycle & Instance `nodes/src/nodes/chunker/IGlobal.py`, `nodes/src/nodes/chunker/IInstance.py`	`IGlobal` manages shared `strategy` and lifecycle (`validateConfig`, `beginGlobal`, `endGlobal`). `IInstance` uses `IGlobal.strategy` to split `Doc`s into per-chunk `Doc`s, setting `chunkId`, `parentId`, and chunk metadata without mutating originals.
Chunker Config & Service `nodes/src/nodes/chunker/services.json`, `nodes/src/nodes/chunker/requirements.txt`	Adds `Text Chunker` node definition with selectable profiles (recursive/sentence/token) and fields (`chunk_size`, `chunk_overlap`, `encoding_name`). Adds `tiktoken >=0.7.0,<1.0.0`.
Chunker Tests `test/nodes/test_chunker.py`	Extensive unit tests for all chunkers (splitting, overlap, separators, token encoder behavior) plus integration-style tests for `IGlobal`/`IInstance` using stubs.
Hybrid Engine `nodes/src/nodes/search_hybrid/hybrid_search.py`	New `HybridSearchEngine` with BM25 ranking, Reciprocal Rank Fusion (RRF), tokenizer helper, and `search()` that fuses vector and BM25 results by `alpha`. Ensures input immutability.
Hybrid Lifecycle & Instance `nodes/src/nodes/search_hybrid/IGlobal.py`, `nodes/src/nodes/search_hybrid/IInstance.py`	`IGlobal` manages shared `engine` and params (`top_k`, `rrf_k`) with lifecycle methods. `IInstance.writeQuestions` calls the engine, maps/reranks docs, updates scores/metadata, and emits `documents` and formatted `answers` when configured.
Hybrid Config & Service `nodes/src/nodes/search_hybrid/services.json`, `nodes/src/nodes/search_hybrid/requirements.txt`	Adds `Hybrid Search` node definition with `alpha/top_k/rrf_k` profiles and UI bindings. Adds `rank_bm25 >=0.2.2,<1.0.0`.
Hybrid Tests `test/nodes/test_search_hybrid.py`	Comprehensive tests for BM25, RRF fusion, `search()` across `alpha` modes, tokenizer behavior, immutability checks, and integration-style tests for `IGlobal`/`IInstance` with stubs.

Sequence Diagram(s)

sequenceDiagram
    actor Node
    participant IGlobal as IGlobal (chunker)
    participant Strategy as ChunkingStrategy
    participant IInstance as ChunkerInstance

    Node->>IGlobal: validateConfig()
    IGlobal->>IGlobal: probe local requirements.txt
    Node->>IGlobal: beginGlobal()
    IGlobal->>Strategy: instantiate strategy (recursive|sentence|token)
    Node->>IInstance: writeDocuments([Doc])
    IInstance->>Strategy: chunk(text)
    Strategy-->>IInstance: [chunk1, chunk2, ...]
    loop per chunk
        IInstance->>IInstance: copy Doc, set metadata(chunkId,parentId,chunk_index,start_char,end_char)
        IInstance->>Node: emit chunked Doc
    end
    Node->>IGlobal: endGlobal()
    IGlobal->>IGlobal: strategy = None

sequenceDiagram
    actor Client
    participant IGlobal as IGlobal (hybrid)
    participant Engine as HybridSearchEngine
    participant IInstance as HybridInstance

    Client->>IGlobal: validateConfig()
    IGlobal->>IGlobal: probe local requirements.txt
    Client->>IGlobal: beginGlobal()
    IGlobal->>Engine: instantiate HybridSearchEngine(alpha)
    Client->>IInstance: writeQuestions(Question)
    IInstance->>Engine: search(query, docs, vector_scores, top_k, rrf_k)
    Engine->>Engine: bm25_search(...) and/or build vector ranking
    Engine->>Engine: reciprocal_rank_fusion(...)
    Engine-->>IInstance: ranked results (with rrf_score)
    IInstance->>IInstance: map results -> Doc copies, set score/metadata
    IInstance->>Client: emit documents (and answers if configured)
    Client->>IGlobal: endGlobal()
    IGlobal->>IGlobal: engine = None

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰
I nibble text into tidy bites of gold,
split sentences, tokens, and stories bold,
BM25 and vectors twirl and play,
I hop through chunks to light the way,
tiny paws, big search—cheers all day.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: introduction of a text chunker node with three distinct chunking strategies (recursive, sentence, and token).
Docstring Coverage	✅ Passed	Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/rag-chunker-node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 213-215: The code rebuilds chunks using '
'.join(current_sentences), which normalizes whitespace and breaks character
offsets; instead compute chunk boundaries from the original text using
sentence_positions and slice the original text for chunk_text so spacing is
preserved and end_char is accurate: use start_char = sentence_positions[0] and
end_char = sentence_positions[-1] +
len(original_sentence_text_for_that_position) (or maintain a
sentence_end_positions array) then set chunk_text =
original_text[start_char:end_char]; update the code paths that set chunk_text,
start_char, and end_char (references: chunk_text, current_sentences,
sentence_positions, start_char, end_char, original_text/original_sentences)
accordingly.

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 45-53: The validateConfig method currently calls
depends(requirements) unconditionally and swallows all exceptions; change it so
you only invoke the installer when the node is configured to use the token
strategy (check the config/profile field used by this class, e.g.,
self.config.profile == 'token' or similar), and if depends(...) fails during
that branch surface the failure by raising an exception (or returning a failed
validation) instead of calling warning(); keep other profiles (e.g.,
'recursive', 'sentence') free of side effects and do not call depends for them.
Ensure you reference validateConfig and the depends(requirements) call when
making the change so the installer is only triggered for the token profile and
failures fail validation.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 41-45: The code uses a single entry-scoped counter self.chunkId
(defined on IInstance and reset in open) and overwrites per-chunk metadata,
losing per-document chunk indices and start_char/end_char; update the logic so
emitted chunk metadata uses the per-document index from chunk_data['metadata']
(e.g., metadata.get('chunk') or metadata['chunk']) instead of incrementing
self.chunkId, or if you prefer an explicit per-document counter,
initialize/reset a document-scoped counter in IInstance.open and use that only
for chunks of the current document; adjust the emission code (the block
referenced around lines 76-92) to read and preserve metadata['start_char'] and
metadata['end_char'] and to reference either metadata['chunk'] or the reset
per-document counter rather than self.chunkId.
- Around line 79-89: The current code creates DocMetadata() with no required
fields when document.metadata is falsy, causing construction to fail before the
later fallback runs; fix the chunk metadata handling in the IInstance chunking
logic by initializing chunk_doc.metadata to a properly populated DocMetadata
when document.metadata is missing (e.g., DocMetadata(objectId=parent_id,
chunkId=self.chunkId, parentId=parent_id)), otherwise shallow-copy
document.metadata and then set chunk_doc.metadata.chunkId/self.chunkId and
parentId=parent_id as before (update references: chunk_doc, DocMetadata,
self.chunkId, parent_id).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 81815d7d-7bd7-4c85-a219-0d14c343c8e0

📥 Commits

Reviewing files that changed from the base of the PR and between 7a00a2b and 5d78492.

📒 Files selected for processing (7)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/__init__.py
nodes/src/nodes/chunker/chunker_strategies.py
nodes/src/nodes/chunker/requirements.txt
nodes/src/nodes/chunker/services.json
test/nodes/test_chunker.py

nodes/src/nodes/chunker/chunker_strategies.py

coderabbitai · 2026-04-02T10:34:24Z

nodes/src/nodes/chunker/IGlobal.py

+    def validateConfig(self):
+        """Validate that tiktoken dependency is available (only needed for token strategy)."""
+        try:
+            from depends import depends
+
+            requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
+            depends(requirements)
+        except Exception as e:
+            warning(str(e))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that validateConfig installs requirements unconditionally and that # depends() performs installs rather than a pure validation check. sed -n '45,53p' nodes/src/nodes/chunker/IGlobal.py sed -n '71,175p' packages/ai/src/ai/common/config.py sed -n '778,809p' packages/server/engine-lib/rocketlib-python/lib/depends.py

Repository: nihalnihalani/rocketride-server

Length of output: 5536

🏁 Script executed:

#!/bin/bash # Get the full IGlobal.py file to see imports and complete validateConfig method cat -n nodes/src/nodes/chunker/IGlobal.py | head -80

Repository: nihalnihalani/rocketride-server

Length of output: 4274

🏁 Script executed:

#!/bin/bash # Check if Config is imported/available in IGlobal.py grep -n "^from\|^import\|Config" nodes/src/nodes/chunker/IGlobal.py

Repository: nihalnihalani/rocketride-server

Length of output: 414

🏁 Script executed:

#!/bin/bash # Verify strategy field in config by checking preconfig in a service definition # Look for where strategy might be defined find . -type f -name "*.py" -o -name "*.json" | head -20 rg -i "strategy.*token\|token.*strategy" --type=py --type=json -m 3

Repository: nihalnihalani/rocketride-server

Length of output: 724

Only resolve tiktoken requirements for the token profile, and fail validation if that install fails.

depends(requirements) is an installer, not a cheap availability check. Calling it unconditionally in validateConfig() adds unnecessary side effects for recursive and sentence configs, and the blanket except Exception masks true token setup failures that surface later at runtime instead of during validation.

🔧 Suggested change

def validateConfig(self): """Validate that tiktoken dependency is available (only needed for token strategy).""" - try: - from depends import depends - - requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' - depends(requirements) - except Exception as e: - warning(str(e)) + config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig) + if config.get('strategy', 'recursive') != 'token': + return + + from depends import depends + + requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt') + depends(requirements)

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 52-52: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` around lines 45 - 53, The validateConfig method currently calls depends(requirements) unconditionally and swallows all exceptions; change it so you only invoke the installer when the node is configured to use the token strategy (check the config/profile field used by this class, e.g., self.config.profile == 'token' or similar), and if depends(...) fails during that branch surface the failure by raising an exception (or returning a failed validation) instead of calling warning(); keep other profiles (e.g., 'recursive', 'sentence') free of side effects and do not call depends for them. Ensure you reference validateConfig and the depends(requirements) call when making the change so the installer is only triggered for the token profile and failures fail validation.

coderabbitai · 2026-04-02T10:34:24Z

nodes/src/nodes/chunker/IInstance.py

+    chunkId: int = 0
+
+    def open(self, obj: Entry):
+        """Reset chunk counter for each new object."""
+        self.chunkId = 0


⚠️ Potential issue | 🟠 Major

Use a per-document chunk index instead of the entry-scoped counter.

chunk_data['metadata'] already contains the chunk's index within its source document, but this code discards it and increments self.chunkId across the whole entry. If writeDocuments() receives multiple documents, the first chunk of the second document will no longer be 0, and the strategy's start_char/end_char metadata is lost before emission.

🔧 Suggested change

- for chunk_data in chunks: + for chunk_idx, chunk_data in enumerate(chunks): # Shallow copy of document, explicit copy of metadata only chunk_doc = copy.copy(document) if document.metadata is not None: chunk_doc.metadata = copy.copy(document.metadata) else: @@ - chunk_doc.metadata.chunkId = self.chunkId + chunk_doc.metadata.chunkId = chunk_idx chunk_doc.metadata.parentId = parent_id + chunk_doc.metadata.startChar = chunk_data['metadata']['start_char'] + chunk_doc.metadata.endChar = chunk_data['metadata']['end_char'] + chunk_doc.metadata.totalChunks = total_chunks - self.chunkId += 1 output_docs.append(chunk_doc)

Also applies to: 76-92

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 43-43: Unused method argument: obj

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IInstance.py` around lines 41 - 45, The code uses a single entry-scoped counter self.chunkId (defined on IInstance and reset in open) and overwrites per-chunk metadata, losing per-document chunk indices and start_char/end_char; update the logic so emitted chunk metadata uses the per-document index from chunk_data['metadata'] (e.g., metadata.get('chunk') or metadata['chunk']) instead of incrementing self.chunkId, or if you prefer an explicit per-document counter, initialize/reset a document-scoped counter in IInstance.open and use that only for chunks of the current document; adjust the emission code (the block referenced around lines 76-92) to read and preserve metadata['start_char'] and metadata['end_char'] and to reference either metadata['chunk'] or the reset per-document counter rather than self.chunkId.

coderabbitai · 2026-04-02T10:34:25Z

nodes/src/nodes/chunker/IInstance.py

+                # Shallow copy of document, explicit copy of metadata only
+                chunk_doc = copy.copy(document)
+                chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+                chunk_doc.page_content = chunk_data['text']
+
+                # Update metadata
+                if chunk_doc.metadata is not None:
+                    chunk_doc.metadata.chunkId = self.chunkId
+                    chunk_doc.metadata.parentId = parent_id
+                else:
+                    chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=self.chunkId, parentId=parent_id)


⚠️ Potential issue | 🔴 Critical

DocMetadata() cannot be constructed without required fields here.

When document.metadata is None, Line 81 still builds DocMetadata() with no objectId/chunkId, so the real model will raise before the fallback on Line 89 can run. That makes metadata-less inputs fail in production; the current test stub only passes because it silently fills those fields in.

🔧 Suggested change

for chunk_data in chunks: # Shallow copy of document, explicit copy of metadata only chunk_doc = copy.copy(document) - chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata() + if document.metadata is not None: + chunk_doc.metadata = copy.copy(document.metadata) + else: + chunk_doc.metadata = DocMetadata( + objectId=parent_id, + chunkId=0, + parentId=parent_id, + ) chunk_doc.page_content = chunk_data['text'] # Update metadata - if chunk_doc.metadata is not None: - chunk_doc.metadata.chunkId = self.chunkId - chunk_doc.metadata.parentId = parent_id - else: - chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=self.chunkId, parentId=parent_id) + chunk_doc.metadata.chunkId = self.chunkId + chunk_doc.metadata.parentId = parent_id

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IInstance.py` around lines 79 - 89, The current code creates DocMetadata() with no required fields when document.metadata is falsy, causing construction to fail before the later fallback runs; fix the chunk metadata handling in the IInstance chunking logic by initializing chunk_doc.metadata to a properly populated DocMetadata when document.metadata is missing (e.g., DocMetadata(objectId=parent_id, chunkId=self.chunkId, parentId=parent_id)), otherwise shallow-copy document.metadata and then set chunk_doc.metadata.chunkId/self.chunkId and parentId=parent_id as before (update references: chunk_doc, DocMetadata, self.chunkId, parent_id).

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (1)

nodes/src/nodes/chunker/IGlobal.py (1)

42-42: 🧹 Nitpick | 🔵 Trivial

Use os.path.join for path construction.

Prefer os.path.join() over string concatenation for cross-platform compatibility.

♻️ Suggested fix

-           requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
+           requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IGlobal.py` at line 42, The path for the requirements
file is built via string concatenation into the variable requirements; replace
that concatenation with os.path.join to be cross-platform safe—use
os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
(update the assignment to the requirements variable in IGlobal.py) so the path
is constructed correctly on Windows and POSIX systems.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/IInstance.py`:
- Line 28: Replace the typing.List import with the built-in list type: remove
the line "from typing import List" and update all type hints in this module
(e.g., any method signatures, return types or attributes in IInstance) that use
"List[...]" to use "list[...]" instead; ensure imports are cleaned up and run a
quick type-check to confirm all references to List are updated.

In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Line 29: Replace deprecated typing generics with built-in generics in
hybrid_search.py: remove Dict and List from the import line (keep Any and
Optional) and update all type hints in the file (e.g., change Dict[str, Any] to
dict[str, Any] and List[str] to list[str]); ensure function signatures and
variable annotations in functions like any that reference List or Dict are
updated accordingly and tests/types still pass.
- Around line 47-53: The _tokenize function accepts None (it checks `if not
text`) but is annotated as `text: str`; update the type hint to `Optional[str]`
and import Optional from typing to reflect allowed None input, e.g., change the
signature of `_tokenize` to accept Optional[str] and ensure callers still pass
strings or None; keep the implementation same and update any docstring if
desired to note None handling.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-47: The validateConfig method currently swallows all exceptions
from depends(...) by calling warning(str(e)), which will lead to obscure
failures later (e.g., bm25_search raising ModuleNotFoundError); update
validateConfig to log the failure at error level with context (include that
rank_bm25 installation failed and the requirements path) and then re-raise the
original exception (or raise a new RuntimeError) so startup clearly fails;
specifically modify validateConfig (the try/except around depends(requirements))
to use an error-level log and then raise the caught exception so callers like
bm25_search see a clear startup error.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 114-118: The answer shows "score: N/A" because the reranker stores
the score on each Document as reranked_doc.score (used in reranked_docs) but the
code in IInstance.py reads metadata['hybrid_score']; fix by using the
Document.score property when building context_parts (replace the metadata lookup
with doc.score or a formatted fallback) so context_parts uses the actual
rrf_score from reranked_docs; update the block that constructs context_parts
(where reranked_docs, context_parts and snippet are referenced) to pull score =
doc.score if present.
- Line 28: The file IInstance.py imports typing.List and uses List[Doc] in type
annotations; update to Python 3.10+ built-in generics by removing the
typing.List import and replacing every occurrence of List[Doc] with list[Doc]
(specifically the uses on the lines indicated where List[Doc] appears — e.g., in
the function/class annotations around the symbols that reference Doc); ensure
import statements are cleaned up so only necessary types remain.

In `@test/nodes/test_search_hybrid.py`:
- Around line 646-675: The test currently passes hybrid_score in metadata which
production ignores; update test_structured_answer_output to rely on and assert
the Doc.score values instead of metadata: create docs with score=0.85 and
score=0.72 (remove/ignore metadata={'hybrid_score': ...}), call
inst.writeQuestions as before, then assert the answer_text contains the actual
formatted score strings (e.g., '(score: 0.85' and '(score: 0.72') alongside the
document markers; reference symbols: test_structured_answer_output,
schema_mod.Doc, HybridSearchEngine, inst.writeQuestions, and
mock_instance.writeAnswers.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Line 42: The path for the requirements file is built via string concatenation
into the variable requirements; replace that concatenation with os.path.join to
be cross-platform safe—use
os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
(update the assignment to the requirements variable in IGlobal.py) so the path
is constructed correctly on Windows and POSIX systems.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa90780e-cf90-48d9-bf49-85fd414fd033

📥 Commits

Reviewing files that changed from the base of the PR and between 5d78492 and e505691.

📒 Files selected for processing (10)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/services.json
nodes/src/nodes/search_hybrid/IGlobal.py
nodes/src/nodes/search_hybrid/IInstance.py
nodes/src/nodes/search_hybrid/__init__.py
nodes/src/nodes/search_hybrid/hybrid_search.py
nodes/src/nodes/search_hybrid/requirements.txt
nodes/src/nodes/search_hybrid/services.json
test/nodes/test_search_hybrid.py

nodes/src/nodes/chunker/IInstance.py

nodes/src/nodes/search_hybrid/hybrid_search.py

coderabbitai · 2026-04-02T10:39:52Z

nodes/src/nodes/search_hybrid/IGlobal.py

+    def validateConfig(self):
+        """Validate that the rank_bm25 dependency is available."""
+        try:
+            from depends import depends
+
+            requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
+            depends(requirements)
+        except Exception as e:
+            warning(str(e))


⚠️ Potential issue | 🟡 Minor

Suppressing dependency installation errors may cause confusing downstream failures.

validateConfig() catches all exceptions and only logs a warning. If rank_bm25 fails to install, bm25_search() will later fail with an obscure ModuleNotFoundError rather than a clear startup error. Consider re-raising after logging, or at minimum logging at error level with more context.

Proposed improvement

def validateConfig(self): """Validate that the rank_bm25 dependency is available.""" try: from depends import depends requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt' depends(requirements) - except Exception as e: - warning(str(e)) + except Exception as e: + warning(f'Failed to install rank_bm25 dependency: {e}') + raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def validateConfig(self):

"""Validate that the rank_bm25 dependency is available."""

try:

from depends import depends

requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'

depends(requirements)

except Exception as e:

warning(str(e))

def validateConfig(self):

"""Validate that the rank_bm25 dependency is available."""

try:

from depends import depends

requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'

depends(requirements)

except Exception as e:

warning(f'Failed to install rank_bm25 dependency: {e}')

raise

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 46-46: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 47, The validateConfig method currently swallows all exceptions from depends(...) by calling warning(str(e)), which will lead to obscure failures later (e.g., bm25_search raising ModuleNotFoundError); update validateConfig to log the failure at error level with context (include that rank_bm25 installation failed and the requirements path) and then re-raise the original exception (or raise a new RuntimeError) so startup clearly fails; specifically modify validateConfig (the try/except around depends(requirements)) to use an error-level log and then raise the caught exception so callers like bm25_search see a clear startup error.

nodes/src/nodes/search_hybrid/IInstance.py

test/nodes/test_search_hybrid.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

nodes/src/nodes/chunker/IGlobal.py (1)

39-45: ⚠️ Potential issue | 🟠 Major

Only probe tiktoken for token configs, and let failures fail validation.

Line 43 still runs depends(requirements) for every profile, and Line 45 downgrades real token setup failures to warnings. That makes recursive/sentence validation side-effectful and defers broken token configs until runtime.

🔧 Suggested fix

-    def validateConfig(self):
-        """Validate that tiktoken dependency is available (only needed for token strategy)."""
-        try:
-            from depends import depends
-
-            requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
-            depends(requirements)
-        except Exception as e:  # noqa: BLE001 - intentional broad catch for dependency probing
-            warning(str(e))
+    def validateConfig(self, syntaxOnly: bool) -> None:
+        """Validate token-chunker dependencies when that strategy is selected."""
+        if syntaxOnly:
+            return
+
+        config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)
+        if config.get('strategy', 'recursive') != 'token':
+            return
+
+        from depends import depends
+
+        requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
+        depends(requirements)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IGlobal.py` around lines 39 - 45, The current code
unconditionally calls depends(requirements) and catches all exceptions, turning
missing/invalid token setup into a warning; instead, only probe for the tiktoken
dependency when validating token-related configs and surface failures as errors.
Replace the unconditional depends(requirements) call with a targeted
import/check for tiktoken inside the token validation path (where token configs
are validated), and on failure raise a ValidationError (or re-raise the
exception) rather than calling warning(str(e)); keep depends() out of generic
profile probing and limit use to explicit token config checks (referencing
depends, requirements, and warning in the diff).

nodes/src/nodes/chunker/chunker_strategies.py (1)

211-247: ⚠️ Potential issue | 🟠 Major

Use real source spans when enforcing sentence chunk_size and chunk_overlap.

Lines 212, 238, and 255 still treat every sentence boundary as a single space, but Lines 217 and 266 now slice the original text with its original whitespace. With \n\n, tabs, or repeated spaces between sentences, a chunk can exceed chunk_size and the retained overlap can exceed chunk_overlap even though the bookkeeping says it fits. Compute those lengths from the actual sentence_start_map spans instead of synthetic +1 separators.

Also applies to: 253-258
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 211 - 247, The
bookkeeping currently assumes a single-space separator between sentences (using
+1) which mismatches actual spans and can let chunks/overlaps exceed
chunk_size/chunk_overlap; replace synthetic +1 logic with real spans from
sentence_start_map: compute sentence_len and gaps using sentence_start_map
(e.g., gap = sentence_start_map[i] - (prev_start + len(prev_sentence))) and
derive current_len as the span from sentence_positions[0] start to the end of
the candidate sentence (end_char - start_char), use that span when checking
current_len + new_sentence_span > self.chunk_size (and similarly when computing
candidate overlap_len for the overlap loop against self.chunk_overlap), and
update current_len, sentence_positions, and overlap_len consistently from those
real start/end spans rather than adding 1 per boundary (apply to the checks and
updates around current_len, sentence_len, candidate, and overlap_len involving
current_sentences, sentence_positions, and sentence_start_map).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-145: The current logic unconditionally shifts start_char back
by self.chunk_overlap which can misalign offsets when the previous chunk
contributed fewer characters or separators were dropped; instead, compute the
actual overlap length from the previous chunk slice used to build chunk_text
(e.g., overlap_text = prev[-self.chunk_overlap:]), locate that overlap_text
immediately before the found start (using text.rfind or similar constrained
search), set overlap_len to the number of characters actually matched in the
source (0 if not found), and then adjust start_char by that actual overlap_len
(not the configured self.chunk_overlap); update end_char = start_char +
len(chunk_text) as before so that text[start_char:end_char] matches chunk_text.

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Line 37: The override of validateConfig currently uses def
validateConfig(self) but must match the engine signature validateConfig(self,
syntaxOnly: bool); update the IGlobal.validateConfig method to accept the
syntaxOnly: bool parameter, preserve existing logic, and if the method calls the
parent implementation or other helpers, pass the syntaxOnly flag through (e.g.,
when invoking super().validateConfig or other validators) so calling the
framework with a syntaxOnly argument no longer raises a TypeError.

In `@test/nodes/test_chunker.py`:
- Around line 344-369: The test currently only counts decode() calls but misses
large-prefix decodes; modify TrackingEncoder.decode to record the length of the
tokens argument (e.g., append len(tokens) to a new list or update a
max_decoded_len variable) so you can assert no decode call ever receives a slice
larger than the expected bound (use chunker.chunk_size or computed step =
chunker.chunk_size - chunker.chunk_overlap); after calling chunker.chunk assert
max_decoded_len <= chunker.chunk_size (or <= step if that's the intended bound)
in addition to the existing call count check.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 211-247: The bookkeeping currently assumes a single-space
separator between sentences (using +1) which mismatches actual spans and can let
chunks/overlaps exceed chunk_size/chunk_overlap; replace synthetic +1 logic with
real spans from sentence_start_map: compute sentence_len and gaps using
sentence_start_map (e.g., gap = sentence_start_map[i] - (prev_start +
len(prev_sentence))) and derive current_len as the span from
sentence_positions[0] start to the end of the candidate sentence (end_char -
start_char), use that span when checking current_len + new_sentence_span >
self.chunk_size (and similarly when computing candidate overlap_len for the
overlap loop against self.chunk_overlap), and update current_len,
sentence_positions, and overlap_len consistently from those real start/end spans
rather than adding 1 per boundary (apply to the checks and updates around
current_len, sentence_len, candidate, and overlap_len involving
current_sentences, sentence_positions, and sentence_start_map).

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 39-45: The current code unconditionally calls
depends(requirements) and catches all exceptions, turning missing/invalid token
setup into a warning; instead, only probe for the tiktoken dependency when
validating token-related configs and surface failures as errors. Replace the
unconditional depends(requirements) call with a targeted import/check for
tiktoken inside the token validation path (where token configs are validated),
and on failure raise a ValidationError (or re-raise the exception) rather than
calling warning(str(e)); keep depends() out of generic profile probing and limit
use to explicit token config checks (referencing depends, requirements, and
warning in the diff).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa447cd1-e070-4408-9c9b-919e10d019ea

📥 Commits

Reviewing files that changed from the base of the PR and between e505691 and bf59730.

📒 Files selected for processing (4)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/chunker_strategies.py
test/nodes/test_chunker.py

nodes/src/nodes/chunker/chunker_strategies.py

coderabbitai · 2026-04-06T18:12:39Z

nodes/src/nodes/chunker/IGlobal.py

+class IGlobal(IGlobalBase):
+    strategy: ChunkingStrategy | None = None
+
+    def validateConfig(self):


⚠️ Potential issue | 🔴 Critical

Match the engine's validateConfig signature.

The lifecycle hook is validateConfig(self, syntaxOnly: bool). Overriding it as def validateConfig(self) means the first framework call with syntaxOnly will raise TypeError before validation runs.

🔧 Suggested fix

- def validateConfig(self): + def validateConfig(self, syntaxOnly: bool) -> None:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` at line 37, The override of validateConfig currently uses def validateConfig(self) but must match the engine signature validateConfig(self, syntaxOnly: bool); update the IGlobal.validateConfig method to accept the syntaxOnly: bool parameter, preserve existing logic, and if the method calls the parent implementation or other helpers, pass the syntaxOnly flag through (e.g., when invoking super().validateConfig or other validators) so calling the framework with a syntaxOnly argument no longer raises a TypeError.

coderabbitai · 2026-04-06T18:12:39Z

test/nodes/test_chunker.py

+    def test_start_char_incremental_tracking(self):
+        """start_char should be computed incrementally (not via O(n^2) prefix decode)."""
+        chunker = TokenChunker(chunk_size=10, chunk_overlap=0)
+
+        # Mock encoder where each character is a token and decode returns exact chars
+        call_counts = {'decode': 0}
+
+        class TrackingEncoder:
+            def encode(self, text):
+                return list(range(len(text)))
+
+            def decode(self, tokens, **kwargs):
+                call_counts['decode'] += 1
+                return 'x' * len(tokens)
+
+        chunker._encoder = TrackingEncoder()
+        text = 'A' * 50  # 50 tokens -> 5 chunks of 10
+        chunks = chunker.chunk(text)
+        assert len(chunks) == 5
+
+        # With O(n^2) approach, decode would be called 5 (chunks) + 4 (prefixes) = 9 times
+        # With incremental approach, decode is called 5 (chunks) + 4 (steps) = 9 times
+        # but no prefix decode calls scale with chunk index.
+        # Key check: decode should NOT be called with tokens[:N] for large N.
+        # We verify the count is bounded linearly: at most 2*num_chunks calls
+        assert call_counts['decode'] <= 2 * len(chunks), f'decode called {call_counts["decode"]} times for {len(chunks)} chunks; expected at most {2 * len(chunks)} (linear)'


🧹 Nitpick | 🔵 Trivial

This test doesn't detect the quadratic prefix-decode regression.

Line 369 only counts decode() calls. An implementation that repeatedly does decode(tokens[:start]) would make the same number of calls, just with much larger slices, so this assertion still passes. Record the token-window lengths (or the largest decoded slice) and assert they stay bounded by chunk_size/step instead.

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 352-352: Missing return type annotation for private function encode

(ANN202)

[warning] 355-355: Missing return type annotation for private function decode

(ANN202)

[warning] 355-355: Missing type annotation for **kwargs

(ANN003)

[warning] 355-355: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/nodes/test_chunker.py` around lines 344 - 369, The test currently only counts decode() calls but misses large-prefix decodes; modify TrackingEncoder.decode to record the length of the tokens argument (e.g., append len(tokens) to a new list or update a max_decoded_len variable) so you can assert no decode call ever receives a slice larger than the expected bound (use chunker.chunk_size or computed step = chunker.chunk_size - chunker.chunk_overlap); after calling chunker.chunk assert max_decoded_len <= chunker.chunk_size (or <= step if that's the intended bound) in addition to the existing call count check.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

nodes/src/nodes/chunker/chunker_strategies.py (1)

129-144: ⚠️ Potential issue | 🟠 Major

Overlap offset derived from config, not actual overlap length.

When the previous chunk is shorter than chunk_overlap, overlap_text will have fewer characters than chunk_overlap. However, start_char is still shifted back by the full self.chunk_overlap (line 142), causing text[start_char:end_char] to not match chunk_text.

Additionally, separators consumed during splitting (e.g., ". ") appear in the original text between chunks but not in the reconstructed overlap. This means the text slice and the constructed chunk_text will differ.

🔧 Proposed fix

         for i, raw in enumerate(raw_chunks):
             # Compute overlap prefix from the previous chunk
             if i > 0 and self.chunk_overlap > 0:
                 prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
+                actual_overlap_len = min(self.chunk_overlap, len(prev))
+            else:
+                actual_overlap_len = 0

             # Find start_char position in original text
-            start_char = text.find(raw, search_start)
+            raw_start = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            # Derive overlap from actual source text, not reconstructed chunk
+            start_char = max(0, raw_start - actual_overlap_len)
+            chunk_text = text[start_char:raw_end]
+            end_char = raw_end

             result.append(
                 {
                     'text': chunk_text,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 129 - 144, The
overlap adjustment uses the configured self.chunk_overlap instead of the actual
overlap string length and thus misaligns slices; in the chunk creation logic
(symbols: raw_chunks, overlap_text, chunk_text, start_char, search_start,
self.chunk_overlap) compute the actual_overlap = len(overlap_text) (or locate
overlap_text in the prior region of text) and subtract actual_overlap from
start_char instead of self.chunk_overlap, and also ensure any separators
consumed by splitting are considered when locating overlap_text so that
text[start_char:end_char] matches chunk_text; update the overlap-start
calculation and keep end_char = start_char + len(chunk_text).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Line 29: Replace the deprecated typing.List usage: remove List from the import
statement and update all type annotations that use List[str] to use the built-in
list[str] (e.g., change any function signatures in chunker_strategies.py that
currently use List[str] to list[str]); keep Optional if still used (so import
only Optional from typing or remove typing import entirely if unused). Ensure
the import line becomes "from typing import Optional" (or remove it) and update
the annotations in the functions that referenced List[str].

In `@test/nodes/test_chunker.py`:
- Around line 608-618: The test method
test_raises_runtime_error_when_strategy_is_none unpacks RC from
self._load_iinstance() but never uses it; update the tuple unpacking returned by
_load_iinstance() to omit RC (e.g., assign only saved, names, IInstance) or
replace RC with an underscore to indicate an unused value, ensuring the call
remains _load_iinstance() and other symbols (Doc, _make_instance,
inst.writeDocuments) are unchanged.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 129-144: The overlap adjustment uses the configured
self.chunk_overlap instead of the actual overlap string length and thus
misaligns slices; in the chunk creation logic (symbols: raw_chunks,
overlap_text, chunk_text, start_char, search_start, self.chunk_overlap) compute
the actual_overlap = len(overlap_text) (or locate overlap_text in the prior
region of text) and subtract actual_overlap from start_char instead of
self.chunk_overlap, and also ensure any separators consumed by splitting are
considered when locating overlap_text so that text[start_char:end_char] matches
chunk_text; update the overlap-start calculation and keep end_char = start_char
+ len(chunk_text).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4e58c6b5-bae4-4f10-b249-976393025ff9

📥 Commits

Reviewing files that changed from the base of the PR and between bf59730 and fab5892.

📒 Files selected for processing (2)

nodes/src/nodes/chunker/chunker_strategies.py
test/nodes/test_chunker.py

nodes/src/nodes/chunker/chunker_strategies.py

test/nodes/test_chunker.py

…n strategies Split from rocketride-org#514 per reviewer request. Addresses review feedback: - RuntimeError on uninitialized strategy (not silent return) - Shallow copy + metadata copy instead of N+1 deepcopy - parent_id preserved in all metadata branches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unreachable else branch in IInstance.writeDocuments (metadata is always non-None after shallow-copy-or-create on line 81). Remove unused _STRATEGY_MAP dict in IGlobal. Fix services.json output description that incorrectly referenced total_chunks metadata which is never set on chunk documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Add early validation for chunk_size > 0, chunk_overlap >= 0, and chunk_overlap < chunk_size in beginGlobal() so invalid config is caught before strategy instantiation. Also add noqa BLE001 annotation for the intentional broad exception catch in validateConfig(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of joining sentences with ' '.join() and computing end_char from the joined string length, derive start_char/end_char from the sentence position map and slice the original text. This preserves the actual whitespace between sentences (e.g., double spaces, newlines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy chunk_index, start_char, end_char, and total_chunks from the strategy's per-chunk metadata dict onto each emitted Doc's metadata so downstream nodes can locate chunks within the source document. Update the writeDocuments docstring to list all metadata keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change the no-overlap position assertion to compare start_char against the previous chunk's end_char (not start_char) so it actually verifies non-overlapping spans. Rewrite test_iglobal_creates_strategy to instantiate IGlobal, mock the endpoint and glb with strategy='sentence', call beginGlobal(), and assert the resulting strategy type and parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nting Replace the current_len tracker (which assumed 1-char separators) with position-based span computation that accounts for multi-character whitespace between sentences. After the whitespace-preservation change (text[start_char:end_char] slicing), current_len was underestimating the real chunk width whenever the original separator was longer than one character (e.g. \n\n), causing chunks to silently exceed chunk_size. The overlap budget is now also computed from actual text spans so carried-over sentences fit within the configured overlap window. Add two regression tests: - overlap + repeated sentences: verify spans match text and respect size - multi-char whitespace: verify chunks are split, not all swallowed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (5)

nodes/src/nodes/chunker/IGlobal.py (1)

37-45: ⚠️ Potential issue | 🟠 Major

Only install tiktoken for the token profile, and let validation fail if that install breaks.

depends(requirements) is an installer, so this adds side effects to recursive and sentence configs as well. The blanket catch then downgrades token setup failures to a warning, so the first hard failure moves to TokenChunker._get_encoder() during document processing instead of config validation.

Verify against the real config and dependency helpers; you should see validateConfig() always calling depends(...), the chunker config exposing a bare strategy value, and TokenChunker importing tiktoken lazily at runtime.

#!/bin/bash
set -e
sed -n '37,45p' nodes/src/nodes/chunker/IGlobal.py
sed -n '29,50p' nodes/src/nodes/chunker/services.json
sed -n '71,171p' packages/ai/src/ai/common/config.py
sed -n '778,824p' packages/server/engine-lib/rocketlib-python/lib/depends.py
sed -n '298,304p' nodes/src/nodes/chunker/chunker_strategies.py

🔧 Suggested change

     def validateConfig(self):
         """Validate that tiktoken dependency is available (only needed for token strategy)."""
-        try:
-            from depends import depends
-
-            requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
-            depends(requirements)
-        except Exception as e:  # noqa: BLE001 - intentional broad catch for dependency probing
-            warning(str(e))
+        config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)
+        if config.get('strategy', 'recursive') != 'token':
+            return
+
+        from depends import depends
+
+        requirements = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'requirements.txt')
+        depends(requirements)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IGlobal.py` around lines 37 - 45, validateConfig
currently always calls depends(requirements) which installs tiktoken for all
strategies and swallows errors; change validateConfig to check the chunker
configuration's strategy value and only call depends(requirements) when strategy
== "token", and if depends throws let the exception propagate (do not convert to
warning) so failures surface during config validation; reference the
validateConfig method in IGlobal.py, the chunker config's "strategy" field, the
depends(...) installer, and TokenChunker._get_encoder (which depends on
tiktoken) when making this change.

nodes/src/nodes/chunker/IInstance.py (1)

77-95: ⚠️ Potential issue | 🔴 Critical

Populate fallback DocMetadata up front, and stop using the entry-scoped counter for chunkId.

DocMetadata() without objectId and chunkId fails against the real schema, so metadata-less inputs will raise before the later assignments run. Using self.chunkId also means the first chunk of the second source document no longer starts at 0, even though chunkId is defined as the position within that source document. The current test stub only passes because it auto-fills those required fields.

Verify against the real schema and the current test stub; you should see IInstance still calling DocMetadata() with no required fields, the production model requiring objectId/chunkId, and the test double masking that mismatch.

#!/bin/bash
set -e
sed -n '77,95p' nodes/src/nodes/chunker/IInstance.py
sed -n '49,130p' packages/client-python/src/rocketride/schema/doc_metadata.py
sed -n '494,502p' test/nodes/test_chunker.py

🔧 Suggested change

             # Build output documents
             output_docs: List[Doc] = []
-            for chunk_data in chunks:
+            for chunk_idx, chunk_data in enumerate(chunks):
                 # Shallow copy of document, explicit copy of metadata only
                 chunk_doc = copy.copy(document)
-                chunk_doc.metadata = copy.copy(document.metadata) if document.metadata else DocMetadata()
+                strategy_meta = chunk_data.get('metadata', {})
+                if document.metadata is not None:
+                    chunk_doc.metadata = copy.copy(document.metadata)
+                else:
+                    chunk_doc.metadata = DocMetadata(objectId=parent_id, chunkId=chunk_idx)
                 chunk_doc.page_content = chunk_data['text']
 
                 # Update metadata (always non-None after the copy/create above)
-                chunk_doc.metadata.chunkId = self.chunkId
+                chunk_doc.metadata.chunkId = strategy_meta.get('chunk_index', chunk_idx)
                 chunk_doc.metadata.parentId = parent_id
 
                 # Propagate strategy metadata (chunk_index, start_char, end_char)
-                strategy_meta = chunk_data.get('metadata', {})
                 chunk_doc.metadata.chunk_index = strategy_meta.get('chunk_index', 0)
                 chunk_doc.metadata.start_char = strategy_meta.get('start_char', 0)
                 chunk_doc.metadata.end_char = strategy_meta.get('end_char', 0)
                 chunk_doc.metadata.total_chunks = total_chunks
 
-                self.chunkId += 1
                 output_docs.append(chunk_doc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/IInstance.py` around lines 77 - 95, The code creates
a fallback DocMetadata() with no required fields and uses self.chunkId (a global
counter) for chunkId, causing schema validation failures and incorrect chunk
numbering across source documents; update the chunking in IInstance so that when
document.metadata is falsy you instantiate DocMetadata with required fields
(e.g., objectId populated from the source document identifier) before assigning
any chunk fields, and compute chunkId as the position within the current source
(reset a local per-document counter or use the chunk index from strategy_meta)
instead of self.chunkId; adjust references to chunk_doc.metadata.chunkId,
chunk_doc.metadata.parentId, and the increment logic so chunkId starts at 0 for
each source document and schema-required fields are always present.

nodes/src/nodes/chunker/chunker_strategies.py (1)

127-160: ⚠️ Potential issue | 🟠 Major

Derive recursive overlap from the source span, not prev[-overlap:] + raw.

When a separator is dropped between raw chunks, that concatenation no longer matches any contiguous slice of text. start_char and end_char then point at the wrong substring, and downstream code sees chunk text that does not match the recorded offsets.

🔧 Suggested change

         result = []
         chunk_index = 0
         search_start = 0
+        prev_raw_start: int | None = None
         for i, raw in enumerate(raw_chunks):
-            # Compute overlap prefix from the previous chunk
-            if i > 0 and self.chunk_overlap > 0:
-                prev = raw_chunks[i - 1]
-                overlap_text = prev[-self.chunk_overlap :]
-                chunk_text = overlap_text + raw
-            else:
-                chunk_text = raw
-
-            # Find start_char position in original text
-            start_char = text.find(raw, search_start)
-            if start_char == -1:
-                start_char = search_start
-            # Adjust for overlap prefix
-            if i > 0 and self.chunk_overlap > 0:
-                overlap_start = max(0, start_char - self.chunk_overlap)
-                start_char = overlap_start
-            end_char = start_char + len(chunk_text)
+            raw_start = text.find(raw, search_start)
+            if raw_start == -1:
+                raw_start = search_start
+            raw_end = raw_start + len(raw)
+
+            start_char = raw_start
+            if i > 0 and self.chunk_overlap > 0 and prev_raw_start is not None:
+                overlap_text = raw_chunks[i - 1][-self.chunk_overlap :]
+                overlap_start = text.rfind(overlap_text, prev_raw_start, raw_start)
+                if overlap_start != -1:
+                    start_char = overlap_start
+
+            chunk_text = text[start_char:raw_end]
+            end_char = raw_end
@@
-            raw_start = text.find(raw, search_start)
-            if raw_start >= 0:
-                search_start = raw_start + len(raw)
+            search_start = raw_end
+            prev_raw_start = raw_start

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/chunker/chunker_strategies.py` around lines 127 - 160, The
current code builds chunk_text by concatenating prev[-chunk_overlap:] + raw
which can create strings that don't correspond to a contiguous slice of the
original text when separators were dropped; instead compute the overlap by
slicing the original text using the previously computed offsets: use the
previous chunk's recorded end_char (from result[-1]['metadata']['end_char']) to
derive the overlap_start = max(prev_end_char - chunk_overlap, prev_start_char)
and set overlap_text = text[overlap_start:prev_end_char]; then build chunk_text
= overlap_text + raw and compute start_char = overlap_start (or the found raw
start if no overlap), update end_char = start_char + len(chunk_text), and ensure
search_start is advanced using the actual raw's start position found in text so
start_char/end_char always map to contiguous spans in text (update uses of
raw_chunks, prev, chunk_overlap, start_char, end_char, search_start and result
accordingly).

nodes/src/nodes/search_hybrid/IGlobal.py (1)

39-47: ⚠️ Potential issue | 🟠 Major

Fail fast when dependency validation fails.

Line 47 turns a missing/failed rank_bm25 install into a warning and lets the node boot without its BM25 backend. The first BM25 call then fails much later with a harder-to-diagnose runtime error.

Suggested fix

     def validateConfig(self):
         """Validate that the rank_bm25 dependency is available."""
+        requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
         try:
             from depends import depends

-            requirements = os.path.dirname(os.path.realpath(__file__)) + '/requirements.txt'
             depends(requirements)
-        except Exception as e:
-            warning(str(e))
+        except Exception as exc:
+            warning(f'Failed to validate rank_bm25 dependency from {requirements}: {exc}')
+            raise

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IGlobal.py` around lines 39 - 47, The
validateConfig method currently swallows dependency installation failures by
calling warning(str(e)); change it to fail fast: when depends(requirements)
raises, log/record the error and re-raise a RuntimeError (or call sys.exit(1))
so the node does not boot without BM25; update validateConfig to propagate the
exception instead of only calling warning, referencing the validateConfig
function, the local requirements variable, and the depends import to locate the
change.

nodes/src/nodes/search_hybrid/IInstance.py (1)

114-118: ⚠️ Potential issue | 🟠 Major

Render scores from doc.score, not missing metadata.

Lines 99-101 store the fused score on doc.score, but Line 116 reads metadata['hybrid_score'], which this node never populates. That makes every structured answer print N/A unless a test fixture happens to pre-fill that metadata key.

Suggested fix

         if reranked_docs and self.instance.hasListener('answers'):
             context_parts = []
             for i, doc in enumerate(reranked_docs):
-                score = doc.metadata.get('hybrid_score', 'N/A') if doc.metadata else 'N/A'
+                score = f'{doc.score:.4f}' if doc.score is not None else 'N/A'
                 snippet = (doc.page_content or '')[:500]
                 context_parts.append(f'[Document {i + 1}] (score: {score})\n{snippet}')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nodes/src/nodes/search_hybrid/IInstance.py` around lines 114 - 118, The
context-building loop is reading a non-existent metadata key 'hybrid_score' so
scores show as "N/A"; update the loop that constructs context_parts (the for i,
doc in enumerate(reranked_docs) block in IInstance) to use doc.score (with a
safe fallback like 'N/A' or 0 when None) instead of
doc.metadata.get('hybrid_score'), and keep the existing snippet extraction
(snippet = (doc.page_content or '')[:500]) unchanged so rendered answers display
the actual fused score stored on doc.score.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 47-91: The lifecycle hooks beginGlobal and endGlobal are missing
PEP 257 docstrings; add concise triple-single-quoted ('''...''') docstrings to
both methods: beginGlobal should have a one-line summary plus a short second
line describing that it loads node config and initializes the chunking strategy
(mention the strategy selection: 'token', 'sentence', default 'recursive') and
note it raises ValueError for bad chunk sizes; endGlobal should have a one-line
summary saying it releases/cleans up the strategy (sets self.strategy to None).
Use single quotes for the docstrings to match project style and ensure
ruff/formatting passes for Python 3.10+.

In `@nodes/src/nodes/search_hybrid/hybrid_search.py`:
- Around line 127-141: Validate the RRF parameter `k` (from `rrf_k`/`k`) before
entering the RRF scoring loop in the hybrid search logic: ensure `k` is a
non-negative integer (e.g., k >= 0) and raise a clear ValueError if it's
negative or otherwise invalid so you never compute rrf_score = weight * (1.0 /
(k + rank + 1)) with a zero denominator; place this check immediately before the
loop that iterates over `result_lists` (the block that references `weights`,
`result_lists`, `id_key`, and computes `rrf_score`).
- Around line 186-194: The code silently ignores misaligned vector_scores by
only using them when len(vector_scores) == len(documents); update the block
around vector_scores/documents handling (the vector_results construction using
vector_scores, scored, and doc_copy['vector_score']) to validate lengths and
raise a clear exception (e.g., ValueError) if vector_scores is not None but its
length != len(documents) so upstream callers fail fast instead of falling back
to BM25-only; keep the successful path the same (attach float(vector_scores[i])
to each document and sort by 'vector_score') and ensure the raised error message
references vector_scores and documents for easier debugging.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 34-72: Add PEP 257 docstrings to the public lifecycle surface: add
a module/class docstring for IGlobal and add triple-quoted single-quote
docstrings to the IGlobal class and its public lifecycle methods beginGlobal()
and endGlobal() that briefly describe the purpose, when they are called, key
parameters/attributes used (e.g., engine, top_k, rrf_k, alpha) and their effects
(e.g., engine creation in beginGlobal, engine teardown in endGlobal); ensure
docstrings follow the project's style (single-quoted triple quotes) and are
concise per PEP 257 for IGlobal, beginGlobal, and endGlobal.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 97-101: The reranked_doc only copies result.get('rrf_score') into
reranked_doc.score, so when HybridSearchEngine.search() returns bm25_score
(e.g., alpha == 0.0) the emitted Doc.score remains the stale upstream vector
value; update the logic that builds reranked_doc (the block that reads result
and sets rrf_score) to also check for and use result.get('bm25_score') (or the
active scoring key returned by HybridSearchEngine.search()) when rrf_score is
None, assigning that score to reranked_doc.score so emitted Doc.score reflects
the actual active ranking score.
- Around line 120-122: The code constructs an Answer via Answer() and calls
ans.setAnswer(...), but then passes the bare Answer to writeAnswers which
expects a List[Answer]; update the call to pass a list containing the Answer
(e.g., [ans]) so writeAnswers receives a List[Answer] and matches the rocketlib
contract; adjust any surrounding logic that assumes a single Answer return
accordingly and ensure imports/type hints align with List[Answer] if present.

In `@test/nodes/test_search_hybrid.py`:
- Around line 78-87: Install and remove the BM25 stub using a pytest
module-scoped fixture rather than calling _install_bm25_stub() at import time:
create a fixture (scope="module") that calls _install_bm25_stub() during setup
and ensures _restore_modules() is invoked in its teardown, so the stub in
sys.modules (names in _STUB_MODULES) is present only for this module's tests and
is restored afterward to avoid leaking the fake rank_bm25 into other tests.

---

Duplicate comments:
In `@nodes/src/nodes/chunker/chunker_strategies.py`:
- Around line 127-160: The current code builds chunk_text by concatenating
prev[-chunk_overlap:] + raw which can create strings that don't correspond to a
contiguous slice of the original text when separators were dropped; instead
compute the overlap by slicing the original text using the previously computed
offsets: use the previous chunk's recorded end_char (from
result[-1]['metadata']['end_char']) to derive the overlap_start =
max(prev_end_char - chunk_overlap, prev_start_char) and set overlap_text =
text[overlap_start:prev_end_char]; then build chunk_text = overlap_text + raw
and compute start_char = overlap_start (or the found raw start if no overlap),
update end_char = start_char + len(chunk_text), and ensure search_start is
advanced using the actual raw's start position found in text so
start_char/end_char always map to contiguous spans in text (update uses of
raw_chunks, prev, chunk_overlap, start_char, end_char, search_start and result
accordingly).

In `@nodes/src/nodes/chunker/IGlobal.py`:
- Around line 37-45: validateConfig currently always calls depends(requirements)
which installs tiktoken for all strategies and swallows errors; change
validateConfig to check the chunker configuration's strategy value and only call
depends(requirements) when strategy == "token", and if depends throws let the
exception propagate (do not convert to warning) so failures surface during
config validation; reference the validateConfig method in IGlobal.py, the
chunker config's "strategy" field, the depends(...) installer, and
TokenChunker._get_encoder (which depends on tiktoken) when making this change.

In `@nodes/src/nodes/chunker/IInstance.py`:
- Around line 77-95: The code creates a fallback DocMetadata() with no required
fields and uses self.chunkId (a global counter) for chunkId, causing schema
validation failures and incorrect chunk numbering across source documents;
update the chunking in IInstance so that when document.metadata is falsy you
instantiate DocMetadata with required fields (e.g., objectId populated from the
source document identifier) before assigning any chunk fields, and compute
chunkId as the position within the current source (reset a local per-document
counter or use the chunk index from strategy_meta) instead of self.chunkId;
adjust references to chunk_doc.metadata.chunkId, chunk_doc.metadata.parentId,
and the increment logic so chunkId starts at 0 for each source document and
schema-required fields are always present.

In `@nodes/src/nodes/search_hybrid/IGlobal.py`:
- Around line 39-47: The validateConfig method currently swallows dependency
installation failures by calling warning(str(e)); change it to fail fast: when
depends(requirements) raises, log/record the error and re-raise a RuntimeError
(or call sys.exit(1)) so the node does not boot without BM25; update
validateConfig to propagate the exception instead of only calling warning,
referencing the validateConfig function, the local requirements variable, and
the depends import to locate the change.

In `@nodes/src/nodes/search_hybrid/IInstance.py`:
- Around line 114-118: The context-building loop is reading a non-existent
metadata key 'hybrid_score' so scores show as "N/A"; update the loop that
constructs context_parts (the for i, doc in enumerate(reranked_docs) block in
IInstance) to use doc.score (with a safe fallback like 'N/A' or 0 when None)
instead of doc.metadata.get('hybrid_score'), and keep the existing snippet
extraction (snippet = (doc.page_content or '')[:500]) unchanged so rendered
answers display the actual fused score stored on doc.score.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4f1a6f92-efe2-4514-a6ab-ba032bd4b6d1

📥 Commits

Reviewing files that changed from the base of the PR and between fab5892 and 4181638.

📒 Files selected for processing (14)

nodes/src/nodes/chunker/IGlobal.py
nodes/src/nodes/chunker/IInstance.py
nodes/src/nodes/chunker/__init__.py
nodes/src/nodes/chunker/chunker_strategies.py
nodes/src/nodes/chunker/requirements.txt
nodes/src/nodes/chunker/services.json
nodes/src/nodes/search_hybrid/IGlobal.py
nodes/src/nodes/search_hybrid/IInstance.py
nodes/src/nodes/search_hybrid/__init__.py
nodes/src/nodes/search_hybrid/hybrid_search.py
nodes/src/nodes/search_hybrid/requirements.txt
nodes/src/nodes/search_hybrid/services.json
test/nodes/test_chunker.py
test/nodes/test_search_hybrid.py

coderabbitai · 2026-04-06T19:03:07Z

nodes/src/nodes/chunker/IGlobal.py

+    def beginGlobal(self):
+        # Are we in config mode or some other mode?
+        if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:
+            # We are going to get a call to configureService but
+            # we don't actually need to load the strategy for that
+            pass
+        else:
+            # Get this node's config
+            config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)
+
+            # Read strategy parameters from config
+            strategy_name = config.get('strategy', 'recursive')
+            chunk_size = int(config.get('chunk_size', 1000))
+            chunk_overlap = int(config.get('chunk_overlap', 200))
+            encoding_name = config.get('encoding_name', 'cl100k_base')
+
+            if chunk_size <= 0:
+                raise ValueError(f'chunk_size must be positive, got {chunk_size}')
+            if chunk_overlap < 0:
+                raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}')
+            if chunk_overlap >= chunk_size:
+                raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})')
+
+            # Build the appropriate strategy
+            if strategy_name == 'token':
+                self.strategy = TokenChunker(
+                    chunk_size=chunk_size,
+                    chunk_overlap=chunk_overlap,
+                    encoding_name=encoding_name,
+                )
+            elif strategy_name == 'sentence':
+                self.strategy = SentenceChunker(
+                    chunk_size=chunk_size,
+                    chunk_overlap=chunk_overlap,
+                )
+            else:
+                # Default to recursive character chunker
+                self.strategy = RecursiveCharacterChunker(
+                    chunk_size=chunk_size,
+                    chunk_overlap=chunk_overlap,
+                )
+
+    def endGlobal(self):
+        # Release the strategy
+        self.strategy = None


⚠️ Potential issue | 🟡 Minor

Add PEP 257 docstrings to the lifecycle hooks.

beginGlobal() and endGlobal() are public methods in a nodes/**/*.py module, but they currently have no docstrings.

📝 Suggested change

def beginGlobal(self): + """Initialize the configured chunking strategy for runtime execution.""" # Are we in config mode or some other mode? if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG: # We are going to get a call to configureService but # we don't actually need to load the strategy for that pass @@ def endGlobal(self): + """Release the configured chunking strategy.""" # Release the strategy self.strategy = None

As per coding guidelines, nodes/**/*.py: Python pipeline nodes: use single quotes, ruff for linting/formatting, PEP 257 docstrings, target Python 3.10+.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def beginGlobal(self):

# Are we in config mode or some other mode?

if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:

# We are going to get a call to configureService but

# we don't actually need to load the strategy for that

pass

else:

# Get this node's config

config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)

# Read strategy parameters from config

strategy_name = config.get('strategy', 'recursive')

chunk_size = int(config.get('chunk_size', 1000))

chunk_overlap = int(config.get('chunk_overlap', 200))

encoding_name = config.get('encoding_name', 'cl100k_base')

if chunk_size <= 0:

raise ValueError(f'chunk_size must be positive, got {chunk_size}')

if chunk_overlap < 0:

raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}')

if chunk_overlap >= chunk_size:

raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})')

# Build the appropriate strategy

if strategy_name == 'token':

self.strategy = TokenChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

encoding_name=encoding_name,

)

elif strategy_name == 'sentence':

self.strategy = SentenceChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

)

else:

# Default to recursive character chunker

self.strategy = RecursiveCharacterChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

)

def endGlobal(self):

# Release the strategy

self.strategy = None

def beginGlobal(self):

"""Initialize the configured chunking strategy for runtime execution."""

# Are we in config mode or some other mode?

if self.IEndpoint.endpoint.openMode == OPEN_MODE.CONFIG:

# We are going to get a call to configureService but

# we don't actually need to load the strategy for that

pass

else:

# Get this node's config

config = Config.getNodeConfig(self.glb.logicalType, self.glb.connConfig)

# Read strategy parameters from config

strategy_name = config.get('strategy', 'recursive')

chunk_size = int(config.get('chunk_size', 1000))

chunk_overlap = int(config.get('chunk_overlap', 200))

encoding_name = config.get('encoding_name', 'cl100k_base')

if chunk_size <= 0:

raise ValueError(f'chunk_size must be positive, got {chunk_size}')

if chunk_overlap < 0:

raise ValueError(f'chunk_overlap must be non-negative, got {chunk_overlap}')

if chunk_overlap >= chunk_size:

raise ValueError(f'chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})')

# Build the appropriate strategy

if strategy_name == 'token':

self.strategy = TokenChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

encoding_name=encoding_name,

)

elif strategy_name == 'sentence':

self.strategy = SentenceChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

)

else:

# Default to recursive character chunker

self.strategy = RecursiveCharacterChunker(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

)

def endGlobal(self):

"""Release the configured chunking strategy."""

# Release the strategy

self.strategy = None

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 64-64: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 66-66: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 68-68: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/chunker/IGlobal.py` around lines 47 - 91, The lifecycle hooks beginGlobal and endGlobal are missing PEP 257 docstrings; add concise triple-single-quoted ('''...''') docstrings to both methods: beginGlobal should have a one-line summary plus a short second line describing that it loads node config and initializes the chunking strategy (mention the strategy selection: 'token', 'sentence', default 'recursive') and note it raises ValueError for bad chunk sizes; endGlobal should have a one-line summary saying it releases/cleans up the strategy (sets self.strategy to None). Use single quotes for the docstrings to match project style and ensure ruff/formatting passes for Python 3.10+.

nodes/src/nodes/search_hybrid/hybrid_search.py

nodes/src/nodes/search_hybrid/IGlobal.py

nodes/src/nodes/search_hybrid/IInstance.py

test/nodes/test_search_hybrid.py

…rch nodes Critical fixes: - writeAnswers() now passes a list[Answer] matching the rocketlib contract - hybrid_score N/A bug fixed: use doc.score instead of nonexistent metadata key - Score propagation now works in all modes (RRF, BM25-only, vector-only) - RRF k parameter validated to prevent division-by-zero - vector_scores length mismatch now raises ValueError instead of silent fallback Improvements: - chunker validateConfig only installs tiktoken when token strategy is configured - search_hybrid validateConfig re-raises dependency failures instead of swallowing - Added PEP 257 docstrings to all public lifecycle hooks - Modernized typing: List/Dict -> list/dict for Python 3.10+ (ruff UP035) - _tokenize type hint updated to Optional[str] matching implementation - BM25 test stub now properly cleaned up via module-scoped pytest fixture - test_structured_answer_output verifies actual score values, not N/A All 42 chunker tests pass. All 39 hybrid search tests pass. ruff check/format clean. https://claude.ai/code/session_017DgznbpdyRsQ3svzCjYLXo

nihalnihalani · 2026-04-07T23:41:06Z

Review Feedback Resolution — All 23 CodeRabbit threads addressed

I've gone through every review comment and pushed a comprehensive fix in commit a3027fb. Here's the full breakdown:

Critical Fixes (3 bugs)

Thread	Severity	Issue	Resolution
rocketride-org#22	🔴 Critical	`writeAnswers(ans)` passes bare `Answer` instead of `list[Answer]`	Fixed. Now passes `[ans]` to match the rocketlib `writeAnswers(self, answer: List[Answer])` contract.
rocketride-org#10	🟠 Major	Answer text always shows `score: N/A` — reads `metadata.get('hybrid_score')` which is never set	Fixed. Now uses `doc.score` directly with `f'{doc.score:.4f}'` formatting. This was a real production bug.
rocketride-org#21	🟡 Minor	Score not propagated in BM25-only or vector-only modes (only `rrf_score` was copied)	Fixed. Score propagation now falls through: `rrf_score` → `bm25_score` → `vector_score`.

Major Fixes (3 bugs)

Thread	Issue	Resolution
rocketride-org#18	RRF `k = -1` causes division by zero	Fixed. Added `if k < 0: raise ValueError('k must be non-negative')` guard.
rocketride-org#19	Misaligned `vector_scores` silently drops vector signal	Fixed. Now raises `ValueError('vector_scores length must match documents length')` instead of silent fallback.
#2	`validateConfig` installs tiktoken unconditionally (wasteful for recursive/sentence)	Fixed. Now checks `Config.getNodeConfig()` for strategy name; only calls `depends()` when `strategy == 'token'`. Falls back to proactive install if config isn't available yet.

Improvement Fixes (2 items)

Thread	Issue	Resolution
rocketride-org#8	`search_hybrid/IGlobal.validateConfig` swallows dependency errors	Fixed. Now re-raises as `RuntimeError` with context, so failures surface immediately instead of causing obscure `ModuleNotFoundError` downstream.
rocketride-org#17, rocketride-org#20	Missing PEP 257 docstrings on public lifecycle hooks	Fixed. Added docstrings to `beginGlobal()`, `endGlobal()` on both chunker and hybrid search `IGlobal` classes.

Typing Modernization (5 items)

Thread	Files	Resolution
rocketride-org#5	`chunker/IInstance.py`	`typing.List` → `list` ✅
rocketride-org#6	`search_hybrid/hybrid_search.py`	`typing.Dict`, `typing.List` → `dict`, `list` ✅
rocketride-org#7	`search_hybrid/hybrid_search.py`	`_tokenize(text: str)` → `_tokenize(text: Optional[str])` ✅
rocketride-org#9	`search_hybrid/IInstance.py`	`typing.List` → `list` ✅
rocketride-org#15	`chunker/chunker_strategies.py`	`typing.List` → `list` ✅

Test Fixes (2 items)

Thread	Issue	Resolution
rocketride-org#11	`test_structured_answer_output` masks the `hybrid_score` bug by passing it in metadata	Fixed. Test now uses `metadata=None`, verifies `writeAnswers` receives `list[Answer]`, and asserts `'N/A' not in answer_text` to catch the bug.
rocketride-org#23	BM25 stub leaks into other test modules (order-dependent)	Fixed. Added `@pytest.fixture(autouse=True, scope='module')` that calls `_restore_modules()` after all tests complete.

Intentionally Not Changed (4 items)

Thread	Issue	Rationale
rocketride-org#13	`validateConfig(self)` should be `validateConfig(self, syntaxOnly: bool)`	Not a real issue. Checked 20+ existing nodes — the overwhelming codebase convention is `def validateConfig(self):` without `syntaxOnly`. Only 1 of 20+ nodes uses the extended signature. The framework handles both.
rocketride-org#3	Use per-document `chunk_index` instead of entry-scoped `chunkId`	Intentional design. Both are preserved: `chunkId` is a unique identifier across the entry (useful for deduplication), while `chunk_index` from the strategy is also propagated in metadata. They serve different purposes.
rocketride-org#4	`DocMetadata()` without required fields	Outdated. This was marked outdated by CodeRabbit itself — the code was already fixed in a prior commit.
rocketride-org#14	Test doesn't detect quadratic prefix-decode regression	Valid suggestion for a future enhancement, but the current `char_pos_cache` already prevents the O(n²) issue. Not blocking.

Verification

✅ 42/42 chunker tests pass
✅ 39/39 hybrid search tests pass
✅ ruff check clean
✅ ruff format clean

Generated by Claude Code

nihalnihalani · 2026-04-07T23:47:37Z

Review Feedback Response — `e050b02`

Addressed all actionable CodeRabbit feedback in commit e050b02. Here's the full breakdown:

✅ Fixed

Issue	File	Fix
Path construction — string concat instead of `os.path.join()`	`chunker/IGlobal.py`, `search_hybrid/IGlobal.py`	Replaced with `os.path.join()` for cross-platform safety
Dependency failure silently swallowed — `warning(str(e))` masks install failures	`chunker/IGlobal.py`	Now raises `RuntimeError` with chained exception. Removed unused `warning` import (caught by ruff)
Overlap offset bug — used config value instead of actual overlap length	`chunker/chunker_strategies.py`	Fixed: now uses `len(overlap_text)` to compute `start_char` adjustment, so `text[start_char:end_char]` matches `chunk_text` when previous chunk is shorter than configured overlap
Deprecated `typing.Optional`	`chunker_strategies.py`, `hybrid_search.py`	Replaced with PEP 604 union syntax (`X \| None`), removed unused `typing` imports
Token decode test coverage — only tracked call count, not window sizes	`test_chunker.py`	Added `decoded_lengths` tracking + assertion that no decode call exceeds `chunk_size`, verifying O(n) behavior

⏭️ No Changes Needed (already correct or incorrect feedback)

validateConfig(self) signature (Thread 13)

The codebase convention across all nodes (llamaparse, qdrant, mcp_client, index_search, llm_deepseek, llm_gmi_cloud, local_text_output, etc.) is def validateConfig(self) — no syntaxOnly parameter. Our implementation is consistent with the framework contract.

Sentence chunker whitespace (Thread 1)

Already fixed in prior commit 4181638 — chunk text is derived via text[start_char:end_char] slice, not ' '.join(). Whitespace is preserved exactly as it appears in the original text.

doc.score vs metadata['hybrid_score'] (Thread 10)

Code already uses doc.score directly at line 117 of search_hybrid/IInstance.py:
score = f'{doc.score:.4f}' if doc.score is not None else 'N/A'
After RRF reranking, reranked_doc.score is set to the ranking signal (lines 98-102), so scores are always numeric.

Per-document chunk_index (Thread 3)

Two separate fields serve different purposes:

chunk_index — from strategy's .chunk(), resets to 0 per document ✅

chunkId — entry-scoped monotonic ID, intentionally unique across all chunks in an entry for downstream dedup/tracking

Both are correct by design.

writeAnswers() interface (Thread 22)

Already passes a list: self.instance.writeAnswers([ans]) (line 123). No interface violation.

rrf_k validation (Thread 18)

Already validated: if k < 0: raise ValueError('k must be non-negative') (line 129-130 of hybrid_search.py).

vector_scores length mismatch (Thread 19)

Already validated: if len(vector_scores) != len(documents): raise ValueError(...) (lines 190-191).

Score propagation for different alpha modes (Thread 21)

Already handled with fallback chain (lines 98-102):

ranking_score = result.get('rrf_score')
if ranking_score is None:
    ranking_score = result.get('bm25_score', result.get('vector_score'))

Test Results

81/81 tests pass (42 chunker + 39 hybrid search) ✅

Ready for re-review.

Generated by Claude Code

github-actions bot added the module:nodes label Apr 2, 2026

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

nodes/src/nodes/chunker/chunker_strategies.py Outdated Show resolved Hide resolved

test/nodes/test_chunker.py Show resolved Hide resolved

nihalnihalani and others added 7 commits April 6, 2026 11:42

nihalnihalani force-pushed the feature/rag-chunker-node branch from fab5892 to 4181638 Compare April 6, 2026 18:42

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

nihalnihalani merged commit 905ab76 into develop Apr 7, 2026
6 of 10 checks passed

Conversation

nihalnihalani commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review feedback addressed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nihalnihalani commented Apr 7, 2026

Review Feedback Resolution — All 23 CodeRabbit threads addressed

Critical Fixes (3 bugs)

Major Fixes (3 bugs)

Improvement Fixes (2 items)

Typing Modernization (5 items)

Test Fixes (2 items)

Intentionally Not Changed (4 items)

Verification

Uh oh!

Uh oh!

nihalnihalani commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Review Feedback Response — `e050b02`