-
Notifications
You must be signed in to change notification settings - Fork 444
fix: Optimize chunk strategy #996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-20260202-v2.0.5
Are you sure you want to change the base?
Changes from all commits
7a1e121
861ce14
b2ce876
7f723c3
1e80f0c
edeb180
db143a6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -216,3 +216,5 @@ cython_debug/ | |
| outputs | ||
|
|
||
| evaluation/data/temporal_locomo | ||
| test_add_pipeline.py | ||
| test_file_pipeline.py | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| from abc import ABC, abstractmethod | ||
|
|
||
| from memos.configs.chunker import BaseChunkerConfig | ||
| import re | ||
|
|
||
|
|
||
| class Chunk: | ||
|
|
@@ -22,3 +23,42 @@ def __init__(self, config: BaseChunkerConfig): | |
| @abstractmethod | ||
| def chunk(self, text: str) -> list[Chunk]: | ||
| """Chunk the given text into smaller chunks.""" | ||
|
|
||
| def protect_urls(self, text: str) -> tuple[str, dict[str, str]]: | ||
| """ | ||
| Protect URLs in text from being split during chunking. | ||
|
|
||
| Args: | ||
| text: Text to process | ||
|
|
||
| Returns: | ||
| tuple: (Text with URLs replaced by placeholders, URL mapping dictionary) | ||
| """ | ||
| url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+' | ||
| url_map = {} | ||
|
|
||
| def replace_url(match): | ||
| url = match.group(0) | ||
| placeholder = f"__URL_{len(url_map)}__" | ||
| url_map[placeholder] = url | ||
| return placeholder | ||
|
|
||
| protected_text = re.sub(url_pattern, replace_url, text) | ||
| return protected_text, url_map | ||
|
|
||
| def restore_urls(self, text: str, url_map: dict[str, str]) -> str: | ||
| """ | ||
| Restore protected URLs in text back to their original form. | ||
|
|
||
| Args: | ||
| text: Text with URL placeholders | ||
| url_map: URL mapping dictionary from protect_urls | ||
|
|
||
| Returns: | ||
| str: Text with URLs restored | ||
| """ | ||
| restored_text = text | ||
| for placeholder, url in url_map.items(): | ||
| restored_text = restored_text.replace(placeholder, url) | ||
|
|
||
| return restored_text | ||
|
Comment on lines
+27
to
+64
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -2,6 +2,8 @@ | |||||
| from memos.dependency import require_python_package | ||||||
| from memos.log import get_logger | ||||||
|
|
||||||
| import re | ||||||
|
|
||||||
| from .base import BaseChunker, Chunk | ||||||
|
|
||||||
|
|
||||||
|
|
@@ -22,13 +24,15 @@ def __init__( | |||||
| chunk_size: int = 1000, | ||||||
| chunk_overlap: int = 200, | ||||||
| recursive: bool = False, | ||||||
| auto_fix_headers: bool = True, | ||||||
|
||||||
| auto_fix_headers: bool = True, | |
| auto_fix_headers: bool = True, |
Mozy403 marked this conversation as resolved.
Show resolved
Hide resolved
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header hierarchy fix strategy may not produce the desired results in all cases. The current approach increments all headers after the first by 1 level, but this doesn't account for the original hierarchy structure. For example, if the original has headers at levels [1, 1, 1], they become [1, 2, 2], but if the original was [1, 2, 1], they become [1, 3, 2], which breaks the hierarchy (level 3 appears before level 2 is closed). A more robust approach would be to normalize all level-1 headers to level-2 except the first, preserving the relative structure of non-level-1 headers.
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new header hierarchy detection and fixing functionality lacks test coverage. Since there are tests for mem_reader components, tests should be added to verify that malformed header hierarchies are correctly detected and fixed, including edge cases like all headers being level 1, mixed header levels, and empty documents.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -43,11 +43,13 @@ def __init__(self, config: SentenceChunkerConfig): | |||||
|
|
||||||
| def chunk(self, text: str) -> list[str] | list[Chunk]: | ||||||
| """Chunk the given text into smaller chunks based on sentences.""" | ||||||
| chonkie_chunks = self.chunker.chunk(text) | ||||||
| protected_text, url_map = self.protect_urls(text) | ||||||
| chonkie_chunks = self.chunker.chunk(protected_text) | ||||||
|
|
||||||
| chunks = [] | ||||||
| for c in chonkie_chunks: | ||||||
| chunk = Chunk(text=c.text, token_count=c.token_count, sentences=c.sentences) | ||||||
| chunk = self.restore_urls(chunk.text, url_map) | ||||||
|
||||||
| chunk = self.restore_urls(chunk.text, url_map) | |
| chunk.text = self.restore_urls(chunk.text, url_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL restoration uses sequential string replacements in a loop, which has O(n*m) complexity where n is the number of URLs and m is the text length. For large texts with many URLs, this could be inefficient. Consider using a single regex substitution with a replacement function that looks up placeholders in the url_map dictionary, or building the result string in a single pass.