-
Notifications
You must be signed in to change notification settings - Fork 444
fix: Optimize chunk strategy #996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-20260202-v2.0.5
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the chunking strategy by introducing URL protection during text chunking to prevent URLs from being split, adding hierarchical header context to image processing in markdown documents, and implementing automatic detection and fixing of malformed markdown header hierarchies. It also improves language detection by filtering out URLs that could dilute Chinese character detection.
Changes:
- Added URL protection/restoration methods to base chunker to prevent URLs from being split during chunking
- Implemented markdown header extraction and hierarchical context tracking for better image content processing
- Added automatic detection and correction of malformed markdown header hierarchies
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/memos/chunkers/base.py | Adds protect_urls() and restore_urls() methods to base chunker class for URL preservation during chunking |
| src/memos/chunkers/simple_chunker.py | Integrates URL protection/restoration into simple text chunking logic |
| src/memos/chunkers/sentence_chunker.py | Applies URL protection to sentence-based chunking |
| src/memos/chunkers/charactertext_chunker.py | Applies URL protection to character-based chunking |
| src/memos/chunkers/markdown_chunker.py | Adds URL protection plus header hierarchy detection and auto-fix functionality |
| src/memos/mem_reader/read_multi_modal/file_content_parser.py | Implements markdown header extraction and adds hierarchical header context to image processing |
| src/memos/mem_reader/read_multi_modal/utils.py | Updates language detection to remove URLs before analyzing text |
| .gitignore | Adds test pipeline files to ignore list |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| header_context = None | ||
| if headers: | ||
| header_context = self._get_header_context(text, image_position, headers) | ||
|
|
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line contains only whitespace. This empty line should be completely empty with no trailing spaces.
| # Subsequent headers: increment by 1, cap at level 6 | ||
| new_level = min(current_level + 1, 6) | ||
| new_hashes = '#' * new_level | ||
| fixed_line = f"{new_hashes} {title_content}" | ||
| logger.debug(f"Adjust header level: {current_level} -> {new_level}: {title_content[:50]}...") |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header hierarchy fix strategy may not produce the desired results in all cases. The current approach increments all headers after the first by 1 level, but this doesn't account for the original hierarchy structure. For example, if the original has headers at levels [1, 1, 1], they become [1, 2, 2], but if the original was [1, 2, 1], they become [1, 3, 2], which breaks the hierarchy (level 3 appears before level 2 is closed). A more robust approach would be to normalize all level-1 headers to level-2 except the first, preserving the relative structure of non-level-1 headers.
| def protect_urls(self, text: str) -> tuple[str, dict[str, str]]: | ||
| """ | ||
| Protect URLs in text from being split during chunking. | ||
|
|
||
| Args: | ||
| text: Text to process | ||
|
|
||
| Returns: | ||
| tuple: (Text with URLs replaced by placeholders, URL mapping dictionary) | ||
| """ | ||
| url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+' | ||
| url_map = {} | ||
|
|
||
| def replace_url(match): | ||
| url = match.group(0) | ||
| placeholder = f"__URL_{len(url_map)}__" | ||
| url_map[placeholder] = url | ||
| return placeholder | ||
|
|
||
| protected_text = re.sub(url_pattern, replace_url, text) | ||
| return protected_text, url_map | ||
|
|
||
| def restore_urls(self, text: str, url_map: dict[str, str]) -> str: | ||
| """ | ||
| Restore protected URLs in text back to their original form. | ||
|
|
||
| Args: | ||
| text: Text with URL placeholders | ||
| url_map: URL mapping dictionary from protect_urls | ||
|
|
||
| Returns: | ||
| str: Text with URLs restored | ||
| """ | ||
| restored_text = text | ||
| for placeholder, url in url_map.items(): | ||
| restored_text = restored_text.replace(placeholder, url) | ||
|
|
||
| return restored_text No newline at end of file |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new URL protection and restoration functionality in the base chunker lacks test coverage. Since other chunkers have tests (e.g., test_sentence_chunker.py), tests should be added to verify that URLs are properly protected during chunking and restored afterwards, including edge cases like URLs at chunk boundaries or multiple URLs in the same chunk.
|
|
||
| Args: | ||
| text: Markdown text to parse | ||
| """ | ||
| if not text: | ||
| return {} | ||
|
|
||
| headers = {} | ||
| # Pattern to match markdown headers: # Title, ## Title, etc. | ||
| header_pattern = r'^(#{1,6})\s+(.+)$' | ||
|
|
||
| lines = text.split('\n') | ||
| char_position = 0 | ||
|
|
||
| for line_num, line in enumerate(lines): | ||
| # Match header pattern (must be at start of line) | ||
| match = re.match(header_pattern, line.strip()) | ||
| if match: | ||
| level = len(match.group(1)) # Number of # symbols (1-6) | ||
| title = match.group(2).strip() # Extract title text | ||
|
|
||
| # Store header info with its position | ||
| headers[line_num] = { | ||
| 'level': level, | ||
| 'title': title, | ||
| 'position': char_position | ||
| } | ||
|
|
||
| logger.debug( | ||
| f"[FileContentParser] Found H{level} at line {line_num}: {title}" | ||
| ) | ||
|
|
||
| # Update character position for next line (+1 for newline character) | ||
| char_position += len(line) + 1 | ||
|
|
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is missing the "Returns:" section. According to the pattern seen in other methods in this file, docstrings should include a "Returns:" section describing what the method returns. This method returns a dictionary mapping line numbers to header information.
| Args: | |
| text: Markdown text to parse | |
| """ | |
| if not text: | |
| return {} | |
| headers = {} | |
| # Pattern to match markdown headers: # Title, ## Title, etc. | |
| header_pattern = r'^(#{1,6})\s+(.+)$' | |
| lines = text.split('\n') | |
| char_position = 0 | |
| for line_num, line in enumerate(lines): | |
| # Match header pattern (must be at start of line) | |
| match = re.match(header_pattern, line.strip()) | |
| if match: | |
| level = len(match.group(1)) # Number of # symbols (1-6) | |
| title = match.group(2).strip() # Extract title text | |
| # Store header info with its position | |
| headers[line_num] = { | |
| 'level': level, | |
| 'title': title, | |
| 'position': char_position | |
| } | |
| logger.debug( | |
| f"[FileContentParser] Found H{level} at line {line_num}: {title}" | |
| ) | |
| # Update character position for next line (+1 for newline character) | |
| char_position += len(line) + 1 | |
| Args: | |
| text: Markdown text to parse | |
| Returns: | |
| dict[int, dict]: A mapping from 0-based line numbers to header metadata | |
| dictionaries, each containing the header level, title, and character | |
| position within the original text. | |
| """ | |
| if not text: | |
| return {} | |
| headers = {} | |
| # Pattern to match markdown headers: # Title, ## Title, etc. | |
| header_pattern = r'^(#{1,6})\s+(.+)$' | |
| lines = text.split('\n') | |
| char_position = 0 | |
| for line_num, line in enumerate(lines): | |
| # Match header pattern (must be at start of line) | |
| match = re.match(header_pattern, line.strip()) | |
| if match: | |
| level = len(match.group(1)) # Number of # symbols (1-6) | |
| title = match.group(2).strip() # Extract title text | |
| # Store header info with its position | |
| headers[line_num] = { | |
| 'level': level, | |
| 'title': title, | |
| 'position': char_position | |
| } | |
| logger.debug( | |
| f"[FileContentParser] Found H{level} at line {line_num}: {title}" | |
| ) | |
| # Update character position for next line (+1 for newline character) | |
| char_position += len(line) + 1 |
| if extracted_texts: | ||
| # Combine all extracted texts | ||
| extracted_content = "\n".join(extracted_texts) | ||
| #build final replacement text |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after the comment marker. Should be # build final replacement text instead of #build final replacement text.
| #build final replacement text | |
| # build final replacement text |
| chunk_size: int = 1000, | ||
| chunk_overlap: int = 200, | ||
| recursive: bool = False, | ||
| auto_fix_headers: bool = True, |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace after the comma. This should be removed to maintain code cleanliness.
| auto_fix_headers: bool = True, | |
| auto_fix_headers: bool = True, |
Description
Please include a summary of the change, the problem it solves, the implementation approach, and relevant context. List any dependencies required for this change.
Related Issue (Required): Fixes @issue_number
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist
Reviewer Checklist