fix: Optimize chunk strategy #996

Mozy403 · 2026-02-02T06:46:41Z

Description

Please include a summary of the change, the problem it solves, the implementation approach, and relevant context. List any dependencies required for this change.

Related Issue (Required): Fixes @issue_number

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g. code style improvements, linting)
Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Unit Test
Test Script Or Test Steps (please provide)
Pipeline Automated API Test (please provide)

Checklist

I have performed a self-review of my own code | 我已自行检查了自己的代码
I have commented my code in hard-to-understand areas | 我已在难以理解的地方对代码进行了注释
I have added tests that prove my fix is effective or that my feature works | 我已添加测试以证明我的修复有效或功能正常
I have created related documentation issue/PR in MemOS-Docs (if applicable) | 我已在 MemOS-Docs 中创建了相关的文档 issue/PR（如果适用）
I have linked the issue to this PR (if applicable) | 我已将 issue 链接到此 PR（如果适用）
I have mentioned the person who will review this PR | 我已提及将审查此 PR 的人

Reviewer Checklist

closes #xxxx (Replace xxxx with the GitHub issue number)
Made sure Checks passed
Tests have been provided

…test1

Copilot

Pull request overview

This PR optimizes the chunking strategy by introducing URL protection during text chunking to prevent URLs from being split, adding hierarchical header context to image processing in markdown documents, and implementing automatic detection and fixing of malformed markdown header hierarchies. It also improves language detection by filtering out URLs that could dilute Chinese character detection.

Changes:

Added URL protection/restoration methods to base chunker to prevent URLs from being split during chunking
Implemented markdown header extraction and hierarchical context tracking for better image content processing
Added automatic detection and correction of malformed markdown header hierarchies

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
src/memos/chunkers/base.py	Adds `protect_urls()` and `restore_urls()` methods to base chunker class for URL preservation during chunking
src/memos/chunkers/simple_chunker.py	Integrates URL protection/restoration into simple text chunking logic
src/memos/chunkers/sentence_chunker.py	Applies URL protection to sentence-based chunking
src/memos/chunkers/charactertext_chunker.py	Applies URL protection to character-based chunking
src/memos/chunkers/markdown_chunker.py	Adds URL protection plus header hierarchy detection and auto-fix functionality
src/memos/mem_reader/read_multi_modal/file_content_parser.py	Implements markdown header extraction and adds hierarchical header context to image processing
src/memos/mem_reader/read_multi_modal/utils.py	Updates language detection to remove URLs before analyzing text
.gitignore	Adds test pipeline files to ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/memos/mem_reader/read_multi_modal/file_content_parser.py

Copilot · 2026-02-02T06:51:57Z

src/memos/mem_reader/read_multi_modal/file_content_parser.py

+            header_context = None
+            if headers:
+                header_context = self._get_header_context(text, image_position, headers)
+


Line contains only whitespace. This empty line should be completely empty with no trailing spaces.

Suggested change

src/memos/mem_reader/read_multi_modal/file_content_parser.py

Copilot · 2026-02-02T06:51:57Z

src/memos/chunkers/markdown_chunker.py

+                    # Subsequent headers: increment by 1, cap at level 6
+                    new_level = min(current_level + 1, 6)
+                    new_hashes = '#' * new_level
+                    fixed_line = f"{new_hashes} {title_content}"
+                    logger.debug(f"Adjust header level: {current_level} -> {new_level}: {title_content[:50]}...")


The header hierarchy fix strategy may not produce the desired results in all cases. The current approach increments all headers after the first by 1 level, but this doesn't account for the original hierarchy structure. For example, if the original has headers at levels [1, 1, 1], they become [1, 2, 2], but if the original was [1, 2, 1], they become [1, 3, 2], which breaks the hierarchy (level 3 appears before level 2 is closed). A more robust approach would be to normalize all level-1 headers to level-2 except the first, preserving the relative structure of non-level-1 headers.

Copilot · 2026-02-02T06:51:58Z

src/memos/chunkers/base.py

+    def protect_urls(self, text: str) -> tuple[str, dict[str, str]]:
+        """
+        Protect URLs in text from being split during chunking.
+
+        Args:
+            text: Text to process
+
+        Returns:
+            tuple: (Text with URLs replaced by placeholders, URL mapping dictionary)
+        """
+        url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
+        url_map = {}
+
+        def replace_url(match):
+            url = match.group(0)
+            placeholder = f"__URL_{len(url_map)}__"
+            url_map[placeholder] = url
+            return placeholder
+
+        protected_text = re.sub(url_pattern, replace_url, text)
+        return protected_text, url_map
+
+    def restore_urls(self, text: str, url_map: dict[str, str]) -> str:
+        """
+        Restore protected URLs in text back to their original form.
+
+        Args:
+            text: Text with URL placeholders
+            url_map: URL mapping dictionary from protect_urls
+
+        Returns:
+            str: Text with URLs restored
+        """
+        restored_text = text
+        for placeholder, url in url_map.items():
+            restored_text = restored_text.replace(placeholder, url)
+
+        return restored_text


The new URL protection and restoration functionality in the base chunker lacks test coverage. Since other chunkers have tests (e.g., test_sentence_chunker.py), tests should be added to verify that URLs are properly protected during chunking and restored afterwards, including edge cases like URLs at chunk boundaries or multiple URLs in the same chunk.

Copilot · 2026-02-02T06:52:00Z

src/memos/mem_reader/read_multi_modal/file_content_parser.py

+
+        Args:
+            text: Markdown text to parse    
+        """
+        if not text:
+            return {}
+
+        headers = {}
+        # Pattern to match markdown headers: # Title, ## Title, etc.
+        header_pattern = r'^(#{1,6})\s+(.+)$'
+
+        lines = text.split('\n')
+        char_position = 0
+
+        for line_num, line in enumerate(lines):
+            # Match header pattern (must be at start of line)
+            match = re.match(header_pattern, line.strip())
+            if match:
+                level = len(match.group(1))  # Number of # symbols (1-6)
+                title = match.group(2).strip()  # Extract title text
+
+                # Store header info with its position
+                headers[line_num] = {
+                    'level': level,
+                    'title': title,
+                    'position': char_position
+                }
+
+                logger.debug(
+                    f"[FileContentParser] Found H{level} at line {line_num}: {title}"
+                )
+
+            # Update character position for next line (+1 for newline character)
+            char_position += len(line) + 1
+


The docstring is missing the "Returns:" section. According to the pattern seen in other methods in this file, docstrings should include a "Returns:" section describing what the method returns. This method returns a dictionary mapping line numbers to header information.

Suggested change

Args:

text: Markdown text to parse

"""

if not text:

return {}

headers = {}

# Pattern to match markdown headers: # Title, ## Title, etc.

header_pattern = r'^(#{1,6})\s+(.+)$'

lines = text.split('\n')

char_position = 0

for line_num, line in enumerate(lines):

# Match header pattern (must be at start of line)

match = re.match(header_pattern, line.strip())

if match:

level = len(match.group(1)) # Number of # symbols (1-6)

title = match.group(2).strip() # Extract title text

# Store header info with its position

headers[line_num] = {

'level': level,

'title': title,

'position': char_position

}

logger.debug(

f"[FileContentParser] Found H{level} at line {line_num}: {title}"

)

# Update character position for next line (+1 for newline character)

char_position += len(line) + 1

Args:

text: Markdown text to parse

Returns:

dict[int, dict]: A mapping from 0-based line numbers to header metadata

dictionaries, each containing the header level, title, and character

position within the original text.

"""

if not text:

return {}

headers = {}

# Pattern to match markdown headers: # Title, ## Title, etc.

header_pattern = r'^(#{1,6})\s+(.+)$'

lines = text.split('\n')

char_position = 0

for line_num, line in enumerate(lines):

# Match header pattern (must be at start of line)

match = re.match(header_pattern, line.strip())

if match:

level = len(match.group(1)) # Number of # symbols (1-6)

title = match.group(2).strip() # Extract title text

# Store header info with its position

headers[line_num] = {

'level': level,

'title': title,

'position': char_position

}

logger.debug(

f"[FileContentParser] Found H{level} at line {line_num}: {title}"

)

# Update character position for next line (+1 for newline character)

char_position += len(line) + 1

Copilot · 2026-02-02T06:52:00Z

src/memos/mem_reader/read_multi_modal/file_content_parser.py

            if extracted_texts:
                # Combine all extracted texts
                extracted_content = "\n".join(extracted_texts)
+                #build final replacement text


Missing space after the comment marker. Should be # build final replacement text instead of #build final replacement text.

Suggested change

#build final replacement text

# build final replacement text

src/memos/mem_reader/read_multi_modal/file_content_parser.py

Copilot · 2026-02-02T06:52:01Z

src/memos/chunkers/markdown_chunker.py

        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        recursive: bool = False,
+        auto_fix_headers: bool = True,  


Trailing whitespace after the comma. This should be removed to maintain code cleanliness.

Suggested change

auto_fix_headers: bool = True,

auto_fix_headers: bool = True,

src/memos/chunkers/markdown_chunker.py

mozuyun added 7 commits January 21, 2026 11:09

fix: fix three feature issues

7a1e121

Merge remote-tracking branch 'upstream/dev-20260119-v2.0.3' into dev-…

861ce14

…test1

fix:add uncommitted changes for the previous fix

b2ce876

fix: optimize chunk strategy

7f723c3

Merge remote-tracking branch 'upstream/dev-20260202-v2.0.5' into dev-…

1e80f0c

…test1

Optimize chunk strategy

edeb180

add some comments

db143a6

Copilot AI review requested due to automatic review settings February 2, 2026 06:46

Copilot started reviewing on behalf of Mozy403 February 2, 2026 06:46 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

Mozy403 closed this Feb 2, 2026

Mozy403 reopened this Feb 2, 2026

Mozy403 closed this Feb 2, 2026

Mozy403 reopened this Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Optimize chunk strategy #996

fix: Optimize chunk strategy #996

Uh oh!

Mozy403 commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	auto_fix_headers: bool = True,
	auto_fix_headers: bool = True,

fix: Optimize chunk strategy #996

Are you sure you want to change the base?

fix: Optimize chunk strategy #996

Uh oh!

Conversation

Mozy403 commented Feb 2, 2026

Description

Type of change

How Has This Been Tested?

Checklist

Reviewer Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant