Skip to content

Conversation

@Mozy403
Copy link

@Mozy403 Mozy403 commented Feb 2, 2026

Description

Please include a summary of the change, the problem it solves, the implementation approach, and relevant context. List any dependencies required for this change.

Related Issue (Required): Fixes @issue_number

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g. code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Unit Test
  • Test Script Or Test Steps (please provide)
  • Pipeline Automated API Test (please provide)

Checklist

  • I have performed a self-review of my own code | 我已自行检查了自己的代码
  • I have commented my code in hard-to-understand areas | 我已在难以理解的地方对代码进行了注释
  • I have added tests that prove my fix is effective or that my feature works | 我已添加测试以证明我的修复有效或功能正常
  • I have created related documentation issue/PR in MemOS-Docs (if applicable) | 我已在 MemOS-Docs 中创建了相关的文档 issue/PR(如果适用)
  • I have linked the issue to this PR (if applicable) | 我已将 issue 链接到此 PR(如果适用)
  • I have mentioned the person who will review this PR | 我已提及将审查此 PR 的人

Reviewer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed
  • Tests have been provided

Copilot AI review requested due to automatic review settings February 2, 2026 06:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the chunking strategy by introducing URL protection during text chunking to prevent URLs from being split, adding hierarchical header context to image processing in markdown documents, and implementing automatic detection and fixing of malformed markdown header hierarchies. It also improves language detection by filtering out URLs that could dilute Chinese character detection.

Changes:

  • Added URL protection/restoration methods to base chunker to prevent URLs from being split during chunking
  • Implemented markdown header extraction and hierarchical context tracking for better image content processing
  • Added automatic detection and correction of malformed markdown header hierarchies

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/memos/chunkers/base.py Adds protect_urls() and restore_urls() methods to base chunker class for URL preservation during chunking
src/memos/chunkers/simple_chunker.py Integrates URL protection/restoration into simple text chunking logic
src/memos/chunkers/sentence_chunker.py Applies URL protection to sentence-based chunking
src/memos/chunkers/charactertext_chunker.py Applies URL protection to character-based chunking
src/memos/chunkers/markdown_chunker.py Adds URL protection plus header hierarchy detection and auto-fix functionality
src/memos/mem_reader/read_multi_modal/file_content_parser.py Implements markdown header extraction and adds hierarchical header context to image processing
src/memos/mem_reader/read_multi_modal/utils.py Updates language detection to remove URLs before analyzing text
.gitignore Adds test pipeline files to ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

header_context = None
if headers:
header_context = self._get_header_context(text, image_position, headers)

Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line contains only whitespace. This empty line should be completely empty with no trailing spaces.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +147
# Subsequent headers: increment by 1, cap at level 6
new_level = min(current_level + 1, 6)
new_hashes = '#' * new_level
fixed_line = f"{new_hashes} {title_content}"
logger.debug(f"Adjust header level: {current_level} -> {new_level}: {title_content[:50]}...")
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header hierarchy fix strategy may not produce the desired results in all cases. The current approach increments all headers after the first by 1 level, but this doesn't account for the original hierarchy structure. For example, if the original has headers at levels [1, 1, 1], they become [1, 2, 2], but if the original was [1, 2, 1], they become [1, 3, 2], which breaks the hierarchy (level 3 appears before level 2 is closed). A more robust approach would be to normalize all level-1 headers to level-2 except the first, preserving the relative structure of non-level-1 headers.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +64
def protect_urls(self, text: str) -> tuple[str, dict[str, str]]:
"""
Protect URLs in text from being split during chunking.

Args:
text: Text to process

Returns:
tuple: (Text with URLs replaced by placeholders, URL mapping dictionary)
"""
url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
url_map = {}

def replace_url(match):
url = match.group(0)
placeholder = f"__URL_{len(url_map)}__"
url_map[placeholder] = url
return placeholder

protected_text = re.sub(url_pattern, replace_url, text)
return protected_text, url_map

def restore_urls(self, text: str, url_map: dict[str, str]) -> str:
"""
Restore protected URLs in text back to their original form.

Args:
text: Text with URL placeholders
url_map: URL mapping dictionary from protect_urls

Returns:
str: Text with URLs restored
"""
restored_text = text
for placeholder, url in url_map.items():
restored_text = restored_text.replace(placeholder, url)

return restored_text No newline at end of file
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new URL protection and restoration functionality in the base chunker lacks test coverage. Since other chunkers have tests (e.g., test_sentence_chunker.py), tests should be added to verify that URLs are properly protected during chunking and restored afterwards, including edge cases like URLs at chunk boundaries or multiple URLs in the same chunk.

Copilot uses AI. Check for mistakes.
Comment on lines +861 to +895

Args:
text: Markdown text to parse
"""
if not text:
return {}

headers = {}
# Pattern to match markdown headers: # Title, ## Title, etc.
header_pattern = r'^(#{1,6})\s+(.+)$'

lines = text.split('\n')
char_position = 0

for line_num, line in enumerate(lines):
# Match header pattern (must be at start of line)
match = re.match(header_pattern, line.strip())
if match:
level = len(match.group(1)) # Number of # symbols (1-6)
title = match.group(2).strip() # Extract title text

# Store header info with its position
headers[line_num] = {
'level': level,
'title': title,
'position': char_position
}

logger.debug(
f"[FileContentParser] Found H{level} at line {line_num}: {title}"
)

# Update character position for next line (+1 for newline character)
char_position += len(line) + 1

Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is missing the "Returns:" section. According to the pattern seen in other methods in this file, docstrings should include a "Returns:" section describing what the method returns. This method returns a dictionary mapping line numbers to header information.

Suggested change
Args:
text: Markdown text to parse
"""
if not text:
return {}
headers = {}
# Pattern to match markdown headers: # Title, ## Title, etc.
header_pattern = r'^(#{1,6})\s+(.+)$'
lines = text.split('\n')
char_position = 0
for line_num, line in enumerate(lines):
# Match header pattern (must be at start of line)
match = re.match(header_pattern, line.strip())
if match:
level = len(match.group(1)) # Number of # symbols (1-6)
title = match.group(2).strip() # Extract title text
# Store header info with its position
headers[line_num] = {
'level': level,
'title': title,
'position': char_position
}
logger.debug(
f"[FileContentParser] Found H{level} at line {line_num}: {title}"
)
# Update character position for next line (+1 for newline character)
char_position += len(line) + 1
Args:
text: Markdown text to parse
Returns:
dict[int, dict]: A mapping from 0-based line numbers to header metadata
dictionaries, each containing the header level, title, and character
position within the original text.
"""
if not text:
return {}
headers = {}
# Pattern to match markdown headers: # Title, ## Title, etc.
header_pattern = r'^(#{1,6})\s+(.+)$'
lines = text.split('\n')
char_position = 0
for line_num, line in enumerate(lines):
# Match header pattern (must be at start of line)
match = re.match(header_pattern, line.strip())
if match:
level = len(match.group(1)) # Number of # symbols (1-6)
title = match.group(2).strip() # Extract title text
# Store header info with its position
headers[line_num] = {
'level': level,
'title': title,
'position': char_position
}
logger.debug(
f"[FileContentParser] Found H{level} at line {line_num}: {title}"
)
# Update character position for next line (+1 for newline character)
char_position += len(line) + 1

Copilot uses AI. Check for mistakes.
if extracted_texts:
# Combine all extracted texts
extracted_content = "\n".join(extracted_texts)
#build final replacement text
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after the comment marker. Should be # build final replacement text instead of #build final replacement text.

Suggested change
#build final replacement text
# build final replacement text

Copilot uses AI. Check for mistakes.
chunk_size: int = 1000,
chunk_overlap: int = 200,
recursive: bool = False,
auto_fix_headers: bool = True,
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the comma. This should be removed to maintain code cleanliness.

Suggested change
auto_fix_headers: bool = True,
auto_fix_headers: bool = True,

Copilot uses AI. Check for mistakes.
@Mozy403 Mozy403 closed this Feb 2, 2026
@Mozy403 Mozy403 reopened this Feb 2, 2026
@Mozy403 Mozy403 closed this Feb 2, 2026
@Mozy403 Mozy403 reopened this Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant