fix(memory): parse YAML frontmatter in scan and improve search relevance#12
Conversation
The memory scanner previously treated the first non-empty line as the description, returning raw "---" for files with YAML frontmatter. This broke search matching because the title and description fields—the only fields the search function inspected—contained no meaningful content. Changes: - Parse name/description/type from YAML frontmatter in scan, consistent with the existing skill loader pattern. - Add body_preview and memory_type fields to MemoryHeader so downstream consumers can leverage structured metadata. - Search now matches against body content in addition to metadata, with metadata hits weighted 2x for relevance ordering. - Tokenizer handles CJK characters for multilingual memory queries. - Eight new tests covering frontmatter parsing, search relevance ranking, body content matching, and CJK tokenization.
There was a problem hiding this comment.
Pull request overview
This PR improves OpenHarness “memory” retrieval by correctly extracting metadata from Markdown files that use YAML frontmatter, and by expanding search to include a bounded body preview so relevant memories can be found even when query terms aren’t in the title/description.
Changes:
- Extend
MemoryHeaderwithmemory_typeandbody_previewto capture frontmatter type and a short body snippet. - Update memory scanning to parse YAML frontmatter (
name,description,type) and compute a 300-character body preview. - Improve search relevance by tokenizing queries (including CJK ideographs) and scoring metadata matches higher than body matches; add tests covering frontmatter and ranking behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/openharness/memory/types.py |
Adds new header fields to carry frontmatter type and body preview used by search. |
src/openharness/memory/scan.py |
Parses YAML frontmatter and extracts body_preview for downstream search/scoring. |
src/openharness/memory/search.py |
Searches both metadata and body preview with weighted scoring and updated tokenization. |
tests/test_memory/test_memdir.py |
Adds test coverage for frontmatter parsing, fallback behavior, and relevance ordering (including CJK query). |
CHANGELOG.md |
Documents the scanner/search behavior changes under Unreleased. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Fallback: first non-empty, non-frontmatter line as description | ||
| if not description: | ||
| for line in lines[body_start:body_start + 10]: | ||
| stripped = line.strip() | ||
| if stripped and not stripped.startswith("#"): | ||
| description = stripped[:200] | ||
| break | ||
|
|
||
| # Build body preview from content after frontmatter | ||
| body_lines = [ | ||
| line.strip() | ||
| for line in lines[body_start:] | ||
| if line.strip() and not line.strip().startswith("#") | ||
| ] | ||
| body_preview = " ".join(body_lines)[:300] |
There was a problem hiding this comment.
body_preview is built from lines[body_start:]. When a file has no frontmatter (body_start == 0), this includes the same line that was used to populate description, so the same text can be counted both as metadata and as body content during scoring. This effectively over-weights the first paragraph for non-frontmatter files and makes the meta-vs-body weighting inconsistent. Consider excluding the chosen description line from body_preview (or advancing body_start past it when frontmatter is absent) so body scoring reflects only additional content.
src/openharness/memory/scan.py
Outdated
| if not description: | ||
| for line in lines[body_start:body_start + 10]: | ||
| stripped = line.strip() | ||
| if stripped and not stripped.startswith("#"): |
There was a problem hiding this comment.
The fallback description extraction only skips Markdown headings, but it does not skip frontmatter delimiter lines (---). If a file starts with --- but the frontmatter block is malformed/unclosed (or if body_start remains 0 for any reason), the fallback can still select --- as the description, reintroducing the original search-quality issue. Consider adding a guard to ignore lines that start with/are --- in the fallback loop (similar to skills/loader.py’s behavior).
| if stripped and not stripped.startswith("#"): | |
| if stripped and stripped != "---" and not stripped.startswith("#"): |
src/openharness/memory/search.py
Outdated
| """Extract search tokens from *text*, handling ASCII and CJK.""" | ||
| # ASCII word tokens (3+ chars) | ||
| ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3} | ||
| # CJK individual characters (each is meaningful on its own) | ||
| cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text)) | ||
| return ascii_tokens | cjk_chars |
There was a problem hiding this comment.
_tokenize() claims to handle “CJK”, but the regex only matches CJK Unified Ideographs (Han) and excludes Japanese kana (\u3040-\u30FF), Hangul (\uAC00-\uD7AF), etc. Either broaden the character ranges so the implementation matches the docstring/comment, or narrow the wording to reflect that only Han characters are tokenized individually.
| """Extract search tokens from *text*, handling ASCII and CJK.""" | |
| # ASCII word tokens (3+ chars) | |
| ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3} | |
| # CJK individual characters (each is meaningful on its own) | |
| cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text)) | |
| return ascii_tokens | cjk_chars | |
| """Extract search tokens from *text*, handling ASCII and individual Han ideographs.""" | |
| # ASCII word tokens (3+ chars) | |
| ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3} | |
| # Han individual characters (each is meaningful on its own) | |
| han_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text)) | |
| return ascii_tokens | han_chars |
CHANGELOG.md
Outdated
|
|
||
| ### Fixed | ||
|
|
||
| - Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description. | ||
| - Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance. | ||
| - Memory search tokenizer handles CJK characters for multilingual queries. | ||
| - `CONTRIBUTING.md` with local setup, validation commands, and PR expectations. | ||
| - `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands. | ||
| - GitHub issue templates and a pull request template. |
There was a problem hiding this comment.
The new ### Fixed section was inserted above existing bullets for CONTRIBUTING.md, docs/SHOWCASE.md, and GitHub templates, which makes those additions appear under “Fixed”. Those items look like new docs/templates rather than bug fixes, so they should likely remain under ### Added (or be moved to an appropriate section) to keep the changelog semantically correct.
| ### Fixed | |
| - Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description. | |
| - Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance. | |
| - Memory search tokenizer handles CJK characters for multilingual queries. | |
| - `CONTRIBUTING.md` with local setup, validation commands, and PR expectations. | |
| - `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands. | |
| - GitHub issue templates and a pull request template. | |
| - `CONTRIBUTING.md` with local setup, validation commands, and PR expectations. | |
| - `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands. | |
| - GitHub issue templates and a pull request template. | |
| ### Fixed | |
| - Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description. | |
| - Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance. | |
| - Memory search tokenizer handles CJK characters for multilingual queries. |
- Exclude description line from body_preview for non-frontmatter files to prevent double-counting in search scoring (meta 2x + body 1x). - Guard fallback description against '---' delimiters from malformed frontmatter, consistent with skills/loader.py behavior. - Rename CJK→Han in tokenizer docstring and variable names to reflect that only Han ideographs are tokenized individually (kana/hangul excluded by design as single characters lack lexical meaning). - Fix CHANGELOG section ordering so existing Added entries stay under their original heading. - Add tests for malformed frontmatter and body_preview exclusion.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for path in memory_dir.glob("*.md"): | ||
| if path.name == "MEMORY.md": | ||
| continue | ||
| try: | ||
| lines = path.read_text(encoding="utf-8").splitlines() | ||
| text = path.read_text(encoding="utf-8") | ||
| except OSError: | ||
| continue | ||
| title = path.stem | ||
| description = "" | ||
| for line in lines[:10]: | ||
| stripped = line.strip() | ||
| if stripped: | ||
| description = stripped[:160] | ||
| break | ||
| headers.append( | ||
| MemoryHeader( | ||
| path=path, | ||
| title=title, | ||
| description=description, | ||
| modified_at=path.stat().st_mtime, | ||
| ) | ||
| ) | ||
| header = _parse_memory_file(path, text) | ||
| headers.append(header) |
There was a problem hiding this comment.
scan_memory_files() now reads the entire contents of every *.md file before sorting/slicing by max_files. If the memory directory grows or files are large, this can significantly increase IO and latency for calls like find_relevant_memories() (which uses max_files=100). Consider reading only a bounded prefix (enough to cover frontmatter + ~300 chars of body) or streaming line-by-line to build the header without loading full files into memory.
| # Parse YAML frontmatter (--- ... ---) | ||
| if lines and lines[0].strip() == "---": | ||
| for i, line in enumerate(lines[1:], 1): | ||
| if line.strip() == "---": | ||
| for fm_line in lines[1:i]: | ||
| key, _, value = fm_line.partition(":") | ||
| key = key.strip() | ||
| value = value.strip().strip("'\"") | ||
| if not value: | ||
| continue | ||
| if key == "name": | ||
| title = value | ||
| elif key == "description": | ||
| description = value | ||
| elif key == "type": | ||
| memory_type = value | ||
| body_start = i + 1 | ||
| break | ||
|
|
||
| # Fallback: first non-empty, non-frontmatter line as description | ||
| desc_line_idx: int | None = None | ||
| if not description: | ||
| for idx, line in enumerate(lines[body_start:body_start + 10], body_start): | ||
| stripped = line.strip() | ||
| if stripped and stripped != "---" and not stripped.startswith("#"): | ||
| description = stripped[:200] | ||
| desc_line_idx = idx | ||
| break |
There was a problem hiding this comment.
When a file starts with a frontmatter delimiter (---) but is missing the closing delimiter, body_start remains 0 and the fallback description scan can pick up frontmatter key lines like name: ... as the description (e.g. name: oops). If the intent is to treat this as malformed frontmatter and fall back to real body content, consider skipping the initial delimiter and any leading key: value-style lines until the first blank/non-metadata line before selecting a description.
| for header in scan_memory_files(cwd, max_files=100): | ||
| haystack = f"{header.title} {header.description}".lower() | ||
| score = sum(1 for token in tokens if token in haystack) | ||
| if score: | ||
| meta = f"{header.title} {header.description}".lower() | ||
| body = header.body_preview.lower() | ||
|
|
||
| # Metadata matches are weighted 2x; body matches 1x. | ||
| meta_hits = sum(1 for t in tokens if t in meta) | ||
| body_hits = sum(1 for t in tokens if t in body) | ||
| score = meta_hits * 2.0 + body_hits |
There was a problem hiding this comment.
find_relevant_memories() says it searches “metadata and content”, but the metadata haystack only includes title and description. Since MemoryHeader now has memory_type (parsed from frontmatter), consider including memory_type in the metadata scoring so queries like “project” / “reference” can match on type as well (or clarify that type is intentionally excluded).
Problem
The memory scanner (
memory/scan.py) treats the first non-empty line of each file as the description. For files with YAML frontmatter — the format encouraged by Claude-style workflows and already parsed by the skill loader — this returns the raw---delimiter as the description.This silently degrades search quality in
find_relevant_memories(), which only inspectstitleanddescriptionfor matching. With frontmatter files, neither field contains meaningful content, so relevant memories never surface.Before (frontmatter file):
After:
Changes
memory/types.pymemory_typeandbody_previewfields toMemoryHeadermemory/scan.pyskills/loader.pypattern; extract body previewmemory/search.pytests/test_memory/test_memdir.pyCHANGELOG.mdVerification
All 131 tests pass (including the 8 new ones). Existing tests remain unchanged — the 3 original memory tests pass without modification since files without frontmatter use the same fallback path.
Design notes
skills/loader.py:_parse_skill_markdownbut lives inmemory/scan.pyto avoid coupling the two subsystems. Both handle---delimiters, key-value extraction, and quoted value stripping.meta_hits × 2 + body_hitsweighting rather than TF-IDF, keeping the implementation zero-dependency and predictable. This matches the project's philosophy of clarity over complexity.body_previewis capped at 300 characters to bound memory overhead while providing enough signal for search.