Skip to content

fix(memory): parse YAML frontmatter in scan and improve search relevance#12

Merged
tjb-tech merged 2 commits intoHKUDS:mainfrom
ZenAlexa:fix/memory-frontmatter-search
Apr 4, 2026
Merged

fix(memory): parse YAML frontmatter in scan and improve search relevance#12
tjb-tech merged 2 commits intoHKUDS:mainfrom
ZenAlexa:fix/memory-frontmatter-search

Conversation

@ZenAlexa
Copy link
Copy Markdown
Contributor

@ZenAlexa ZenAlexa commented Apr 3, 2026

Problem

The memory scanner (memory/scan.py) treats the first non-empty line of each file as the description. For files with YAML frontmatter — the format encouraged by Claude-style workflows and already parsed by the skill loader — this returns the raw --- delimiter as the description.

This silently degrades search quality in find_relevant_memories(), which only inspects title and description for matching. With frontmatter files, neither field contains meaningful content, so relevant memories never surface.

Before (frontmatter file):

title: "project_auth"  (filename stem, not frontmatter name)
description: "---"      (first non-empty line)

After:

title: "auth-rewrite"                            (from frontmatter name)
description: "Auth middleware driven by compliance" (from frontmatter description)
memory_type: "project"                            (from frontmatter type)
body_preview: "Session token storage rework..."    (first 300 chars of body)

Changes

File Change
memory/types.py Add memory_type and body_preview fields to MemoryHeader
memory/scan.py Parse YAML frontmatter (name/description/type), consistent with skills/loader.py pattern; extract body preview
memory/search.py Search body content in addition to metadata; weight metadata matches 2×; handle CJK characters in tokenizer
tests/test_memory/test_memdir.py 8 new tests: frontmatter parsing, fallback behavior, quoted values, search relevance ordering, body content matching, CJK queries
CHANGELOG.md Add entries under Unreleased/Fixed

Verification

$ uv run ruff check src/openharness/memory/ tests/test_memory/
All checks passed!

$ uv run pytest tests/ -q
131 passed, 3 warnings in 6.65s

All 131 tests pass (including the 8 new ones). Existing tests remain unchanged — the 3 original memory tests pass without modification since files without frontmatter use the same fallback path.

Design notes

  • Frontmatter parsing mirrors the approach in skills/loader.py:_parse_skill_markdown but lives in memory/scan.py to avoid coupling the two subsystems. Both handle --- delimiters, key-value extraction, and quoted value stripping.
  • Search scoring uses a simple meta_hits × 2 + body_hits weighting rather than TF-IDF, keeping the implementation zero-dependency and predictable. This matches the project's philosophy of clarity over complexity.
  • CJK tokenization treats each CJK character as an independent token (standard for Chinese/Japanese), enabling multilingual memory retrieval.
  • body_preview is capped at 300 characters to bound memory overhead while providing enough signal for search.

The memory scanner previously treated the first non-empty line as the
description, returning raw "---" for files with YAML frontmatter.  This
broke search matching because the title and description fields—the only
fields the search function inspected—contained no meaningful content.

Changes:
- Parse name/description/type from YAML frontmatter in scan, consistent
  with the existing skill loader pattern.
- Add body_preview and memory_type fields to MemoryHeader so downstream
  consumers can leverage structured metadata.
- Search now matches against body content in addition to metadata, with
  metadata hits weighted 2x for relevance ordering.
- Tokenizer handles CJK characters for multilingual memory queries.
- Eight new tests covering frontmatter parsing, search relevance ranking,
  body content matching, and CJK tokenization.
Copilot AI review requested due to automatic review settings April 3, 2026 10:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves OpenHarness “memory” retrieval by correctly extracting metadata from Markdown files that use YAML frontmatter, and by expanding search to include a bounded body preview so relevant memories can be found even when query terms aren’t in the title/description.

Changes:

  • Extend MemoryHeader with memory_type and body_preview to capture frontmatter type and a short body snippet.
  • Update memory scanning to parse YAML frontmatter (name, description, type) and compute a 300-character body preview.
  • Improve search relevance by tokenizing queries (including CJK ideographs) and scoring metadata matches higher than body matches; add tests covering frontmatter and ranking behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/openharness/memory/types.py Adds new header fields to carry frontmatter type and body preview used by search.
src/openharness/memory/scan.py Parses YAML frontmatter and extracts body_preview for downstream search/scoring.
src/openharness/memory/search.py Searches both metadata and body preview with weighted scoring and updated tokenization.
tests/test_memory/test_memdir.py Adds test coverage for frontmatter parsing, fallback behavior, and relevance ordering (including CJK query).
CHANGELOG.md Documents the scanner/search behavior changes under Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +69
# Fallback: first non-empty, non-frontmatter line as description
if not description:
for line in lines[body_start:body_start + 10]:
stripped = line.strip()
if stripped and not stripped.startswith("#"):
description = stripped[:200]
break

# Build body preview from content after frontmatter
body_lines = [
line.strip()
for line in lines[body_start:]
if line.strip() and not line.strip().startswith("#")
]
body_preview = " ".join(body_lines)[:300]
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

body_preview is built from lines[body_start:]. When a file has no frontmatter (body_start == 0), this includes the same line that was used to populate description, so the same text can be counted both as metadata and as body content during scoring. This effectively over-weights the first paragraph for non-frontmatter files and makes the meta-vs-body weighting inconsistent. Consider excluding the chosen description line from body_preview (or advancing body_start past it when frontmatter is absent) so body scoring reflects only additional content.

Copilot uses AI. Check for mistakes.
if not description:
for line in lines[body_start:body_start + 10]:
stripped = line.strip()
if stripped and not stripped.startswith("#"):
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback description extraction only skips Markdown headings, but it does not skip frontmatter delimiter lines (---). If a file starts with --- but the frontmatter block is malformed/unclosed (or if body_start remains 0 for any reason), the fallback can still select --- as the description, reintroducing the original search-quality issue. Consider adding a guard to ignore lines that start with/are --- in the fallback loop (similar to skills/loader.py’s behavior).

Suggested change
if stripped and not stripped.startswith("#"):
if stripped and stripped != "---" and not stripped.startswith("#"):

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +49
"""Extract search tokens from *text*, handling ASCII and CJK."""
# ASCII word tokens (3+ chars)
ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}
# CJK individual characters (each is meaningful on its own)
cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))
return ascii_tokens | cjk_chars
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tokenize() claims to handle “CJK”, but the regex only matches CJK Unified Ideographs (Han) and excludes Japanese kana (\u3040-\u30FF), Hangul (\uAC00-\uD7AF), etc. Either broaden the character ranges so the implementation matches the docstring/comment, or narrow the wording to reflect that only Han characters are tokenized individually.

Suggested change
"""Extract search tokens from *text*, handling ASCII and CJK."""
# ASCII word tokens (3+ chars)
ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}
# CJK individual characters (each is meaningful on its own)
cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))
return ascii_tokens | cjk_chars
"""Extract search tokens from *text*, handling ASCII and individual Han ideographs."""
# ASCII word tokens (3+ chars)
ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}
# Han individual characters (each is meaningful on its own)
han_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))
return ascii_tokens | han_chars

Copilot uses AI. Check for mistakes.
CHANGELOG.md Outdated
Comment on lines 12 to 20

### Fixed

- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.
- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.
- Memory search tokenizer handles CJK characters for multilingual queries.
- `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.
- `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.
- GitHub issue templates and a pull request template.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ### Fixed section was inserted above existing bullets for CONTRIBUTING.md, docs/SHOWCASE.md, and GitHub templates, which makes those additions appear under “Fixed”. Those items look like new docs/templates rather than bug fixes, so they should likely remain under ### Added (or be moved to an appropriate section) to keep the changelog semantically correct.

Suggested change
### Fixed
- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.
- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.
- Memory search tokenizer handles CJK characters for multilingual queries.
- `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.
- `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.
- GitHub issue templates and a pull request template.
- `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.
- `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.
- GitHub issue templates and a pull request template.
### Fixed
- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.
- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.
- Memory search tokenizer handles CJK characters for multilingual queries.

Copilot uses AI. Check for mistakes.
- Exclude description line from body_preview for non-frontmatter files
  to prevent double-counting in search scoring (meta 2x + body 1x).
- Guard fallback description against '---' delimiters from malformed
  frontmatter, consistent with skills/loader.py behavior.
- Rename CJK→Han in tokenizer docstring and variable names to reflect
  that only Han ideographs are tokenized individually (kana/hangul
  excluded by design as single characters lack lexical meaning).
- Fix CHANGELOG section ordering so existing Added entries stay under
  their original heading.
- Add tests for malformed frontmatter and body_preview exclusion.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 15 to +23
for path in memory_dir.glob("*.md"):
if path.name == "MEMORY.md":
continue
try:
lines = path.read_text(encoding="utf-8").splitlines()
text = path.read_text(encoding="utf-8")
except OSError:
continue
title = path.stem
description = ""
for line in lines[:10]:
stripped = line.strip()
if stripped:
description = stripped[:160]
break
headers.append(
MemoryHeader(
path=path,
title=title,
description=description,
modified_at=path.stat().st_mtime,
)
)
header = _parse_memory_file(path, text)
headers.append(header)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan_memory_files() now reads the entire contents of every *.md file before sorting/slicing by max_files. If the memory directory grows or files are large, this can significantly increase IO and latency for calls like find_relevant_memories() (which uses max_files=100). Consider reading only a bounded prefix (enough to cover frontmatter + ~300 chars of body) or streaming line-by-line to build the header without loading full files into memory.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +63
# Parse YAML frontmatter (--- ... ---)
if lines and lines[0].strip() == "---":
for i, line in enumerate(lines[1:], 1):
if line.strip() == "---":
for fm_line in lines[1:i]:
key, _, value = fm_line.partition(":")
key = key.strip()
value = value.strip().strip("'\"")
if not value:
continue
if key == "name":
title = value
elif key == "description":
description = value
elif key == "type":
memory_type = value
body_start = i + 1
break

# Fallback: first non-empty, non-frontmatter line as description
desc_line_idx: int | None = None
if not description:
for idx, line in enumerate(lines[body_start:body_start + 10], body_start):
stripped = line.strip()
if stripped and stripped != "---" and not stripped.startswith("#"):
description = stripped[:200]
desc_line_idx = idx
break
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a file starts with a frontmatter delimiter (---) but is missing the closing delimiter, body_start remains 0 and the fallback description scan can pick up frontmatter key lines like name: ... as the description (e.g. name: oops). If the intent is to treat this as malformed frontmatter and fall back to real body content, consider skipping the initial delimiter and any leading key: value-style lines until the first blank/non-metadata line before selecting a description.

Copilot uses AI. Check for mistakes.
Comment on lines 28 to +35
for header in scan_memory_files(cwd, max_files=100):
haystack = f"{header.title} {header.description}".lower()
score = sum(1 for token in tokens if token in haystack)
if score:
meta = f"{header.title} {header.description}".lower()
body = header.body_preview.lower()

# Metadata matches are weighted 2x; body matches 1x.
meta_hits = sum(1 for t in tokens if t in meta)
body_hits = sum(1 for t in tokens if t in body)
score = meta_hits * 2.0 + body_hits
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_relevant_memories() says it searches “metadata and content”, but the metadata haystack only includes title and description. Since MemoryHeader now has memory_type (parsed from frontmatter), consider including memory_type in the metadata scoring so queries like “project” / “reference” can match on type as well (or clarify that type is intentionally excluded).

Copilot uses AI. Check for mistakes.
@tjb-tech tjb-tech merged commit ac582c7 into HKUDS:main Apr 4, 2026
3 of 4 checks passed
tjb-tech added a commit that referenced this pull request Apr 4, 2026
6 tasks testing PR #17 (diagnose skill), #12 (memory frontmatter),
#14 (OpenAI client), #16 (session resume, cron, cost tracking),
and all PRs combined — every task runs on AutoAgent (17K LOC).

5/6 pass (1 test assertion typo: 'failure' vs 'failed').
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants