fix(memory): parse YAML frontmatter in scan and improve search relevance by ZenAlexa · Pull Request #12 · HKUDS/OpenHarness

ZenAlexa · 2026-04-03T10:38:26Z

Problem

The memory scanner (memory/scan.py) treats the first non-empty line of each file as the description. For files with YAML frontmatter — the format encouraged by Claude-style workflows and already parsed by the skill loader — this returns the raw --- delimiter as the description.

This silently degrades search quality in find_relevant_memories(), which only inspects title and description for matching. With frontmatter files, neither field contains meaningful content, so relevant memories never surface.

Before (frontmatter file):

title: "project_auth"  (filename stem, not frontmatter name)
description: "---"      (first non-empty line)

After:

title: "auth-rewrite"                            (from frontmatter name)
description: "Auth middleware driven by compliance" (from frontmatter description)
memory_type: "project"                            (from frontmatter type)
body_preview: "Session token storage rework..."    (first 300 chars of body)

Changes

File	Change
`memory/types.py`	Add `memory_type` and `body_preview` fields to `MemoryHeader`
`memory/scan.py`	Parse YAML frontmatter (name/description/type), consistent with `skills/loader.py` pattern; extract body preview
`memory/search.py`	Search body content in addition to metadata; weight metadata matches 2×; handle CJK characters in tokenizer
`tests/test_memory/test_memdir.py`	8 new tests: frontmatter parsing, fallback behavior, quoted values, search relevance ordering, body content matching, CJK queries
`CHANGELOG.md`	Add entries under Unreleased/Fixed

Verification

$ uv run ruff check src/openharness/memory/ tests/test_memory/
All checks passed!

$ uv run pytest tests/ -q
131 passed, 3 warnings in 6.65s

All 131 tests pass (including the 8 new ones). Existing tests remain unchanged — the 3 original memory tests pass without modification since files without frontmatter use the same fallback path.

Design notes

Frontmatter parsing mirrors the approach in skills/loader.py:_parse_skill_markdown but lives in memory/scan.py to avoid coupling the two subsystems. Both handle --- delimiters, key-value extraction, and quoted value stripping.
Search scoring uses a simple meta_hits × 2 + body_hits weighting rather than TF-IDF, keeping the implementation zero-dependency and predictable. This matches the project's philosophy of clarity over complexity.
CJK tokenization treats each CJK character as an independent token (standard for Chinese/Japanese), enabling multilingual memory retrieval.
body_preview is capped at 300 characters to bound memory overhead while providing enough signal for search.

The memory scanner previously treated the first non-empty line as the description, returning raw "---" for files with YAML frontmatter. This broke search matching because the title and description fields—the only fields the search function inspected—contained no meaningful content. Changes: - Parse name/description/type from YAML frontmatter in scan, consistent with the existing skill loader pattern. - Add body_preview and memory_type fields to MemoryHeader so downstream consumers can leverage structured metadata. - Search now matches against body content in addition to metadata, with metadata hits weighted 2x for relevance ordering. - Tokenizer handles CJK characters for multilingual memory queries. - Eight new tests covering frontmatter parsing, search relevance ranking, body content matching, and CJK tokenization.

Copilot

Pull request overview

This PR improves OpenHarness “memory” retrieval by correctly extracting metadata from Markdown files that use YAML frontmatter, and by expanding search to include a bounded body preview so relevant memories can be found even when query terms aren’t in the title/description.

Changes:

Extend MemoryHeader with memory_type and body_preview to capture frontmatter type and a short body snippet.
Update memory scanning to parse YAML frontmatter (name, description, type) and compute a 300-character body preview.
Improve search relevance by tokenizing queries (including CJK ideographs) and scoring metadata matches higher than body matches; add tests covering frontmatter and ranking behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/openharness/memory/types.py`	Adds new header fields to carry frontmatter type and body preview used by search.
`src/openharness/memory/scan.py`	Parses YAML frontmatter and extracts `body_preview` for downstream search/scoring.
`src/openharness/memory/search.py`	Searches both metadata and body preview with weighted scoring and updated tokenization.
`tests/test_memory/test_memdir.py`	Adds test coverage for frontmatter parsing, fallback behavior, and relevance ordering (including CJK query).
`CHANGELOG.md`	Documents the scanner/search behavior changes under Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T10:41:59Z

src/openharness/memory/scan.py

+    # Fallback: first non-empty, non-frontmatter line as description
+    if not description:
+        for line in lines[body_start:body_start + 10]:
+            stripped = line.strip()
+            if stripped and not stripped.startswith("#"):
+                description = stripped[:200]
+                break
+
+    # Build body preview from content after frontmatter
+    body_lines = [
+        line.strip()
+        for line in lines[body_start:]
+        if line.strip() and not line.strip().startswith("#")
+    ]
+    body_preview = " ".join(body_lines)[:300]


body_preview is built from lines[body_start:]. When a file has no frontmatter (body_start == 0), this includes the same line that was used to populate description, so the same text can be counted both as metadata and as body content during scoring. This effectively over-weights the first paragraph for non-frontmatter files and makes the meta-vs-body weighting inconsistent. Consider excluding the chosen description line from body_preview (or advancing body_start past it when frontmatter is absent) so body scoring reflects only additional content.

Copilot · 2026-04-03T10:42:00Z

src/openharness/memory/scan.py

+    if not description:
+        for line in lines[body_start:body_start + 10]:
+            stripped = line.strip()
+            if stripped and not stripped.startswith("#"):


The fallback description extraction only skips Markdown headings, but it does not skip frontmatter delimiter lines (---). If a file starts with --- but the frontmatter block is malformed/unclosed (or if body_start remains 0 for any reason), the fallback can still select --- as the description, reintroducing the original search-quality issue. Consider adding a guard to ignore lines that start with/are --- in the fallback loop (similar to skills/loader.py’s behavior).

Suggested change

if stripped and not stripped.startswith("#"):

if stripped and stripped != "---" and not stripped.startswith("#"):

Copilot · 2026-04-03T10:42:00Z

src/openharness/memory/search.py

+    """Extract search tokens from *text*, handling ASCII and CJK."""
+    # ASCII word tokens (3+ chars)
+    ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}
+    # CJK individual characters (each is meaningful on its own)
+    cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))
+    return ascii_tokens | cjk_chars


_tokenize() claims to handle “CJK”, but the regex only matches CJK Unified Ideographs (Han) and excludes Japanese kana (\u3040-\u30FF), Hangul (\uAC00-\uD7AF), etc. Either broaden the character ranges so the implementation matches the docstring/comment, or narrow the wording to reflect that only Han characters are tokenized individually.

Suggested change

"""Extract search tokens from *text*, handling ASCII and CJK."""

# ASCII word tokens (3+ chars)

ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}

# CJK individual characters (each is meaningful on its own)

cjk_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))

return ascii_tokens | cjk_chars

"""Extract search tokens from *text*, handling ASCII and individual Han ideographs."""

# ASCII word tokens (3+ chars)

ascii_tokens = {t for t in re.findall(r"[A-Za-z0-9_]+", text.lower()) if len(t) >= 3}

# Han individual characters (each is meaningful on its own)

han_chars = set(re.findall(r"[\u4e00-\u9fff\u3400-\u4dbf]", text))

return ascii_tokens | han_chars

Copilot · 2026-04-03T10:42:00Z

CHANGELOG.md

+
+### Fixed
+
+- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.
+- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.
+- Memory search tokenizer handles CJK characters for multilingual queries.
 - `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.
 - `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.
 - GitHub issue templates and a pull request template.


The new ### Fixed section was inserted above existing bullets for CONTRIBUTING.md, docs/SHOWCASE.md, and GitHub templates, which makes those additions appear under “Fixed”. Those items look like new docs/templates rather than bug fixes, so they should likely remain under ### Added (or be moved to an appropriate section) to keep the changelog semantically correct.

Suggested change

### Fixed

- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.

- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.

- Memory search tokenizer handles CJK characters for multilingual queries.

- `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.

- `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.

- GitHub issue templates and a pull request template.

- `CONTRIBUTING.md` with local setup, validation commands, and PR expectations.

- `docs/SHOWCASE.md` with concrete OpenHarness usage patterns and demo commands.

- GitHub issue templates and a pull request template.

### Fixed

- Memory scanner now parses YAML frontmatter (`name`, `description`, `type`) instead of returning raw `---` as description.

- Memory search matches against body content in addition to metadata, with metadata weighted higher for relevance.

- Memory search tokenizer handles CJK characters for multilingual queries.

- Exclude description line from body_preview for non-frontmatter files to prevent double-counting in search scoring (meta 2x + body 1x). - Guard fallback description against '---' delimiters from malformed frontmatter, consistent with skills/loader.py behavior. - Rename CJK→Han in tokenizer docstring and variable names to reflect that only Han ideographs are tokenized individually (kana/hangul excluded by design as single characters lack lexical meaning). - Fix CHANGELOG section ordering so existing Added entries stay under their original heading. - Add tests for malformed frontmatter and body_preview exclusion.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T11:38:51Z

src/openharness/memory/scan.py

    for path in memory_dir.glob("*.md"):
        if path.name == "MEMORY.md":
            continue
        try:
-            lines = path.read_text(encoding="utf-8").splitlines()
+            text = path.read_text(encoding="utf-8")
        except OSError:
            continue
-        title = path.stem
-        description = ""
-        for line in lines[:10]:
-            stripped = line.strip()
-            if stripped:
-                description = stripped[:160]
-                break
-        headers.append(
-            MemoryHeader(
-                path=path,
-                title=title,
-                description=description,
-                modified_at=path.stat().st_mtime,
-            )
-        )
+        header = _parse_memory_file(path, text)
+        headers.append(header)


scan_memory_files() now reads the entire contents of every *.md file before sorting/slicing by max_files. If the memory directory grows or files are large, this can significantly increase IO and latency for calls like find_relevant_memories() (which uses max_files=100). Consider reading only a bounded prefix (enough to cover frontmatter + ~300 chars of body) or streaming line-by-line to build the header without loading full files into memory.

Copilot · 2026-04-03T11:38:51Z

src/openharness/memory/scan.py

+    # Parse YAML frontmatter (--- ... ---)
+    if lines and lines[0].strip() == "---":
+        for i, line in enumerate(lines[1:], 1):
+            if line.strip() == "---":
+                for fm_line in lines[1:i]:
+                    key, _, value = fm_line.partition(":")
+                    key = key.strip()
+                    value = value.strip().strip("'\"")
+                    if not value:
+                        continue
+                    if key == "name":
+                        title = value
+                    elif key == "description":
+                        description = value
+                    elif key == "type":
+                        memory_type = value
+                body_start = i + 1
+                break
+
+    # Fallback: first non-empty, non-frontmatter line as description
+    desc_line_idx: int | None = None
+    if not description:
+        for idx, line in enumerate(lines[body_start:body_start + 10], body_start):
+            stripped = line.strip()
+            if stripped and stripped != "---" and not stripped.startswith("#"):
+                description = stripped[:200]
+                desc_line_idx = idx
+                break


When a file starts with a frontmatter delimiter (---) but is missing the closing delimiter, body_start remains 0 and the fallback description scan can pick up frontmatter key lines like name: ... as the description (e.g. name: oops). If the intent is to treat this as malformed frontmatter and fall back to real body content, consider skipping the initial delimiter and any leading key: value-style lines until the first blank/non-metadata line before selecting a description.

Copilot · 2026-04-03T11:38:52Z

src/openharness/memory/search.py

    for header in scan_memory_files(cwd, max_files=100):
-        haystack = f"{header.title} {header.description}".lower()
-        score = sum(1 for token in tokens if token in haystack)
-        if score:
+        meta = f"{header.title} {header.description}".lower()
+        body = header.body_preview.lower()
+
+        # Metadata matches are weighted 2x; body matches 1x.
+        meta_hits = sum(1 for t in tokens if t in meta)
+        body_hits = sum(1 for t in tokens if t in body)
+        score = meta_hits * 2.0 + body_hits


find_relevant_memories() says it searches “metadata and content”, but the metadata haystack only includes title and description. Since MemoryHeader now has memory_type (parsed from frontmatter), consider including memory_type in the metadata scoring so queries like “project” / “reference” can match on type as well (or clarify that type is intentionally excluded).

…nce (#12)

6 tasks testing PR #17 (diagnose skill), #12 (memory frontmatter), #14 (OpenAI client), #16 (session resume, cron, cost tracking), and all PRs combined — every task runs on AutoAgent (17K LOC). 5/6 pass (1 test assertion typo: 'failure' vs 'failed').

Copilot AI review requested due to automatic review settings April 3, 2026 10:38

Copilot started reviewing on behalf of ZenAlexa April 3, 2026 10:39 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

ZenAlexa requested a review from Copilot April 3, 2026 11:35

Copilot started reviewing on behalf of ZenAlexa April 3, 2026 11:36 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

tjb-tech added a commit that referenced this pull request Apr 4, 2026

fix(memory): parse YAML frontmatter in scan and improve search releva…

3eccdf7

…nce (#12)

tjb-tech merged commit ac582c7 into HKUDS:main Apr 4, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): parse YAML frontmatter in scan and improve search relevance#12

fix(memory): parse YAML frontmatter in scan and improve search relevance#12
tjb-tech merged 2 commits intoHKUDS:mainfrom
ZenAlexa:fix/memory-frontmatter-search

ZenAlexa commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if stripped and not stripped.startswith("#"):
	if stripped and stripped != "---" and not stripped.startswith("#"):

Conversation

ZenAlexa commented Apr 3, 2026

Problem

Changes

Verification

Design notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants