Skip to content

feat(memory): wiki self-healing — cao memory heal (Phase 4 U1)#306

Open
fanhongy wants to merge 8 commits into
mainfrom
feat/wiki-self-healing
Open

feat(memory): wiki self-healing — cao memory heal (Phase 4 U1)#306
fanhongy wants to merge 8 commits into
mainfrom
feat/wiki-self-healing

Conversation

@fanhongy

Copy link
Copy Markdown
Contributor

Summary

Phase 3 shipped wiki lint detectors (cao memory lint) that report drift but offer no path from a finding to a fix. This adds the other half: services/wiki_healer.py + a cao memory heal CLI that consumes the exact LintIssue list run_lint() produces and applies a fix per issue type.

Dry-run by default, explicit --apply to mutate, one awaited audit event per mutation. Backend only — no web UI, no new detectors. Closes #297.

Fixes per issue type

issue_type Fix Gate Audit event
orphan_page delete wiki file + index.md line + SQLite row --apply orphan_pruned
contradiction keep newer updated_at article, forget the loser --apply contradiction_resolved
stale_claim strip the paragraph naming the stale path/symbol, atomic rewrite, stash pre-strip paragraph in audit --apply stale_claim_pruned
poison_frequency zero access_count --apply --aggressive (dual gate) poison_access_zeroed
graph_density flag-only, never mutates

Design invariants

  • Dry-run defaultapply=False mutates nothing; poison needs --apply --aggressive.
  • SQL row authoritative — contradiction/poison re-read the DB row by (key, scope, scope_id) and trust the DB, never the LintIssue payload.
  • Atomic per-issue-type batch — each group's DB writes run in one transaction; partial failure rolls that group back, others proceed.
  • Concurrency guard — dedicated .heal.lock (flock, LOCK_NB), separate from the index lock.
  • Bounded blast radius — run-level MAX_HEAL_ACTIONS=200 + per-type caps; truncation reported, never silent.
  • Recovery fieldstale_claim stashes the pre-strip paragraph (size-capped) in the audit record.
  • Deterministic contradiction tiebreak — on a same-second updated_at tie, keep the lexicographically-smaller key (reproducible, never order-dependent).

CLI

cao memory heal --scope <s> [--apply] [--issue-type <t>] [--aggressive] [--format table|json]

Testing

  • test/services/test_wiki_healer.py — 27 tests (dry-run read-only, each fix, dual gate, caps/truncation, SQL-authoritative, lock conflict, deterministic tiebreak, unparseable-skip).
  • CLI + audit-whitelist tests added.
  • Built via a design → implement → 3-lens adversarial review → fix workflow, then run through a pre-PR gate seeded with the bug-classes Copilot caught on prior memory PRs (timestamp-serialization, same-second-tie, plan/apply drift — all fixed).
  • Targeted suites green; mypy/black/isort clean on touched files. The 2 failures in the full suite (test_bm25_performance_within_budget, test_real_kiro_initialization_and_idle) are pre-existing flakes unrelated to this change.

Acceptance (#297)

  • heal() covers all five issue types; the two flag-only/gated types documented.
  • Default invocation is a dry-run plan; no mutation without --apply.
  • Every applied fix emits an audit event.
  • Run-level + per-type caps enforced; exceeding them truncates with a clear report.
  • Concurrent heal + store does not corrupt index.md or the DB (.heal.lock).
  • 0 regressions in existing lint / memory tests.

Out of scope (separate issues)

Daily heal cron / flow_service wiring; memory import/export (Phase 4 U2); cross-project federation (Phase 4 U3); any web UI surface.

🤖 Generated with Claude Code

Turn lint findings into fixes: orphan/contradiction/stale_claim repaired
under --apply, poison dual-gated, graph_density flag-only. Dry-run default,
audit trail, .heal.lock, bounded caps. Closes #297.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “self-healing” backend workflow for the memory wiki by introducing a wiki_healer service and a new cao memory heal CLI command that turns wiki_lint.run_lint() findings into planned/applied remediation actions (dry-run by default), with audit logging for each mutation.

Changes:

  • Introduces services/wiki_healer.py with dry-run planning, --apply gating, per-type/run-level caps, and a .heal.lock concurrency guard.
  • Adds cao memory heal CLI command plus CLI tests, and extends the audit-log sync whitelist for the new heal events.
  • Adds a dedicated unit test suite for healer behavior across issue types (including gating, caps, and lock contention) and updates memory docs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/cli_agent_orchestrator/services/wiki_healer.py Core implementation of wiki “heal” planning/apply logic, per-issue-type fixers, caps, and locking.
src/cli_agent_orchestrator/cli/commands/memory.py Adds cao memory heal command that runs lint then invokes healer with formatting options.
src/cli_agent_orchestrator/services/audit_log.py Adds new heal-related event types to the SYNC audit whitelist.
test/services/test_wiki_healer.py New end-to-end/unit tests for healer behavior, including dry-run invariants, fixes, caps, and lock conflict.
test/cli/commands/test_memory.py CLI-level tests for heal flag plumbing, filtering, output formats, and lock conflict surfacing.
test/services/test_audit_log.py Updates whitelist expectation to include heal events.
docs/memory.md Documents delivery status and references the new healing functionality.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +365 to +371
@click.option(
"--scope",
type=click.Choice([s.value for s in MemoryScope], case_sensitive=False),
default="project",
show_default=True,
help="Scope to heal.",
)
Comment on lines +711 to +714
if cap is not None and (
sum(1 for a in actions if _action_src_type(a.issue_type) == t) >= cap
):
truncated_by_type[t] = truncated_by_type.get(t, 0) + 1
Comment on lines +788 to +792
action = await healer(svc, issue, scope, scope_id, db)
actions.append(action)
n_this_type += 1
actions_applied += 1
db.commit()
Comment on lines +788 to +796
action = await healer(svc, issue, scope, scope_id, db)
actions.append(action)
n_this_type += 1
actions_applied += 1
db.commit()
except Exception as e:
# Roll the whole group back; surface as errored actions so the
# report is truthful (no silent partial-commit).
logger.warning("heal batch group rolled back type=%s: %s", t, type(e).__name__)
Comment on lines +301 to +306
return HealAction(
"orphan_pruned",
key,
status="applied",
description="file already absent; index/metadata cleaned",
)
Comment thread docs/memory.md Outdated
Comment on lines +27 to +29
**In progress:** Phase 4 U1 wiki self-healing adds `cao memory heal`, which consumes the
Phase 3 lint findings and applies a fix per issue type (dry-run by default, `--apply` to
mutate, full audit trail). It lives on `feat/wiki-self-healing` and is not yet PR'd.
@haofeif haofeif added the enhancement New feature or request label Jun 16, 2026
…ing, scope choice

Buffer per-action audit payloads and emit only after the group commit
succeeds, so a rolled-back heal never records a false mutation (notably
poison_frequency). Exclude skipped no-ops from cap budget in both apply
and dry-run. Restrict `heal --scope` to global/project. Doc + wording fixes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Comment thread src/cli_agent_orchestrator/services/wiki_healer.py
Comment thread src/cli_agent_orchestrator/services/wiki_healer.py
Comment on lines +313 to +322
audit=(
"orphan_pruned",
f"deleted orphan wiki file: {key}",
{
"key": key,
"scope": scope,
"scope_id": scope_id or "",
"file_path": str(wiki_path),
},
),
Comment thread test/services/test_wiki_healer.py
Comment thread docs/memory.md Outdated
Comment on lines +19 to +29
| — | **Phase 4 U1 — wiki self-healing** (`cao memory heal`): turn lint findings into fixes, dry-run by default | 🟡 In progress — branch `feat/wiki-self-healing` |
| — | **Phase 4 — import/export, federation** | ⏳ Pending — not yet split into a PR |

**What works on `main` today:** store, recall, forget, four scopes, SQLite-indexed BM25
search, 3-factor recall scoring, CLI inspection, MCP tools, retention/cleanup, all Phase 2.5
hardening, auto-injection into provider config files, LLM wiki compaction, cross-references,
`cao memory lint` detectors, the daily audit log, and the memory Web UI.

**In progress:** Phase 4 U1 wiki self-healing adds `cao memory heal`, which consumes the
Phase 3 lint findings and applies a fix per issue type (dry-run by default, `--apply` to
mutate, full audit trail). It is up for review on `feat/wiki-self-healing` (PR #306).

@fanhongy fanhongy Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will use a separate commit to update the doc. The doc is a bit messy at the moment, need a formal lint

Comment thread src/cli_agent_orchestrator/services/wiki_healer.py
fanhongy and others added 4 commits June 18, 2026 14:27
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
run_lint(scope="project") returns orphans across all project containers
but LintIssue carries no scope_id, so _heal_orphan rebuilt the path in the
current container — a key collision could delete a live memory. Guard now
skips delete when a SQLite row or index entry exists for this (scope,
scope_id).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove the maintainer Delivery Status table; fix stale claims (recall
search_mode/sort_by + 3-factor scoring, per-scope injection caps, project
identity precedence chain, plugin config-file injection); document lint,
compact, and heal CLI commands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@haofeif

haofeif commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@fanhongy

I rechecked the latest head (f3263d3) against current origin/main. I still think this needs changes before merge.

The main blocker is broader than the orphan case Copilot called out: cao memory heal --scope project calls run_lint(project_hash, scope="project"), but run_lint() only filters by scope, not by the current project scope_id, and LintIssue does not carry scope_id. The apply path then re-reads/mutates using the current project's scope_id. The orphan re-check protects _heal_orphan(), but contradiction, stale claim, and poison frequency can still apply a finding from project B to same-key memories in project A.

I reproduced this with poison_frequency: a finding for project B's shared-key zeroed project A's shared-key row while project B remained unchanged.

There are also two audit/rollback issues worth fixing:

  • If DB commit fails after an orphan/contradiction/stale filesystem mutation, the file/index change can remain but no mutation audit is emitted.
  • _heal_stale_claim() emits stale_claim_pruned even for status="skipped" when no paragraph matched.

One functional stale-claim bug: _strip_stale_paragraph() uses \b...\b, so lint findings for paths like ./src/gone.py or /tmp/repo/src/gone.py are skipped even though the detector accepts those path forms.

Focused tests and formatting pass (102 passed, 6 skipped; black/isort/diff-check clean). Full mypy still has unrelated existing status_monitor.py failures.

Findings

P1 - Project heal can apply another project's lint finding to the current project

src/cli_agent_orchestrator/cli/commands/memory.py:430 calls:

issues = _run_async(run_lint(project_hash, scope=scope))

project_hash is not used to filter the rows loaded by run_lint(). The lint code only filters by scope:

q = session.query(MemoryMetadataModel)
if scope is not None:
    q = q.filter(MemoryMetadataModel.scope == scope)

That means cao memory heal --scope project --apply collects findings from all project containers. The returned LintIssue contains only issue_type, key, related_key, description, severity, and detected_at; it does not include the source scope_id.

Apply-time healers then re-read or mutate using the current CLI scope_id:

  • src/cli_agent_orchestrator/services/wiki_healer.py:429-441 for contradiction rows
  • src/cli_agent_orchestrator/services/wiki_healer.py:545-546 for stale claim wiki paths
  • src/cli_agent_orchestrator/services/wiki_healer.py:664-675 for poison frequency rows

The orphan-specific re-check added after Copilot's comment helps only _heal_orphan(). It does not fix contradiction, stale claim, or poison frequency.

I reproduced this with poison_frequency: a synthetic finding for project B with key shared-key was applied to project A's shared-key row because the healer was invoked with project A's scope_id.

P1 cross-project poison: applied reset access_count 12 -> 0
  project_a_count= 0 project_b_count= 999

Impact: running the default project healer from one repository can delete/strip/reset a same-key memory in the current project for a finding that actually came from another project. This is cross-project data corruption.

Suggested fix: carry full identity through lint findings, or restrict run_lint() to the current project scope_id when cao memory heal --scope project is invoked. The healers should only mutate when the finding's (scope, scope_id, key) matches the target identity.

P2 - DB commit failure can leave filesystem mutations without a mutation audit

Several healers mutate files/indexes before the batch commit:

  • src/cli_agent_orchestrator/services/wiki_healer.py:384-387 unlinks an orphan file, rewrites the index, then deletes the DB row.
  • src/cli_agent_orchestrator/services/wiki_healer.py:488-491 unlinks the contradiction loser, rewrites the index, then deletes the DB row.
  • src/cli_agent_orchestrator/services/wiki_healer.py:598-599 atomically rewrites a stale-claim article before the batch commit path continues.

If db.commit() fails, the handler rolls back DB state and drops the buffered mutation audits:

db.rollback()
...
description=(
    f"commit failed ({type(e).__name__}); DB rolled back; "
    f"filesystem changes may be partial: {a.description}"
)

The description now admits partial filesystem changes, so the previous "rolled back" wording problem is partly addressed. The remaining issue is the audit invariant: a real file/index mutation can happen with no mutation audit event.

I reproduced this with an orphan file and a forced commit failure:

P2 commit failure orphan: error
  file_exists_after_failure= False has_orphan_audit= False

Impact: forensic review can miss a real file deletion/index rewrite when the DB transaction fails after filesystem mutation.

Suggested fix: either make the filesystem mutation compensating/transactional enough to restore on DB rollback, or emit an explicit failure/partial-mutation audit event when pre-commit filesystem changes may have occurred.

P2 - Stale-claim healing misses valid path identifiers that start with ./ or /

wiki_lint.PATH_CANDIDATE_RE accepts relative paths with ./ and absolute paths:

(?:\./|/|[A-Za-z0-9_\-./]+/)

But _strip_stale_paragraph() wraps the escaped stale identifier in word boundaries:

pattern = re.compile(r"\b" + re.escape(stale_id) + r"\b")

For identifiers beginning with non-word characters, the leading \b does not match. I confirmed:

P2 stale path match: src/gone.py True
P2 stale path match: ./src/gone.py False
P2 stale path match: /tmp/repo/src/gone.py False

Impact: valid file not found: findings produced by the lint detector can be reported as healable, but apply will skip them because the paragraph matcher cannot find the exact path.

Suggested fix: replace \b...\b with custom boundaries that work for path punctuation, for example "not preceded/followed by a path/token character", and add tests for ./src/gone.py and absolute paths.

P3 - Skipped stale-claim actions still emit stale_claim_pruned

In the pre_strip is None path, _heal_stale_claim() returns status="skipped" but still buffers a stale_claim_pruned audit payload:

return HealAction(
    "stale_claim_pruned",
    key,
    status="skipped",
    ...
    audit=("stale_claim_pruned", ...)
)

The batch loop flushes any non-None audit payload after commit regardless of action.status.

I reproduced:

P3 skipped stale audit: skipped has_stale_claim_pruned_audit= True

Impact: the audit stream uses a mutation event type for a no-op. The summary says no paragraph was found, but the event type still reads as a prune and contradicts the "every applied mutation emits" invariant.

Suggested fix: do not attach a mutation audit to skipped no-op actions, or use a distinct non-mutation event type if read-only no-op outcomes need to be recorded.

@fanhongy

Copy link
Copy Markdown
Contributor Author

Thank you @haofeif for looking to this pr.

I can confirm all three issue are reproducable (did on my side as well). Thanks for picking those edge cases.

will implement at least first p1 and p2 in next commit and push.

@fanhongy

fanhongy commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

P1 (blocker) — cross-project apply — wiki_lint.py + wiki_healer.py

  • Added scope_id to LintIssue and _make_issue, populated it at the four mutating detector call sites (orphan, stale-claim file+symbol, poison, contradiction). Bookkeeping rows leave it None.
  • heal() now drops any finding whose scope_id != the run's target before it reaches a healer (both dry-run and apply paths). The existing orphan re-check stays as belt-and-suspenders.

P2a — stale-claim regex — wiki_healer.py

  • Replaced \b…\b with path-aware lookarounds (?<![\w./-])…(?![\w/-]).

P2b — commit-failure audit gap — wiki_healer.py + audit_log.py

  • Added fs_mutated to HealAction (set on orphan/contradiction/stale success; poison stays False since rollback fully restores it). On batch-commit failure, each rolled action that already mutated the filesystem now emits a new heal_partial_mutation audit event (added to the SYNC whitelist), so a real file deletion is never lost from the trail.

P3 — skipped stale-claim audited a mutation — wiki_healer.py

  • The no-paragraph-match path now emits audit=None, and the post-commit flush is gated on status == "applied" as a systemic guard.

Verification

  • All four original reproductions now report FIXED.
  • Targeted suites: 167 passed, 6 skipped (incl. 9 new tests across P1/P2/P3 + audit whitelist).
  • black / isort / mypy clean on touched files.
  • Broader test/services/ test/cli/: 892 passed, 1 failed — the failure is the pre-existing test_bm25_performance_within_budget timing flake (609ms vs 500ms cap, untouched code, already noted in the PR description).

fanhongy and others added 2 commits June 24, 2026 14:25
P1: thread scope_id through LintIssue so heal() never applies one
container's finding to a same-key memory in another. P2a: path-aware
boundaries strip ./ and / paths. P2b: emit heal_partial_mutation when a
commit fails after a filesystem change. P3: skipped stale-claim no longer
audits a mutation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Wiki self-healing: turn lint findings into fixes

3 participants