feat(memory): wiki self-healing — cao memory heal (Phase 4 U1)#306
feat(memory): wiki self-healing — cao memory heal (Phase 4 U1)#306fanhongy wants to merge 8 commits into
cao memory heal (Phase 4 U1)#306Conversation
Turn lint findings into fixes: orphan/contradiction/stale_claim repaired under --apply, poison dual-gated, graph_density flag-only. Dry-run default, audit trail, .heal.lock, bounded caps. Closes #297.
f6d3bf7 to
3115067
Compare
There was a problem hiding this comment.
Pull request overview
Adds a “self-healing” backend workflow for the memory wiki by introducing a wiki_healer service and a new cao memory heal CLI command that turns wiki_lint.run_lint() findings into planned/applied remediation actions (dry-run by default), with audit logging for each mutation.
Changes:
- Introduces
services/wiki_healer.pywith dry-run planning,--applygating, per-type/run-level caps, and a.heal.lockconcurrency guard. - Adds
cao memory healCLI command plus CLI tests, and extends the audit-log sync whitelist for the new heal events. - Adds a dedicated unit test suite for healer behavior across issue types (including gating, caps, and lock contention) and updates memory docs.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/cli_agent_orchestrator/services/wiki_healer.py |
Core implementation of wiki “heal” planning/apply logic, per-issue-type fixers, caps, and locking. |
src/cli_agent_orchestrator/cli/commands/memory.py |
Adds cao memory heal command that runs lint then invokes healer with formatting options. |
src/cli_agent_orchestrator/services/audit_log.py |
Adds new heal-related event types to the SYNC audit whitelist. |
test/services/test_wiki_healer.py |
New end-to-end/unit tests for healer behavior, including dry-run invariants, fixes, caps, and lock conflict. |
test/cli/commands/test_memory.py |
CLI-level tests for heal flag plumbing, filtering, output formats, and lock conflict surfacing. |
test/services/test_audit_log.py |
Updates whitelist expectation to include heal events. |
docs/memory.md |
Documents delivery status and references the new healing functionality. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @click.option( | ||
| "--scope", | ||
| type=click.Choice([s.value for s in MemoryScope], case_sensitive=False), | ||
| default="project", | ||
| show_default=True, | ||
| help="Scope to heal.", | ||
| ) |
| if cap is not None and ( | ||
| sum(1 for a in actions if _action_src_type(a.issue_type) == t) >= cap | ||
| ): | ||
| truncated_by_type[t] = truncated_by_type.get(t, 0) + 1 |
| action = await healer(svc, issue, scope, scope_id, db) | ||
| actions.append(action) | ||
| n_this_type += 1 | ||
| actions_applied += 1 | ||
| db.commit() |
| action = await healer(svc, issue, scope, scope_id, db) | ||
| actions.append(action) | ||
| n_this_type += 1 | ||
| actions_applied += 1 | ||
| db.commit() | ||
| except Exception as e: | ||
| # Roll the whole group back; surface as errored actions so the | ||
| # report is truthful (no silent partial-commit). | ||
| logger.warning("heal batch group rolled back type=%s: %s", t, type(e).__name__) |
| return HealAction( | ||
| "orphan_pruned", | ||
| key, | ||
| status="applied", | ||
| description="file already absent; index/metadata cleaned", | ||
| ) |
| **In progress:** Phase 4 U1 wiki self-healing adds `cao memory heal`, which consumes the | ||
| Phase 3 lint findings and applies a fix per issue type (dry-run by default, `--apply` to | ||
| mutate, full audit trail). It lives on `feat/wiki-self-healing` and is not yet PR'd. |
…ing, scope choice Buffer per-action audit payloads and emit only after the group commit succeeds, so a rolled-back heal never records a false mutation (notably poison_frequency). Exclude skipped no-ops from cap budget in both apply and dry-run. Restrict `heal --scope` to global/project. Doc + wording fixes.
| audit=( | ||
| "orphan_pruned", | ||
| f"deleted orphan wiki file: {key}", | ||
| { | ||
| "key": key, | ||
| "scope": scope, | ||
| "scope_id": scope_id or "", | ||
| "file_path": str(wiki_path), | ||
| }, | ||
| ), |
| | — | **Phase 4 U1 — wiki self-healing** (`cao memory heal`): turn lint findings into fixes, dry-run by default | 🟡 In progress — branch `feat/wiki-self-healing` | | ||
| | — | **Phase 4 — import/export, federation** | ⏳ Pending — not yet split into a PR | | ||
|
|
||
| **What works on `main` today:** store, recall, forget, four scopes, SQLite-indexed BM25 | ||
| search, 3-factor recall scoring, CLI inspection, MCP tools, retention/cleanup, all Phase 2.5 | ||
| hardening, auto-injection into provider config files, LLM wiki compaction, cross-references, | ||
| `cao memory lint` detectors, the daily audit log, and the memory Web UI. | ||
|
|
||
| **In progress:** Phase 4 U1 wiki self-healing adds `cao memory heal`, which consumes the | ||
| Phase 3 lint findings and applies a fix per issue type (dry-run by default, `--apply` to | ||
| mutate, full audit trail). It is up for review on `feat/wiki-self-healing` (PR #306). |
There was a problem hiding this comment.
will use a separate commit to update the doc. The doc is a bit messy at the moment, need a formal lint
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
run_lint(scope="project") returns orphans across all project containers but LintIssue carries no scope_id, so _heal_orphan rebuilt the path in the current container — a key collision could delete a live memory. Guard now skips delete when a SQLite row or index entry exists for this (scope, scope_id). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove the maintainer Delivery Status table; fix stale claims (recall search_mode/sort_by + 3-factor scoring, per-scope injection caps, project identity precedence chain, plugin config-file injection); document lint, compact, and heal CLI commands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
I rechecked the latest head ( The main blocker is broader than the orphan case Copilot called out: I reproduced this with There are also two audit/rollback issues worth fixing:
One functional stale-claim bug: Focused tests and formatting pass ( FindingsP1 - Project heal can apply another project's lint finding to the current project
issues = _run_async(run_lint(project_hash, scope=scope))
q = session.query(MemoryMetadataModel)
if scope is not None:
q = q.filter(MemoryMetadataModel.scope == scope)That means Apply-time healers then re-read or mutate using the current CLI
The orphan-specific re-check added after Copilot's comment helps only I reproduced this with Impact: running the default project healer from one repository can delete/strip/reset a same-key memory in the current project for a finding that actually came from another project. This is cross-project data corruption. Suggested fix: carry full identity through lint findings, or restrict P2 - DB commit failure can leave filesystem mutations without a mutation auditSeveral healers mutate files/indexes before the batch commit:
If db.rollback()
...
description=(
f"commit failed ({type(e).__name__}); DB rolled back; "
f"filesystem changes may be partial: {a.description}"
)The description now admits partial filesystem changes, so the previous "rolled back" wording problem is partly addressed. The remaining issue is the audit invariant: a real file/index mutation can happen with no mutation audit event. I reproduced this with an orphan file and a forced commit failure: Impact: forensic review can miss a real file deletion/index rewrite when the DB transaction fails after filesystem mutation. Suggested fix: either make the filesystem mutation compensating/transactional enough to restore on DB rollback, or emit an explicit failure/partial-mutation audit event when pre-commit filesystem changes may have occurred. P2 - Stale-claim healing misses valid path identifiers that start with
|
|
Thank you @haofeif for looking to this pr. I can confirm all three issue are reproducable (did on my side as well). Thanks for picking those edge cases. will implement at least first p1 and p2 in next commit and push. |
|
P1 (blocker) — cross-project apply —
P2a — stale-claim regex — wiki_healer.py
P2b — commit-failure audit gap — wiki_healer.py + audit_log.py
P3 — skipped stale-claim audited a mutation — wiki_healer.py
Verification
|
P1: thread scope_id through LintIssue so heal() never applies one container's finding to a same-key memory in another. P2a: path-aware boundaries strip ./ and / paths. P2b: emit heal_partial_mutation when a commit fails after a filesystem change. P3: skipped stale-claim no longer audits a mutation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Phase 3 shipped wiki lint detectors (
cao memory lint) that report drift but offer no path from a finding to a fix. This adds the other half:services/wiki_healer.py+ acao memory healCLI that consumes the exactLintIssuelistrun_lint()produces and applies a fix per issue type.Dry-run by default, explicit
--applyto mutate, one awaited audit event per mutation. Backend only — no web UI, no new detectors. Closes #297.Fixes per issue type
issue_typeorphan_pageindex.mdline + SQLite row--applyorphan_prunedcontradictionupdated_atarticle, forget the loser--applycontradiction_resolvedstale_claim--applystale_claim_prunedpoison_frequencyaccess_count--apply --aggressive(dual gate)poison_access_zeroedgraph_densityDesign invariants
apply=Falsemutates nothing; poison needs--apply --aggressive.(key, scope, scope_id)and trust the DB, never theLintIssuepayload..heal.lock(flock,LOCK_NB), separate from the index lock.MAX_HEAL_ACTIONS=200+ per-type caps; truncation reported, never silent.stale_claimstashes the pre-strip paragraph (size-capped) in the audit record.updated_attie, keep the lexicographically-smaller key (reproducible, never order-dependent).CLI
Testing
test/services/test_wiki_healer.py— 27 tests (dry-run read-only, each fix, dual gate, caps/truncation, SQL-authoritative, lock conflict, deterministic tiebreak, unparseable-skip).mypy/black/isortclean on touched files. The 2 failures in the full suite (test_bm25_performance_within_budget,test_real_kiro_initialization_and_idle) are pre-existing flakes unrelated to this change.Acceptance (#297)
heal()covers all five issue types; the two flag-only/gated types documented.--apply.index.mdor the DB (.heal.lock).Out of scope (separate issues)
Daily heal cron /
flow_servicewiring; memory import/export (Phase 4 U2); cross-project federation (Phase 4 U3); any web UI surface.🤖 Generated with Claude Code