From 38e4378f55b6085992abcf303cf185f2d5660ade Mon Sep 17 00:00:00 2001 From: Fotis Stamatelopoulos Date: Thu, 14 May 2026 16:11:27 -0700 Subject: [PATCH] feat(clio): auto-ingest documenter output (docs/**/*.md) into Clio MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes a long-standing inconsistency: cfcf auto-ingests almost every workspace artifact (iteration logs, judge assessments, reflection analyses, plan.md, decision-log, architect-review, problem-pack, context-pack...) — but explicitly excluded the documenter's docs/*.md output. The exclusion's stated rationale ("docs/ is canonical, Clio is redundant") applied equally to plan.md, which IS auto-ingested. Carve-out was inconsistent and cost cross-workspace discoverability of the most polished, integrative artifact a workspace produces. Behaviour: - After documenter completes (both auto-document path in iteration-loop AND standalone cfcf document), walk /docs/ recursively; ingest every *.md as a separate Clio doc. - Stable per-file title: : docs/ - updateIfExists: true → re-runs overwrite in place; sha256 dedup makes unchanged content a no-op. - Author: documenter|| - Metadata includes artifact_type=documenter-output + file_path + ingest_trigger ("loop-auto" | "manual") for filtering. - Non-.md files skipped; dot-dirs skipped; empty files skipped. - Per-file errors logged + counted but never fail the batch. - Respects clio.ingestPolicy (off / summaries-only / all). Documenter output runs on summaries-only — it IS a summary. Implementation: new ingestDocumenterOutput() in loop-ingest.ts + walkMarkdownFiles() helper; two call sites (iteration-loop auto-document path, documenter-runner standalone path); documenter template prose updated to match reality. Test coverage: 11 new tests covering multi-file ingest, recursive walk, updateIfExists round-trip, non-.md filtering, dot-dir skipping, missing docs/ no-op, empty file skipping, author stamp, trigger metadata, ingestPolicy gates. All 1049 tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 83 +++++++- packages/core/src/clio/loop-ingest.test.ts | 186 ++++++++++++++++++ packages/core/src/clio/loop-ingest.ts | 138 ++++++++++++- packages/core/src/documenter-runner.ts | 27 +++ packages/core/src/iteration-loop.ts | 24 +++ .../templates/cfcf-documenter-instructions.md | 13 +- 6 files changed, 464 insertions(+), 7 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index a06ab3c..6858562 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,88 @@ Changes are tracked via git tags. Each release tag corresponds to an entry here. ## [Unreleased] -_No changes yet._ +### Added — Documenter output auto-ingested into Clio + +Closes a long-standing inconsistency the user spotted post-v0.24.3: +cfcf auto-ingests almost every workspace artifact into Clio +(iteration logs, judge assessments, reflection analyses, plan.md, +decision-log, architect-review, problem-pack, context-pack…) — +but explicitly **excluded** the documenter's `docs/*.md` output. + +The documenter agent template even called this out: *"cf² doesn't +auto-ingest the documenter output (the `docs/` tree is the +canonical surface)."* The carve-out's stated rationale was that +`docs/` is canonical, so Clio is redundant. But that same argument +applies to `plan.md` — which IS auto-ingested. The carve-out was +inconsistent and cost cross-workspace discoverability of the +*most polished, integrative* artifact a workspace produces. + +**Behaviour**: + +- After the documenter completes (both auto-document path inside + the iteration loop AND standalone `cfcf document`), walk + `/docs/` recursively and ingest every `*.md` file as a + separate Clio document. +- Stable per-file title: `: docs/` + (e.g. `gmbot: docs/architecture.md`, `gmbot: docs/api/auth.md`). +- `updateIfExists: true` — re-running the documenter overwrites + in place, never produces duplicates. sha256 dedup means + unchanged content is a no-op. +- Author stamp: `documenter||` (matches the + existing actor convention). +- Metadata: `{role: "documenter", artifact_type: + "documenter-output", file_path: "", tier: "semantic", + ingest_trigger: "loop-auto" | "manual", …}`. The + `documenter-output` artifact_type makes the new docs filterable + in `cfcf clio search --metadata`. +- Non-`.md` files (images, JSON config, etc.) are skipped. Dot- + directories (`.git`, `.vscode`, …) under `docs/` are skipped. +- Empty / whitespace-only files are skipped. +- Per-file errors are logged + counted but never fail the rest + of the batch (same best-effort policy as the other auto-ingest + hooks). +- Respects `clio.ingestPolicy` (per-workspace or global): `"off"` + → no-op; `"summaries-only"` and `"all"` → runs. Documenter + output is treated as a summary — it's the cleanest + cross-workspace artifact a workspace produces. +- Pre-existing user-authored files in `docs/` are also ingested. + Intentional: they're authoritative workspace content; surfacing + them in cross-workspace Clio search is a feature, not a leak. + (Different directory than `cfcf-docs/`, so no overlap with + existing ingests.) + +**Implementation** (~165 LoC + tests): + +- `packages/core/src/clio/loop-ingest.ts` — new + `ingestDocumenterOutput(backend, workspace, trigger)` helper + + `walkMarkdownFiles(dir)` internal helper (recursive walk, skips + dot-dirs). +- `packages/core/src/iteration-loop.ts` — call site inside the + auto-document branch, after the commit + history-event update. +- `packages/core/src/documenter-runner.ts` — call site after the + standalone `cfcf document` run completes successfully. (Failed + runs don't ingest — partial / broken output shouldn't pollute + cross-workspace search.) +- `packages/core/src/templates/cfcf-documenter-instructions.md` — + prose updated to match reality. Agent no longer needs to be + asked to push to Clio; the harness handles it. + +**Test coverage** (11 new tests in `loop-ingest.test.ts`, all 1049 +total pass): + +- Multi-file ingest with per-file titles +- Recursive walk through nested `docs/` subdirectories +- `updateIfExists` round-trip (one doc per file, content updates + in place across re-runs) +- Non-`.md` files ignored +- Dot-directories skipped +- Empty `docs/` returns `{ingested: 0, errors: 0}` (no-op safe) +- Whitespace-only files skipped +- Author stamp = `documenter||` +- Trigger captured in metadata (`loop-auto` vs `manual`) +- `clio.ingestPolicy: "off"` → no-op +- `clio.ingestPolicy: "summaries-only"` → runs (documenter output + IS a summary) ## [0.24.3] -- 2026-05-13 diff --git a/packages/core/src/clio/loop-ingest.test.ts b/packages/core/src/clio/loop-ingest.test.ts index a976b5f..373dadb 100644 --- a/packages/core/src/clio/loop-ingest.test.ts +++ b/packages/core/src/clio/loop-ingest.test.ts @@ -22,6 +22,7 @@ import { ingestPlanMd, ingestDevIterationArtifacts, ingestJudgeArtifact, + ingestDocumenterOutput, PROBLEM_PACK_FILES, } from "./loop-ingest.js"; import type { WorkspaceConfig } from "../types.js"; @@ -1029,3 +1030,188 @@ describe("ingestContextPack", () => { expect(r.perFile).toEqual([]); }); }); + +// ── ingestDocumenterOutput (v0.24.4) ────────────────────────────────────── + +describe("ingestDocumenterOutput", () => { + async function seedDocs(files: Record): Promise { + for (const [rel, content] of Object.entries(files)) { + const full = join(repoDir, rel); + const dir = full.substring(0, full.lastIndexOf("/")); + await mkdir(dir, { recursive: true }); + await writeFile(full, content, "utf-8"); + } + } + + it("ingests every *.md under docs/ as a separate Clio doc with stable per-file titles", async () => { + const ws = makeWorkspace(); + await seedDocs({ + "docs/architecture.md": "# Architecture\n\nSystem overview.\n", + "docs/api.md": "# API\n\nEndpoints.\n", + "docs/deployment.md": "# Deployment\n\nDeploy guide.\n", + }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(3); + expect(result.errors).toBe(0); + + const docs = await clio.listDocuments({ project: "test-project" }); + const docOutputs = docs.filter( + (d) => (d.metadata as { artifact_type?: string })?.artifact_type === "documenter-output", + ); + expect(docOutputs).toHaveLength(3); + + // Each file gets `: ` as title. + const titles = docOutputs.map((d) => d.title).sort(); + expect(titles).toEqual([ + `${ws.name}: docs/api.md`, + `${ws.name}: docs/architecture.md`, + `${ws.name}: docs/deployment.md`, + ]); + }); + + it("walks nested docs/ subdirectories", async () => { + const ws = makeWorkspace(); + await seedDocs({ + "docs/architecture.md": "# Top-level\n", + "docs/api/auth.md": "# Auth API\n", + "docs/api/users.md": "# Users API\n", + "docs/guides/quickstart.md": "# Quickstart\n", + }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(4); + + const docs = await clio.listDocuments({ project: "test-project" }); + const titles = docs + .filter((d) => (d.metadata as { artifact_type?: string })?.artifact_type === "documenter-output") + .map((d) => d.title) + .sort(); + expect(titles).toEqual([ + `${ws.name}: docs/api/auth.md`, + `${ws.name}: docs/api/users.md`, + `${ws.name}: docs/architecture.md`, + `${ws.name}: docs/guides/quickstart.md`, + ]); + }); + + it("updates docs in place across re-runs (updateIfExists — one doc per file, NOT one per call)", async () => { + const ws = makeWorkspace(); + await seedDocs({ "docs/architecture.md": "# Architecture v1\n\nInitial.\n" }); + + const r1 = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(r1.ingested).toBe(1); + const docsAfterRun1 = await clio.listDocuments({ project: "test-project" }); + expect(docsAfterRun1).toHaveLength(1); + const docId1 = docsAfterRun1[0].id; + + // Second documenter run with revised content. The previously + // ingested doc should be UPDATED, not duplicated. + await seedDocs({ "docs/architecture.md": "# Architecture v2\n\nUpdated after a loop.\n" }); + const r2 = await ingestDocumenterOutput(clio, ws, "manual"); + expect(r2.ingested).toBe(1); + + const docsAfterRun2 = await clio.listDocuments({ project: "test-project" }); + expect(docsAfterRun2).toHaveLength(1); + expect(docsAfterRun2[0].id).toBe(docId1); // same doc id, updated in place + + // Content reflects the second pass. + const fetched = await clio.getDocumentContent(docId1); + expect(fetched?.content).toContain("v2"); + expect(fetched?.content).toContain("Updated after a loop"); + }); + + it("ignores non-.md files in docs/", async () => { + const ws = makeWorkspace(); + await seedDocs({ + "docs/architecture.md": "# Architecture\n", + "docs/diagram.png": "fake binary", // .png — should NOT be ingested + "docs/config.json": "{}", // .json — should NOT be ingested + }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(1); + const docs = await clio.listDocuments({ project: "test-project" }); + expect(docs.filter((d) => (d.metadata as { artifact_type?: string })?.artifact_type === "documenter-output")).toHaveLength(1); + }); + + it("skips dot-directories (.git, .vscode, etc.)", async () => { + const ws = makeWorkspace(); + await seedDocs({ + "docs/architecture.md": "# Real doc\n", + "docs/.git/HEAD": "ref: refs/heads/main\n", // unlikely but defensive + "docs/.cache/build.md": "should be skipped\n", + }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(1); + }); + + it("returns {ingested: 0, errors: 0} when docs/ doesn't exist (no-op, safe)", async () => { + const ws = makeWorkspace(); + // No docs/ created — this can happen if the documenter agent + // failed early or the workspace has no docs phase yet. + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(0); + expect(result.errors).toBe(0); + }); + + it("skips empty files (no ingest for whitespace-only content)", async () => { + const ws = makeWorkspace(); + await seedDocs({ + "docs/empty.md": " \n\n \n", + "docs/real.md": "# Real content\n", + }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(1); // only real.md + }); + + it("stamps author as documenter|| for audit-log attribution", async () => { + const ws = makeWorkspace({ + documenterAgent: { adapter: "codex", model: "gpt-5" }, + }); + await seedDocs({ "docs/architecture.md": "# Architecture\n" }); + + await ingestDocumenterOutput(clio, ws, "manual"); + const docs = await clio.listDocuments({ project: "test-project" }); + const doc = docs.find((d) => d.title === `${ws.name}: docs/architecture.md`); + expect(doc?.author).toBe("documenter|codex|gpt-5"); + }); + + it("captures the trigger (loop-auto vs manual) in metadata for audit", async () => { + const ws = makeWorkspace(); + await seedDocs({ "docs/architecture.md": "# Architecture\n" }); + + await ingestDocumenterOutput(clio, ws, "loop-auto"); + const docs1 = await clio.listDocuments({ project: "test-project" }); + const doc1 = docs1.find((d) => d.title === `${ws.name}: docs/architecture.md`); + expect((doc1?.metadata as { ingest_trigger?: string })?.ingest_trigger).toBe("loop-auto"); + + // Re-run via standalone documenter (manual trigger) — overrides + // the previous trigger stamp. + await ingestDocumenterOutput(clio, ws, "manual"); + const docs2 = await clio.listDocuments({ project: "test-project" }); + const doc2 = docs2.find((d) => d.title === `${ws.name}: docs/architecture.md`); + expect((doc2?.metadata as { ingest_trigger?: string })?.ingest_trigger).toBe("manual"); + }); + + it("respects clio.ingestPolicy = 'off' (no-op)", async () => { + const ws = makeWorkspace({ clio: { ingestPolicy: "off" } }); + await seedDocs({ "docs/architecture.md": "# Architecture\n" }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(0); + expect(result.errors).toBe(0); + const docs = await clio.listDocuments({ project: "test-project" }); + expect(docs).toHaveLength(0); + }); + + it("runs on policy 'summaries-only' (documenter output IS a summary)", async () => { + const ws = makeWorkspace({ clio: { ingestPolicy: "summaries-only" } }); + await seedDocs({ "docs/architecture.md": "# Architecture\n" }); + + const result = await ingestDocumenterOutput(clio, ws, "loop-auto"); + expect(result.ingested).toBe(1); + }); +}); diff --git a/packages/core/src/clio/loop-ingest.ts b/packages/core/src/clio/loop-ingest.ts index 1bd9640..2af97bc 100644 --- a/packages/core/src/clio/loop-ingest.ts +++ b/packages/core/src/clio/loop-ingest.ts @@ -12,8 +12,8 @@ * adjunct service. */ -import { readFile, access } from "fs/promises"; -import { join } from "path"; +import { readFile, access, readdir } from "fs/promises"; +import { join, relative } from "path"; import type { WorkspaceConfig, JudgeSignals, ReflectionSignals } from "../types.js"; import type { MemoryBackend } from "./backend/types.js"; import type { IngestResult } from "./types.js"; @@ -349,6 +349,140 @@ export async function ingestPlanMd( } } +// ── Hook: documenter output (docs/**/*.md) ──────────────────────────────── + +/** + * Walk a directory recursively and return paths of every `.md` file. + * Skips dot-directories (`.git`, `.vscode`, etc.). Returns absolute + * paths. Returns an empty array (not throwing) when the directory + * doesn't exist. + */ +async function walkMarkdownFiles(dir: string): Promise { + const out: string[] = []; + async function walk(current: string): Promise { + let entries; + try { + entries = await readdir(current, { withFileTypes: true }); + } catch { + return; // dir doesn't exist or unreadable — caller treats as empty + } + for (const entry of entries) { + const full = join(current, entry.name); + if (entry.isDirectory()) { + if (entry.name.startsWith(".")) continue; // skip dot-dirs + await walk(full); + } else if (entry.isFile() && entry.name.endsWith(".md")) { + out.push(full); + } + } + } + await walk(dir); + return out; +} + +/** + * Mirror the documenter's `docs/**\/*.md` output into Clio. + * + * Called after the documenter completes — both from the + * iteration-loop's auto-document path AND from the standalone + * `cfcf document` runner. Walks `/docs/` recursively; for + * each `.md` file, ingests it as a separate Clio document with a + * stable title `: ` and + * `updateIfExists: true` so re-running the documenter overwrites + * in place rather than producing duplicates. sha256 dedup makes + * unchanged content a no-op on the backend. + * + * Pre-existing user-authored files in `docs/` are ALSO ingested. + * Intentional: they're authoritative workspace content; including + * them in cross-workspace Clio search is a feature, not a leak. + * (`cfcf-docs/` artifacts are separate and tracked by their own + * hooks — different directory, no overlap.) + * + * Respects `clio.ingestPolicy`: skipped on `"off"`, runs on + * `"summaries-only"` and `"all"` (documenter output is the most + * polished cross-workspace summary a workspace produces — it + * belongs in summaries-only). + * + * Best-effort: per-file errors are logged + counted; one bad file + * doesn't fail the rest of the batch. The function never throws. + */ +export async function ingestDocumenterOutput( + backend: MemoryBackend, + workspace: WorkspaceConfig, + trigger: "loop-auto" | "manual", +): Promise<{ ingested: number; errors: number }> { + const policy = await resolveIngestPolicy(workspace); + if (policy === "off") return { ingested: 0, errors: 0 }; + + const docsDir = join(workspace.repoPath, "docs"); + const files = await walkMarkdownFiles(docsDir); + if (files.length === 0) return { ingested: 0, errors: 0 }; + + const project = resolveClioProject(workspace); + const author = actorForRole(workspace, "documenter"); + + let ingested = 0; + let errors = 0; + + for (const filePath of files) { + let content: string; + try { + content = await readFile(filePath, "utf-8"); + } catch (err) { + errors++; + console.warn( + `[clio] documenter output read failed for ${filePath}: ${err instanceof Error ? err.message : String(err)}`, + ); + continue; + } + if (!content.trim()) continue; // skip empty files + + const rel = relative(workspace.repoPath, filePath); // e.g. "docs/architecture.md" + + try { + const result = await backend.ingest({ + project, + title: `${workspace.name}: ${rel}`, + content, + author, + source: `cfcf-auto:documenter-output:${trigger}`, + // Singleton-per-(workspace, file) — `--update-if-exists` looks + // up by title within the project. Re-running the documenter + // overwrites in place; unchanged content is a sha256-dedup + // no-op (same pattern as plan.md, decision-log, architect-review). + updateIfExists: true, + metadata: baseMetadata(workspace, { + role: "documenter", + artifact_type: "documenter-output", + tier: "semantic", + file_path: rel, + ingest_trigger: trigger, + }), + }); + recordInternalUsage(backend, { + operation: "ingest", + requestor: author, + documentId: result.document?.id, + projectId: result.document?.projectId, + extra: { + artifact_type: "documenter-output", + file_path: rel, + ingest_trigger: trigger, + action: result.action, + }, + }); + ingested++; + } catch (err) { + errors++; + console.warn( + `[clio] documenter output ingest failed for ${rel}: ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + + return { ingested, errors }; +} + // ── Hook: decision-log.md (tagged semantic entries) ─────────────────────── const DECISION_LOG_HEADER_RE = diff --git a/packages/core/src/documenter-runner.ts b/packages/core/src/documenter-runner.ts index d134896..eafa0c9 100644 --- a/packages/core/src/documenter-runner.ts +++ b/packages/core/src/documenter-runner.ts @@ -20,6 +20,8 @@ import { dispatchForWorkspace, makeEvent } from "./notifications/index.js"; import { getTemplate } from "./templates.js"; import { effectiveClioProject } from "./clio/system-projects.js"; import { formatClioActor } from "./clio/actor.js"; +import { getClioBackend } from "./clio/singleton.js"; +import { ingestDocumenterOutput } from "./clio/loop-ingest.js"; /** * Count markdown files in the workspace's docs/ directory. @@ -432,6 +434,31 @@ async function runDocument( docsFileCount, committed, } as Partial); + + // Auto-ingest the documenter's docs/ output into Clio. Mirrors + // the call site in iteration-loop.ts's auto-document path (the + // post-SUCCESS branch). Only runs when the documenter actually + // succeeded — failed runs may have partial / broken output we + // don't want to surface in cross-workspace search. Best-effort: + // ingest failures never fail the run. + if (result.exitCode === 0) { + try { + const res = await ingestDocumenterOutput( + getClioBackend(), + workspace, + "manual", + ); + if (res.ingested > 0 || res.errors > 0) { + console.log( + `[clio] documenter-output ingest: ${res.ingested} ingested, ${res.errors} errors`, + ); + } + } catch (err) { + console.warn( + `[clio] documenter-output ingest failed (manual): ${err instanceof Error ? err.message : String(err)}`, + ); + } + } } finally { documentProcessStore.delete(workspace.id); unregister(); diff --git a/packages/core/src/iteration-loop.ts b/packages/core/src/iteration-loop.ts index 4944ed2..ec5022e 100644 --- a/packages/core/src/iteration-loop.ts +++ b/packages/core/src/iteration-loop.ts @@ -50,6 +50,7 @@ import { ingestDevIterationArtifacts, ingestJudgeArtifact, ingestDecisionLogEntries, + ingestDocumenterOutput, ingestIterationSummary, ingestRawIterationArtifacts, writeClioRelevant, @@ -2386,6 +2387,29 @@ async function runJudgeAndDecide( await updateHistoryEvent(workspace.id, docResult.historyEventId, { committed, } as Partial); + + // Auto-ingest the documenter's docs/ output into Clio. + // Mirrors the per-iteration ingest pattern (decision-log, + // plan.md, architect-review): every file in docs/ gets a + // stable title + updateIfExists. Best-effort; never breaks + // the loop. Same call lives in documenter-runner.ts for + // the standalone `cfcf document` invocation. + try { + const res = await ingestDocumenterOutput( + getClioBackend(), + workspace, + "loop-auto", + ); + if (res.ingested > 0 || res.errors > 0) { + console.log( + `[clio] documenter-output ingest: ${res.ingested} ingested, ${res.errors} errors`, + ); + } + } catch (err) { + console.warn( + `[clio] documenter-output ingest failed (loop-auto): ${err instanceof Error ? err.message : String(err)}`, + ); + } } catch { // Documenter failure is not fatal -- the code is done } diff --git a/packages/core/src/templates/cfcf-documenter-instructions.md b/packages/core/src/templates/cfcf-documenter-instructions.md index 716b62a..21af478 100644 --- a/packages/core/src/templates/cfcf-documenter-instructions.md +++ b/packages/core/src/templates/cfcf-documenter-instructions.md @@ -38,10 +38,15 @@ that didn't make it into the source-tree comments (item 6.9): cfcf clio search "" --project {{WORKSPACE_CLIO_PROJECT}} \ --metadata '{"artifact_type":"decision-log-entry"}' -- The final docs you produce go into `docs/` on disk, NOT into Clio. - cf² doesn't auto-ingest the documenter output (the `docs/` tree is - the canonical surface). If the user explicitly asks you to push a - copy to Clio, use `--author "documenter||"`. +- The final docs you produce go into `docs/` on disk. cf² **also + auto-ingests** every `*.md` file under `docs/` into Clio after + you finish (since v0.24.4), with `--update-if-exists` + stable + per-file titles (`: docs/`). You don't + need to call `cfcf clio docs ingest` yourself — the harness + handles it. Author is stamped as `documenter||` + automatically. If you want to push something to Clio that ISN'T + in `docs/` (an ad-hoc cross-workspace note, say), use + `--author "documenter||"` explicitly. ## What to Produce