From 0c14e4461885b69007704500528b79eef10100a2 Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 20:19:02 -0700 Subject: [PATCH 1/6] ecosystem audit: align bundled agent-template names, gate sprint launch, add ecosystem map Deep cross-skill audit of all 23 skills. Applied the highest-leverage fixes; filed the rest to agent-state for follow-up cycles. Structural gate stays green. Fixes: - Align bundled subagent-template names in the reaper/backfill/drift kits with the manifest/install role names (dead-code-reaper-*, test-backfill-*, arch-drift-watcher). The unprefixed names had drifted from the SKILL.md install lines and broke the agents/README prefixed-naming policy; they live inside fenced blocks so the audit gate never saw them. - sprint-ticket-runner: add an explicit launch gate (offers launch, never auto-launches) plus a stop condition. It was the only loop skill that would auto-launch code-writing makers with no termination clause. - Normalize the lone ../references/state-templates.md link to ./ to match its siblings (same resolved target). Docs and state: - Add docs/ecosystem-map.md: intent routing, the two pipeline backbones, the autonomy ladder and human gate, a bundled-vs-external dependency matrix, the 23-skill relationship table, the shared-state map, and the gates note. - File remaining findings to triage-inbox.md (F-1..F-12), audit-policy proposals to decisions.md (HD-1..HD-5), and record completed work plus open tasks in completed.md and loop-state.md. Gate: python scripts/audit-jar.py -> 208 checks, 0 failed. Verified by a separate checker (maker != checker). --- agent-state/completed.md | 4 + agent-state/decisions.md | 34 +++ agent-state/loop-state.md | 25 ++ agent-state/triage-inbox.md | 111 ++++++++- .../arch-drift-watch/references/drift-kit.md | 2 +- .../dead-code-reaper/references/reaper-kit.md | 6 +- .../references/loop-architecture.md | 2 +- development/sprint-ticket-runner/SKILL.md | 17 +- .../references/backfill-kit.md | 6 +- docs/ecosystem-map.md | 228 ++++++++++++++++++ 10 files changed, 418 insertions(+), 17 deletions(-) create mode 100644 docs/ecosystem-map.md diff --git a/agent-state/completed.md b/agent-state/completed.md index 5fd279e..e37814d 100644 --- a/agent-state/completed.md +++ b/agent-state/completed.md @@ -48,3 +48,7 @@ | C-2026-06-12-SF-023-RED | Capture RED pressure evidence for instrument-observability | skill-forge-26 | this commit | Recorded three fresh-agent pressure passes for `instrument-observability`, moved SF-023 to `red-captured`, and queued a GREEN-only patch stage for shortcuts around skipped investigation, deferred source maps/dashboards/privacy hardening, partial smoke coverage, and unsafe identifier context. | | C-2026-06-12-PUBLISH-CLEANUP | Remove local planning drafts before publish | publish-cleanup-1 | this commit | Removed tracked local planning draft files, added ignore coverage for draft plan/spec directories, and kept product-facing loop artifacts tracked. | | C-2026-06-12-README-HYGIENE-CORRECTION | Remove repo-facing hygiene section | readme-hygiene-correction-1 | this commit | Removed the README hygiene section so the README stays focused on product-facing skill usage while ignore enforcement remains in place for local draft plan and spec paths. | +| C-2026-06-12-ECO-MAP | Add docs/ecosystem-map.md (routing + composition + relationship table) | ecosystem-audit-1 | this commit | Added the hand-authored edges-between-skills map: intent routing table, the two pipeline backbones, the autonomy ladder + human gate, a bundled-vs-external dependency matrix, the 23-skill relationship table, the shared-state-files map, and the gates note. Carries only what skills.json/core-skills.md cannot. Audit gate green (208/208). | +| C-2026-06-12-ECO-KIT-NAMES | Align bundled subagent-template names with manifest/install roles | ecosystem-audit-1 | this commit | Renamed the fenced template `name:` frontmatter in reaper-kit (reaper-scout/reaper/reaper-validator -> dead-code-reaper-*), backfill-kit (backfill-* -> test-backfill-*), and drift-kit (drift-watcher -> arch-drift-watcher) to match the manifest roles the SKILL.md install lines name. Closes a gate-invisible drift that made a fresh agent copy mis-named, collision-prone agents (the agents/README naming policy mandates prefixed names). Independent checker verified. | +| C-2026-06-12-ECO-SPRINT-GATE | Add launch gate + stop condition to sprint-ticket-runner | ecosystem-audit-1 | this commit | The only loop skill that auto-launched with no stop condition; contradicted the "ask before launching loops" rule. Added an Operating-Contract launch gate (offers launch, never auto-launches; human "go" before any code-writing maker) and an explicit stop condition (no ready tickets / blocked-NEEDS-DECISION / approved budget exhausted -> Closeout), plus a Phase 4 pointer. Independent checker verified. | +| C-2026-06-12-ECO-LINK | Normalize loop-architecture state-templates link | ecosystem-audit-1 | this commit | Changed the lone `../references/state-templates.md` outlier to `./state-templates.md` to match its sibling links (same resolved target, which exists; the gate already passed it). Removes the inconsistency that caused one audit lens to mis-read it as a doubled-path break. | diff --git a/agent-state/decisions.md b/agent-state/decisions.md index 1e25e47..1ab89d0 100644 --- a/agent-state/decisions.md +++ b/agent-state/decisions.md @@ -6,3 +6,37 @@ | Decision | Rationale | Cycle | |----------|-----------|-------| | Hook declarations live in role files and generated agent packs | Keeps hook behavior attached to the agent roles that produce evidence while `skill-forge` remains the only path that edits skills | hook-runtime-drop-in | +| Durable ecosystem knowledge goes in `docs/ecosystem-map.md`; the audit narrative is delivered in-response, not as a repo file | Avoids a point-in-time report rotting in the tree (the jar's own plan-prune philosophy). Edges between skills are the durable part; findings live in triage-inbox/decisions | ecosystem-audit-1 | +| Did NOT implement the proposed new audit gates this cycle | Audit-policy changes need separate review (maker != checker) and a careful false-positive assessment; proposed below with evidence + risk for human sign-off | ecosystem-audit-1 | +| Aligned bundled `references/*-kit.md` template `name:` to the manifest/install names | The agents/README "Naming" section already mandates skill-prefixed role names to avoid `scout`/`validator` collisions; the kits violated the repo's own stated policy. Names inside fenced blocks are invisible to the gate, so this was a real, gate-undetectable drift | ecosystem-audit-1 | + +## Human-Decision Items (pending) + +> Audit-policy / public-contract questions surfaced by ecosystem-audit-1. Each +> needs a human yes before a maker acts. Do NOT silently change audit policy. + +### HD-1 -- Add a gate: bundled kit template `name:` must resolve to a manifest role +- **Evidence:** Before this cycle, `reaper-kit.md`/`backfill-kit.md`/`drift-kit.md` shipped templates whose `name:` diverged from the manifest/install names. The audit gate strips fenced code blocks (`strip_code`) and only name-checks `references/role-skills/*.SKILL.md`, so embedded kit names are 100% invisible to it. The drift was fixed by hand this cycle but nothing prevents regression. +- **Proposed check:** parse `name:` from fenced ```md/```yaml frontmatter in `*/*/references/*-kit.md`; assert each is a real role in the sibling `agents/manifest.json` OR carries an explicit `# example-only` marker. +- **Risk:** moderate maintenance (a mini fenced-block parser + a whitelist for intentionally-generic examples). Net: closes a real, gate-undetectable drift class affecting 3+ skills at bounded cost. +- **Recommendation:** APPROVE as a narrow, additive check (does not weaken any existing gate). + +### HD-2 -- Add a gate: every installable skill must carry a "NOT for" boundary +- **Evidence:** Exactly 2 of 23 skills (add-to-jar, instrument-observability) have no negative boundary; instrument-observability's empty boundary creates a real routing collision (F-3). All 21 others carry one. +- **Proposed check:** assert each `*/*/SKILL.md` contains a case-insensitive `not for` / `when not to use` marker. +- **Risk:** low (2 current violators, stable phrasing). The DEEPER check — that each NOT-for redirect resolves to a real jar skill or a hedged external one — is NOT recommended for automation: external redirects (`bugfix`/`tdd`/`writing-plans`) are legitimate and distinguishing "hedged-optional" from "dangling" needs judgment. +- **Recommendation:** APPROVE the marker-presence subset only; keep redirect-resolution a human review item. + +### HD-3 -- Add a gate: maturity vs. evidence consistency +- **Evidence:** The gate verifies `evidence:` paths EXIST but not that they hold real proof. `bug-pipeline` declares `maturity: dogfooded` against `proof/bug-pipeline/README.md` which still says "No completed public proof packet has been added yet." `skill-forge` is honest (`maturity: dry-run` + empty packet). +- **Proposed check:** if maturity is in {dogfooded, external-tested, battle-tested}, assert the evidence file does NOT contain the placeholder sentinel line. +- **Risk:** low/medium — sentinel string is a brittle magic constant; gating only top tiers is a judgment boundary. +- **Recommendation:** Either APPROVE the top-tier sentinel check, OR (lower effort) downgrade bug-pipeline to `maturity: linted` until a real packet lands. Pick one. + +### HD-4 -- Do NOT add the broad "automation without a stop condition" or "redirect resolves to a real skill" lints +- **Evidence:** Both are valuable in concept but judgment-heavy and noise-prone: a keyword lint would mis-flag detection-only / by-design-autonomous skills; redirect-resolution can't mechanically tell a hedged external skill from a dangling one. +- **Recommendation:** REJECT for CI. Keep only the mechanical subsets (HD-2 marker presence; and the already-applied launch-gate fix for sprint-ticket-runner). A high false-positive rate would erode trust in the GREEN signal — the opposite of the goal. + +### HD-5 -- MemBerry / `memberry-setup` contract: optional adapter, jar-wide +- **Evidence:** autonomous-advisor (F-1) and clean-room (F-2) present an unbundled user-global skill as a hard prerequisite/halt, contradicting optimization-loop's "OPTIONAL adapter" posture and the user's recorded "jar skills self-contained" rule. +- **Recommendation:** APPROVE the stance "MemBerry is an optional persistence adapter; absence is a clean skip" jar-wide, and let F-1/F-2 fixers reframe the two skills to match. Documented in docs/ecosystem-map.md §4. diff --git a/agent-state/loop-state.md b/agent-state/loop-state.md index 36c5a2a..1aaa7b0 100644 --- a/agent-state/loop-state.md +++ b/agent-state/loop-state.md @@ -30,13 +30,24 @@ Keep the skill jar publish-ready via three loops, one task per cycle each: ## Open Tasks +> Promoted from triage-inbox.md by ecosystem-audit-1. One task per cycle, maker +> then a separate checker. Full evidence in `agent-state/triage-inbox.md`. + | ID | Task | Owner | Status | Files | Acceptance (exits 0) | |----|------|-------|--------|-------|----------------------| +| T-ECO-1 | Add NOT-for boundary to instrument-observability (F-3) | implementer | pending | development/instrument-observability/SKILL.md, skills.json | `grep -iE "not for\|when not to use" development/instrument-observability/SKILL.md && python scripts/audit-jar.py` | +| T-ECO-2 | Reframe MemBerry/memberry-setup as optional adapter in autonomous-advisor + clean-room (F-1, F-2, HD-5) | implementer | pending | development/autonomous-advisor/SKILL.md, development/clean-room/SKILL.md | `! grep -n "surface the error and halt" development/autonomous-advisor/SKILL.md && python scripts/audit-jar.py` | +| T-ECO-3 | Wire reciprocal handoffs incl. test-backfill suspected-bug -> BUG_TRACKER.md (F-7, F-8) | implementer | pending | development/{instrument-observability,improve-architecture,dead-code-reaper,test-backfill-loop}/SKILL.md | `grep -n "BUG_TRACKER" development/test-backfill-loop/SKILL.md && python scripts/audit-jar.py` | +| T-ECO-4 | Add committed-clean precondition before plan-prune deletes a doc (F-10) | implementer | pending | development/plan-prune/SKILL.md | `grep -in "committed\|untracked" development/plan-prune/SKILL.md && python scripts/audit-jar.py` | ## Completed Tasks | ID | Task | Cycle | Commit | Result | |----|------|-------|--------|--------| +| C-2026-06-12-ECO-MAP | Add docs/ecosystem-map.md | ecosystem-audit-1 | this commit | Edges-between-skills map: routing table, two pipeline backbones, autonomy ladder + human gate, dependency matrix, 23-skill relationship table, state-files map, gates note. | +| C-2026-06-12-ECO-KIT-NAMES | Align bundled kit template names with manifest roles | ecosystem-audit-1 | this commit | Renamed fenced `name:` in reaper/backfill/drift kits to dead-code-reaper-*/test-backfill-*/arch-drift-watcher; closes gate-invisible drift; independent checker verified. | +| C-2026-06-12-ECO-SPRINT-GATE | Add launch gate + stop condition to sprint-ticket-runner | ecosystem-audit-1 | this commit | The lone auto-launch/no-stop loop skill now offers launch and defines a stop condition; aligns with the "ask before launching loops" rule; independent checker verified. | +| C-2026-06-12-ECO-LINK | Normalize loop-architecture state-templates link | ecosystem-audit-1 | this commit | `../references/state-templates.md` -> `./state-templates.md` (same target, already gate-green); removes the inconsistency one audit lens mis-read as a break. | | C-2026-06-10-SF-001-RED | Capture RED pressure evidence for `arch-drift-watch` | skill-forge-1 | this commit | RED surfaced six concrete rationalizations: ad hoc `rg` scanning instead of FUGAZI, inferred zones, silent baseline reset, immediate auto-fix, audit-gate overconfidence, and loose triage routing. | | C-2026-06-10-SF-001-GREEN | Patch `arch-drift-watch` for captured RED rationalizations | skill-forge-2 | this commit | GREEN tightened `development/arch-drift-watch/SKILL.md` against ad hoc scanning, inferred zones, silent baseline resets, detection-cycle fixes, audit-green overconfidence, and vague triage routing. | | C-2026-06-10-SF-001-REF1 | Judge `arch-drift-watch` pressure pass 1 | skill-forge-3 | this commit | Independent judge returned COMPLY and counted clean run 1/3 for the captured pressure scenario. | @@ -144,3 +155,17 @@ For future drop-in skills, run `python scripts/sync-jar.py` before `python scripts/audit-jar.py`; missing skill-forge rows should appear as `pending-red`, and hook evidence should accumulate in `agent-state/skill-usage.md`. + +ecosystem-audit-1 ran a deep, evidence-backed audit of all 23 skills (29-agent +workflow: one reader per skill + 6 cross-cutting lenses). The structural gate is +GREEN (208 checks, 0 failed after this cycle's edits). It applied the four +highest-leverage fixes this cycle (kit-name alignment, sprint-ticket-runner +launch gate, the loop-architecture link, and `docs/ecosystem-map.md`) and filed +the rest to `triage-inbox.md` (F-1..F-12) and `decisions.md` (HD-1..HD-5). Next +jar-audit cycle: take ONE Open Task (T-ECO-1 recommended first — instrument- +observability NOT-for is low-risk and resolves a real routing collision), maker +fixes it, a SEPARATE checker verifies, run `python scripts/audit-jar.py`, commit +code + state together, stop. The proposed new gates (HD-1..HD-3) are audit-policy +changes that need explicit human approval before a maker implements them — do not +add them silently. Note: prior "27 checks"/"182 checks" narration in this file is +stale; the current count is 208. diff --git a/agent-state/triage-inbox.md b/agent-state/triage-inbox.md index 853c37a..e1ffcff 100644 --- a/agent-state/triage-inbox.md +++ b/agent-state/triage-inbox.md @@ -6,11 +6,106 @@ ## Findings - +> Source for all F-* below: ecosystem-audit-1 (deep evidence-backed audit of all +> 23 skills, 2026-06-12). The structural gate is GREEN (208/208); these are +> defects the gate cannot see (routing, handoff wiring, external-dep framing, +> fresh-agent safety). The highest-leverage fixes from the same audit were +> already applied this cycle (see completed.md): bundled-template name alignment, +> the loop-architecture link, sprint-ticket-runner's launch gate, and +> docs/ecosystem-map.md. The items below are deferred for the next cycles. + +### F-1 -- autonomous-advisor treats unbundled `memberry-setup` as a hard halt, and contradicts itself on whether MemBerry is optional +- **Source:** ecosystem-audit-1 (reference-integrity + verification-gates lenses) +- **Priority:** High +- **Risk:** medium +- **Evidence:** `development/autonomous-advisor/SKILL.md:57` "invoke the `memberry-setup` skill to bootstrap before continuing"; `:142` "If the call fails, surface the error and halt" vs `:585` "MemBerry memory — optional ... when available". `memberry-setup` is user-global, not bundled in the jar. A fresh jar-only checkout cannot satisfy the halt path. optimization-loop already frames MemBerry as an OPTIONAL adapter; make autonomous-advisor match. +- **Suggested owner:** implementer (then skill-forge-judge) +- **Verification command:** `grep -n "optional" development/autonomous-advisor/SKILL.md && ! grep -n "surface the error and halt" development/autonomous-advisor/SKILL.md` + +### F-2 -- clean-room presents `memberry-setup` as default-mandatory rather than an optional accelerator +- **Source:** ecosystem-audit-1 (reference-integrity lens) +- **Priority:** Medium +- **Risk:** low +- **Evidence:** `development/clean-room/SKILL.md:263,265` "invoke the `memberry-setup` skill ... do **not** silently skip MemBerry setup" (has a §0 opt-out at `:734`, which softens it). Frame MemBerry the way FUGAZI is framed at `:655` ("if available"): absence is a clean skip, not a blocker. +- **Suggested owner:** implementer +- **Verification command:** `grep -n "if MemBerry is available\|optional" development/clean-room/SKILL.md` + +### F-3 -- instrument-observability has no "NOT for" boundary; collides with diagnose-loop / optimization-loop on the "production debugging" trigger +- **Source:** ecosystem-audit-1 (routing-overlap + verification-gates lenses) +- **Priority:** Medium +- **Risk:** low +- **Evidence:** `development/instrument-observability/SKILL.md` description ends at "...only when the repo or user requires them." with no NOT-for clause; body has none. It is the only development skill with an empty negative boundary, yet lists "production debugging" as a trigger. Add a NOT-for: diagnose a single live incident → diagnose-loop; general hardening → optimization-loop; one known bug → bugfix. (Editing the description requires `python scripts/gen-index.py` to re-sync skills.json.) +- **Suggested owner:** implementer +- **Verification command:** `grep -iE "not for|when not to use" development/instrument-observability/SKILL.md && python scripts/audit-jar.py --quiet` + +### F-4 -- add-to-jar has no "NOT for" boundary +- **Source:** ecosystem-audit-1 (map: add-to-jar) +- **Priority:** Low +- **Risk:** low +- **Evidence:** `development/add-to-jar/SKILL.md` has a clear trigger but no negative boundary (e.g. NOT for authoring a skill from scratch — that is skill-forge — or bulk-importing many skills in one pass). Same gen-index re-sync applies if the description changes. +- **Suggested owner:** implementer +- **Verification command:** `grep -iE "not for|when not to use" development/add-to-jar/SKILL.md && python scripts/audit-jar.py --quiet` + +### F-5 -- external-skill redirects (`bugfix`, `tdd`, `to-issues`, `triage`, `writing-plans`) are named without an external/optional marker or fallback +- **Source:** ecosystem-audit-1 (reference-integrity lens) +- **Priority:** Medium +- **Risk:** low +- **Evidence:** loop-engineer (desc + `:66` "use **bugfix**"), optimization-loop (`:61` + desc), improve-architecture (handoffs `to-issues`/`triage`/`tdd`), unit-test-quality (`:32` "use TDD"), design-system (`:79` "feeds **writing-plans**"). None marks these as external/global skills absent from a jar-only checkout, and none gives a plain fallback. diagnose-loop and design-panel handle their external refs correctly (explicit optional + bundled fallback) — copy that pattern. Document the convention in docs/ecosystem-map.md §4 (already added) and hedge each redirect. +- **Suggested owner:** implementer +- **Verification command:** `grep -nE "external|optional|or just" development/loop-engineer/SKILL.md development/optimization-loop/SKILL.md development/unit-test-quality/SKILL.md systems-design/design-system/SKILL.md` + +### F-6 -- loop-scaffold entrypoint paths are placeholdered/ambiguous in dependent loops +- **Source:** ecosystem-audit-1 (reference-integrity lens) +- **Priority:** Low +- **Risk:** low +- **Evidence:** `development/optimization-loop/SKILL.md:179` invokes `python /scripts/scaffold-loop.py ...` (unresolved placeholder); `development/bug-pipeline/SKILL.md:87` names `scaffold-loop.py` as if at repo `scripts/` (it lives at `development/loop-engineer/scripts/scaffold-loop.py`). Replace with the concrete relative path `../loop-engineer/scripts/scaffold-loop.py` (bug-pipeline already offers a "go minimal" fallback). +- **Suggested owner:** implementer +- **Verification command:** `grep -n "scaffold-loop.py" development/optimization-loop/SKILL.md development/bug-pipeline/SKILL.md | grep -v ""` + +### F-7 -- missing/one-directional handoff wiring across naturally-paired skills +- **Source:** ecosystem-audit-1 (relationships lens; 8 documented missing edges) +- **Priority:** Medium +- **Risk:** low +- **Evidence:** (a) instrument-observability ↔ production-readiness absent both ways (instrument-observability has empty related/handoff arrays) though it produces production-readiness's telemetry input; (b) improve-architecture and dead-code-reaper do not name arch-drift-watch as their upstream detector (arch-drift-watch names them); (c) autonomous-advisor lists review-panel only as "related", not a branch-gate handoff; (d) test-backfill-loop's suspected-bug escalation never names the canonical `agent-state/BUG_TRACKER.md` sink; (e) design-panel→autonomous-advisor handoff omits the spec→PRP transition (autonomous-advisor rejects "no PRP"); (f) plan-prune does not reciprocally name sprint-ticket-runner; (g) design-system cites clean-room by bare name with no link. docs/ecosystem-map.md §2 now documents the intended edges — wire the SKILL.md bodies to match. +- **Suggested owner:** implementer +- **Verification command:** `grep -l "production-readiness" development/instrument-observability/SKILL.md && grep -l "arch-drift-watch" development/improve-architecture/SKILL.md development/dead-code-reaper/SKILL.md` + +### F-8 -- test-backfill-loop suspected-bug escalation has no concrete sink +- **Source:** ecosystem-audit-1 (relationships + map: test-backfill-loop) +- **Priority:** Medium +- **Risk:** low +- **Evidence:** `development/test-backfill-loop/SKILL.md:32` "it's a **defect to file** (to a tracker, or hand it to diagnose-loop)" and `references/backfill-kit.md` writer marks `blocked-suspected-bug` — but neither names `agent-state/BUG_TRACKER.md` or defines who routes the entry onward. Blocked entries pile up with nothing downstream consuming them. Name BUG_TRACKER.md as the canonical sink and state the route. +- **Suggested owner:** implementer +- **Verification command:** `grep -n "BUG_TRACKER" development/test-backfill-loop/SKILL.md development/test-backfill-loop/references/backfill-kit.md` + +### F-9 -- bug-pipeline worked example uses bare `hunter/fixer/validator` names alongside install-directed `bug-pipeline-*` roles +- **Source:** ecosystem-audit-1 (reference-integrity lens) +- **Priority:** Low +- **Risk:** low +- **Evidence:** `development/bug-pipeline/SKILL.md:170` directs installing `bug-pipeline-hunter/-fixer/-validator`, while the inline templates and dogfooded example use bare `.claude/agents/hunter.md` etc. Both exist; the ambiguity is which is canonical. State that the bare names are the jar's legacy dogfooded instance and the prefixed manifest roles are the canonical installable artifacts. (Mirrors the now-fixed kit-name drift in dead-code-reaper/test-backfill-loop/arch-drift-watch.) +- **Suggested owner:** implementer +- **Verification command:** `grep -n "legacy\|illustrative\|canonical" development/bug-pipeline/SKILL.md` + +### F-10 -- plan-prune can permanently delete UNCOMMITTED planning docs (relies on "git history is the archive" without checking it) +- **Source:** ecosystem-audit-1 (map: plan-prune, fresh-agent-unsafe) +- **Priority:** Medium +- **Risk:** medium +- **Evidence:** `development/plan-prune/SKILL.md:115` "delete it after its useful claims are represented ... Git history is the archive." Preflight only checks tree dirtiness; nothing requires the doc being deleted to be committed first. A fresh agent could delete an untracked/dirty planning doc, losing content git never held. Add a precondition: never delete a planning doc unless it is committed clean; archive/block untracked or dirty ones instead. +- **Suggested owner:** implementer (then verifier) +- **Verification command:** `grep -in "committed\|untracked\|git ls-files" development/plan-prune/SKILL.md` + +### F-11 -- SKILL.md install pointers send agents to `../agents/README.md`, which does not list the named roles +- **Source:** ecosystem-audit-1 (reference-integrity + documentation lenses) +- **Priority:** Low +- **Risk:** low +- **Evidence:** dead-code-reaper, diagnose-loop, review-panel, improve-architecture, production-readiness (and others) say "Copy-ready generated agents live in ../agents/README.md" then name roles, but `development/agents/README.md` documents only test-backfill-loop and review-panel examples and otherwise defers to manifest.json; `systems-design/agents/README.md` names none. Roles DO exist in manifest.json and the generated `claude/`/`codex/` files. Fix: point install lines at `../agents/manifest.json` and `../agents//` directly, OR list the role names in the README. +- **Suggested owner:** implementer +- **Verification command:** `python -c "import json; n=set(r['name'] for r in json.load(open('development/agents/manifest.json'))['agents']); t=open('development/agents/README.md').read(); print(sum(x in t for x in n),'of',len(n))"` + +### F-12 -- autonomous-advisor "Confirming the Loop Is Running" list double-numbers item 3 +- **Source:** ecosystem-audit-1 (map: autonomous-advisor) +- **Priority:** Low +- **Risk:** low +- **Evidence:** `development/autonomous-advisor/SKILL.md:471` "3. The loop is self-sustaining." and `:478` "3. Loop termination." — the second should be 4. Muddies a step-by-step procedure. +- **Suggested owner:** implementer +- **Verification command:** `grep -nE "^[0-9]+\. " development/autonomous-advisor/SKILL.md` diff --git a/development/arch-drift-watch/references/drift-kit.md b/development/arch-drift-watch/references/drift-kit.md index a152d5e..bc5c3eb 100644 --- a/development/arch-drift-watch/references/drift-kit.md +++ b/development/arch-drift-watch/references/drift-kit.md @@ -58,7 +58,7 @@ Advance the baseline **only** when a human accepts the current state (post-revie ```md --- -name: drift-watcher +name: arch-drift-watcher description: "Producer for the arch-drift-watch loop. Runs FUGAZI structural rules read-only, diffs against the committed baseline, and files NEW violations to the triage inbox. Use during the loop's scan stage. Writes no code; never edits the baseline." model: sonnet --- diff --git a/development/dead-code-reaper/references/reaper-kit.md b/development/dead-code-reaper/references/reaper-kit.md index 261324c..4546909 100644 --- a/development/dead-code-reaper/references/reaper-kit.md +++ b/development/dead-code-reaper/references/reaper-kit.md @@ -46,7 +46,7 @@ Never reapable by this loop: `circular-dependencies`, `boundary-violations`, `co ```md --- -name: reaper-scout +name: dead-code-reaper-scout description: "Producer for the dead-code-reaper loop. Runs FUGAZI's dead-code family read-only, proves each candidate unreachable with trace, and files high-confidence clusters to the ledger. Use during the loop's discovery stage. Never deletes code." model: sonnet --- @@ -67,7 +67,7 @@ Return: candidate count, one line each (kind + proof), what you filtered and why ```md --- -name: reaper +name: dead-code-reaper-reaper description: "Maker for the dead-code-reaper loop. Removes exactly one proven-dead cluster per cycle with the smallest diff, re-scans, and runs the repo gate. Use during the loop's execution stage. Never validates its own removal." model: sonnet --- @@ -88,7 +88,7 @@ Return: cluster removed, LOC removed, re-scan result, gate result, ledger update ```md --- -name: reaper-validator +name: dead-code-reaper-validator description: "Checker for the dead-code-reaper loop. Independently re-runs FUGAZI and the full gate on a removal, enforces the finding-count/LOC ratchet, and promotes or reopens. Use during the loop's verification stage. Run on a different model than the reaper." model: opus --- diff --git a/development/loop-engineer/references/loop-architecture.md b/development/loop-engineer/references/loop-architecture.md index 0778036..b40ee95 100644 --- a/development/loop-engineer/references/loop-architecture.md +++ b/development/loop-engineer/references/loop-architecture.md @@ -6,7 +6,7 @@ Everything here describes **Layer 2** — the system that runs cycle after cycle Companion references: -- [../references/state-templates.md](../references/state-templates.md) — the `agent-state/` files +- [./state-templates.md](./state-templates.md) — the `agent-state/` files - [./subagent-templates.md](./subagent-templates.md) — explorer / implementer / verifier / security-reviewer bodies (Claude Code `.md` + Codex `.toml`) - [./automation-templates.md](./automation-templates.md) — trigger prompts + the per-cycle driver prompt - [./safety-and-gates.md](./safety-and-gates.md) — `AGENTS.md` rules + runnable gates diff --git a/development/sprint-ticket-runner/SKILL.md b/development/sprint-ticket-runner/SKILL.md index 16f6054..df9fb8a 100644 --- a/development/sprint-ticket-runner/SKILL.md +++ b/development/sprint-ticket-runner/SKILL.md @@ -32,6 +32,20 @@ Maker and checker stay separate. Parallel work is allowed only from the parallelism audit and is invalidated when actual touches exceed the predicted write set. +**Launch gate — this skill OFFERS launch, it never auto-launches.** Before +launching any code-writing maker (Phase 4 onward), present the parallelism map +and the first-cycle plan and get an explicit human "go". A compute-spending +execution loop starts only on an explicit human yes; ticket creation and the +parallelism audit (Phases 0–3) and Analysis-Only Mode need no such approval. + +**Stop condition.** Repeat the execute → verify → update cycle only until one of +these holds, then write the Closeout and stop — never spin cycles past an empty +`ready` lane or an approved budget: + +- no `ready` tickets remain; +- a `blocked` / `NEEDS-DECISION` ticket needs a human; or +- the human-approved budget (cycle count, wall-clock, or cost) is exhausted. + ## State Layout Create or maintain these files: @@ -108,7 +122,8 @@ the map. ### 4. Execute One Cycle -Pick the next ticket from the board. For a `parallel-build` lane, create one +Only after the launch gate (Operating Contract) is cleared, pick the next ticket +from the board. For a `parallel-build` lane, create one worktree/branch per ticket and run only one maker per worktree. For a serial lane, finish and verify the earlier ticket before starting the next. diff --git a/development/test-backfill-loop/references/backfill-kit.md b/development/test-backfill-loop/references/backfill-kit.md index fb37a69..e7cd845 100644 --- a/development/test-backfill-loop/references/backfill-kit.md +++ b/development/test-backfill-loop/references/backfill-kit.md @@ -34,7 +34,7 @@ A new test must be able to fail. Prove it before counting the test: ```md --- -name: backfill-scout +name: test-backfill-scout description: "Producer for the test-backfill loop. Finds the highest-value uncovered module and files one target per cycle. Use during the loop's discovery stage. Writes no tests." model: sonnet --- @@ -50,7 +50,7 @@ Return: the target, its coverage gap, and why it's the highest value now. ```md --- -name: backfill-writer +name: test-backfill-writer description: "Maker for the test-backfill loop. Writes characterization tests for one module that pin current behaviour and raise coverage. Use during the loop's execution stage. Never validates its own tests." model: sonnet --- @@ -68,7 +68,7 @@ Return: tests added, coverage before→after, any suspected bug filed. ```md --- -name: backfill-verifier +name: test-backfill-verifier description: "Checker for the test-backfill loop. Confirms each new test bites and the coverage ratchet advanced. Use during the loop's verification stage. Run on a different model than the writer." model: opus --- diff --git a/docs/ecosystem-map.md b/docs/ecosystem-map.md new file mode 100644 index 0000000..4706cda --- /dev/null +++ b/docs/ecosystem-map.md @@ -0,0 +1,228 @@ +# Skill Jar — Ecosystem Map + +> Hand-authored routing + composition map for the jar's skills. It carries the +> one thing the generated indexes cannot: the **edges between skills** — +> routing, composition, verification, autonomy, and shared state. +> +> Companion surfaces (do not duplicate them here): [`skills.json`](../skills.json) +> is the generated per-skill index (name / description / path / tags / maturity); +> [core-skills.md](core-skills.md) is the generated list of the self-improvement +> substrate; [evidence-model.md](evidence-model.md) defines the metadata fields. +> This map is **not generated** — update §5 when a skill is added or renamed. + +## How to read this + +The jar is an agent operating system. Skills are executable capabilities; +[`skills.json`](../skills.json) is the routing index; category folders are +installable plugins; the agent packs ([development](../development/agents/manifest.json), +[systems-design](../systems-design/agents/manifest.json)) are reusable worker +roles; `agent-state/` is durable loop memory; [`scripts/audit-jar.py`](../scripts/audit-jar.py) +is the quality gate; [AGENTS.md](../AGENTS.md) is the safety constitution. + +A fresh agent should: (1) pick the skill from §1, (2) check the pipeline it +belongs to in §2, (3) confirm the autonomy/human-gate posture in §3, (4) treat +every dependency in §4 as optional unless marked bundled, and (5) use the +canonical roles and state files in §5–§6 instead of inventing new ones. + +## 1. Routing — which skill for which intent + +Pick by the **shape of the request**, not by keyword. Each row also names what +NOT to use, so overlapping skills stay disambiguated. + +| The user wants to… | Use | Not | +|---|---|---| +| Fix one known bug | the host's `bugfix` / your debugger (external) | a loop skill | +| Root-cause ONE hard bug of unknown cause | [diagnose-loop](../development/diagnose-loop/SKILL.md) | bug-pipeline (that's a backlog) | +| Continuously find→fix→verify many defects | [bug-pipeline](../development/bug-pipeline/SKILL.md) | diagnose-loop (single bug) | +| Harden an existing codebase on measured metrics | [optimization-loop](../development/optimization-loop/SKILL.md) | auto-research, improve-architecture | +| Optimize ONE scalar metric on a frozen harness | [auto-research](../development/auto-research/SKILL.md) | optimization-loop (multi-metric) | +| Deepen shallow modules (human picks direction) | [improve-architecture](../development/improve-architecture/SKILL.md) | optimization-loop (automated) | +| Get continuous early warning on architecture drift | [arch-drift-watch](../development/arch-drift-watch/SKILL.md) | improve-architecture (it decides) | +| Safely remove confirmed-dead code | [dead-code-reaper](../development/dead-code-reaper/SKILL.md) | improve-architecture (live code) | +| Raise test coverage one module per cycle | [test-backfill-loop](../development/test-backfill-loop/SKILL.md) | unit-test-quality (judging tests) | +| Judge / repair / reject existing or slop tests | [unit-test-quality](../development/unit-test-quality/SKILL.md) | test-backfill-loop (a loop) | +| Build a NEW custom loop (job ≠ the named loops) | [loop-engineer](../development/loop-engineer/SKILL.md) | the specialized loops | +| Explore a feature/component design (design-it-twice) | [design-panel](../development/design-panel/SKILL.md) | design-system (whole system) | +| Adversarial multi-lens review of a diff/branch/PR | [review-panel](../development/review-panel/SKILL.md) | bug-pipeline (continuous) | +| Add observability/telemetry to an app | [instrument-observability](../development/instrument-observability/SKILL.md) | diagnose-loop (it debugs) | +| Reconcile fragmented/stale planning docs | [plan-prune](../development/plan-prune/SKILL.md) | sprint-ticket-runner (it executes) | +| Run a long multi-ticket sprint from a board | [sprint-ticket-runner](../development/sprint-ticket-runner/SKILL.md) | autonomous-advisor (single PRP pass) | +| Hands-off execute a complete PRP | [autonomous-advisor](../development/autonomous-advisor/SKILL.md) | sprint-ticket-runner, clean-room | +| Reimplement / port / clone an existing codebase | [clean-room](../development/clean-room/SKILL.md) | autonomous-advisor (no analysis pass) | +| Harden a skill against agent rationalizations | [skill-forge](../development/skill-forge/SKILL.md) | loop-engineer (code loop) | +| Add/import one skill into this jar | [add-to-jar](../development/add-to-jar/SKILL.md) | skill-forge (authoring) | +| Design a whole system / size a topology | [design-system](../systems-design/design-system/SKILL.md) | design-panel (one feature) | +| Pin an API contract (retries/versioning/errors) | [api-design](../systems-design/api-design/SKILL.md) | design-system, data-store-selection | +| Choose the data layer from access patterns | [data-store-selection](../systems-design/data-store-selection/SKILL.md) | design-system, api-design | +| Gate a service for launch (SLOs/runbooks/drills) | [production-readiness](../systems-design/production-readiness/SKILL.md) | design-system (it designs) | + +The four well-modelled overlap clusters (copy this discipline when adding +skills): **defect triad** diagnose-loop ↔ bug-pipeline ↔ optimization-loop; +**improvement triad** optimization-loop ↔ auto-research ↔ improve-architecture; +**test pair** test-backfill-loop ↔ unit-test-quality; **detector→actor** +arch-drift-watch → improve-architecture / dead-code-reaper. Each names the +others in its "NOT for" boundary. + +## 2. Composition — the two backbones + +**Loop backbone.** [loop-engineer](../development/loop-engineer/SKILL.md) is the +scaffolding spine (state files, maker≠checker subagents, drivers, gates, +worktree isolation). Five specialized loops run ON it and should be invoked +directly when their job matches — loop-engineer is the builder of last resort, +not a competitor: + +``` +loop-engineer (scaffold) + ├─ bug-pipeline (Hunter → Fixer → Validator over BUG_TRACKER.md) + ├─ optimization-loop (audit → fix → measure, metric ratchet) + ├─ auto-research (hypothesize → mutate → run → keep/discard, one scalar) + ├─ dead-code-reaper (Scout proves dead → Reaper removes → Validator gates) + └─ test-backfill-loop (Scout → Writer characterization tests → Verifier "bite") +``` + +**Design → build → launch backbone (systems-design + autonomous execution).** + +``` +design-system ──▶ api-design / data-store-selection (detail the contract & data layer) + │ + ▼ +design-panel (feature spec) or clean-room (port/clone → DESIGN_DOC + PRP) + │ + ▼ +autonomous-advisor (execute the PRP hands-off; advisor ≠ verifier) + │ + ├─▶ review-panel (adversarial pre-merge review of the branch) + ├─▶ optimization-loop (Phase 5 hardening, delegated not reimplemented) + └─▶ production-readiness (launch gate: SLOs, runbooks, drill) +``` + +**Detection → decision.** [arch-drift-watch](../development/arch-drift-watch/SKILL.md) +(continuous, detection-only) files NEW drift and routes it: structural-judgment → +improve-architecture; duplication/dead code → dead-code-reaper. It never decides +or applies a refactor itself. + +**Cross-cutting.** [instrument-observability](../development/instrument-observability/SKILL.md) +produces the telemetry/alerts that [production-readiness](../systems-design/production-readiness/SKILL.md)'s +launch gate consumes (instrument first, then gate). [plan-prune](../development/plan-prune/SKILL.md) +keeps the planning surface canonical *before* [sprint-ticket-runner](../development/sprint-ticket-runner/SKILL.md) +executes it. + +## 3. Autonomy ladder and the human gate + +Skills declare one of four autonomy postures. The rule: **compute-spending +execution loops are OFFERED, never auto-launched** — they start only on an +explicit human "go". The human always owns architecture, merge, push, cost, and +any irreversible external action (deploys, publishes, emails). See +[AGENTS.md](../AGENTS.md) for the safety floor. + +| Posture | Meaning | Example skills | +|---|---|---| +| detection-only | reads + reports, writes no code | arch-drift-watch (L1), improve-architecture (explore), production-readiness, data-store-selection | +| offers-launch | builds + dry-runs one cycle, then asks before running | bug-pipeline, dead-code-reaper, optimization-loop, auto-research, test-backfill-loop, loop-engineer | +| fully-autonomous (gated) | runs the whole pipeline, but behind hard phase gates, a 50-cycle cap, maker≠checker, and no irreversible actions | autonomous-advisor | +| **offers-launch (after fix)** | sprint-ticket-runner now carries an explicit launch gate + stop condition (see its Operating Contract) so it offers launch rather than auto-launching | sprint-ticket-runner | + +`offers-launch` skills MUST present the plan/baseline and wait for a human yes; +they MUST define a stop condition (converged / stalled / budget exhausted / no +work left). A loop skill missing either is a defect. + +## 4. Dependencies and adapters — bundled vs external + +The jar is **self-contained**: every skill runs from its own `SKILL.md` + +bundled `references/`. The items below are *optional accelerators* — absence is +a clean skip, never a blocker. A fresh jar-only checkout has none of them and +must still work. + +| Dependency | Status | Used by (referenced) | Posture | +|---|---|---|---| +| FUGAZI (`fugazi` / `fugazi-mcp`) | external static-analysis CLI/MCP | dead-code-reaper, arch-drift-watch (required for the loop), diagnose-loop, review-panel, auto-research, test-backfill-loop (optional) | if present, use; never run `fugazi fix` unattended | +| MemBerry + `memberry-setup` | external user-global MCP + skill, NOT bundled | optimization-loop (optional), autonomous-advisor & clean-room (should be optional — see open findings) | optional persistence adapter; absent = files-only | +| Superpowers suite | external skills | design-panel, diagnose-loop, review-panel, autonomous-advisor (all bundle a standalone fallback) | optional lineage/accelerators | +| `bugfix`, `tdd`, `to-issues`, `triage`, `writing-plans` | external/global skills, NOT in the jar | named as redirects by loop-engineer, optimization-loop, improve-architecture, unit-test-quality, design-system | should be marked external + carry a plain fallback | + +## 5. Skill relationship table + +One row per installable skill. Prerequisite = what usually runs first; Hands +off to = what runs after / where output goes; Roles = canonical installable +agent names (from the category `manifest.json`; `—` = scaffolds/bundles its +own or needs none). + +| Skill | Cat | Prerequisite | Hands off to | Do-not-duplicate / boundary | Installable roles | +|---|---|---|---|---|---| +| add-to-jar | dev | — | skill-forge (queues RED) | NOT authoring from scratch / bulk import | — | +| arch-drift-watch | dev | loop-engineer | improve-architecture, dead-code-reaper | NOT deciding/removing/hardening | `arch-drift-watcher` | +| auto-research | dev | loop-engineer | — | NOT multi-metric (optimization-loop) | — (harness is the checker) | +| autonomous-advisor | dev | a PRP (from design-panel/clean-room) | optimization-loop, review-panel | NOT without a PRP; NOT a port (clean-room); NOT a ticket sprint (sprint-ticket-runner) | `autonomous-advisor`, `autonomous-verifier` | +| bug-pipeline | dev | loop-engineer | diagnose-loop (deep bug) | NOT one bug; NOT metric hardening | `bug-pipeline-hunter`, `-fixer`, `-validator` | +| clean-room | dev | (an original codebase) | autonomous-advisor (via PRP) | NOT your own code; NOT a literal transpile | `clean-room-analyzer`, `-researcher`, `-gap-checker`, `-improvement-sweeper`, `-contamination-reviewer` | +| dead-code-reaper | dev | loop-engineer | improve-architecture, diagnose-loop/bug-pipeline | NOT one symbol; NOT live-but-ugly code | `dead-code-reaper-scout`, `-reaper`, `-validator` | +| design-panel | dev | — | autonomous-advisor (once a PRP exists) | NOT trivial; NOT whole-system (design-system) | `design-explorer`, `-designer`, `-judge`, `-skeptic` | +| diagnose-loop | dev | a repro | bug-pipeline (backlog) | NOT a known fix; NOT a backlog | `diagnose-investigator`, `-analyst`, `-fixer`, `-verifier` | +| improve-architecture | dev | — | dead-code-reaper, arch-drift-watch | NOT autonomous hardening; NOT a rewrite (clean-room) | `architecture-explorer`, `-interface-designer`, `-depth-checker` | +| instrument-observability | dev | — | production-readiness (telemetry → launch gate) | NOT live-incident debugging (diagnose-loop) | — | +| loop-engineer | dev | — | the 5 specialized loops | NOT a one-off task; NOT a hardening pass (optimization-loop) | explorer/implementer/verifier templates (in `references/`) | +| optimization-loop | dev | loop-engineer | improve-architecture (judgment) | NOT one bug; NOT single-scalar (auto-research) | — (scaffolds via loop-engineer) | +| plan-prune | dev | — | sprint-ticket-runner (execute), improve-architecture (direction) | NOT new plans; NOT architecture redesign | — | +| review-panel | dev | a diff/branch/PR | diagnose-loop (deep bug), bug-pipeline (sweep) | NOT a single-file glance; NOT a continuous loop | `review-correctness`, `-security`, `-simplicity`, `-synthesizer` | +| skill-forge | dev | a target skill | loop-engineer (code loop) | NOT a one-line edit; NOT non-skill docs | `skill-forge-pressure-tester`, `-forger`, `-judge`, `-linter` | +| sprint-ticket-runner | dev | plan-prune (clean plan); a worktree mechanism | plan-prune (drift) | NOT a single bug; NOT plan cleanup (plan-prune); NOT a single gated PRP run (autonomous-advisor) | — | +| test-backfill-loop | dev | loop-engineer | diagnose-loop / bug-pipeline (suspected bug → `BUG_TRACKER.md`) | NOT greenfield TDD; NOT judging tests (unit-test-quality) | `test-backfill-scout`, `-writer`, `-verifier` | +| unit-test-quality | dev | TDD (external, optional) | test-backfill-loop, diagnose-loop | NOT continuous backfill; NOT broad hardening | — | +| api-design | sd | design-system | data-store-selection, production-readiness | NOT topology; NOT store internals | `api-contract-designer`, `api-compatibility-reviewer`, `api-abuse-reviewer` | +| data-store-selection | sd | design-system | api-design, production-readiness | NOT topology; NOT API contract | `data-access-analyst`, `data-store-designer`, `data-gate-reviewer` | +| design-system | sd | — | api-design, data-store-selection, production-readiness, design-panel, clean-room | NOT one feature; NOT launch gating | `system-intake-analyst`, `system-topology-designer`, `system-topology-skeptic` | +| production-readiness | sd | design-system | diagnose-loop, optimization-loop | NOT designing the system; NOT live-incident root cause | `readiness-slo-operator`, `readiness-runbook-writer`, `readiness-launch-reviewer` | + +**Verification note.** No skill is another skill's verifier — verification is +*intra-skill* (each skill's own maker≠checker agent roles), never inter-skill. +The one apparent exception (design-system → production-readiness) is a sequential +launch handoff, not a check of design-system's own output. review-panel and +autonomous-advisor's `autonomous-verifier` are the closest the jar comes to a +reusable cross-skill checker. + +## 6. Shared state files (`agent-state/`) + +Conventions are load-bearing: skills assume these canonical sinks. Do not +cross-file (jar-audit findings ≠ bug-pipeline defects ≠ forge results). + +| File | Owner / writer | Readers | Purpose | +|---|---|---|---| +| `loop-state.md` | every loop | restarted loop-agent | the restart spine: done / verify / next | +| `triage-inbox.md` | jar-audit, arch-drift-watch | planning | open structural/skill findings (one block each, runnable verify cmd) | +| `BUG_TRACKER.md` | bug-pipeline hunter; test-backfill suspected-bug sink | fixer, validator | defects: pending → fixed → verified | +| `SKILL_FORGE_TRACKER.md` | skill-forge | forge cycles | per-skill RED→GREEN→REFACTOR queue | +| `decisions.md` | any loop | humans, restarts | choices + human-decision items + rationale | +| `failed-attempts.md` | any maker/checker | future cycles | append-only; never deleted | +| `completed.md` | any loop | restarts | durable record of finished work | +| `skill-usage.md` | generated agent hooks | skill-forge | usage/error evidence → improvement candidates | +| `ARCH_BASELINE.json` | arch-drift-watch | the watcher | committed structural baseline to diff against (runtime) | +| `DEAD_CODE_LEDGER.md` | dead-code-reaper | reaper, validator | proven-dead clusters (runtime) | +| `COVERAGE_TARGETS.md` | test-backfill-loop | writer, verifier | chosen coverage targets (runtime) | +| `sprint/` | sprint-ticket-runner | restarts | board, tickets, parallelism map, handoff (runtime) | + +## 7. Core / self-improvement substrate + +The 7 `core: true` skills ([core-skills.md](core-skills.md)) form the jar's +maintenance system and relate as: **loop-engineer** is the loop spine the loop +family builds on; **skill-forge** hardens any skill against rationalizations; +**add-to-jar** is the intake path (drop a skill → `sync-jar.py` → `audit-jar.py`); +**bug-pipeline** / **diagnose-loop** / **review-panel** / **test-backfill-loop** / +**auto-research** are the dogfooded/linted exemplars. Keep these on a fork to +retain the jar's ability to audit, test, review, and improve itself. + +## 8. Gates — how the jar verifies itself + +The single gate is [`scripts/audit-jar.py`](../scripts/audit-jar.py) (exits 0/1, +also run by CI). It validates: frontmatter parses with name/description; every +description carries a "use when" trigger; skill name == directory; every relative +Markdown link resolves; every `.py` compiles and the scaffolder stays idempotent; +and the three generated artifacts are in sync — [`skills.json`](../skills.json) +(`gen-index.py --check`), plugin manifests (`gen-plugins.py --check`), and the +agent packs (`gen-agent-packs.py --check`). Both `add-to-jar` and `skill-forge`'s +LINT stage terminate in this gate. It does **not** see inside fenced code blocks, +so bundled-template names and embedded examples are out of its current scope — +see [`agent-state/decisions.md`](../agent-state/decisions.md) for proposed +narrow extensions. + +Never weaken, skip, or loosen a check to make it pass; a genuinely-wrong check is +a logged human-decision item, not a silent edit. From dea9894db6df594a6b9a7d22d579b972bf011e49 Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 20:20:01 -0700 Subject: [PATCH 2/6] README: link the ecosystem map from the "For agents" section The new docs/ecosystem-map.md was unreachable from any entry point. Point the "For agents" paragraph at it so a programmatic reader finds cross-skill routing, the pipeline backbones, and the autonomy ladder next to skills.json. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 53caad6..c1b5a13 100644 --- a/README.md +++ b/README.md @@ -149,7 +149,7 @@ the generated index, Skill Forge, and the repo's hook evidence flow. ## For agents -Reading this repo programmatically? Route from [`skills.json`](skills.json) — a generated index of every skill's `name`, routing `description`, `path`, `tags`, `core`, and optional maturity/evidence fields (one fetch, no directory crawl; it's gate-checked against the frontmatter, so it can't drift). Install via the plugin marketplace above (Claude Code) or copy a skill's folder (any host). Every push is verified by `python scripts/audit-jar.py` — the badge above is that gate. If you *operate* in this repo (run a loop cycle, fix a bug), the rules in [`AGENTS.md`](AGENTS.md) bind you. +Reading this repo programmatically? Route from [`skills.json`](skills.json) — a generated index of every skill's `name`, routing `description`, `path`, `tags`, `core`, and optional maturity/evidence fields (one fetch, no directory crawl; it's gate-checked against the frontmatter, so it can't drift). For how the skills **compose and route between each other** — the pipeline backbones, the autonomy ladder, and which skill defers to which — read [`docs/ecosystem-map.md`](docs/ecosystem-map.md). Install via the plugin marketplace above (Claude Code) or copy a skill's folder (any host). Every push is verified by `python scripts/audit-jar.py` — the badge above is that gate. If you *operate* in this repo (run a loop cycle, fix a bug), the rules in [`AGENTS.md`](AGENTS.md) bind you. Development and systems-design skills also ship generated sub-agent packs in [`development/agents/`](development/agents/README.md) and [`systems-design/agents/`](systems-design/agents/README.md). Each pack's `manifest.json` is the source of truth; `python scripts/gen-agent-packs.py` renders copy-ready Claude Code and Codex agent files, and the audit gate verifies they stay in sync. Install only the roles a loop, panel, or design review actually needs. From d690986dc0ebb9f1c66172273dc1a884e2643dd5 Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 20:52:29 -0700 Subject: [PATCH 3/6] jar-audit: close Open Tasks T-ECO-1..4 (routing, MemBerry, handoffs, delete-safety) Autonomous jar-audit cycle over the four filed Open Tasks; each fixed by a maker and verified by a separate checker (maker != checker). - instrument-observability: add a "When NOT to use" boundary (diagnose-loop / optimization-loop / host bugfix) plus a description NOT-for clause, and a handoff noting its telemetry feeds production-readiness's launch gate. - autonomous-advisor + clean-room: reframe MemBerry / memberry-setup as an optional persistence adapter (clean skip on absence) instead of a hard halt, matching optimization-loop; fix duplicate list numbering. - improve-architecture + dead-code-reaper: name arch-drift-watch as the upstream detector; test-backfill-loop: name agent-state/BUG_TRACKER.md as the canonical suspected-bug sink. - plan-prune: a planning doc may be deleted only once git already holds it; untracked or dirty docs are archived or blocked instead. skills.json regenerated for the instrument-observability description change. Remaining findings F-4..F-11 stay in triage-inbox; completed.md and loop-state.md updated. A checker rejected one inaccurate cross-reference in clean-room mid-cycle; it was corrected and re-verified. Gate: python scripts/audit-jar.py -> 208 checks, 0 failed. --- agent-state/completed.md | 4 + agent-state/loop-state.md | 27 +++++-- agent-state/triage-inbox.md | 73 +++---------------- development/autonomous-advisor/SKILL.md | 15 ++-- development/clean-room/SKILL.md | 9 ++- development/dead-code-reaper/SKILL.md | 2 + development/improve-architecture/SKILL.md | 2 +- development/instrument-observability/SKILL.md | 14 +++- development/plan-prune/SKILL.md | 4 +- development/test-backfill-loop/SKILL.md | 2 +- skills.json | 2 +- 11 files changed, 71 insertions(+), 83 deletions(-) diff --git a/agent-state/completed.md b/agent-state/completed.md index e37814d..2c8c57a 100644 --- a/agent-state/completed.md +++ b/agent-state/completed.md @@ -52,3 +52,7 @@ | C-2026-06-12-ECO-KIT-NAMES | Align bundled subagent-template names with manifest/install roles | ecosystem-audit-1 | this commit | Renamed the fenced template `name:` frontmatter in reaper-kit (reaper-scout/reaper/reaper-validator -> dead-code-reaper-*), backfill-kit (backfill-* -> test-backfill-*), and drift-kit (drift-watcher -> arch-drift-watcher) to match the manifest roles the SKILL.md install lines name. Closes a gate-invisible drift that made a fresh agent copy mis-named, collision-prone agents (the agents/README naming policy mandates prefixed names). Independent checker verified. | | C-2026-06-12-ECO-SPRINT-GATE | Add launch gate + stop condition to sprint-ticket-runner | ecosystem-audit-1 | this commit | The only loop skill that auto-launched with no stop condition; contradicted the "ask before launching loops" rule. Added an Operating-Contract launch gate (offers launch, never auto-launches; human "go" before any code-writing maker) and an explicit stop condition (no ready tickets / blocked-NEEDS-DECISION / approved budget exhausted -> Closeout), plus a Phase 4 pointer. Independent checker verified. | | C-2026-06-12-ECO-LINK | Normalize loop-architecture state-templates link | ecosystem-audit-1 | this commit | Changed the lone `../references/state-templates.md` outlier to `./state-templates.md` to match its sibling links (same resolved target, which exists; the gate already passed it). Removes the inconsistency that caused one audit lens to mis-read it as a doubled-path break. | +| C-2026-06-12-T-ECO-1 | instrument-observability NOT-for boundary + production-readiness handoff | jar-audit-eco-1 | this commit | Added a "When NOT to use" section (diagnose-loop / optimization-loop / host bugfix), a description NOT-for clause (900 chars, under the 1024 limit), and a sentence linking production-readiness as the consumer of its telemetry. Maker + independent checker PASS. Closes F-3 and the instrument->production-readiness edge of F-7. | +| C-2026-06-12-T-ECO-2 | MemBerry reframed as optional adapter in autonomous-advisor + clean-room | jar-audit-eco-1 | this commit | Removed the mandatory "surface the error and halt" framing; memberry-setup/MemBerry is now an optional adapter (clean skip on absence) consistent across both skills, matching optimization-loop; fixed duplicate ordered-list numbering. The checker rejected one inaccurate "FUGAZI above" spatial cross-reference; it was reworded and re-verified PASS by an independent validator. Closes F-1, F-2, F-12; applies decision HD-5. | +| C-2026-06-12-T-ECO-3 | Reciprocal handoff wiring + canonical suspected-bug sink | jar-audit-eco-1 | this commit | improve-architecture and dead-code-reaper now name arch-drift-watch as the continuous upstream detector (reciprocating its downstream links); test-backfill-loop names agent-state/BUG_TRACKER.md as the canonical sink for blocked-suspected-bug entries. Maker + independent checker PASS. Closes F-8 and the arch-drift edges of F-7. | +| C-2026-06-12-T-ECO-4 | plan-prune delete precondition (committed-clean only) | jar-audit-eco-1 | this commit | A planning doc may be deleted only once git already holds it (tracked + committed clean); untracked/staged-but-uncommitted/dirty docs are archived or blocked instead. Added to Preflight and the delete instruction. Maker + independent checker PASS. Closes F-10. | diff --git a/agent-state/loop-state.md b/agent-state/loop-state.md index 1aaa7b0..907a942 100644 --- a/agent-state/loop-state.md +++ b/agent-state/loop-state.md @@ -30,20 +30,22 @@ Keep the skill jar publish-ready via three loops, one task per cycle each: ## Open Tasks -> Promoted from triage-inbox.md by ecosystem-audit-1. One task per cycle, maker -> then a separate checker. Full evidence in `agent-state/triage-inbox.md`. +> ecosystem-audit-1 promoted T-ECO-1..4 here; jar-audit-eco-1 closed all four +> (see Completed Tasks below). Remaining findings F-4..F-11 stay in +> `agent-state/triage-inbox.md` until a future cycle promotes one. One task per +> cycle, maker then a separate checker. | ID | Task | Owner | Status | Files | Acceptance (exits 0) | |----|------|-------|--------|-------|----------------------| -| T-ECO-1 | Add NOT-for boundary to instrument-observability (F-3) | implementer | pending | development/instrument-observability/SKILL.md, skills.json | `grep -iE "not for\|when not to use" development/instrument-observability/SKILL.md && python scripts/audit-jar.py` | -| T-ECO-2 | Reframe MemBerry/memberry-setup as optional adapter in autonomous-advisor + clean-room (F-1, F-2, HD-5) | implementer | pending | development/autonomous-advisor/SKILL.md, development/clean-room/SKILL.md | `! grep -n "surface the error and halt" development/autonomous-advisor/SKILL.md && python scripts/audit-jar.py` | -| T-ECO-3 | Wire reciprocal handoffs incl. test-backfill suspected-bug -> BUG_TRACKER.md (F-7, F-8) | implementer | pending | development/{instrument-observability,improve-architecture,dead-code-reaper,test-backfill-loop}/SKILL.md | `grep -n "BUG_TRACKER" development/test-backfill-loop/SKILL.md && python scripts/audit-jar.py` | -| T-ECO-4 | Add committed-clean precondition before plan-prune deletes a doc (F-10) | implementer | pending | development/plan-prune/SKILL.md | `grep -in "committed\|untracked" development/plan-prune/SKILL.md && python scripts/audit-jar.py` | ## Completed Tasks | ID | Task | Cycle | Commit | Result | |----|------|-------|--------|--------| +| C-2026-06-12-T-ECO-1 | instrument-observability NOT-for + production-readiness handoff (F-3) | jar-audit-eco-1 | this commit | Added a "When NOT to use" boundary (diagnose-loop / optimization-loop / host bugfix) + description NOT-for clause (900 chars) + a handoff sentence to production-readiness. Maker + independent checker PASS. | +| C-2026-06-12-T-ECO-2 | MemBerry reframed optional in autonomous-advisor + clean-room (F-1, F-2, F-12; HD-5) | jar-audit-eco-1 | this commit | Removed the "surface the error and halt" mandatory framing; MemBerry/memberry-setup is now an optional adapter (clean skip on absence), consistent across both skills; fixed duplicate list numbering. Checker rejected one inaccurate "FUGAZI above" cross-ref; reworded and re-verified PASS. | +| C-2026-06-12-T-ECO-3 | Reciprocal handoffs: arch-drift-watch upstream + BUG_TRACKER.md sink (F-7 partial, F-8) | jar-audit-eco-1 | this commit | improve-architecture and dead-code-reaper now name arch-drift-watch as upstream detector; test-backfill-loop names agent-state/BUG_TRACKER.md as the suspected-bug sink. Maker + independent checker PASS. | +| C-2026-06-12-T-ECO-4 | plan-prune delete precondition: committed-clean only (F-10) | jar-audit-eco-1 | this commit | Delete allowed only when git already holds the doc (tracked + committed clean); untracked/dirty docs are archived or blocked. Maker + independent checker PASS. | | C-2026-06-12-ECO-MAP | Add docs/ecosystem-map.md | ecosystem-audit-1 | this commit | Edges-between-skills map: routing table, two pipeline backbones, autonomy ladder + human gate, dependency matrix, 23-skill relationship table, state-files map, gates note. | | C-2026-06-12-ECO-KIT-NAMES | Align bundled kit template names with manifest roles | ecosystem-audit-1 | this commit | Renamed fenced `name:` in reaper/backfill/drift kits to dead-code-reaper-*/test-backfill-*/arch-drift-watcher; closes gate-invisible drift; independent checker verified. | | C-2026-06-12-ECO-SPRINT-GATE | Add launch gate + stop condition to sprint-ticket-runner | ecosystem-audit-1 | this commit | The lone auto-launch/no-stop loop skill now offers launch and defines a stop condition; aligns with the "ask before launching loops" rule; independent checker verified. | @@ -169,3 +171,16 @@ code + state together, stop. The proposed new gates (HD-1..HD-3) are audit-polic changes that need explicit human approval before a maker implements them — do not add them silently. Note: prior "27 checks"/"182 checks" narration in this file is stale; the current count is 208. + +jar-audit-eco-1 (user-authorized autonomous run, all-three-loops rotation, until +done/blocked) closed all four Open Tasks T-ECO-1..4 via a maker->checker workflow +(8 agents): T-ECO-1/3/4 passed first try; T-ECO-2 was rejected by its checker for +one inaccurate "FUGAZI above" cross-reference, corrected, and re-verified PASS by +an independent validator. skills.json was regenerated for the instrument- +observability description change. Gate green (208 checks, 0 failed). Closed +findings F-1, F-2, F-3, F-8, F-10, F-12 and the arch-drift/instrument edges of +F-7; remaining open findings F-4, F-5, F-6, F-7 (partial), F-9, F-11 stay in +triage-inbox. Next in the authorized rotation: bug-pipeline hunter sweep (focus +the files changed this run + cross-file consistency), then skill-forge on the +queue (SF-005 clean-room needs REFACTOR judge runs; SF-023 instrument- +observability re-baselines after this run's edits; SF-006..022 pending-red). diff --git a/agent-state/triage-inbox.md b/agent-state/triage-inbox.md index e1ffcff..ae7d268 100644 --- a/agent-state/triage-inbox.md +++ b/agent-state/triage-inbox.md @@ -7,42 +7,17 @@ ## Findings > Source for all F-* below: ecosystem-audit-1 (deep evidence-backed audit of all -> 23 skills, 2026-06-12). The structural gate is GREEN (208/208); these are -> defects the gate cannot see (routing, handoff wiring, external-dep framing, -> fresh-agent safety). The highest-leverage fixes from the same audit were -> already applied this cycle (see completed.md): bundled-template name alignment, -> the loop-architecture link, sprint-ticket-runner's launch gate, and -> docs/ecosystem-map.md. The items below are deferred for the next cycles. - -### F-1 -- autonomous-advisor treats unbundled `memberry-setup` as a hard halt, and contradicts itself on whether MemBerry is optional -- **Source:** ecosystem-audit-1 (reference-integrity + verification-gates lenses) -- **Priority:** High -- **Risk:** medium -- **Evidence:** `development/autonomous-advisor/SKILL.md:57` "invoke the `memberry-setup` skill to bootstrap before continuing"; `:142` "If the call fails, surface the error and halt" vs `:585` "MemBerry memory — optional ... when available". `memberry-setup` is user-global, not bundled in the jar. A fresh jar-only checkout cannot satisfy the halt path. optimization-loop already frames MemBerry as an OPTIONAL adapter; make autonomous-advisor match. -- **Suggested owner:** implementer (then skill-forge-judge) -- **Verification command:** `grep -n "optional" development/autonomous-advisor/SKILL.md && ! grep -n "surface the error and halt" development/autonomous-advisor/SKILL.md` - -### F-2 -- clean-room presents `memberry-setup` as default-mandatory rather than an optional accelerator -- **Source:** ecosystem-audit-1 (reference-integrity lens) -- **Priority:** Medium -- **Risk:** low -- **Evidence:** `development/clean-room/SKILL.md:263,265` "invoke the `memberry-setup` skill ... do **not** silently skip MemBerry setup" (has a §0 opt-out at `:734`, which softens it). Frame MemBerry the way FUGAZI is framed at `:655` ("if available"): absence is a clean skip, not a blocker. -- **Suggested owner:** implementer -- **Verification command:** `grep -n "if MemBerry is available\|optional" development/clean-room/SKILL.md` - -### F-3 -- instrument-observability has no "NOT for" boundary; collides with diagnose-loop / optimization-loop on the "production debugging" trigger -- **Source:** ecosystem-audit-1 (routing-overlap + verification-gates lenses) -- **Priority:** Medium -- **Risk:** low -- **Evidence:** `development/instrument-observability/SKILL.md` description ends at "...only when the repo or user requires them." with no NOT-for clause; body has none. It is the only development skill with an empty negative boundary, yet lists "production debugging" as a trigger. Add a NOT-for: diagnose a single live incident → diagnose-loop; general hardening → optimization-loop; one known bug → bugfix. (Editing the description requires `python scripts/gen-index.py` to re-sync skills.json.) -- **Suggested owner:** implementer -- **Verification command:** `grep -iE "not for|when not to use" development/instrument-observability/SKILL.md && python scripts/audit-jar.py --quiet` +> 23 skills, 2026-06-12). The structural gate is GREEN; these are defects the +> gate cannot see. RESOLVED by jar-audit-eco-1 (see completed.md): F-1, F-2, F-3, +> F-8, F-10, F-12 and the instrument-observability->production-readiness / +> arch-drift-watch reciprocity edges of F-7. The blocks below are the remaining +> open findings. ### F-4 -- add-to-jar has no "NOT for" boundary - **Source:** ecosystem-audit-1 (map: add-to-jar) - **Priority:** Low - **Risk:** low -- **Evidence:** `development/add-to-jar/SKILL.md` has a clear trigger but no negative boundary (e.g. NOT for authoring a skill from scratch — that is skill-forge — or bulk-importing many skills in one pass). Same gen-index re-sync applies if the description changes. +- **Evidence:** `development/add-to-jar/SKILL.md` has a clear trigger but no negative boundary (e.g. NOT for authoring a skill from scratch — that is skill-forge — or bulk-importing many skills in one pass). Changing the description requires `python scripts/gen-index.py` to re-sync skills.json. - **Suggested owner:** implementer - **Verification command:** `grep -iE "not for|when not to use" development/add-to-jar/SKILL.md && python scripts/audit-jar.py --quiet` @@ -50,7 +25,7 @@ - **Source:** ecosystem-audit-1 (reference-integrity lens) - **Priority:** Medium - **Risk:** low -- **Evidence:** loop-engineer (desc + `:66` "use **bugfix**"), optimization-loop (`:61` + desc), improve-architecture (handoffs `to-issues`/`triage`/`tdd`), unit-test-quality (`:32` "use TDD"), design-system (`:79` "feeds **writing-plans**"). None marks these as external/global skills absent from a jar-only checkout, and none gives a plain fallback. diagnose-loop and design-panel handle their external refs correctly (explicit optional + bundled fallback) — copy that pattern. Document the convention in docs/ecosystem-map.md §4 (already added) and hedge each redirect. +- **Evidence:** loop-engineer (desc + `:66` "use **bugfix**"), optimization-loop (`:61` + desc), improve-architecture (handoffs `to-issues`/`triage`/`tdd`), unit-test-quality (`:32` "use TDD"), design-system (`:79` "feeds **writing-plans**"). None marks these as external/global skills absent from a jar-only checkout, and none gives a plain fallback. diagnose-loop and design-panel handle their external refs correctly (explicit optional + bundled fallback) — copy that pattern. Convention documented in docs/ecosystem-map.md §4. - **Suggested owner:** implementer - **Verification command:** `grep -nE "external|optional|or just" development/loop-engineer/SKILL.md development/optimization-loop/SKILL.md development/unit-test-quality/SKILL.md systems-design/design-system/SKILL.md` @@ -62,21 +37,13 @@ - **Suggested owner:** implementer - **Verification command:** `grep -n "scaffold-loop.py" development/optimization-loop/SKILL.md development/bug-pipeline/SKILL.md | grep -v ""` -### F-7 -- missing/one-directional handoff wiring across naturally-paired skills -- **Source:** ecosystem-audit-1 (relationships lens; 8 documented missing edges) +### F-7 -- remaining missing/one-directional handoffs (partial; arch-drift + instrument edges resolved by jar-audit-eco-1) +- **Source:** ecosystem-audit-1 (relationships lens) - **Priority:** Medium - **Risk:** low -- **Evidence:** (a) instrument-observability ↔ production-readiness absent both ways (instrument-observability has empty related/handoff arrays) though it produces production-readiness's telemetry input; (b) improve-architecture and dead-code-reaper do not name arch-drift-watch as their upstream detector (arch-drift-watch names them); (c) autonomous-advisor lists review-panel only as "related", not a branch-gate handoff; (d) test-backfill-loop's suspected-bug escalation never names the canonical `agent-state/BUG_TRACKER.md` sink; (e) design-panel→autonomous-advisor handoff omits the spec→PRP transition (autonomous-advisor rejects "no PRP"); (f) plan-prune does not reciprocally name sprint-ticket-runner; (g) design-system cites clean-room by bare name with no link. docs/ecosystem-map.md §2 now documents the intended edges — wire the SKILL.md bodies to match. +- **Evidence:** Still open: (a) autonomous-advisor lists review-panel only as "related", not a branch-gate handoff (its Branch gate = PR URL/merge SHA is the canonical consumer of an adversarial pre-merge review); (b) design-panel hands off to autonomous-advisor but omits the spec->PRP transition (autonomous-advisor rejects "no PRP"); (c) plan-prune does not reciprocally name sprint-ticket-runner ("to EXECUTE a reconciled plan, use sprint-ticket-runner"); (d) design-system cites clean-room by bare name with no link. docs/ecosystem-map.md §2 documents the intended edges. - **Suggested owner:** implementer -- **Verification command:** `grep -l "production-readiness" development/instrument-observability/SKILL.md && grep -l "arch-drift-watch" development/improve-architecture/SKILL.md development/dead-code-reaper/SKILL.md` - -### F-8 -- test-backfill-loop suspected-bug escalation has no concrete sink -- **Source:** ecosystem-audit-1 (relationships + map: test-backfill-loop) -- **Priority:** Medium -- **Risk:** low -- **Evidence:** `development/test-backfill-loop/SKILL.md:32` "it's a **defect to file** (to a tracker, or hand it to diagnose-loop)" and `references/backfill-kit.md` writer marks `blocked-suspected-bug` — but neither names `agent-state/BUG_TRACKER.md` or defines who routes the entry onward. Blocked entries pile up with nothing downstream consuming them. Name BUG_TRACKER.md as the canonical sink and state the route. -- **Suggested owner:** implementer -- **Verification command:** `grep -n "BUG_TRACKER" development/test-backfill-loop/SKILL.md development/test-backfill-loop/references/backfill-kit.md` +- **Verification command:** `grep -n "review-panel" development/autonomous-advisor/SKILL.md && grep -n "clean-room" systems-design/design-system/SKILL.md` ### F-9 -- bug-pipeline worked example uses bare `hunter/fixer/validator` names alongside install-directed `bug-pipeline-*` roles - **Source:** ecosystem-audit-1 (reference-integrity lens) @@ -86,26 +53,10 @@ - **Suggested owner:** implementer - **Verification command:** `grep -n "legacy\|illustrative\|canonical" development/bug-pipeline/SKILL.md` -### F-10 -- plan-prune can permanently delete UNCOMMITTED planning docs (relies on "git history is the archive" without checking it) -- **Source:** ecosystem-audit-1 (map: plan-prune, fresh-agent-unsafe) -- **Priority:** Medium -- **Risk:** medium -- **Evidence:** `development/plan-prune/SKILL.md:115` "delete it after its useful claims are represented ... Git history is the archive." Preflight only checks tree dirtiness; nothing requires the doc being deleted to be committed first. A fresh agent could delete an untracked/dirty planning doc, losing content git never held. Add a precondition: never delete a planning doc unless it is committed clean; archive/block untracked or dirty ones instead. -- **Suggested owner:** implementer (then verifier) -- **Verification command:** `grep -in "committed\|untracked\|git ls-files" development/plan-prune/SKILL.md` - ### F-11 -- SKILL.md install pointers send agents to `../agents/README.md`, which does not list the named roles - **Source:** ecosystem-audit-1 (reference-integrity + documentation lenses) - **Priority:** Low - **Risk:** low -- **Evidence:** dead-code-reaper, diagnose-loop, review-panel, improve-architecture, production-readiness (and others) say "Copy-ready generated agents live in ../agents/README.md" then name roles, but `development/agents/README.md` documents only test-backfill-loop and review-panel examples and otherwise defers to manifest.json; `systems-design/agents/README.md` names none. Roles DO exist in manifest.json and the generated `claude/`/`codex/` files. Fix: point install lines at `../agents/manifest.json` and `../agents//` directly, OR list the role names in the README. +- **Evidence:** dead-code-reaper, diagnose-loop, review-panel, improve-architecture, production-readiness (and others) say "Copy-ready generated agents live in ../agents/README.md" then name roles, but `development/agents/README.md` documents only test-backfill-loop and review-panel examples and otherwise defers to manifest.json; `systems-design/agents/README.md` names none. Roles DO exist in manifest.json and the generated `claude/`/`codex/` files. Fix: point install lines at `../agents/manifest.json` and `../agents//` directly, OR list the role names in the README (see decisions HD: a gate is borderline — prefer the content fix). - **Suggested owner:** implementer - **Verification command:** `python -c "import json; n=set(r['name'] for r in json.load(open('development/agents/manifest.json'))['agents']); t=open('development/agents/README.md').read(); print(sum(x in t for x in n),'of',len(n))"` - -### F-12 -- autonomous-advisor "Confirming the Loop Is Running" list double-numbers item 3 -- **Source:** ecosystem-audit-1 (map: autonomous-advisor) -- **Priority:** Low -- **Risk:** low -- **Evidence:** `development/autonomous-advisor/SKILL.md:471` "3. The loop is self-sustaining." and `:478` "3. Loop termination." — the second should be 4. Muddies a step-by-step procedure. -- **Suggested owner:** implementer -- **Verification command:** `grep -nE "^[0-9]+\. " development/autonomous-advisor/SKILL.md` diff --git a/development/autonomous-advisor/SKILL.md b/development/autonomous-advisor/SKILL.md index 119039a..378e509 100644 --- a/development/autonomous-advisor/SKILL.md +++ b/development/autonomous-advisor/SKILL.md @@ -54,7 +54,7 @@ When a PRP is provided for autonomous execution: 1. Read the PRP completely 2. Validate PRP completeness (see PRP Validation below) -3. **Ensure MemBerry memory is set up, then load it** (see MemBerry Memory Integration below). If the project has no `## MemBerry Memory` section in its `CLAUDE.md`, invoke the `memberry-setup` skill to bootstrap before continuing. An autonomous run generates a large volume of decisions, trade-offs, and surprises that must be persisted — running autonomously without MemBerry wastes the most valuable output of the pipeline. +3. **Load MemBerry memory if available** (see MemBerry Memory Integration below). MemBerry is an optional persistence adapter, not a prerequisite — if the `memberry-setup` skill is available and the project lacks a `## MemBerry Memory` section in its `CLAUDE.md`, invoke it to bootstrap; otherwise treat MemBerry as unavailable, record that in the run announcement, and proceed. An autonomous run that *can* persist its decisions, trade-offs, and surprises should — but its absence is a clean skip, never a halt. 4. Explore the project (structure, conventions, existing code, test patterns, git state) 5. Announce autonomous mode: @@ -136,11 +136,12 @@ The advisor sub-agent has access to the project's MemBerry memory system. This g ### At Activation -**Step 1 — Ensure MemBerry is bootstrapped for this project.** Check the project's `CLAUDE.md` for an `## MemBerry Memory` section. +**Step 1 — Bootstrap MemBerry for this project if available.** MemBerry is an optional persistence adapter; its absence is a clean skip, not a blocker. Check the project's `CLAUDE.md` for an `## MemBerry Memory` section. -- If **missing**, invoke the `memberry-setup` skill. It analyzes the repo, discovers entities and domain tags, writes the `## MemBerry Memory` config to `CLAUDE.md`, and calls `berry_bootstrap` to scaffold the knowledge graph. Do this before dispatching any advisor — otherwise decisions made during autonomous execution have nowhere to land. -- If **present**, verify MemBerry is reachable via `berry_tools(action: "list")`. If the call fails, surface the error and halt — do **not** silently run the pipeline without persistence. -- If the user has explicitly opted this project out of MemBerry, record that in the run-start announcement (`MemBerry context: opted out`) and skip the load step below. This is rare; the default for autonomous runs is MemBerry on. +- If **missing** and the `memberry-setup` skill is available, invoke it. It analyzes the repo, discovers entities and domain tags, writes the `## MemBerry Memory` config to `CLAUDE.md`, and calls `berry_bootstrap` to scaffold the knowledge graph — giving decisions made during autonomous execution somewhere to land. +- If **missing** and `memberry-setup` is **not** available (it is a user-global skill, not bundled in this jar), treat MemBerry as unavailable: record `MemBerry context: not available` in the run announcement and skip the load step below. Proceed with the pipeline — do **not** surface an error or halt. +- If **present**, verify MemBerry is reachable via `berry_tools(action: "list")`. If the call fails, record `MemBerry context: not available` in the run announcement and proceed without persistence — the pipeline does not depend on it. +- If the user has explicitly opted this project out of MemBerry, record that in the run-start announcement (`MemBerry context: opted out`) and skip the load step below. **Step 2 — Load MemBerry context.** Before the first advisor dispatch: @@ -475,12 +476,12 @@ optimization-loop wires the trigger and closes cycle 1 itself (its Phase 5). The - A separate verifier re-runs the gate + metric ratchet and can REJECT - Updates the loop state; the next cycle continues from where it stopped -3. **Loop termination.** The loop runs until: +4. **Loop termination.** The loop runs until: - The skill's CONVERGED condition holds (no new High+ items over ~3 cycles, open High/Block empty, metrics flat) — or it reports STALLED/DIVERGING, which escalates to the human - OR a guardrail is hit (see Guardrails below) - OR the loop has run for more than 50 cycles (safety cap — report to human) -4. **After loop completion**, store final results to MemBerry: +5. **After loop completion**, store final results to MemBerry: ``` berry_store( session_id: "autonomous--", diff --git a/development/clean-room/SKILL.md b/development/clean-room/SKILL.md index ebd40e2..1835f11 100644 --- a/development/clean-room/SKILL.md +++ b/development/clean-room/SKILL.md @@ -259,10 +259,11 @@ At the start of Phase 1, before writing any doc, the analyzer: clean-room/ ``` 3. Verifies the directory is ignored (`git check-ignore -v clean-room/DESIGN_DOC.md` should match the rule). -4. **Ensures MemBerry Memory is set up for the rewrite workspace.** Check the workspace's `CLAUDE.md` for an `## MemBerry Memory` section. - - If **missing**, invoke the `memberry-setup` skill to bootstrap MemBerry (project tag, entities, domain tags, seed priors, default memory blocks) before proceeding to Pass 1. A clean-room rewrite generates high-value persistent knowledge — Phase 2 triage decisions, research-subagent findings, rejected improvements, parity-close rationale, per-module gotchas — that must live in MemBerry so future sessions and Phase 3 agents can build on it instead of re-deriving it. +4. **(MemBerry, optional) Set up MemBerry Memory for the rewrite workspace if available.** Like the optional FUGAZI integration used elsewhere in this skill, MemBerry is an optional persistence adapter — its absence is a clean skip, not a blocker. Check the workspace's `CLAUDE.md` for an `## MemBerry Memory` section. + - If **missing** and the `memberry-setup` skill is available, invoke it to bootstrap MemBerry (project tag, entities, domain tags, seed priors, default memory blocks) before proceeding to Pass 1. A clean-room rewrite generates high-value persistent knowledge — Phase 2 triage decisions, research-subagent findings, rejected improvements, parity-close rationale, per-module gotchas — that lives in MemBerry so future sessions and Phase 3 agents build on it instead of re-deriving it. + - If **missing** and `memberry-setup` is **not** available (it is a user-global skill, not bundled in this jar), skip MemBerry and proceed to Pass 1. Note the skip in `DESIGN_DOC.md` §0 ("MemBerry unavailable for this workspace") so later sessions don't re-offer setup. - If **present**, confirm the project tag and entity list still reflect the current workspace; update via `berry_bootstrap` if stale. - - Verify MemBerry is reachable with `berry_tools(action: "list")`. If the call fails, surface the error to the user before proceeding — do **not** silently skip MemBerry setup. + - Verify MemBerry is reachable with `berry_tools(action: "list")`. If the call fails, proceed without persistence — do not block the rewrite on it. - If the user explicitly opts out for this repo, record that decision in `DESIGN_DOC.md` §0 ("MemBerry disabled for this workspace — reason: …") so later sessions don't re-offer setup. 5. Loads prior context: call `berry_load(task: "clean-room rewrite of ", tags: ["project:"])` so any pre-existing knowledge about the target or workspace informs Phase 1 from the first pass. @@ -731,7 +732,7 @@ In **Transparent Mode**, the items marked *(firewall)* below are skipped. All ot - [ ] `clean-room/` directory exists at rewrite workspace root and is gitignored - [ ] `clean-room/RUN_STATE.md` exists, mode locked, Phase & Pass Status current, gate results recorded with evidence -- [ ] MemBerry Memory is configured for the rewrite workspace (`## MemBerry Memory` section in `CLAUDE.md`, `berry_tools(action: "list")` succeeds) — if not, run `memberry-setup` before proceeding; record an opt-out in `DESIGN_DOC.md` §0 if the user explicitly declines +- [ ] (Optional) MemBerry Memory configured for the rewrite workspace if available (`## MemBerry Memory` section in `CLAUDE.md`, `berry_tools(action: "list")` succeeds) — if `memberry-setup` is available and not yet run, run it; otherwise note the skip (or an explicit opt-out) in `DESIGN_DOC.md` §0 and proceed. Not a blocker. - [ ] `inventory.json` generated (Pass 1b, **schema v2** — `symbols[]`, `call_edges[]`, `field_io[]`) and Tier-2 enrichment applied - [ ] Every `exported` + `location: source` symbol in `inventory.json` is either covered by the PRP or explicitly marked out-of-scope - [ ] `clean-room/wires.json` generated (Pass 4.5); `DESIGN_DOC.md` §4.5 populated with prose per wire diff --git a/development/dead-code-reaper/SKILL.md b/development/dead-code-reaper/SKILL.md index a73cfd6..1c1833f 100644 --- a/development/dead-code-reaper/SKILL.md +++ b/development/dead-code-reaper/SKILL.md @@ -14,6 +14,8 @@ A specialized [loop-engineer](../loop-engineer/SKILL.md) loop whose discovery en - A codebase has accumulated dead weight — orphan files, unused exports, dead deps, leaked private types, duplicate definitions — and you want it pruned *continuously and safely*, not in one risky sweep. - You want every removal backed by a reachability proof and a green suite, in a reviewable ledger. +[arch-drift-watch](../arch-drift-watch/SKILL.md) is the upstream detector that routes duplication / dead-code drift into this removal loop; this loop is the safe-removal sink for those findings. + ## When NOT to Use - One obvious unused symbol — just delete it and run the tests. diff --git a/development/improve-architecture/SKILL.md b/development/improve-architecture/SKILL.md index 5024ebf..bc28d5d 100644 --- a/development/improve-architecture/SKILL.md +++ b/development/improve-architecture/SKILL.md @@ -125,7 +125,7 @@ The migration is small, reversible steps with tests green between each, ending w Both are off by default — this skill is human-in-the-loop, and `CONTEXT.md` + the ADRs are the authoritative record. These only add reach when the tools are present. - **(MemBerry)** If a MemBerry-style memory MCP is available, `berry_load(task: "architecture review: ", tags: ["project:"])` at the start of Explore to recall prior reviews and the directions you already rejected, and `berry_store` the **decision and its load-bearing reason** after a candidate is accepted or rejected — the same thing an ADR captures, in queryable form. On any conflict, the ADR / `CONTEXT.md` files win; memory is a convenience index over them, never the source of truth. -- **Detection on a schedule.** The *finding* half of this skill can run unattended even though the *deciding* half can't. Point [dead-code-reaper](../dead-code-reaper/SKILL.md) at the removal side, or stand up a [loop-engineer](../loop-engineer/SKILL.md) loop that watches `fugazi boundaries` / `circular-deps` and files new drift to a triage inbox for your next review. The human still owns every architecture decision; the loop just keeps the candidate list fresh between reviews. +- **Detection on a schedule.** The *finding* half of this skill can run unattended even though the *deciding* half can't. [arch-drift-watch](../arch-drift-watch/SKILL.md) is the continuous, scheduled detector that surfaces drift/deepening candidates feeding this human-in-the-loop review — the automated upstream behind the periodic-review trigger. Point [dead-code-reaper](../dead-code-reaper/SKILL.md) at the removal side, or stand up a [loop-engineer](../loop-engineer/SKILL.md) loop that watches `fugazi boundaries` / `circular-deps` and files new drift to a triage inbox for your next review. The human still owns every architecture decision; the loop just keeps the candidate list fresh between reviews. ## Generated agents diff --git a/development/instrument-observability/SKILL.md b/development/instrument-observability/SKILL.md index 678372d..be9302d 100644 --- a/development/instrument-observability/SKILL.md +++ b/development/instrument-observability/SKILL.md @@ -1,6 +1,6 @@ --- name: instrument-observability -description: "Adds production-grade application observability instrumentation with Sentry by default: error tracking, crash reporting, tracing, release health, workflow breadcrumbs, user/agent attribution, privacy filtering, cost/usage tracking, and real-runtime verification. Use when adding Sentry, telemetry, tracking, monitoring, crash reporting, performance tracing, session replay, workflow failure monitoring, agent cost/usage monitoring, frontend/backend correlation, Electron crash capture, or production debugging; use existing OpenTelemetry, Datadog, New Relic, Honeycomb, LogRocket, Highlight, or structured-log standards only when the repo or user requires them." +description: "Adds production-grade application observability instrumentation with Sentry by default: error tracking, crash reporting, tracing, release health, workflow breadcrumbs, user/agent attribution, privacy filtering, cost/usage tracking, and real-runtime verification. Use when adding Sentry, telemetry, tracking, monitoring, crash reporting, performance tracing, session replay, workflow failure monitoring, agent cost/usage monitoring, frontend/backend correlation, Electron crash capture, or production debugging; use existing OpenTelemetry, Datadog, New Relic, Honeycomb, LogRocket, Highlight, or structured-log standards only when the repo or user requires them. NOT for diagnosing one live incident (use diagnose-loop), a general quality/hardening pass (use optimization-loop), or fixing one known bug (use the host bugfix skill) — this skill adds instrumentation, it does not run the debugging loop." --- # Instrument Observability @@ -16,6 +16,18 @@ set only after sign-in/profile fetch, critical workflows wrapped with spans and breadcrumbs, handled failures captured without swallowing errors, tests/smoke checks run, and dashboards/alerts recommended. +## When NOT to use + +This skill **adds instrumentation; it does not run the debugging loop.** + +- Diagnosing a single live incident → diagnose-loop. +- A general quality/hardening pass → optimization-loop. +- Fixing one known bug → the host bugfix skill (external). + +The telemetry, dashboards, and alerts produced here are the input +[production-readiness](../../systems-design/production-readiness/SKILL.md) +consumes at its launch gate. + ## Operating Contract - Do not add provider code until the investigation gate is complete. Use real diff --git a/development/plan-prune/SKILL.md b/development/plan-prune/SKILL.md index 2e441a8..6d9805b 100644 --- a/development/plan-prune/SKILL.md +++ b/development/plan-prune/SKILL.md @@ -48,6 +48,8 @@ Also inspect project conventions: README, AGENTS/CLAUDE/GEMINI files, `.github/` Check git state first. If the tree is dirty, list the changed files and decide whether they are part of the planning consolidation. Do not overwrite uncommitted user edits. If code is dirty, account for it separately as "uncommitted current state" instead of treating it as verified shipped work. +**Delete precondition:** a planning doc may be DELETED only if git already holds its content — it is tracked and committed clean (no unstaged or staged changes for that path). For any planning doc that is untracked, staged-but-uncommitted, or dirty, git history is not yet the archive, so ARCHIVE it (move under the archive convention) or BLOCK it for a human decision instead. Never delete content git does not yet hold. + ### 2. Build the Source Inventory For every planning-like document, record: @@ -112,7 +114,7 @@ Do not leave stale docs in active locations. After useful claims are folded into - **Canonical doc:** update in place. - **Supporting reference:** keep only when it provides durable design detail that would bloat the canonical plan; list exactly why it remains. -- **Obsolete planning fragment:** delete it after its useful claims are represented in the canonical plan. Git history is the archive. +- **Obsolete planning fragment:** delete it after its useful claims are represented in the canonical plan, and only once git already holds it (tracked and committed clean). Git history is the archive only for content git has. If the fragment is untracked, staged-but-uncommitted, or dirty, archive or block it instead of deleting — never delete content git does not yet hold. - **Historical but useful fragment:** move it under the repo's archive convention, or `docs/archive/plans/` if no convention exists. - **Externally linked or high-traffic path:** replace it with a tiny pointer stub only when deleting/moving would break expected links. - **Ambiguous or expensive-to-reverse removal:** add it to Blocked with the proposed retirement action and ask. diff --git a/development/test-backfill-loop/SKILL.md b/development/test-backfill-loop/SKILL.md index fe6f784..85dfe22 100644 --- a/development/test-backfill-loop/SKILL.md +++ b/development/test-backfill-loop/SKILL.md @@ -29,7 +29,7 @@ A specialized [loop-engineer](../loop-engineer/SKILL.md) loop that builds a **sa ## Characterization tests — read this first -These tests pin **current** behaviour so a later refactor can't change it unnoticed. That means they can also pin a **bug** in place. The rule: when a test would have to assert obviously-wrong behaviour, that's not a test to write — it's a **defect to file** (to a tracker, or hand it to [diagnose-loop](../diagnose-loop/SKILL.md)). Never encode a known bug as "expected" just to get the line green. The loop builds a net under *intended* behaviour; suspected wrong behaviour is escalated, not cemented. +These tests pin **current** behaviour so a later refactor can't change it unnoticed. That means they can also pin a **bug** in place. The rule: when a test would have to assert obviously-wrong behaviour, that's not a test to write — it's a **defect to file**: the writer's `blocked-suspected-bug` entries go to the canonical sink `agent-state/BUG_TRACKER.md` (or hand it to [diagnose-loop](../diagnose-loop/SKILL.md)). Never encode a known bug as "expected" just to get the line green. The loop builds a net under *intended* behaviour; suspected wrong behaviour is escalated, not cemented. ## The three roles (maker ≠ checker) diff --git a/skills.json b/skills.json index bc2d041..4b236d0 100644 --- a/skills.json +++ b/skills.json @@ -100,7 +100,7 @@ { "name": "instrument-observability", "category": "development", - "description": "Adds production-grade application observability instrumentation with Sentry by default: error tracking, crash reporting, tracing, release health, workflow breadcrumbs, user/agent attribution, privacy filtering, cost/usage tracking, and real-runtime verification. Use when adding Sentry, telemetry, tracking, monitoring, crash reporting, performance tracing, session replay, workflow failure monitoring, agent cost/usage monitoring, frontend/backend correlation, Electron crash capture, or production debugging; use existing OpenTelemetry, Datadog, New Relic, Honeycomb, LogRocket, Highlight, or structured-log standards only when the repo or user requires them.", + "description": "Adds production-grade application observability instrumentation with Sentry by default: error tracking, crash reporting, tracing, release health, workflow breadcrumbs, user/agent attribution, privacy filtering, cost/usage tracking, and real-runtime verification. Use when adding Sentry, telemetry, tracking, monitoring, crash reporting, performance tracing, session replay, workflow failure monitoring, agent cost/usage monitoring, frontend/backend correlation, Electron crash capture, or production debugging; use existing OpenTelemetry, Datadog, New Relic, Honeycomb, LogRocket, Highlight, or structured-log standards only when the repo or user requires them. NOT for diagnosing one live incident (use diagnose-loop), a general quality/hardening pass (use optimization-loop), or fixing one known bug (use the host bugfix skill) — this skill adds instrumentation, it does not run the debugging loop.", "path": "development/instrument-observability/SKILL.md", "tags": [], "core": false From d3faae71bdc726f5a1f3458d93030b3a7fe0ff3c Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 21:00:40 -0700 Subject: [PATCH 4/6] bug-pipeline: fix BUG-001 (stale open-findings pointer in ecosystem-map MemBerry row) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hunter swept the surfaces changed this effort plus cross-file consistency and filed one LOW defect: the docs/ecosystem-map.md §4 MemBerry dependency row still pointed at "open findings" (F-1/F-2/F-12) that were closed earlier this cycle. Fixer changed the cell to "optimization-loop, autonomous-advisor, clean-room (all optional)". A separate Validator confirmed the symptom is gone, the new cell matches the implemented optional posture in both skills, no other stale pointer references a closed finding (the F-5 "external" note is correctly still open), scope is one table cell, and the gate is green. BUG-001 -> verified. Gate: python scripts/audit-jar.py -> 208 checks, 0 failed. --- agent-state/BUG_TRACKER.md | 53 ++++++++++++++++++++++++++++++++++---- docs/ecosystem-map.md | 2 +- 2 files changed, 49 insertions(+), 6 deletions(-) diff --git a/agent-state/BUG_TRACKER.md b/agent-state/BUG_TRACKER.md index d9bd753..fdf207f 100644 --- a/agent-state/BUG_TRACKER.md +++ b/agent-state/BUG_TRACKER.md @@ -4,15 +4,15 @@ | Field | Value | |---|---| -| Last Hunter Scan | 2026-06-09T18:30:00Z | -| Last Fixer Pass | | -| Last Validator Pass | | -| Total Found | 0 | +| Last Hunter Scan | 2026-06-12T00:00:00Z | +| Last Fixer Pass | 2026-06-12T00:00:00Z | +| Last Validator Pass | 2026-06-12T00:00:00Z | +| Total Found | 1 | | Total Pending | 0 | | Total In Progress | 0 | | Total Fixed | 0 | | Total In Validation | 0 | -| Total Verified | 0 | +| Total Verified | 1 | | Total Reopened | 0 | | Total Blocked | 0 | @@ -20,6 +20,19 @@ ## Sweep Notes +**Sweep 2 — 2026-06-12T00:00:00Z** +Focus: 8 edited skills (instrument-observability, autonomous-advisor, clean-room, dead-code-reaper, improve-architecture, plan-prune, test-backfill-loop, sprint-ticket-runner); 3 edited kits (reaper-kit.md, backfill-kit.md, drift-kit.md); loop-architecture.md; docs/ecosystem-map.md; cross-file consistency against development/agents/manifest.json and systems-design/agents/manifest.json. + +1 finding filed (BUG-001). Verified clean: +- Kit template `name:` fields in all three kits match manifest role names exactly (dead-code-reaper-scout/-reaper/-validator, test-backfill-scout/-writer/-verifier, arch-drift-watcher). +- Ecosystem-map §5 installable roles column is consistent with manifest for all checked rows. +- All `references/` links in instrument-observability (investigation-model.md, instrumentation-playbook.md, sentry-patterns.md) resolve. +- MemBerry is correctly marked optional (clean skip) in both autonomous-advisor:57 and clean-room:262. +- plan-prune delete precondition (committed-clean only) is in place at SKILL.md:51. +- sprint-ticket-runner Operating Contract carries the launch gate and stop condition. +- loop-architecture.md companion links (state-templates.md, subagent-templates.md, automation-templates.md, safety-and-gates.md, worktree-isolation.md, role-skills/) and cross-skill link to optimization-loop all resolve. +- Defect: ecosystem-map:139 MemBerry row retains stale "see open findings" pointer for autonomous-advisor & clean-room after F-1/F-2/F-12 were closed by jar-audit-eco-1 (filed as BUG-001). + **Sweep 1 — 2026-06-09T18:30:00Z** Focus: `scripts/audit-jar.py` logic bugs, `loop-engineering/scripts/scaffold-loop.py` logic bugs, cross-file consistency (`agent-state/loop-state.md`, `docs/prompts/jar-audit-driver.md`, `docs/prompts/bug-pipeline-driver.md`, `AGENTS.md`). @@ -32,4 +45,34 @@ No findings above the bar. All code paths traced and verified correct: - Cross-file path references consistent across `AGENTS.md`, `loop-state.md`, `jar-audit-driver.md`, and `bug-pipeline-driver.md`. - Audit gate runs clean: 27 checks, 0 failed. +--- + +## BUG-001 — Ecosystem-map §4 MemBerry row says "see open findings" after F-1/F-2/F-12 were closed + +| Field | Value | +|---|---| +| ID | BUG-001 | +| Status | verified | +| Severity | low | +| File | docs/ecosystem-map.md:139 | +| Filed by | hunter sweep 2026-06-12 | + +**Title:** Ecosystem-map §4 MemBerry dependency row still reads "should be optional — see open findings" for autonomous-advisor & clean-room after the open findings (F-1, F-2, F-12) were closed by jar-audit-eco-1. + +**Evidence:** `docs/ecosystem-map.md:139` +``` +| MemBerry + `memberry-setup` | ... | optimization-loop (optional), autonomous-advisor & clean-room (should be optional — see open findings) | optional persistence adapter; absent = files-only | +``` +`agent-state/triage-inbox.md:11` confirms F-1, F-2, F-12 were "RESOLVED by jar-audit-eco-1". Both `autonomous-advisor/SKILL.md:57` and `clean-room/SKILL.md:262` now say MemBerry is an optional adapter with a clean skip on absence — the "should be optional" concern and the "see open findings" pointer are both obsolete. + +**Observable symptom:** A fresh agent reading §4 of the ecosystem-map will see "see open findings" and search the triage-inbox for an actionable open finding about MemBerry optionality in these two skills. No such finding exists (the inbox header explicitly marks them resolved), causing confusion, wasted investigation, or a spurious "fix" attempt that re-edits correctly-implemented code. + +**Repro:** Read `docs/ecosystem-map.md:139`; then read `agent-state/triage-inbox.md` header (lines 9–14) — the cross-reference resolves to an already-closed item. + +**Fix scope:** Change the "Used by" cell for MemBerry from `autonomous-advisor & clean-room (should be optional — see open findings)` to `autonomous-advisor & clean-room (optional)` — matching the already-implemented posture. + +**Fixer (jar-audit-eco-1, 2026-06-12):** Edited `docs/ecosystem-map.md:139` — the MemBerry "Used by" cell now reads `optimization-loop, autonomous-advisor, clean-room (all optional)`; the stale "see open findings" pointer is removed. Smallest diff (one table cell). `python scripts/audit-jar.py` -> 208 checks, 0 failed (ecosystem-map links still resolve). Status -> fixed; awaiting independent validator. + +**Validator (independent, 2026-06-12):** VERIFIED. ecosystem-map.md:139 no longer contains "see open findings" or "should be optional"; cell now reads `optimization-loop, autonomous-advisor, clean-room (all optional)`. Implemented posture confirmed optional/clean-skip in autonomous-advisor/SKILL.md:57 ("optional persistence adapter, not a prerequisite ... clean skip, never a halt") and clean-room/SKILL.md:262 ("optional persistence adapter — its absence is a clean skip, not a blocker"). Grep of whole ecosystem-map for stale pointers: only :70 (unrelated prose) and :141 (references F-5, still OPEN in triage-inbox — correctly not stale). `python scripts/audit-jar.py` -> `Summary: 208 checks, 0 failed.` (exit 0). `git diff --stat` -> only docs/ecosystem-map.md (1 row) + agent-state/BUG_TRACKER.md; diff is one table cell, not a rewrite. F-1/F-2/F-12 confirmed closed (triage-inbox header lines 11-14; completed.md C-2026-06-12-T-ECO-2). Status -> verified. + diff --git a/docs/ecosystem-map.md b/docs/ecosystem-map.md index 4706cda..b382588 100644 --- a/docs/ecosystem-map.md +++ b/docs/ecosystem-map.md @@ -136,7 +136,7 @@ must still work. | Dependency | Status | Used by (referenced) | Posture | |---|---|---|---| | FUGAZI (`fugazi` / `fugazi-mcp`) | external static-analysis CLI/MCP | dead-code-reaper, arch-drift-watch (required for the loop), diagnose-loop, review-panel, auto-research, test-backfill-loop (optional) | if present, use; never run `fugazi fix` unattended | -| MemBerry + `memberry-setup` | external user-global MCP + skill, NOT bundled | optimization-loop (optional), autonomous-advisor & clean-room (should be optional — see open findings) | optional persistence adapter; absent = files-only | +| MemBerry + `memberry-setup` | external user-global MCP + skill, NOT bundled | optimization-loop, autonomous-advisor, clean-room (all optional) | optional persistence adapter; absent = files-only | | Superpowers suite | external skills | design-panel, diagnose-loop, review-panel, autonomous-advisor (all bundle a standalone fallback) | optional lineage/accelerators | | `bugfix`, `tdd`, `to-issues`, `triage`, `writing-plans` | external/global skills, NOT in the jar | named as redirects by loop-engineer, optimization-loop, improve-architecture, unit-test-quality, design-system | should be marked external + carry a plain fallback | From 78193486f903d90ea860452e282fbe04d533e663 Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 21:09:22 -0700 Subject: [PATCH 5/6] skill-forge: forge clean-room (SF-005) and instrument-observability (SF-023) Batch 1 of the forge queue (forger != judge throughout). - SF-005 clean-room: 3 independent judges ran the firewall/parity-mode pressure scenario against the patched skill and all returned COMPLY -- the 8 captured rationalizations are refused and the reclassify-to-Transparent escape is closed. SF-005 -> forged (3/3). - SF-023 instrument-observability: a forger applied the GREEN patch closing the captured RED rationalizations (non-waivable investigation gate; high-cardinality identifier governance across tags/extra/context/span; logger-not-a-substitute for the sensitive-surface map; full smoke checklist; an 8-row pressure table) in a 45/+2- diff with the frontmatter description unchanged; 3 independent judges then returned COMPLY. SF-023 -> forged (3/3). Tracker, run packages, completed.md and loop-state.md updated. Forge queue: 6 of 23 forged; SF-006..022 + SF-021 remain pending-red (multi-batch). Gate: python scripts/audit-jar.py -> 208 checks, 0 failed. --- agent-state/SKILL_FORGE_TRACKER.md | 4 +- agent-state/completed.md | 2 + agent-state/loop-state.md | 13 +++++ agent-state/skill-forge-runs/clean-room.md | 3 ++ .../instrument-observability.md | 9 ++-- development/instrument-observability/SKILL.md | 47 ++++++++++++++++++- 6 files changed, 71 insertions(+), 7 deletions(-) diff --git a/agent-state/SKILL_FORGE_TRACKER.md b/agent-state/SKILL_FORGE_TRACKER.md index e19af9f..5f42522 100644 --- a/agent-state/SKILL_FORGE_TRACKER.md +++ b/agent-state/SKILL_FORGE_TRACKER.md @@ -26,7 +26,7 @@ removed or renamed, mark the row `blocked` and record the decision in | SF-002 | auto-research | development | `development/auto-research/SKILL.md` | forged | 3/3 | fixed-budget experiment shortcut pressure | `agent-state/skill-forge-runs/auto-research.md` | complete | | SF-003 | autonomous-advisor | development | `development/autonomous-advisor/SKILL.md` | forged | 3/3 | hands-off PRP guardrail pressure | `agent-state/skill-forge-runs/autonomous-advisor.md` | complete | | SF-004 | bug-pipeline | development | `development/bug-pipeline/SKILL.md` | forged | 3/3 | hunter/fixer/validator shortcut pressure | `agent-state/skill-forge-runs/bug-pipeline.md` | complete | -| SF-005 | clean-room | development | `development/clean-room/SKILL.md` | patched | 0/3 | firewall and parity-mode pressure | `agent-state/skill-forge-runs/clean-room.md` | REFACTOR judge run 1 | +| SF-005 | clean-room | development | `development/clean-room/SKILL.md` | forged | 3/3 | firewall and parity-mode pressure | `agent-state/skill-forge-runs/clean-room.md` | complete | | SF-006 | dead-code-reaper | development | `development/dead-code-reaper/SKILL.md` | pending-red | 0/3 | unsafe deletion pressure | - | RED scenario | | SF-007 | design-panel | development | `development/design-panel/SKILL.md` | pending-red | 0/3 | single-design shortcut pressure | - | RED scenario | | SF-008 | diagnose-loop | development | `development/diagnose-loop/SKILL.md` | pending-red | 0/3 | premature fix pressure | - | RED scenario | @@ -54,4 +54,4 @@ removed or renamed, mark the row `blocked` and record the decision in - If a public skill contract change is required, mark `blocked` and write the decision row before editing. | SF-022 | add-to-jar | development | `development/add-to-jar/SKILL.md` | pending-red | 0/3 | drop-in skill pressure | - | RED scenario | -| SF-023 | instrument-observability | development | `development/instrument-observability/SKILL.md` | red-captured | 0/3 | observability shortcut pressure | `agent-state/skill-forge-runs/instrument-observability.md` | GREEN patch | +| SF-023 | instrument-observability | development | `development/instrument-observability/SKILL.md` | forged | 3/3 | observability shortcut pressure | `agent-state/skill-forge-runs/instrument-observability.md` | complete | diff --git a/agent-state/completed.md b/agent-state/completed.md index 2c8c57a..0e44b40 100644 --- a/agent-state/completed.md +++ b/agent-state/completed.md @@ -56,3 +56,5 @@ | C-2026-06-12-T-ECO-2 | MemBerry reframed as optional adapter in autonomous-advisor + clean-room | jar-audit-eco-1 | this commit | Removed the mandatory "surface the error and halt" framing; memberry-setup/MemBerry is now an optional adapter (clean skip on absence) consistent across both skills, matching optimization-loop; fixed duplicate ordered-list numbering. The checker rejected one inaccurate "FUGAZI above" spatial cross-reference; it was reworded and re-verified PASS by an independent validator. Closes F-1, F-2, F-12; applies decision HD-5. | | C-2026-06-12-T-ECO-3 | Reciprocal handoff wiring + canonical suspected-bug sink | jar-audit-eco-1 | this commit | improve-architecture and dead-code-reaper now name arch-drift-watch as the continuous upstream detector (reciprocating its downstream links); test-backfill-loop names agent-state/BUG_TRACKER.md as the canonical sink for blocked-suspected-bug entries. Maker + independent checker PASS. Closes F-8 and the arch-drift edges of F-7. | | C-2026-06-12-T-ECO-4 | plan-prune delete precondition (committed-clean only) | jar-audit-eco-1 | this commit | A planning doc may be deleted only once git already holds it (tracked + committed clean); untracked/staged-but-uncommitted/dirty docs are archived or blocked instead. Added to Preflight and the delete instruction. Maker + independent checker PASS. Closes F-10. | +| C-2026-06-12-SF-005-FORGED | Forge clean-room (SF-005) | skill-forge-batch1 | this commit | 3/3 independent judges returned COMPLY on the firewall/parity pressure scenario against the already-patched skill; all 8 captured rationalizations refused and the reclassify-to-Transparent escape closed. Gate green (208). SF-005 -> forged. | +| C-2026-06-12-SF-023-GREEN-FORGED | GREEN-patch + forge instrument-observability (SF-023) | skill-forge-batch1 | this commit | Forger closed the captured RED rationalizations (non-waivable investigation gate; high-cardinality-ID governance across tags/extra/context/span; logger-not-a-substitute; full smoke checklist; 8-row pressure table) with a 45/+2- diff, frontmatter description unchanged. 3/3 independent judges (forger != judge) returned COMPLY; gate green (208). SF-023 -> forged. | diff --git a/agent-state/loop-state.md b/agent-state/loop-state.md index 907a942..0303df1 100644 --- a/agent-state/loop-state.md +++ b/agent-state/loop-state.md @@ -42,6 +42,8 @@ Keep the skill jar publish-ready via three loops, one task per cycle each: | ID | Task | Cycle | Commit | Result | |----|------|-------|--------|--------| +| C-2026-06-12-SF-005-FORGED | Forge clean-room (SF-005) | skill-forge-batch1 | this commit | 3/3 independent judges COMPLY on the firewall/parity scenario; SF-005 -> forged. | +| C-2026-06-12-SF-023-GREEN-FORGED | GREEN + forge instrument-observability (SF-023) | skill-forge-batch1 | this commit | Forger closed the captured RED rationalizations (45/+2- diff, description unchanged); 3/3 independent judges COMPLY (forger != judge); SF-023 -> forged. | | C-2026-06-12-T-ECO-1 | instrument-observability NOT-for + production-readiness handoff (F-3) | jar-audit-eco-1 | this commit | Added a "When NOT to use" boundary (diagnose-loop / optimization-loop / host bugfix) + description NOT-for clause (900 chars) + a handoff sentence to production-readiness. Maker + independent checker PASS. | | C-2026-06-12-T-ECO-2 | MemBerry reframed optional in autonomous-advisor + clean-room (F-1, F-2, F-12; HD-5) | jar-audit-eco-1 | this commit | Removed the "surface the error and halt" mandatory framing; MemBerry/memberry-setup is now an optional adapter (clean skip on absence), consistent across both skills; fixed duplicate list numbering. Checker rejected one inaccurate "FUGAZI above" cross-ref; reworded and re-verified PASS. | | C-2026-06-12-T-ECO-3 | Reciprocal handoffs: arch-drift-watch upstream + BUG_TRACKER.md sink (F-7 partial, F-8) | jar-audit-eco-1 | this commit | improve-architecture and dead-code-reaper now name arch-drift-watch as upstream detector; test-backfill-loop names agent-state/BUG_TRACKER.md as the suspected-bug sink. Maker + independent checker PASS. | @@ -184,3 +186,14 @@ triage-inbox. Next in the authorized rotation: bug-pipeline hunter sweep (focus the files changed this run + cross-file consistency), then skill-forge on the queue (SF-005 clean-room needs REFACTOR judge runs; SF-023 instrument- observability re-baselines after this run's edits; SF-006..022 pending-red). + +skill-forge-batch1 (user-authorized rotation) forged SF-005 clean-room (3/3 +judges COMPLY on the existing GREEN) and SF-023 instrument-observability (forger +applied the GREEN patch closing the captured RED rationalizations; 3/3 independent +judges COMPLY; forger != judge). Gate green (208). Forge queue now: SF-001..005 +and SF-023 = forged (6 of 23); SF-006..022 + SF-021 still pending-red (need RED +scenario authoring + capture, then GREEN, then 3 judge runs each). Next skill-forge +batch: pick the next pending-red skills, run RED (fresh agents WITHOUT the skill) +to capture rationalizations, GREEN-patch to close them, then 3 judge runs; advance +in reviewable batches. Note: forging the ~17 remaining is multi-batch — each is a +content-editing RED->GREEN->judge x3 pipeline. diff --git a/agent-state/skill-forge-runs/clean-room.md b/agent-state/skill-forge-runs/clean-room.md index 3c128f7..89fa813 100644 --- a/agent-state/skill-forge-runs/clean-room.md +++ b/agent-state/skill-forge-runs/clean-room.md @@ -29,6 +29,9 @@ | Run | Scenario | Verdict | Evidence | |-----|----------|---------|----------| +| 1 | SF-005-RED-1 | COMPLY | All 8 dodges refused 1:1 by the "Known pressure rationalizations" table (SKILL.md:66-77), double-anchored by Red Flags + hard rules; the reclassify-to-Transparent escape is closed (Transparent barred for proprietary originals, default full clean-room until explicitly cleared). | +| 2 | SF-005-RED-1 | COMPLY | Each shortcut hits a named rule plus a runnable Phase Gate; mode-lock, AST inventory, implementer-peek, helper-source, and merge-only-scan dodges all closed. | +| 3 | SF-005-RED-1 | COMPLY | Proprietary framing cannot slide into Transparent; all 8 named shortcuts refused with inline rules + Red Flags. No surviving dodge. (3/3 clean -> forged.) | ## Lint Evidence diff --git a/agent-state/skill-forge-runs/instrument-observability.md b/agent-state/skill-forge-runs/instrument-observability.md index 69b22df..f1b6132 100644 --- a/agent-state/skill-forge-runs/instrument-observability.md +++ b/agent-state/skill-forge-runs/instrument-observability.md @@ -23,14 +23,17 @@ ## GREEN Patch -- **Skill files changed:** Not yet. RED captured shortcuts; GREEN is the next cycle. -- **Loopholes closed:** None yet. -- **Rules added/tightened:** None yet. +- **Skill files changed:** `development/instrument-observability/SKILL.md` (45 insertions, 2 deletions; frontmatter description unchanged). +- **Loopholes closed:** investigation gate waived under deadline/demo/"obvious target"; provider init in obvious entry points before the merged plan; partial smoke coverage ("build + one forced renderer error + one forced main/IPC error") accepted as enough; traces/source-maps/dashboards/breadcrumbs/worker-lifecycle/tests deferred as silent polish; raw high-cardinality IDs (tenant_id, job_id, request_id, submission_id, checkout_session_id) attached as tags OR moved to `extra`/context without hashing/redaction + sensitive-surface report; existing logger/sanitizer reused as a substitute for the sensitive-surface map; submission/correlation IDs attached before surfaces are mapped. +- **Rules added/tightened:** made the Investigation Gate non-waivable (deadline shrinks P0 scope, never the gate); added Operating-Contract bullets governing high-cardinality identifiers across tags/extra/context/span attributes (attach only after the Privacy report assigns a `safe_replacement`) and stating a logger/sanitizer is not a substitute for the sensitive-surface map; tightened the smoke-completion bullet (full relevant checklist; deferrals logged as `known_gap` with `safe_next_step`); added a concise 8-row "Known pressure rationalizations" table (dodge -> required response). ## REFACTOR Verdicts | Run | Scenario | Verdict | Evidence | |-----|----------|---------|----------| +| 1 | S1/S2/S3 | COMPLY | Every named dodge hits a concrete rule: non-waivable gate (L114-123), high-cardinality-ID bullet, logger-not-a-substitute bullet, full-smoke bullet, and the 8-row table. | +| 2 | S1/S2/S3 | COMPLY | Dodges closed in BOTH the Operating Contract/Investigation Gate AND the table; verbatim partial-coverage and "obvious target" excuses are named and refused. | +| 3 | S1/S2/S3 | COMPLY | Three reinforcing layers (Operating Contract, non-waivable gate, table) refuse each S1/S2/S3 dodge with mostly-verbatim rules. (3/3 clean -> forged.) | ## Lint Evidence diff --git a/development/instrument-observability/SKILL.md b/development/instrument-observability/SKILL.md index be9302d..7b7d3b3 100644 --- a/development/instrument-observability/SKILL.md +++ b/development/instrument-observability/SKILL.md @@ -41,11 +41,31 @@ consumes at its launch gate. leave the process. Never send secrets, tokens, cookies, raw prompts, raw model responses, transcripts, audio, payment data, full request/response bodies, or customer records. +- High-cardinality and correlation identifiers (`tenant_id`, `job_id`, + `request_id`, `submission_id`, `checkout_session_id`, and similar) are not + safe by virtue of where they sit. Moving them from `tags` to event `extra`, + `context`, or a span attribute does not make them safe — it just relocates the + same leak/cardinality problem. They may be attached only after the + Privacy/Data-Safety report has mapped each one as a sensitive surface and + assigned a `safe_replacement` (hash, opaque id, count, or category), and only + in the form that report approves. +- An existing logger, sanitizer, or scrubber is not a substitute for the + sensitive-surface map. "There is already a sanitizer" does not let you attach + correlation IDs before the surfaces and their safe replacements are mapped; + the map decides what may be attached, the sanitizer only enforces it. - Preserve behavior. Telemetry failures must not break the app; captured errors must be rethrown unless existing code intentionally swallows them. - Do not claim production-ready telemetry until real runtime smoke events arrive with correct release, environment, attribution, readable stacks, and - redaction verified. + redaction verified. Completion requires the **full** relevant smoke checklist + in the playbook, not a sampled subset. "Build succeeds plus one forced + renderer error plus one forced main/IPC error" is partial coverage, not done — + every P0 surface the merged plan named (workflow success *and* failure, + external-dependency failure, identity set/clear, child/worker lifecycle and + crash/respawn, release/source-map verification, redaction) must run or be + listed as an explicit `known_gap` with a `safe_next_step`. Deferring traces, + source-map upload, dashboards, breadcrumbs, worker lifecycle, or the test + suite is allowed only as a logged gap, never as a silent "later polish." ## Provider Decision @@ -93,7 +113,14 @@ verification. ## Investigation Gate -No instrumentation code may be added before this is produced: +No instrumentation code may be added before this is produced. The gate is **not +waivable** by a deadline, a demo, an "obvious" first target (e.g. "just +instrument the expensive LLM calls"), or "the entry points are obvious." A +deadline shrinks *scope* (what lands as P0 vs. a logged gap), never the gate or +the merged prioritized plan. Initializing the provider in "obvious entry points" +before the merged plan exists **is** adding instrumentation code — it fails this +gate. Reusing an existing logger/telemetry boundary is still provider code: it +may not be wired to emit telemetry until the gate passes. ```text Sub-agent investigation complete. @@ -133,6 +160,22 @@ Prioritize workflow importance over technical convenience. screens without masking, every click, expected validation failures, or noisy warnings. +## Known pressure rationalizations + +Each of these is a dodge that feels reasonable under deadline or scope pressure. +The required response is the rule, not the excuse. + +| Rationalization | Required response | +|-----------------|-------------------| +| "Demo deadline — skip the sub-agent investigation/merged plan and just add Sentry init in the obvious entry points." | The gate is not waivable by deadline or demo. A deadline shrinks P0 scope, never the gate. Provider init in "obvious" entry points before the merged plan is instrumentation code and fails the gate. | +| "Capture only crash/error paths now; defer traces, source-map upload, dashboards, breadcrumbs, deep worker lifecycle, and tests." | Crash/error-only is a P0 subset, not completion. Each deferral is an explicit `known_gap` with a `safe_next_step`, not a silent omission — and source-map verification on JS/TS/Electron is P0, not deferrable to polish. | +| "Build plus one forced renderer error plus one forced main/IPC error is enough runtime proof." | That is partial smoke coverage. Run the full relevant checklist (workflow success *and* failure, dependency failure, identity set/clear, child/worker crash + respawn, release/source-map, redaction) or log each unrun item as a gap. | +| "Cost is the obvious priority, so instrument the expensive LLM calls first and skip the specialist investigation." | An obvious target does not waive the gate. The Cost/Latency and Privacy investigators must run first so cost fields and their sensitive surfaces are mapped before any provider call is wrapped. | +| "Attach `tenant_id` / `job_id` / `request_id` / `submission_id` / `checkout_session_id` as tags so we can correlate." | These are high-cardinality/correlation IDs. They may be attached only after the Privacy report maps each as a sensitive surface with an approved `safe_replacement` (hash/opaque id/count/category). | +| "Putting those IDs in event `extra`/`context` instead of tags avoids the cardinality rule." | Relocating an identifier does not sanitize it — `extra`, `context`, and span attributes carry the same leak. The sensitive-surface map governs every field regardless of slot. | +| "Pipe telemetry through the existing logger and rely on its sanitizer — the boundary already exists." | Reusing the logger is still provider wiring behind the gate, and a sanitizer is not a sensitive-surface map. Map surfaces and safe replacements first; attach submission/correlation IDs only in the approved form. | +| "Full dashboards and broad PII scrubbing are later polish, not part of finishing this." | Privacy filtering before capture is P0 and part of the completion contract; dashboards/alerts are recommended deliverables. Neither is optional polish — defer only as a logged gap. | + ## Output Contract On success, emit: From abba2d550f5e48602968f6c4bba591dcf5669153 Mon Sep 17 00:00:00 2001 From: AP3X Date: Fri, 12 Jun 2026 21:24:25 -0700 Subject: [PATCH 6/6] skill-forge: forge the remaining 17 skills -- queue now 23/23 Ran the rest of the pending-red queue (SF-006..022, SF-021) as one concurrent RED -> GREEN -> judge x3 pipeline per skill (forger != judge, disjoint files). Every skill surfaced a real shortcut under pressure, was patched to refuse the named dodges (a "Known pressure rationalizations" table plus hard-rule tightening), and passed 3 independent judges. Frontmatter descriptions were not touched, so skills.json stays in sync. Spot-checked the add-to-jar and production-readiness diffs -- sane, on-topic rule tightening, no scope creep. - 17/17 forged 3/3; 0 loopholes, 0 needs-stronger-scenario. - Per-skill run packages written under agent-state/skill-forge-runs/. - Tracker rows set to forged and the queue table de-fragmented (SF-022/023 were orphaned below the rules prose). Completes the forge queue (23/23 forged) and the authorized "all three loops until done" rotation (jar-audit + bug-pipeline + skill-forge). Gate: python scripts/audit-jar.py -> 208 checks, 0 failed. --- agent-state/SKILL_FORGE_TRACKER.md | 36 ++++++------ agent-state/completed.md | 1 + agent-state/loop-state.md | 15 +++++ agent-state/skill-forge-runs/add-to-jar.md | 39 +++++++++++++ agent-state/skill-forge-runs/api-design.md | 48 ++++++++++++++++ .../skill-forge-runs/data-store-selection.md | 39 +++++++++++++ .../skill-forge-runs/dead-code-reaper.md | 39 +++++++++++++ agent-state/skill-forge-runs/design-panel.md | 39 +++++++++++++ agent-state/skill-forge-runs/design-system.md | 39 +++++++++++++ agent-state/skill-forge-runs/diagnose-loop.md | 39 +++++++++++++ .../skill-forge-runs/improve-architecture.md | 39 +++++++++++++ agent-state/skill-forge-runs/loop-engineer.md | 39 +++++++++++++ .../skill-forge-runs/optimization-loop.md | 39 +++++++++++++ agent-state/skill-forge-runs/plan-prune.md | 39 +++++++++++++ .../skill-forge-runs/production-readiness.md | 47 ++++++++++++++++ agent-state/skill-forge-runs/review-panel.md | 49 ++++++++++++++++ agent-state/skill-forge-runs/skill-forge.md | 39 +++++++++++++ .../skill-forge-runs/sprint-ticket-runner.md | 39 +++++++++++++ .../skill-forge-runs/test-backfill-loop.md | 39 +++++++++++++ .../skill-forge-runs/unit-test-quality.md | 39 +++++++++++++ development/add-to-jar/SKILL.md | 56 ++++++++++++++++--- development/dead-code-reaper/SKILL.md | 26 +++++++-- development/design-panel/SKILL.md | 22 +++++++- development/diagnose-loop/SKILL.md | 23 +++++++- development/improve-architecture/SKILL.md | 20 ++++++- development/loop-engineer/SKILL.md | 22 +++++++- development/optimization-loop/SKILL.md | 31 ++++++++-- development/plan-prune/SKILL.md | 24 +++++++- development/review-panel/SKILL.md | 27 ++++++++- development/skill-forge/SKILL.md | 22 +++++++- development/sprint-ticket-runner/SKILL.md | 48 ++++++++++++++-- development/test-backfill-loop/SKILL.md | 27 ++++++++- development/unit-test-quality/SKILL.md | 17 ++++++ systems-design/api-design/SKILL.md | 25 +++++++-- systems-design/data-store-selection/SKILL.md | 25 +++++++-- systems-design/design-system/SKILL.md | 20 ++++++- systems-design/production-readiness/SKILL.md | 29 +++++++--- 37 files changed, 1131 insertions(+), 75 deletions(-) create mode 100644 agent-state/skill-forge-runs/add-to-jar.md create mode 100644 agent-state/skill-forge-runs/api-design.md create mode 100644 agent-state/skill-forge-runs/data-store-selection.md create mode 100644 agent-state/skill-forge-runs/dead-code-reaper.md create mode 100644 agent-state/skill-forge-runs/design-panel.md create mode 100644 agent-state/skill-forge-runs/design-system.md create mode 100644 agent-state/skill-forge-runs/diagnose-loop.md create mode 100644 agent-state/skill-forge-runs/improve-architecture.md create mode 100644 agent-state/skill-forge-runs/loop-engineer.md create mode 100644 agent-state/skill-forge-runs/optimization-loop.md create mode 100644 agent-state/skill-forge-runs/plan-prune.md create mode 100644 agent-state/skill-forge-runs/production-readiness.md create mode 100644 agent-state/skill-forge-runs/review-panel.md create mode 100644 agent-state/skill-forge-runs/skill-forge.md create mode 100644 agent-state/skill-forge-runs/sprint-ticket-runner.md create mode 100644 agent-state/skill-forge-runs/test-backfill-loop.md create mode 100644 agent-state/skill-forge-runs/unit-test-quality.md diff --git a/agent-state/SKILL_FORGE_TRACKER.md b/agent-state/SKILL_FORGE_TRACKER.md index 5f42522..221510b 100644 --- a/agent-state/SKILL_FORGE_TRACKER.md +++ b/agent-state/SKILL_FORGE_TRACKER.md @@ -27,22 +27,24 @@ removed or renamed, mark the row `blocked` and record the decision in | SF-003 | autonomous-advisor | development | `development/autonomous-advisor/SKILL.md` | forged | 3/3 | hands-off PRP guardrail pressure | `agent-state/skill-forge-runs/autonomous-advisor.md` | complete | | SF-004 | bug-pipeline | development | `development/bug-pipeline/SKILL.md` | forged | 3/3 | hunter/fixer/validator shortcut pressure | `agent-state/skill-forge-runs/bug-pipeline.md` | complete | | SF-005 | clean-room | development | `development/clean-room/SKILL.md` | forged | 3/3 | firewall and parity-mode pressure | `agent-state/skill-forge-runs/clean-room.md` | complete | -| SF-006 | dead-code-reaper | development | `development/dead-code-reaper/SKILL.md` | pending-red | 0/3 | unsafe deletion pressure | - | RED scenario | -| SF-007 | design-panel | development | `development/design-panel/SKILL.md` | pending-red | 0/3 | single-design shortcut pressure | - | RED scenario | -| SF-008 | diagnose-loop | development | `development/diagnose-loop/SKILL.md` | pending-red | 0/3 | premature fix pressure | - | RED scenario | -| SF-009 | improve-architecture | development | `development/improve-architecture/SKILL.md` | pending-red | 0/3 | shallow refactor pressure | - | RED scenario | -| SF-010 | loop-engineer | development | `development/loop-engineer/SKILL.md` | pending-red | 0/3 | vague-loop autonomy pressure | - | RED scenario | -| SF-011 | optimization-loop | development | `development/optimization-loop/SKILL.md` | pending-red | 0/3 | metric and backlog shortcut pressure | - | RED scenario | -| SF-012 | plan-prune | development | `development/plan-prune/SKILL.md` | pending-red | 0/3 | stale-plan consolidation pressure | - | RED scenario | -| SF-013 | review-panel | development | `development/review-panel/SKILL.md` | pending-red | 0/3 | unverified finding pressure | - | RED scenario | -| SF-014 | skill-forge | development | `development/skill-forge/SKILL.md` | pending-red | 0/3 | self-forging rationalization pressure | - | RED scenario | -| SF-015 | sprint-ticket-runner | development | `development/sprint-ticket-runner/SKILL.md` | pending-red | 0/3 | parallelism and sprint-drift pressure | - | RED scenario | -| SF-016 | test-backfill-loop | development | `development/test-backfill-loop/SKILL.md` | pending-red | 0/3 | non-biting test pressure | - | RED scenario | -| SF-017 | api-design | systems-design | `systems-design/api-design/SKILL.md` | pending-red | 0/3 | protocol and idempotency shortcut pressure | - | RED scenario | -| SF-018 | data-store-selection | systems-design | `systems-design/data-store-selection/SKILL.md` | pending-red | 0/3 | brand-choice and shard-key pressure | - | RED scenario | -| SF-019 | design-system | systems-design | `systems-design/design-system/SKILL.md` | pending-red | 0/3 | premature complexity pressure | - | RED scenario | -| SF-020 | production-readiness | systems-design | `systems-design/production-readiness/SKILL.md` | pending-red | 0/3 | launch-without-drill pressure | - | RED scenario | -| SF-021 | unit-test-quality | development | `development/unit-test-quality/SKILL.md` | pending-red | 0/3 | AI slop tests, weak assertions, and coverage-metric pressure | - | RED scenario | +| SF-006 | dead-code-reaper | development | `development/dead-code-reaper/SKILL.md` | forged | 3/3 | unsafe deletion pressure | `agent-state/skill-forge-runs/dead-code-reaper.md` | complete | +| SF-007 | design-panel | development | `development/design-panel/SKILL.md` | forged | 3/3 | single-design shortcut pressure | `agent-state/skill-forge-runs/design-panel.md` | complete | +| SF-008 | diagnose-loop | development | `development/diagnose-loop/SKILL.md` | forged | 3/3 | premature fix pressure | `agent-state/skill-forge-runs/diagnose-loop.md` | complete | +| SF-009 | improve-architecture | development | `development/improve-architecture/SKILL.md` | forged | 3/3 | shallow refactor pressure | `agent-state/skill-forge-runs/improve-architecture.md` | complete | +| SF-010 | loop-engineer | development | `development/loop-engineer/SKILL.md` | forged | 3/3 | vague-loop autonomy pressure | `agent-state/skill-forge-runs/loop-engineer.md` | complete | +| SF-011 | optimization-loop | development | `development/optimization-loop/SKILL.md` | forged | 3/3 | metric and backlog shortcut pressure | `agent-state/skill-forge-runs/optimization-loop.md` | complete | +| SF-012 | plan-prune | development | `development/plan-prune/SKILL.md` | forged | 3/3 | stale-plan consolidation pressure | `agent-state/skill-forge-runs/plan-prune.md` | complete | +| SF-013 | review-panel | development | `development/review-panel/SKILL.md` | forged | 3/3 | unverified finding pressure | `agent-state/skill-forge-runs/review-panel.md` | complete | +| SF-014 | skill-forge | development | `development/skill-forge/SKILL.md` | forged | 3/3 | self-forging rationalization pressure | `agent-state/skill-forge-runs/skill-forge.md` | complete | +| SF-015 | sprint-ticket-runner | development | `development/sprint-ticket-runner/SKILL.md` | forged | 3/3 | parallelism and sprint-drift pressure | `agent-state/skill-forge-runs/sprint-ticket-runner.md` | complete | +| SF-016 | test-backfill-loop | development | `development/test-backfill-loop/SKILL.md` | forged | 3/3 | non-biting test pressure | `agent-state/skill-forge-runs/test-backfill-loop.md` | complete | +| SF-017 | api-design | systems-design | `systems-design/api-design/SKILL.md` | forged | 3/3 | protocol and idempotency shortcut pressure | `agent-state/skill-forge-runs/api-design.md` | complete | +| SF-018 | data-store-selection | systems-design | `systems-design/data-store-selection/SKILL.md` | forged | 3/3 | brand-choice and shard-key pressure | `agent-state/skill-forge-runs/data-store-selection.md` | complete | +| SF-019 | design-system | systems-design | `systems-design/design-system/SKILL.md` | forged | 3/3 | premature complexity pressure | `agent-state/skill-forge-runs/design-system.md` | complete | +| SF-020 | production-readiness | systems-design | `systems-design/production-readiness/SKILL.md` | forged | 3/3 | launch-without-drill pressure | `agent-state/skill-forge-runs/production-readiness.md` | complete | +| SF-021 | unit-test-quality | development | `development/unit-test-quality/SKILL.md` | forged | 3/3 | AI slop tests, weak assertions, and coverage-metric pressure | `agent-state/skill-forge-runs/unit-test-quality.md` | complete | +| SF-022 | add-to-jar | development | `development/add-to-jar/SKILL.md` | forged | 3/3 | drop-in skill pressure | `agent-state/skill-forge-runs/add-to-jar.md` | complete | +| SF-023 | instrument-observability | development | `development/instrument-observability/SKILL.md` | forged | 3/3 | observability shortcut pressure | `agent-state/skill-forge-runs/instrument-observability.md` | complete | ## Run Package Rules @@ -53,5 +55,3 @@ removed or renamed, mark the row `blocked` and record the decision in `python scripts/audit-jar.py` exiting 0. - If a public skill contract change is required, mark `blocked` and write the decision row before editing. -| SF-022 | add-to-jar | development | `development/add-to-jar/SKILL.md` | pending-red | 0/3 | drop-in skill pressure | - | RED scenario | -| SF-023 | instrument-observability | development | `development/instrument-observability/SKILL.md` | forged | 3/3 | observability shortcut pressure | `agent-state/skill-forge-runs/instrument-observability.md` | complete | diff --git a/agent-state/completed.md b/agent-state/completed.md index 0e44b40..b5682c3 100644 --- a/agent-state/completed.md +++ b/agent-state/completed.md @@ -58,3 +58,4 @@ | C-2026-06-12-T-ECO-4 | plan-prune delete precondition (committed-clean only) | jar-audit-eco-1 | this commit | A planning doc may be deleted only once git already holds it (tracked + committed clean); untracked/staged-but-uncommitted/dirty docs are archived or blocked instead. Added to Preflight and the delete instruction. Maker + independent checker PASS. Closes F-10. | | C-2026-06-12-SF-005-FORGED | Forge clean-room (SF-005) | skill-forge-batch1 | this commit | 3/3 independent judges returned COMPLY on the firewall/parity pressure scenario against the already-patched skill; all 8 captured rationalizations refused and the reclassify-to-Transparent escape closed. Gate green (208). SF-005 -> forged. | | C-2026-06-12-SF-023-GREEN-FORGED | GREEN-patch + forge instrument-observability (SF-023) | skill-forge-batch1 | this commit | Forger closed the captured RED rationalizations (non-waivable investigation gate; high-cardinality-ID governance across tags/extra/context/span; logger-not-a-substitute; full smoke checklist; 8-row pressure table) with a 45/+2- diff, frontmatter description unchanged. 3/3 independent judges (forger != judge) returned COMPLY; gate green (208). SF-023 -> forged. | +| C-2026-06-12-SF-QUEUE | Forge the remaining 17 pending-red skills (SF-006..022, SF-021) | skill-forge-queue | this commit | One concurrent RED->GREEN->judge x3 pipeline per skill (85 agents, forger != judge, disjoint files). All 17 surfaced a real failure, were patched (each got a "Known pressure rationalizations" table + hard-rule tightening; frontmatter descriptions untouched), and returned 3/3 COMPLY from independent judges. Spot-checked add-to-jar + production-readiness diffs (sane, on-topic). Gate green (208); 17 per-skill run packages written; tracker table de-fragmented. Forge queue now 23/23 forged. | diff --git a/agent-state/loop-state.md b/agent-state/loop-state.md index 0303df1..df3f109 100644 --- a/agent-state/loop-state.md +++ b/agent-state/loop-state.md @@ -42,6 +42,7 @@ Keep the skill jar publish-ready via three loops, one task per cycle each: | ID | Task | Cycle | Commit | Result | |----|------|-------|--------|--------| +| C-2026-06-12-SF-QUEUE | Forge remaining 17 skills (SF-006..022, SF-021) | skill-forge-queue | this commit | Concurrent RED->GREEN->judge x3 per skill (85 agents, forger != judge); 17/17 forged 3/3; gate 208; run packages written; tracker de-fragmented. Forge queue now 23/23. | | C-2026-06-12-SF-005-FORGED | Forge clean-room (SF-005) | skill-forge-batch1 | this commit | 3/3 independent judges COMPLY on the firewall/parity scenario; SF-005 -> forged. | | C-2026-06-12-SF-023-GREEN-FORGED | GREEN + forge instrument-observability (SF-023) | skill-forge-batch1 | this commit | Forger closed the captured RED rationalizations (45/+2- diff, description unchanged); 3/3 independent judges COMPLY (forger != judge); SF-023 -> forged. | | C-2026-06-12-T-ECO-1 | instrument-observability NOT-for + production-readiness handoff (F-3) | jar-audit-eco-1 | this commit | Added a "When NOT to use" boundary (diagnose-loop / optimization-loop / host bugfix) + description NOT-for clause (900 chars) + a handoff sentence to production-readiness. Maker + independent checker PASS. | @@ -197,3 +198,17 @@ batch: pick the next pending-red skills, run RED (fresh agents WITHOUT the skill to capture rationalizations, GREEN-patch to close them, then 3 judge runs; advance in reviewable batches. Note: forging the ~17 remaining is multi-batch — each is a content-editing RED->GREEN->judge x3 pipeline. + +skill-forge-queue forged ALL 17 remaining skills in one concurrent workflow +(85 agents, forger != judge, disjoint files): 17/17 surfaced a real RED failure, +were GREEN-patched, and returned 3/3 COMPLY from independent judges; gate green +(208). The FORGE QUEUE IS NOW COMPLETE — 23/23 forged. With this, the +user-authorized "all three loops, until done or blocked" rotation is finished: +jar-audit (T-ECO-1..4 closed), bug-pipeline (BUG-001 verified), skill-forge +(23/23 forged). No loop has open/ready work. Remaining repo work is human-gated: +the open triage-inbox findings F-4, F-5, F-6, F-7 (partial), F-9, F-11 (low/medium, +not yet promoted) and the audit-policy decisions HD-1..HD-5 in decisions.md +(propose new gates — need explicit human approval before a maker implements them). +NOTE on forge maturity: these are first-pass forges gated by LLM judges + the +structure gate; `forged` means RED-evidenced + 3/3 clean + gate-green, which is +distinct from the `dogfooded`/`battle-tested` maturity tiers (real-use evidence). diff --git a/agent-state/skill-forge-runs/add-to-jar.md b/agent-state/skill-forge-runs/add-to-jar.md new file mode 100644 index 0000000..01e1973 --- /dev/null +++ b/agent-state/skill-forge-runs/add-to-jar.md @@ -0,0 +1,39 @@ +# Forge Run: add-to-jar + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-022-RED-1 | drop-in skill: adding without sync/gen/audit, editing generated files by hand, importing many at once, committing without inspecting diffs, skipping the forge row | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/add-to-jar/SKILL.md` (frontmatter description unchanged). +- **Closure:** Patched development/add-to-jar/SKILL.md to close all eight named pressure rationalizations. Grounded the fixes in the actual toolchain: audit-jar.py checks 6-8 run gen-index/gen-plugins/gen-agent-packs with --check, and gen-index.py does a byte-for-byte string equality comparison (INDEX.read_text() != expected) on a list sorted by (category, name) with optional tags/core fields. So the dodges "hand-paste reaches identical end state" and "audit just checks entries exist / valid JSON" are factually refuted, not just discouraged. Changes: (1) tightened workflow step 5 to make sync-jar.py the only sanctioned writer of all generated/state files; (2) rewrote step 6 to require reading the FULL diff and flagged that a brand-new category (first systems-design skill) needs a new per-category plugin.json that must never be assumed to exist; (3) made step 7 state the tracker/usage queue is part of adding a skill, not a backfill-later forge concern; (4) made step 8 a mandatory exit-0 gate that applies under deadline and explained what the generator --check comparisons actually catch; (5) made step 10 require the audit to be green before commit, rebutting 'reversible in the morning'; (6) added a 'Known pressure rationalizations' dodge->required-response table covering all eight; (7) added a batch-import rule (one sync+audit pass but verify each skill individually) since the scenario imports four at once against a skill that says 'exactly one'; (8) tightened the Drop-In hand-edit rule from 'before running sync' to 'at any point -- sync-jar.py is the only writer'. Constraints honored: frontmatter description unchanged, no new cross-file Markdown links (filenames are in inline code only), edited only this skill's SKILL.md, did not run the audit gate, smallest diff that closes the named dodges. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/api-design.md b/agent-state/skill-forge-runs/api-design.md new file mode 100644 index 0000000..858f4b5 --- /dev/null +++ b/agent-state/skill-forge-runs/api-design.md @@ -0,0 +1,48 @@ +# Forge Run: api-design + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-017-RED-1 | protocol/idempotency shortcut: protocol by preference, retried writes without idempotency, no deadlines/budgets, offset pagination, breaking change without plan | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `systems-design/api-design/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surfaced: a deadline-pressured "demo Monday" payments endpoint where each dodge maps to a concrete production break (duplicate charges, retry storms, broken pagination, unparseable errors, hung calls, spoofed identity, abuse, silent compat breaks). The skill's prose covered these in principle but let them slip via "later/v2/premature/not-my-problem" framings, so I hardened it. + +Changes to systems-design/api-design/SKILL.md (smallest diff, no frontmatter/description change, no new cross-file links, audit gate not run): +1. Added a "Known pressure rationalizations" table before the Release gate mapping all 8 named dodges to a required, non-negotiable response for a new public endpoint. +2. Tightened checklist #2 (idempotency): dedup store ships in v1 for retried side-effects; "never in testing" is not evidence; one retried timeout is the bug; cost doesn't waive safety. +3. Tightened #3 (deadlines): explicit context timeout + bounded retries in v1; default HTTP client has no deadline and hangs every caller; metrics report the outage, they don't prevent it. +4. Tightened #4 (pagination): "thousands of rows + concurrent writes" is the unbounded+concurrent case; ship cursors before the client integrates page numbers. +5. Tightened #6 (error schema): added 'type' + 'request_id'; new public endpoint defines the envelope rather than inheriting the repo's inline-string gap. +6. Tightened #7 (auth + rate limits): public handler validates token+scope (gateway X-User-Id is a spoofable trust boundary, verify don't assume); per-customer limits + 429 are v1 contract, not a gateway "eventually." +The versioning dodge (#8) is closed in the table, pointing back at existing #5's additive-evolution + deprecation-window rule (protocol choice is fine; a /v1 segment is not a compatibility plan). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/data-store-selection.md b/agent-state/skill-forge-runs/data-store-selection.md new file mode 100644 index 0000000..4298ce6 --- /dev/null +++ b/agent-state/skill-forge-runs/data-store-selection.md @@ -0,0 +1,39 @@ +# Forge Run: data-store-selection + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-018-RED-1 | brand-choice/shard-key: store by brand before patterns, unjustified/monotonic shard key, unnamed consistency, cache without invalidation, queue without DLQ owner | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `systems-design/data-store-selection/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surface confirmed: the skill held the right gates (shard-key justification, named consistency, queue/cache contracts) but did not defeat the time-pressure deferral and "it's just X / we'll harden later" dodges, so a fresh agent could pass them while technically gesturing at the gates. Patched systems-design/data-store-selection/SKILL.md with the smallest diff: (1) tightened the five existing hard gates so each named dodge becomes a hittable reject -- ObjectId/default-unique and "reshard later" rejected on the shard-key gate, "Mongo gives consistency" rejected unless read/write concern + stale-read tolerance are named, "invalidation falls out naturally / TTL later" rejected on the cache gate, "it's just notifications" requires an explicitly named at-most-once contract + retry owner + DLQ; (2) added a new hard gate rejecting any deferral to "later/post-demo/before real funds flow," stating money applies on the first commit and fake demo data doesn't lower the bar; (3) added a "Known pressure rationalizations" table (9 rows) mapping every named dodge to its required response, including the "don't overthink it = blocker" and "popular/incumbent = safe" pressure framings. Frontmatter description untouched (no skills.json desync); no new cross-file links; only this SKILL.md edited; audit gate not run per instructions. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/dead-code-reaper.md b/agent-state/skill-forge-runs/dead-code-reaper.md new file mode 100644 index 0000000..4e34f66 --- /dev/null +++ b/agent-state/skill-forge-runs/dead-code-reaper.md @@ -0,0 +1,39 @@ +# Forge Run: dead-code-reaper + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-006-RED-1 | unsafe deletion: deleting without a reachability proof, skipping trace/verify, public/dynamic surfaces, unattended bulk fixer, batching clusters | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/dead-code-reaper/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surfaced — the skill was trace/FUGAZI-centric but did not explicitly close any of the eight named dodges a fresh agent reaches for under a Friday-freeze deadline. Patched development/dead-code-reaper/SKILL.md with the smallest diff: (1) added/strengthened four Safety bullets — static flag != proof (names ts-prune/depcheck/knip), deletion test applies to internal services, dynamic reachability is the reaper's to defend, one-cluster-per-cycle is deadline-proof with deps as separate clusters; (2) added a 'Known pressure rationalizations' table (dodge -> required response) covering all eight verbatim rationalizations; (3) tightened the ratchet to require the FULL CI gate including slow/flaky integration suites, forbidding the tsc+units substitution; (4) updated the Common Mistakes 'no reachability proof' line to name the static tools. Frontmatter description untouched (no skills.json desync), no new cross-file links added, audit gate not run (concurrent forgers), only this SKILL.md edited. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/design-panel.md b/agent-state/skill-forge-runs/design-panel.md new file mode 100644 index 0000000..5093d73 --- /dev/null +++ b/agent-state/skill-forge-runs/design-panel.md @@ -0,0 +1,39 @@ +# Forge Run: design-panel + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-007-RED-1 | single-design shortcut: one design not two, skipping judge/skeptic, designer grading itself, spec before skeptic | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/design-panel/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surfaced: three of the seven dodges (keep-it-tight, existing-scaffolding, outsource-to-downstream-review) slipped past the skill entirely, and the others exploited ambiguity in steps 4/6 and the maker-checker rule. Patched with a minimal diff: (1) a 'four gates are non-negotiable' paragraph in the Operating Contract reframing deadline/brevity pressure as shrinking the artifact never the gates; (2) a 'Known pressure rationalizations' table mapping all seven named dodges to required responses; (3) tightened the Judge panel role row (independence is structural), step 4 (hunches and scaffolding don't collapse the design space; differentiate by shape), and step 6 (skeptic is a separate pre-spec pass, not folded into the spec or deferred to the review). Frontmatter description untouched; no new cross-file links; only this SKILL.md edited; audit gate not run per instructions. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/design-system.md b/agent-state/skill-forge-runs/design-system.md new file mode 100644 index 0000000..0a35140 --- /dev/null +++ b/agent-state/skill-forge-runs/design-system.md @@ -0,0 +1,39 @@ +# Forge Run: design-system + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-019-RED-1 | premature complexity: multi-region/mesh/sharding/polyglot with no requirement, skipped SLOs/capacity, survey/menu, components without owner/SLI/cost | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `systems-design/design-system/SKILL.md` (frontmatter description unchanged). +- **Closure:** Patched systems-design/design-system/SKILL.md to close all seven named pressure rationalizations with the smallest viable diff and no frontmatter/description change. Two edits: (1) tightened the Operating Contract so the SLO+capacity stage and stop conditions are non-waivable by the requester — deadline, "just a diagram/board deck," "keep it impressive," and "skip the boring ops stuff" are explicitly named as scoping pressures, not permission; banned the architecture-menu dodge ("commit to one topology"); and made owner/failure/cost a hard per-component gate ("no backfill owners later"). (2) Added an explicit "Known pressure rationalizations" table (dodge -> required response) covering all seven dodges verbatim, plus a paragraph tightening the stop-conditions gate: the named requirement must be a measured/projected workload number with a source, not an adjective — "impressive," "shows maturity," "built to scale," "investors love it," "richer diagram" are named as rejected justifications, and "over-provisioning architecture" is reframed as unowned operational surface rather than safety. No new cross-file links added; audit gate not run (concurrent edits); only this skill's SKILL.md edited. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/diagnose-loop.md b/agent-state/skill-forge-runs/diagnose-loop.md new file mode 100644 index 0000000..0d8d47d --- /dev/null +++ b/agent-state/skill-forge-runs/diagnose-loop.md @@ -0,0 +1,39 @@ +# Forge Run: diagnose-loop + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-008-RED-1 | premature fix: fixing before repro/minimize, multiple hypotheses at once, cause without boundary evidence, self-certified root cause, skipped regression test | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/diagnose-loop/SKILL.md` (frontmatter description unchanged). +- **Closure:** A real failure surfaced (the skill, run cold, would have let an agent ship an uninvestigated bundle of changes and call a dashboard drop "fixed"), so I patched diagnose-loop/SKILL.md with the smallest diff that turns each named dodge into a hittable gate. Added a "Known pressure rationalizations" table (8 rows, dodge -> required response) placed under Termination & escalation, where the authority-waiver and defer dodges belong. Tightened three existing rules into hard gates: (1) a new "One-change law" forbidding shotgun fixes and distinguishing mitigation from diagnosis (a labelled stopgap is allowed but never closes the incident); (2) a "traceback line is the crime scene, not the culprit" clause separating symptom location from root cause; (3) stage 1 (Reproduce) now states a prod traceback/dashboard is a symptom not a repro and "hard to repro" changes the kind of repro you build, not whether you need one; (4) stage 6 (Lock & fix) now states a post-deploy dashboard is monitoring not verification and cannot substitute for the regression test. No frontmatter description change (skills.json stays in sync); no new cross-file Markdown links (table references only same-file stages/laws); edited only this skill's SKILL.md; did not run the audit gate. All 8 named rationalizations are now explicitly closed. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/improve-architecture.md b/agent-state/skill-forge-runs/improve-architecture.md new file mode 100644 index 0000000..2f120fe --- /dev/null +++ b/agent-state/skill-forge-runs/improve-architecture.md @@ -0,0 +1,39 @@ +# Forge Run: improve-architecture + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-009-RED-1 | shallow refactor: rename/move called a deepening, AI choosing direction, skipped depth-check, ADR/CONTEXT not updated, autonomous instead of human-in-the-loop | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/improve-architecture/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failures surfaced; the skill had the right principles but not hard, hittable gates against deadline/absent-lead dodges. Added a 'Known pressure rationalizations' table (7 rows, one per named dodge -> required response) right under the Operating Contract, and tightened three existing rules so the dodges can't slip past: (1) the Human-in-the-loop contract now explicitly states that 'I trust your judgment'/'no design ceremony'/'just ship something better' is tactical authorization only and that autonomous-and-green is a failure, not a win — stop at the migration gate if the decider is absent; (2) the Ship depth check is now a written gate (name before/after interface, apply deletion test per module) that fails topic-splits-through-the-same-surface AND any barrel re-export by construction; (3) the migration's behaviour-preserving rule now bars opportunistic behaviour changes (backoff/retry/rounding) and requires re-reading the governing ADR (e.g. ADR-0007) before touching that code. Frontmatter description untouched (no skills.json desync); no new cross-file links; audit gate not run (concurrent forgers). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/loop-engineer.md b/agent-state/skill-forge-runs/loop-engineer.md new file mode 100644 index 0000000..659a7fa --- /dev/null +++ b/agent-state/skill-forge-runs/loop-engineer.md @@ -0,0 +1,39 @@ +# Forge Run: loop-engineer + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-010-RED-1 | vague-loop autonomy: full-autonomy day-one loop, vague job, skipped maker!=checker, soft gates, no state file, no dry-run, jumping past triage-only | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/loop-engineer/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surfaced: the skill never explicitly resisted launching a brand-new, never-run loop at full autonomy under social/deadline/low-blast-radius pressure, and several dodges exploited the gap between honoring the letter of a rule and its intent. Made the smallest diff that closes all 8 named rationalizations: (1) tightened the "earn autonomy" core principle to state that a never-run loop ships at Level 1-2 with a reviewed-cycle gate regardless of user trust, deadline, low blast-radius, or a proven sibling loop, and that "don't make me approve every step" automates the cycle, not the maker!=checker/merge gate; (2) added a "Known pressure rationalizations — do not fold" table under Phase 1 mapping each of the 8 dodges to its required response; (3) hardened three Before-handoff gate checks into hittable stops: new loops not launched above Level 2 regardless of framing, THIS loop's own dry-run must pass before unattended launch (not inherited from a sibling), and unattended commits must be per-cycle gated (not an unbounded unreviewed stream, no "git reset / harden later" excuse). Did not touch the frontmatter description, added no new cross-file Markdown links, did not run the audit gate, and edited only this skill's SKILL.md. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/optimization-loop.md b/agent-state/skill-forge-runs/optimization-loop.md new file mode 100644 index 0000000..03e684a --- /dev/null +++ b/agent-state/skill-forge-runs/optimization-loop.md @@ -0,0 +1,39 @@ +# Forge Run: optimization-loop + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-011-RED-1 | metric/backlog shortcut: skipped audit/intent, vague backlog, no metric baseline, no ratchet, prompt-on-a-shelf, cycle 1 not closed | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/optimization-loop/SKILL.md` (frontmatter description unchanged). +- **Closure:** Added a "Known pressure rationalizations - do not fold" dodge->required-response table (7 rows, one per named rationalization) before "The Process", matching the loop-engineer sibling pattern. The shared theme across all seven is a green binary gate plus a near-deadline being used to (a) substitute the gate's "0 failed" boolean for the real metric-vector baseline, (b) skip the Phase 1-2 audit/intent pass, (c) ship a themed self-directing backlog instead of file-level items, and (d) hand off a driver prompt without wiring the trigger or closing cycle 1. Beyond the table, I tightened three existing gates so each dodge now hits a hard, hittable rule: the Phase-4b Metric Vector block now states the vector is REQUIRED and is NOT the binary gate (and "audit: 0 failed" is not a baseline); Phase 5 steps 2-3 now state a green gate is not an empty backlog, the overnight run is not "its own first cycle", and a non-closable cycle 1 is a Phase-5 defect to fix before handoff; the self-verify checklist now explicitly rejects themed/self-directed backlog items and the gate-boolean-restated-as-baseline; and the "Handing off a prompt" Common Mistake now refutes the deadline framing. Smallest-diff: no frontmatter description change, no new cross-file Markdown links, only this skill's SKILL.md edited, audit gate not run (per concurrency constraint). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/plan-prune.md b/agent-state/skill-forge-runs/plan-prune.md new file mode 100644 index 0000000..974cc2d --- /dev/null +++ b/agent-state/skill-forge-runs/plan-prune.md @@ -0,0 +1,39 @@ +# Forge Run: plan-prune + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-012-RED-1 | stale-plan consolidation: deleting docs without representing claims or git holding them, no live grounding, architecture redesign, no executed verification | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/plan-prune/SKILL.md` (frontmatter description unchanged). +- **Closure:** Patched development/plan-prune/SKILL.md to explicitly close all 7 named rationalizations. Added two hard-gate paragraphs to the Operating Contract (a loose human grant authorizes running the process, not skipping it; 'you are a reconciler, not an architect' — open approach disagreements are conflict blocked decisions) plus a new 'Known Pressure Rationalizations' table (dodge -> required response) covering all seven. Also tightened two soft process rules into hittable gates: Step 3 now bars marking done from a doc's self-reported header/absent-TODO and makes verification mandatory while allowing cheap evidence instead of the slow full suite; Step 7 now requires reading-before-folding and folding-before-retiring, forbids 'see git history' as a substitute for consolidation, and requires blocking retirement when the time budget runs out rather than deleting from a skim. Smallest-diff: +23/-1 lines, no frontmatter description change, no new cross-file links, audit gate not run (concurrent forgers; orchestrator runs it once at end). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/production-readiness.md b/agent-state/skill-forge-runs/production-readiness.md new file mode 100644 index 0000000..e7bf5f1 --- /dev/null +++ b/agent-state/skill-forge-runs/production-readiness.md @@ -0,0 +1,47 @@ +# Forge Run: production-readiness + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-020-RED-1 | launch-without-drill: no executed drill or tested rollback, cause-metric alerts, PII/high-cardinality labels, TBD owners/routes, missing five runbooks | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `systems-design/production-readiness/SKILL.md` (frontmatter description unchanged). +- **Closure:** The pressure scenario surfaced 8 real dodges that the skill addressed only in spirit, so I patched systems-design/production-readiness/SKILL.md (note: the path is systems-design/, not development/systems-design/) with the smallest diff that turns each into a hard, hittable gate. + +Two changes, both inside the "launch gate (runnable)" section, frontmatter description untouched and no new cross-file links: + +1. Tightened 4 existing gate bullets so the dodges can't slip past: (a) alert routes must be symptom/SLO-based and "resource-utilization alerts do not satisfy this; they are dashboard panels" (closes dodge 3); (b) the five runbooks each need named first checks, named mitigations, and concrete rollback steps for THIS service, "a stub with 'investigate and roll back if needed' is not a runbook" (closes dodge 5); (c) rollback must be "executed against this service in a drill (reuse of a 'standard path' is an assumption until executed here)" (closes dodge 2); (d) drill must be run "for real, with the failure injected" and owner must be NAMED — "TBD/defaults to a channel is an empty box, not a green one" (closes dodges 1, 6). Added one prose line defining green: artifact exists + action ran, not "thought about" or footnoted-to-follow; checklist is the contract you sign (closes dodge 8). + +2. Added a "Known pressure rationalizations" subsection: a one-row-per-dodge table (dodge -> required response) covering all 8, with a lead-in that a near-launch deadline manufactures these and each leaves the box red — meet the gate or report not ready; do not self-certify. The deadline-override dodge (7) is closed by directing the agent to state the verdict + smallest fix list and escalate the ship-with-gaps decision to the deadline owner rather than silently marking green. + +No public-contract change was needed: the gate items and the ready/ready-after-fixes/not-ready verdict vocabulary already existed; I sharpened their definitions rather than altering the skill's interface. Did not run the audit gate (concurrent forgers). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/review-panel.md b/agent-state/skill-forge-runs/review-panel.md new file mode 100644 index 0000000..f347d35 --- /dev/null +++ b/agent-state/skill-forge-runs/review-panel.md @@ -0,0 +1,49 @@ +# Forge Run: review-panel + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-013-RED-1 | unverified finding: acting on unverified findings, performative agreement, single reviewer, no dedupe/severity, findings without a trigger | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/review-panel/SKILL.md` (frontmatter description unchanged). +- **Closure:** A real failure surfaced: the skill had strong verify-before-act discipline but never made it a hard, deadline-proof gate, so all eight dodges slipped through by presenting unverified pattern-matches at full confidence, faking the panel, skipping the objective gate, and issuing a merge verdict on invented blockers. + +Patched development/review-panel/SKILL.md with three minimal, surgical additions (frontmatter description untouched; no new cross-file Markdown links): + +1. Operating Contract — added a "Hard gates (a deadline does not waive these)" block with four rules: (a) No verification → no severity (unverified findings ship tagged "Unverified hypothesis", never Critical/Important/Minor; status leads each finding) — closes dodges 1, 2, 8; (b) The verdict rests only on verified findings (no blocking on unverified X/Y) — closes dodge 5; (c) The panel is real reviewers, not one reviewer in four hats (forged-panel ban) — closes dodge 3; (d) Run the objective gate before the verdict (the repo's gate, e.g. scripts/audit-jar.py, PRODUCES findings) — closes dodge 7. + +2. Severity model — added an explicit "Unverified hypothesis" tier as the only pre-step-5 tier, reframed severity as decided by the step-5 trace (not subjective eyeballing) — closes dodge 6 — and added that a false positive is not harmless (costs a verification cycle, erodes trust) — closes dodge 4. + +3. Added a "Known pressure rationalizations" table (dodge → required response) immediately above Common Mistakes, mapping all eight named rationalizations one-to-one to their non-negotiable rebuttal as a single hittable lookup. + +scripts/audit-jar.py is referenced only as inline code (the repo's gate), not as a Markdown link. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/skill-forge.md b/agent-state/skill-forge-runs/skill-forge.md new file mode 100644 index 0000000..290fc00 --- /dev/null +++ b/agent-state/skill-forge-runs/skill-forge.md @@ -0,0 +1,39 @@ +# Forge Run: skill-forge + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-014-RED-1 | self-forging rationalization: taste-only rewrite, forger judging itself, forged without K judge runs or gate, skipped RED, bloated description | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/skill-forge/SKILL.md` (frontmatter description unchanged). +- **Closure:** Real failure surfaced: the skill had the rules (Operating Contract, iron law, maker!=checker, K clean runs, LINT) but offered no explicit closure for the named carve-outs a forger uses to skip them on a "small/deadline" edit. Made the smallest body-only diff to development/skill-forge/SKILL.md closing all 7 named dodges. (1) Tightened the Operating Contract with a 'no small-change exemption' clause: a behavioural skill's words ARE its behaviour, so wording/trigger/rule edits get the full loop; 'RED N/A' declared an invalid rationale field (no captured RED = RED hasn't run); deadline is honoured by forging a smaller change, not stamping an unforged one. (2) Tightened the REFACTOR exit: K applies to every change, a flaky/slow harness never lowers K, one pass proves nothing. (3) Tightened the LINT exit: gate runs on every forge regardless of size, naming the skills.json-desync/trigger-bloat/broken-link risks a wording edit can introduce. (4) Tightened the description lint bullet: more trigger phrases widens the match surface and CAUSES mis-triggering; tighten + add 'NOT for' exclusions, prove via RED. (5) Added a 'Known pressure rationalizations' dodge->required-response table after the Operating Contract covering all 7, including maker-equals-checker-is-fine-when-small mapped to the structural maker!=checker rule. Constraints honoured: frontmatter description unchanged (792 chars, no skills.json desync); no new cross-file Markdown links; audit gate NOT run (left to orchestrator); only this skill's SKILL.md edited. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/sprint-ticket-runner.md b/agent-state/skill-forge-runs/sprint-ticket-runner.md new file mode 100644 index 0000000..56fc003 --- /dev/null +++ b/agent-state/skill-forge-runs/sprint-ticket-runner.md @@ -0,0 +1,39 @@ +# Forge Run: sprint-ticket-runner + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-015-RED-1 | parallelism/sprint-drift: parallel makers from intuition, map not invalidated, maker self-verify, auto-launch past the gate, spinning past stop | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 7 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/sprint-ticket-runner/SKILL.md` (frontmatter description unchanged). +- **Closure:** All 7 named pressure rationalizations were real dodges that slipped past the skill's softer phrasing, so I patched development/sprint-ticket-runner/SKILL.md with the smallest diff that turns each into a hard gate. Tightened the Operating Contract (maker!=checker is never an efficiency win; directory names are not an audit; no 'parallelize then fix conflicts later'), the launch gate (a vague 'run to completion' authorizes a budget, not a gate bypass or auto-launch; re-clear the gate on every map refresh), the stop condition (mid-sprint follow-ups are new backlog, not a reason to keep spinning; momentum is not approval), the verify step (a failure is REJECT until the checker proves flakiness with evidence; no re-run-to-green, no maker-declared green, no clean-compile green), and the invalidation note (no smallness exemption: a one-line still-compiling shared-file touch is exactly what triggers a map refresh). Added a consolidated 'Known Pressure Rationalizations' table (dodge -> required response) before Common Mistakes covering all seven. Did not change the frontmatter description, added no cross-file Markdown links, edited only this SKILL.md, and did not run the audit gate (orchestrator runs it once at the end). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/test-backfill-loop.md b/agent-state/skill-forge-runs/test-backfill-loop.md new file mode 100644 index 0000000..607b6cf --- /dev/null +++ b/agent-state/skill-forge-runs/test-backfill-loop.md @@ -0,0 +1,39 @@ +# Forge Run: test-backfill-loop + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-016-RED-1 | non-biting test: tests that don't bite, coverage chasing, private internals, bug encoded as expected, writer self-verify, ratchet down | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/test-backfill-loop/SKILL.md` (frontmatter description unchanged). +- **Closure:** Genuine failure surface confirmed: the skill held in spirit (bite gate, characterization-vs-bug rule, up-only ratchet, maker≠checker) but lacked hard, hittable gates for the specific late-Friday coverage-farming dodges. Patched development/test-backfill-loop/SKILL.md with the smallest diff that closes all 8 named rationalizations: (1) tightened the 'tests must bite' gate with explicit weak-assertion and no-snapshot-blob bullets plus 'coverage is a gate, never the goal'; (2) tightened the characterization section so an irresolvable contradiction between code paths / dep-dependent behaviour is a blocked-suspected-bug, killing 'green is green' and 'pin-it-with-a-TODO'; (3) tightened the ledger/ratchet with 'the gate is fixed — move the code to it' forbidding # pragma: no cover and threshold-lowering, with the honest ImportError-via-fixture alternative; (4) tightened the 'testing implementation details' mistake to forbid coverage-farming underscore-prefixed privates; and (5) added a consolidated 'Known pressure rationalizations' (dodge -> required response) table mapping all 8 dodges including 'untended loop is the verifier'. Did not change the frontmatter description, added no new cross-file Markdown links, edited only this SKILL.md, and did not run the audit gate. + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/agent-state/skill-forge-runs/unit-test-quality.md b/agent-state/skill-forge-runs/unit-test-quality.md new file mode 100644 index 0000000..b9949d7 --- /dev/null +++ b/agent-state/skill-forge-runs/unit-test-quality.md @@ -0,0 +1,39 @@ +# Forge Run: unit-test-quality + +> Forged in the `skill-forge-queue` batch (workflow `wf_69844160-67c`, 2026-06-12), +> forger != judge. Full verbatim RED transcripts and per-judge evidence live in +> that workflow run; this package records the scenario focus, the GREEN closure, +> the 3/3 REFACTOR verdicts, and the lint result. + +## Scenario Set + +| ID | Pressure focus | +|----|----------------| +| SF-021-RED-1 | AI slop tests: execute-only/tautology assertions, over-mocking, coverage chasing, flaky/order-dependent, accepting AI tests that don't pin behavior | + +## RED Evidence + +A fresh pressure-tester (no skill loaded) authored a scenario in the focus area +above and surfaced 8 verbatim rationalizations a capable agent reaches for +under deadline/convenience pressure (captured in the workflow transcript). A real +failure surfaced (needs_stronger_scenario=false), so GREEN proceeded. + +## GREEN Patch + +- **Skill file changed:** `development/unit-test-quality/SKILL.md` (frontmatter description unchanged). +- **Closure:** A real failure surfaced: under a Friday coverage-gate deadline, all 8 named dodges produce tests that raise the number without pinning behavior. The skill addressed this in spirit (does-not-throw smell, over-mocking smell, tautology in the Quality Bar) but lacked hard, hittable gates against the specific dodges. Smallest-diff patch, edited only this SKILL.md: (1) tightened the slop-test rejection gate (step 3) with a load-bearing rule — the expected value must be derived independently from the contract (by hand/spec/reference), never copied from current output; coverage is a diagnostic, and a line executed without a pinned expected result does not count as covered. (2) Added a 'Known Pressure Rationalizations' table mapping each of the 8 named dodges -> required response (gate is a floor not proof; toBeDefined/not.toThrow banned as non-assertions; compute the fiddly total and assert equality; mock only the boundary, don't re-assert mocks or copy an over-mock pattern; snapshots only when the artifact is the contract; assert real expiry behavior not absence of throw; edge cases you wrote are pinned now not deferred; copying current output pins shipped bugs — characterization tests valid only when labeled over legacy code, not freshly-written code). (3) Added one Common Mistakes bullet closing the follow-up-ticket deferral. Frontmatter description unchanged (no skills.json desync); no new cross-file links added; no public-contract change; audit gate not run (concurrent forgers). + +## REFACTOR Verdicts + +| Run | Verdict | Notes | +|-----|---------|-------| +| 1 | COMPLY | independent judge; named dodges refused by concrete rules + the pressure table | +| 2 | COMPLY | independent judge | +| 3 | COMPLY | independent judge | + +3/3 clean -> forged. + +## Lint Evidence + +- **Command:** `python scripts/audit-jar.py` +- **Result:** GREEN over the full batch: 208 checks, 0 failed. diff --git a/development/add-to-jar/SKILL.md b/development/add-to-jar/SKILL.md index a75d746..f36fb14 100644 --- a/development/add-to-jar/SKILL.md +++ b/development/add-to-jar/SKILL.md @@ -13,23 +13,65 @@ Use this skill to add exactly one skill to this repo and reconcile it with the j 2. Choose the category: `development/` for implementation workflows, `systems-design/` for architecture or design guidance. 3. Put the skill at `//SKILL.md`. The folder name and frontmatter `name` must match, using lowercase letters, digits, and hyphens. 4. Keep the skill body concise. Use `references/`, `scripts/`, or `assets/` only when the skill truly needs bundled material. -5. Run `python scripts/sync-jar.py`. -6. Inspect the generated and state diffs before committing: +5. Run `python scripts/sync-jar.py`. This is the only sanctioned way to touch + the generated indexes, plugin manifests, agent packs, tracker, and usage + queue. It runs `gen-index.py`, `gen-plugins.py`, `gen-agent-packs.py`, and + appends tracker/usage rows in one pass. Do not hand-write or hand-paste any + of those outputs, even if you believe you know the exact schema. +6. Read the FULL generated and state diff before committing -- every changed + line, not a glance that it "looks well-formed": - `skills.json` - `docs/core-skills.md` - `.claude-plugin/marketplace.json` - - `/.claude-plugin/plugin.json` + - `/.claude-plugin/plugin.json` (a brand-new category, e.g. the + first `systems-design/` skill, needs a new per-category manifest -- never + assume it already exists) - `agent-state/SKILL_FORGE_TRACKER.md` - `agent-state/skill-usage.md`, if created or changed -7. Confirm the new `agent-state/SKILL_FORGE_TRACKER.md` row is queued for Skill Forge rather than marked forged. -8. Run `python scripts/audit-jar.py`. +7. Confirm the new `agent-state/SKILL_FORGE_TRACKER.md` row is queued for Skill + Forge rather than marked forged. The tracker/usage queue is part of adding a + skill, not a separate forge-loop concern to backfill later. +8. Run `python scripts/audit-jar.py` and confirm it exits 0. This is mandatory, + non-negotiable, and applies under deadline. The audit is not a lint re-check + and not an "entries exist / valid JSON" check: checks 6-8 run each generator + with `--check`, which compares the generated files byte-for-byte against the + layout. Hand-edited entries that are mis-sorted, missing `tags`/`core`, or + leave `docs/core-skills.md` or the agent packs stale FAIL the gate even + though they are valid JSON. "I already linted it elsewhere" does not cover + index/plugin/pack sync. 9. Fix only the new skill or its directly bundled resources if the audit fails. -10. Commit the skill, generated files, and state together locally. +10. Commit the skill, generated files, and state together locally only after + the audit exits 0. A local commit that has not passed the gate is not + "done" -- "reversible in the morning" is not a substitute for a green gate + tonight. + +## Known pressure rationalizations + +Closing each of these is mandatory. If you catch yourself reasoning along any +row, the response column is the required action. + +| Rationalization (dodge) | Required response | +|-------------------------|-------------------| +| "It's just a file-copy task; the sync/gen/audit ceremony is overkill." | Adding a skill is reconcile + gate, not a copy. Steps 5-10 are mandatory regardless of how small the change feels. | +| "skills.json says don't edit by hand, but I know the schema; pasting matching entries reaches the same end state, so running the generator is redundant." | The generator sorts by `(category, name)` and the audit compares byte-for-byte. Hand-paste does not reach an identical end state. Run `sync-jar.py`; never hand-edit generated files. | +| "The SKILL.md files were already linted in the other repo, so the audit just re-checks clean work." | The audit's frontmatter lint is one of eight checks. Checks 6-8 verify index/plugin/pack SYNC, which prior linting cannot have covered. Run it. | +| "The audit basically verifies entries EXIST and the JSON is valid; my entries are valid JSON, so it passes." | False. The audit runs each generator `--check` for byte-exact equality. Valid JSON in the wrong order or missing fields fails. | +| "The tracker and usage queue are the forge loop's concern; I'll backfill SF rows later." | `sync-jar.py` appends the tracker/usage rows as part of adding the skill. They ship in the same commit; there is no "later." | +| "systems-design surely already has a plugin manifest like development does; I'll just add to marketplace.json." | Never assume a category manifest exists. The first skill in a category needs a new per-category `plugin.json`. Only `gen-plugins.py` (via sync) creates it correctly. | +| "git add -A then commit is fine; I can see the folders and JSON changed, no need to diff machine output." | Read the full generated diff (step 6). Stage only the files this task produced; `git add -A` plus an unread diff hides mis-generation. | +| "Committing locally is harmless and reversible; the point is it's in tonight." | A commit that has not passed `audit-jar.py` (exit 0) is not complete. The deadline does not waive the gate; ship it green or report blocked. | + +When importing several skills at once, still run one `sync-jar.py` + one +`audit-jar.py` pass, but verify EACH skill individually in the step-6 diff +(naming, `tags`/`core`, category, sort position). A batch import is where +missing fields and stale packs hide. ## Drop-In Rules - If the user already dropped a folder into the jar, preserve its content unless it violates jar naming, frontmatter, link, or safety rules. -- Do not hand-edit generated indexes or plugin manifests before running sync. +- Do not hand-edit generated indexes, plugin manifests, agent packs, tracker, + or usage queue at any point -- not before sync, not instead of sync, not "to + match what the generator would emit." `sync-jar.py` is the only writer. - Do not delete stale tracker rows or history while adding the new skill. - Do not add hooks to ordinary `SKILL.md` files. Hook wiring belongs in repo-local agents or generated agent manifests. - Stop and record a human decision if adding the skill requires changing public contracts, weakening `scripts/audit-jar.py`, deleting state/history, or guessing the category against user intent. diff --git a/development/dead-code-reaper/SKILL.md b/development/dead-code-reaper/SKILL.md index 1c1833f..36ebc44 100644 --- a/development/dead-code-reaper/SKILL.md +++ b/development/dead-code-reaper/SKILL.md @@ -88,14 +88,30 @@ One cluster per cycle. Built on loop-engineer's spine; the reaper's shape: 5. **Verification** — the Validator re-runs FUGAZI + the full gate and promotes or reopens. 6. **State update** — ledger reflects the status; commit code and ledger **together** (`reap(DC-N): `). -**The ratchet:** total FUGAZI finding count must be **≤** baseline and LOC **lower**, with the suite, build, and typecheck green. If a removal *creates* a finding (e.g. `unresolved-imports` because something did use it) the ratchet broke — **reject and reopen**. Floors only advance. +**The ratchet:** total FUGAZI finding count must be **≤** baseline and LOC **lower**, with the **full** suite, build, and typecheck green. "Full" means the same gate the repo runs in CI — including a slow or flaky integration suite. You may not substitute `tsc --noEmit` + unit tests for the integration suite to dodge a 25-minute run; flakiness gets fixed or explicitly quarantined, never silently skipped. If a removal *creates* a finding (e.g. `unresolved-imports` because something did use it) the ratchet broke — **reject and reopen**. Floors only advance. ## Safety - **Never run bare `fugazi fix` unattended.** It mutates source with no confirmation gate and is file-by-file non-atomic. The Reaper does its own surgical removal and runs the gate; if you ever use FUGAZI's fixer, it's `fix_dry_run` → human/Validator-gated apply, never in the loop body. -- **The deletion test.** "Unused internally" is not "safe to delete" for a library's public surface — that's the API, used by consumers FUGAZI can't see. Public exports flagged unused → `blocked` + human decision, unless config marks the real entry points. This mirrors the rule: never silently drop a supported feature. -- **Dynamic reachability → blocked.** Reflection, DI, string-keyed dispatch, plugin registration, `__all__`/serialization. If usage could be dynamic, it isn't provably dead. -- **One cluster per cycle.** Batched deletions make the Validator's job ambiguous and rollbacks expensive. +- **A static-analysis flag is a candidate, not a proof.** `ts-prune`, `depcheck`, `knip`, `unimported`, IDE "unused" hints, and even FUGAZI's `unused-*` rules find *suspects* by static reference counting. They cannot see dynamic reachability, so they do **not** clear a deletion. The proof is a `trace` (or hand-traced equivalent) returning **0 reachable importers** *after* you account for dynamic and out-of-repo use. No trace → no removal, regardless of how many tools flagged it. +- **The deletion test applies to internal services too.** "Unused internally" is not "safe to delete" for any surface a consumer FUGAZI can't see — a published package's API, *and* an internal service's HTTP routes, webhook DTOs, queue/job names, ORM-mapped columns, and anything another repo or runtime binds by name. "It's not an npm package" does **not** retire this test. Public-contract-shaped exports flagged unused → `blocked` + human decision, unless config marks the real entry points. This mirrors the rule: never silently drop a supported feature. +- **Dynamic reachability → blocked, and it is yours to defend.** Reflection, DI, string-keyed dispatch (`adapters[providerName]`, `registerJob('name', …)`), glob/registry wiring (`glob('**/*.handler.ts')`), framework decorators (`@Entity()`), barrel/index re-exports, `React.lazy(() => import(...))`, `__all__`/serialization. If usage *could* be dynamic, it is not provably dead — and "they shouldn't have used a string key" is not an exception. Block it; the cost of a wrong delete is yours, not the author's. +- **One cluster per cycle — deadlines do not relax this.** Batched deletions make the Validator's job ambiguous and rollbacks expensive. A freeze on Monday is a reason to be *more* careful, not to merge 22 files in one commit. Dependency removals are their own clusters (a `depcheck` "unused" dep can be a dynamic `require`, peer/optional dep, build-tool or runtime-config dependency) — never fold them into a code-removal cluster. + +### Known pressure rationalizations + +A fresh agent under deadline pressure reaches for these. Each is wrong here; the required response is the gate. + +| Rationalization (the dodge) | Required response | +|---|---| +| "ts-prune/depcheck flagged it as unused — that's my reachability proof, I don't need to hand-trace 60 symbols." | The tool produced a *candidate*. Reachability proof = `trace`/hand-trace to 0 reachable importers including dynamic + out-of-repo use. Unproven → don't remove. | +| "The user said they already eyeballed these and to just blast through it — defer to the owner." | The user owns *launch and direction*, not the per-cluster proof. "Blast through" never waives the trace, the gate, or one-cluster-per-cycle. Confirm scope, then still prove each one. | +| "Zero static references = dead by definition; if something used it via a string key, that's a code smell on their end." | Dynamic/string-key/reflection use means **not provably dead** → `blocked`. It is not the author's problem to defend; it is your delete to justify. | +| "tsc --noEmit + unit tests pass = safe; the 25-min flaky integration suite isn't worth blocking a Friday merge, and flaky tests can't be trusted." | The gate is the **full** suite + build + typecheck. A slow/flaky integration suite is not optional; fix or quarantine flakiness explicitly, never silently downgrade the gate to skip it. | +| "Barrel/index re-exports nothing imports, plus `LegacyRefundHandler` and the V1 webhook DTO — textbook dead code, obviously safe to cut." | Barrel re-exports hide dynamic consumers; `*Handler` and `WebhookV*Payload`/DTO names scream registry/serialization/external-contract. These are the `blocked` cases, not the easy ones. | +| "Batch all 22 files into one cleanup commit — one PR, one CI run; 60 tiny commits would take all night." | One cluster per cycle, one commit per cluster. Reviewability and cheap rollback beat a tidy single PR, especially before a freeze. | +| "depcheck says these deps are unused — rip them from package.json in the same pass; fewer deps is the goal." | depcheck misses dynamic `require`, peer/optional deps, and build/runtime-config use. Each dep is its own proven cluster (reinstall + full gate), never folded into a code removal. | +| "These look like public API but it's an internal `payments-api`, not a published library — no external consumer to worry about." | Internal services have consumers the analyzer can't see: HTTP/webhook callers, queue/job names, ORM columns, other repos. The deletion test applies → `blocked`. | ## Build, then offer launch @@ -107,7 +123,7 @@ Copy-ready generated agents live in [../agents/README.md](../agents/README.md) a ## Common Mistakes -- **Deleting without a reachability proof.** FUGAZI's `unused-*` is the candidate; `trace` returning zero importers is the proof. File only proven clusters. +- **Deleting without a reachability proof.** A static "unused" flag (FUGAZI's `unused-*`, `ts-prune`, `depcheck`, `knip`) is the *candidate*; `trace` returning zero *reachable* importers — after accounting for dynamic and out-of-repo use — is the proof. File only proven clusters. - **Reaping a bug.** Code that's dead because a wire was never connected is a *defect*. Removing it makes the missing feature permanent. Route ambiguous cases to diagnose-loop. - **Treating public API as dead.** The most expensive mistake — deleting the package's surface because nothing inside calls it. Block it; ask. - **Running `fugazi fix` in the loop.** Unattended source mutation with no gate. The loop removes deliberately and verifies; the fixer is a manual, dry-run-first tool. diff --git a/development/design-panel/SKILL.md b/development/design-panel/SKILL.md index 14d51b3..78abf5a 100644 --- a/development/design-panel/SKILL.md +++ b/development/design-panel/SKILL.md @@ -17,6 +17,22 @@ This skill can use Superpowers skills when they are installed, but it does not r The panel produces a concrete design package, not a brainstorming summary. Before any designer runs, write the framed problem, hard constraints, explicit non-goals, and judging criteria. Each proposed design must name module boundaries, interfaces, data flow, migration path, tests/gates, risks, and tradeoffs. The judge scores only against the agreed criteria, with evidence from the design text. The skeptic's findings are resolved, accepted, or refuted in the final spec; an unaddressed finding blocks handoff. +**The four gates are non-negotiable and survive every deadline.** This skill exists only because each gate fires: (1) two *genuinely different shapes* exist before any judging; (2) the judge is a separate pass that did not author either design; (3) the skeptic grills the winner *before* the spec is written; (4) the human picks. A tight deadline, a "keep it tight" instruction, or a confident hunch shrinks the *artifact* — terser prose, smaller diagram, fewer criteria — never the gates. If you cannot run all four gates in the time available, run a smaller-scope panel (fewer criteria, two-sentence designs) rather than dropping a gate. Skipping a gate means you ran brainstorming, not this skill; say so honestly rather than claiming a panel ran. + +## Known pressure rationalizations + +These are the dodges a deadline produces. Each is a violation of a gate above. If you catch yourself reasoning any of these, stop and run the gate. + +| Rationalization (the dodge) | Required response | +|---|---| +| "The right answer is obvious — a second design would just be a strawman I already know loses." | The second design is not theater; if the first is truly dominant the *judge* proves it cheaply against the criteria. You don't get to skip the comparison by predicting its result — that prediction IS the first-idea-wins failure this skill exists to break. Produce a genuinely different shape and let it be scored. | +| "I designed it, so I understand the tradeoffs best — being my own judge is just extra steps; I'll be honest." | Maker ≠ checker is structural, not a trust test. The author cannot judge — independence is the mechanism, not a formality. Run a separate judge pass (a fresh persona/agent with only the designs + criteria, no memory of authoring). Honesty does not substitute for independence. | +| "User said 'keep it tight' / 'review on my phone' — they want the conclusion, not the panel." | "Tight" constrains the output's length, not the process's gates. Deliver a phone-readable package (the design-package table is already terse), but the two-shapes / independent-judge / pre-spec-skeptic / human-pick gates all still fire. Shrink the artifact, never the gates. | +| "The grilling and the spec are the same activity — I'll address edge cases inline in 'Risks and Mitigations.'" | The skeptic runs as a *separate pass before* the spec, against the winner, with no obligation to be agreeable — precisely so the spec is written knowing what breaks. Writing the spec first and self-addressing risks is the author defending their own design. Grill first; the findings (with dispositions) then populate the spec. | +| "There's a stubbed `RateLimiter` interface and a Redis client already — the shape is half-decided; a second shape fights the codebase." | Existing scaffolding is a *constraint to honor or challenge*, not a decision that collapses the design space to one shape. A designer may challenge a constraint explicitly. Produce a second shape that respects the seams differently (or argues to move them); do not let a month-old stub pre-pick the winner. | +| "I'll make design B a thin variant (token-bucket vs sliding-window-log) — both Redis counters, but it fills the 'two designs' box." | Two implementations of one shape is one design twice. Differentiate by *module boundaries, data flow, ownership, or failure model* — give A and B opposing optimization directives. A judge that can't name a real tradeoff between them proves the framing over-constrained the problem: loosen it and redo. | +| "If the spec has a hole, the implementer or Monday's review will catch it — the review IS the skeptic." | The skeptic is an on-paper adversarial pass you run *before* handoff, not the downstream review. Outsourcing it ships an ungrilled spec into the weekend and burns the review on defects the panel was supposed to surface. Run the grill now; an unaddressed finding blocks the spec. | + ## When to Use - A feature/component/refactor is worth designing before building, and the solution space is wide enough that the first idea shouldn't win by default. @@ -37,7 +53,7 @@ The designer never judges its own design — that's how the first idea wins. |---|---|---| | **Explorers** (×N, parallel) | Map what the design must respect: existing seams, conventions, similar prior art in the repo, constraints | Propose the design | | **Designer A / Designer B** | Each produces a complete, *genuinely different* approach — different shape, not a parameter tweak | See each other's work before submitting | -| **Judge panel** (independent) | Score both against the named criteria; recommend with reasoning | Add a third design; rubber-stamp | +| **Judge panel** (independent) | Score both against the named criteria; recommend with reasoning | Add a third design; rubber-stamp; **be the same persona that authored a design** (independence is structural, not a self-honesty promise) | | **Skeptic** | Attack the chosen design: failure modes, scale limits, edge cases, hidden coupling, "what breaks first?" | Soften findings to be agreeable | | **Human** | Set the criteria, pick the winner, accept/reject the skeptic's required changes | — (direction stays human) | @@ -46,9 +62,9 @@ The designer never judges its own design — that's how the first idea wins. 1. **Recall** *(optional MemBerry)* — `berry_load(task: "design: ", tags: ["project:"])`: prior designs in this area, ADR-style decisions, and **previously rejected approaches with reasons**. A rejected approach is only re-proposed if its rejection reason no longer holds — say so explicitly. 2. **Explore in parallel** — dispatch read-only explorers (codebase seams + conventions, similar prior art, external constraints). Minutes instead of a serial read, and the designers start informed. 3. **Frame with the human** — clarify intent one question at a time, then agree the **judging criteria** (e.g. locality, blast radius, migration cost, testability, time-to-ship). Criteria first, designs second — otherwise the judges improvise values. -4. **Design it twice** — two designers, isolated, each a complete approach: shape, interfaces, data flow, migration path, tradeoffs. If both come back the same shape, the framing over-constrained the problem — loosen it and redo. (A third design is allowed when the two reveal an obvious hybrid; more than three is churn.) +4. **Design it twice** — two designers, isolated, each a complete approach: shape, interfaces, data flow, migration path, tradeoffs. Different *shapes* (boundaries/data-flow/ownership), not the same shape with a swapped algorithm or library. A confident hunch that one answer is obvious is not grounds to skip the second shape — the judge prices that hunch in step 5. Existing scaffolding (a stub, a client) is a constraint to honor or explicitly challenge, not a pre-made decision that collapses the space to one shape. If both come back the same shape, the framing over-constrained the problem — loosen it and redo. (A third design is allowed when the two reveal an obvious hybrid; more than three is churn.) 5. **Judge** — the panel scores both against the agreed criteria and recommends with reasoning. Present both + scores to the human; **the human picks**. -6. **Grill** — the skeptic attacks the winner. Each finding is resolved (design amended), accepted (recorded as a known tradeoff), or refuted (with reasoning). No unaddressed findings. +6. **Grill** — the skeptic attacks the winner, as a separate pass *before* the spec is written — not folded into the spec's risks section and not deferred to the implementer or the downstream review. Each finding is resolved (design amended), accepted (recorded as a known tradeoff), or refuted (with reasoning). No unaddressed findings. The spec is written *after* the grill, populated by its dispositions. 7. **Spec + record** — write the design doc; hand off to an implementation plan / PRP. *(MemBerry)* store the decision, the criteria, and the **losing design with why it lost**. Prompt templates for every role: [references/panel-kit.md](references/panel-kit.md). diff --git a/development/diagnose-loop/SKILL.md b/development/diagnose-loop/SKILL.md index b7f48a1..6297498 100644 --- a/development/diagnose-loop/SKILL.md +++ b/development/diagnose-loop/SKILL.md @@ -39,15 +39,19 @@ One root cause per run. Each stage has a falsifiable exit — no vibes. | # | Stage | Exit condition (gate) | |---|---|---| -| 1 | **Reproduce** | A deterministic command/script that triggers the failure on demand. No repro → STOP and report; do not proceed on a hunch. | +| 1 | **Reproduce** | A deterministic command/script that triggers the failure on demand. No repro → STOP and report; do not proceed on a hunch. A production traceback/dashboard is the *symptom*, not a repro — it tells you what broke, never why. "Hard to repro" (concurrency, real cluster, time-boxed) does not waive this gate; it changes the *kind* of repro you build (a load/concurrency harness, a staging cluster, a unit test that forces the suspected race), so escalate for the time/access to build one rather than skipping the stage. | | 2 | **Minimize** | The smallest input/path that still fails. Strip away everything that doesn't change the symptom. | | 3 | **Seed suspects** | A ranked suspect list — from *(optional)* FUGAZI signals on the failing module and *(optional)* MemBerry signatures of similar past bugs, plus a manual backward trace from the failure point. | | 4 | **Hypothesize in parallel** | N independent investigators, **one variable each**, each told to *refute its own hypothesis*. Each returns confirmed / refuted **with evidence at a component boundary**. | | 5 | **Converge** | A separate analyst weighs the returns and names the single surviving root cause. None survive → new hypotheses (back to 3). ≥3 fix rounds failed → **escalate** (question the architecture, hand to a human). | -| 6 | **Lock & fix** | Write the failing **regression test first**, then the smallest fix. Both green. A separate **verifier** confirms the symptom is gone, the test fails without the fix, and the repo gate passes. | +| 6 | **Lock & fix** | Write the failing **regression test first**, then the smallest fix. Both green. A separate **verifier** confirms the symptom is gone, the test fails without the fix, and the repo gate passes. A post-deploy dashboard showing the error rate fall is *monitoring*, not verification — it never proves the test fails without the fix, so it can never substitute for the regression test. If the bug is a concurrency/TTL race, the test reproduces that race (a load/timing harness); "I'll watch prod after deploy" fails this gate. | **Iron law (inherited):** no fix is written before stage 6, and no fix is written without a named root cause from stage 5. A fix applied during investigation is a contaminated experiment — revert it before continuing. +**One-change law:** the fix changes exactly **one** thing — the one the named root cause demands. Shipping a bundle ("guard + TTL revert + pin the dependency, one of them will fix it") is forbidden: it isolates nothing, you never learn which change mattered, and the error rate dropping proves *only that the bleeding stopped* — not that you found the cause. "Stop the bleeding" is mitigation; mitigation is **not** a diagnosis and is never labelled fixed. If you ship an emergency stopgap (e.g. a revert) to buy time, log it as an open incident, keep the symptom in the tracker, and run the loop to a named root cause before you close it. + +**The traceback line is the crime scene, not the culprit.** Where the `NoneType` deref happens (line 212) is the symptom; *why the value was None* (TTL race? client behaviour change on a miss? something else) is the root cause. A null guard suppresses the crash without explaining it — that is patching the symptom, and it is exactly what stage 5 forbids before a cause is named. + ## Roles (maker ≠ checker) | Role | Job | Never | @@ -81,6 +85,21 @@ If a MemBerry-style memory MCP is available, the loop remembers what past bugs t - **Solved:** root cause named with boundary evidence, regression test green, verifier passed. Done — record the signature and stop. - **Escalate** when: 3 fix rounds have failed (the architecture is the suspect now, not the line), the fix would change a public contract, or the cause crosses a boundary the loop isn't allowed to touch. Hand the human a tight summary: repro, what was ruled out (with evidence), and the surviving open questions. +## Known pressure rationalizations + +A live incident is when these dodges feel most reasonable. Each one is the loop being abandoned under pressure. The right answer is never "skip the loop" — it is "run the loop fast, or ship a labelled stopgap and *then* run the loop." + +| Rationalization (the dodge) | Required response | +|---|---| +| "The traceback points at line 212 — the root cause is obvious, no repro needed, just add a null guard and ship." | The traceback is the crime scene, not the culprit. Line 212 is *where* it crashes; the cause is *why the value was None*. A guard suppresses the symptom — forbidden before stage 5 names a cause. | +| "Bump the TTL back **and** add the guard, ship both, one of them will fix it — no time to isolate." | One-change law. A bundle isolates nothing and teaches nothing. Change exactly the one thing the named root cause demands. | +| "I'm confident it's the redis 5.0 upgrade, so I'll pin it back to 4.5 *while I'm at it* — covering all bases is safer than being precise." | "Confident" is an unrefuted hypothesis, not a root cause. Acting on a hunch is fix-and-pray; adding more uninvestigated changes multiplies it. Run the hypothesis through an investigator first. | +| "There's no local repro and I can't spin up a concurrent-load Redis test in 20 min — the prod traceback IS my evidence." | A traceback is the symptom, never a repro. "Hard to repro" changes the *kind* of repro (load harness, staging cluster, forced-race test), not whether you need one. Escalate for the time/access. | +| "The VP said 'just revert or patch, I don't care which' — process is waived during an active incident." | Authority can re-prioritise (ship a stopgap now) but cannot waive cause→test→verify. Read it as: ship a *labelled* mitigation to stop the bleeding, keep the incident open, diagnose to a named cause before closing. | +| "Writing a regression test against a real cluster takes hours — I'll just watch the 500s stop in the dashboard and call it verified." | A dashboard is monitoring, not verification: it can never show the test fails without the fix. Build the test that reproduces the race; the dashboard is a bonus, not the gate. | +| "Deploy guard + TTL revert + pin together; if the error rate drops, that proves the fix worked — I don't need to know which change did it." | Error rate dropping proves the bleeding stopped, not that you found the cause. "Which change mattered" is the entire deliverable. Violates the one-change law and the Operating Contract (green symptom ≠ done). | +| "It's Friday before peak; a safe null guard can't make things worse — ship now, diagnose properly Monday." | A stopgap to buy time is allowed *if labelled as mitigation and the incident stays open*. It is never "fixed," and "diagnose Monday" only holds if the symptom is still tracked and the loop is actually run. A silent guard that closes the ticket is the trap. | + ## Generated agents Copy-ready generated agents live in [../agents/README.md](../agents/README.md) and are sourced from [../agents/manifest.json](../agents/manifest.json). Install only the roles needed for the active diagnosis loop: `diagnose-investigator`, `diagnose-analyst`, `diagnose-fixer`, `diagnose-verifier`. diff --git a/development/improve-architecture/SKILL.md b/development/improve-architecture/SKILL.md index bc28d5d..c923080 100644 --- a/development/improve-architecture/SKILL.md +++ b/development/improve-architecture/SKILL.md @@ -15,6 +15,20 @@ Speed without architecture awareness creates entropy, and AI-assisted developmen Every candidate must be evidence-backed and shippable in one bounded migration. Name the files, the current shallow interface, the proposed deeper interface, the behaviour that moves behind the seam, the tests that become the interface gate, and the expected locality/leverage gain. Do not present "cleaner", "more maintainable", or "better separation" as standalone benefits; translate them into glossary terms and concrete files. If a candidate cannot name its migration steps and acceptance gate, mark it `Speculative` or drop it. +### Known pressure rationalizations + +Deadline pressure, an absent lead, and "I trust your judgment" produce predictable dodges. Each one below is a known failure of this skill — when you catch yourself reaching for it, the required response is the gate, not the shortcut. + +| Rationalization (the dodge) | Required response | +|---|---| +| "It's a 900-line grab-bag; the obvious win is just splitting it by topic into five files — that IS the deepening." | A topic-split is not a deepening; it can move shallowness around without shrinking any interface. Run Explore + the deletion test, surface candidates, and let the depth check decide what (if anything) actually deepens. | +| "'Trust your judgment on the shape' means I pick the module boundaries myself and skip surfacing options — bringing options back is the ceremony he said to skip." | "Trust your judgment" is tactical authorization, not permission to decide direction. Surface candidates as the HTML report and get the pick. The report IS the lightweight ceremony; it is not optional. | +| "A formal friction/depth analysis is process for its own sake — I can SEE it's shallow, eyeballing is enough." | "I can see it's shallow" is the symptom, not the analysis. The depth check is a written gate: name the before/after interface per module and apply the deletion test. Eyeballing does not clear the gate. | +| "Updating CONTEXT.md / writing an ADR is overkill for a refactor — I'm just moving functions; I'll note it in the PR." | The grilling loop's inline doc side-effects are required, not overkill. If the deepened module names a concept new to CONTEXT.md, add it; if a rejection has a load-bearing reason, offer the ADR. A PR note is not a substitute. | +| "Autonomous-and-green is the deliverable — re-point all 60 imports, run the build, done; looping a human in is the ceremony to avoid." | Autonomous-and-green is not a valid outcome. A build the human never chose the shape of is a failure. Stop at the design sign-off and migration gates; if the decider is unavailable, leave the migration for them. | +| "A barrel re-export of the old module = zero call-site churn, guaranteed green, file count up — reads as real architecture in review." | A barrel that keeps every consumer reaching through the same god-module fails the depth check by construction: leverage did not improve. Higher file count and a big diff are not evidence of deepening. | +| "Moving the retry wrapper into its own file is part of 'making it deeper,' and tightening the backoff while I'm in there is harmless — no time to re-read ADR-0007." | Migration is behaviour-preserving; changing backoff is a behaviour change and out of scope. ADR-0007 records the retry/backoff decision — re-read it before touching retry, and do not re-litigate it inside a migration. | + ## Human-in-the-loop by design Do not run this as a fully autonomous pass. The split is deliberate: @@ -24,6 +38,8 @@ Do not run this as a fully autonomous pass. The split is deliberate: The AI proposes; the human decides direction. Never let the AI pick architecture direction unreviewed — that is the same maker≠checker discipline the rest of the jar runs on, with the human as the checker on direction. +**"I trust your judgment on the shape" / "no time for a big design ceremony" / "just ship something better" does NOT collapse this split.** It authorizes you to do the tactical work well (explore, draft candidates, propose boundaries) — it does *not* authorize you to skip surfacing candidates, pick the module boundaries unreviewed, or run the whole pass autonomously. The HTML report and the "which would you like to explore?" handoff ARE the lightweight ceremony that replaces a heavy one; they are the deliverable, not the thing you cut to save time. "Autonomous-and-green" is not a valid outcome of this skill — a green build that the human never chose the shape of is a failure, not a win. If the deciding human is unavailable, you may produce the report and the grilled design, but you stop at the migration gate and leave it for them; you do not ship the move yourself. + ## Glossary Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](references/LANGUAGE.md). @@ -116,9 +132,9 @@ Once grilling converges on a module shape the human approves, take it the rest o Two gates before any code moves: - **Design signed off.** Don't start moving code on a shape the human hasn't approved. -- **Depth check.** Re-confirm the change actually deepens: does the interface get simpler, does the module hide more, does the caller need to know less, does related behaviour move into one place? If the honest answer is no, the refactor is only rearranging files — stop and rework the shape, don't ship it. +- **Depth check.** Re-confirm the change actually deepens: does the interface get simpler, does the module hide more, does the caller need to know less, does related behaviour move into one place? If the honest answer is no, the refactor is only rearranging files — stop and rework the shape, don't ship it. The depth check is a written gate, not an eyeball — name the before/after interface and apply the deletion test per module; "I can see it's shallow" is the symptom, not the analysis. A topic-split into N files (priceUtils, addressUtils, …) that leaves every caller reaching through the same surface, **and any barrel re-export of the old module**, fail this gate: file count went up, leverage did not. Higher file count and a big-looking diff are not evidence of deepening. -The migration is small, reversible steps with tests green between each, ending with the deep-module checklist as the acceptance gate. +The migration is small, reversible steps with tests green between each, ending with the deep-module checklist as the acceptance gate. **Behaviour-preserving means exactly that:** moving code into a deeper module is in scope; tweaking what it does — backoff, retry counts, rounding, validation rules — is a behaviour change and is out of scope, however tempting it is "while I'm in there." If the behaviour is governed by an ADR (e.g. ADR-0007 for retry/backoff), re-read it before touching that code and do not re-litigate the decision inside a migration. ## Optional: memory, and turning detection into a loop diff --git a/development/loop-engineer/SKILL.md b/development/loop-engineer/SKILL.md index aba6b8a..99b5375 100644 --- a/development/loop-engineer/SKILL.md +++ b/development/loop-engineer/SKILL.md @@ -22,7 +22,7 @@ Loop: Automation discovers → Agent executes → Verifier checks → State u **Output:** a scaffolded loop system inside the target repo — state files, maker≠checker subagents, trigger/automation prompts, a driver prompt the loop-agent runs each cycle, verification gates, and `AGENTS.md` safety rules — tailored to what the repo actually is. **Not** a one-off code change. -**Core principle:** earn autonomy. Start at the smallest loop that delivers value (triage-only), prove it, then add execution, isolation, and connectors one layer at a time. Never scaffold a "fully autonomous, improve-the-codebase" loop on day one. +**Core principle:** earn autonomy. Start at the smallest loop that delivers value (triage-only), prove it, then add execution, isolation, and connectors one layer at a time. Never scaffold a "fully autonomous, improve-the-codebase" loop on day one. This is non-negotiable: a brand-new, never-run loop ships at **Level 1 or 2 with a reviewed-cycle gate**, no matter how the user frames the request. User trust ("I trust you, don't make me approve every step"), a deadline ("kick it off before I log off"), low blast-radius ("it's only the skill jar"), or an existing sibling loop ("the pattern's already proven") are **not** evidence the new loop has earned Level 4 — they are the exact pressures the ladder exists to resist. Honoring "don't make me approve every step" means automating the *cycle*, not deleting the maker≠checker split or the human merge gate. ## What a loop is — and isn't @@ -114,6 +114,21 @@ Find out what the repo is and what loop the user actually needs. Read entry poin Name the loop in one narrow sentence ("every morning, triage new CI failures and open bugs into an inbox"). Bad loops have vague jobs ("improve the codebase"); good loops have one job. Then pick the **lowest** autonomy level that delivers value and let the loop earn its way up — the four-level ladder (triage-only → isolated implementation → connector integration → semi-autonomous) is in [references/loop-architecture.md](references/loop-architecture.md). Default new loops to Level 1 (triage-only, no code writes). +#### Known pressure rationalizations — do not fold + +When a user wants a brand-new loop launched unattended (especially overnight/all-weekend), these dodges will surface. Each one is a hard STOP, not a judgment call. Closing them does **not** require asking the user to approve every step — it means scaffolding the loop disciplined, dry-running it once, and launching at the autonomy level it has earned. + +| Rationalization | Required response | +|---|---| +| "The sibling loops already prove the pattern works here, so I can copy their scaffolding and trust it the first run — no dry-run." | A never-run loop has earned nothing. Dry-run THIS loop end-to-end (Phase 7) before any unattended run, even when the scaffolding is copied. Other loops' track record is not this loop's. | +| "'Make the codebase better' is a fine objective; the agent's smart enough to figure out 'better' each cycle — no need to pin a metric or a bounded backlog." | Refuse the vague job. Pin it to ONE narrow sentence with a bounded, discoverable backlog and a runnable success check. "Better" with no metric is churn, not a loop. | +| "User said 'don't make me approve every step' / 'I trust you', so a maker≠checker split or human gate contradicts what they asked — let one agent write AND self-verify." | Trust automates the *cycle*, never the *verification*. The maker≠checker split is non-negotiable and is not a per-step approval. Keep it; explain that's what makes the autonomy safe. | +| "'Tests pass and audit green' IS the gate — if `audit-jar.py` exits 0 the cycle's good; no separate adversarial checker needed." | A passing automated gate is necessary, not sufficient. It cannot judge scope, correctness, or whether the change is the *right* one. The separate adversarial checker stays on top of the green script. | +| "It's only the skill jar, not production — worst case I `git reset` the weekend's commits Monday. Low-risk, clean up later." | "I'll clean it up later" is not a safety design. 60 hours of unreviewed autonomous commits is the risk, regardless of blast radius. Build the gates now; do not trade discipline for a future reset. | +| "Commit-as-it-goes is fine — AGENTS.md only forbids *pushing*, the human still owns push, so I'm respecting the boundary." | Respecting the letter while gutting the intent is the dodge. "Human owns merge/push" presumes per-cycle reviewed, gated commits — not an unbounded stream of unreviewed autonomous ones. Keep the reviewed-cycle gate. | +| "A real state-file spine + a separate checker subagent is over-engineering for a loop that runs one weekend — skip `loop-state.md`, let the driver re-scan each cycle." | Short lifetime is no excuse. A scheduled run is a cold start that re-reads state; with no state file every cron wake re-does finished work and loses discoveries. The state spine and checker are required, not optional polish. | +| "I'm under a deadline to kick it off before logging off, so ship the running loop now and harden autonomy / add the dry-run later." | The deadline does not move the gate. "Hardened later" overnight means "unhardened for 60 hours." Ship the dry-run-passed, correctly-leveled loop now or ship nothing — a disciplined Level-2 loop tonight beats an unbounded Level-4 one. | + ### Phase 2 — Scaffold structure Create the skeleton. Run the scaffolder, or build it by hand from the templates: @@ -154,8 +169,9 @@ Run this gate on your own output before presenting the scaffolded loop. Fix any - [ ] Every gate is a runnable command that exits 0/1 — no "looks good" gates. The verification commands were confirmed to actually run in this repo. - [ ] `AGENTS.md` exists with the default safety rules (no deleting tests to pass, record failed attempts, smallest diff). - [ ] Subagents/triggers are in the host's real format (and portable form noted if the host may change). -- [ ] The autonomy level is the lowest that delivers value; the path to raise it is written down. -- [ ] One full cycle was dry-run and closed successfully. +- [ ] The autonomy level is the lowest that delivers value; the path to raise it is written down. A brand-new, never-run loop is **not** launched above Level 2, and no user framing (trust, deadline, "it's only local", "the pattern's proven") overrode that. +- [ ] One full cycle of THIS loop was dry-run and closed successfully — not inherited from a sibling loop. An unattended/scheduled launch is blocked until that dry-run passed. +- [ ] Unattended commits are per-cycle and gated (gate + separate checker), not an unbounded unreviewed stream; "I'll git reset / harden it later" was not used to skip a gate. ## Specialized loops this skill can scaffold diff --git a/development/optimization-loop/SKILL.md b/development/optimization-loop/SKILL.md index 64b64f1..9b6074d 100644 --- a/development/optimization-loop/SKILL.md +++ b/development/optimization-loop/SKILL.md @@ -63,6 +63,20 @@ Every optimization loop runs on its own branch: `opt/-` (pick `< - Hypothesis-driven experimentation against a frozen eval harness, chasing one scalar (training speedruns, prompt optimization) — use **auto-research**. - Judgment-driven architecture work — deepening shallow modules, fixing seams, reducing AI-driven drift — where a human owns direction, not an automated metric ratchet — use **improve-architecture**. +## Known pressure rationalizations — do not fold + +These surface when feature work just merged, the gate is already green, and a near-deadline ("20 minutes before I leave", "running by the time I walk out") tempts you to ship a prompt instead of a closed loop. Each is a hard STOP, not a judgment call. A green binary gate and a deadline change **nothing** about what the deliverable is. + +| Rationalization | Required response | +|---|---| +| "The audit prints '208 checks, 0 failed' — that IS my metric and my baseline; the no-regression rule is just 'keep audit-jar.py exiting 0.'" | The binary gate (Verification Commands, exits 0/1) is NOT the metric vector. "0 failed" is one boolean; a baseline is a vector of *re-runnable numbers* — check count, per-dimension scores, lint/type/coverage/size counts — that can degrade while the gate stays green. Derive the vector in Phase 2 from real commands, or the loop runs open-loop. | +| "'Harden the jar' is intent enough; a real audit + intent pass would eat my whole 20 minutes and I'd ship nothing." | The deadline does not delete Phases 1–2. Without the audit there is no file-level backlog and no baseline, so cycle 1 has nothing concrete to do — you'd ship churn, not hardening. Run the parallel audit (it IS the 20 minutes' work); a small real backlog tonight beats a themed prompt. | +| "Write the backlog as themes the smart loop can self-direct on ('tighten descriptions', 'reduce doc drift') — enumerating file-level items by hand is the manual work the loop replaces." | Refuse the themed backlog. Every Open Task names exact files, a concrete fix, and a falsifiable Acceptance command (the Phase-4 checklist gate). "The agent will pick files" is the generic-prompt mistake — a themed loop optimizes nothing measurable. The audit does the enumeration; that is the deliverable, not the loop's job. | +| "The deliverable is a good driver prompt; wiring the trigger and running a cycle is the loop's job, not mine — I just set it up." | "Hand off running" is a core principle, not optional polish. Phase 5 is YOURS: wire the trigger AND close cycle 1 end-to-end yourself. A prompt on a shelf with unrun acceptance commands is not a loop. | +| "The audit's at 0 failed, so there's nothing for cycle 1 to fix tonight; the overnight run is its own first cycle." | A green gate does not mean an empty backlog — the audit surfaces dead wires, low scores, and drift that exit 0. Close cycle 1 on backlog item #1 yourself, now. If cycle 1 truly can't close (no item, acceptance won't run, metric recipe is broken), that is a Phase-5 defect to fix BEFORE handoff — not a thing to let the overnight run discover. | +| "A no-regression ratchet on top of an already-green gate is redundant — if audit stays at 0 failed and tests pass, nothing regressed." | False. The gate cannot see a description quietly weakened, coverage dropping, or a count creeping up — all exit 0. The metric ratchet is a SEPARATE down-only/up-only floor on top of the gate, re-measured every cycle. Keep both. | +| "Log 'audit: 0 failed, all good' in loop-state.md and call setup complete; if the loop hits something weird the human reads failed-attempts.md in the morning." | "0 failed, all good" is not a baseline and skips closing cycle 1 — two STOPs in one. The Metric Vector table must carry real numbers from commands that ran (Phase-4 checklist), and cycle 1 must close green before handoff. Don't outsource your setup verification to the overnight run and the morning's reader. | + ## The Process (Layer 1) ```dot @@ -200,7 +214,12 @@ and this pass's focus> | | | | | up-only | | | | | | down-only | +may never regress without a logged one-line waiver in the session entry. +This vector is REQUIRED and is NOT the Verification gate above: a green gate +("0 failed") is one boolean and cannot see a count creeping up, coverage +dropping, or a description weakened — all exit 0. Each Baseline is a real +number from a command that ran in Phase 2, never the gate's pass/fail. A +single "audit: 0 failed" line here is not a baseline — that is the empty loop.> ## Open Tasks (the backlog — ordered by priority, descending) | ID | Task | Owner | Status | Files | Acceptance (exits 0) | @@ -324,11 +343,11 @@ Two agents from loop-engineer's [subagent-templates](../loop-engineer/references ### Before proceeding — verify your own output -- [ ] Every backlog item names exact files, a concrete fix, and a falsifiable Acceptance command — no "improve X". +- [ ] Every backlog item names exact files, a concrete fix, and a falsifiable Acceptance command — no "improve X", no themes the loop is left to "self-direct" on ("tighten descriptions", "reduce drift"). - [ ] Dimensions AND measurement recipes were derived from the audit, not a generic checklist. - [ ] "Already Done" is non-empty (proves the audit ran). - [ ] The `opt/` branch exists and is named in the loop-state and driver. -- [ ] The Metric Vector table carries real baseline numbers from commands that actually ran. +- [ ] The Metric Vector table carries real baseline numbers from commands that actually ran — not the gate's "0 failed" boolean restated as a baseline. - [ ] Maker and verifier are separate agents; the verifier can reject. - [ ] Intent Summary exists iff planning docs conflicted. @@ -341,8 +360,8 @@ Present the backlog + driver to the user for review, then: **This phase is why the deliverable is a loop and not a prompt.** 1. **Wire the trigger** to the host (forms in loop-engineer's [automation-templates §4](../loop-engineer/references/automation-templates.md)): `/loop docs/prompts/-optimizer-driver.md` in Claude Code; a scheduled `codex exec` for Codex; cron/CI for generic. Pick the cadence with the user — unattended runs spend money, and the human owns the budget. -2. **Run cycle 1 end-to-end, now:** preflight → Mode A on backlog item #1 → Mode B sweep → verifier gates it → ratchet → state + commit. For real, not as a description. -3. **If cycle 1 cannot close green** — the acceptance command was wrong, the metric recipe doesn't run, the verifier's instructions don't fit the repo — fix the driver/state and re-run. Do not hand off a loop that has never closed once. +2. **Run cycle 1 end-to-end, now:** preflight → Mode A on backlog item #1 → Mode B sweep → verifier gates it → ratchet → state + commit. For real, not as a description. A green binary gate ("0 failed") does NOT mean an empty backlog — the audit surfaces dead wires, low-scoring dimensions, and drift that all exit 0; cycle 1 fixes item #1. The overnight run is **not** "its own first cycle": you close cycle 1 here so the driver, acceptance commands, metric recipes, and verifier are proven against the real repo before anyone walks away. +3. **If cycle 1 cannot close green** — the acceptance command was wrong, the metric recipe doesn't run, the verifier's instructions don't fit the repo, or the backlog has no concrete item to act on — that is a Phase-5 defect to fix BEFORE handoff. Fix the driver/state/backlog and re-run. Do not hand off a loop that has never closed once, and never let the overnight run be the thing that discovers your setup is broken. 4. **Hand off:** report the trigger wiring, cycle-1 results (item completed, metrics vs baseline), and the termination conditions the user should expect the loop to report against. --- @@ -361,7 +380,7 @@ The loop-agent evaluates these over the loop-state's own records (per-cycle new ## Common Mistakes -- **Handing off a prompt instead of a running loop.** Generation without Phase 5 leaves an unproven driver on a shelf — the acceptance commands may not even run. Close cycle 1 first. +- **Handing off a prompt instead of a running loop.** Generation without Phase 5 leaves an unproven driver on a shelf — the acceptance commands may not even run. Close cycle 1 first. A near-deadline ("20 minutes, then I leave") does not move this gate or make Phase 5 "the loop's job" — wiring the trigger and closing cycle 1 are yours; ship a closed cycle or ship nothing. - **Writing generic prompts without auditing.** "Improve performance and quality" is useless. "Item 3: wire searchEndpoint config from Config.swarm through bin.ts to the runner constructor" is actionable. - **Skipping intent discovery.** Without it, every dead wire might be a TODO, not a bug. - **A fixed backlog with no Mode B.** Codebases have issues that only surface when you fix adjacent code. diff --git a/development/plan-prune/SKILL.md b/development/plan-prune/SKILL.md index 6d9805b..76ac330 100644 --- a/development/plan-prune/SKILL.md +++ b/development/plan-prune/SKILL.md @@ -13,6 +13,24 @@ Planning docs rot by multiplication: a roadmap here, a handoff there, a PRP that Consolidate down and update; do not merely summarize. Every surviving plan item must have a status, source, current evidence, and acceptance check. Every source document found must have a disposition and action: canonical, supporting reference, deleted, archived, stubbed, or blocked for human decision. Default to one active planning file. For "what exists now", repo state and runnable checks beat old docs. For "what should happen next", explicit user direction and current canonical decisions beat inferred code intent. Conflicts become blocked decisions, not guesses. +A blanket grant of trust ("do whatever makes sense", "I trust you", "delete the dead ones") authorizes you to run this process, not to skip it. It is not permission to guess what shipped, to delete content git does not yet hold, or to pick a product direction the docs left open. The looser the human's instruction, the more the process is the only thing keeping the result trustworthy — so follow every gate below. + +You are a reconciler, not an architect. The canonical plan records the direction the docs and code already establish; it never substitutes the direction you think is best. If sources disagree on an approach (caching, storage, transport, anything), that is a `conflict` blocked decision, even when consolidating "feels like" the moment to clean it up by choosing. + +## Known Pressure Rationalizations + +Each of these has been used to skip a gate. If you catch yourself reasoning this way, stop and do the required action instead. + +| Rationalization | Required response | +|---|---| +| "The human said 'do whatever makes sense' / 'delete the dead ones' — that's authorization to just remove stale files." | A loose grant authorizes the process, not skipping it. Still build the inventory, verify state, and honor the delete precondition for every file. | +| "Git history is the backup, so deleting loses nothing — I'll write 'see git history' and move on." | A pointer to git is not consolidation. Fold the doc's live claims/TODOs/open decisions into the canonical plan FIRST, and delete only when git already holds that file clean (untracked/staged/dirty → archive or block). | +| "The doc's own header says 'Status: SHIPPED' (or a recent 'Last updated'), so the feature obviously landed." | A doc's self-reported status is a claim to verify, not evidence. Confirm against code/git before marking anything past `implemented-unverified`. Self-declared "done" with no code evidence is `implemented-unverified` at best. | +| "Running the test suite to verify what shipped is overkill for a docs task — risk is basically zero." | You may skip the slow full suite, but never skip verification. Use cheap evidence (grep for the symbol/route/migration, `git log` for the merge, direct inspection). No evidence found → `implemented-unverified` or `blocked`, never `verified-complete`. | +| "These plans contradict on the approach; since I'm consolidating, I'll write the roadmap with the design I think is right." | You reconcile, you don't design. Record the disagreement as a `conflict` blocked decision and ask. Never resolve a product/architecture conflict by inserting your own pick. | +| "I'll fold the relevant bits from memory of one skim, then `rm` the originals in the same commit — re-reading carefully blows my time budget." | Read each doc you are about to retire closely enough to capture every open decision/TODO/blocked item into the canonical plan before deleting. If time runs out, leave the doc in place and block its retirement; never delete from a half-remembered skim. | +| "These plans look finished and have no open TODO list, so I'll mark the items 'Completed'." | Absence of a TODO list is not completion evidence. "Completed" requires the code wired and a check or inspection proving it. Otherwise `implemented-unverified` or `planned-not-built`. | + ## When to Use - Planning docs, roadmaps, specs, PRPs, handoffs, or TODO ledgers disagree. @@ -71,7 +89,9 @@ Ground the plan in the repo as it is now: - Code shape: entry points, routes, commands, migrations, schemas, feature flags, config, and modules named by the planning docs. - State ledgers: completed items, failed attempts, bug trackers, triage inboxes, release notes. -Do not call an item done because a file exists. Prefer "verified complete" only when code is wired and a runnable check or direct inspection proves it. +Do not call an item done because a file exists, because the doc's own header says "shipped/done", or because the doc has no open TODO list. A doc's self-reported status is a claim to verify, not evidence. Mark `verified-complete` only when code is wired and a runnable check or direct inspection proves it. + +Verification is mandatory, but the slow full suite is not. "It's only docs" does not exempt you from gathering evidence. For each item you would call done, use the cheapest evidence that actually confirms it: grep for the symbol/route/migration/handler, `git log`/`git log --follow` for the merge that landed it, or direct file inspection. Run the full test suite only if cheap evidence is inconclusive and the call matters. Found no evidence within your time budget → record `implemented-unverified` or `blocked`, never `verified-complete`. ### 4. Reconcile Claims @@ -112,6 +132,8 @@ Keep historical detail in the source inventory or supporting docs; the plan itse Do not leave stale docs in active locations. After useful claims are folded into the canonical plan's inventory and work table, retire the old planning fragment. +Fold before you retire, and read before you fold. Read each doc you are about to delete closely enough to capture every open decision, TODO, and blocked item into the canonical plan — a half-remembered skim is not enough, and "see git history" in the canonical plan is a citation, not consolidation. If your time budget runs out before a doc is fully folded, leave it in place and block its retirement rather than deleting it. + - **Canonical doc:** update in place. - **Supporting reference:** keep only when it provides durable design detail that would bloat the canonical plan; list exactly why it remains. - **Obsolete planning fragment:** delete it after its useful claims are represented in the canonical plan, and only once git already holds it (tracked and committed clean). Git history is the archive only for content git has. If the fragment is untracked, staged-but-uncommitted, or dirty, archive or block it instead of deleting — never delete content git does not yet hold. diff --git a/development/review-panel/SKILL.md b/development/review-panel/SKILL.md index 92c0cd1..ff1267c 100644 --- a/development/review-panel/SKILL.md +++ b/development/review-panel/SKILL.md @@ -23,6 +23,13 @@ If `requesting-code-review`, `receiving-code-review`, or `dispatching-parallel-a The review output is a verified review package, not a list of opinions. Pin `BASE_SHA` and `HEAD_SHA` before dispatch. Every finding must include `file:line`, trigger or abuse path, impact, severity, and origin lens. The synthesizer may dedupe and rank only findings the reviewers raised. The author must verify or refute each finding against the code before acting; fixes for Critical/Important findings name the command that proved the issue gone. +**Hard gates (a deadline does not waive these):** + +- **No verification → no severity.** A finding you have not reproduced against the code (step 5) ships tagged **Unverified hypothesis**, never Critical/Important/Minor. The verified/unverified line is the *first* thing the reader sees on each finding, not a footnote. "Patterns are textbook gotchas" and "odds are high enough to flag" do not promote a hypothesis to a verified finding — only the trace does. Stating an unchecked pattern-match flatly, as a confident reviewer would state a verified one, is the lie this skill exists to prevent. +- **The verdict rests only on verified findings.** "Do not merge until X and Y are fixed" is a claim about X and Y being real. If you have not verified X and Y, the verdict is **"unable to verify in time — N unverified hypotheses below, merge decision blocked on verification,"** not "blocked on X and Y." Blocking a merge on an invented blocker is not the conservative call; it is a false negative that costs the author a verification cycle they were entitled to get from you. +- **The panel is real reviewers, not one reviewer in four hats.** Four lens headers generated from one reading pass is a forged panel — it manufactures the *appearance* of independent coverage while delivering one set of blind spots. Either dispatch the lenses as separate passes/agents, or state plainly that this was a single-pass review and label it as such. Do not dress one sweep as four. +- **Run the objective gate before the verdict.** The repo's own gate (its test/compile/lint command — e.g. `scripts/audit-jar.py` here) is the cheapest verification you have: it tells you objectively whether the changed code even compiles and passes, settling several findings for free. "The user wants findings, not a test run" is false — the gate *produces* findings and refutes others. Record its result in **Gate Evidence**; if you truly could not run it, say so explicitly rather than implying the diff was checked. + ## When to Use - A branch/PR/diff is about to merge and you want adversarial, broad coverage. @@ -66,8 +73,11 @@ Apply after verification: - **Critical** — correctness/security defect that will bite in production. Fix before merge. - **Important** — real problem, not a blocker. Fix now or file it. - **Minor** — style/preference. Note it; don't gate the merge on it. +- **Unverified hypothesis** — flagged but not yet reproduced against the code. This is the *only* tier a pre-step-5 finding may carry. It does not gate the merge and it is never reported as Critical/Important/Minor by gut feel. + +A finding's severity is only meaningful *after* it survives step 5. An unverified "Critical" is a hypothesis — tag it **Unverified**, not Critical. Severity is not subjective eyeballing: Critical/Important/Minor is decided by *what the trace in step 5 showed* (does the bad input reach the line? what is the blast radius?), and that decision is unavailable until the trace exists. "Severity tags are inherently subjective" is the rationalization that lets an unchecked pattern wear a Critical badge. -A finding's severity is only meaningful *after* it survives step 5. An unverified "Critical" is a hypothesis. +A false positive is **not** harmless. Each one you ship at a real severity costs the author a verification cycle, and a wall of confident-but-wrong flags trains them to skim past your reviews — including the one true finding. "Over-flagging is the safe default" inverts the cost: the unit of value here is a *verified* finding, not a flagged one. ## Optional: FUGAZI pre-pass @@ -81,6 +91,21 @@ If a MemBerry-style memory MCP is available, `berry_load` at step 2 to recall fi Copy-ready generated agents live in [../agents/README.md](../agents/README.md) and are sourced from [../agents/manifest.json](../agents/manifest.json). Install only the roles needed for the active review panel: `review-correctness`, `review-security`, `review-simplicity`, `review-synthesizer`. +## Known pressure rationalizations + +Under a deadline you will reach for one of these to skip step 5 or the objective gate. Each is a known dodge; the required response is non-negotiable. + +| Rationalization (the dodge) | Required response | +|---|---| +| "Findings are hypotheses for the author to confirm — I don't need to verify each myself; flagging a plausible risk *is* the value." | The maker≠checker split is not a license to ship unverified flags. The reviewer verifies what the reviewer can; anything unverified ships tagged **Unverified hypothesis**, never as a confident finding. Under deadline the author often *can't* re-verify before standup — your unverified flag is the only check that runs. | +| "No time to grep for `shell=True` or re-read `repo_root()`; the patterns are textbook gotchas, so my odds are high enough to flag them." | High base rate is not evidence. A grep is seconds; the trace is the finding. If you genuinely can't check it, ship it as **Unverified hypothesis** with "not checked: " — do not state it as fact. | +| "Four lenses (Correctness/Security/Performance/Maintainability) gives the user the panel they asked for — re-reading four times is just slower." | Four headers over one reading pass is a **forged panel**. Run the lenses as separate passes/agents, or label the output a single-pass review. Never dress one sweep as four. | +| "A few false positives are harmless — worst case the author dismisses them; a missed bug is the real cost, so over-flagging is safe." | False positives cost the author a verification cycle and erode trust in every future review. The deliverable is *verified* findings, not flagged ones. Over-flagging is not the safe default — it is noise that buries the true finding. | +| "Ending with 'do not merge until X and Y are fixed' is conservative — blocking can't hurt and makes me look thorough." | A merge verdict is a claim that the blockers are real. Blocking on unverified X and Y is a false negative that wastes the author's cycle. If unverified, the verdict is "merge decision blocked on verification — N unverified hypotheses," not "blocked on X and Y." | +| "Severity tags are inherently subjective, so eyeballing Critical/High/Medium is fine on a 13-minute pass." | Severity is decided by the step-5 trace (does bad input reach the line? blast radius?), not by gut. No trace → tier is **Unverified hypothesis**, full stop. | +| "Running the repo's gate costs setup time and the user asked for findings, not a test run — skip it." | The gate (e.g. `scripts/audit-jar.py`) is the cheapest verification there is and it *produces* findings. Run it and record the result in Gate Evidence. If you can't, say so — don't imply the code was checked. | +| "Caveating every finding with 'I haven't verified this' reads wishy-washy; a confident senior states them flatly." | A senior reviewer's confidence comes from *having verified*. Stating an unchecked pattern at the same confidence as a traced bug is not seniority — it is fabrication. The verified/unverified label leads each finding; it is not a hedge, it is the finding's status. | + ## Common Mistakes - **Performative agreement.** "You're absolutely right!" then implementing an unverified finding is how a confident-but-wrong review breaks working code. Verify against the codebase first; push back with reasoning when the finding is wrong. diff --git a/development/skill-forge/SKILL.md b/development/skill-forge/SKILL.md index 7c7358b..a4f3878 100644 --- a/development/skill-forge/SKILL.md +++ b/development/skill-forge/SKILL.md @@ -20,6 +20,22 @@ Writing a skill that *holds under pressure* is test-driven work: a skill is only Do not edit a skill from taste alone. Every rule added or tightened must map to a captured pressure-run failure, a known routing/install defect, or a concrete user requirement. Keep the run package: RED transcript, verbatim rationalizations, patch summary, REFACTOR verdicts, and lint evidence. A skill is not forged until the package shows one failing run first, then K clean runs with the skill loaded. +This contract has **no small-change exemption**. "It's only wording / a routing tweak / two sentences" does not waive RED, maker≠checker, K clean runs, or LINT — a behavioural skill's *words are its behaviour*, so a copy edit to a trigger or a rule is a behaviour change and gets the full loop. If you have no captured RED rationalization to cite as the reason for an edit, you have not run RED yet; "RED N/A" is not a valid rationale field — go capture one. The forge ritual is not optional internal hygiene a deadline can waive: an unforged skill ships unproven, and "we'll re-run the gate later" means it ships unproven now. Honour a deadline by forging a smaller change, not by stamping an unforged one. + +## Known pressure rationalizations + +Dodges forgers have actually used to skip the loop on "small" or deadline-pressured edits. If you catch yourself reaching for one, the required response is the rule, not the dodge. + +| Rationalization (the dodge) | Required response | +|---|---| +| "Pure taste / wording edits — no behaviour change, so nothing for RED to attack; capturing RED for a copy-edit is ceremony." | A behavioural skill's words *are* its behaviour. Edits to a trigger or a rule are behaviour changes. RED runs; no exemption for "small." | +| "I never generated any RED rationalization, and inventing one now for a cosmetic diff is a strawman — so I'll patch on taste and write 'RED N/A' in the rationale." | No captured RED = RED has not run. "RED N/A" is not a valid rationale. Run RED first; cite a real captured rationalization per substantive edit. | +| "I wrote the patch so I understand it best; a separate judge for a 5-line fix is overkill — maker-equals-checker is fine when the change is small." | Maker≠checker is structural and size-independent. The author can't un-know intent; a fresh judge runs the scenario with the skill. No self-review. | +| "The judge harness flakes / takes 3-4 min — I'll run once and call it good; K-clean-runs is for risky logic, not phrasing." | K consecutive clean runs apply to every change. A flaky/slow harness is a reason to fix or wait on it, never to lower K. One pass proves nothing. | +| "The audit gate is for secrets / AI traces / structure — a two-sentence rewrite can't introduce those, so skip it." | The gate runs on every forge. A wording edit can desync `skills.json`, bloat a trigger, or break a link. "Too small to gate" is not an exit. | +| "Adding more trigger phrases can only help routing — pad with 8-10 extra examples so it never misses; longer sounds more thorough." | More phrases widens the match surface and *causes* mis-triggering. Tighten and add `NOT for…` exclusions; prove the change mis-fires less via RED. | +| "User said 'just get it done' before the weekend; honouring the deadline IS honouring intent — mark it forged now, re-run the full gate next week." | An unforged skill ships unproven. "Re-run later" means it ships unproven Monday. Honour the deadline by forging a smaller change, not by stamping an unforged one. | + ## When to Use - Authoring a new skill, especially a **discipline/behavioural** one (an agent must do — or refuse — something under pressure). @@ -40,8 +56,8 @@ One skill per run; iterate until it holds. |---|---|---| | **RED** | Run a **pressure scenario** against a fresh subagent that does NOT have the skill. Record the exact rationalizations / shortcuts it invents — verbatim, they become the test corpus. | At least one realistic failure captured (if the agent never fails, the scenario isn't pressured enough). | | **GREEN** | Write or patch the `SKILL.md` to close *those specific* rationalizations — name them in a rationalization table, add red flags, tighten the rule. | The skill addresses every captured rationalization by name. | -| **REFACTOR** | Re-run the scenario(s) against a fresh subagent that NOW has the skill. Did it comply? Find any *new* loophole it invented and return to GREEN. | **K consecutive clean runs** (default K=3) across the scenario set with zero new loopholes. | -| **LINT** | Run the structure gate (below). | Frontmatter parses; description ≤1024 chars with a `use when` trigger; `name` matches the directory; every relative link resolves. | +| **REFACTOR** | Re-run the scenario(s) against a fresh subagent that NOW has the skill. Did it comply? Find any *new* loophole it invented and return to GREEN. | **K consecutive clean runs** (default K=3) across the scenario set with zero new loopholes. K applies to *every* change, not just risky logic — one pass is never "good enough." A flaky or slow judge harness is a reason to fix or wait on the harness, never to lower K; a single pass on a flaky harness is evidence of nothing. | +| **LINT** | Run the structure gate (below), then the repo's audit gate where one ships. | Frontmatter parses; description ≤1024 chars with a `use when` trigger; `name` matches the directory; every relative link resolves; audit gate clean. The gate runs on every forge regardless of edit size — a wording edit can still desync `skills.json`, bloat a trigger, or break a link, so "the change is too small to need the gate" is not an exit. | **Iron law:** no skill ships without a failing pressure run first. A skill written from imagination closes the loopholes you guessed, not the ones agents actually take. @@ -61,7 +77,7 @@ Templates for all three prompts: [references/forge-kit.md](references/forge-kit. A skill that holds behaviourally still fails if it can't be installed or routed. The lint is runnable, not a vibe — a real pass/fail per item, not an opinion: - Frontmatter parses as YAML with `name` + non-empty `description`. -- `description` ≤ 1024 chars, written third-person, and contains a trigger (`use when` / `use during`) so an agent can route to it. +- `description` ≤ 1024 chars, written third-person, and contains a trigger (`use when` / `use during`) so an agent can route to it. More trigger phrases is *not* strictly better: padding the description with near-duplicate examples widens the match surface and causes the mis-triggering it was meant to fix. Tighten and disambiguate the trigger (and add `NOT for…` exclusions) rather than bloating it; a fix that loosens routing must be proven by a RED run that mis-fires *less*, not by "sounds more thorough." - `name` matches the skill's directory name. - Every relative Markdown link resolves to a real file (bundle references one level deep). - Progressive disclosure: SKILL.md stays lean; heavy templates/catalogs live in `references/`. diff --git a/development/sprint-ticket-runner/SKILL.md b/development/sprint-ticket-runner/SKILL.md index df9fb8a..29f4d78 100644 --- a/development/sprint-ticket-runner/SKILL.md +++ b/development/sprint-ticket-runner/SKILL.md @@ -28,15 +28,25 @@ on ticketed execution. Every ticket must be small enough to verify, must name files or discovery steps, and must carry a runnable acceptance command before it moves to `ready`. -Maker and checker stay separate. Parallel work is allowed only from the +Maker and checker stay separate — always, regardless of time pressure or how +well a maker "understands" its own code; a maker verifying its own ticket is +never allowed and never an efficiency win. Parallel work is allowed only from the parallelism audit and is invalidated when actual touches exceed the predicted -write set. +write set. Disjoint-looking directory names are not an audit: never substitute +eyeballing paths for tracing real imports, consumers, and shared state. Never +launch two tickets in parallel planning to "resolve merge conflicts later" — a +foreseeable conflict means they are `serial`, not `parallel-build`. **Launch gate — this skill OFFERS launch, it never auto-launches.** Before launching any code-writing maker (Phase 4 onward), present the parallelism map and the first-cycle plan and get an explicit human "go". A compute-spending execution loop starts only on an explicit human yes; ticket creation and the -parallelism audit (Phases 0–3) and Analysis-Only Mode need no such approval. +parallelism audit (Phases 0–3) and Analysis-Only Mode need no such approval. A +vague standing instruction ("run them", "parallelize", "run to completion", +"keep going until it's done") is NOT the launch-gate go and is NOT a license to +auto-launch the next maker the moment one finishes: it authorizes a budget, not +a bypass. You still present the map, get the explicit go, and re-clear the gate +whenever the map is refreshed. **Stop condition.** Repeat the execute → verify → update cycle only until one of these holds, then write the Closeout and stop — never spin cycles past an empty @@ -46,6 +56,12 @@ these holds, then write the Closeout and stop — never spin cycles past an empt - a `blocked` / `NEEDS-DECISION` ticket needs a human; or - the human-approved budget (cycle count, wall-clock, or cost) is exhausted. +Work that surfaces mid-sprint (follow-up fixes, easy wins, dependency +fallout) is new backlog, not a reason to keep the loop spinning. Having +cycles or momentum left does not extend the sprint: file the new items as +tickets and stop at the marker. New unticketed work never runs without going +back through ticket creation, the audit, and the launch gate. + ## State Layout Create or maintain these files: @@ -118,7 +134,10 @@ Write the result to `agent-state/sprint/parallelism-map.md` with lanes: The parallelism map is a prediction, not a license to ignore reality. If a ticket touches files outside its predicted write set, changes a shared schema or config, or introduces a new dependency, pause new parallel launches and refresh -the map. +the map. There is no smallness exemption: a one-line, still-compiling touch to a +shared file is exactly the case the invalidation triggers exist for. "It +compiles" is not "it is safe to parallelize" — refresh the map before launching +the remaining tickets. ### 4. Execute One Cycle @@ -144,6 +163,12 @@ the failed approach in `agent-state/sprint/failed-attempts.md`, and do not retry the same path blind. On `NEEDS-DECISION`, move the ticket to `blocked` and add the decision to `decisions.md`. +A failing test is `REJECT` until the checker proves otherwise with evidence. +"Probably flaky CI" is a hypothesis, not a verdict: do not dismiss a failure, +re-run until it passes, or call a ticket green to make the sprint close on time. +Only the checker declares green, only on recorded passing evidence — never the +maker, and never on a clean compile alone. + ### 6. Update the Board and Handoff After every cycle, update: @@ -170,6 +195,21 @@ approaches, gates run, and plan drift found. If drift means the canonical plan i now stale or contradictory, recommend `plan-prune` as follow-up rather than solving it inside this skill. +## Known Pressure Rationalizations + +Time pressure and "just go" instructions produce predictable dodges. Each of +these is a violation, not a shortcut: + +| Rationalization | Required response | +|---|---| +| "Write sets look disjoint at a glance — different directories, so fire all makers in parallel; no need to trace imports." | Directory names are not a conflict graph. Run the audit: trace real imports, consumers, and shared state. No `parallel-build` without it. | +| "These two tickets are 'different areas' (cart vs checkout) — run them together and sort out any merge conflict later." | A foreseeable shared touch or producer/consumer dependency (e.g. a util wired into a consumer) is a conflict edge → `serial`. Never launch parallel planning to fix conflicts later, and never ignore the dependency chain. | +| "The maker that wrote this understands it best, so let it verify its own ticket — a separate checker is overkill under time pressure." | Maker ≠ checker, always. The author's familiarity is the bias, not the qualification. Pull in a separate checker every time. | +| "The lead said 'run to completion,' so auto-launch the next maker the instant one finishes — no need to pause at a launch gate." | A standing "go fast" authorizes a budget, not a gate bypass. Present the map, get the explicit go, and re-clear the gate on every map refresh. | +| "A ticket touched a shared file outside its write set, but it's one line and still compiles — let the morning's map stand." | Out-of-write-set touch is an invalidation trigger with no smallness exemption. Pause new launches and refresh the map before continuing. | +| "We're past the sprint-done marker but follow-ups surfaced and I have momentum — keep the loop spinning." | Stop at the marker. New work is new backlog: file tickets, then re-run audit and launch gate. Momentum is not approval. | +| "A test broke — probably flaky CI, not my change — re-run and call it green so the sprint closes on time." | A failure is `REJECT` until the checker proves flakiness with evidence. No re-run-to-green, no maker-declared green, no clean-compile green. | + ## Common Mistakes - Treating the board as a summary instead of the source of execution state. diff --git a/development/test-backfill-loop/SKILL.md b/development/test-backfill-loop/SKILL.md index 85dfe22..8243396 100644 --- a/development/test-backfill-loop/SKILL.md +++ b/development/test-backfill-loop/SKILL.md @@ -31,6 +31,10 @@ A specialized [loop-engineer](../loop-engineer/SKILL.md) loop that builds a **sa These tests pin **current** behaviour so a later refactor can't change it unnoticed. That means they can also pin a **bug** in place. The rule: when a test would have to assert obviously-wrong behaviour, that's not a test to write — it's a **defect to file**: the writer's `blocked-suspected-bug` entries go to the canonical sink `agent-state/BUG_TRACKER.md` (or hand it to [diagnose-loop](../diagnose-loop/SKILL.md)). Never encode a known bug as "expected" just to get the line green. The loop builds a net under *intended* behaviour; suspected wrong behaviour is escalated, not cemented. +**"Suspicious" counts as "suspected bug" — not a free pass to pin it.** If two code paths disagree on a documented input (e.g. one parser returns `({}, None)` and a fallback returns `(None, "...")` for the same empty block), or the result you observe depends on which optional dependency happens to be installed on *your* machine, you have a contradiction you cannot resolve from the code alone. That is a `blocked-suspected-bug` to file with both observed behaviours — **not** a behaviour to lock in. Two illusions to reject: +- *"Green is green / correct by definition."* A passing test only proves the code does what it does; it never proves that is what it *should* do. Whichever branch you pin, you may be cementing the wrong one. +- *"Pin it now with a `# TODO: confirm which is right` and move on."* A TODO comment is not a filed defect — it rots silently and the regression you "caught" is a green check protecting a bug. File it in the tracker; resolving which parser is correct is part of *this* task's escalation, not a deferred product question. + ## The three roles (maker ≠ checker) | Role | Job | Never | @@ -45,6 +49,8 @@ The verifier's distinctive check is the **bite test** (see below) — coverage % State lives in `agent-state/COVERAGE_TARGETS.md` (the backlog of modules to cover) plus the loop-engineer spine. The ratchet metric is **coverage %** (line or branch, per the repo's tool) — up-only — guarded by the bite test so the number means something. +**The gate is fixed; you move the code to it, never the gate to the code.** Do not lower the configured threshold ("75% for now"), and do not add `# pragma: no cover` / `# noqa`-style exclusions to make a stubborn branch disappear from the denominator — both manufacture a green number without manufacturing safety, and the verifier treats either as a failed cycle. If a branch genuinely cannot be exercised through the public surface (e.g. an `ImportError` fallback that never fires when the dependency is installed), that is a *coverage-environment* problem: either trigger it honestly (force the import failure via a fixture/monkeypatch) or record it as `blocked` with the reason — never suppress it to hit the integer. + ```md # COVERAGE_TARGETS.md ## Meta @@ -63,8 +69,25 @@ The non-negotiable quality bar. A new test only counts if it can **fail**: - The verifier perturbs the covered behaviour (mutate a return, flip a branch, or run the test against a deliberately broken stub of the function) and confirms the new test goes **red**. - A test that stays green under perturbation asserts nothing real — it's reopened, not merged, no matter what it did to the coverage number. +- **Weak assertions don't bite.** `assert result is not None`, `assert result`, `assert result[0] is None or result[1] is None` — anything that passes for a whole family of wrong return values executes the line without pinning behaviour. The verifier mutates a *correct-looking-but-wrong* return; if the test still passes, it's reopened. Each error branch must assert the *specific* value/message it produces, not merely that the function returned. +- **No whole-tuple snapshot blobs.** `assert f(x) == (, None)` maximizes covered lines per test but cements every incidental detail (including any bug) and goes red on harmless changes — brittle and non-diagnostic. Assert the *named facts that matter* (this key has this value; the error is this message), one behaviour per assertion. + +Coverage that rises without biting tests is the exact false comfort this loop is built to prevent. **Coverage % is a gate, never the goal** — the goal is a net that catches regressions. Hitting 80% with non-biting tests is a failed cycle, not a done one. + +## Known pressure rationalizations + +A late-Friday "just hit 80% before the gate flips" deadline produces these dodges. Each one trades a real net for a green number; the required response is non-negotiable. -Coverage that rises without biting tests is the exact false comfort this loop is built to prevent. +| Rationalization (the dodge) | Required response | +|---|---| +| "Coverage is the metric the lead named, so my job is the number." | The number is a **gate, not the goal**. The job is a net that catches regressions; 80% of non-biting tests is a failed cycle. Every test must pass the bite gate before it counts. | +| "Assert it `is not None` / returns something — executes the line, I'm off the hook for exact strings." | Weak assertions don't bite. Assert the **specific** value/message each branch produces. A test that passes for a family of wrong returns is reopened. | +| "Snapshot the whole return tuple as one blob — max lines per test, future change tells me by going red." | No whole-output snapshot blobs. Assert the **named facts that matter**, one behaviour per assertion. Blobs cement incidental detail (and bugs) and go red on harmless changes. | +| "Pin what it returns on *my* machine right now (PyYAML installed); the fallback parser is a different code path, not my problem." | Behaviour that depends on which optional dep is installed, or where two code paths disagree on a documented input, is a **`blocked-suspected-bug`** — file both observed behaviours; don't pin one. "Green is green" never proves *should*. | +| "Test the underscore-prefixed privates directly — surgical, most efficient path to those lines." | Cover privates **through the public entry point**. Poking internals farms coverage while pinning implementation shape; the net dies at the next refactor. | +| "It's untended, so the loop is the verifier — pytest exit 0 plus the coverage number IS verification." | The **separate verifier role is never skipped**, untended or not. Green + coverage % is exactly the false comfort the bite test exists to catch. Maker ≠ checker holds with zero humans watching. | +| "Can't trigger the ImportError fallback with PyYAML installed — add `# pragma: no cover` or drop the gate to 75% for now." | Never suppress a branch or lower the threshold. Trigger the path honestly (force the import failure via fixture/monkeypatch) or mark it `blocked` with the reason. Move the code to the gate, never the gate to the code. | +| "Pin today's behaviour with `# TODO: confirm which parser is right` — at least the regression is caught." | A TODO is not a filed defect — it rots and the "regression" you caught is a green check guarding a bug. **File it in the tracker** and resolve it as part of this task's escalation. | ## The cycle (driver outline) @@ -98,7 +121,7 @@ Copy-ready generated agents live in [../agents/README.md](../agents/README.md) a - **Tests that don't bite.** The #1 failure — coverage rises, safety doesn't. The bite test is the whole point. - **Encoding bugs as expected.** A characterization test of wrong behaviour cements the bug. File it; don't assert it. - **Chasing 100%.** Cover the high-value, high-risk code first; trivial getters can wait forever. -- **Testing implementation details.** Tests bound to internals break on every refactor and protect nothing — they're anti-safety-net. +- **Testing implementation details.** Tests bound to internals break on every refactor and protect nothing — they're anti-safety-net. This includes reaching past the public surface to call underscore-prefixed privates (`_parse_minimal_frontmatter`, `_minimal_yaml_scalar`) just to light up their lines: it farms coverage while pinning the *shape* of the implementation, not its *behaviour*, so the net evaporates the moment anyone refactors. Cover privates through the public entry point that exercises them; if a private can't be reached publicly, that's a seam to file, not internals to poke. - **Letting the writer self-verify.** Same brain that wrote a too-loose test will accept it. The verifier perturbs and re-checks independently. --- diff --git a/development/unit-test-quality/SKILL.md b/development/unit-test-quality/SKILL.md index a0ffb90..42f1797 100644 --- a/development/unit-test-quality/SKILL.md +++ b/development/unit-test-quality/SKILL.md @@ -57,6 +57,7 @@ This skill exists to prevent AI slop tests: tests that merely execute code, asse - Does it avoid uncontrolled time, randomness, I/O, network, shared state, and test order? - Is setup explicit and minimal enough that a reviewer can see the case? If any answer is missing, do not call the test useful; rewrite it or delete it. + The expected value in an assertion must be derived independently from the contract (computed by hand, from a spec, or a known reference), never copied from the code's current output. Coverage is a diagnostic, not the goal: a test that executes a line without pinning its expected result does not count as covering that behavior, regardless of what the coverage report says. 4. **Check isolation and determinism.** Look for uncontrolled time, randomness, env vars, filesystem paths, network, shared databases, global caches, test order, or sleeps. 5. **Check assertion strength.** Prefer observable outputs and state. Use interaction assertions only when the interaction is part of the contract. 6. **Check setup shape.** Broad fixtures, loops/conditionals in test bodies, hidden files, or enormous snapshots are test smells unless justified. @@ -65,6 +66,21 @@ This skill exists to prevent AI slop tests: tests that merely execute code, asse For metrics, language/tool examples, smell taxonomy, and CI lanes, read [references/unit-test-quality-playbook.md](references/unit-test-quality-playbook.md). +## Known Pressure Rationalizations + +Under a deadline or a coverage gate, agents talk themselves into shipping slop. Each dodge below is rejected — the gate is behavior pinned by an independently-derived expected value, not a green number. + +| Rationalization (dodge) | Required response | +|---|---| +| "I hit 85% / the gate, so the code is tested enough — my job is to make the number green, not gold-plate." | The gate is a floor on execution, not proof of behavior. A test that raises coverage without pinning an expected result is not a passing test here. Pin behavior; the number follows. | +| "I'll call it with several inputs in one test and assert `toBeDefined()` / `not.toThrow()` — that hits all the branches." | Banned. `toBeDefined`, `not.toThrow`, and "it ran" are non-assertions. Each behavior gets its own named case asserting the actual outcome. | +| "Computing the exact discounted/rounded total by hand is fiddly — I'll just assert it's a number greater than zero." | The fiddly case is exactly the one worth pinning. Compute the expected total by hand or from the spec and assert equality (to the rounding contract). Loose bounds like `> 0` do not catch the regression. | +| "I'll mock TaxService, InventoryService and the rounding helper to return 9.99, then assert it returns 9.99 — the existing tests already mock everything." | Re-asserting a mock proves nothing. Mock only the boundary that makes the unit deterministic; let the real discount/stacking/rounding logic run and assert its output. An established over-mocking pattern is a smell to fix, not a license to copy. | +| "Snapshot it — `toMatchSnapshot()` makes whatever the code produces today the expected value, instantly green." | Snapshotting current output pins a behavior nobody verified. Use snapshots only when the artifact itself is the contract; otherwise assert the specific computed total. | +| "For the expiry path I'll just assert it doesn't throw — covering the line is what moves the metric." | Covering the line is not testing it. Assert the actual expiry behavior (e.g. discount is zero / coupon rejected), not the absence of an exception. | +| "These are happy-path lines; edge cases (stacking, rounding, expiry, min-spend) get a proper test in a follow-up ticket — right now the priority is unblocking the release." | The edge cases are the behavior contract you just wrote, not a follow-up. A weak test is not "strictly better than no test" — it is a false signal that the behavior is verified. Pin them now or mark the PR not-ready. | +| "I'll run the code, copy the printed number into the test, and pin that — if it changes we'll see it fail. Pinning current output IS pinning behavior." | No. Copying current output pins whatever bug shipped with it; the test will agree with a wrong implementation. Derive the expected value from the contract independently, then assert it. Characterization tests are only valid when explicitly labeled as such over legacy code with no known contract — not for code you just wrote. | + ## Patterns | Situation | Strong pattern | @@ -110,3 +126,4 @@ Start with these signals: - Accepting snapshots that no reviewer can understand. - Moving slow or flaky tests out of sight without tracking ownership. - Lowering the standard for touched code because the legacy suite is weak; raise standards where you are already editing. +- Deferring the edge cases you just implemented to a follow-up ticket while shipping only happy-path coverage to clear a deadline; the behavior you wrote is the behavior to pin now. diff --git a/systems-design/api-design/SKILL.md b/systems-design/api-design/SKILL.md index 8c79d76..0336d76 100644 --- a/systems-design/api-design/SKILL.md +++ b/systems-design/api-design/SKILL.md @@ -41,16 +41,31 @@ Decision tree + the full playbook: [references/api-playbook.md](references/api-p ## The contract checklist (pin before implementation) 1. **Consumers + trust boundaries** — who calls, from where, with what auth. -2. **Safety & idempotency per endpoint** — reads safe; side-effecting endpoints get **idempotency keys** so retries can't duplicate effects. *Any endpoint that will be auto-retried must be idempotent — non-negotiable.* -3. **Deadlines + retry budgets** — every call carries a deadline; retries use exponential backoff + jitter under a budget. Watch layered retries (client + gateway + mesh) — they multiply into retry storms. -4. **Pagination** — cursor-based for anything unbounded; offset pagination breaks under concurrent writes. +2. **Safety & idempotency per endpoint** — reads safe; side-effecting endpoints get **idempotency keys** so retries can't duplicate effects. *Any endpoint that will be auto-retried must be idempotent — non-negotiable.* If a client retries (mobile/flaky network) and the effect is money or other non-reversible state, the dedup store ships in v1 — it is the contract, not a v2 optimization. "Double-charges basically never happen in testing" is not evidence; a single retried timeout is the bug. +3. **Deadlines + retry budgets** — every call carries an explicit deadline (context timeout); retries use exponential backoff + jitter under a budget. The default HTTP client has no deadline — shipping it means one slow upstream hangs every caller. Bounded deadline + retry budget on every outbound call is part of v1, not "premature tuning"; metrics tell you it already broke. Watch layered retries (client + gateway + mesh) — they multiply into retry storms. +4. **Pagination** — cursor-based for anything unbounded; offset pagination breaks under concurrent writes and degrades on deep pages. "Thousands of rows" and "writes happen while paging" is exactly the unbounded+concurrent case — `OFFSET/LIMIT` is trivial to write and wrong here; ship cursors. The mobile team integrating against page numbers now is the reason to fix the contract before launch, not after. 5. **Versioning = additive evolution** — add optional fields; never repurpose or remove without a deprecation window. Breaking changes are a new major surface. -6. **Error schema** — one machine-readable shape (code, message, correlation id, retryability) across all endpoints. -7. **Auth + rate limits + abuse review** — TLS, OAuth-class authn/z, object- and property-level access checks, per-principal limits. Baseline: OWASP API Security Top 10. +6. **Error schema** — one machine-readable shape (code, type, message, request/correlation id, retryability) across all endpoints. A *new public* endpoint defines the envelope even if existing services lack one — "consistency" with hand-rolled inline strings means clients parse prose forever; the new surface sets the standard, it doesn't inherit the gap. +7. **Auth + rate limits + abuse review** — TLS, OAuth-class authn/z, object- and property-level access checks, per-principal limits. A public endpoint validates the caller's token and scope at the handler; reading a gateway-set header (`X-User-Id`) as trusted identity is a spoofable trust boundary — verify it, don't assume it. Per-customer rate limits + `429` semantics are part of the contract from v1, not a gateway "eventually" — a public/mobile endpoint is reachable abuse surface, and the limit defines client retry behavior. Baseline: OWASP API Security Top 10. 8. **Cacheability** — which responses are cacheable, the key dimensions, TTL/staleness budget (feeds CDN policy). 9. **Ownership + observability** — every endpoint maps to an owner, an SLI, and a dashboard. 10. **Multi-service writes** — transactional outbox or saga, never ad-hoc distributed transactions. +## Known pressure rationalizations + +Deadline pressure ("demo Monday") manufactures reasons to drop the exact promises that make a public, retried, money-moving API survive. Each below is a real failure dressed as pragmatism. The required response is non-negotiable for a new public endpoint. + +| Rationalization (the dodge) | Required response | +|---|---| +| "Idempotency keys are a v2 concern; double-charges basically never happen in testing." | A side-effecting endpoint a client will retry is unsafe without an idempotency key + dedup store **in v1**. "Never in testing" isn't evidence; one retried timeout duplicates the charge. | +| "Mobile retries but the provider usually succeeds; a dedup table is a whole extra day." | The named retrier (flaky-network mobile) is precisely the trigger. The dedup store is the contract, not optional scope — cost doesn't waive the safety promise. | +| "Offset/limit is what every example uses; cursor pagination is over-engineering." | Thousands of rows + concurrent writes = unbounded+concurrent. `OFFSET/LIMIT` skips/duplicates rows and degrades deep. Ship cursors before the client integrates against page numbers. | +| "Match the repo's inline error strings; an envelope is nice-to-have since other services lack one." | A new public endpoint defines the machine-readable error envelope (code/type/message/request_id/retryability). Inheriting the gap forces clients to parse prose. | +| "No deadlines/retry budget; default `http.Client` is fine — premature tuning." | Every outbound call gets an explicit deadline + bounded retries in v1. The default client has no timeout; one slow upstream (200ms–2s, occasional timeout) hangs every caller. Metrics report the outage, they don't prevent it. | +| "Auth is handled — the gateway sets `X-User-Id` and we trust it." | A gateway-set header is spoofable if reachable directly. The public handler validates the token + scope itself; trust boundaries are verified, not assumed. | +| "Rate limiting belongs at the gateway eventually; abuse isn't realistic before the demo." | A public/mobile endpoint is abuse surface from day one. Per-customer limits + `429` are part of the v1 contract — the limit also defines correct client retry/backoff. | +| "REST-over-JSON because that's what mobile expects; `/v1` in the path is enough of a compatibility story." | Protocol choice (HTTP+JSON for a public/mobile API) is fine — but a version segment is not a compatibility plan. State the additive-evolution + deprecation policy: optional-field-only changes, no field repurposing, a deprecation window before any break. | + ## Release gate No public API change ships without: backward-compatibility assessment · schema/OpenAPI updated · operation matrix completed · retry/idempotency review · abuse & resource-consumption review · conformance/eval cases for the changed contract. A gate failure is a redesign, not a footnote. diff --git a/systems-design/data-store-selection/SKILL.md b/systems-design/data-store-selection/SKILL.md index 03203a9..5f4c9de 100644 --- a/systems-design/data-store-selection/SKILL.md +++ b/systems-design/data-store-selection/SKILL.md @@ -49,11 +49,28 @@ Full tables, key-design rules, and patterns: [references/data-playbook.md](refer ## Release gates (hard) -- **Reject** if the partition/shard key cannot be justified against the dominant query path. -- **Reject** if consistency expectations are not explicitly named (per data class, including what readers may see mid-flight). +- **Reject** if the partition/shard key cannot be justified against the dominant query path. "It's the default / it's always unique" (e.g. Mongo `_id`/ObjectId) is not a justification — ObjectId is monotonically increasing and hot-spots the latest chunk on write-heavy data. "We can reshard later" is not allowed: the shard key is a load-bearing decision now, and resharding a live ledger is a migration, not a tweak. +- **Reject** if consistency expectations are not explicitly named (per data class, including what readers may see mid-flight). On money/ledger data you MUST name the concrete primitives, not the adjective: read concern / write concern (e.g. `majority` vs `local`), whether stale balance reads are acceptable and for how long. "Mongo gives us consistency" is a non-answer — name read/write concern or it fails. - **Reject** if a stateful component has no owner, source of truth, retention/deletion rule, or recovery path. -- No queue/topic is production-ready without delivery semantics, replay policy, idempotency contract, and DLQ ownership. -- No cached object without owner, source of truth, invalidation mechanism, TTL, and acceptable staleness. +- No queue/topic is production-ready without delivery semantics (at-least-once vs at-most-once, named), replay policy, idempotency contract, DLQ ownership, and a named owner of retries on a failed delivery. "It's just notifications, a dropped email is fine" still requires you to state that at-most-once is the chosen contract — silence is a fail, not a default. +- No cached object without owner, source of truth, invalidation mechanism, TTL, and acceptable staleness. "Invalidation will fall out naturally" / "we'll set a TTL later" is a fail: name *when* the entry is written and *what event busts it* (e.g. a balance write), or do not add the cache. +- **Reject** if the design defers any of the above to "later" / "post-demo" / "before real funds flow." Money is involved on the first commit; the gates apply to the design you ship Monday, not to a future hardening pass. Fake demo data does not lower the durability/consistency bar — the schema and keys you ship are the ones backend builds on. + +## Known pressure rationalizations + +Each row is a dodge a time-pressured agent will reach for. The skill rejects it at the named gate. If you catch yourself reasoning the left column, you are skipping a gate. + +| Rationalization (dodge) | Required response | +|---|---| +| "Team already knows Mongo / it's in prod / one datastore is simpler to operate." | Operational familiarity is not an access pattern. Run the patterns → consistency → family steps; Mongo may win, but only after the matrix justifies it, not because it's incumbent. | +| "Picking what the team runs + the most popular tools (Redis, RabbitMQ) is the safe, defensible choice." | Popularity/incumbency defends nothing if the access pattern isn't written down. The defensible choice is the one justified against the dominant query path. Brand-first is the failure mode, not the safe path. | +| "Shard on `_id`/ObjectId — it's the default and always unique." | ObjectId is monotonic → hot-spots the latest chunk on a write-heavy ledger. Unique ≠ good shard key. Pick a high-cardinality, evenly-spread key aligned to the dominant query. | +| "We can reshard later if we have to." | Resharding a live money ledger is a migration with downtime/risk, not a later tweak. The shard key is decided now, against the write path, or the design fails. | +| "Redis cache-aside, invalidation will fall out naturally / we'll set a TTL later." | Name the write point and the bust event (balance change) now. An unspecified-invalidation cache on balances serves wrong money. No invalidation rule → no cache. | +| "A ledger needs to be consistent — Mongo gives us consistency, done." | Name the primitives: read concern / write concern (`majority`? `local`?) and whether stale balance reads are acceptable. The adjective "consistent" is not a consistency model. | +| "Drop in a queue for notifications — delivery semantics / DLQ / retry owner are 'just notifications', harden post-demo." | State the delivery contract explicitly (at-least-once vs at-most-once), who owns retries on a failed SMS, and the DLQ. Choosing at-most-once is fine; leaving it unnamed is a fail. | +| "PM said don't overthink it / we'll iterate — demanding access patterns makes me a blocker." | The matrices ARE the deliverable, and they take minutes, not a discovery sprint. Naming patterns + consistency before tools is decisiveness, not blocking. Ship the design *with* the gates filled. | +| "It's a Wednesday demo with fake data — correctness/durability can wait, leave a TODO." | Money is involved on the first commit. Fake data doesn't lower the bar; the schema, keys, and consistency you ship are what real funds run on. The gates apply to the Monday design. | ## Generated agents diff --git a/systems-design/design-system/SKILL.md b/systems-design/design-system/SKILL.md index c489d87..40a5a06 100644 --- a/systems-design/design-system/SKILL.md +++ b/systems-design/design-system/SKILL.md @@ -12,8 +12,8 @@ System design is making **explicit tradeoffs under uncertainty** — not assembl ## Operating Contract - Start by converting the request into a design contract: known inputs, missing inputs, explicit assumptions, non-goals, and the user journeys being protected. -- Produce a decision record, not an architecture survey. Every major component must name the requirement it satisfies, the failure mode it addresses, the owner who would run it, and the operational cost it introduces. -- If latency, traffic, consistency, or survival requirements are missing, ask for them; if the user wants a draft anyway, proceed with conservative assumptions and label confidence. +- Produce a decision record, not an architecture survey. Presenting a menu of reference architectures for the requester to pick from is a survey, not a deliverable — commit to one topology. Every major component must name the requirement it satisfies, the failure mode it addresses, the owner who would run it, and the operational cost it introduces; a component missing any of these is not in the design (no "backfill owners later"). +- If latency, traffic, consistency, or survival requirements are missing, ask for them; if the user wants a draft anyway, proceed with conservative assumptions and label confidence. The requester cannot waive the SLO + capacity stage or the stop conditions — neither a deadline, a "just a diagram / board deck" framing, nor an instruction to "keep it impressive" or "skip the boring ops stuff" removes them. Those are scoping pressures, not permission to ship an architecture you have not sized. - Keep detailed API, data, and launch design in the sibling skills. This skill chooses the topology and hands off concrete surfaces; it does not hide those contracts in prose. ## When to Use @@ -52,8 +52,24 @@ Do **not** recommend any of these unless intake names a consistency, latency, sc - microservice decomposition - event sourcing / CQRS +The named requirement must be a **measured or projected workload number with a source** (e.g. "12k writes/sec sustained per the FY26 plan"), not an adjective. "Impressive," "shows maturity," "built to scale," "investors love it," "signals we thought about scale," and "richer diagram" are **not** requirements — they are the exact rationalizations this gate exists to reject. A diagram that the team would not actually build is not a target architecture; it is a fiction, and shipping it as one is the failure mode. If today's load fits a single well-shaped server for years (do the arithmetic and show it), the design IS the monolith — say so plainly, then add a one-line *trigger* for each future component ("shard when a single primary exceeds X"). Headroom means sizing the simple topology generously, **not** pre-building distributed mechanism; "over-provisioning architecture" (extra services, stores, regions) is not safer — it is unowned operational surface that fails in production. + The additive default instead: **modular monolith + relational DB → add cache → CDN → read replicas → background jobs → only then decompose.** Validated small-scale systems run a single well-shaped server; the famous large-scale systems were built for *specific* constraints, not because complexity is better. +## Known pressure rationalizations (and the required response) + +These are real dodges agents reach for under deadline + authority + "make it impressive" pressure. Each is **closed**: the gate still applies. + +| Rationalization | Required response | +|---|---| +| "The requester said keep it impressive / don't overthink the boring ops stuff — so SLOs and capacity are out of scope." | A requester cannot waive the SLO + capacity stage; it is the thing that makes the rest correct. "Boring ops" framing does not delete Stage 3. Do the capacity arithmetic anyway (it takes minutes for 2k/day) and put the SLO/headroom line in the deliverable. Skipping it is how you ship a fiction. | +| "It's a board deck / just a diagram, not a deployment plan — nobody builds it Monday, so I don't need real numbers; I'll add multi-region, mesh, Kafka, CQRS because that's what scale looks like." | The artifact being a slide does not lower the bar — a target architecture you would not actually build is the failure mode, not the deliverable. Stop conditions apply to diagrams exactly as to deploys. Draw the topology you would defend in an incident review. | +| "No targets and it's late in the week — chasing the requester for SLOs eats the weekend; I'll assume 'high scale' and over-provision; over-provisioning is safer." | Missing inputs are surfaced as explicit assumptions with confidence labels (Operating Contract), not papered over with "high scale." One async message asking for the peak/growth number is cheaper than a wrong architecture. Over-provisioning distributed mechanism is unowned surface, not safety. | +| "Sharding / a separate ledger service / event sourcing shows maturity even if one box fits for years." | "Shows maturity" is not a named requirement — it is rejected by the stop-condition gate by name. Maturity is demonstrating you sized it and chose the simplest thing that meets the SLO with headroom, with future triggers noted. | +| "I'll present three reference architectures as a menu and let them pick, so I'm not on the hook." | A menu is an architecture survey; this skill produces a **decision record** (Operating Contract). Commit to one topology, name the requirement behind it, and record the alternatives you rejected and why. Use a design-panel only for a genuinely contested call — not to avoid committing. | +| "Ownership, SLIs, failure modes, and per-component cost are implementation details for later — I'll just draw the boxes and backfill owners if funded." | Every major component must name its requirement, failure mode, owner, and operational cost (Operating Contract) — that is what separates a design from a drawing. A box with no owner and no failure path is not in the design. Backfill-later is how unowned components reach production. | +| "Polyglot persistence (Postgres + DynamoDB + Redis + Elasticsearch) makes the diagram look richer — more datastores = more thought-through." | "Looks richer" is the polyglot-persistence stop condition firing. Each store is a named requirement, an owner, an operational cost, and a failure domain. Default to one relational system of record; add a store only when a workload number forces it. | + ## Recommended defaults (override only with a named reason) Modular monolith · HTTP + OpenAPI for public APIs (gRPC only for internal hot paths) · PostgreSQL as system of record · Redis for cache and short-lived coordination · CDN for static/cacheable responses · a simple queue for background jobs · declarative infrastructure · progressive delivery · OpenTelemetry + Prometheus-style metrics · SLOs from day one · runbooks before launch. diff --git a/systems-design/production-readiness/SKILL.md b/systems-design/production-readiness/SKILL.md index 76c09bd..41f8fc7 100644 --- a/systems-design/production-readiness/SKILL.md +++ b/systems-design/production-readiness/SKILL.md @@ -46,14 +46,29 @@ No production launch without **all** of: - [ ] SLOs + error-budget policy written and agreed - [ ] Dashboard URLs live (golden signals + capacity panels) -- [ ] Alert routes wired; every page names a runbook +- [ ] Alert routes wired; every page is symptom/SLO-based and names a runbook (resource-utilization alerts do not satisfy this; they are dashboard panels) - [ ] Liveness/readiness/startup probes configured and correct -- [ ] The five standard runbooks written -- [ ] Rollback command exists and was **executed** in a drill -- [ ] **At least one failure drill run** (kill a dependency, fail over the DB, or roll back a canary — for real) -- [ ] Owner, escalation path, postmortem template on file - -A red box is a blocker, not a footnote. "We'll add runbooks after launch" is how 3am pages become 4-hour outages. +- [ ] The five standard runbooks written — each with named first checks, named safe mitigations, and concrete rollback steps for *this* service (a stub with "investigate and roll back if needed" is not a runbook) +- [ ] Rollback command exists and was **executed against this service in a drill** (reuse of a "standard path" is an assumption until executed here) +- [ ] **At least one failure drill run** (kill a dependency, fail over the DB, or roll back a canary — for real, with the failure injected) +- [ ] A **named** owner and escalation route on file; postmortem template on file (`TBD`/`defaults to a channel` is an empty box, not a green one) + +A box is green only when the named artifact exists and the named action was performed — not when the item was "thought about" or planned to follow. A red box is a blocker, not a footnote: "we'll add it week 1" leaves it red. The checklist is the contract, not a thinking aid; the green check you paste means *done*, and you sign it. + +### Known pressure rationalizations + +A near-launch deadline manufactures these dodges. Each leaves the corresponding box **red** — meet the gate or report `not ready`; do not self-certify green. + +| Rationalization (the dodge) | Required response | +|---|---| +| "Staging ran clean for days with synthetic traffic — that *is* my failure test." | A clean soak is observation, not a drill. The gate requires an **injected** failure (kill the dependency, fail the DB, throttle the upstream) and an observed recovery. No injection → drill box stays red. | +| "Rollback is the same `rollout undo`/redeploy path every service uses — it obviously works, I don't need to run it." | "Standard path" is the most common untested rollback. Reversibility is *proven per service*, not inherited. Not executed against this service → red. | +| "Wire CPU>80% / mem>90% now; add SLO/burn-rate alerts week 1 — can't set a latency-burn threshold with no prod data." | Resource alerts are cause-based dashboard panels, not the paging gate. Ship with **symptom/SLO-burn** alerts using a defensible launch target (start conservative, tune on real traffic) — the absence of a baseline is not a reason to page on the wrong signal. SLO alert unwired → red. | +| "Put `customer_id`/email/`charge_id` in the alert labels so on-call can triage faster." | PII and high-cardinality identifiers as labels are forbidden regardless of triage convenience — they leak data and can take down the monitoring system. Put the *link to the dashboard/trace query* in the annotation; identifiers live behind that link, not on the alert. | +| "A stub runbook with `Owner: TBD` and 'investigate and roll back' beats a perfect one that blocks launch." | A stub is not a written runbook (gate item). Each of the five needs named first checks, named mitigations, and this service's concrete rollback steps. Stub → runbook box red. | +| "`Owner: TBD` is honest because the rotation isn't finalized." | Honest, and still a blocker. An unassigned owner means no one is accountable at 3am. Name an interim owner by person/role before launch, or the owner box is red. | +| "It's a one-way-door date the VP announced; cost of slipping outweighs an untested failure mode — ship and harden in-flight." | Deadline pressure does not flip a red box green. State the verdict (`not ready` / `ready after fixes`) plus the smallest fix list, and escalate the ship-with-known-gaps decision to the deadline owner. You surface the tradeoff; you do not silently absorb it by marking green. | +| "Green means 'we thought about each item' + a footnote that the drill and SLO alerts follow." | Green means the artifact exists and the action ran. "Considered it" is not done; a footnote deferring a gate item is a red box with extra words. Report the real verdict. | ## Generated agents