blundergoat · mattyhansen · Jun 1, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/.agents/skills/goat-critique/SKILL.md b/.agents/skills/goat-critique/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: goat-critique
 description: "Use when a decision or analysis needs multi-lens critique to surface blind spots before shipping."
-goat-flow-skill-version: "1.7.0"
+goat-flow-skill-version: "1.9.0"
 ---
 # /goat-critique
 
@@ -67,7 +67,7 @@ All three perspectives must appear in every critique from Agents A and B. The te
 
 | Agent | Reads | Does NOT read |
 |---|---|---|
-| A (Risk) | artifact + architecture.md + footguns + lessons + rubric | git history, config.yaml |
+| A (Risk) | artifact + architecture.md + targeted grep-first footgun/lesson hits + rubric | git history, config.yaml |
 | B (Alternatives) | artifact + architecture.md + `git log --oneline -20` + config.yaml + rubric | footguns, lessons |
 | C (Fresh Eyes) | artifact + rubric ONLY | everything else (isolation enforced) |
 
@@ -79,7 +79,7 @@ Full directives: `references/sub-agent-directives.md`.
 - **B (Alternatives):** SKEPTIC/ANALYST/STRATEGIST on alternatives, ranked by implementation friction. Must surface at least one alternative.
 - **C (Fresh Eyes):** No project context. Flags unstated assumptions. ISOLATION RULE enforced.
 
-Each sub-agent MUST return 3-7 findings, each with: title, severity, evidence (file + semantic anchor), confidence, Proof attempt, Evidence quality (OBSERVED/INFERRED/UNVERIFIED), SKEPTIC/ANALYST/STRATEGIST lines, and rubric dimensions covered. Plus: overall assessment (STRONG/ADEQUATE/WEAK/FLAWED) and one thing the artifact gets RIGHT.
+Each sub-agent MUST return 3-7 findings, each with: title, severity, evidence (file + semantic anchor), confidence, Proof attempt, Proof class (`RUNTIME | CONTRACT-GREP | STATIC | NOT-REPRODUCED`), Evidence quality (OBSERVED/INFERRED/UNVERIFIED), SKEPTIC/ANALYST/STRATEGIST lines, and rubric dimensions covered. Plus: overall assessment (STRONG/ADEQUATE/WEAK/FLAWED) and one thing the artifact gets RIGHT.
 
 **Lens-finding floor:** each lens must surface >= 1 finding per sub-agent or re-run once; convergence allowed after one re-run. See anti-fabrication constraint. Full floor spec in the sub-agent directives reference pack.
 
@@ -154,7 +154,7 @@ Then the full critique:
 
 **Blind spot check:** List unaddressed artifact sections, unmapped rubric aspects, and unread referenced files as "What Wasn't Critiqued." Must never be empty.
 
-**Proof Gate:** Apply the Proof Gate (see Constraints) to every synthesised finding before inclusion.
+**Proof Gate:** Apply the Proof Gate (see Constraints) to every synthesised finding before inclusion. Every synthesised finding must carry proof class `RUNTIME | CONTRACT-GREP | STATIC | NOT-REPRODUCED`.
 
 **Phase 5.5 - Meta-audit.** Spawn a lightweight meta-agent (budget: 2 tool calls, no context beyond the draft Phase 5 output). Audit the critique for internal consistency against the 10-point rubric in `references/rubric-examples.md`. If issues found, insert an `## Auto-Detected Issues` block before presenting. Verdict block updated with `Meta-score: N/100`.
 
@@ -190,10 +190,10 @@ The rubric determines what sub-agents evaluate. Match to artifact type. Dimensio
 - MUST set max 5 tool-call budget per critique sub-agent; log calls/limit when exposed, otherwise unavailable markers. Do not claim mechanical enforcement when counts are unavailable.
 - MUST log per spawned critique/cross-exam/meta agent: id/handle if exposed, calls/limit, or unavailable markers.
 - MUST Scan Agent C output for context leaks before any other Phase 2 work. Only flag references absent from the input artifact. Any untraceable match = CONTEXT LEAK; discard and re-spawn.
-- MUST Check sub-agent completeness: verify each sub-agent returned 3-7 findings plus required lens fields, severity, evidence, confidence, rubric dimensions, overall assessment, and preservation note. Incomplete → re-spawn once; if still incomplete, record `sub-agent completeness limited`.
+- MUST Check sub-agent completeness: verify each sub-agent returned 3-7 findings plus required lens fields, severity, evidence, confidence, proof class, rubric dimensions, and overall assessment. Incomplete → re-spawn once; if still incomplete, record `sub-agent completeness limited`.
 - MUST enforce cross-examination budget: Max 3 cross-examination agents total, max 3 tool calls per agent.
 - Recommendations are never auto-applied. After synthesis, stop. Do not enter implementation mode unless the user explicitly asks to apply changes.
-- MUST apply the Proof Gate from `skill-preamble.md` to every synthesised finding. Sub-agent reports are inputs to verify, not evidence to launder. Re-read applies to findings surviving to Phase 5 (typically 3-7 after Phase 3/4 filtering), not to all findings raised in Phase 1.
+- MUST apply the Proof Gate from `skill-preamble.md` to every synthesised finding and preserve one proof class tag (`RUNTIME | CONTRACT-GREP | STATIC | NOT-REPRODUCED`) on each. Sub-agent reports are inputs to verify, not evidence to launder. Re-read applies to findings surviving to Phase 5 (typically 3-7 after Phase 3/4 filtering), not to all findings raised in Phase 1.
 - MUST NOT fabricate findings. Do not fabricate findings to meet the lens-finding floor; convergence allowed after one re-run.
 - Universal constraints from skill-preamble.md apply.
 
@@ -209,13 +209,13 @@ The rubric determines what sub-agents evaluate. Match to artifact type. Dimensio
 ## Sub-Agent Rankings
 ## Rubric Coverage Gaps
 ## Control Group Delta
-## Validated Findings  <!-- source pool for Recommended Changes -->
+## Validated Findings  <!-- source pool for Recommended Changes; every finding includes proof class -->
 ## Cross-Examination Results
 ## Auto-Detected Issues  <!-- from Phase 5.5 meta-audit, if any -->
 ## Retracted Findings
 ## Human Decisions
 ## Strengths
-## Recommended Changes  <!-- subset of Validated Findings; ordered by severity; each with concrete action -->
+## Recommended Changes  <!-- subset of Validated Findings; ordered by severity; each with concrete action and proof class -->
 ## Open Questions
 ## Integration Hooks  <!-- for-goat-plan, for-goat-debug, for-implementation -->
 ## What Wasn't Critiqued

diff --git a/.agents/skills/goat-critique/references/rubric-examples.md b/.agents/skills/goat-critique/references/rubric-examples.md
@@ -1,46 +1,46 @@
 ---
-goat-flow-reference-version: "1.7.0"
+goat-flow-reference-version: "1.9.0"
 ---
 # Critique Rubric Examples (Reference Pack)
 
 *Extracted from the goat-critique SKILL.md to stay within the 2500-word skill cap. Canonical rubric definitions remain in SKILL.md; worked examples and context-map details live here.*
 
 ## Rubric Context Maps
 
-Each rubric has a context map that Step 0 reads and passes to sub-agent spawn directives. Agent C's isolation enforcement (Phase 2 step 1 grep check) is unchanged regardless of context map. Generic fallback uses the default split.
+Each rubric has a context map that Step 0 reads and passes to sub-agent spawn directives. Footgun/lesson entries mean targeted grep-first hits from those buckets, not whole-directory reads. Agent C's isolation enforcement (Phase 2 step 1 grep check) is unchanged regardless of context map. Generic fallback uses the default split.
 
 ### Plan
-- **A:** footguns, lessons, `.goat-flow/decisions/`
+- **A:** targeted grep-first footgun/lesson hits, `.goat-flow/decisions/`
 - **B:** `.goat-flow/tasks/.active`, `git log --oneline -20`, milestone logs
 - **C:** [] (isolation enforced)
 
 ### Security assessment
-- **A:** footguns, lessons, threat-model docs, `.goat-flow/decisions/`
+- **A:** targeted grep-first footgun/lesson hits, threat-model docs, `.goat-flow/decisions/`
 - **B:** `git log --oneline -20`, config.yaml, dependency manifests
 - **C:** [] (isolation enforced)
 
 ### Debug hypotheses
-- **A:** footguns, lessons, `.goat-flow/logs/sessions/`
+- **A:** targeted grep-first footgun/lesson hits, `.goat-flow/logs/sessions/`
 - **B:** `git log --oneline -20`, config.yaml, test output
 - **C:** [] (isolation enforced)
 
 ### Review findings
-- **A:** footguns, lessons, `.goat-flow/decisions/`
+- **A:** targeted grep-first footgun/lesson hits, `.goat-flow/decisions/`
 - **B:** `git log --oneline -20`, config.yaml, CI logs
 - **C:** [] (isolation enforced)
 
 ### Test strategy
-- **A:** footguns, lessons, `.goat-flow/decisions/`
+- **A:** targeted grep-first footgun/lesson hits, `.goat-flow/decisions/`
 - **B:** `git log --oneline -20`, config.yaml, test manifests
 - **C:** [] (isolation enforced)
 
 ### Architecture/refactor
-- **A:** footguns, lessons, `.goat-flow/decisions/`, dependency maps
+- **A:** targeted grep-first footgun/lesson hits, `.goat-flow/decisions/`, dependency maps
 - **B:** `git log --oneline -20`, config.yaml, module boundaries
 - **C:** [] (isolation enforced)
 
 ### Generic (fallback)
-- **A:** footguns, lessons
+- **A:** targeted grep-first footgun/lesson hits
 - **B:** `git log --oneline -20`, config.yaml
 - **C:** [] (isolation enforced)
 
@@ -53,6 +53,7 @@ Each rubric has a context map that Step 0 reads and passes to sub-agent spawn di
 - **Severity:** HIGH | **Confidence:** HIGH
 - **Evidence:** Milestone plan excerpt (search: "Phase 2 additions") - Phase 2 additions depend on Phase 1 extraction completing first
 - **Proof attempt:** Read the milestone plan excerpt, confirmed extraction must precede additions
+- **Proof class:** STATIC
 - **Evidence quality:** OBSERVED
 - **SKEPTIC:** If extraction doesn't reclaim enough words, Phase 2 additions blow the 2500 cap
 - **ANALYST:** Current 2532w minus ~100w extraction gives ~80w budget for additions; tight but feasible
@@ -67,6 +68,7 @@ Each rubric has a context map that Step 0 reads and passes to sub-agent spawn di
 - **Severity:** CRITICAL | **Confidence:** HIGH
 - **Evidence:** `src/api/handler.ts` (search: "database query") - user input passed directly to database query
 - **Proof attempt:** Read handler.ts around the database query, confirmed no sanitization before query construction
+- **Proof class:** STATIC
 - **Evidence quality:** OBSERVED
 - **SKEPTIC:** SQL injection vector; worst case is full database compromise
 - **ANALYST:** Direct string interpolation in query; parameterised queries would eliminate the risk at zero performance cost
@@ -79,7 +81,7 @@ Each rubric has a context map that Step 0 reads and passes to sub-agent spawn di
 The meta-agent scores the draft critique against these 10 points:
 
 1. **Gate-finding match** - Gate value matches highest surviving severity
-2. **Evidence quality per finding** - every finding has Proof attempt + Evidence quality fields
+2. **Evidence quality per finding** - every finding has Proof attempt + Proof class + Evidence quality fields
 3. **Rubric coverage completeness** - no unaddressed mandatory dimensions
 4. **Rec-changes actionability** - every recommendation has a concrete next step
 5. **No orphan retractions** - every retracted finding has rationale

diff --git a/.agents/skills/goat-critique/references/sub-agent-directives.md b/.agents/skills/goat-critique/references/sub-agent-directives.md
@@ -1,15 +1,15 @@
 ---
-goat-flow-reference-version: "1.7.0"
+goat-flow-reference-version: "1.9.0"
 ---
 # Critique Sub-Agent Directives (Reference Pack)
 
 *Extracted from the goat-critique SKILL.md to stay within the 2500-word skill cap. Canonical detail lives here; SKILL.md retains concise summaries.*
 
 ## Sub-agent A (Risk Focus - backward-looking context)
 
-**Directive:** "Apply SKEPTIC/ANALYST/STRATEGIST. Focus on RISKS: what could go wrong, what the evidence says about cost/benefit, what the 2nd-order systemic impacts are (local fix → global break patterns), and what the fastest safe path looks like. For any 2nd-order claim, you MUST cite the downstream file or system by name - speculation without a named target gets retracted in Phase 3. Your context includes past mistakes (footguns, lessons) - use them."
+**Directive:** "Apply SKEPTIC/ANALYST/STRATEGIST. Focus on RISKS: what could go wrong, what the evidence says about cost/benefit, what the 2nd-order systemic impacts are (local fix → global break patterns), and what the fastest safe path looks like. For any 2nd-order claim, you MUST cite the downstream file or system by name - speculation without a named target gets retracted in Phase 3. Your context includes targeted grep-first past-mistake hits - use them."
 
-**Context reads:** artifact + architecture.md + footguns + lessons + rubric
+**Context reads:** artifact + architecture.md + targeted grep-first footgun/lesson hits + rubric
 **Does NOT read:** git history, config.yaml
 
 ## Sub-agent B (Alternatives Focus - current-state context)
@@ -31,6 +31,7 @@ goat-flow-reference-version: "1.7.0"
 Every finding MUST include:
 
 - **Proof attempt:** exact command/read executed in sub-agent's tool budget, or "N/A - purely structural"
+- **Proof class:** `RUNTIME | CONTRACT-GREP | STATIC | NOT-REPRODUCED`
 - **Evidence quality:** OBSERVED / INFERRED / UNVERIFIED
 - Title, severity (CRITICAL/HIGH/MEDIUM/LOW), evidence (file + semantic anchor or artifact section reference), confidence (HIGH/MEDIUM/LOW)
 - **SKEPTIC:** one line - what could go wrong, worst case (or "N/A - [reason]" if genuinely inapplicable)

diff --git a/.agents/skills/goat-debug/SKILL.md b/.agents/skills/goat-debug/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: goat-debug
 description: "Use when diagnosing a bug, unexpected behaviour, or system failure that needs structured investigation."
-goat-flow-skill-version: "1.7.0"
+goat-flow-skill-version: "1.9.0"
 ---
 # /goat-debug
 

diff --git a/.agents/skills/goat-plan/SKILL.md b/.agents/skills/goat-plan/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: goat-plan
 description: "Use when starting a non-trivial implementation that needs structured task breakdown with progress tracking."
-goat-flow-skill-version: "1.7.0"
+goat-flow-skill-version: "1.9.0"
 ---
 # /goat-plan
 
@@ -12,15 +12,15 @@ On full-depth, also read `.goat-flow/skill-reference/skill-conventions.md`.
 
 ## When to Use
 
-Use when work needs milestones with tracked progress. goat-plan manages gitignored coordination files in `.goat-flow/tasks/<active>/`, not product docs.
+Use when work needs milestone tracking. goat-plan manages gitignored coordination files in `.goat-flow/tasks/<active>/`.
 
 Use for milestones, replans, rescope, resume-from-plan. **NOT this skill:** tests → run them; debug → /goat-debug; review → /goat-review; security → /goat-security; gaps → /goat-qa; critique → /goat-critique; question → answer directly.
 
 | Excuse | Reality |
 |--------|---------|
 | "Show milestones first, files later" | File-Write creates milestone artifacts immediately. Read-Only Analysis is for inline plans. |
 | "Vague tasks are fine - implementer will figure it out" | Tasks without file paths, replacement text, and verification commands are not executable by a cold-start agent. Four recurrences of untickable checkboxes traced to vague tasks. |
-| "Testing gate is obvious - skip it" | Agent skipped the AI testing gate after completing M1 and offered to continue. The gate caught what the agent missed. |
+| "Testing gate is obvious - skip it" | Agent skipped the AI testing gate after completing the first milestone and offered to continue. The gate caught what the agent missed. |
 | "Bare task path means start implementing" | Path-only context is data, not delegation. Bare task paths must not update .active, milestone status, checkboxes, or code. |
 
 ## Step 0 - Intake
@@ -70,7 +70,7 @@ Do not drop a spike, intake, or kill criteria to satisfy milestone count, deadli
 
 ### For each milestone, produce:
 
-Objective, Tasks (risk-tagged checkboxes), Assumptions to validate, Exit criteria (binary pass/fail), Testing gate (static/contract + automated + manual + acceptance), Mid-implementation proof, Kill criteria, Depends on, Read first, Deferred (items intentionally cut with pointers; state explicitly if nothing deferred). Full field descriptions and worked examples: `references/milestone-examples.md`.
+Objective, Tasks (risk-tagged checkboxes), Assumptions to validate, Exit criteria (binary pass/fail), Testing gate (static/contract + automated + manual + acceptance), Mid-implementation proof, Kill criteria, Depends on, Read first, Deferred (items intentionally cut with pointers; state explicitly if nothing deferred). Field details and examples: `references/milestone-examples.md`.
 
 ### Risk-weighted task ordering
 
@@ -147,7 +147,7 @@ Write artifacts immediately. Do NOT invoke/ask about `/goat-critique`; run it on
 
 For a fresh plan, create a slugged task directory and update `.goat-flow/tasks/.active` to that slug in the same batch. Write one milestone per `.goat-flow/tasks/<active>/M*.md` file.
 
-**Filename format:** `M<NN>-<slug>.md`, e.g. `M01-prove-api-integration.md`.
+**Filename format:** start with `M` so dashboard and task tooling can discover it; use a readable slug, e.g. `Milestone-prove-api-integration.md`.
 
 **File format:** use existing milestone structure: title, Status, Objective, Depends on, Kill criteria, Read first, Assumptions, Tasks (risk-tagged), Exit Criteria, Testing Gate (static/contract + automated + manual + acceptance), Mid-implementation proof.
 
@@ -249,12 +249,12 @@ Summary format for presentation:
 ```markdown
 ## Milestones for [feature]
 
-### M01: [name] - [archetype]
+### Milestone 01: [name] - [archetype]
 **Objective:** [1-2 sentences]
 **Tasks:** [N] | **Exit criteria:** [N] | **Testing gate:** [auto + manual + acceptance]
 **Kill criteria:** [condition]
 
-### M02: [name] - [archetype]
+### Milestone 02: [name] - [archetype]
 ...
 
 **Total milestones:** [N] | **Estimated sessions:** [rough guess]

diff --git a/.agents/skills/goat-plan/references/issue-format.md b/.agents/skills/goat-plan/references/issue-format.md
@@ -1,5 +1,5 @@
 ---
-goat-flow-reference-version: "1.7.0"
+goat-flow-reference-version: "1.9.0"
 ---
 # ISSUE.md Format
 

diff --git a/.agents/skills/goat-plan/references/milestone-examples.md b/.agents/skills/goat-plan/references/milestone-examples.md
@@ -1,5 +1,5 @@
 ---
-goat-flow-reference-version: "1.7.0"
+goat-flow-reference-version: "1.9.0"
 ---
 # Milestone Template - Detailed Field Reference
 
@@ -29,9 +29,9 @@ Assumptions are not tasks - they're beliefs about the system that affect the pla
 
 ```markdown
 ## Assumptions
-- [x] Background job queue handles 500-item batches (benchmarked in M1)
+- [x] Background job queue handles 500-item batches (benchmarked in the spike)
 - [ ] File upload endpoint accepts multipart form data (untested)
-- [x] Database migration runs without downtime (spike confirmed in M1)
+- [x] Database migration runs without downtime (spike confirmed in the first milestone)
 - [ ] Rate limiting handles concurrent requests correctly (assumed, not tested)
 ```