mandarnilange · mandarnilange · May 2, 2026 · May 2, 2026 · May 2, 2026 · May 2, 2026
diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock
diff --git a/.gitignore b/.gitignore
@@ -18,4 +18,10 @@ coverage/
 /_archive/
 .grepai/
 
-graphify-out/
+graphify-out/
+
+# AI coding agent local state (per-developer, not for git)
+.claude/
+.agents/
+CLAUDE.md
+skills-lock.json
diff --git a/README.md b/README.md
@@ -222,13 +222,19 @@ When `ANTHROPIC_API_KEY` isn't set, the dashboard renders a read-only banner —
 
 ## Agent Skills
 
-Designing an AgentForge workflow from scratch is a lot of YAML. The repo ships an [agent skill](https://skills.sh) that walks any Claude Code / Cursor / Codex session through the design — agents, phases, gates, loops, parallelism, wiring, nodes — and emits a working `.agentforge/` directory.
+The repo ships [agent skills](https://skills.sh) that drop into any Claude Code / Cursor / Codex session — installable in one command:
 
 ```bash
-npx skills add mandarnilange/agentforge/agentforge-workflow
+npx skills add mandarnilange/agentforge
 ```
 
-Then ask the agent something like *"help me design an AgentForge pipeline for PR triage"* and the skill kicks in. Catalog and authoring docs: [`skills/`](skills/).
+| Skill | Audience | What it does |
+|---|---|---|
+| `agentforge-workflow` | End user | Walks through designing an AgentForge workflow and emits a complete `.agentforge/` directory. |
+| `agentforge-template-author` | Contributor | Guides adding a new shipped template under `packages/{core,platform}/src/templates/`. |
+| `agentforge-debug` | Operator | Triages stuck or failing pipeline runs from symptom to fix path. |
+
+Trigger phrases live in each skill's frontmatter — e.g. *"help me design an AgentForge pipeline for PR triage"* fires `agentforge-workflow`; *"why is pipeline X stuck?"* fires `agentforge-debug`. Catalog, changelog, and authoring docs: [`skills/`](skills/).
 
 ---
 

diff --git a/skills/CHANGELOG.md b/skills/CHANGELOG.md
@@ -0,0 +1,56 @@
+# Skills changelog
+
+Top-level history for every skill under [`skills/`](.). Each skill also
+keeps a per-skill `CHANGELOG.md` with finer-grained notes — link to that
+file from the entry below.
+
+## Versioning policy
+
+Skills are versioned independently via `metadata.version` in
+`SKILL.md`'s frontmatter. Use semver:
+
+| Bump | Trigger |
+|---|---|
+| **major** | Trigger conditions changed (the skill fires on different prompts now), required tool surface changed, file structure renamed, or an existing reference doc removed. Anything a user has muscle memory for. |
+| **minor** | New reference docs, new sections in `SKILL.md`, additive support for new prompts. Existing behaviour preserved. |
+| **patch** | Prose tightening, typo fixes, schema cheat-sheet updates that mirror upstream changes, broken-link repairs. |
+
+**Bump on every PR that lands a skill change**, even a tiny one — the
+version is what distinguishes one published skill from the next, and
+users may pin to it. CI's frontmatter validator enforces presence; the
+contributor enforces honesty.
+
+## Release flow
+
+1. Land your change locally and run `npm run skills:validate`.
+2. Bump the affected skill's `metadata.version` in its `SKILL.md`.
+3. Add an entry to that skill's `CHANGELOG.md` (create the file if it
+   does not exist).
+4. Add a one-line entry here under the next "Unreleased" or current
+   version block, linking to the per-skill changelog.
+5. Open a PR. The `Publish Skills` workflow validates frontmatter on
+   every push.
+
+There is no separate "publish" command. Once the PR merges into `main`,
+`npx skills add mandarnilange/agentforge` picks up the new version on
+next install.
+
+## History
+
+### Unreleased
+
+- _add entries here as they merge_
+
+### 2026-05-02
+
+- **`agentforge-template-author` 0.1.0** — initial release. Guides
+  contributors through adding a new shipped template under
+  `packages/{core,platform}/src/templates/`. See
+  [`agentforge-template-author/CHANGELOG.md`](agentforge-template-author/CHANGELOG.md).
+- **`agentforge-debug` 0.1.0** — initial release. Triages stuck or
+  failing pipeline runs from symptom to fix path. See
+  [`agentforge-debug/CHANGELOG.md`](agentforge-debug/CHANGELOG.md).
+- **`agentforge-workflow` 0.1.0** — initial release. Walks users through
+  designing an AgentForge workflow and emits a complete `.agentforge/`
+  directory. See
+  [`agentforge-workflow/CHANGELOG.md`](agentforge-workflow/CHANGELOG.md).
diff --git a/skills/README.md b/skills/README.md
@@ -11,17 +11,21 @@ Install all skills from this repo into your agent runtime:
 npx skills add mandarnilange/agentforge
 ```
 
-Or install a single skill by path:
+The Vercel skills CLI scans the repo's `skills/` directory and installs every skill it finds. Want a single skill instead? Use the full GitHub tree URL:
 
 ```bash
-npx skills add mandarnilange/agentforge/agentforge-workflow
+npx skills add https://github.com/mandarnilange/agentforge/tree/main/skills/agentforge-workflow
 ```
 
 ## Available skills
 
-| Skill | What it does |
-|---|---|
-| [`agentforge-workflow`](./agentforge-workflow/SKILL.md) | Walks a user through designing an AgentForge workflow — agents, pipeline, gates, loops, parallelism, wiring, nodes — and emits a complete `.agentforge/` directory. |
+| Skill | Audience | What it does |
+|---|---|---|
+| [`agentforge-workflow`](./agentforge-workflow/SKILL.md) | End user | Walks through designing an AgentForge workflow — agents, pipeline, gates, loops, parallelism, wiring, nodes — and emits a complete `.agentforge/` directory. |
+| [`agentforge-template-author`](./agentforge-template-author/SKILL.md) | Contributor | Guides adding a new shipped template under `packages/{core,platform}/src/templates/`. Scope check → package selection → manifest → tests → PR checklist. |
+| [`agentforge-debug`](./agentforge-debug/SKILL.md) | Operator | Triages a stuck or failing pipeline run from symptom → state inspection → root-cause classification → fix path. |
+
+Versions, release notes, and the bump policy live in [`CHANGELOG.md`](./CHANGELOG.md).
 
 ## Layout
 

diff --git a/skills/agentforge-debug/CHANGELOG.md b/skills/agentforge-debug/CHANGELOG.md
@@ -0,0 +1,15 @@
+# `agentforge-debug` changelog
+
+## 0.1.0 — 2026-05-02
+
+Initial release.
+
+- Triages a stuck or failing AgentForge pipeline run: symptom →
+  state inspection → root-cause classification → fix path.
+- Read-first discipline: never propose state-mutating actions until run
+  state has been read from the state store / dashboard.
+- Reference docs: symptom tree (symptom → cause → first command),
+  state model (every status enum and transition), fix paths (concrete
+  recipes per root cause).
+- Hard rules: no SQLite-file deletes, no gate bypass without
+  authorisation, no validation skip flags, one fix at a time.
diff --git a/skills/agentforge-debug/SKILL.md b/skills/agentforge-debug/SKILL.md
@@ -0,0 +1,167 @@
+---
+name: agentforge-debug
+description: >
+  Triages a stuck or failing AgentForge pipeline run. Walks the user from
+  symptom → state inspection → root-cause classification → fix path. Trigger
+  when the user reports an AgentForge run that is "stuck", "failing",
+  "looping", "blowing its budget", "not picking up at the gate", "schema
+  invalid", "won't dispatch", "node not picking it up", or asks "why is
+  pipeline X not finishing?". Do NOT trigger for general AgentForge design
+  questions (use `agentforge-workflow`) or for debugging unrelated AI
+  pipelines.
+license: MIT
+metadata:
+  author: mandarnilange
+  version: "0.1.0"
+---
+
+# AgentForge Pipeline Debug
+
+You are helping a user triage an AgentForge pipeline run that is not
+behaving. Run state lives in the SQLite/Postgres state store; every agent
+step, gate decision, and node heartbeat is recorded there. The dashboard
+visualises the same data. This skill walks from the symptom the user
+reports to a concrete fix path.
+
+The bias is **read first, change last**. Never propose mutations (re-run,
+abort, force-approve gate) until you have read the current run state and
+classified the root cause.
+
+## When to use this skill
+
+Trigger on any of:
+- "Pipeline X is stuck / not finishing / hung"
+- "My run is looping / hitting max iterations / failing schema validation"
+- "Gate is not picking up / no one can approve"
+- "Agent blew its budget / cost ceiling hit"
+- "Node not picking up the job / scheduler isn't dispatching"
+- "Dashboard says paused but nothing is happening"
+- "Why is pipeline `<id>` failing?"
+
+Do **not** trigger for:
+- Workflow design questions (use `agentforge-workflow`).
+- General LLM-app debugging unrelated to AgentForge.
+
+## Reference material
+
+Read on demand:
+
+- `references/symptom-tree.md` — symptom → likely cause → first command to run
+- `references/state-model.md` — pipeline / agent run / gate status enums and
+  what each transition means
+- `references/fix-paths.md` — concrete recipes for each root cause
+  (re-run agent, edit prompt, swap node, raise budget, force gate, etc.)
+
+The authoritative status enums live at:
+- `packages/core/src/domain/models/pipeline-run.model.ts` — `PipelineRunStatus`
+- `packages/core/src/domain/models/agent-run.model.ts` — `AgentRunRecordStatus`, `AgentRunExitReason`
+- `packages/core/src/domain/models/gate.model.ts` — `GateStatus`
+
+Read the status enum from the source if you need the exact list — do not
+rely on memory.
+
+## The flow
+
+### 1. Capture the symptom precisely
+
+Restate what you heard in one or two lines. Resist jumping to fixes.
+
+Ask only what you need:
+- The pipeline run ID (or pipeline name + project name).
+- What the user expected to happen.
+- What they observe (dashboard state, error message, last log line).
+- When it started — was it working before? What changed?
+
+Skip questions whose answers you can infer from the dashboard / state.
+
+### 2. Read the run state
+
+Run these in order. Stop reading the moment you have enough to classify:
+
+```bash
+# High-level run state — status, current phase, last update
+agentforge get pipeline <run-id>
+
+# Per-agent run records — which one is stuck/failed
+agentforge get pipeline <run-id> --agents          # if available; otherwise:
+# Inspect via dashboard at http://localhost:3001/runs/<run-id>
+
+# Logs for the suspect agent run
+agentforge logs <run-id> --agent <agent-name>
+
+# Node pool health — relevant if the symptom is "not dispatching"
+agentforge get nodes
+```
+
+If the user does not have CLI access (e.g. on a worker host), point them at
+the dashboard pages: `/runs/<id>`, `/runs/<id>/agents/<agent-name>`,
+`/nodes`.
+
+### 3. Classify the root cause
+
+Use `references/symptom-tree.md`. The classifications:
+
+| Class | One-line tell |
+|---|---|
+| **Gate hung** | `pipeline_run.status = paused_at_gate`, no recent gate decision |
+| **Agent failed (LLM error)** | `agent_run.status = failed`, `error` field non-empty, `exitReason = error` |
+| **Agent failed (schema invalid)** | `failed` with `error` mentioning Zod or `validate` step |
+| **Agent failed (script step)** | `failed` with last step type `script` and non-zero `exitCode` |
+| **Loop spinning** | Agent `running` with high `retryCount` or `loop.iteration ≈ maxIterations` |
+| **Budget hit** | `failed` with `exitReason = budget-tokens` or `budget-cost` |
+| **Timeout** | `failed` with `exitReason = timeout` |
+| **Stuck dispatching** | `agent_run.status = pending`/`scheduled` for long, no node claim |
+| **Node down** | `nodes` table shows the targeted node `offline` or stale heartbeat |
+| **Node affinity miss** | `pending` forever; no node satisfies `nodeAffinity.required` |
+| **Cancelled** | `pipeline_run.status = cancelled` — was someone else's `agentforge cancel <run-id>`? |
+
+Tell the user the class in one sentence and cite the evidence ("agent
+`developer` is `failed` with `exitReason: budget-cost`, run cost was
+$0.62 vs the $0.60 ceiling").
+
+### 4. Propose the fix path
+
+Use `references/fix-paths.md` for the matched class. Present **the smallest
+reversible action first**, then escalating options.
+
+For example, on a schema-invalid failure:
+1. Inspect the agent's last LLM output via dashboard or logs.
+2. If the LLM hallucinated a missing field, the prompt likely needs a
+   tightening — propose a one-line addition to `prompts/<agent>.system.md`.
+3. If the schema itself is wrong, point at the schema file and propose the
+   minimum change.
+4. Re-run only the failed agent: `agentforge run --continue <run-id>`.
+
+Do not skip to "abort and re-start the pipeline" unless the run is
+unrecoverable.
+
+### 5. Confirm before mutating state
+
+Any action that changes shared state — approving a gate, cancelling a run,
+re-running an agent, force-clearing a stuck claim — requires explicit user
+confirmation. State your understanding, the proposed action, and the
+expected outcome. Wait.
+
+Read-only investigation does not need confirmation.
+
+## Hard rules
+
+- **Read before write.** Always inspect run state before proposing a fix.
+- **Never tell the user to delete the SQLite file** or `TRUNCATE pipeline_runs`
+  to "clear stuck state". That is data loss. Use the proper CLI.
+- **Never bypass a gate** without explicit user authorisation. Gates are the
+  audit trail.
+- **Do not propose `--no-verify` or skip-validation flags** to bypass
+  schema failures. Fix the schema or the prompt.
+- **Do not assume the dashboard is wrong.** When the dashboard and CLI
+  disagree, the state store is the source of truth — read it directly.
+- **Propose one fix at a time.** Avoid stacked changes that make it
+  impossible to know which one solved the problem.
+
+## What success looks like
+
+- The user knows exactly which agent / phase / gate is the failure point.
+- The user knows the root-cause class with cited evidence.
+- The user has a concrete next action (CLI command, file edit, gate
+  decision) and an idea of what will tell them whether it worked.
+- No state has been mutated without their explicit go-ahead.