diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock deleted file mode 100644 index a9fcb25..0000000 --- a/.claude/scheduled_tasks.lock +++ /dev/null @@ -1 +0,0 @@ -{"sessionId":"3c40b50c-1a7d-4957-a8df-6dd80f691ee9","pid":29655,"acquiredAt":1776712915943} \ No newline at end of file diff --git a/.gitignore b/.gitignore index 82b5388..82c14ff 100644 --- a/.gitignore +++ b/.gitignore @@ -18,4 +18,10 @@ coverage/ /_archive/ .grepai/ -graphify-out/ \ No newline at end of file +graphify-out/ + +# AI coding agent local state (per-developer, not for git) +.claude/ +.agents/ +CLAUDE.md +skills-lock.json \ No newline at end of file diff --git a/README.md b/README.md index 4c61660..a958853 100644 --- a/README.md +++ b/README.md @@ -222,13 +222,19 @@ When `ANTHROPIC_API_KEY` isn't set, the dashboard renders a read-only banner — ## Agent Skills -Designing an AgentForge workflow from scratch is a lot of YAML. The repo ships an [agent skill](https://skills.sh) that walks any Claude Code / Cursor / Codex session through the design — agents, phases, gates, loops, parallelism, wiring, nodes — and emits a working `.agentforge/` directory. +The repo ships [agent skills](https://skills.sh) that drop into any Claude Code / Cursor / Codex session — installable in one command: ```bash -npx skills add mandarnilange/agentforge/agentforge-workflow +npx skills add mandarnilange/agentforge ``` -Then ask the agent something like *"help me design an AgentForge pipeline for PR triage"* and the skill kicks in. Catalog and authoring docs: [`skills/`](skills/). +| Skill | Audience | What it does | +|---|---|---| +| `agentforge-workflow` | End user | Walks through designing an AgentForge workflow and emits a complete `.agentforge/` directory. | +| `agentforge-template-author` | Contributor | Guides adding a new shipped template under `packages/{core,platform}/src/templates/`. | +| `agentforge-debug` | Operator | Triages stuck or failing pipeline runs from symptom to fix path. | + +Trigger phrases live in each skill's frontmatter — e.g. *"help me design an AgentForge pipeline for PR triage"* fires `agentforge-workflow`; *"why is pipeline X stuck?"* fires `agentforge-debug`. Catalog, changelog, and authoring docs: [`skills/`](skills/). --- diff --git a/skills/CHANGELOG.md b/skills/CHANGELOG.md new file mode 100644 index 0000000..a9f770b --- /dev/null +++ b/skills/CHANGELOG.md @@ -0,0 +1,56 @@ +# Skills changelog + +Top-level history for every skill under [`skills/`](.). Each skill also +keeps a per-skill `CHANGELOG.md` with finer-grained notes — link to that +file from the entry below. + +## Versioning policy + +Skills are versioned independently via `metadata.version` in +`SKILL.md`'s frontmatter. Use semver: + +| Bump | Trigger | +|---|---| +| **major** | Trigger conditions changed (the skill fires on different prompts now), required tool surface changed, file structure renamed, or an existing reference doc removed. Anything a user has muscle memory for. | +| **minor** | New reference docs, new sections in `SKILL.md`, additive support for new prompts. Existing behaviour preserved. | +| **patch** | Prose tightening, typo fixes, schema cheat-sheet updates that mirror upstream changes, broken-link repairs. | + +**Bump on every PR that lands a skill change**, even a tiny one — the +version is what distinguishes one published skill from the next, and +users may pin to it. CI's frontmatter validator enforces presence; the +contributor enforces honesty. + +## Release flow + +1. Land your change locally and run `npm run skills:validate`. +2. Bump the affected skill's `metadata.version` in its `SKILL.md`. +3. Add an entry to that skill's `CHANGELOG.md` (create the file if it + does not exist). +4. Add a one-line entry here under the next "Unreleased" or current + version block, linking to the per-skill changelog. +5. Open a PR. The `Publish Skills` workflow validates frontmatter on + every push. + +There is no separate "publish" command. Once the PR merges into `main`, +`npx skills add mandarnilange/agentforge` picks up the new version on +next install. + +## History + +### Unreleased + +- _add entries here as they merge_ + +### 2026-05-02 + +- **`agentforge-template-author` 0.1.0** — initial release. Guides + contributors through adding a new shipped template under + `packages/{core,platform}/src/templates/`. See + [`agentforge-template-author/CHANGELOG.md`](agentforge-template-author/CHANGELOG.md). +- **`agentforge-debug` 0.1.0** — initial release. Triages stuck or + failing pipeline runs from symptom to fix path. See + [`agentforge-debug/CHANGELOG.md`](agentforge-debug/CHANGELOG.md). +- **`agentforge-workflow` 0.1.0** — initial release. Walks users through + designing an AgentForge workflow and emits a complete `.agentforge/` + directory. See + [`agentforge-workflow/CHANGELOG.md`](agentforge-workflow/CHANGELOG.md). diff --git a/skills/README.md b/skills/README.md index a0377ab..dfe98c7 100644 --- a/skills/README.md +++ b/skills/README.md @@ -11,17 +11,21 @@ Install all skills from this repo into your agent runtime: npx skills add mandarnilange/agentforge ``` -Or install a single skill by path: +The Vercel skills CLI scans the repo's `skills/` directory and installs every skill it finds. Want a single skill instead? Use the full GitHub tree URL: ```bash -npx skills add mandarnilange/agentforge/agentforge-workflow +npx skills add https://github.com/mandarnilange/agentforge/tree/main/skills/agentforge-workflow ``` ## Available skills -| Skill | What it does | -|---|---| -| [`agentforge-workflow`](./agentforge-workflow/SKILL.md) | Walks a user through designing an AgentForge workflow — agents, pipeline, gates, loops, parallelism, wiring, nodes — and emits a complete `.agentforge/` directory. | +| Skill | Audience | What it does | +|---|---|---| +| [`agentforge-workflow`](./agentforge-workflow/SKILL.md) | End user | Walks through designing an AgentForge workflow — agents, pipeline, gates, loops, parallelism, wiring, nodes — and emits a complete `.agentforge/` directory. | +| [`agentforge-template-author`](./agentforge-template-author/SKILL.md) | Contributor | Guides adding a new shipped template under `packages/{core,platform}/src/templates/`. Scope check → package selection → manifest → tests → PR checklist. | +| [`agentforge-debug`](./agentforge-debug/SKILL.md) | Operator | Triages a stuck or failing pipeline run from symptom → state inspection → root-cause classification → fix path. | + +Versions, release notes, and the bump policy live in [`CHANGELOG.md`](./CHANGELOG.md). ## Layout diff --git a/skills/agentforge-debug/CHANGELOG.md b/skills/agentforge-debug/CHANGELOG.md new file mode 100644 index 0000000..f923793 --- /dev/null +++ b/skills/agentforge-debug/CHANGELOG.md @@ -0,0 +1,15 @@ +# `agentforge-debug` changelog + +## 0.1.0 — 2026-05-02 + +Initial release. + +- Triages a stuck or failing AgentForge pipeline run: symptom → + state inspection → root-cause classification → fix path. +- Read-first discipline: never propose state-mutating actions until run + state has been read from the state store / dashboard. +- Reference docs: symptom tree (symptom → cause → first command), + state model (every status enum and transition), fix paths (concrete + recipes per root cause). +- Hard rules: no SQLite-file deletes, no gate bypass without + authorisation, no validation skip flags, one fix at a time. diff --git a/skills/agentforge-debug/SKILL.md b/skills/agentforge-debug/SKILL.md new file mode 100644 index 0000000..b5316e4 --- /dev/null +++ b/skills/agentforge-debug/SKILL.md @@ -0,0 +1,167 @@ +--- +name: agentforge-debug +description: > + Triages a stuck or failing AgentForge pipeline run. Walks the user from + symptom → state inspection → root-cause classification → fix path. Trigger + when the user reports an AgentForge run that is "stuck", "failing", + "looping", "blowing its budget", "not picking up at the gate", "schema + invalid", "won't dispatch", "node not picking it up", or asks "why is + pipeline X not finishing?". Do NOT trigger for general AgentForge design + questions (use `agentforge-workflow`) or for debugging unrelated AI + pipelines. +license: MIT +metadata: + author: mandarnilange + version: "0.1.0" +--- + +# AgentForge Pipeline Debug + +You are helping a user triage an AgentForge pipeline run that is not +behaving. Run state lives in the SQLite/Postgres state store; every agent +step, gate decision, and node heartbeat is recorded there. The dashboard +visualises the same data. This skill walks from the symptom the user +reports to a concrete fix path. + +The bias is **read first, change last**. Never propose mutations (re-run, +abort, force-approve gate) until you have read the current run state and +classified the root cause. + +## When to use this skill + +Trigger on any of: +- "Pipeline X is stuck / not finishing / hung" +- "My run is looping / hitting max iterations / failing schema validation" +- "Gate is not picking up / no one can approve" +- "Agent blew its budget / cost ceiling hit" +- "Node not picking up the job / scheduler isn't dispatching" +- "Dashboard says paused but nothing is happening" +- "Why is pipeline `` failing?" + +Do **not** trigger for: +- Workflow design questions (use `agentforge-workflow`). +- General LLM-app debugging unrelated to AgentForge. + +## Reference material + +Read on demand: + +- `references/symptom-tree.md` — symptom → likely cause → first command to run +- `references/state-model.md` — pipeline / agent run / gate status enums and + what each transition means +- `references/fix-paths.md` — concrete recipes for each root cause + (re-run agent, edit prompt, swap node, raise budget, force gate, etc.) + +The authoritative status enums live at: +- `packages/core/src/domain/models/pipeline-run.model.ts` — `PipelineRunStatus` +- `packages/core/src/domain/models/agent-run.model.ts` — `AgentRunRecordStatus`, `AgentRunExitReason` +- `packages/core/src/domain/models/gate.model.ts` — `GateStatus` + +Read the status enum from the source if you need the exact list — do not +rely on memory. + +## The flow + +### 1. Capture the symptom precisely + +Restate what you heard in one or two lines. Resist jumping to fixes. + +Ask only what you need: +- The pipeline run ID (or pipeline name + project name). +- What the user expected to happen. +- What they observe (dashboard state, error message, last log line). +- When it started — was it working before? What changed? + +Skip questions whose answers you can infer from the dashboard / state. + +### 2. Read the run state + +Run these in order. Stop reading the moment you have enough to classify: + +```bash +# High-level run state — status, current phase, last update +agentforge get pipeline + +# Per-agent run records — which one is stuck/failed +agentforge get pipeline --agents # if available; otherwise: +# Inspect via dashboard at http://localhost:3001/runs/ + +# Logs for the suspect agent run +agentforge logs --agent + +# Node pool health — relevant if the symptom is "not dispatching" +agentforge get nodes +``` + +If the user does not have CLI access (e.g. on a worker host), point them at +the dashboard pages: `/runs/`, `/runs//agents/`, +`/nodes`. + +### 3. Classify the root cause + +Use `references/symptom-tree.md`. The classifications: + +| Class | One-line tell | +|---|---| +| **Gate hung** | `pipeline_run.status = paused_at_gate`, no recent gate decision | +| **Agent failed (LLM error)** | `agent_run.status = failed`, `error` field non-empty, `exitReason = error` | +| **Agent failed (schema invalid)** | `failed` with `error` mentioning Zod or `validate` step | +| **Agent failed (script step)** | `failed` with last step type `script` and non-zero `exitCode` | +| **Loop spinning** | Agent `running` with high `retryCount` or `loop.iteration ≈ maxIterations` | +| **Budget hit** | `failed` with `exitReason = budget-tokens` or `budget-cost` | +| **Timeout** | `failed` with `exitReason = timeout` | +| **Stuck dispatching** | `agent_run.status = pending`/`scheduled` for long, no node claim | +| **Node down** | `nodes` table shows the targeted node `offline` or stale heartbeat | +| **Node affinity miss** | `pending` forever; no node satisfies `nodeAffinity.required` | +| **Cancelled** | `pipeline_run.status = cancelled` — was someone else's `agentforge cancel `? | + +Tell the user the class in one sentence and cite the evidence ("agent +`developer` is `failed` with `exitReason: budget-cost`, run cost was +$0.62 vs the $0.60 ceiling"). + +### 4. Propose the fix path + +Use `references/fix-paths.md` for the matched class. Present **the smallest +reversible action first**, then escalating options. + +For example, on a schema-invalid failure: +1. Inspect the agent's last LLM output via dashboard or logs. +2. If the LLM hallucinated a missing field, the prompt likely needs a + tightening — propose a one-line addition to `prompts/.system.md`. +3. If the schema itself is wrong, point at the schema file and propose the + minimum change. +4. Re-run only the failed agent: `agentforge run --continue `. + +Do not skip to "abort and re-start the pipeline" unless the run is +unrecoverable. + +### 5. Confirm before mutating state + +Any action that changes shared state — approving a gate, cancelling a run, +re-running an agent, force-clearing a stuck claim — requires explicit user +confirmation. State your understanding, the proposed action, and the +expected outcome. Wait. + +Read-only investigation does not need confirmation. + +## Hard rules + +- **Read before write.** Always inspect run state before proposing a fix. +- **Never tell the user to delete the SQLite file** or `TRUNCATE pipeline_runs` + to "clear stuck state". That is data loss. Use the proper CLI. +- **Never bypass a gate** without explicit user authorisation. Gates are the + audit trail. +- **Do not propose `--no-verify` or skip-validation flags** to bypass + schema failures. Fix the schema or the prompt. +- **Do not assume the dashboard is wrong.** When the dashboard and CLI + disagree, the state store is the source of truth — read it directly. +- **Propose one fix at a time.** Avoid stacked changes that make it + impossible to know which one solved the problem. + +## What success looks like + +- The user knows exactly which agent / phase / gate is the failure point. +- The user knows the root-cause class with cited evidence. +- The user has a concrete next action (CLI command, file edit, gate + decision) and an idea of what will tell them whether it worked. +- No state has been mutated without their explicit go-ahead. diff --git a/skills/agentforge-debug/references/fix-paths.md b/skills/agentforge-debug/references/fix-paths.md new file mode 100644 index 0000000..ebbbedb --- /dev/null +++ b/skills/agentforge-debug/references/fix-paths.md @@ -0,0 +1,199 @@ +# Fix paths — concrete recipes per root cause + +Each section: smallest reversible action first, then escalations. Always +get explicit user confirmation before any state-mutating action. + +## Gate hung + +**Smallest action — make the decision via dashboard or CLI:** +```bash +agentforge gate approve # accept +agentforge gate reject --reason "..." # reject +agentforge gate revise --notes "Tighten section 3" # ask for revision +``` + +**If approver is unavailable:** check `gateDefaults.timeout` in the pipeline. +Expired gates auto-reject. Adjust the timeout in the pipeline YAML for +future runs. + +**If the gate should never have existed:** edit the pipeline to drop the +`gate:` block on that phase. This requires a re-run; cannot retroactively +remove a gate from an in-flight run. + +## Agent failed — LLM error + +**Smallest action — read the error field, retry once if transient:** +```bash +agentforge run --continue +``` +Anthropic 529 (overloaded) and transient network errors are retried 3× by +the runtime already. A failure surfaced to the agent run record means +those retries were exhausted. + +**If the LLM keeps producing the same wrong output:** the prompt is the +problem. Read the agent's `prompts/.system.md`, identify the +ambiguity, and tighten with a one-line constraint. Re-run the failed +agent only: +```bash +agentforge run --continue +``` + +**If the model is the problem (capability gap):** swap `spec.model.name` +to a stronger model on that agent. Note the cost impact; warn the user. + +## Agent failed — schema invalid + +**Smallest action — inspect the failing artifact:** +- Dashboard: `/runs//agents/` shows the LLM's last + output and the validation error. +- CLI: `agentforge logs --agent ` (then look for the + `validate` step). + +**Branch on root cause:** + +| Cause | Fix | +|---|---| +| LLM hallucinated a missing field | Tighten the prompt to require it | +| LLM emitted the right field with the wrong type | Add an example to the prompt | +| Schema is too strict for the use case | Loosen the schema — minimum change wins | +| Schema is wrong (typo, missing optional) | Fix the schema | + +**Re-run only the failed agent:** +```bash +agentforge run --continue +``` + +**Do not** disable validation to make the run pass. The validate step +exists exactly to catch this; bypassing it pollutes downstream phases. + +## Agent failed — script step + +**Smallest action — read the error field for stderr + exit code.** + +**Most common causes and fixes:** +- Missing tool on the node (`eslint`, `pytest`, `gofmt`): + ```bash + # On the node, install it. Then re-run. + npm i -g eslint # or whatever + agentforge run --continue + ``` +- Wrong working directory: the script `cd {{run.workdir}}`s but `workdir` + is empty because a prior step did not create it. Add `mkdir -p` to the + setup step. +- Path / env var missing on the worker: the local node has it but the + Docker / SSH worker does not. Bake the dep into the worker image, or + declare it as a `nodeAffinity.required` capability and run the agent + on a different node. + +## Loop spinning at maxIterations + +**First — read what the LLM is doing each iteration.** The dashboard's +agent timeline shows every iteration's input and output. If the LLM is +making no progress, the prompt is the problem; if each iteration improves, +the predicate is wrong. + +**Fixes:** +- Predicate wrong (script always emits the same value): rewrite the + predicate's `run` block. Re-run only the agent. +- Prompt too vague: add a "what the previous iteration got wrong" hint + using `{{steps.run-tests.output}}`. +- Genuinely too hard: raise `maxIterations` (with caution — costs + multiply) or break the work into two agents. + +**Never raise `maxIterations` past 10** without explicit user buy-in. +Cost and wall-clock both compound. + +## Budget hit (`budget-tokens` / `budget-cost`) + +**Read `tokenUsage` and `cost_usd` to see how close to the ceiling the run +came.** + +**Three honest fixes:** +1. **Raise the budget** in `spec.resources.budget`. Justified when the + task is genuinely larger than the original estimate. +2. **Shorten the prompt** — system prompts that have grown over time + often duplicate context. Read for redundancy. +3. **Pick a cheaper model** — `claude-haiku-4-5` for cost-sensitive + agents that do not need deep reasoning. + +Do not silently disable budgets. They exist to bound runaway spend. + +## Timeout + +**Smallest action — bump the timeout for that agent only:** +```yaml +spec: + resources: + timeoutSeconds: 1800 # was 600 (default) +``` + +`0` disables. Use sparingly — disabled timeouts mean a hung LLM call +holds the worker forever. + +**If the agent has multiple steps:** add per-step timeouts in +`spec.definitions..timeoutSeconds` rather than raising the +agent-level one. The script step is usually the one that needs more +time, not the LLM call. + +## Stuck dispatching / node affinity miss + +**First — confirm a node satisfies the agent's required capabilities:** +```bash +agentforge get nodes +grep -A 5 "nodeAffinity" .agentforge/agents/.agent.yaml +``` + +**Two fix paths:** +- **Relax the agent**: drop a `required:` capability the user does not + actually need. +- **Strengthen the pool**: start a worker with the missing capability: + ```bash + NODE_NAME=worker-gpu \ + NODE_CAPABILITIES=llm-access,docker,gpu \ + CONTROL_PLANE_URL=http://cp:3001 \ + docker compose -f packages/platform/docker-compose.worker.yml up -d + ``` + +If all nodes are at `maxConcurrentRuns`: wait, raise the limit on a node, +or scale the pool horizontally. + +## Node down + +**Verify:** +```bash +agentforge get nodes # is the node `offline` / heartbeat stale? +``` + +**Fixes:** +- Restart the worker process. The reconciler reclaims its in-flight runs + within `claim_ttl` (default ~60s) and re-dispatches. +- If the node is permanently gone, drop it from the pool. Existing + agent runs reassigned automatically. + +## Cancelled by accident + +`cancelled` is terminal. There is no "uncancel." + +**Recovery path:** +- Read what artifacts the cancelled run produced via dashboard or CLI. +- Start a fresh pipeline. If you want to skip phases that already + completed, manually feed those artifacts as inputs to the new run. + +Make sure the user understands the cost / time implications before +re-running the full pipeline. + +--- + +## When to escalate to a code change + +If the same class of failure has happened more than twice on different +runs of the same template / agent, the root cause is in the YAML or the +framework, not the run. Examples: +- Repeated schema-invalid failures from the same agent → prompt or schema + bug; fix and PR. +- Repeated timeouts on the same agent → unrealistic budget; raise default. +- Repeated stuck-dispatch on the same affinity → template author should + loosen the `required:` to `preferred:`. + +In these cases, file an issue or a PR rather than re-running the same +broken pipeline. diff --git a/skills/agentforge-debug/references/state-model.md b/skills/agentforge-debug/references/state-model.md new file mode 100644 index 0000000..9d100fc --- /dev/null +++ b/skills/agentforge-debug/references/state-model.md @@ -0,0 +1,114 @@ +# State model — what every status means + +Authoritative source files (read directly when in doubt): + +- `packages/core/src/domain/models/pipeline-run.model.ts` +- `packages/core/src/domain/models/agent-run.model.ts` +- `packages/core/src/domain/models/gate.model.ts` +- `packages/core/src/domain/models/node.model.ts` + +## `PipelineRunStatus` + +``` +running → at least one agent is executing or pending +paused_at_gate → all phase-N agents done; waiting for human gate decision +completed → all phases finished, every gate approved +failed → an agent's failure was unrecoverable; pipeline halted +cancelled → manually aborted via CLI / dashboard +``` + +Transitions: +- `running` → `paused_at_gate` after a gated phase's agents complete +- `paused_at_gate` → `running` after gate `approved` +- `paused_at_gate` → `failed` after gate `rejected` or timeout +- `running` → `failed` when an agent run fails and there are no retries left +- any → `cancelled` via explicit cancel + +`completed` and `failed` are terminal except via `agentforge run --continue` +(only valid when `paused_at_gate` or in the middle of a recoverable failure). + +## `AgentRunRecordStatus` + +``` +pending → created but not yet claimed by a node +scheduled → assigned to a node, awaiting execution slot +running → executing on a node right now +succeeded → finished, output validated, artifacts persisted +failed → run did not produce a valid output +``` + +A `pending` agent run with no matching node hangs the pipeline. A +`scheduled` agent run that never advances to `running` suggests the worker +crashed between claim and start — the reconciler should reclaim within +`claim_ttl`. + +## `AgentRunExitReason` + +Populated only when non-obvious. ADR-0003 has the rationale. + +``` +timeout → wall-clock budget hit (LLM call or step) +budget-tokens → token ceiling exceeded +budget-cost → USD ceiling exceeded +cancelled → user cancelled mid-run +error → catch-all (LLM error, script non-zero exit, schema invalid) +``` + +Absent `exitReason` on `succeeded` is normal. Absent on `failed` means the +failure path is unspecified — read the `error` field for the message. + +## `GateStatus` + +``` +pending → awaiting decision +approved → human approved; pipeline proceeds to next phase +rejected → human rejected; pipeline marked failed +revision_requested → human asked for revision; phase agents re-run with revisionNotes +``` + +`revision_requested` re-runs the phase's agents with the revision note +inlined into their input. The new agent runs are full executions — +budgets and timeouts apply afresh. + +## `NodeStatus` + +``` +online → recent heartbeat, accepting work +offline → no recent heartbeat (default ~60s threshold) +unknown → never heartbeated since registration +degraded → online but reporting partial capability loss +``` + +`offline` for longer than the heartbeat threshold triggers reclaim of any +agent runs claimed by that node. `degraded` is informational — the node +still receives work that fits its remaining capabilities. + +## How transitions are persisted + +Every status change writes a row to the state store: + +- `pipeline_runs` — one row per run, `version` column for optimistic + concurrency. +- `agent_runs` — one row per agent execution, includes `tokenUsage`, + `cost_usd`, `duration_ms`, `error`, `exitReason`, `recoveryToken`, + `lastStatusAt`, `statusMessage`. +- `gates` — one row per gate, status updates atomic. + +Reading the state store directly (read-only) is fine for diagnostics. +Mutating it directly is not — always go through the CLI / API. + +## Reading the state store directly (last resort) + +```bash +sqlite3 ./output/.state.db "SELECT id, status, current_phase, started_at FROM pipeline_runs WHERE id LIKE '%' ORDER BY created_at DESC LIMIT 10;" + +sqlite3 ./output/.state.db "SELECT agent_name, phase, status, exit_reason, error, cost_usd FROM agent_runs WHERE pipeline_run_id = '' ORDER BY phase, started_at;" + +sqlite3 ./output/.state.db "SELECT id, phase_completed, status, reviewer, decided_at FROM gates WHERE pipeline_run_id = '';" +``` + +For Postgres deployments swap to `psql $AGENTFORGE_POSTGRES_URL`. Same +schema. + +Treat raw SQL output as a debugging aid — present findings to the user +in plain English, not as table dumps. diff --git a/skills/agentforge-debug/references/symptom-tree.md b/skills/agentforge-debug/references/symptom-tree.md new file mode 100644 index 0000000..7776ce3 --- /dev/null +++ b/skills/agentforge-debug/references/symptom-tree.md @@ -0,0 +1,161 @@ +# Symptom → cause → first command + +Walk the user from observable symptom to a single CLI command that will +either confirm or eliminate a suspected cause. + +## "The run is stuck — nothing is happening" + +**Step 1.** Read pipeline state: +```bash +agentforge get pipeline +``` + +Branch on `status`: + +| `status` | Likely cause | Next | +|---|---|---| +| `running` | Agent stuck inside an LLM call or loop | Read agent run records | +| `paused_at_gate` | Gate awaiting approval | List pending gates | +| `failed` | Latest agent failed, run halted | Inspect last agent run | +| `completed` | The run is done — user is looking at the wrong run | Confirm run-id | +| `cancelled` | Someone aborted it | Check audit log | + +**Step 2 — `running` branch.** Find the active agent: +- Dashboard: `/runs/` shows the live phase + agent. +- CLI: `agentforge logs ` and look at the most recent agent name. + +If the agent has been `running` for more than its `timeoutSeconds`, +suspect a timeout that has not yet propagated. Check the worker node: +```bash +agentforge get nodes +``` +Stale heartbeat → node crashed mid-run; reconciliation should mark the +agent `failed` within ~60s. + +**Step 2 — `paused_at_gate` branch.** Pending gate is waiting: +```bash +# List pending gates for the run (via dashboard or DB query) +# Dashboard: /runs//gates +``` +Gate status `pending` with a non-empty `phaseCompleted` → the user (or +their reviewer) needs to take action. Gate timeout (`gateDefaults.timeout` +or per-gate) automatically rejects when expired. + +## "Agent X is failing" + +**Step 1.** Read agent run record: +```bash +agentforge logs --agent # if your CLI supports it +# or via dashboard /runs//agents/ +``` + +Branch on `exitReason`: + +| `exitReason` | Cause | First fix path | +|---|---|---| +| `error` | LLM returned bad output, script step failed, or unhandled error | Read `error` field — message tells you which | +| `budget-tokens` | Hit `resources.budget.maxTotalTokens` | Inspect `tokenUsage`; raise budget or shorten prompt | +| `budget-cost` | Hit `resources.budget.maxCostUsd` | Inspect `cost_usd`; raise budget or pick cheaper model | +| `timeout` | Wall-clock budget exhausted | Raise `timeoutSeconds` or split work | +| `cancelled` | Manually cancelled via CLI / dashboard | Confirm intent | + +If `exitReason` is missing, the agent succeeded or failed for an +unspecified reason — read the `error` field directly. + +**Step 2 — Schema validation failures** (within `error: ...`): +- Look for `ZodError` or `validate` step output. +- Read the failing step's `instructions` (LLM prompt) and the schema file. +- 90% of the time: the prompt does not constrain the LLM toward the schema. + Fix is a one-line prompt addition stating the required fields. + +**Step 2 — Script step failures** (within `error: ...`): +- The error contains the step's stderr / exit code. +- Read the agent's `definitions..run` block. +- Common: missing tool on the node (`eslint`, `pytest`, `gofmt`), missing + env var (`PATH`), or a `cd {{run.workdir}}` referring to a path the + agent did not create. Fix in the YAML. + +## "It's looping — same error over and over" + +This is a `loop` block hitting `maxIterations` without the `until` +predicate going truthy. + +**Step 1.** Read the agent definition's loop: +```bash +# Pull the YAML — could be in .agentforge/ or under packages//src/templates/ +cat .agentforge/agents/.agent.yaml +``` + +**Step 2.** Inspect the predicate step's output across iterations via the +dashboard timeline or logs. The predicate (`test-gate`, etc.) should emit a +truthy sentinel when work is done. If it never does, either: +- The work genuinely cannot be completed by the LLM with current + constraints (raise `maxIterations`, change model, refine prompt). +- The predicate is wrong (logic error in the script that decides done-ness). + +Do **not** raise `maxIterations` past a sane ceiling (5–10) without first +reading what the LLM is doing — silent infinite-loop on the user's bill. + +## "Node not picking up the job" + +**Step 1.**: +```bash +agentforge get nodes +``` + +Branch: + +| State | Cause | Fix | +|---|---|---| +| Node `offline` / stale heartbeat | Worker process crashed | Restart the worker; reconciler will re-dispatch | +| All nodes online but agent `pending` | `nodeAffinity.required` not satisfied | Inspect agent's required capabilities vs node capabilities | +| All nodes at max capacity | `maxConcurrentRuns` reached | Wait, scale the pool, or raise per-node concurrency | +| No nodes at all | No worker registered | Start one — `agentforge node start --control-plane-url ...` | + +**Step 2.** For `nodeAffinity` mismatches: +```bash +# What the agent requires +grep -A 5 "nodeAffinity" .agentforge/agents/.agent.yaml + +# What the nodes provide +agentforge get nodes +``` +Resolve by either relaxing the agent's requirements or starting a node +with the missing capability. + +## "Cost ceiling hit / budget blown" + +**Step 1.** The agent's `exitReason` will be `budget-tokens` or +`budget-cost`. Read its `tokenUsage` and `cost_usd` from the run record. + +**Step 2.** Decide path: +- **The agent's task is genuinely larger than the budget** → raise + `resources.budget.maxTotalTokens` / `maxCostUsd` in the agent YAML. +- **The prompt is leaky** (LLM rambling, stuffed system prompt) → tighten + the prompt, lower `maxTokens`, lower `thinking` from `high` to `medium`. +- **A loop is consuming budget** → lower `maxIterations` or fix the + predicate (see "looping" above). + +## "Pipeline stuck dispatching — never reaches an agent" + +This is the scheduler not finding a node, OR a worker not claiming the +pending job. Order of investigation: + +1. `agentforge get nodes` — any nodes online? +2. Inspect `nodeAffinity` requirements (above). +3. If nodes online and capabilities match, suspect the event bus or job + queue. Check the control-plane logs for "no eligible node" messages + and the reconciler logs for claim races. +4. Worst case: restart the control plane. The state store survives; + dispatch resumes. + +## "I cancelled by accident — can I resume?" + +`pipeline_run.status = cancelled` is terminal. The state store keeps the +record but `agentforge run --continue ` will not work. Use: +- `agentforge get pipeline ` to read the artifacts produced so far. +- Start a fresh pipeline that wires those artifacts as inputs (manual + inputs to the new run). + +There is no automatic "uncancel". Confirm with the user before they +re-run from scratch (cost, time). diff --git a/skills/agentforge-template-author/CHANGELOG.md b/skills/agentforge-template-author/CHANGELOG.md new file mode 100644 index 0000000..9599a2f --- /dev/null +++ b/skills/agentforge-template-author/CHANGELOG.md @@ -0,0 +1,17 @@ +# `agentforge-template-author` changelog + +## 0.1.0 — 2026-05-02 + +Initial release. + +- Guides contributors adding a new shipped AgentForge template under + `packages/{core,platform}/src/templates//`. +- Walks scope check → package selection → domain & inputs → agent set + → pipeline shape → prompts → schemas → manifest, README, and tests + → wire-up → emit. +- Reference docs: template anatomy (directory layout + `template.json` + manifest contract), core-vs-platform decision matrix, test-and-publish + PR checklist. +- Hard rules: no invented registry fields, one package only, no + platform-only features in core templates, every output type has a + schema, parse tests required. diff --git a/skills/agentforge-template-author/SKILL.md b/skills/agentforge-template-author/SKILL.md new file mode 100644 index 0000000..b8ebf2c --- /dev/null +++ b/skills/agentforge-template-author/SKILL.md @@ -0,0 +1,231 @@ +--- +name: agentforge-template-author +description: > + Guides a contributor through adding a new shipped AgentForge template under + `packages/core/src/templates//` (or platform). Walks the user through + scope check, agent set, prompts, schemas, `template.json` metadata, README, + and tests, then emits the complete template directory ready for PR. Trigger + when the user asks to "add a new template", "ship a template for X", + "contribute a template", "create a starter for Y workflow", or wants to + publish a reusable pipeline shape that other AgentForge users can scaffold + via `agentforge init --template `. Do NOT trigger for + end-user `.agentforge/` projects (use `agentforge-workflow` for those). +license: MIT +metadata: + author: mandarnilange + version: "0.1.0" +--- + +# AgentForge Template Author + +You are helping a contributor add a new **shipped template** to AgentForge. +Templates live under `packages/core/src/templates//` (core, no platform +deps) or `packages/platform/src/templates//` (platform, can use the +multi-provider middleware, Postgres, Docker / SSH executors). They are +auto-discovered by the template registry — no registration step. + +End users scaffold a copy into their project with: + +```bash +npx @mandarnilange/agentforge init --template +``` + +A new template is the right contribution shape when a useful pipeline pattern +is missing from the catalog. It is **not** the right shape when a user is +building something for their own project — point them at +`agentforge-workflow` instead. + +## When to use this skill + +Trigger on any of: +- "Add / contribute / ship a new AgentForge template" +- "Create a starter template for ``" +- "Modify the existing `