From 93e9efabc632ca9fd2cdec1bca00460f66e520ca Mon Sep 17 00:00:00 2001 From: Brian Madison Date: Sat, 6 Jun 2026 15:15:28 -0500 Subject: [PATCH 01/13] Rebuild workflow-builder and eval-runner: eval-driven, platform-agnostic, lean workflow-builder: - New leanness/outcome scanner owns the minimal-baseline tests (core test, defend-against-its-own-absence, outcome-vs-prescription) that nothing applied before - Report pipeline collapsed: scanners return lean JSON in-context, one report-author fills a stable self-contained HTML shell with a single JSON island (cannot render blank; multi-select + copy-to-paste-back-prompt baked in). Retired generate-html-report.py, extract-report-json.py, the rigid report-data schema, report-quality-scan-creator.md - memlog replaces .decision-log.md (scripts/memlog.py, typed append-only) - customize.toml is the sole customizability mechanism; build flow asks (default no, headless no unless requested); no installer questions, no module.yaml authoring - tiktoken token counts replace line counts as the length metric (scripts/count_tokens.py) - Build flow rewritten from the 5-phase lockstep into the migration-guide Process loop - Enhancement scanner flipped to add-or-subtract; standard-fields numbered-steps fix - Script-opportunity seeking kept and strengthened (native python) eval-runner: - Rebuilt lean: dropped Docker, PTY, keychain staging, dual isolation - Standalone and builder-invoked; everything runtime-specific behind one platform-adapter seam (invocation command, auth env-var, transcript schema; skill_dir/load_signal for triggers) - Four modes: baseline (skill vs bare model), variant (full vs stripped), quality, trigger - Case = input + rubric + optional state_prefix (turn-simulation); bounded self-improvement loop - No hardcoded model list anywhere --- skills/bmad-eval-runner/SKILL.md | 100 +-- skills/bmad-eval-runner/agents/grader.md | 93 --- skills/bmad-eval-runner/assets/Dockerfile | 29 - .../references/description-optimization.md | 64 ++ .../references/eval-format.md | 91 +++ .../references/eval-formats.md | 147 ---- skills/bmad-eval-runner/references/grader.md | 72 ++ .../bmad-eval-runner/references/isolation.md | 110 --- .../references/platform-adapter.md | 51 ++ .../references/self-improvement.md | 55 ++ .../scripts/aggregate_benchmark.py | 236 ++++++ .../bmad-eval-runner/scripts/count_tokens.py | 77 ++ .../bmad-eval-runner/scripts/docker_setup.py | 115 --- .../scripts/generate_report.py | 184 ----- skills/bmad-eval-runner/scripts/memlog.py | 197 +++++ skills/bmad-eval-runner/scripts/pty_runner.py | 171 ---- skills/bmad-eval-runner/scripts/run_evals.py | 750 ++++++++---------- .../bmad-eval-runner/scripts/run_triggers.py | 542 +++++++------ skills/bmad-eval-runner/scripts/utils.py | 260 ------ skills/bmad-workflow-builder/SKILL.md | 47 +- .../assets/SKILL-template.md | 70 +- .../assets/report-shell.html | 580 ++++++++++++++ .../references/build-process.md | 148 +--- .../references/complex-workflow-patterns.md | 109 ++- .../references/customize-toml-guide.md | 119 +++ .../references/quality-analysis.md | 140 ---- .../references/quality-scan-architecture.md | 63 -- .../references/quality-scan-customization.md | 48 -- .../references/quality-scan-determinism.md | 60 -- .../references/quality-scan-enhancement.md | 55 -- .../references/report-author.md | 64 ++ .../references/report-quality-scan-creator.md | 182 ----- .../references/scan-architecture.md | 50 ++ .../references/scan-customization.md | 52 ++ .../references/scan-determinism.md | 46 ++ .../references/scan-enhancement.md | 46 ++ .../references/scan-leanness.md | 80 ++ .../references/scan-orchestration.md | 59 ++ .../script-opportunities-reference.md | 128 ++- .../references/skill-quality-principles.md | 177 +++-- .../references/standard-fields.md | 226 +++--- .../scripts/count_tokens.py | 77 ++ .../scripts/extract-report-json.py | 287 ------- .../scripts/generate-html-report.py | 588 -------------- .../scripts/init_skill.py | 132 +++ .../bmad-workflow-builder/scripts/memlog.py | 197 +++++ .../scripts/prepass-execution-deps.py | 288 ------- .../scripts/prepass-prompt-metrics.py | 407 +++++----- .../scripts/quick_validate.py | 117 +++ .../scripts/tests/test_count_tokens.py | 180 +++++ .../scripts/tests/test_memlog.py | 247 ++++++ 51 files changed, 4179 insertions(+), 4234 deletions(-) delete mode 100644 skills/bmad-eval-runner/agents/grader.md delete mode 100644 skills/bmad-eval-runner/assets/Dockerfile create mode 100644 skills/bmad-eval-runner/references/description-optimization.md create mode 100644 skills/bmad-eval-runner/references/eval-format.md delete mode 100644 skills/bmad-eval-runner/references/eval-formats.md create mode 100644 skills/bmad-eval-runner/references/grader.md delete mode 100644 skills/bmad-eval-runner/references/isolation.md create mode 100644 skills/bmad-eval-runner/references/platform-adapter.md create mode 100644 skills/bmad-eval-runner/references/self-improvement.md create mode 100644 skills/bmad-eval-runner/scripts/aggregate_benchmark.py create mode 100644 skills/bmad-eval-runner/scripts/count_tokens.py delete mode 100644 skills/bmad-eval-runner/scripts/docker_setup.py delete mode 100644 skills/bmad-eval-runner/scripts/generate_report.py create mode 100644 skills/bmad-eval-runner/scripts/memlog.py delete mode 100644 skills/bmad-eval-runner/scripts/pty_runner.py delete mode 100644 skills/bmad-eval-runner/scripts/utils.py create mode 100644 skills/bmad-workflow-builder/assets/report-shell.html create mode 100644 skills/bmad-workflow-builder/references/customize-toml-guide.md delete mode 100644 skills/bmad-workflow-builder/references/quality-analysis.md delete mode 100644 skills/bmad-workflow-builder/references/quality-scan-architecture.md delete mode 100644 skills/bmad-workflow-builder/references/quality-scan-customization.md delete mode 100644 skills/bmad-workflow-builder/references/quality-scan-determinism.md delete mode 100644 skills/bmad-workflow-builder/references/quality-scan-enhancement.md create mode 100644 skills/bmad-workflow-builder/references/report-author.md delete mode 100644 skills/bmad-workflow-builder/references/report-quality-scan-creator.md create mode 100644 skills/bmad-workflow-builder/references/scan-architecture.md create mode 100644 skills/bmad-workflow-builder/references/scan-customization.md create mode 100644 skills/bmad-workflow-builder/references/scan-determinism.md create mode 100644 skills/bmad-workflow-builder/references/scan-enhancement.md create mode 100644 skills/bmad-workflow-builder/references/scan-leanness.md create mode 100644 skills/bmad-workflow-builder/references/scan-orchestration.md create mode 100644 skills/bmad-workflow-builder/scripts/count_tokens.py delete mode 100644 skills/bmad-workflow-builder/scripts/extract-report-json.py delete mode 100644 skills/bmad-workflow-builder/scripts/generate-html-report.py create mode 100644 skills/bmad-workflow-builder/scripts/init_skill.py create mode 100644 skills/bmad-workflow-builder/scripts/memlog.py delete mode 100755 skills/bmad-workflow-builder/scripts/prepass-execution-deps.py create mode 100644 skills/bmad-workflow-builder/scripts/quick_validate.py create mode 100644 skills/bmad-workflow-builder/scripts/tests/test_count_tokens.py create mode 100644 skills/bmad-workflow-builder/scripts/tests/test_memlog.py diff --git a/skills/bmad-eval-runner/SKILL.md b/skills/bmad-eval-runner/SKILL.md index 911dbc9..3846e88 100644 --- a/skills/bmad-eval-runner/SKILL.md +++ b/skills/bmad-eval-runner/SKILL.md @@ -1,91 +1,77 @@ --- name: bmad-eval-runner -description: Run a skill's evals in a clean, isolated environment and report results. Use when the user wants to evaluate a skill, run evals, benchmark a skill, validate triggers, or grade skill outputs. +description: Run a skill's evals and report results. Use when the user wants to evaluate a skill, run evals, benchmark a skill, validate triggers, optimize a description, or grade skill outputs. --- # Skill Eval Runner -## Overview +You run a skill's evals and report what they say. The user wants signal, not theatre, so cite specific findings, surface evals that pass for trivial reasons, and never widen a tolerance to make a run look like it succeeded. -Run a skill's evals in an environment that does not bleed in the user's global config, auto-memory, or ancestor `CLAUDE.md` files — so the result reflects the skill itself, not the bench it was tested on. Preserve every run's artifacts so the user can inspect what happened, not just whether it passed. +The runner is platform-agnostic. Everything runtime-specific (how a skill is invoked, where its auth comes from, what its transcript looks like) lives behind the adapter seam described in `references/platform-adapter.md`. No model name is hardcoded anywhere in this skill. -Two eval shapes are supported and run independently: +## The four modes -- **Artifact evals** (`evals.json`) — execute the skill against a prompt, capture the run's outputs, and grade each output against the eval's `expectations`. -- **Trigger evals** (`triggers.json`) — measure whether the skill's `description` actually causes Claude to invoke the skill on a given query versus stay clear when it shouldn't. +Each mode answers a different question about a skill. Pick the one that matches what the user is asking, or run several. -You are an experienced eval engineer. The user wants signal, not theatre. Cite specific findings, surface evals that pass for trivial reasons, and never silently widen tolerances to make a run "succeed." +| Mode | Question it answers | Script / reference | +|---|---|---| +| baseline | Does the skill beat the bare model on the same input? | `references/eval-format.md`, `scripts/run_evals.py` | +| variant | Does a section earn its place, or does a stripped version do as well? | `references/eval-format.md`, `scripts/run_evals.py` | +| quality | Does the output meet the named rubric? | `references/grader.md`, `references/eval-format.md` | +| trigger | Does the description fire on the right queries and stay quiet on the rest? | `references/platform-adapter.md`, `scripts/run_triggers.py` | -## Args - -- Positional: a path to the skill being evaluated (directory containing `SKILL.md`). -- `--evals ` — explicit path to evals folder or a specific `evals.json` / `triggers.json` file. If omitted, discover. -- `--mode artifact|trigger|both` — which eval kind to run. Default: `both` if both files are found, else whichever exists. -- `--isolation docker|local|auto` — sandbox strategy. Default: `auto` (Docker when available, otherwise local). -- `--project-root ` — root of the project the skill belongs to. Default: walk up from skill path looking for `_bmad/` or `.git/`. -- `--output-dir ` — where run folders are written. Default: `{bmad_builder_reports}/eval-runs/` if configured, else `~/bmad-evals/`. -- `--workers ` — parallel evals. Default: 4. -- `--headless` / `-H` — non-interactive; emit final JSON only. - -## On Activation - -1. Resolve config the same way `bmad-workflow-builder` does (`{project-root}/_bmad/config.yaml` then `config.user.yaml`, falling back to `bmb/config.yaml`). Resolve `{user_name}`, `{communication_language}`, `{bmad_builder_reports}`. Apply throughout the session. +Baseline runs the input twice in the same turn, once with the skill and once against the bare model, so the bare model is the long-term floor. Variant runs the full skill against a stripped smallest-version of itself to settle whether a section is doing real work. Quality grades one config's output against a rubric with the read-only grader. Trigger measures real firing through the adapter and can optimize the description across rounds; the optimization loop lives in `references/description-optimization.md`. -2. If `--headless` was passed, set `{headless_mode}=true` and skip every confirmation below; pick the safest defaults and proceed. +A case is `input + rubric + optional state_prefix`. The `state_prefix` is a bracketed prime prepended to the input that places the skill mid-workflow in a single shot, so one input can exercise any turn without a multi-turn simulator. The full case format and the strong-versus-weak expectation taxonomy are in `references/eval-format.md`. -3. Locate the skill. Verify `/SKILL.md` exists; halt with a clear error if it doesn't. +## Args -4. Discover evals — see `## Eval Discovery` below. +- Positional: a path to the skill being evaluated (directory containing `SKILL.md`). +- `--evals `: explicit path to the cases file. If omitted, discover. +- `--mode baseline|variant|quality|trigger`: which mode to run. May be repeated. +- `--variant-path `: for variant mode, the stripped or prior-version skill to compare against. +- `--project-root `: root of the project the skill belongs to. Default: walk up from the skill path looking for `_bmad/` or `.git/`. +- `--output-dir `: where run folders are written. Default: `{bmad_builder_reports}/eval-runs/` if configured, else `~/bmad-evals/`. +- `--runs `: repeats per case for the variance benchmark. Default: 1 for a single check, higher when the user wants a stable mean. +- `--headless` / `-H`: non-interactive; emit final JSON only. -5. Choose isolation — see `## Isolation` below. On the first Docker run on this machine, the image will need to be built; surface that, ask once unless headless, then cache. +## On activation -6. Confirm the run summary with the user (skill, evals found, mode, isolation, output dir) unless headless. Then execute. +1. Resolve config the way `bmad-workflow-builder` does (`{project-root}/_bmad/config.yaml` then `config.user.yaml`, falling back to `bmb/config.yaml`). Resolve `{user_name}`, `{communication_language}`, `{bmad_builder_reports}` and apply them through the session. -## Eval Discovery +2. If `--headless` was passed, set `{headless_mode}=true`, skip every confirmation below, pick the safest defaults, and proceed. -Look in this order, taking the first match: +3. Resume check: glob the output dir for an in-progress run's `.memlog.md`. If one exists and matches this skill, read it once to rebuild state, then continue append-only. Capture decisions and direction changes into the run's memlog through `scripts/memlog.py` as they land. -1. `--evals` argument if provided. May point to a folder (containing `evals.json` and/or `triggers.json`) or a specific JSON file. -2. `/evals/` — colocated with the skill. -3. `/../../evals//` — sibling-of-parent layout (common in BMad modules where `evals/` is excluded from distribution but lives next to `src/`). -4. `/evals//` — top-level evals tree. -5. `/evals/**//` — anywhere under project evals. +4. Locate the skill and verify `/SKILL.md` exists. Halt with a clear error if it does not. -Surface what you found and where. If no evals are discovered, halt with a clear message — do not attempt to fabricate evals. +5. Load the adapter for the current runtime from `references/platform-adapter.md`. This gives you the invocation command, the auth env-var to forward, and the transcript schema to read. -## Isolation +6. Discover the cases file. Look at `--evals` first, then `/evals/`, then `/../../evals//`, then `/evals//`, then anywhere under `/evals/`. Take the first match. If nothing is found, halt and say so; the runner does not invent cases. -Run each eval in a fresh workspace so memory, project CLAUDE.md, prior runs, and host shell config cannot bias the result. Two strategies, picked automatically by default: +7. Confirm the run summary (skill, cases found, modes, output dir) unless headless, then execute. -- **Docker** (preferred when available): each eval runs in a fresh container off `bmad-eval-runner:latest`. The host's `ANTHROPIC_API_KEY` is the only env passed in. The skill's project is bind-mounted read-only and copied into a writable scratch dir inside the container; `HOME` is a fresh in-container directory; there is no auto-memory and no host CLAUDE.md. +## Run execution -- **Local fallback** (when Docker is unavailable or the user opts out): each eval runs in a fresh `~/bmad-evals///workspace/` directory with `HOME=/.home` overridden so global memory and global CLAUDE.md do not leak. The project is copied (or hardlinked where supported) into the workspace. Tell the user this is the active mode and acknowledge that local isolation is best-effort, not hermetic. +Run each case in a clean working directory so the host shell config, prior runs, and ancestor instruction files do not bias the result. The clean-cwd setup is part of the adapter seam; there is no container, no terminal emulation, and no credential staging beyond forwarding the adapter's auth env-var. -The first time Docker is selected on this machine, build the image — `python3 {skill-root}/scripts/docker_setup.py --build` — and tell the user this is happening once. +For baseline, variant, and quality modes, call `python3 {skill-root}/scripts/run_evals.py` with the resolved arguments and the adapter. The script applies any `state_prefix` to the input before invoking, runs the configured invocations (skill, bare model, or variant), and writes a per-case folder. It captures timing and token counts the moment each invocation completes and writes them to `timing.json` immediately, so a later crash never loses the measurement. -Details and the exact mount layout live in `references/isolation.md`. Read that file when you need to debug an isolation issue or explain to the user what is being isolated. +For trigger mode, call `python3 {skill-root}/scripts/run_triggers.py`. It stages a synthetic skill where the runtime discovers skills, sends each query through the adapter, and detects the skill-load event. Each query runs several times for stability. When the user wants to optimize the description rather than just measure it, follow `references/description-optimization.md`. -## Run Execution +For quality mode, spawn the grader described in `references/grader.md` per case. The grader is read-only against the run folder, returns `{text, passed, evidence}` per expectation, gives no partial credit, and flags weak or non-discriminating assertions. Relay that feedback. -For artifact evals, invoke `python3 {skill-root}/scripts/run_evals.py` with the resolved arguments. The script handles isolation per eval, runs `claude -p` in the sandbox with the eval's prompt and any staged fixture files, and writes a per-eval folder with `prompt.txt`, `transcript.jsonl`, `artifacts/`, and `metrics.json`. +When `--runs` is greater than one, call `python3 {skill-root}/scripts/aggregate_benchmark.py` to produce the mean, sample standard deviation, min, max, and the delta between configs. -For trigger evals, invoke `python3 {skill-root}/scripts/run_triggers.py`. The script measures whether the skill's description causes the skill to fire for each query, with `runs-per-query` repeats for stability, and writes `triggers-result.json`. Trigger evals should run under Docker isolation when available — local mode can have the host's installed skills bleed in via cwd-based skill discovery, biasing the trigger signal. If Docker is unavailable, run trigger evals locally but say so explicitly. +## Artifacts -After artifact runs complete, grade each eval. Spawn a grader subagent per eval in parallel (Agent tool, prompt loaded from `{skill-root}/agents/grader.md` plus the eval's `expectations` and the path to its outputs). Each grader writes `grading.json` next to the artifacts. The grader has license to flag weak assertions — relay that feedback to the user. +Every run writes a dated run folder under the output dir, and those artifacts are permanent. Each case folder holds its prompt, transcript, any files the skill wrote, `timing.json`, and the grading when quality mode ran. Never delete, overwrite, or rotate a run folder; disk usage is the user's call. The run's `.memlog.md` records the decisions and deltas so a resumed or audited run reads back cleanly. -After all grading is done, generate the aggregate report — `python3 {skill-root}/scripts/generate_report.py --run-dir ` — which produces `report.html`. Tell the user where the run folder is and where the HTML report is. +Tell the user where the run folder is when you finish. ## Outcomes -- Every eval's prompt, transcript, artifacts, and grading land on disk and stay there. Nothing is silently cleaned up. -- The run honestly reflects the skill's behavior in a clean room — not the behavior of the host shell with its memories and configs. -- The user knows whether Docker or local was used and why. -- Failures cite specific expectations and evidence; passes that look superficial are flagged, not papered over. - -## Constraints - -- **Artifacts are forever.** Never delete, overwrite, or rotate run folders. Disk usage is the user's call. -- **Auth boundary is narrow.** On macOS, the host's Claude Code OAuth credential is staged into each isolated `.claude/.credentials.json` so the subprocess can authenticate without inheriting host config. `ANTHROPIC_API_KEY`, if set, is also forwarded. Nothing else crosses. -- **Trigger evals do not need real artifacts.** They use a stub command file and only measure description firing — keep them cheap and parallel. -- **No silent fallbacks on grading.** If a grader subagent errors, mark that eval `grading_error` rather than substituting a default verdict. -- **Stop when evals are missing.** If discovery returns nothing, halt with diagnostics — the runner does not invent test cases. +- The run reflects the skill's behavior in a clean working directory, not the behavior of the host shell with its memories and configs. +- Timing and token counts land on disk the moment they are measured. +- Failures cite specific expectations with evidence, and a pass that looks superficial is flagged rather than papered over. +- A baseline run that the skill no longer wins points to retiring the skill, not patching it. diff --git a/skills/bmad-eval-runner/agents/grader.md b/skills/bmad-eval-runner/agents/grader.md deleted file mode 100644 index af1d0fb..0000000 --- a/skills/bmad-eval-runner/agents/grader.md +++ /dev/null @@ -1,93 +0,0 @@ -# Grader Agent - -Evaluate a single eval's expectations against its captured transcript and artifacts. Return pass/fail per expectation with evidence — and flag weak assertions when you see them. - -You are not the executor. You are not allowed to "fix" the artifacts. Your only job is to inspect what was produced and answer: did each expectation hold? - -## Inputs - -You receive in your prompt: - -- **eval_id**: identifier for this eval -- **prompt**: the original user message that was sent to the skill -- **expected_output**: human-readable description of what success looks like (context only, not scored against) -- **expectations**: list of strings — the assertions you grade -- **transcript_path**: absolute path to a stream-JSON transcript (`.jsonl`) -- **artifacts_dir**: absolute path to the directory containing files the skill wrote -- **grading_path**: absolute path where you write `grading.json` - -## Process - -1. **Read the transcript.** Open `transcript_path`. The transcript is stream-JSON: each line is a JSON event. Note: - - The user prompt that was sent - - Every tool call Claude made — `Write`, `Edit`, `Read`, `Skill`, `Bash`, etc. (the event has `type: "assistant"` and `content[].type: "tool_use"` with `name` and `input`) - - The order tool calls happened in (events are line-ordered) - - The final assistant message — often contains a JSON status block for headless runs - - Any errors or warnings logged - -2. **List and inspect artifacts.** Walk `artifacts_dir`. For each expectation, open the files it implicates and read their contents — do not rely on filenames alone. Note file modification times when ordering or read-only behavior matters. - -3. **Grade each expectation independently.** For each entry in `expectations`, identify what kind of check it is and gather the right evidence: - - - **Side-artifact existence + content** ("decision-log.md exists AND captures decision X") → open the file, read it, check the content matches. - - **Transcript tool-call patterns** ("transcript contains a Skill call to bmad-editorial-review-prose") → scan the transcript for `tool_use` events with the matching `name` and `input`. Quote the matching event. - - **Phase ordering** ("polish call occurs after the Write to brief.md and before the final JSON block") → find the line numbers / event indices of each landmark and verify the order. - - **Read-only enforcement** ("input brief.md is byte-identical to the fixture; no Write/Edit calls targeted it") → compare file content if the original is available; AND scan the transcript for any Write/Edit `tool_use` whose `input.file_path` falls in the protected directory. - - **YAML frontmatter** ("frontmatter contains title, status, created (ISO 8601), updated") → parse the frontmatter, check fields and their formats. - - **JSON output blocks** ("final assistant message contains a JSON object with intent='create'") → look at the final `text` content of the last assistant message; extract the JSON object; check the field. - - **Bidirectional fidelity** ("every decision in decision-log.md is reflected in brief.md AND no claim in brief.md is absent from the input prompt or log") → list decisions in the log, verify each appears in the brief; list substantive claims in the brief, verify each traces to either the prompt or the log. - -4. **Decide PASS or FAIL with specific evidence.** - - PASS only if there is clear, specific evidence the expectation holds AND the evidence reflects substance, not surface compliance (file exists AND contains correct content, not just the right filename). - - FAIL when no evidence is found, evidence contradicts, or the assertion is technically satisfied but the underlying outcome is wrong. - - Cite the evidence — quote a specific line, name a specific file with a path, point to a specific tool call with its index or input. - -5. **Critique the evals.** After grading, surface assertions that look weak: ones that passed but would also pass for a clearly wrong output, or important outcomes you observed (good or bad) that no assertion checks. Keep the bar high — flag what an eval author would say "good catch" about, not nits. - -6. **Write `grading.json`.** Save to `grading_path`. - -## Output Format - -```json -{ - "eval_id": "", - "expectations": [ - { - "text": "brief.md exists in the run folder", - "passed": true, - "evidence": "Found at artifacts/2026-05-09-insulens/brief.md, 487 words" - }, - { - "text": "decision-log.md references having ingested the memo as source material", - "passed": false, - "evidence": "decision-log.md exists but contains only template placeholders; no mention of the memo" - } - ], - "summary": { - "passed": 1, - "failed": 1, - "total": 2, - "pass_rate": 0.5 - }, - "eval_feedback": { - "suggestions": [ - { - "assertion": "brief.md exists in the run folder", - "reason": "Existence is a weak check — an empty brief.md would also pass. Consider pairing with a content assertion (e.g., word count > 200, contains the project name)." - } - ], - "overall": "Assertions check structure but not content correctness in two places." - } -} -``` - -If `eval_feedback.suggestions` would be empty, set it to `[]` and `overall` to `"No suggestions; assertions look solid."` - -## Guidelines - -- **Be objective.** Verdicts come from evidence, not vibes. -- **Be specific.** Quote, name files, point to line numbers. -- **No partial credit.** Each expectation is pass or fail. -- **Burden of proof is on the expectation.** When uncertain, fail. -- **Do not edit artifacts.** You are read-only against the run folder. -- **Do not silently substitute defaults.** If you genuinely cannot read a file or the transcript is missing, mark the affected expectations failed with that as the evidence. diff --git a/skills/bmad-eval-runner/assets/Dockerfile b/skills/bmad-eval-runner/assets/Dockerfile deleted file mode 100644 index 9c791ae..0000000 --- a/skills/bmad-eval-runner/assets/Dockerfile +++ /dev/null @@ -1,29 +0,0 @@ -FROM node:20-bookworm-slim - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get update \ - && apt-get install -y --no-install-recommends \ - git \ - python3 \ - python3-pip \ - ca-certificates \ - curl \ - jq \ - rsync \ - && rm -rf /var/lib/apt/lists/* - -RUN npm install -g @anthropic-ai/claude-code - -RUN useradd -ms /bin/bash evaluator \ - && mkdir -p /workspace /project /output /home/evaluator/.claude \ - && chown -R evaluator:evaluator /workspace /output /home/evaluator - -USER evaluator -WORKDIR /workspace - -ENV HOME=/home/evaluator -ENV CLAUDE_CONFIG_DIR=/home/evaluator/.claude -ENV PATH=/home/evaluator/.local/bin:$PATH - -CMD ["bash"] diff --git a/skills/bmad-eval-runner/references/description-optimization.md b/skills/bmad-eval-runner/references/description-optimization.md new file mode 100644 index 0000000..f2295d6 --- /dev/null +++ b/skills/bmad-eval-runner/references/description-optimization.md @@ -0,0 +1,64 @@ +# Description optimization: the trigger-eval loop + +A skill's description is its only trigger. The router reads it, decides whether the user's request belongs to this skill, and either loads it or moves on. A description that is too narrow stays quiet when it should fire; one that is too broad fires on requests it cannot serve. This loop measures real firing against a held-out test set and improves the description until it triggers on what it should and stays silent on what it should not, without the improver ever overfitting to the cases it is being graded on. + +The whole loop runs through the adapter, so "did the skill fire" means the skill-load event the runtime emits, defined in `references/platform-adapter.md`. No model name appears anywhere in this loop; the adapter forwards whatever a runtime needs. + +## Step 1: generate the query set + +Generate about twenty near-miss queries, roughly half that should trigger the skill and half that should not. The signal lives in the near misses, so make the should-not queries share keywords, domain, and phrasing with the should queries. A should-not query that obviously belongs to another skill teaches the description nothing, because any wording already handles it. The pairs that matter are the ones a careless reader would lump together: a request to build a workflow versus a request to debug an existing one, a request to write a brief versus a request to critique a brief someone already wrote. + +Each query is a `{query, should_trigger}` record: + +```json +{ "query": "help me turn my deploy script into a reusable skill", "should_trigger": true } +{ "query": "my deploy script keeps failing on the rollback step", "should_trigger": false } +``` + +Aim for variety in surface form (casual speech, a pasted error, a one-line ask, a paragraph of context) so the description is tested against the shapes real requests arrive in, not one tidy template. + +## Step 2: stratified 60/40 split + +Split the queries into a train set and a test set, 60 percent train and 40 percent test, stratified so the should and should-not ratio is preserved in both halves. Stratifying matters because an unstratified split can land most of the should-not queries in one half and leave the improver blind to the false-positive problem on train, or leave the test set unable to detect it. + +The split is fixed once at the start of the loop and never reshuffled between rounds, because reshuffling would let a query that exposed a weakness in one round hide in the train set the next. The improver works only from the train set. It never sees the test queries, their labels, or the test score, which is what keeps the loop honest. + +## Step 3: measure real triggering + +Run every query through the adapter with the current description in place, several times per query because firing is probabilistic. The trigger rate for a query is the fraction of runs that produced the skill-load event. Turn each rate into a verdict against a threshold (a query "triggers" when its rate clears the bar, for example more than half its runs loaded the skill), then score against the labels: + +- a should-trigger query that triggered is a true positive, +- a should-trigger query that stayed quiet is a false negative (the description is too narrow here), +- a should-not query that triggered is a false positive (the description is too broad here), +- a should-not query that stayed quiet is a true negative. + +Score train and test separately. The train score and its per-query verdicts are what the improver sees; the test score is recorded but withheld. + +## Step 4: improve from train failures, test blinded + +Hand the improver the current description, the train queries with their labels, and the train verdicts, and ask for a rewritten description that fixes the train failures. False negatives mean the description needs to claim ground it is leaving uncovered; false positives mean it needs to draw a sharper boundary against the near misses it is wrongly catching. The improver works the train failures only and never sees a test query or the test score, so it cannot tune to the held-out set. + +Also hand the improver the descriptions it already tried and why each fell short, so it tries something structurally different rather than nudging the same wording round after round. Without this, the loop tends to oscillate between two phrasings that each fix one failure and reintroduce the other. Feeding the history back pushes the improver toward a different cut of the boundary: reframing around intent instead of keywords, naming the adjacent skill the near misses belong to, or moving a qualifier from the trigger clause into the body. + +Keep the description within whatever length and format bounds the runtime enforces (character cap, no angle brackets, and so on); a rewrite that triggers well but violates the bound is not a candidate. + +## Step 5: re-measure and iterate + +Apply the new description, re-measure train and test, and record both scores plus the description text for this round. Continue for up to five rounds. Stop early if train reaches a clean separation (all should fire, all should-not stay quiet) and the test score agrees, because more rounds past a clean split only invite overfitting. + +## Step 6: pick the winner by test score + +After the rounds finish, pick the description with the best test score, not the best train score. Train measures how well the improver fixed the failures it could see; test measures whether that fix generalizes to queries it never saw, which is the only thing that matters in production. When two rounds tie on test, prefer the one with the better train score as the tiebreaker, and failing that the shorter, sharper description. + +Report the winning description, its test score, and the round-by-round trail (each description, its train score, its test score) so the choice is auditable and a human can override it. Log the trail to the run's memlog through `scripts/memlog.py` as the loop runs, one `event` entry per round capturing the description tried and the train and test scores, so a resumed or audited run reads the progression cleanly. + +## Why each guard is here + +| Guard | What it prevents | +|---|---| +| near-miss should-not queries | a test set so easy the description never has to draw a real boundary | +| 60/40 stratified split | a split that hides the false-positive or false-negative problem in one half | +| fixed split across rounds | a weakness escaping into the train set on a later round | +| test score blinded from improver | the improver tuning its wording to the held-out queries | +| pick by test score, not train | shipping a description that fixed the visible failures but does not generalize | +| prior attempts fed back | the loop oscillating between two phrasings instead of finding a new boundary | diff --git a/skills/bmad-eval-runner/references/eval-format.md b/skills/bmad-eval-runner/references/eval-format.md new file mode 100644 index 0000000..0887841 --- /dev/null +++ b/skills/bmad-eval-runner/references/eval-format.md @@ -0,0 +1,91 @@ +# Eval format and the four modes + +A case is the unit of evaluation. Every case is `input + rubric + optional state_prefix`. The same case shape feeds all four modes; what changes is which invocations the runner sets up and how the result is judged. + +## The case + +```json +{ + "id": "create-1", + "input": "I want a brief for InsuLens, a claims-triage tool for mid-market insurers. Notes are in evals/insulens/files/memo.md", + "rubric": [ + "brief.md exists and its word count is between 250 and 1500", + "brief.md names InsuLens and the mid-market insurer segment", + "brief.md incorporates at least two specific points from memo.md without inventing claims absent from it" + ], + "state_prefix": null, + "files": ["evals/insulens/files/memo.md"] +} +``` + +Field semantics: + +- `id`: stable identifier; used as the case's folder name in the run. +- `input`: the realistic, messy user request. Use real file paths, company names, typos, and casual speech, because a polished input tests a situation the skill rarely meets. The runner sends this verbatim to the invocation, after prepending any `state_prefix`. +- `rubric`: a list of named expectations, each gradeable to `{text, passed, evidence}` by the grader. The strong-versus-weak taxonomy below decides whether each one is worth keeping. +- `state_prefix`: optional bracketed prime that places the skill mid-workflow (see below). Null or absent means the skill starts cold. +- `files`: optional fixture paths staged into the case's clean working directory before the run. A bare filename lands at the workspace root; a nested path keeps its directory structure. + +For trigger cases the shape is lighter: a `query` and a `should_trigger` boolean, because there is no artifact to grade, only whether the skill fired. Those cases are covered in `platform-adapter.md` and `description-optimization.md`. + +## state_prefix: turn simulation in one shot + +Most multi-turn skills can be evaluated single-shot if the case is designed right. The `state_prefix` is the trick that makes mid-workflow points reachable without a multi-turn simulator. It is a bracketed prime prepended to the input that tells the skill where in its own flow this turn lands and what the user already said: + +``` +[the skill has already worked through discovery; on turn 4 the user was asked about stakeholders and responded:] User said: "just me and a PM" +``` + +The runner prepends the `state_prefix` to `input` and sends the combined text as a single message. One input then exercises any mid-workflow moment: a clarifying turn, a correction, a resume after an interruption. This replaces the deferred multi-turn simulator for everything except cases where the conversation arc itself is the deliverable. + +Subjective skills (coaching, brainstorming, design facilitation) skip the rubric and rely on human judgment. The `state_prefix` still earns its place there, because it lets a human see the exact mid-run moment they want to judge. + +## Strong versus weak expectations + +The grader's job is easier and the result is more honest when an expectation is discriminating, meaning a wrong output cannot pass it. A weak expectation is worse than no expectation, because a green check on it reads as proof when it is noise. The grader flags weak expectations when it sees them; write them out of the rubric before they ship. + +Weak patterns to avoid: + +- Filename-only checks. "brief.md exists" passes for an empty file. Pair existence with a content check. +- Wholly subjective phrasing. "the brief is high quality" cannot be graded. State the property concretely. +- Tautologies. Anything that follows automatically from the prompt being understood proves nothing. + +Strong patterns for artifact correctness: + +- Specific facts that must appear, such as "incorporates at least two findings from section X." +- Structural claims a wrong output would fail, such as "word count between 250 and 1500." +- Negative assertions, such as "does not introduce content from unrelated sections." +- Frontmatter checks, such as "frontmatter contains title, status, created (ISO 8601), updated." +- Bounded output blocks, such as "the final message contains a JSON object with intent='create'." + +Strong patterns for process discipline: + +- Side-artifact existence paired with content, such as ".memlog.md captures the pricing decision with its rejected alternative and rationale." +- Transcript tool-call patterns, such as "the transcript contains a call invoking bmad-editorial-review-prose." +- Phase ordering, such as "the polish call occurs after the brief Write and before the final JSON block." +- Read-only enforcement, such as "the input brief.md is byte-identical to the fixture and no Write or Edit targeted it." +- Bidirectional fidelity, such as "every decision in the memlog is reflected in the brief, and no claim in the brief is absent from the input or the memlog." + +Most process-discipline checks are deterministic reads of the transcript and filesystem, so the grader confirms them by quoting evidence rather than judging. + +## The four modes in detail + +### Baseline: skill versus bare model + +Run the case input twice in parallel in the same turn, once wrapped by the skill and once against the bare model with nothing around it. The bare-model run is the long-term floor. The skill earns its existence only by producing something the bare model cannot, so when the skill stops beating the bare model the right call is retirement, not another patch. Use baseline when the user asks whether the skill is worth keeping, or as the release check. + +### Variant: full versus stripped smallest-version + +Run the full skill against a stripped smallest-version of the same skill (passed as `--variant-path`), or against a snapshot of the prior version for an edit, on the same input. This is the two-version comparison made runnable, and it settles the leanness scanner's defend-against-absence findings. If the two outputs tie on the dimension the section was supposed to protect, the section is decoration and gets cut. If the small version is materially and durably worse, the section earned its keep. Variant is how a suspected piece of ceremony gets a real verdict instead of an argument. + +### Quality: output versus rubric + +Grade a single config's output against the named rubric with the read-only grader in `references/grader.md`. The grader gives no partial credit, puts the burden of proof on a passing grade, and flags any non-discriminating assertion. Use quality when a rubric exists and the user wants to know whether the output meets it, independent of any comparison. + +### Trigger and description + +Generate near-miss should-trigger and should-not-trigger queries that share keywords, split them, measure real firing through the adapter, and improve the description across bounded rounds with the held-out scores blinded from the improver. The full loop, including the split ratio, the round bound, and feeding prior failed attempts back, is in `references/description-optimization.md`. Trigger detection itself is "did the skill load," abstracted per runtime in `references/platform-adapter.md`. + +## Getting a skill to behave non-interactively + +Single-shot modes need the skill to produce its deliverable without stopping to ask. Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and ends with a structured status block. Trigger it from the input: the literal `Run headless.` at the start, a skill-specific keyword from the skill's own headless section, or enough context that no clarification is genuinely needed. The `state_prefix` also helps here, because a turn that already supplies the answer the skill would ask for keeps the run moving. If a skill has no headless path and the input cannot satisfy its questions, either add a headless mode to the skill or accept that this case needs a human in the loop. diff --git a/skills/bmad-eval-runner/references/eval-formats.md b/skills/bmad-eval-runner/references/eval-formats.md deleted file mode 100644 index 6856abc..0000000 --- a/skills/bmad-eval-runner/references/eval-formats.md +++ /dev/null @@ -1,147 +0,0 @@ -# Eval Formats - -The runner accepts two file shapes, both compatible with Anthropic's skill-creator conventions. - -## Artifact evals — `evals.json` - -```json -{ - "skill_name": "bmad-product-brief", - "evals": [ - { - "id": 1, - "prompt": "I want to create a brief for ...", - "expected_output": "A run folder with brief.md and decision-log.md ...", - "files": [ - "evals/.../files/some-fixture.md" - ], - "expectations": [ - "brief.md exists in the run folder", - "decision-log.md exists", - "brief.md word count is between 250 and 1500" - ] - } - ] -} -``` - -Field semantics: - -- **id**: stable identifier; used as the eval's directory name in the run folder. -- **prompt**: the literal user message Claude will receive. Sent verbatim to `claude -p`. -- **expected_output**: human-readable description, used for context only — the grader reads it but does not score against it directly. -- **files**: optional fixture paths. Resolved relative to the project root (or the evals folder). Each file is staged into the eval's workspace before execution. Path semantics: - - A bare filename is staged at the workspace root. - - A nested path (`some-brief/brief.md`) preserves the directory structure inside the workspace. -- **expectations**: list of pass/fail assertions evaluated by the grader subagent. Each is graded independently. The grader is instructed to flag weak assertions — assertions a wrong output would also trivially pass. - -The grader writes `grading.json` next to each eval's artifacts; the runner aggregates. - -## Trigger evals — `triggers.json` - -```json -[ - { "query": "Help me write a product brief for ...", "should_trigger": true }, - { "query": "Help me brainstorm ideas for ...", "should_trigger": false } -] -``` - -The runner creates a synthetic command file in the sandbox's `.claude/commands/.md` containing the skill's description, then runs each query against `claude -p` with stream-JSON output and detects whether the skill (or a Read of its SKILL.md) appears as a tool call. Each query is run `--runs-per-query` times (default 3); `trigger_rate` is the fraction of runs that fired. - -A query passes when: -- `should_trigger=true` and `trigger_rate >= --trigger-threshold` (default 0.5) -- `should_trigger=false` and `trigger_rate < --trigger-threshold` - -Trigger evals do not produce artifacts beyond the result JSON. They are cheap and parallelize aggressively. - -## Where evals can live - -The runner discovers evals in this order: - -1. `--evals ` — explicit. May point to a folder or a specific `*.json`. -2. `/evals/` — colocated with the skill. -3. `/../../evals//` — sibling-of-parent. Common pattern when evals are intentionally excluded from skill distribution. -4. `/evals//`. -5. `/evals/**//` — fuzzy search under the project's evals tree. - -If both `evals.json` and `triggers.json` are found, both run unless `--mode` narrows it. - -## Two patterns for single-shot evals - -Most multi-turn workflow skills can be evaluated single-shot if you design the eval right. Two patterns cover the bulk of what you'd otherwise need a multi-turn simulator for: - -### Pattern A — artifact correctness (headless + rich prompt) - -Force the skill into headless mode and pack the prompt with everything Discovery would have surfaced. Grade what comes out: the artifact, its structure, whether it reflects the inputs without inventing. - -Use when: -- The deliverable is the artifact (brief, PRD, doc, plan) -- You can write a complete pre-Discovery prompt -- You want regression coverage on drafting/format/extraction - -### Pattern B — process discipline (headless + transcript and side-artifact inspection) - -Same single-shot mechanics, but the expectations look at *what the skill did internally* — not just the final output. The grader reads the stream-JSON transcript for tool calls, walks side-artifacts (decision logs, addenda, distillates), checks file mtimes, and verifies phase ordering. - -Use when: -- The skill enforces a protocol (decision log, polish phase, finalize sequence) -- The skill has read-only intents (Validate must not write) -- You need to catch "drafting works but the discipline went soft" regressions - -These are deterministic checks against the transcript and filesystem — no LLM judgment needed for most of them. - -### What single-shot can NOT cover - -Facilitation arc: vague-input → sharper pushback → user clarifies → better artifact. That requires a multi-turn user simulator. Defer it to a separate eval mode for skills where conversation is the value (coaching, brainstorming, design thinking). - -## Writing good expectations - -The grader's job is easier when expectations are *discriminating* — hard to pass without actually doing the work. - -**Weak patterns to avoid:** -- **Filename-only checks** — "brief.md exists" passes for an empty file. Pair with a content check. -- **Wholly subjective phrasing** — "the brief is high quality" cannot be evaluated. State the property concretely. -- **Tautologies** — anything that follows from the prompt being understood is not a useful expectation. - -**Strong patterns for artifact correctness (Pattern A):** -- Specific facts that should appear ("incorporates at least 2 specific findings from section X") -- Structural claims a wrong output would fail ("word count between 250 and 1500") -- Negative assertions ("does not introduce content from unrelated sections") -- YAML frontmatter checks ("frontmatter contains title, status, created, updated as ISO 8601") -- Bounded JSON output ("final assistant message contains a JSON object with intent='create'") - -**Strong patterns for process discipline (Pattern B):** -- **Side-artifact existence + content** ("decision-log.md exists AND captures the pricing decision with rejected alternative and rationale") -- **Transcript tool-call patterns** ("the transcript contains a Skill tool call invoking bmad-editorial-review-prose") -- **Phase ordering** ("the polish-phase Skill calls occur after the brief body Write and before the final JSON status block") -- **Read-only enforcement** ("the input brief.md is byte-identical to the staged fixture; no Write or Edit tool calls targeted the run folder") -- **Bidirectional fidelity** ("every substantive entry in decision-log.md has a corresponding reflection in brief.md, AND no claim in brief.md is absent from the input prompt or decision-log.md") -- **Timestamp checks** ("YAML frontmatter 'updated' field is later than 'created'; 'created' is unchanged from the input fixture") - -## Headless mode — getting the skill to behave non-interactively - -Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and produces a structured JSON status block at the end. To use Pattern A or B, the eval prompt needs to trigger this. Common signals: - -- The literal phrase `Run headless.` at the start of the prompt -- Skill-specific flags or keywords as documented in the skill's `## Headless Mode` section -- Sufficient context such that no clarification is genuinely needed - -If the skill has no headless mode, single-shot evals will halt at the first clarifying question and you have two options: (1) add a headless mode to the skill, (2) defer that skill's evals to the multi-turn simulator. - -## Pre-staging files (Update / Validate intents) - -For Update and Validate evals, the workspace needs to contain an existing brief, decision log, addendum, etc. Use the `files` field — each path is staged into the workspace at the same relative location. The eval prompt then references the staged path explicitly: - -```json -{ - "id": "B5", - "prompt": "Run headless. Update the brief at evals/skill-x/files/some-brief/brief.md — ...", - "files": [ - "evals/skill-x/files/some-brief/brief.md", - "evals/skill-x/files/some-brief/decision-log.md", - "evals/skill-x/files/some-brief/addendum.md" - ] -} -``` - -For Validate (read-only) expectations, pair the staged files with byte-identical assertions and a no-Write/no-Edit transcript check. diff --git a/skills/bmad-eval-runner/references/grader.md b/skills/bmad-eval-runner/references/grader.md new file mode 100644 index 0000000..afb9ed3 --- /dev/null +++ b/skills/bmad-eval-runner/references/grader.md @@ -0,0 +1,72 @@ +# Grader: LLM-as-judge contract + +The grader inspects one case's captured transcript and artifacts and answers, per expectation, whether it held. It is read-only against the run folder. It does not execute the skill, fix an artifact, or rerun anything; its only job is to judge what was produced and cite the evidence. + +The grader has a second job that matters as much as the first: it critiques the rubric. A passing grade on a weak assertion is worse than useless, because it reads as proof while measuring nothing, so the grader flags assertions that a wrong output would also pass and names important outcomes that no assertion covers. + +## Inputs + +The grader receives: + +- `case_id`: identifier for this case. +- `input`: the message that was sent to the skill, including any prepended `state_prefix`. +- `rubric`: the list of expectation strings it grades, each independently. +- `transcript_path`: absolute path to the run's transcript, in the schema the adapter defines. +- `artifacts_dir`: absolute path to the directory of files the skill wrote. + +## Process + +1. Read the transcript. It is line-ordered events in the adapter's schema. Note the input that was sent, every tool call the skill made (with its name and arguments), the order those calls happened in, the final message (often a JSON status block for headless runs), and any errors. + +2. List and read the artifacts. Walk `artifacts_dir` and open the files each expectation implicates. Read their contents rather than trusting filenames, and note modification times when ordering or read-only behavior is in scope. + +3. Grade each expectation independently. Identify what kind of check it is and gather the matching evidence: open and read the file for a content check, scan the transcript for a tool-call pattern, find event indices for a phase-ordering check, compare bytes and scan for Write or Edit calls for a read-only check, parse and verify fields for a frontmatter check, extract and inspect the object for an output-block check, and trace each claim both directions for a fidelity check. + +4. Decide pass or fail with specific evidence. Pass only when there is clear evidence the expectation holds and the evidence reflects substance rather than surface compliance, so a file that exists but holds only placeholders fails a content expectation. Fail when no evidence is found, the evidence contradicts the expectation, or the assertion is technically satisfied while the underlying outcome is wrong. Cite the evidence every time by quoting a line, naming a file with its path, or pointing to a tool call by its index and arguments. + +5. Critique the rubric. After grading, surface assertions that look weak, meaning ones that passed but would also pass for a clearly wrong output, and name important outcomes you observed, good or bad, that no assertion checks. Keep the bar at what a rubric author would call a good catch rather than a nit. + +## Output + +The grader returns one record per expectation plus a summary and rubric feedback: + +```json +{ + "case_id": "create-1", + "expectations": [ + { + "text": "brief.md exists and word count is between 250 and 1500", + "passed": true, + "evidence": "artifacts/insulens/brief.md, 487 words" + }, + { + "text": "the memlog references having ingested the memo as source material", + "passed": false, + "evidence": ".memlog.md exists but contains only the init entry; no mention of memo.md" + } + ], + "summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.5 }, + "rubric_feedback": { + "weak": [ + { + "assertion": "brief.md exists", + "reason": "Existence alone passes for an empty file; pair with a content or word-count check." + } + ], + "uncovered": [ + "The brief invented a competitor not present in the input or the memlog; no assertion would have caught this." + ], + "overall": "Assertions check structure but not content fidelity in two places." + } +} +``` + +When `weak` and `uncovered` would both be empty, set them to `[]` and `overall` to `"No suggestions; the rubric looks discriminating."` + +## Rules + +- Verdicts come from evidence, not impressions, so quote, name files, and point to event indices. +- No partial credit. Each expectation is pass or fail. +- The burden of proof is on a passing grade, so when the evidence is uncertain the expectation fails. +- Read-only against the run folder. The grader never edits an artifact. +- No silent defaults. If a file or the transcript genuinely cannot be read, mark the affected expectations failed with that as the evidence rather than guessing. diff --git a/skills/bmad-eval-runner/references/isolation.md b/skills/bmad-eval-runner/references/isolation.md deleted file mode 100644 index 056fda8..0000000 --- a/skills/bmad-eval-runner/references/isolation.md +++ /dev/null @@ -1,110 +0,0 @@ -# Isolation Strategies - -The eval runner offers two strategies. The intent is identical in both: every eval starts from a clean slate so the result reflects the skill itself, not the host's accumulated state. - -## What we are isolating from - -- The user's global `~/.claude/CLAUDE.md` (private global instructions) -- Any ancestor `CLAUDE.md` in the project tree above the skill -- Auto-memory at `~/.claude/projects/.../memory/MEMORY.md` -- Cached settings, MCP configurations, IDE integrations -- Prior conversation context bleeding via the shell - -## Authentication - -The isolated `claude -p` subprocess needs to authenticate, but cannot read the host's `~/.claude/` (HOME is overridden) or the macOS Keychain (Keychain ACLs are scoped to the process that wrote the entry). The runner solves this in the parent process: - -1. On macOS, read the OAuth credential JSON from the Keychain entry `Claude Code-credentials` via `security find-generic-password -s "Claude Code-credentials" -w`. This succeeds because the parent runs as the same user that wrote the entry. -2. Stage that JSON as `/.home/.claude/.credentials.json` (local mode) or copy it into `/home/evaluator/.claude/.credentials.json` inside the container (Docker mode). -3. The subprocess reads `.credentials.json` exactly the way Claude Code normally does, with no other host config bleed. - -If the parent has `ANTHROPIC_API_KEY` set, that env var is also forwarded — and it takes precedence over the Keychain credential. On non-macOS hosts, the Keychain step is skipped and `ANTHROPIC_API_KEY` is the only auth path. - -## Docker (preferred) - -A single image, `bmad-eval-runner:latest`, is built once per machine. It contains Node 20, Claude Code (via `npm install -g @anthropic-ai/claude-code`), Python 3, and standard tools. The image is intentionally minimal — every eval starts from this baseline. - -### Image build - -`scripts/docker_setup.py --build` builds the image from `assets/Dockerfile`. This runs once. Re-runs are a no-op unless `--rebuild` is passed. - -### Per-eval container - -Each eval gets a fresh container: - -``` -docker run --rm \ - -v ":/project:ro" \ - -v "/:/output" \ - -v ":/fixtures:ro" \ - -e ANTHROPIC_API_KEY \ - -e EVAL_PROMPT \ - -e EVAL_ID \ - -e SKILL_PATH \ - bmad-eval-runner:latest \ - /bin/bash -c "/scripts/run_one_eval.sh" -``` - -Inside the container: - -1. The project is copied from `/project` (read-only) to `/workspace` (writable, container-local). Copy is fast because the underlying layer is shared. -2. Fixtures are copied into `/workspace/fixtures/`. -3. `HOME` is `/home/evaluator`, an empty directory created by the image — no global `CLAUDE.md`, no memory. -4. `claude -p "$EVAL_PROMPT" --output-format stream-json --verbose` runs at `/workspace`. -5. The stream-json transcript is captured to `/output/transcript.jsonl`. Any files the skill writes under `/workspace` are rsynced to `/output/artifacts/` after the run completes. -6. The container exits and is removed automatically. - -The host then has `//transcript.jsonl`, `//artifacts/`, and timing data. Nothing on the host is touched. - -### Why Docker is preferred - -- The image is reproducible — every run starts from byte-identical state. -- `HOME` is genuinely empty, not just overridden. -- Filesystem isolation is real, not just convention. -- Network can be locked down (`--network=none` for trigger evals; full network for artifact evals that may need it). - -## Local fallback - -When Docker is unavailable, the runner falls back to per-eval temp directories under `~/bmad-evals///`. Layout: - -``` -~/bmad-evals/// - workspace/ # the eval's working directory - .home/ # HOME override — empty .claude/ inside - project/ # rsync'd copy of - fixtures/ # staged fixture files - transcript.jsonl # claude -p stream output - artifacts/ # files Claude wrote under workspace/ - metrics.json -``` - -Per-eval invocation roughly: - -``` -HOME="$WORKSPACE/.home" \ -CLAUDE_CONFIG_DIR="$WORKSPACE/.home/.claude" \ -ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \ - claude -p "$EVAL_PROMPT" \ - --output-format stream-json --verbose \ - > transcript.jsonl -``` - -### Limitations of local mode - -- `HOME` override prevents global `CLAUDE.md` and memory loading, but ancestor discovery still happens from the workspace's cwd. If the workspace is created inside a directory tree that contains a `.claude/skills/` further up, the subprocess may discover those skills regardless of `HOME`. This matters most for trigger evals, where stray host skills can fire instead of the synthetic skill we're testing — **prefer Docker for trigger evals**, where filesystem isolation is real. -- Filesystem isolation is by convention only — the skill could write outside its workspace if it tries. We don't sandbox syscalls. -- Network is unrestricted. - -Tell the user clearly when local mode is in use and that it is best-effort. - -## Why a real skill, not a slash command, for trigger evals - -The trigger runner stages a synthetic skill at `/.claude/skills//SKILL.md` — not at `.claude/commands/.md`. Slash commands are user-invoked (`/`); they do not surface as `Skill` tool calls and so a description placed there can never be observed firing the way a real skill would. Anthropic's reference `run_eval.py` uses the commands path and is known to report 0% trigger rates as a result. Placing the synthetic at `.claude/skills/` matches how real skills load and lets the detector observe genuine `Skill` (or `Read` of the synthetic SKILL.md) tool calls. - -## Why not `--add-dir` only? - -`claude -p --add-dir ` would let Claude see the skill but would still inherit the user's `CLAUDE.md` and memory from the cwd's ancestors. The whole point of this runner is to test the skill, not the host's accumulated state. So we always either Docker-isolate or temp-dir-isolate. - -## Artifact retention - -Run folders are never deleted by this skill. Disk management is the user's responsibility. The runner emits the run folder path on completion; users who want to clean up old runs can delete `~/bmad-evals//` directly. diff --git a/skills/bmad-eval-runner/references/platform-adapter.md b/skills/bmad-eval-runner/references/platform-adapter.md new file mode 100644 index 0000000..1d537b6 --- /dev/null +++ b/skills/bmad-eval-runner/references/platform-adapter.md @@ -0,0 +1,51 @@ +# Platform adapter + +Everything runtime-specific in the eval-runner lives here, behind one seam. The rest of the skill, the scripts, the case format, the grader, and the modes are written against this seam and stay platform-agnostic. No model name is hardcoded anywhere; a model is just a value the adapter forwards if a runtime needs one, never a list this skill maintains. + +An adapter provides three core things, plus two more only when trigger mode runs. If a new runtime can supply the three core values, the output-grading modes work against it unchanged; trigger mode also needs the two trigger keys described below. + +## The three core things an adapter exposes + +| Thing | What it is | Used by | +|---|---|---| +| invocation command | how to send an input to the runtime and get a completed run back | `run_evals.py`, `run_triggers.py` | +| auth env-var | the single environment variable name the runtime reads for its credential | clean-cwd setup | +| transcript schema | the on-disk shape of the run's event stream | `run_evals.py`, the grader | + +### Invocation command + +A template for running one input non-interactively and producing a transcript. The runner fills in the input (with any `state_prefix` already prepended) and the clean working directory, runs the command, and waits for completion. The command must run from the case's clean working directory so host shell config, prior runs, and ancestor instruction files do not bleed into the result. The only environment that crosses into that directory is the auth env-var below. There is no container, no terminal emulation, and no credential file staging. + +For a baseline run the runner issues the same command twice from the same input: once with the skill available in the working directory and once with nothing wrapped around the model, so the bare-model floor is measured under identical conditions. For a variant run it issues the command against the full skill and against the `--variant-path` skill. + +### Auth env-var + +The name of the one environment variable the runtime reads for its credential. The runner forwards that variable's value from the host into the clean working directory and forwards nothing else. Naming the variable rather than baking in a provider keeps the seam honest: a different runtime declares a different variable name and everything upstream is unchanged. + +### Transcript schema + +The shape of the event stream the run writes, so `run_evals.py` knows where timing and token counts live and the grader knows how to read tool calls and the final message. An adapter declares: + +- the file form and extension of the transcript (for example line-delimited JSON events), +- how to find a tool call in an event (the field that holds the tool name and the field that holds its arguments), +- how to find the final assistant message, +- where the completion notification reports timing and token usage, so `run_evals.py` can capture them to `timing.json` the moment the run finishes. + +## Two more things for trigger mode + +Trigger mode needs two keys beyond the core three, because it has to place a skill where the runtime will discover it and then recognize when that skill fired: + +| Thing | What it is | Used by | +|---|---|---| +| skill_dir | the directory where the runtime discovers installed skills, so the synthetic skill can be staged for the run | `run_triggers.py` | +| load_signal | how a skill-load event appears in the transcript (a skill-invocation tool call, a read of the skill's entry file, or whatever the runtime emits on activation) | `run_triggers.py` | + +These two stay idle for the output-grading modes (baseline, variant, quality); only trigger mode reads them. + +## Trigger detection: "did the skill load" + +Trigger mode does not measure output; it measures whether the description caused the skill to fire. Abstracted across runtimes, firing is one event: the skill loaded, expressed by `load_signal`. `run_triggers.py` stages a synthetic skill in `skill_dir` where the runtime discovers skills, sends each query through the invocation command, and checks the transcript for that load event. Each query runs several times because firing is probabilistic, and the trigger rate is the fraction of runs that loaded the skill. + +## Adding a runtime + +To support a new runtime, write an adapter that declares the invocation command, the auth env-var name, and the transcript schema, plus `skill_dir` and `load_signal` if you want trigger mode. Add no model list and no provider branch anywhere else; if a value beyond these is needed, it belongs in the adapter, not in a script or a prompt. diff --git a/skills/bmad-eval-runner/references/self-improvement.md b/skills/bmad-eval-runner/references/self-improvement.md new file mode 100644 index 0000000..69078bb --- /dev/null +++ b/skills/bmad-eval-runner/references/self-improvement.md @@ -0,0 +1,55 @@ +# Self-improvement: the bounded auto-iterate loop + +This is the loop that scans a skill, evaluates it, proposes a fix, applies it, and re-evaluates, repeating until the skill passes or a round bound is hit. It turns a single scan-and-fix pass into a closed loop that keeps going until the evidence says stop. It is the most autonomous mode the runner offers, so it carries the most guardrails: it is opt-in, calibrated to what is at stake, fully logged, and bounded. + +The benchmark is a guardrail, never the judge. The human stays the judge. A green run means the change cleared the bar the loop was given, not that the change is correct, and the loop's job is to do the mechanical iteration a human would otherwise do by hand and then hand back a fix plus the evidence for it. + +## When to run it, and how hard + +The loop is opt-in. It never starts on its own, because applying changes to a skill in a loop is a stronger action than reporting findings, and the user decides when that is warranted. + +Calibrate the aggressiveness to the stakes. A throwaway skill the user is still shaping can take a longer loop and a looser bar, because a wrong iteration costs little and is easy to throw away. A skill that other skills already depend on, or one that is shipped and in use, takes a short loop, a strict pass bar, and a close human read of every applied change, because a regression there propagates. Agree the round bound and the pass condition with the user before the first round, and write both into the memlog so the run is auditable against the terms it was given. + +## The loop + +Each round runs four beats: + +1. Scan. Run the builder's scanners against the skill (the five lenses in `bmad-workflow-builder`: architecture, determinism, customization, enhancement, leanness), and collect the findings. On rounds after the first, scan again rather than trusting the prior scan, because the last fix may have moved something. + +2. Eval. Run the modes that apply to this skill: quality against its rubric where one exists, variant to settle a leanness defend-against-absence finding, baseline to confirm the skill still beats the bare model. The scan says what looks wrong; the eval says whether it measurably is. A finding the eval cannot confirm is a candidate to note for a human, not to auto-fix. + +3. Propose a fix. From the confirmed findings, propose one concrete change. Address the cause the finding names rather than the single case that exposed it (see generalizing, below). Keep the change small enough that the next eval can attribute the delta to it; a round that rewrites five things at once cannot tell you which one moved the score. + +4. Apply and re-eval. Apply the proposed change, then re-run the eval from beat 2 and compare. A round that improves the score and breaks nothing else is kept; a round that regresses any mode is reverted before the next round, because an applied change that made things worse is not a base to build on. + +Stop when the pass condition is met or the round bound is reached, whichever comes first. The bound is a hard stop: hitting it without passing ends the loop and reports the best state reached, it does not earn extra rounds. + +## The full trail goes in memlog + +Every round writes to the run's memlog through `scripts/memlog.py`, so the whole reasoning chain is on disk and nothing the loop decided is hidden in a model's head. Per round, log: + +- a `decision` entry naming the fix proposed and the finding it answers, +- an `event` entry recording the re-eval delta (which modes ran, the before-and-after score, what regressed if anything), +- a `note` entry when a round is reverted, with why. + +At the end, log a `direction` entry summarizing the final state, whether the pass condition was met, and what a human should still review. Because the trail is append-only and typed, a reviewer reads the run back in order and sees what was tried, what each attempt did to the numbers, and why the loop stopped where it did. + +## Generalize to intent, do not overfit to the case + +The failure that ends most auto-iterate loops is fixing the example instead of the cause. A case fails because the skill mishandled a class of input; patching the skill to special-case that one input passes the case and leaves the class broken, and often the patch is a hardcoded branch that makes the skill worse. Read each finding as a representative of an intent category and fix the category. A case where the skill invented a fact absent from the source is not "handle this memo," it is "the skill does not ground its output in the provided source," and the fix belongs at that level. + +When a proposed fix reaches for ALL-CAPS ALWAYS or NEVER or a stack of MUSTs, treat that as a yellow flag, the same way the leanness scanner does. Shouting at the model is usually a sign the fix is patching a symptom; a sharper outcome statement or a small worked example generalizes where a louder rule does not. Prefer the version that explains the reasoning over the version that issues the command. + +## Why each guard is here + +| Guard | What it prevents | +|---|---| +| opt-in | a loop applying changes the user never authorized | +| stakes calibration | the same aggressiveness on a throwaway and a depended-on skill | +| eval confirms the scan | auto-fixing a finding the evidence does not support | +| one change per round | a round whose delta cannot be attributed to a specific fix | +| revert on regression | building the next round on a change that made things worse | +| round bound | a loop that runs away instead of handing back to a human | +| full memlog trail | reasoning that lives only in the model and cannot be audited | +| benchmark as guardrail, human as judge | treating a green run as proof the change is correct | +| generalize to intent | a hardcoded patch that passes the case and leaves the class broken | diff --git a/skills/bmad-eval-runner/scripts/aggregate_benchmark.py b/skills/bmad-eval-runner/scripts/aggregate_benchmark.py new file mode 100644 index 0000000..e4cf019 --- /dev/null +++ b/skills/bmad-eval-runner/scripts/aggregate_benchmark.py @@ -0,0 +1,236 @@ +#!/usr/bin/env python3 +# /// script +# requires-python = ">=3.9" +# /// +"""Variance benchmark: summarize a metric across N runs, and compare two configs. + +A single skill run is noisy. Running the same case N times and summarizing the +spread tells you whether a difference between two versions is real or just noise. +This script computes, per numeric metric, the mean, the sample standard deviation +(n-1, the unbiased estimator for a sample), the min, and the max across N runs. +Given two such config summaries it reports the delta on each shared metric so a +"did the change help" question gets a number instead of a guess. + +Input shapes accepted for a single config: + - a list of run records, each a flat dict of metric -> number + [{"elapsed_s": 12.1, "total_tokens": 800}, {"elapsed_s": 11.4, ...}] + - {"runs": [ ...records... ]} + - a directory of run folders, each holding timing.json files written by + run_evals.py (the script reads every timing.json under the directory and + treats each as one run record) + +Usage: + Summarize one config across its runs: + python3 aggregate_benchmark.py --runs CONFIG_A.json + python3 aggregate_benchmark.py --runs RUN_DIR/ (reads timing.json files) + + Compare two configs (each summarized, then delta = B - A): + python3 aggregate_benchmark.py --baseline A.json --variant B.json + + Self-test on a known fixture (no external input needed): + python3 aggregate_benchmark.py --self-test + +Output is one JSON object on stdout. +""" + +from __future__ import annotations + +import argparse +import json +import math +import sys +from pathlib import Path + + +NUMERIC = (int, float) + + +# --- statistics ------------------------------------------------------------- + +def sample_stddev(values: list[float]) -> float: + """Sample standard deviation using n-1 (Bessel's correction). + + Returns 0.0 for fewer than two values, where the sample variance is + undefined and reporting zero spread is the least surprising choice. + """ + n = len(values) + if n < 2: + return 0.0 + mean = sum(values) / n + var = sum((x - mean) ** 2 for x in values) / (n - 1) + return math.sqrt(var) + + +def summarize_metric(values: list[float]) -> dict: + return { + "n": len(values), + "mean": (sum(values) / len(values)) if values else 0.0, + "stddev": sample_stddev(values), + "min": min(values) if values else 0.0, + "max": max(values) if values else 0.0, + } + + +def collect_numeric_metrics(records: list[dict]) -> dict[str, list[float]]: + """Group every numeric field across records by metric name.""" + by_metric: dict[str, list[float]] = {} + for rec in records: + if not isinstance(rec, dict): + continue + for key, val in rec.items(): + if isinstance(val, bool): + continue # bools are ints in Python; not a metric + if isinstance(val, NUMERIC): + by_metric.setdefault(key, []).append(float(val)) + return by_metric + + +def summarize_config(records: list[dict]) -> dict: + by_metric = collect_numeric_metrics(records) + return { + "runs": len(records), + "metrics": {name: summarize_metric(vals) + for name, vals in sorted(by_metric.items())}, + } + + +def delta_configs(baseline: dict, variant: dict) -> dict: + """Per shared metric, delta = variant.mean - baseline.mean, plus context.""" + b_metrics = baseline.get("metrics", {}) + v_metrics = variant.get("metrics", {}) + shared = sorted(set(b_metrics) & set(v_metrics)) + out: dict[str, dict] = {} + for name in shared: + b = b_metrics[name] + v = v_metrics[name] + diff = v["mean"] - b["mean"] + pct = (diff / b["mean"] * 100.0) if b["mean"] != 0 else None + out[name] = { + "baseline_mean": b["mean"], + "variant_mean": v["mean"], + "delta": diff, + "delta_pct": pct, + "baseline_stddev": b["stddev"], + "variant_stddev": v["stddev"], + } + return out + + +# --- input loading ---------------------------------------------------------- + +def load_records(path: Path) -> list[dict]: + """Load run records from a JSON file, a {'runs': [...]} file, or a dir of + timing.json files.""" + if path.is_dir(): + records: list[dict] = [] + for f in sorted(path.rglob("timing.json")): + try: + data = json.loads(f.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + if isinstance(data, dict): + records.append(data) + return records + + data = json.loads(path.read_text(encoding="utf-8")) + if isinstance(data, dict) and "runs" in data: + data = data["runs"] + if not isinstance(data, list): + raise ValueError(f"expected a list of run records in {path}") + return [r for r in data if isinstance(r, dict)] + + +# --- self-test -------------------------------------------------------------- + +def run_self_test() -> int: + """Verify mean/stddev/min/max/delta on a known fixture.""" + config_a = [ + {"elapsed_s": 10.0, "total_tokens": 100}, + {"elapsed_s": 12.0, "total_tokens": 200}, + {"elapsed_s": 14.0, "total_tokens": 300}, + ] + summary_a = summarize_config(config_a) + el = summary_a["metrics"]["elapsed_s"] + # mean of 10,12,14 = 12; n-1 stddev = sqrt(((-2)^2+0+2^2)/2)=sqrt(4)=2 + assert el["n"] == 3, el + assert abs(el["mean"] - 12.0) < 1e-9, el + assert abs(el["stddev"] - 2.0) < 1e-9, el + assert el["min"] == 10.0 and el["max"] == 14.0, el + tok = summary_a["metrics"]["total_tokens"] + # mean of 100,200,300 = 200; n-1 stddev = sqrt((10000+0+10000)/2)=100 + assert abs(tok["mean"] - 200.0) < 1e-9, tok + assert abs(tok["stddev"] - 100.0) < 1e-9, tok + + # single value -> stddev 0 + one = summarize_config([{"x": 5}]) + assert one["metrics"]["x"]["stddev"] == 0.0, one + + # bools are not treated as metrics + with_bool = summarize_config([{"ok": True, "x": 1}, {"ok": False, "x": 3}]) + assert "ok" not in with_bool["metrics"], with_bool + assert abs(with_bool["metrics"]["x"]["mean"] - 2.0) < 1e-9, with_bool + + # delta: variant slower by 3s on mean, faster question answered by sign + config_b = [ + {"elapsed_s": 13.0, "total_tokens": 90}, + {"elapsed_s": 15.0, "total_tokens": 110}, + {"elapsed_s": 17.0, "total_tokens": 100}, + ] + summary_b = summarize_config(config_b) + d = delta_configs(summary_a, summary_b) + # elapsed mean: A=12, B=15 -> delta +3, pct +25% + assert abs(d["elapsed_s"]["delta"] - 3.0) < 1e-9, d + assert abs(d["elapsed_s"]["delta_pct"] - 25.0) < 1e-9, d + # tokens mean: A=200, B=100 -> delta -100, pct -50% + assert abs(d["total_tokens"]["delta"] + 100.0) < 1e-9, d + assert abs(d["total_tokens"]["delta_pct"] + 50.0) < 1e-9, d + + print(json.dumps({"self_test": "passed", + "checked": ["mean", "stddev_n_minus_1", "min", "max", + "single_value_stddev", "bool_excluded", + "delta", "delta_pct"]})) + return 0 + + +# --- main ------------------------------------------------------------------- + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + p.add_argument("--runs", type=Path, + help="summarize one config (JSON file or dir of timing.json)") + p.add_argument("--baseline", type=Path, + help="baseline config for a two-config comparison") + p.add_argument("--variant", type=Path, + help="variant config for a two-config comparison") + p.add_argument("--self-test", action="store_true", + help="run the built-in fixture self-test and exit") + args = p.parse_args(argv) + + if args.self_test: + return run_self_test() + + if args.baseline and args.variant: + b = summarize_config(load_records(args.baseline)) + v = summarize_config(load_records(args.variant)) + out = { + "baseline": b, + "variant": v, + "delta": delta_configs(b, v), + } + print(json.dumps(out, indent=2)) + return 0 + + if args.runs: + out = summarize_config(load_records(args.runs)) + print(json.dumps(out, indent=2)) + return 0 + + p.error("provide --runs, or both --baseline and --variant, or --self-test") + return 2 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/count_tokens.py b/skills/bmad-eval-runner/scripts/count_tokens.py new file mode 100644 index 0000000..350c74d --- /dev/null +++ b/skills/bmad-eval-runner/scripts/count_tokens.py @@ -0,0 +1,77 @@ +#!/usr/bin/env python3 +# /// script +# requires-python = ">=3.9" +# dependencies = ["tiktoken"] +# /// +"""count_tokens — the single length metric for skill authoring. + +Token counts replace line counts everywhere in the builder and eval-runner. +This script reports the token length of a file or of text piped on stdin, using +the tiktoken cl100k_base encoding. When tiktoken is not installed it falls back +to a character-based estimate (len(text) // 4) and says so, so the script always +runs under a bare python3 even with no third-party packages present. + +Usage: + count_tokens.py count the tokens in a file + count_tokens.py --stdin count the tokens read from stdin + +Output (one line of JSON on stdout): + {"tokens": , "method": "tiktoken"} when tiktoken loaded + {"tokens": , "method": "fallback"} when it fell back to chars // 4 + +Budgets this feeds: SKILL.md ~1500-2500, multi-branch reference ~4500, +single-purpose reference ~9000. +""" +import argparse +import json +import sys + +ENCODING = "cl100k_base" + + +def count_tokens(text: str) -> tuple[int, str]: + """Return (token_count, method). + + Tries tiktoken's cl100k_base encoding first. If tiktoken cannot be imported + or initialized, estimates with len(text) // 4 and reports method "fallback". + """ + try: + import tiktoken + except Exception: + return len(text) // 4, "fallback" + try: + enc = tiktoken.get_encoding(ENCODING) + except Exception: + return len(text) // 4, "fallback" + return len(enc.encode(text)), "tiktoken" + + +def read_input(args) -> str: + if args.stdin: + return sys.stdin.read() + with open(args.file, encoding="utf-8") as f: + return f.read() + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + p.add_argument("file", nargs="?", help="path to the file to count") + p.add_argument("--stdin", action="store_true", help="read text from stdin instead of a file") + args = p.parse_args(argv) + + if not args.stdin and not args.file: + p.error("provide a file path or --stdin") + if args.stdin and args.file: + p.error("provide either a file path or --stdin, not both") + + text = read_input(args) + tokens, method = count_tokens(text) + print(json.dumps({"tokens": tokens, "method": method})) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/docker_setup.py b/skills/bmad-eval-runner/scripts/docker_setup.py deleted file mode 100644 index 5f6fe7a..0000000 --- a/skills/bmad-eval-runner/scripts/docker_setup.py +++ /dev/null @@ -1,115 +0,0 @@ -#!/usr/bin/env python3 -# /// script -# requires-python = ">=3.9" -# /// -"""Detect Docker and build the bmad-eval-runner image when needed. - -Usage: - python3 docker_setup.py --check # exit 0 if image is ready, 1 otherwise - python3 docker_setup.py --build # build the image (no-op if present) - python3 docker_setup.py --rebuild # force rebuild -""" - -from __future__ import annotations - -import argparse -import json -import shutil -import subprocess -import sys -from pathlib import Path - - -IMAGE_TAG = "bmad-eval-runner:latest" -SCRIPT_DIR = Path(__file__).resolve().parent -DOCKERFILE = SCRIPT_DIR.parent / "assets" / "Dockerfile" - - -def docker_available() -> tuple[bool, str]: - if shutil.which("docker") is None: - return False, "docker CLI not found on PATH" - try: - result = subprocess.run( - ["docker", "info"], - capture_output=True, - text=True, - timeout=5, - ) - if result.returncode != 0: - return False, f"`docker info` failed: {result.stderr.strip().splitlines()[-1] if result.stderr.strip() else 'unknown'}" - return True, "ok" - except subprocess.TimeoutExpired: - return False, "`docker info` timed out" - except Exception as e: - return False, f"docker check error: {e}" - - -def image_present(tag: str = IMAGE_TAG) -> bool: - try: - result = subprocess.run( - ["docker", "image", "inspect", tag], - stdout=subprocess.DEVNULL, - stderr=subprocess.DEVNULL, - timeout=10, - ) - return result.returncode == 0 - except Exception: - return False - - -def build_image(tag: str = IMAGE_TAG, force: bool = False, verbose: bool = True) -> int: - if not DOCKERFILE.is_file(): - print(f"Dockerfile missing at {DOCKERFILE}", file=sys.stderr) - return 2 - - cmd = ["docker", "build", "-t", tag, "-f", str(DOCKERFILE), str(DOCKERFILE.parent)] - if force: - cmd.insert(2, "--no-cache") - - if verbose: - print(f"Building {tag} from {DOCKERFILE} ...", file=sys.stderr) - - proc = subprocess.run(cmd, stdout=sys.stderr if verbose else subprocess.DEVNULL, stderr=sys.stderr) - return proc.returncode - - -def main() -> int: - parser = argparse.ArgumentParser(description="Manage the bmad-eval-runner Docker image") - group = parser.add_mutually_exclusive_group(required=True) - group.add_argument("--check", action="store_true", help="Report status as JSON; exit 0 if image is ready") - group.add_argument("--build", action="store_true", help="Build the image (no-op if already present)") - group.add_argument("--rebuild", action="store_true", help="Force rebuild") - parser.add_argument("--quiet", action="store_true") - args = parser.parse_args() - - available, reason = docker_available() - present = image_present() if available else False - - if args.check: - print(json.dumps({ - "docker_available": available, - "docker_reason": reason, - "image_present": present, - "image_tag": IMAGE_TAG, - }, indent=2)) - return 0 if (available and present) else 1 - - if not available: - print(f"Docker is not available: {reason}", file=sys.stderr) - return 3 - - if args.rebuild: - return build_image(force=True, verbose=not args.quiet) - - if args.build: - if present: - if not args.quiet: - print(f"{IMAGE_TAG} already present; skipping build (use --rebuild to force).", file=sys.stderr) - return 0 - return build_image(force=False, verbose=not args.quiet) - - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/generate_report.py b/skills/bmad-eval-runner/scripts/generate_report.py deleted file mode 100644 index 7596d02..0000000 --- a/skills/bmad-eval-runner/scripts/generate_report.py +++ /dev/null @@ -1,184 +0,0 @@ -#!/usr/bin/env python3 -# /// script -# requires-python = ">=3.9" -# /// -"""Generate an aggregate HTML report for a run folder. - -Reads run.json, execution-summary.json, each /grading.json (if present), -and triggers-result.json (if present), then renders a single-file HTML report. - -Usage: - python3 generate_report.py --run-dir PATH [-o report.html] -""" - -from __future__ import annotations - -import argparse -import html as html_lib -import json -import sys -from pathlib import Path - - -def esc(s: object) -> str: - return html_lib.escape(str(s), quote=True) - - -def load(path: Path) -> dict | list | None: - if not path.is_file(): - return None - try: - return json.loads(path.read_text(encoding="utf-8")) - except json.JSONDecodeError: - return None - - -def render(run_dir: Path) -> str: - run_meta = load(run_dir / "run.json") or {} - exec_summary = load(run_dir / "execution-summary.json") or {} - triggers = load(run_dir / "triggers-result.json") - - eval_blocks: list[str] = [] - grading_total = 0 - grading_passed = 0 - - for res in exec_summary.get("results", []): - eval_id = str(res.get("eval_id", "?")) - eval_dir = run_dir / eval_id - grading = load(eval_dir / "grading.json") - metrics = res.get("metrics") or load(eval_dir / "metrics.json") or {} - rc = res.get("return_code") - - rows: list[str] = [] - if grading: - for exp in grading.get("expectations", []): - passed = bool(exp.get("passed")) - grading_total += 1 - if passed: - grading_passed += 1 - rows.append( - f'' - f'{ "✔" if passed else "✘" }' - f'{esc(exp.get("text", ""))}' - f'{esc(exp.get("evidence", ""))}' - ) - - feedback = (grading or {}).get("eval_feedback") or {} - feedback_html = "" - if feedback: - sugg = feedback.get("suggestions") or [] - sugg_html = "".join( - f"
  • {esc(s.get('assertion','(general)'))}: {esc(s.get('reason',''))}
  • " - for s in sugg - ) - overall = esc(feedback.get("overall", "")) - feedback_html = ( - f'' - ) - - artifacts_listing = "" - artifacts_dir = eval_dir / "artifacts" - if artifacts_dir.is_dir(): - files = sorted(p for p in artifacts_dir.rglob("*") if p.is_file()) - if files: - artifacts_listing = "
      " + "".join( - f'
    • {esc(p.relative_to(eval_dir))} ' - f'({p.stat().st_size}b)
    • ' - for p in files - ) + "
    " - - tool_calls = metrics.get("tool_calls", {}) - tool_summary = ", ".join(f"{k}={v}" for k, v in sorted(tool_calls.items())) or "—" - - eval_blocks.append(f""" -
    -

    Eval {esc(eval_id)} rc={esc(rc)} · {esc(metrics.get('elapsed_s', '?'))}s

    -

    Tool calls: {esc(tool_summary)} · output {esc(metrics.get('output_chars', 0))}b · transcript {esc(metrics.get('transcript_chars', 0))}b

    - { '' + ''.join(rows) + '
    ExpectationEvidence
    ' if rows else '

    No grading.json yet.

    ' } - {feedback_html} -
    Artifacts{artifacts_listing or '

    No artifacts captured.

    '}
    -
    - """) - - triggers_html = "" - if triggers: - rows = [] - for r in triggers.get("results", []): - rows.append( - f'' - f'{ "✔" if r["pass"] else "✘" }' - f'{esc(r["query"])}' - f'{esc(r["should_trigger"])}' - f'{r["triggers"]}/{r["runs"]} ({r["trigger_rate"]:.2f})' - ) - s = triggers.get("summary", {}) - triggers_html = f""" -
    -

    Trigger Evals — {s.get('passed',0)}/{s.get('total',0)} pass

    - - {''.join(rows)}
    QueryShould fireRate
    -
    - """ - - artifact_summary = "" - if exec_summary: - artifact_summary = ( - f"

    Executed {exec_summary.get('executed', 0)} / {exec_summary.get('total', 0)} " - f"evals · {exec_summary.get('exec_failures', 0)} execution failures · " - f"grader: {grading_passed}/{grading_total} expectations passed

    " - ) - - return f""" -Eval Run — {esc(run_meta.get('skill_name','?'))} - - -

    {esc(run_meta.get('skill_name','?'))} — eval run

    -
    - Run id: {esc(run_meta.get('run_id','?'))} · - isolation: {esc(run_meta.get('isolation','?'))} · - started: {esc(run_meta.get('started_at','?'))} -
    -{artifact_summary} -{''.join(eval_blocks)} -{triggers_html} - -""" - - -def main() -> int: - parser = argparse.ArgumentParser(description="Generate HTML report for an eval run folder") - parser.add_argument("--run-dir", required=True, type=Path) - parser.add_argument("-o", "--output", type=Path, default=None) - args = parser.parse_args() - - run_dir = args.run_dir.resolve() - if not run_dir.is_dir(): - print(f"run-dir not found: {run_dir}", file=sys.stderr) - return 2 - - out = args.output or (run_dir / "report.html") - out.write_text(render(run_dir), encoding="utf-8") - print(str(out)) - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/memlog.py b/skills/bmad-eval-runner/scripts/memlog.py new file mode 100644 index 0000000..504fad6 --- /dev/null +++ b/skills/bmad-eval-runner/scripts/memlog.py @@ -0,0 +1,197 @@ +#!/usr/bin/env python3 +# /// script +# requires-python = ">=3.10" +# /// +"""memlog -- an append-only memory log: LLM-optimal working memory for a skill. + +A memlog is the dense, chronological record of everything that mattered in a piece of +work -- every decision, direction, assumption, gap, note, and event as it happened -- +kept minimal like human memory: only what is important, never bloated. It persists +ACROSS sessions, so a fresh session can load it once and continue. It is NOT a +deliverable; downstream artifacts (a brief, a PRD, a report) are derived from it on +demand. + +It is a FLAT log: there are no sections or grouping. Every entry is one line, recorded +at the END in the order it happened. The chronology itself is the structure. + +Two invariants make it trustworthy: + + 1. Append-only, chronological. Entries land at the end, in the order they happen. + Nothing is ever inserted backward, reordered, edited, or removed. There is no + edit or delete subcommand by design; history is never rewritten. + 2. Write-only / blind. Every command is an atomic, context-free write and echoes the + new state as one line of JSON, so the caller never re-reads the file mid-session. + The one time the file is read is on resume, and the caller reads it itself, not + via this script. + +Atomicity: every write goes to a temp file, is flushed and fsync'd, then atomically +renamed over the target, so a crash never leaves a half-written entry. + +The file shape (.memlog.md): + + --- + subject: Onboarding flow for a budgeting app + status: active + updated: 2026-06-06T14:22 + --- + + - (note) user picked the lean draft path + - (decision) lead with one pre-categorized account; defer multi-account import + - (direction) optimize for the anxious first-timer, not the power user + - (assumption) open-banking consent is available in the target market + - (gap) no data yet on week-1 retention baseline + - (event) ran baseline eval mode + +Each entry carries a typed tag drawn from a fixed vocabulary so the chronology stays +machine-scannable: decision, direction, assumption, gap, note, event. + +Commands: + init --path FILE [--field k=v ...] create the memlog (errors if it exists) + append --path FILE --type T --text STR append one typed entry at the end + set-complete --path FILE flip frontmatter status to complete + +The path is the memlog file itself (conventionally {run-folder}/.memlog.md). +""" +import argparse +import json +import os +import sys +from datetime import datetime +from pathlib import Path + +ENTRY_TYPES = ("decision", "direction", "assumption", "gap", "note", "event") + + +def now() -> str: + return datetime.now().strftime("%Y-%m-%dT%H:%M") + + +def split(text: str) -> tuple[dict, str]: + """Return (frontmatter dict in source order, body str). Frontmatter is plain key: value. + + The closing fence is the first line that is *exactly* `---`, so a `---` inside a + field value (subject is free user text) never truncates the frontmatter. + """ + lines = text.splitlines() + if not lines or lines[0] != "---": + raise ValueError(".memlog.md has no frontmatter") + end = next((i for i in range(1, len(lines)) if lines[i] == "---"), None) + if end is None: + raise ValueError(".memlog.md frontmatter is not terminated") + meta: dict[str, str] = {} + for line in lines[1:end]: + if ":" in line: + k, v = line.split(":", 1) + meta[k.strip()] = v.strip() + return meta, "\n".join(lines[end + 1:]).lstrip("\n") + + +def render(meta: dict, body: str) -> str: + # Neutralize newlines in values so a multi-line field can't break the fence on re-read. + fm = "\n".join(f"{k}: {' '.join(str(v).splitlines())}" for k, v in meta.items()) + return "---\n" + fm + "\n---\n\n" + body.rstrip("\n") + "\n" + + +def touch(meta: dict) -> None: + """Stamp `updated` and keep it last so the field order stays predictable.""" + meta.pop("updated", None) + meta["updated"] = now() + + +def write_atomic(path: Path, text: str) -> None: + """Temp + flush + fsync + atomic rename, so a crash never half-writes an entry.""" + tmp = path.with_suffix(path.suffix + ".tmp") + with open(tmp, "w", encoding="utf-8") as f: + f.write(text) + f.flush() + os.fsync(f.fileno()) + os.replace(tmp, path) + + +def entry_count(body: str) -> int: + return sum(1 for ln in body.splitlines() if ln.startswith("- ")) + + +def ack(path: Path, meta: dict, body: str, entry_type: str = "") -> None: + """Echo new state so the caller never re-reads the file to know where it stands.""" + out = { + "ok": True, + "memlog": str(path), + "status": meta.get("status", ""), + "n": entry_count(body), + } + if entry_type: + out["type"] = entry_type + print(json.dumps(out)) + + +def cmd_init(args) -> int: + path = Path(args.path) + if path.exists(): + print(f"error: {path} already exists; use append/set-complete to update it", file=sys.stderr) + return 2 + path.parent.mkdir(parents=True, exist_ok=True) + meta: dict[str, str] = {} + for pair in args.field or []: + if "=" not in pair: + print(f"error: --field expects key=value, got {pair!r}", file=sys.stderr) + return 2 + k, v = pair.split("=", 1) + meta[k.strip()] = v.strip() + meta.setdefault("status", "active") + touch(meta) + write_atomic(path, render(meta, "")) + ack(path, meta, "") + return 0 + + +def cmd_append(args) -> int: + path = Path(args.path) + if args.type not in ENTRY_TYPES: + print(f"error: --type must be one of {', '.join(ENTRY_TYPES)}; got {args.type!r}", file=sys.stderr) + return 2 + meta, body = split(path.read_text(encoding="utf-8")) + text = " ".join(args.text.split()) # collapse newlines/runs -> one-line entry + entry = f"- ({args.type}) {text}" + body = (body.rstrip("\n") + "\n" + entry) if body.strip() else entry # always at the end + touch(meta) + write_atomic(path, render(meta, body)) + ack(path, meta, body, args.type) + return 0 + + +def cmd_set_complete(args) -> int: + path = Path(args.path) + meta, body = split(path.read_text(encoding="utf-8")) + meta["status"] = "complete" + touch(meta) + write_atomic(path, render(meta, body)) + ack(path, meta, body) + return 0 + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + sub = p.add_subparsers(dest="cmd", required=True) + + pi = sub.add_parser("init", help="create the memlog") + pi.add_argument("--path", required=True, help="memlog file path (e.g. {run-folder}/.memlog.md)") + pi.add_argument("--field", action="append", metavar="KEY=VALUE", help="frontmatter field (repeatable)") + pi.set_defaults(func=cmd_init) + + pa = sub.add_parser("append", help="append one typed entry at the end") + pa.add_argument("--path", required=True) + pa.add_argument("--type", required=True, choices=ENTRY_TYPES, help="entry kind") + pa.add_argument("--text", required=True) + pa.set_defaults(func=cmd_append) + + pc = sub.add_parser("set-complete", help="flip frontmatter status to complete") + pc.add_argument("--path", required=True) + pc.set_defaults(func=cmd_set_complete) + + args = p.parse_args(argv) + return args.func(args) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/pty_runner.py b/skills/bmad-eval-runner/scripts/pty_runner.py deleted file mode 100644 index 5b58658..0000000 --- a/skills/bmad-eval-runner/scripts/pty_runner.py +++ /dev/null @@ -1,171 +0,0 @@ -#!/usr/bin/env python3 -# /// script -# requires-python = ">=3.9" -# /// -"""Run claude interactively via PTY so the Skill tool is available. - -In `claude -p` (print mode) the Skill tool is never offered — Claude handles -everything inline. Running `claude` in interactive mode activates the Skill -tool so dependency skills installed in .claude/skills/ can be properly invoked. - -The PTY tricks claude into thinking it has a terminal (interactive mode) while -we capture its stream-json output programmatically. - -Usage: - python3 pty_runner.py --prompt-file /path/to/prompt.txt \\ - --output /path/to/transcript.jsonl \\ - [--timeout 600] - python3 pty_runner.py --prompt "Run headless. ..." --output transcript.jsonl -""" - -from __future__ import annotations - -import argparse -import json -import os -import pty -import re -import select -import subprocess -import sys -import time -from pathlib import Path - -ANSI_RE = re.compile(r"\x1b(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])|\r") - -# How long to wait for claude to initialize before sending the prompt. -# Claude loads skill registry, checks credentials, etc. on startup. -INIT_WAIT_S = 5.0 - -# How long to wait after the stream-json 'result' event before killing claude. -# Trailing tool-result output sometimes follows the result event. -POST_RESULT_S = 4.0 - - -def _strip_ansi(text: str) -> str: - return ANSI_RE.sub("", text) - - -def run_interactive(prompt: str, output: Path, timeout: int = 600) -> None: - """Spawn claude interactively via PTY, send one prompt, capture transcript.""" - master, slave = pty.openpty() - - proc = subprocess.Popen( - [ - "claude", - "--output-format", "stream-json", - "--verbose", - "--dangerously-skip-permissions", - ], - stdin=slave, - stdout=slave, - stderr=slave, - close_fds=True, - ) - os.close(slave) - - json_lines: list[str] = [] - buf = b"" - prompt_sent = False - done_at: float | None = None - start = time.time() - - try: - while True: - elapsed = time.time() - start - if elapsed > timeout: - print(f"[pty_runner] timeout after {elapsed:.0f}s", file=sys.stderr) - break - if done_at is not None and (time.time() - done_at) > POST_RESULT_S: - break - - # Short select so we stay responsive but don't spin. - r, _, _ = select.select([master], [], [], 0.3) - - if r: - try: - chunk = os.read(master, 8192) - except OSError: - break # PTY closed — claude exited - buf += chunk - - # Process all complete lines in buffer. - while b"\n" in buf: - raw, buf = buf.split(b"\n", 1) - line = _strip_ansi(raw.decode("utf-8", errors="replace")).strip() - if not line.startswith("{"): - continue - json_lines.append(line) - try: - obj = json.loads(line) - # 'result' marks end of a claude turn. - if obj.get("type") == "result" and done_at is None: - done_at = time.time() - print( - f"[pty_runner] result event at t={time.time()-start:.1f}s " - f"({len(json_lines)} lines so far)", - file=sys.stderr, - ) - except json.JSONDecodeError: - pass - else: - # Silence window — send prompt once claude has had time to init. - if not prompt_sent and (time.time() - start) >= INIT_WAIT_S: - os.write(master, (prompt + "\n").encode()) - prompt_sent = True - print( - f"[pty_runner] prompt sent at t={time.time()-start:.1f}s", - file=sys.stderr, - ) - - finally: - # Politely ask claude to exit, then hard-kill if needed. - try: - os.write(master, b"exit\n") - time.sleep(0.3) - except OSError: - pass - try: - proc.terminate() - proc.wait(timeout=5) - except Exception: - try: - proc.kill() - except Exception: - pass - try: - os.close(master) - except OSError: - pass - - output.parent.mkdir(parents=True, exist_ok=True) - content = "\n".join(json_lines) + ("\n" if json_lines else "") - output.write_text(content, encoding="utf-8") - print( - f"[pty_runner] wrote {len(json_lines)} transcript lines → {output}", - file=sys.stderr, - ) - - -def main() -> int: - p = argparse.ArgumentParser( - description="Run claude interactively via PTY and capture stream-json transcript" - ) - grp = p.add_mutually_exclusive_group(required=True) - grp.add_argument("--prompt", help="Prompt text") - grp.add_argument("--prompt-file", type=Path, help="File containing the prompt") - p.add_argument("--output", type=Path, required=True, help="Output .jsonl transcript file") - p.add_argument("--timeout", type=int, default=600, help="Hard timeout in seconds") - args = p.parse_args() - - prompt = ( - args.prompt_file.read_text(encoding="utf-8").strip() - if args.prompt_file - else args.prompt - ) - run_interactive(prompt, args.output, args.timeout) - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/skills/bmad-eval-runner/scripts/run_evals.py b/skills/bmad-eval-runner/scripts/run_evals.py index fd8438b..65cd011 100644 --- a/skills/bmad-eval-runner/scripts/run_evals.py +++ b/skills/bmad-eval-runner/scripts/run_evals.py @@ -2,31 +2,52 @@ # /// script # requires-python = ">=3.9" # /// -"""Run a skill's artifact evals in isolated workspaces. - -For each eval, the runner: - 1. Stages a fresh workspace (Docker container or local tmp dir under ~/bmad-evals). - 2. Applies the setup overlay (base then per-eval) so _bmad/ config and dependency - skills land in the workspace BEFORE the skill is staged — the skill's own copy - always wins over overlay content. - 3. Copies the skill into .claude/skills/ so it is discoverable by claude. - 4. Stages any fixture files declared in the eval's `files` list. - 5. Runs `claude -p '' --output-format stream-json --verbose`, capturing - the transcript. The Skill tool is available in -p mode and fires for installed - skills, so dependency skills provided by the setup overlay are properly invokable. - 6. Rsyncs any files claude wrote into `//artifacts/`. - 7. Writes `metrics.json` (tool-call counts, timing, output sizes). - -Grading is performed separately by the parent skill's grader subagents. +"""Run eval cases through the configured platform adapter. + +A case is `input + rubric + optional state_prefix`. This runner does the +runtime-specific part of an eval: it takes a case, builds the prompt the +adapter understands, runs it in a clean working directory, and records the +transcript plus timing and token usage. Grading happens elsewhere; the grader +subagent reads the transcript and artifacts this runner leaves behind. + +What this runner deliberately does NOT do: + - No Docker, no PTY, no keychain staging, no dual-isolation strategy. + - No hardcoded model. Everything runtime-specific comes from the adapter. + +The adapter seam (see references/platform-adapter.md) exposes exactly three +things, read here from an adapter config file (JSON): + + invocation : argv template for running one prompt. The token "{prompt}" is + replaced with the case prompt; "{cwd}" is replaced with the + clean working directory. Example for a Claude Code runtime: + ["claude", "-p", "{prompt}", "--output-format", "stream-json", + "--verbose", "--dangerously-skip-permissions"] + auth_env : name of the environment variable that carries auth (e.g. + "ANTHROPIC_API_KEY"). The runner passes it through unchanged. + No model id ever appears here. + transcript : how to read the run's output. One of: + {"format": "stdout-jsonl"} capture stdout as JSONL transcript + {"format": "file", "path": "transcript.jsonl"} + adapter writes a file in cwd + +If no adapter config is found, the runner degrades gracefully: it stages every +case (clean cwd, prompt with state_prefix applied) and writes a manifest, but +records each result as "skipped: no runtime adapter configured" instead of +crashing. A human or a configured runtime can then complete the run. + +state_prefix handling: when a case carries a state_prefix, it is PREPENDED to +the input to place the skill mid-workflow in one shot. The composed prompt is +recorded so the grader sees exactly what ran. Usage: python3 run_evals.py \\ - --skill-path PATH \\ - --evals-file PATH/evals.json \\ - --project-root PATH \\ - --output-dir PATH \\ - --isolation docker|local \\ - [--workers N] [--timeout SECS] [--eval-ids A1,B3] [--quiet] + --cases CASES.json \\ + --output-dir DIR \\ + [--adapter ADAPTER.json] \\ + [--case-ids A1,B3] [--timeout SECS] [--workers N] [--quiet] + +CASES.json is either a list of cases or {"cases": [...]}. Each case: + {"id": "...", "input": "...", "rubric": [...], "state_prefix": "..."?} """ from __future__ import annotations @@ -34,452 +55,371 @@ import argparse import json import os -import shutil import subprocess import sys import time from concurrent.futures import ThreadPoolExecutor, as_completed +from datetime import datetime, timezone from pathlib import Path -SCRIPT_DIR = Path(__file__).resolve().parent -sys.path.insert(0, str(SCRIPT_DIR)) - -from utils import ( # noqa: E402 - apply_setup_overlay, - discover_setup_dirs, - new_run_id, - parse_skill_md, - read_json, - read_macos_keychain_credentials, - stage_credentials, - utc_now_iso, - write_json, -) - -DOCKER_IMAGE = "bmad-eval-runner:latest" -_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials() -RSYNC_EXCLUDES = ( - ".git", ".bare", "node_modules", ".venv", "__pycache__", - ".pytest_cache", ".next", "dist", "build", ".cache", - ".DS_Store", "*.pyc", -) - - -def stage_workspace_local( - workspace: Path, - project_root: Path, - skill_path: Path, - fixtures: list[tuple[Path, str]], - setup_dirs: list[Path] | None = None, -) -> Path: - """Build a clean local workspace. Returns the project root inside workspace.""" - workspace.mkdir(parents=True, exist_ok=True) - project_dest = workspace / "project" - home_dir = workspace / ".home" - (home_dir / ".claude").mkdir(parents=True, exist_ok=True) - - excludes: list[str] = [] - for pat in RSYNC_EXCLUDES: - excludes.extend(["--exclude", pat]) - - if shutil.which("rsync"): - subprocess.run( - ["rsync", "-a", *excludes, f"{project_root}/", f"{project_dest}/"], - check=True, - ) - else: - shutil.copytree(project_root, project_dest, dirs_exist_ok=True, - ignore=shutil.ignore_patterns(*RSYNC_EXCLUDES)) - # Apply setup overlay before staging the skill — the skill's own copy wins. - if setup_dirs: - apply_setup_overlay(setup_dirs, project_dest) +# --- small self-contained helpers (no Docker/keychain imports) ------------- - skill_link_dir = project_dest / ".claude" / "skills" - skill_link_dir.mkdir(parents=True, exist_ok=True) - skill_dest = skill_link_dir / skill_path.name - if not skill_dest.exists(): - try: - os.symlink(skill_path, skill_dest) - except OSError: - shutil.copytree(skill_path, skill_dest, dirs_exist_ok=True) - - for src, dest_rel in fixtures: - dest = project_dest / dest_rel - dest.parent.mkdir(parents=True, exist_ok=True) - shutil.copy2(src, dest) - - return project_dest - - -def run_eval_local( - eval_item: dict, - run_dir: Path, - skill_path: Path, - project_root: Path, - timeout: int, - setup_dirs: list[Path] | None = None, -) -> dict: - eval_id = str(eval_item.get("id", "unnamed")) - eval_dir = run_dir / eval_id - workspace_root = eval_dir / "workspace" - artifacts_dir = eval_dir / "artifacts" - transcript_path = eval_dir / "transcript.jsonl" - - eval_dir.mkdir(parents=True, exist_ok=True) - artifacts_dir.mkdir(parents=True, exist_ok=True) - - fixtures = resolve_fixtures(eval_item.get("files", []), project_root) - workspace_project = stage_workspace_local( - workspace_root, project_root, skill_path, fixtures, setup_dirs - ) +def utc_now_iso() -> str: + return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") - (eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8") - workspace_snapshot_before = snapshot_files(workspace_project) - home_dir = workspace_root / ".home" - stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS) - env = { - "HOME": str(home_dir), - "CLAUDE_CONFIG_DIR": str(home_dir / ".claude"), - "PATH": os.environ.get("PATH", ""), - "ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""), - } +def new_run_id(label: str) -> str: + return f"{datetime.now().strftime('%Y%m%d-%H%M%S')}-{label}" - cmd = [ - "claude", - "-p", eval_item["prompt"], - "--output-format", "stream-json", - "--verbose", - "--dangerously-skip-permissions", - ] - start = time.time() - try: - with transcript_path.open("wb") as out: - proc = subprocess.run( - cmd, - stdout=out, - stderr=subprocess.PIPE, - cwd=str(workspace_project), - env=env, - timeout=timeout, - ) - elapsed = time.time() - start - return_code = proc.returncode - stderr_tail = (proc.stderr or b"").decode("utf-8", errors="replace")[-2000:] - except subprocess.TimeoutExpired as e: - elapsed = time.time() - start - return_code = -1 - stderr_tail = f"TIMEOUT after {timeout}s" - if e.stderr: - stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:] +def write_json(path: Path, data: object) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8") - new_files = diff_workspace(workspace_project, workspace_snapshot_before) - sync_artifacts(workspace_project, new_files, artifacts_dir) - metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail) - write_json(eval_dir / "metrics.json", metrics) +def read_json(path: Path) -> object: + return json.loads(path.read_text(encoding="utf-8")) - return { - "eval_id": eval_id, - "elapsed_s": elapsed, - "return_code": return_code, - "transcript": str(transcript_path.relative_to(run_dir)), - "artifacts_dir": str(artifacts_dir.relative_to(run_dir)), - "metrics": metrics, - } +# --- adapter ---------------------------------------------------------------- -def run_eval_docker( - eval_item: dict, - run_dir: Path, - skill_path: Path, - project_root: Path, - timeout: int, - setup_dirs: list[Path] | None = None, -) -> dict: - eval_id = str(eval_item.get("id", "unnamed")) - eval_dir = run_dir / eval_id - artifacts_dir = eval_dir / "artifacts" - transcript_path = eval_dir / "transcript.jsonl" - - eval_dir.mkdir(parents=True, exist_ok=True) - artifacts_dir.mkdir(parents=True, exist_ok=True) - fixtures_staging = eval_dir / "fixtures_in" - fixtures_staging.mkdir(parents=True, exist_ok=True) - - fixtures = resolve_fixtures(eval_item.get("files", []), project_root) - for src, dest_rel in fixtures: - dest = fixtures_staging / dest_rel - dest.parent.mkdir(parents=True, exist_ok=True) - shutil.copy2(src, dest) - - (eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8") - - # Pre-merge setup overlay dirs on the host; mount as /setup:ro in the container. - setup_merged: Path | None = None - if setup_dirs: - setup_merged = eval_dir / "setup_merged" - apply_setup_overlay(setup_dirs, setup_merged) - if not any(setup_merged.iterdir()): - setup_merged = None - - creds_dir: Path | None = None - if _KEYCHAIN_CREDS: - creds_dir = eval_dir / "creds" - creds_dir.mkdir(parents=True, exist_ok=True) - (creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8") - - container_script = r""" -set -e -mkdir -p /workspace -rsync -a \ - --exclude=.git --exclude=.bare --exclude=node_modules --exclude=.venv \ - --exclude=__pycache__ --exclude=.pytest_cache --exclude=.next \ - --exclude=dist --exclude=build --exclude=.cache --exclude=.DS_Store \ - /project/ /workspace/ -if [ -d /setup ]; then - rsync -a /setup/ /workspace/ -fi -mkdir -p /workspace/.claude/skills -cp -R "$SKILL_SRC" "/workspace/.claude/skills/$SKILL_NAME" -if [ -d /fixtures ]; then - cp -R /fixtures/. /workspace/ -fi -if [ -f /creds/.credentials.json ]; then - mkdir -p /home/evaluator/.claude - cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json -fi -cd /workspace -claude -p "$EVAL_PROMPT" \ - --output-format stream-json --verbose \ - --dangerously-skip-permissions \ - > /output/transcript.jsonl 2> /output/stderr.log || true -mkdir -p /output/artifacts -rsync -a --exclude=.claude --exclude=node_modules --exclude=.git \ - --filter='+ */' --filter='+ *' \ - /workspace/ /output/artifacts/ -""" +def find_adapter(explicit: Path | None, cases_file: Path) -> Path | None: + """Locate the adapter config. Returns None when none is configured.""" + if explicit is not None: + return explicit if explicit.is_file() else None + env_path = os.environ.get("BMAD_EVAL_ADAPTER") + if env_path and Path(env_path).is_file(): + return Path(env_path) + for candidate in ( + cases_file.parent / "adapter.json", + cases_file.parent / ".bmad-eval-adapter.json", + ): + if candidate.is_file(): + return candidate + return None - skill_name = skill_path.name - cmd = [ - "docker", "run", "--rm", - "-v", f"{project_root}:/project:ro", - "-v", f"{skill_path}:/skill_src:ro", - "-v", f"{eval_dir}:/output", - "-e", "ANTHROPIC_API_KEY", - "-e", f"EVAL_PROMPT={eval_item['prompt']}", - "-e", f"SKILL_SRC=/skill_src", - "-e", f"SKILL_NAME={skill_name}", - ] - if creds_dir: - cmd += ["-v", f"{creds_dir}:/creds:ro"] - if fixtures: - cmd += ["-v", f"{fixtures_staging}:/fixtures:ro"] - if setup_merged: - cmd += ["-v", f"{setup_merged}:/setup:ro"] - cmd += [DOCKER_IMAGE, "bash", "-c", container_script] - start = time.time() - try: - proc = subprocess.run( - cmd, - capture_output=True, - timeout=timeout + 30, - ) - elapsed = time.time() - start - return_code = proc.returncode - stderr_tail = proc.stderr.decode("utf-8", errors="replace")[-2000:] - if proc.stdout: - (eval_dir / "docker.stdout.log").write_bytes(proc.stdout) - except subprocess.TimeoutExpired as e: - elapsed = time.time() - start - return_code = -1 - stderr_tail = f"TIMEOUT after {timeout}s" - if e.stderr: - stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:] +def load_adapter(path: Path) -> dict: + cfg = read_json(path) + if not isinstance(cfg, dict): + raise ValueError(f"adapter config must be a JSON object: {path}") + if "invocation" not in cfg or not isinstance(cfg["invocation"], list): + raise ValueError("adapter config missing 'invocation' argv list") + return cfg - metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail) - write_json(eval_dir / "metrics.json", metrics) - shutil.rmtree(fixtures_staging, ignore_errors=True) - return { - "eval_id": eval_id, - "elapsed_s": elapsed, - "return_code": return_code, - "transcript": str(transcript_path.relative_to(run_dir)), - "artifacts_dir": str(artifacts_dir.relative_to(run_dir)), - "metrics": metrics, - } +def build_argv(invocation: list, prompt: str, cwd: str) -> list[str]: + argv: list[str] = [] + for tok in invocation: + tok = str(tok) + tok = tok.replace("{prompt}", prompt).replace("{cwd}", cwd) + argv.append(tok) + return argv -def resolve_fixtures(files: list[str], project_root: Path) -> list[tuple[Path, str]]: - out: list[tuple[Path, str]] = [] - for entry in files: - candidate = (project_root / entry).resolve() - if not candidate.is_file(): - alt = Path(entry).resolve() - if alt.is_file(): - candidate = alt - else: - print(f"Warning: fixture not found: {entry}", file=sys.stderr) - continue - out.append((candidate, entry)) - return out +# --- case composition ------------------------------------------------------- +def compose_prompt(case: dict) -> str: + """Apply state_prefix by prepending it to the input. -def snapshot_files(root: Path) -> set[str]: - snap: set[str] = set() - for p in root.rglob("*"): - if p.is_file(): - snap.add(str(p.relative_to(root))) - return snap + The state_prefix is a bracketed prime that places the skill mid-workflow in + one shot. Prepending keeps the input intact and visible to the grader. + """ + input_text = str(case.get("input", "")) + prefix = case.get("state_prefix") + if prefix: + return f"{str(prefix).rstrip()}\n\n{input_text}" + return input_text -def diff_workspace(root: Path, before: set[str]) -> list[str]: - after = snapshot_files(root) - return sorted(after - before) +# --- transcript + token accounting ----------------------------------------- +def read_transcript(transcript_cfg: dict, captured_stdout: bytes, + cwd: Path) -> tuple[str, str]: + """Return (transcript_text, source). Source names where it came from.""" + fmt = (transcript_cfg or {}).get("format", "stdout-jsonl") + if fmt == "file": + rel = (transcript_cfg or {}).get("path", "transcript.jsonl") + f = cwd / rel + if f.is_file(): + return f.read_text(encoding="utf-8", errors="replace"), f"file:{rel}" + return "", f"file:{rel} (missing)" + return captured_stdout.decode("utf-8", errors="replace"), "stdout" -def sync_artifacts(workspace: Path, new_files: list[str], dest: Path) -> None: - for rel in new_files: - src = workspace / rel - if not src.is_file(): - continue - if any(part in (".claude", "node_modules", ".git", ".venv") for part in src.parts): - continue - target = dest / rel - target.parent.mkdir(parents=True, exist_ok=True) - shutil.copy2(src, target) +def account_transcript(transcript_text: str) -> dict: + """Pull timing/token usage from a JSONL transcript when present. -def compute_metrics(transcript: Path, artifacts: Path, elapsed: float, - rc: int, stderr_tail: str) -> dict: - tool_calls: dict[str, int] = {} + Reads usage out of the completion notification immediately, so tokens are + captured at run time rather than recomputed later. Recognizes the common + `result` event with a usage block and per-message usage blocks; unknown + shapes degrade to zero counts without failing. + """ + input_tokens = 0 + output_tokens = 0 total_steps = 0 - if transcript.is_file(): - for raw in transcript.read_text(encoding="utf-8", errors="replace").splitlines(): - raw = raw.strip() - if not raw: - continue - try: - evt = json.loads(raw) - except json.JSONDecodeError: - continue - if evt.get("type") == "assistant": - total_steps += 1 - for item in evt.get("message", {}).get("content", []): - if item.get("type") == "tool_use": - name = item.get("name", "?") - tool_calls[name] = tool_calls.get(name, 0) + 1 - - output_chars = 0 - for f in artifacts.rglob("*"): - if f.is_file(): - try: - output_chars += f.stat().st_size - except OSError: - pass + tool_calls: dict[str, int] = {} + found_usage = False + + for raw in transcript_text.splitlines(): + raw = raw.strip() + if not raw: + continue + try: + evt = json.loads(raw) + except json.JSONDecodeError: + continue + if not isinstance(evt, dict): + continue + etype = evt.get("type") + if etype == "assistant": + total_steps += 1 + msg = evt.get("message", {}) + usage = msg.get("usage") if isinstance(msg, dict) else None + if isinstance(usage, dict): + found_usage = True + input_tokens += int(usage.get("input_tokens", 0) or 0) + output_tokens += int(usage.get("output_tokens", 0) or 0) + for item in (msg.get("content", []) if isinstance(msg, dict) else []): + if isinstance(item, dict) and item.get("type") == "tool_use": + name = item.get("name", "?") + tool_calls[name] = tool_calls.get(name, 0) + 1 + elif etype == "result": + usage = evt.get("usage") + if isinstance(usage, dict): + found_usage = True + # result usage is authoritative; prefer it over the running sum + input_tokens = int(usage.get("input_tokens", input_tokens) or input_tokens) + output_tokens = int(usage.get("output_tokens", output_tokens) or output_tokens) return { - "elapsed_s": round(elapsed, 2), - "return_code": rc, + "input_tokens": input_tokens, + "output_tokens": output_tokens, + "total_tokens": input_tokens + output_tokens, + "tokens_reported": found_usage, + "total_steps": total_steps, "tool_calls": tool_calls, "total_tool_calls": sum(tool_calls.values()), - "total_steps": total_steps, - "output_chars": output_chars, - "transcript_chars": transcript.stat().st_size if transcript.is_file() else 0, - "stderr_tail": stderr_tail, } -def main() -> int: - parser = argparse.ArgumentParser(description="Run a skill's artifact evals in isolation") - parser.add_argument("--skill-path", required=True, type=Path) - parser.add_argument("--evals-file", required=True, type=Path) - parser.add_argument("--project-root", required=True, type=Path) - parser.add_argument("--output-dir", required=True, type=Path) - parser.add_argument("--isolation", choices=("docker", "local"), required=True) - parser.add_argument("--workers", type=int, default=8) - parser.add_argument("--timeout", type=int, default=600) - parser.add_argument("--eval-ids", default=None, help="Comma-separated subset of eval ids to run") - parser.add_argument("--quiet", action="store_true") - args = parser.parse_args() - - skill_path = args.skill_path.resolve() - project_root = args.project_root.resolve() - evals_file = args.evals_file.resolve() - if not evals_file.is_file(): - print(f"evals file not found: {evals_file}", file=sys.stderr) - return 2 +# --- per-case execution ----------------------------------------------------- + +def run_case(case: dict, run_dir: Path, adapter: dict | None, + timeout: int) -> dict: + case_id = str(case.get("id", "unnamed")) + case_dir = run_dir / case_id + cwd = case_dir / "cwd" + cwd.mkdir(parents=True, exist_ok=True) + + prompt = compose_prompt(case) + (case_dir / "prompt.txt").write_text(prompt, encoding="utf-8") + write_json(case_dir / "case.json", case) + + if adapter is None: + result = { + "case_id": case_id, + "status": "skipped", + "reason": "no runtime adapter configured", + "prompt_chars": len(prompt), + "cwd": str(cwd.relative_to(run_dir)), + } + write_json(case_dir / "timing.json", { + "case_id": case_id, "status": "skipped", + "captured_at": utc_now_iso(), + }) + return result + + transcript_path = case_dir / "transcript.jsonl" + argv = build_argv(adapter["invocation"], prompt, str(cwd)) + + env = dict(os.environ) + auth_env = adapter.get("auth_env") + if auth_env: + # Pass the named auth var through unchanged; never inject a model id. + env[auth_env] = os.environ.get(auth_env, "") + + start = time.time() + captured = b"" + return_code = 0 + error_tail = "" + status = "ok" + try: + proc = subprocess.run( + argv, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + cwd=str(cwd), + env=env, + timeout=timeout, + ) + captured = proc.stdout or b"" + return_code = proc.returncode + error_tail = (proc.stderr or b"").decode("utf-8", errors="replace")[-2000:] + if return_code != 0: + status = "error" + except FileNotFoundError as e: + # Adapter invocation command is not on PATH: degrade, do not crash. + elapsed = time.time() - start + write_json(case_dir / "timing.json", { + "case_id": case_id, "status": "adapter-missing", + "elapsed_s": round(elapsed, 3), "captured_at": utc_now_iso(), + }) + return { + "case_id": case_id, + "status": "adapter-missing", + "reason": f"invocation command not found: {e}", + "cwd": str(cwd.relative_to(run_dir)), + } + except subprocess.TimeoutExpired as e: + captured = e.stdout or b"" + return_code = -1 + status = "timeout" + error_tail = f"TIMEOUT after {timeout}s" + elapsed = time.time() - start + + transcript_text, source = read_transcript( + adapter.get("transcript", {}), captured, cwd + ) + transcript_path.write_text(transcript_text, encoding="utf-8") + + accounting = account_transcript(transcript_text) + + # Capture timing/tokens immediately to timing.json (run-time snapshot). + timing = { + "case_id": case_id, + "status": status, + "elapsed_s": round(elapsed, 3), + "return_code": return_code, + "transcript_source": source, + "input_tokens": accounting["input_tokens"], + "output_tokens": accounting["output_tokens"], + "total_tokens": accounting["total_tokens"], + "tokens_reported": accounting["tokens_reported"], + "total_steps": accounting["total_steps"], + "total_tool_calls": accounting["total_tool_calls"], + "captured_at": utc_now_iso(), + } + write_json(case_dir / "timing.json", timing) + + return { + "case_id": case_id, + "status": status, + "elapsed_s": round(elapsed, 3), + "return_code": return_code, + "transcript": str(transcript_path.relative_to(run_dir)), + "cwd": str(cwd.relative_to(run_dir)), + "tokens": accounting["total_tokens"], + "tool_calls": accounting["tool_calls"], + "error_tail": error_tail, + } - skill_name, _, _ = parse_skill_md(skill_path) - data = read_json(evals_file) - evals = data["evals"] if isinstance(data, dict) and "evals" in data else data - if args.eval_ids: - wanted = {x.strip() for x in args.eval_ids.split(",") if x.strip()} - evals = [e for e in evals if str(e.get("id")) in wanted] +# --- main ------------------------------------------------------------------- - run_id = new_run_id(skill_name) +def load_cases(cases_file: Path) -> list[dict]: + data = read_json(cases_file) + if isinstance(data, dict) and "cases" in data: + cases = data["cases"] + elif isinstance(data, list): + cases = data + else: + raise ValueError("cases file must be a list or {'cases': [...]}") + if not isinstance(cases, list): + raise ValueError("'cases' must be a list") + return cases + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + p.add_argument("--cases", required=True, type=Path) + p.add_argument("--output-dir", required=True, type=Path) + p.add_argument("--adapter", type=Path, default=None, + help="adapter config JSON; defaults to BMAD_EVAL_ADAPTER env " + "or adapter.json beside the cases file") + p.add_argument("--case-ids", default=None, + help="comma-separated subset of case ids to run") + p.add_argument("--timeout", type=int, default=600) + p.add_argument("--workers", type=int, default=4) + p.add_argument("--label", default="evals", help="label for the run id") + p.add_argument("--quiet", action="store_true") + args = p.parse_args(argv) + + cases_file = args.cases.resolve() + if not cases_file.is_file(): + print(f"cases file not found: {cases_file}", file=sys.stderr) + return 2 + + cases = load_cases(cases_file) + if args.case_ids: + wanted = {x.strip() for x in args.case_ids.split(",") if x.strip()} + cases = [c for c in cases if str(c.get("id")) in wanted] + + adapter_path = find_adapter(args.adapter, cases_file) + adapter: dict | None = None + adapter_note = "none" + if adapter_path is not None: + try: + adapter = load_adapter(adapter_path) + adapter_note = str(adapter_path) + except Exception as e: + print(f"adapter config invalid ({e}); degrading to skip-only", + file=sys.stderr) + adapter = None + adapter_note = f"invalid: {e}" + + run_id = new_run_id(args.label) run_dir = (args.output_dir / run_id).resolve() run_dir.mkdir(parents=True, exist_ok=True) write_json(run_dir / "run.json", { "run_id": run_id, - "skill_name": skill_name, - "skill_path": str(skill_path), - "project_root": str(project_root), - "evals_file": str(evals_file), - "isolation": args.isolation, + "cases_file": str(cases_file), + "adapter": adapter_note, "started_at": utc_now_iso(), - "eval_count": len(evals), + "case_count": len(cases), }) - runner = run_eval_docker if args.isolation == "docker" else run_eval_local + if adapter is None and not args.quiet: + print("[run_evals] no runtime adapter configured; staging cases only " + "(no crash). Configure an adapter to execute.", file=sys.stderr) results: list[dict] = [] if not args.quiet: - print( - f"[run_evals] {len(evals)} evals, isolation={args.isolation}, run_dir={run_dir}", - file=sys.stderr, - ) - - with ThreadPoolExecutor(max_workers=args.workers) as pool: - future_to_eval = { - pool.submit( - runner, - item, - run_dir, - skill_path, - project_root, - int(item.get("timeout", args.timeout)), - discover_setup_dirs(evals_file, str(item.get("id", ""))), - ): item - for item in evals + print(f"[run_evals] {len(cases)} cases, run_dir={run_dir}", + file=sys.stderr) + + with ThreadPoolExecutor(max_workers=max(1, args.workers)) as pool: + fut_to_case = { + pool.submit(run_case, c, run_dir, adapter, + int(c.get("timeout", args.timeout))): c + for c in cases } - for fut in as_completed(future_to_eval): - item = future_to_eval[fut] + for fut in as_completed(fut_to_case): + c = fut_to_case[fut] try: res = fut.result() except Exception as e: - res = {"eval_id": str(item.get("id")), "error": str(e), "return_code": -1} + res = {"case_id": str(c.get("id")), "status": "exception", + "reason": str(e)} results.append(res) if not args.quiet: - rc = res.get("return_code") - status = "ok" if rc == 0 else f"rc={rc}" - print( - f" [{status}] eval {res.get('eval_id')} ({res.get('elapsed_s', 0):.1f}s)", - file=sys.stderr, - ) + print(f" [{res.get('status')}] case {res.get('case_id')} " + f"({res.get('elapsed_s', 0)}s)", file=sys.stderr) summary = { "run_id": run_id, "completed_at": utc_now_iso(), - "total": len(evals), - "executed": len(results), - "exec_failures": sum(1 for r in results if r.get("return_code") != 0), + "total": len(cases), + "executed": sum(1 for r in results if r.get("status") == "ok"), + "skipped": sum(1 for r in results if r.get("status") == "skipped"), + "failures": sum(1 for r in results + if r.get("status") in ("error", "timeout", "exception", + "adapter-missing")), "run_dir": str(run_dir), "results": results, } diff --git a/skills/bmad-eval-runner/scripts/run_triggers.py b/skills/bmad-eval-runner/scripts/run_triggers.py index 9c1bb96..45de94b 100644 --- a/skills/bmad-eval-runner/scripts/run_triggers.py +++ b/skills/bmad-eval-runner/scripts/run_triggers.py @@ -2,27 +2,46 @@ # /// script # requires-python = ">=3.9" # /// -"""Run trigger evals: does the skill's description fire on each query? - -Adapted from Anthropic skill-creator's run_eval.py -(https://github.com/anthropics/skills/tree/main/skills/skill-creator) with two -adaptations: - - 1. Isolation. Each query runs in either a fresh Docker container off - bmad-eval-runner:latest, or a fresh local tmp dir under ~/bmad-evals// - with HOME overridden to a clean directory. This prevents the host's global - CLAUDE.md and auto-memory from biasing whether the skill fires. - - 2. Output. Results are written to a run folder alongside the artifact eval - run-folder layout (so triggers and artifacts can share a single report). +"""Trigger evals: does a skill's description fire on each near-miss query? + +A trigger query is a should/should-not user message that shares keywords with +the skill so the description has to discriminate. For each query the runner +stages a synthetic skill where the runtime looks for skills, sends the query +through the adapter, and detects whether the skill loaded. Each query runs +several times (runs-per-query) so the trigger rate is stable, not a coin flip. + +Detection lives behind the adapter. "Did the skill load" is a runtime-specific +signal, so the adapter declares how skills are staged and how a load shows up in +the transcript. The adapter config (see references/platform-adapter.md) adds two +trigger-specific keys to the three core ones: + + invocation : argv template; "{query}" is replaced with the query text, + "{cwd}" with the staging dir. + auth_env : auth env-var name, passed through unchanged. No model id. + skill_dir : path under the staging cwd where a skill is discovered, e.g. + ".claude/skills". The runner writes the synthetic skill there. + load_signal: how a load appears in the transcript. One of: + {"type": "tool-name", "name": "Skill"} + a tool_use whose name matches and whose input mentions the + synthetic skill's unique name + {"type": "string", "match": "{skill_name}"} + the unique skill name appears anywhere in the transcript + +If no adapter is configured the runner degrades gracefully: it stages each query +and records "skipped: no runtime adapter configured" rather than crashing. Usage: python3 run_triggers.py \\ - --skill-path PATH \\ - --triggers-file PATH/triggers.json \\ - --output-dir PATH \\ - --isolation docker|local \\ - [--workers N] [--runs-per-query N] [--timeout SECS] [--threshold 0.5] + --skill-path SKILL_DIR \\ + --queries QUERIES.json \\ + --output-dir DIR \\ + [--adapter ADAPTER.json] \\ + [--runs-per-query N] [--threshold 0.5] [--timeout SECS] \\ + [--workers N] [--quiet] + +QUERIES.json is a list of {"query": "...", "should_trigger": true|false}. +SKILL_DIR contains the SKILL.md whose name + description are under test; the +description is what the synthetic skill advertises. """ from __future__ import annotations @@ -30,262 +49,247 @@ import argparse import json import os +import re import shutil import subprocess import sys -import time import uuid from concurrent.futures import ThreadPoolExecutor, as_completed +from datetime import datetime, timezone from pathlib import Path -SCRIPT_DIR = Path(__file__).resolve().parent -sys.path.insert(0, str(SCRIPT_DIR)) -from utils import ( # noqa: E402 - new_run_id, - parse_skill_md, - read_json, - read_macos_keychain_credentials, - stage_credentials, - utc_now_iso, - write_json, -) +# --- self-contained helpers ------------------------------------------------- + +def utc_now_iso() -> str: + return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") + + +def new_run_id(label: str) -> str: + return f"{datetime.now().strftime('%Y%m%d-%H%M%S')}-{label}" + + +def write_json(path: Path, data: object) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8") + + +def read_json(path: Path) -> object: + return json.loads(path.read_text(encoding="utf-8")) + + +def parse_skill_md(skill_path: Path) -> tuple[str, str]: + """Return (name, description) from SKILL.md frontmatter.""" + text = (skill_path / "SKILL.md").read_text(encoding="utf-8") + m = re.match(r"^---\s*\n(.*?)\n---\s*\n", text, re.DOTALL) + if not m: + raise ValueError(f"SKILL.md at {skill_path} is missing frontmatter") + frontmatter = m.group(1) + name = None + desc_lines: list[str] = [] + in_desc = False + for line in frontmatter.splitlines(): + if line.startswith("name:"): + name = line.split(":", 1)[1].strip() + in_desc = False + elif line.startswith("description:"): + value = line.split(":", 1)[1].strip() + if value in ("|", ">"): + in_desc = True + else: + desc_lines = [value] + in_desc = False + elif in_desc and line.startswith((" ", "\t")): + desc_lines.append(line.strip()) + elif in_desc: + in_desc = False + if not name: + raise ValueError(f"SKILL.md at {skill_path} has no name") + return name, " ".join(desc_lines).strip() + -DOCKER_IMAGE = "bmad-eval-runner:latest" -_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials() +# --- adapter ---------------------------------------------------------------- +def find_adapter(explicit: Path | None, queries_file: Path) -> Path | None: + if explicit is not None: + return explicit if explicit.is_file() else None + env_path = os.environ.get("BMAD_EVAL_ADAPTER") + if env_path and Path(env_path).is_file(): + return Path(env_path) + for candidate in ( + queries_file.parent / "adapter.json", + queries_file.parent / ".bmad-eval-adapter.json", + ): + if candidate.is_file(): + return candidate + return None -def write_synthetic_skill(skills_dir: Path, skill_name: str, description: str, unique_id: str) -> tuple[Path, str]: - """Place a synthetic skill at //SKILL.md. - The Skill tool only fires for entries discovered as actual skills (frontmatter - `name` + `description` under a `.claude/skills//SKILL.md`). Slash-commands - under `.claude/commands/` do not auto-invoke the Skill tool, so the previous - implementation could never observe a positive trigger. This places the synthetic - skill where Claude Code looks for skills, with a unique name so the detector - can disambiguate it from any pre-existing skill of the same display name. +def load_adapter(path: Path) -> dict: + cfg = read_json(path) + if not isinstance(cfg, dict) or "invocation" not in cfg: + raise ValueError(f"adapter config missing 'invocation': {path}") + return cfg + + +def build_argv(invocation: list, query: str, cwd: str) -> list[str]: + out: list[str] = [] + for tok in invocation: + tok = str(tok).replace("{query}", query).replace("{cwd}", cwd) + out.append(tok) + return out + + +# --- synthetic skill staging ------------------------------------------------ + +def write_synthetic_skill(skills_dir: Path, skill_name: str, + description: str, unique: str) -> str: + """Write a synthetic skill the runtime can discover. Returns its unique name. + + A unique suffix lets the detector tell this synthetic skill apart from any + real skill of the same display name. """ - clean_name = f"{skill_name}-skill-{unique_id}" - skill_root = skills_dir / clean_name - skill_root.mkdir(parents=True, exist_ok=True) - path = skill_root / "SKILL.md" - indented_desc = "\n ".join(description.split("\n")) - path.write_text( + clean_name = f"{skill_name}-trig-{unique}" + root = skills_dir / clean_name + root.mkdir(parents=True, exist_ok=True) + indented = "\n ".join(description.split("\n")) + (root / "SKILL.md").write_text( f"---\n" f"name: {clean_name}\n" f"description: |\n" - f" {indented_desc}\n" + f" {indented}\n" f"---\n\n" f"# {skill_name}\n\n" f"This skill handles: {description}\n", encoding="utf-8", ) - return path, clean_name + return clean_name -def parse_stream_for_trigger(buffer: str, clean_name: str) -> tuple[bool | None, str]: - """Return (triggered_or_none, leftover_buffer). None means undecided yet.""" - triggered: bool | None = None - pending_tool: str | None = None - accumulated_json = "" - leftover = "" +# --- load detection (behind the adapter) ------------------------------------ - while "\n" in buffer: - line, buffer = buffer.split("\n", 1) - line = line.strip() - if not line: - continue - try: - evt = json.loads(line) - except json.JSONDecodeError: - continue - - if evt.get("type") == "stream_event": - se = evt.get("event", {}) - t = se.get("type", "") - if t == "content_block_start": - cb = se.get("content_block", {}) - if cb.get("type") == "tool_use": - name = cb.get("name", "") - if name in ("Skill", "Read"): - pending_tool = name - accumulated_json = "" - else: - return False, "" - elif t == "content_block_delta" and pending_tool: - delta = se.get("delta", {}) - if delta.get("type") == "input_json_delta": - accumulated_json += delta.get("partial_json", "") - if clean_name in accumulated_json: - return True, "" - elif t in ("content_block_stop", "message_stop"): - if pending_tool: - return clean_name in accumulated_json, "" - if t == "message_stop": - return False, "" - elif evt.get("type") == "assistant": - for item in evt.get("message", {}).get("content", []): - if item.get("type") != "tool_use": +def detect_load(transcript_text: str, load_signal: dict, clean_name: str) -> bool: + """Did the synthetic skill load? Interpreted per the adapter's load_signal.""" + sig = load_signal or {"type": "string", "match": "{skill_name}"} + sig_type = sig.get("type", "string") + + if sig_type == "string": + needle = str(sig.get("match", "{skill_name}")).replace( + "{skill_name}", clean_name) + return needle in transcript_text + + if sig_type == "tool-name": + want_name = sig.get("name", "Skill") + for raw in transcript_text.splitlines(): + raw = raw.strip() + if not raw: + continue + try: + evt = json.loads(raw) + except json.JSONDecodeError: + continue + if not isinstance(evt, dict): + continue + content = [] + if evt.get("type") == "assistant": + content = evt.get("message", {}).get("content", []) + for item in content: + if not isinstance(item, dict) or item.get("type") != "tool_use": + continue + if item.get("name") != want_name: continue - tname = item.get("name", "") - tinput = item.get("input", {}) - if tname == "Skill" and clean_name in tinput.get("skill", ""): - return True, "" - if tname == "Read" and clean_name in tinput.get("file_path", ""): - return True, "" - return False, "" - elif evt.get("type") == "result": - return triggered if triggered is not None else False, "" - leftover = buffer - return triggered, leftover - - -def run_query_local(query: str, skill_name: str, description: str, - workspace_root: Path, timeout: int) -> bool: - workspace_root.mkdir(parents=True, exist_ok=True) - home_dir = workspace_root / ".home" - (home_dir / ".claude").mkdir(parents=True, exist_ok=True) - stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS) - project_dir = workspace_root / "project" - skills_dir = project_dir / ".claude" / "skills" - project_dir.mkdir(parents=True, exist_ok=True) + inp = json.dumps(item.get("input", {})) + if clean_name in inp: + return True + return False - unique = uuid.uuid4().hex[:8] - cmd_file, clean_name = write_synthetic_skill(skills_dir, skill_name, description, unique) + # Unknown signal type: fall back to substring match, never crash. + return clean_name in transcript_text - env = { - "HOME": str(home_dir), - "CLAUDE_CONFIG_DIR": str(home_dir / ".claude"), - "PATH": os.environ.get("PATH", ""), - "ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""), - } - cmd = [ - "claude", "-p", query, - "--output-format", "stream-json", - "--verbose", - "--include-partial-messages", - "--dangerously-skip-permissions", - ] +# --- per-query execution ---------------------------------------------------- + +def run_query_once(query: str, skill_name: str, description: str, + adapter: dict, stage_dir: Path, timeout: int) -> bool: + skill_subdir = adapter.get("skill_dir", ".claude/skills") + skills_dir = stage_dir / skill_subdir + skills_dir.mkdir(parents=True, exist_ok=True) + unique = uuid.uuid4().hex[:8] + clean_name = write_synthetic_skill(skills_dir, skill_name, description, unique) + + env = dict(os.environ) + auth_env = adapter.get("auth_env") + if auth_env: + env[auth_env] = os.environ.get(auth_env, "") + argv = build_argv(adapter["invocation"], query, str(stage_dir)) try: - proc = subprocess.Popen( - cmd, + proc = subprocess.run( + argv, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, - cwd=str(project_dir), + cwd=str(stage_dir), env=env, + timeout=timeout, ) - buffer = "" - triggered: bool | None = None - start = time.time() - try: - while time.time() - start < timeout: - if proc.poll() is not None: - rest = proc.stdout.read() - if rest: - buffer += rest.decode("utf-8", errors="replace") - break - chunk = proc.stdout.read1(8192) if hasattr(proc.stdout, "read1") else proc.stdout.read(8192) - if not chunk: - time.sleep(0.05) - continue - buffer += chunk.decode("utf-8", errors="replace") - decided, buffer = parse_stream_for_trigger(buffer, clean_name) - if decided is not None: - triggered = decided - break - finally: - if proc.poll() is None: - proc.kill() - proc.wait() - if triggered is None: - decided, _ = parse_stream_for_trigger(buffer + "\n", clean_name) - triggered = bool(decided) - return bool(triggered) - finally: - try: - shutil.rmtree(cmd_file.parent, ignore_errors=True) - except OSError: - pass + captured = proc.stdout or b"" + except subprocess.TimeoutExpired as e: + captured = e.stdout or b"" + except FileNotFoundError: + # invocation command absent; treat as undetected and let caller note it + raise + transcript_cfg = adapter.get("transcript", {"format": "stdout-jsonl"}) + if transcript_cfg.get("format") == "file": + f = stage_dir / transcript_cfg.get("path", "transcript.jsonl") + text = f.read_text(encoding="utf-8", errors="replace") if f.is_file() else "" + else: + text = captured.decode("utf-8", errors="replace") -def run_query_docker(query: str, skill_name: str, description: str, - workspace_root: Path, timeout: int) -> bool: - workspace_root.mkdir(parents=True, exist_ok=True) - unique = uuid.uuid4().hex[:8] - skills_in = workspace_root / "skills_in" - skills_in.mkdir(parents=True, exist_ok=True) - _, clean_name = write_synthetic_skill(skills_in, skill_name, description, unique) - - creds_dir: Path | None = None - if _KEYCHAIN_CREDS: - creds_dir = workspace_root / "creds_in" - creds_dir.mkdir(parents=True, exist_ok=True) - (creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8") - - container_script = f""" -set -e -mkdir -p /workspace/.claude/skills -cp -R /skills/. /workspace/.claude/skills/ 2>/dev/null || true -if [ -f /creds/.credentials.json ]; then - mkdir -p /home/evaluator/.claude - cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json -fi -cd /workspace -claude -p "$EVAL_QUERY" \\ - --output-format stream-json --verbose --include-partial-messages \\ - --dangerously-skip-permissions \\ - > /output/stream.jsonl 2>/dev/null || true -""" - - output_dir = workspace_root / "output" - output_dir.mkdir(parents=True, exist_ok=True) + return detect_load(text, adapter.get("load_signal", {}), clean_name) - cmd = [ - "docker", "run", "--rm", - "-v", f"{skills_in}:/skills:ro", - "-v", f"{output_dir}:/output", - "-e", "ANTHROPIC_API_KEY", - "-e", f"EVAL_QUERY={query}", - ] - if creds_dir: - cmd += ["-v", f"{creds_dir}:/creds:ro"] - cmd += [DOCKER_IMAGE, "bash", "-c", container_script] - try: - subprocess.run(cmd, capture_output=True, timeout=timeout + 30) - except subprocess.TimeoutExpired: - pass +# --- main ------------------------------------------------------------------- - stream_file = output_dir / "stream.jsonl" - if not stream_file.is_file(): - return False - decided, _ = parse_stream_for_trigger(stream_file.read_text(encoding="utf-8", errors="replace") + "\n", clean_name) - return bool(decided) - - -def main() -> int: - parser = argparse.ArgumentParser(description="Run trigger evals in isolation") - parser.add_argument("--skill-path", required=True, type=Path) - parser.add_argument("--triggers-file", required=True, type=Path) - parser.add_argument("--output-dir", required=True, type=Path) - parser.add_argument("--isolation", choices=("docker", "local"), required=True) - parser.add_argument("--workers", type=int, default=8) - parser.add_argument("--runs-per-query", type=int, default=3) - parser.add_argument("--timeout", type=int, default=45) - parser.add_argument("--threshold", type=float, default=0.5) - parser.add_argument("--quiet", action="store_true") - args = parser.parse_args() +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + p.add_argument("--skill-path", required=True, type=Path) + p.add_argument("--queries", required=True, type=Path) + p.add_argument("--output-dir", required=True, type=Path) + p.add_argument("--adapter", type=Path, default=None) + p.add_argument("--runs-per-query", type=int, default=3) + p.add_argument("--threshold", type=float, default=0.5) + p.add_argument("--timeout", type=int, default=60) + p.add_argument("--workers", type=int, default=4) + p.add_argument("--quiet", action="store_true") + args = p.parse_args(argv) skill_path = args.skill_path.resolve() - triggers_file = args.triggers_file.resolve() - if not triggers_file.is_file(): - print(f"triggers file not found: {triggers_file}", file=sys.stderr) + queries_file = args.queries.resolve() + if not queries_file.is_file(): + print(f"queries file not found: {queries_file}", file=sys.stderr) return 2 - skill_name, description, _ = parse_skill_md(skill_path) - queries = read_json(triggers_file) + skill_name, description = parse_skill_md(skill_path) + queries = read_json(queries_file) + if not isinstance(queries, list): + print("queries file must be a JSON list", file=sys.stderr) + return 2 + + adapter_path = find_adapter(args.adapter, queries_file) + adapter: dict | None = None + adapter_note = "none" + if adapter_path is not None: + try: + adapter = load_adapter(adapter_path) + adapter_note = str(adapter_path) + except Exception as e: + print(f"adapter config invalid ({e}); degrading to skip-only", + file=sys.stderr) + adapter_note = f"invalid: {e}" run_id = new_run_id(f"{skill_name}-triggers") run_dir = (args.output_dir / run_id).resolve() @@ -295,25 +299,54 @@ def main() -> int: "run_id": run_id, "skill_name": skill_name, "description": description, - "isolation": args.isolation, + "adapter": adapter_note, "started_at": utc_now_iso(), "query_count": len(queries), "runs_per_query": args.runs_per_query, "threshold": args.threshold, }) - runner = run_query_docker if args.isolation == "docker" else run_query_local + if adapter is None: + if not args.quiet: + print("[run_triggers] no runtime adapter configured; staging only " + "(no crash).", file=sys.stderr) + output = { + "run_id": run_id, + "completed_at": utc_now_iso(), + "skill_name": skill_name, + "description": description, + "status": "skipped", + "reason": "no runtime adapter configured", + "results": [], + "summary": {"total": len(queries), "passed": 0, "failed": 0, + "skipped": len(queries)}, + } + write_json(run_dir / "triggers-result.json", output) + print(json.dumps(output, indent=2)) + return 0 + + adapter_missing = {"flag": False} def run_one(idx: int, q: dict, run_idx: int) -> tuple[int, bool]: - ws = run_dir / "queries" / f"q{idx:03d}-r{run_idx}" - triggered = runner(q["query"], skill_name, description, ws, args.timeout) + stage = run_dir / "queries" / f"q{idx:03d}-r{run_idx}" + stage.mkdir(parents=True, exist_ok=True) + try: + triggered = run_query_once( + q["query"], skill_name, description, adapter, stage, args.timeout) + except FileNotFoundError: + adapter_missing["flag"] = True + triggered = False + finally: + shutil.rmtree(stage / adapter.get("skill_dir", ".claude/skills").split("/")[0], + ignore_errors=True) return idx, triggered per_query: dict[int, list[bool]] = {} if not args.quiet: - print(f"[run_triggers] {len(queries)} queries × {args.runs_per_query} runs, isolation={args.isolation}", file=sys.stderr) + print(f"[run_triggers] {len(queries)} queries x {args.runs_per_query} " + f"runs", file=sys.stderr) - with ThreadPoolExecutor(max_workers=args.workers) as pool: + with ThreadPoolExecutor(max_workers=max(1, args.workers)) as pool: futures = [] for idx, q in enumerate(queries): for run_idx in range(args.runs_per_query): @@ -322,25 +355,36 @@ def run_one(idx: int, q: dict, run_idx: int) -> tuple[int, bool]: try: idx, triggered = fut.result() except Exception as e: - print(f"Warning: query failed: {e}", file=sys.stderr) + print(f"Warning: query run failed: {e}", file=sys.stderr) continue per_query.setdefault(idx, []).append(triggered) + if adapter_missing["flag"]: + output = { + "run_id": run_id, + "completed_at": utc_now_iso(), + "skill_name": skill_name, + "status": "adapter-missing", + "reason": "adapter invocation command not found on PATH", + "results": [], + "summary": {"total": len(queries), "passed": 0, "failed": 0}, + } + write_json(run_dir / "triggers-result.json", output) + print(json.dumps(output, indent=2)) + return 0 + results = [] for idx, q in enumerate(queries): - triggers = per_query.get(idx, []) - rate = (sum(triggers) / len(triggers)) if triggers else 0.0 - should = bool(q["should_trigger"]) - if should: - passed = rate >= args.threshold - else: - passed = rate < args.threshold + runs = per_query.get(idx, []) + rate = (sum(runs) / len(runs)) if runs else 0.0 + should = bool(q.get("should_trigger", True)) + passed = (rate >= args.threshold) if should else (rate < args.threshold) results.append({ "query": q["query"], "should_trigger": should, - "trigger_rate": rate, - "triggers": int(sum(triggers)), - "runs": len(triggers), + "trigger_rate": round(rate, 3), + "triggers": int(sum(runs)), + "runs": len(runs), "pass": passed, }) @@ -349,7 +393,7 @@ def run_one(idx: int, q: dict, run_idx: int) -> tuple[int, bool]: "completed_at": utc_now_iso(), "skill_name": skill_name, "description": description, - "isolation": args.isolation, + "adapter": adapter_note, "results": results, "summary": { "total": len(results), diff --git a/skills/bmad-eval-runner/scripts/utils.py b/skills/bmad-eval-runner/scripts/utils.py deleted file mode 100644 index 92b6436..0000000 --- a/skills/bmad-eval-runner/scripts/utils.py +++ /dev/null @@ -1,260 +0,0 @@ -#!/usr/bin/env python3 -# /// script -# requires-python = ">=3.9" -# /// -"""Shared helpers for the eval runner.""" - -from __future__ import annotations - -import json -import re -import shutil -import subprocess -import sys -from datetime import datetime, timezone -from pathlib import Path - - -def parse_skill_md(skill_path: Path) -> tuple[str, str, str]: - """Return (name, description, body) from the skill's SKILL.md frontmatter.""" - text = (skill_path / "SKILL.md").read_text(encoding="utf-8") - fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n(.*)$", text, re.DOTALL) - if not fm_match: - raise ValueError(f"SKILL.md at {skill_path} is missing frontmatter") - frontmatter, body = fm_match.group(1), fm_match.group(2) - - name = None - description_lines: list[str] = [] - in_description = False - for line in frontmatter.splitlines(): - if line.startswith("name:"): - name = line.split(":", 1)[1].strip() - in_description = False - elif line.startswith("description:"): - value = line.split(":", 1)[1].strip() - if value in ("|", ">"): - in_description = True - else: - description_lines = [value] - in_description = False - elif in_description and line.startswith((" ", "\t")): - description_lines.append(line.strip()) - elif in_description: - in_description = False - - if not name: - raise ValueError(f"SKILL.md at {skill_path} is missing a name") - return name, " ".join(description_lines).strip(), body - - -def discover_project_root(skill_path: Path) -> Path: - """Walk up from the skill looking for _bmad/ or .git; default to skill's grandparent.""" - for parent in [skill_path, *skill_path.parents]: - if (parent / "_bmad").is_dir() or (parent / ".git").exists(): - return parent - return skill_path.parent.parent - - -def discover_evals( - skill_path: Path, - project_root: Path, - explicit: Path | None, -) -> dict[str, Path]: - """Locate evals.json and triggers.json. Return dict with keys 'evals' and/or 'triggers'.""" - found: dict[str, Path] = {} - - def check_dir(d: Path) -> None: - if not d.is_dir(): - return - for key, fname in (("evals", "evals.json"), ("triggers", "triggers.json")): - candidate = d / fname - if candidate.is_file() and key not in found: - found[key] = candidate - - if explicit is not None: - explicit = explicit.resolve() - if explicit.is_file(): - if explicit.name == "evals.json": - found["evals"] = explicit - elif explicit.name == "triggers.json": - found["triggers"] = explicit - elif explicit.is_dir(): - check_dir(explicit) - return found - - skill_name = skill_path.name - candidates: list[Path] = [ - skill_path / "evals", - skill_path.parent.parent / "evals" / skill_name, - project_root / "evals" / skill_name, - ] - for d in candidates: - check_dir(d) - if found: - break - - if not found: - evals_root = project_root / "evals" - if evals_root.is_dir(): - for sub in evals_root.rglob(skill_name): - if sub.is_dir(): - check_dir(sub) - if found: - break - - return found - - -def utc_now_iso() -> str: - return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") - - -def new_run_id(skill_name: str) -> str: - return f"{datetime.now().strftime('%Y%m%d-%H%M%S')}-{skill_name}" - - -def have_docker() -> bool: - if shutil.which("docker") is None: - return False - try: - result = subprocess.run( - ["docker", "info"], - stdout=subprocess.DEVNULL, - stderr=subprocess.DEVNULL, - timeout=5, - ) - return result.returncode == 0 - except Exception: - return False - - -def docker_image_present(image: str = "bmad-eval-runner:latest") -> bool: - if not have_docker(): - return False - try: - result = subprocess.run( - ["docker", "image", "inspect", image], - stdout=subprocess.DEVNULL, - stderr=subprocess.DEVNULL, - timeout=10, - ) - return result.returncode == 0 - except Exception: - return False - - -def read_macos_keychain_credentials() -> str | None: - """Read the Claude Code OAuth credentials JSON from the macOS Keychain. - - Returns the raw JSON string stored under service "Claude Code-credentials", - or None if unavailable (non-macOS, entry missing, or access denied). - - Called in the parent process — which owns the Keychain ACL — so the credential - can be staged into each isolated workspace's `.claude/.credentials.json` before - `claude -p` is launched. Without this, an isolated subprocess with HOME pointed - at an empty dir has no auth and every eval fails with "Not logged in." - """ - if sys.platform != "darwin": - return None - try: - result = subprocess.run( - ["security", "find-generic-password", "-s", "Claude Code-credentials", "-w"], - capture_output=True, - timeout=5, - ) - if result.returncode != 0: - return None - val = result.stdout.decode("utf-8", errors="replace").strip() - return val if val else None - except Exception: - return None - - -def stage_credentials(claude_dir: Path, credentials_json: str | None) -> None: - """Write credentials_json to /.credentials.json. No-op if None.""" - if not credentials_json: - return - claude_dir.mkdir(parents=True, exist_ok=True) - (claude_dir / ".credentials.json").write_text(credentials_json, encoding="utf-8") - - -def write_json(path: Path, data: object) -> None: - path.parent.mkdir(parents=True, exist_ok=True) - path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8") - - -def read_json(path: Path) -> object: - return json.loads(path.read_text(encoding="utf-8")) - - -def parse_skill_dependencies(skill_path: Path) -> list[str]: - """Return skill names declared under 'dependencies:' in SKILL.md frontmatter.""" - try: - text = (skill_path / "SKILL.md").read_text(encoding="utf-8") - except (FileNotFoundError, OSError): - return [] - fm = re.match(r"^---\s*\n(.*?)\n---", text, re.DOTALL) - if not fm: - return [] - deps: list[str] = [] - in_deps = False - for line in fm.group(1).splitlines(): - if re.match(r"^dependencies\s*:", line): - in_deps = True - elif in_deps: - m = re.match(r"^\s+-\s+(\S+)", line) - if m: - deps.append(m.group(1)) - elif not line.startswith((" ", "\t")): - break - return deps - - -def discover_setup_dirs(evals_file: Path, eval_id: str | None = None) -> list[Path]: - """Return ordered list of setup overlay dirs that exist. - - base: /setup/ - per-eval: //setup/ - - Applied base-first so per-eval overlays win on conflict. - """ - evals_dir = evals_file.parent - dirs: list[Path] = [] - base = evals_dir / "setup" - if base.is_dir(): - dirs.append(base) - if eval_id: - per_eval = evals_dir / eval_id / "setup" - if per_eval.is_dir(): - dirs.append(per_eval) - return dirs - - -def apply_setup_overlay(setup_dirs: list[Path], dest: Path) -> None: - """Rsync each setup dir onto dest in order (base first, per-eval last).""" - dest.mkdir(parents=True, exist_ok=True) - for src in setup_dirs: - if not src.is_dir(): - continue - subprocess.run( - ["rsync", "-a", f"{src}/", f"{dest}/"], - check=False, - ) - - -__all__ = [ - "parse_skill_md", - "discover_project_root", - "discover_evals", - "utc_now_iso", - "new_run_id", - "have_docker", - "docker_image_present", - "read_macos_keychain_credentials", - "stage_credentials", - "write_json", - "read_json", - "parse_skill_dependencies", - "discover_setup_dirs", - "apply_setup_overlay", -] diff --git a/skills/bmad-workflow-builder/SKILL.md b/skills/bmad-workflow-builder/SKILL.md index c861248..937b5d6 100644 --- a/skills/bmad-workflow-builder/SKILL.md +++ b/skills/bmad-workflow-builder/SKILL.md @@ -5,34 +5,45 @@ description: Builds, edits, and analyzes workflows and skills. Use when the user # Overview -You are a creative agent skills workflow builder and facilitator. Your job: turn a user's vision and ideas locked in their head into the outcome driven skills, where every line earns its place against the test "would an LLM do this correctly without being told?" +Act as a skill-building partner who turns a half-formed idea in the user's head into a lean, outcome-driven skill. Every line in what you build has to earn its place against one test: would a capable model do this correctly without being told? If the answer is yes, the line is friction and it stays out. You model the shape you teach, so this skill's own build flow is a goal-driven loop rather than a fixed sequence of phases. -**Args:** `--headless` / `-H` for non-interactive; an initial description for a new build; or a path to an existing skill with keywords like analyze, edit, or rebuild. To re-shape an existing non-BMad skill, just point to it and describe what should change — the build flow handles it. +**Args:** `--headless` / `-H` for non-interactive; an initial description for a new build; or a path to an existing skill alongside words like analyze, edit, or rebuild. To re-shape an existing non-BMad skill, point at it and say what should change, and the build flow takes it from there. ## Conventions -- Bare paths (e.g. `references/build-process.md`) resolve from the skill root. -- `{skill-root}` resolves to this skill's installed directory (where `customize.toml` lives). +- Bare paths (e.g. `references/build-process.md`) resolve from this skill's root. +- `{skill-root}` resolves to this skill's installed directory. - `{project-root}`-prefixed paths resolve from the project working directory. -- `{skill-name}` resolves to the skill directory's basename. +- `{target-skill-path}` is the skill being built, edited, or analyzed. ## On Activation -1. Detect intent. If `--headless` or `-H`, set `{headless_mode}=true` for all sub-prompts. +1. **Resolve customization.** If `{skill-root}/customize.toml` exists, resolve the `workflow` block by reading `{skill-root}/customize.toml`, then `{project-root}/_bmad/custom/bmad-workflow-builder.toml`, then `{project-root}/_bmad/custom/bmad-workflow-builder.user.toml` in that order. Scalars override (last wins), tables deep-merge, arrays of tables keyed by `code` or `id` replace matching entries and append new ones, all other arrays append. Apply the resolved values throughout the session. If no `customize.toml` is present, skip this step. -2. Load config from `{project-root}/_bmad/config.yaml` and `{project-root}/_bmad/config.user.yaml` (root and bmb section). Fall back to `{project-root}/_bmad/bmb/config.yaml` (legacy per-module format). If neither exists and the `bmad-builder-setup` skill is available, mention it. Resolve and apply throughout the session (defaults in parens): - - `{user_name}` (default: null) — address the user by name - - `{communication_language}` (default: user or system intent) — for all communications - - `{document_output_language}` (default: user or system intent) — for generated document content - - `{bmad_builder_output_folder}` (default: `{project-root}/skills`) — where new skills are created. Existing skills use their own path. +2. **Detect intent.** If `--headless` or `-H` is present, set `{headless_mode}=true` for every sub-prompt. Otherwise read the invocation for whether the user wants to Build, Edit, or Analyze, and which skill they mean. -3. **Open the floor (interactive only).** Before any structured questions or routing, invite the user to share everything they have in mind unless they already provided extensive detail (if they did then you could just ask if they want to add any more before proceeding): goals, references, examples, half-formed ideas, paths to existing skills or artifacts, anything they want you to read. Adapt the invitation to what they already gave you — for a vague "build me X," ask for the full picture; for a path or URL, ask what they want focused on or what context you should know. After they share, one soft "anything else?" surfaces what they almost forgot. The dump replaces most structured Q&A downstream; let it run. Skip in headless mode and skip if the invocation already includes enough detail to act on. +3. **Load config.** Read `{project-root}/_bmad/config.yaml` and `{project-root}/_bmad/config.user.yaml` (root and bmb section), falling back to `{project-root}/_bmad/bmb/config.yaml`. If none exist and `bmad-bmb-setup` is available, mention it. Resolve and apply throughout (defaults in parens): `{user_name}` (null), `{communication_language}` (user or system default), `{document_output_language}` (user or system default), and `{bmad_builder_output_folder}` (`{project-root}/skills`, where new skills are created; existing skills keep their own path). -4. **Resume detection.** Once a target skill is identified — either a path to an existing skill, or a new build with a target name — check `{target-skill-path}/.decision-log.md`. If found, read its frontmatter for state recovery (`phase`, `classification`, `last_touched`) and tail the body for full decision history. In headless mode, resume automatically and append a new session heading. +4. **Open the floor (interactive only).** Before any structured questions or routing, invite the user to share everything they have in mind: goals, references, examples, half-formed ideas, paths to existing skills or artifacts, anything they want you to read. Adapt the invitation to what they already gave you, so a vague "build me X" gets a request for the full picture while a bare path gets a question about what to focus on. After they share, one soft "anything else?" surfaces what they almost forgot. This dump replaces most of the downstream questioning, so let it run. Skip in headless mode, and skip if the invocation already carries enough to act on. -## Routing +5. **Resume detection.** Once a target skill is identified, glob `{target-skill-path}/.memlog.md`. If one exists, read it once in full to rebuild the state of the prior session, then continue append-only through `scripts/memlog.py`. Never look for `.decision-log.md`; the memlog is the only process memory. In headless mode, resume automatically. -| Intent | Load | -| ---------------------------- | --------------------------------- | -| Build new or edit existing | `references/build-process.md` | -| Analyze | `references/quality-analysis.md` | +6. **Route to the intent.** Pick the path below from the resolved intent and load only that file. + +## Intents + +| Intent | What it does | Load | +| --- | --- | --- | +| Build | Create a new skill from the user's idea | `references/build-process.md` | +| Edit | Re-shape an existing skill against a described change | `references/build-process.md` | +| Analyze | Run the quality scanners over a skill and produce a report | `references/scan-orchestration.md` | + +Build and Edit share one flow because editing is the same loop pointed at an existing skill: you read what is relevant to the change, capture the new direction in the memlog, and apply the same earn-its-place test to anything you add. + +## Discovery + +Discovery happens through the open floor in activation, not a quiz. Understand why the user came before you read any artifact, and mine the conversation history first for the tools, the sequence, the corrections, and the observed inputs and outputs the user has already described. Capture intent before you ingest files, because what the user wants determines which parts of an existing skill or reference even matter. Ask only the few gaps that the dump left open. + +## The scanner lenses + +Analyze runs five lenses as parallel subagents, each loading `references/skill-quality-principles.md` and returning lean structured findings to you in-context: `references/scan-leanness.md`, `references/scan-architecture.md`, `references/scan-determinism.md`, `references/scan-customization.md`, and `references/scan-enhancement.md`. You consolidate their returns and hand the merged findings to the single `references/report-author.md` subagent. The full mechanics, including the deterministic pre-pass that feeds the scanners, live in `references/scan-orchestration.md`. diff --git a/skills/bmad-workflow-builder/assets/SKILL-template.md b/skills/bmad-workflow-builder/assets/SKILL-template.md index 57ca21e..ac041ee 100644 --- a/skills/bmad-workflow-builder/assets/SKILL-template.md +++ b/skills/bmad-workflow-builder/assets/SKILL-template.md @@ -1,53 +1,57 @@ --- -name: {module-code-or-empty}{skill-name} -description: { skill-description } # [5-8 word summary]. [trigger phrases, e.g. Use when user says create xyz or wants to do abc] +name: {skill-name} +description: {one-line summary plus the trigger phrases that should route here, e.g. "Use when the user says X or wants to Y"} --- -# {skill-name} - -## Overview + -### Step 2: Execute Prepend Steps +# {skill-name} -Execute each entry in `{workflow.activation_steps_prepend}` in order before proceeding. +{One paragraph: who the skill is acting as, the outcome it produces, and who +consumes that output. Write it once; do not restate it lower down.} -### Step 3: Load Persistent Facts +## Conventions -Treat every entry in `{workflow.persistent_facts}` as foundational context for the whole run. Entries prefixed `file:` are paths or globs — load the referenced contents as facts. All other entries are facts verbatim. +- Bare paths (e.g. `references/guide.md`) resolve from the skill root. +- `{skill-root}` resolves to this skill's installed directory. +- `{project-root}`-prefixed paths resolve from the project working directory. -### Step 4: Load Config +## On Activation -{/if-customizable} -{if-module} -Load available config from `{project-root}/_bmad/config.yaml` and `{project-root}/_bmad/config.user.yaml` (root level and `{module-code}` section). If config is missing, let the user know `{module-setup-skill}` can configure the module at any time. Use sensible defaults for anything not configured — prefer inferring at runtime or asking the user over requiring configuration. -{/if-module} -{if-standalone} -Load available config from `{project-root}/_bmad/config.yaml` and `{project-root}/_bmad/config.user.yaml` if present. Use sensible defaults for anything not configured. -{/if-standalone} -{if-customizable} +1. Load config from `{project-root}/_bmad/config.yaml` (and `.user.yaml` if present). Use sensible defaults for anything missing rather than requiring configuration. -### Step 5: Execute Append Steps + +2. Resume check. Look for an existing `.memlog.md` in the run folder. If one is found, read it once to rebuild state and continue append-only; otherwise initialize a new memlog with `python3 scripts/memlog.py init --path /.memlog.md`. -Execute each entry in `{workflow.activation_steps_append}` in order before entering the workflow's first stage. + +3. Resolve the `workflow` block by reading `customize.toml`, then the team and user override files in that order, applying the structural merge rules. Reference resolved values as `{workflow.}` everywhere below; never hardcode a path beside a declared scalar. -{/if-customizable} +## {Body} -{The rest of the skill — body structure, sections, phases, stages, scripts, external skills — is determined entirely by what the skill needs. The builder crafts this based on the discovery and requirements phases.} +{The body is whatever the skill needs and nothing more. State each beat as the +outcome you want, reserving exact procedure for the few places a wrong move costs +something. Name stages with descriptive words, never numbered prefixes.} diff --git a/skills/bmad-workflow-builder/assets/report-shell.html b/skills/bmad-workflow-builder/assets/report-shell.html new file mode 100644 index 0000000..15451d2 --- /dev/null +++ b/skills/bmad-workflow-builder/assets/report-shell.html @@ -0,0 +1,580 @@ + + + + + +Skill Analysis Report + + + +
    +
    +

    Skill Analysis Report

    +
    + Subject:  ·  + Generated:  ·  + Schema: +
    +
    + + + + + + + +
    +

    Clipboard was unavailable. Copy the text below manually:

    + +
    + +
    +
    + +
    Copied
    + + + + + + + diff --git a/skills/bmad-workflow-builder/references/build-process.md b/skills/bmad-workflow-builder/references/build-process.md index 900136e..fdc435a 100644 --- a/skills/bmad-workflow-builder/references/build-process.md +++ b/skills/bmad-workflow-builder/references/build-process.md @@ -1,154 +1,54 @@ -**Workspace.** Once intent is clear and the target skill is named (propose a kebab-case name for new skills if the user didn't give one — they can rename later, that's a logged decision not a redo), write `.decision-log.md` at the skill's root as a peer of `SKILL.md`. The decision log is canonical memory — load-bearing decisions, rejected alternatives, and overrides live on disk, not in the conversation. On resume, append a new session heading; at handoff, audit the log so the user signs off on how their thinking was handled. +# Build Process -## Phase 1: Classify +This is one loop, not a sequence of phases. It carries Build and Edit, because an edit is the same loop pointed at a skill that already exists. The order below is the usual order of discovery, but nothing forces you to march through it; you pursue whichever outcome the conversation is ready for and you revisit earlier ones as the picture sharpens. Each outcome is a thing you want to be true, not a step you check off. -**Outcome:** you and the user agree on the skill type and whether it's part of a module. Reasoning is shared, not hidden. +Load `references/skill-quality-principles.md` before you draft anything. It is the same institutional-knowledge file the scanners verify against, so building against it from the start is cheaper than fixing later. Load `references/standard-fields.md` for frontmatter and naming conventions, and `references/complex-workflow-patterns.md` only if the skill needs multi-stage routing across carved-out reference files. -| Type | When | -|---|---| -| **Simple Utility** | Composable building block with clear input → processing → output. Often deterministic. No multi-turn discovery. | -| **Simple Workflow** | Multi-step process that fits inline in SKILL.md as named sections (`## Discovery`, `## Constraints`, etc.). Default. | -| **Complex Workflow** | SKILL.md routing + carved-out sections in `references/` with descriptive filenames. Reserved for workflows whose SKILL.md would otherwise be too big to scan (~250+ lines). | +## Open by understanding why the user came -Default to Simple Workflow. Carving is a SIZE decision, not a stage-count decision. +Before you read a single artifact, understand what the user is actually trying to get done and what "good" looks like to them. The open-floor invitation in activation does most of this work, so read what they dumped and mine the conversation history for the tools, the sequence, the corrections, and the inputs and outputs they have already shown you. Then ask only the gaps that remain. On an edit, this means reading the part of the existing skill the change touches and ignoring the rest, rather than re-deriving the whole spec. -If module-based: capture module code, other skills it'll invoke (with name / inputs / outputs), and config variables it needs. +## Capture continuously into the memlog -For Workflows that produce an artifact: confirm whether `--headless` should be supported. +As decisions and directions land, write them to `{target-skill-path}/.memlog.md` through `scripts/memlog.py` (`init` once when the target is named, then `append --type ` as things happen). For a new skill, propose a kebab-case name when the user did not give one; renaming later is a logged decision, not a redo. The memlog is the canonical process memory, the source for resume, and the trail you audit at handoff so the user can confirm their thinking was handled the way they meant. Capture as you go, not in a batch at the end, because the value is in catching the reasoning while it is still fresh. -**On Edit:** classification is already set — read it from the existing skill or from `.decision-log.md` frontmatter. Skip this phase. +## Write the minimal outcome-driven version first -## Phase 2: Determine Spec +Draft the smallest skill that could possibly work. Hold it to four things: the role, the outcome, the consumer of the output, and any rule whose absence has already caused real damage. Everything else stays out until a comparison earns it. Apply the core test to every line you are tempted to write, because a model that already knows how to do the thing does not need to be told how. Default to writing the whole workflow inline in SKILL.md as named sections, and carve into `references/` with descriptive filenames only when SKILL.md grows too large to scan in one pass. Never use numbered prefixes on those files. -**Outcome:** you have everything needed to draft the skill — extracted from what the user has already shared (open-floor + decision log) plus targeted follow-ups for whatever's missing. +## Run it on real input and reach for eval at the eval beat -Through what's already known or further conversation, determine all of the following that are relevant: +A skill that has never run is a guess. Run the minimal version on the real, messy input the user actually has. This is the eval beat, and it is where you invoke `bmad-eval-runner`. Offer baseline mode to confirm the skill beats the bare model on the same input, because a skill that does not beat the bare model has no reason to exist. Offer trigger mode to harden the description against near-miss queries. Both are opt-in; surface them, explain what each one settles, and let the user decide. -| Field | Applies | Notes | -|---|---|---| -| Name | All | kebab-case. `{module-code}-{name}` for modules, `{name}` standalone. `bmad-` reserved for official. | -| Description | All | `[5-8 word summary]. [Use when user says 'specific phrase'.]` See `references/standard-fields.md`. | -| Overview | All | What / How / Why-Outcome. Domain framing + theory of mind for interactive or complex skills. | -| Role | Workflows | "Act as a [role/expert]" primer. | -| Design rationale | Where non-obvious | Choices the executing agent should understand so it doesn't optimize them away. | -| External skills | All | Which other skills this calls. | -| Scripts | All | Deterministic operations to push out of prompts; see `references/script-opportunities-reference.md`. List non-stdlib deps and get user approval (`uv` required). | -| Output documents | All | Yes/no — uses `{document_output_language}` if yes. | -| Revisable artifact | If output doc | If Update / Validate intents are likely, propose the Decision-Log Workspace pattern (`references/skill-quality-principles.md`). | -| Inputs / outputs | Simple Utility | Format, schema, required fields. | -| Stages | Workflows | Named sections (Simple) or carved files in `references/` with descriptive filenames (Complex). | -| Module capability | If module-based | phase-name, after, before, is-required, short description. | -| Customization | All | Fixed, or swappable templates / paths / hooks? Default no. If yes, walk each scalar (`_template`, `_output_path`, `on_`); auto-promote in headless. | +## Add scaffolding only when a comparison demands it -The customization opt-in question (interactive only): +Do not add structure on a hunch. Add it only when a worse-on-a-named-dimension comparison shows the minimal version failing on something concrete, where you can say exactly what breaks without the addition. When you do add something, write what survives as a goal rather than a prescription, and reserve exact procedure for the few places where a wrong move costs something real. If you find yourself reaching for more structure, first ask whether a sharper outcome statement would have produced the same result; most of the time it would, so sharpen the sentence and skip the scaffold. -> "Should this support end-user customization (activation hooks, swappable templates, output paths)? If no, it ships fixed — users who need changes fork it." +## Hunt for script opportunities throughout -For path conventions and customize.toml schema, see `references/skill-quality-principles.md`. +This is the builder's differentiator, so keep it active the whole way through rather than treating it as a single checkpoint. Apply the determinism test and the signal-verb scan from `references/script-opportunities-reference.md` to anything the skill does, prefer native Python, and propose the pre-pass JSON pattern wherever the model would otherwise read raw files to extract facts a script could hand it. If eval transcripts show the model re-writing the same helper across runs, that is the signal to bundle it as a script once. List any non-stdlib dependency and confirm it with the user before relying on it. -**On Edit:** spec is already defined by the existing skill. Read what's relevant to the change, ignore the rest. Update the decision-log with what's actually changing and why. +## Decide customization with the explicit ask -## Phase 3: Draft & Refine +Ask once, interactive only, and default to no: "Should this support end-user customization such as activation hooks, swappable templates, or output paths? If no, it ships fixed and anyone who needs a change forks it." Headless also defaults to no and emits a `customize.toml` only when the invocation explicitly asks for customization; log that decision in the memlog either way. When customization is accepted, bake the universal defaults and offer only the skill-specific points whose matching stages exist, following `references/customize-toml-guide.md`. When it is declined, emit no `customize.toml` and no resolver step, and the skill uses hardcoded paths throughout. -**Load `references/skill-quality-principles.md` before reviewing the plan** — same principles file the quality scanners verify against. Building against it upfront is cheaper than fixing afterwards. +## Wire the universal shape, strip ceremony, and ship -Present a plan. Point out vague areas. Iterate with the user until the outcome and shape are clear. Apply the principles file's core test to every planned instruction: **would an LLM do this correctly without being told?** If yes, cut it. +Wire in the shape every producing skill shares: memlog capture during the run, a distillation at finalize for skills whose output feeds downstream consumers, projections produced on demand rather than maintained, polish gated on the user's temperament, and a reviewer gate for skills that produce something substantive. Then strip the ceremony. Confirm the skill passes its own leanness scanner before you hand it off, because the builder has no standing to teach leanness while shipping bloat. When the skill is lean, runs on real input, and the user has signed off on the memlog audit, ship it. -## Phase 4: Build +## Handoff -**Load:** +Interactive: show what was built, the lint results, and offer the next steps, which usually means running Analyze over the new skill or moving on. Point the user at the memlog at `{target-skill-path}/.memlog.md` and walk the audit so they confirm their reasoning was handled the way they intended. Before handoff, run the structural lint and path lint over the built skill and fix high or critical findings. -- `references/skill-quality-principles.md` — what earns its place, BMad institutional knowledge, failure modes (already loaded in Phase 3; keep open) -- `references/standard-fields.md` — field-by-field schema reference for frontmatter, customize.toml, and the Overview formula -- `references/complex-workflow-patterns.md` (Complex Workflow only) — config integration, compaction survival, document-as-cache - -Load `assets/SKILL-template.md` and `references/template-substitution-rules.md`. Default to writing the entire workflow inline in SKILL.md as named sections. Carve out to `references/` ONLY when SKILL.md would otherwise be too big to scan; when you do, use descriptive filenames (`press-release.md`), never numbered prefixes (`01-discover.md`). Output to `{bmad_builder_output_folder}`. - -**If the SKILL.md references multiple internal files** (anything in `references/`, `assets/`, `scripts/`, `agents/`), stamp the Conventions block at the top of SKILL.md (after Overview, before On Activation): - -```markdown -## Conventions - -- Bare paths (e.g. `references/press-release.md`) resolve from the skill root. -- `{skill-root}` resolves to this skill's installed directory (where `customize.toml` lives). -- `{project-root}`-prefixed paths resolve from the project working directory. -- `{skill-name}` resolves to the skill directory's basename. -``` - -**If `{customizable}` is yes:** - -- Emit `customize.toml` alongside SKILL.md from `assets/customize-template.toml`. Fill `[workflow]` with the Phase 2 scalars. -- In SKILL.md, replace hardcoded references with `{workflow.}` indirection. `assets/brief-template.md` → `{workflow.brief_template}` if lifted. -- Add the resolver activation step before config load: - - ```markdown - ### Step 1: Resolve the Workflow Block - - Run: `python3 {project-root}/_bmad/scripts/resolve_customization.py --skill {skill-root} --key workflow` - - If the script fails, resolve the `workflow` block yourself by reading these three files in base → team → user order and applying structural merge rules: `{skill-root}/customize.toml`, `{project-root}/_bmad/custom/{skill-name}.toml`, `{project-root}/_bmad/custom/{skill-name}.user.toml`. Scalars override, tables deep-merge, arrays of tables keyed by `code`/`id` replace matching entries and append new ones, all other arrays append. - ``` - -- Execute `{workflow.activation_steps_prepend}` before the workflow's first stage and `{workflow.activation_steps_append}` after greet but before Stage 1. Treat `{workflow.persistent_facts}` as foundational context loaded on activation (`file:` prefix = path/glob; bare entries = literal facts). - -**If `{customizable}` is no:** no `customize.toml`, no resolver step. SKILL.md uses hardcoded paths throughout. - -**If the skill uses the Decision-Log Workspace pattern** (Phase 2 confirmed it produces a revisable artifact): - -- Add `output_dir` and `output_folder_name` scalars to `customize.toml [workflow]`. Default shape: - - `output_dir = "{planning_artifacts}/"` (e.g. `briefs`, `analyses`) - - `output_folder_name = "-{project_name}-{date}"` - - This implies `{customizable}=yes` — if the user declined customization, ask whether to enable it for these two scalars. -- In SKILL.md Activation, after config resolution: bind `{doc_workspace} = {workflow.output_dir}/{workflow.output_folder_name}/`. -- Wire Create / Update / Validate intents and a Finalize audit per `references/skill-quality-principles.md` § Decision-Log Workspace Pattern. Follow the **Treatment style** sub-section there: state the principle once where it first applies, mention reads at the moments that matter, no prescribed frontmatter schema, no `## Workspace` header, no tree diagram. The workspace is just files. -- If the artifact will feed downstream LLM consumers: offer a `distillate.md` at finalize. Skip with a note if no distillation tool is available; never inline a substitute. - -**Skill source tree** (only create folders that are needed): - -``` -{skill-name}/ -├── SKILL.md # Frontmatter, Overview, Activation, the workflow itself (default), routing if carved -├── customize.toml # Only if {customizable} is yes -├── references/ # Carved-out workflow sections — descriptive names, no numbered prefixes -├── assets/ # Templates and other static content the workflow loads -├── scripts/ # Deterministic code with tests -│ └── tests/ -``` - -Never put workflow content (`*.md` prompt files) directly at skill root — that's `SKILL.md`'s job. Carve-outs always go in `references/`. - -| Location | Contains | LLM relationship | -| ----------------- | --------------------------------------------------------- | ------------------------------------ | -| **SKILL.md** | Overview, Activation, inline workflow OR routing to refs | LLM identity, the workflow itself | -| **`references/`** | Carved-out workflow sections (descriptive names) | Loaded on demand by SKILL.md routing | -| **`assets/`** | Templates, starter files, static content | Copied/transformed into output | -| **`scripts/`** | Python, shell scripts with tests | Invoked for deterministic operations | - -**If the built skill includes scripts**, also load `references/script-standards.md` — ensures PEP 723 metadata, correct shebangs, and `uv run` invocation from the start. - -**Lint gate** — validate and auto-fix. If subagents are available, delegate lint-fix; otherwise run inline. - -1. Run both lint scripts in parallel: - ```bash - python3 scripts/scan-path-standards.py {skill-path} - python3 scripts/scan-scripts.py {skill-path} - ``` -2. Fix high/critical findings, re-run (up to 3 attempts per script). -3. Run unit tests if scripts exist in the built skill. - -## Phase 5: Handoff - -**Interactive:** show what was built, lint results, and offer next steps (commit, run quality analysis). Decision log is at `{target-skill-path}/.decision-log.md`. - -**Headless** (`{headless_mode}=true`): emit JSON only. `intent` is `"build"` for new, `"edit"` for existing. +Headless (`{headless_mode}=true`): call `set-complete` on the memlog and emit JSON only. ```json { "status": "complete", "intent": "build", "skill": "{target-skill-path}", - "decision_log": "{target-skill-path}/.decision-log.md" + "memlog": "{target-skill-path}/.memlog.md" } ``` -Blocked (ambiguous intent that couldn't be inferred, persistent lint failures, etc.): replace `"complete"` with `"blocked"` and add `"reason": ""`. The log carries the detail. +Use `"intent": "edit"` for an existing skill. If the run is blocked by ambiguous intent that could not be inferred or by lint failures that would not clear, replace `"complete"` with `"blocked"` and add `"reason": ""`. The memlog carries the detail. diff --git a/skills/bmad-workflow-builder/references/complex-workflow-patterns.md b/skills/bmad-workflow-builder/references/complex-workflow-patterns.md index f7ee46a..c4c8f1d 100644 --- a/skills/bmad-workflow-builder/references/complex-workflow-patterns.md +++ b/skills/bmad-workflow-builder/references/complex-workflow-patterns.md @@ -1,49 +1,59 @@ # Complex Workflow Patterns -Patterns for workflows whose SKILL.md got too big and had to carve out to `references/`. The default for any new skill is **inline** — a multi-stage coaching workflow lives in a single SKILL.md. Reach for these patterns only when SKILL.md genuinely won't fit. +Patterns for workflows whose SKILL.md grew past what one file can hold and had to carve work out to `references/`. The default for any new skill is inline, where a multi-stage coaching workflow lives in a single SKILL.md. Reach for these patterns only when SKILL.md genuinely will not fit its token budget. ## Carve-Out Conventions When carving out to `references/`: -- Descriptive filenames (`press-release.md`, `customer-faq.md`, `verdict.md`). Never numbered prefixes — the carve-out is a section, not a "step." SKILL.md decides the order by routing. -- Each file works standalone (context compaction can drop SKILL.md). No "as described in the overview." -- SKILL.md keeps Overview, Activation, the Conventions block (see `references/skill-quality-principles.md`), and the routing logic. Everything else moves out. -- `assets/` is for templates and other static content the workflow loads, not for stages. +- Use descriptive filenames (`press-release.md`, `customer-faq.md`, `verdict.md`), never numbered prefixes. The carve-out is a section, not a step, and SKILL.md decides the order by routing. +- Each file works standalone, because context compaction can drop SKILL.md mid-flow. Do not write "as described in the overview." +- SKILL.md keeps the role paragraph, activation, the conventions block (see `references/skill-quality-principles.md`), and the routing logic. Everything else moves out. +- `assets/` holds templates and other static content the workflow loads, not stages. ## Workflow Persona -BMad workflows treat the human operator as the expert. The agent facilitates — asks clarifying questions, presents options with trade-offs, validates before irreversible actions. The operator knows their domain; the workflow knows the process. +BMad workflows treat the human operator as the expert. The agent facilitates by asking clarifying questions, presenting options with their trade-offs, and validating before any irreversible action. The operator knows the domain and the workflow knows the process. ## Config Reading and Integration -Workflows read config from `{project-root}/_bmad/config.yaml` and `config.user.yaml`. +Workflows read config from `{project-root}/_bmad/config.yaml` and `config.user.yaml`. Reading project config at activation is not a customization surface, so it stays even in skills that ship fixed. Customization lives only in customize.toml (see `references/customize-toml-guide.md`). -**Module-based skills** load with fallback and setup-skill awareness: +Module-based skills load with fallback and setup-skill awareness: ``` Load config from {project-root}/_bmad/config.yaml ({module-code} section) and config.user.yaml. If missing: inform user that {module-setup-skill} is available, continue with sensible defaults. ``` -**Standalone skills** load best-effort: +Standalone skills load best-effort: ``` Load config from {project-root}/_bmad/config.yaml and config.user.yaml if available. -If missing: continue with defaults — no mention of a setup skill. +If missing: continue with defaults, no mention of a setup skill. ``` -Config variables resolved already contain `{project-root}` — never double-prefix. +Config variables resolve already containing `{project-root}`, so never double-prefix. -## Decision-Log Workspace Pattern (canonical compaction survival) +## Memory: memlog Is the Canonical Process Memory -For workflows that produce revisable artifacts, the Decision-Log Workspace pattern is the default. See `references/skill-quality-principles.md` for the full treatment. +For workflows that run across turns or produce revisable artifacts, on-disk process memory is the memlog written by `scripts/memlog.py`. The memlog is the load-bearing artifact for identity across sessions. The document is what the user takes away, and the memlog is what carries the decisions, directions, assumptions, gaps, and events that produced it. -**The pattern in one paragraph.** The workspace folder (artifact + `.decision-log.md` + optional `addendum.md` + optional `distillate.md`) exists from the moment intent is confirmed. Decision-log captures every meaningful decision and rationale; addendum captures rejected alternatives. Resume on activation, conflict-detect on update, audit at finalize. The decision log is the load-bearing artifact — the document is what the user takes; the log is what carries identity across sessions. +The memlog is `.memlog.md`, append-only, written atomically through the CLI. The model captures continuously as decisions and directions land, reads the ack each command prints, and never re-reads the file mid-session. On resume, the parent reads the whole memlog once to rebuild state, then resumes appending. -**For Complex Workflows that route to carved-out files**, each carved file must work standalone (compaction can drop SKILL.md mid-flow). Carved files reference the workspace by config-resolved path (`{workflow.output_dir}/{workflow.output_folder_name}/`) — never assume in-context state. +CLI shape: -**YAML frontmatter on the primary artifact** (status + inputs survives compaction): +| Command | Effect | +|---|---| +| `memlog.py init --path ` | Create the memlog | +| `memlog.py append --path --type --text ` | Add one typed entry | +| `memlog.py set-complete --path ` | Mark the run complete | + +Entry types: `decision`, `direction`, `assumption`, `gap`, `note`, `event`. There is no remove or edit command, because history is never rewritten. + +For complex workflows that route to carved-out files, each carved file reaches the memlog by its config-resolved path rather than assuming in-context state, because compaction can drop SKILL.md before the carved file runs. + +YAML frontmatter on the primary artifact carries status and inputs through a compaction: ```markdown --- @@ -56,40 +66,65 @@ updated: '2025-03-02T11:30:00Z' --- ``` -**When NOT to apply:** purely conversational workflows, one-shot single-turn outputs, multi-artifact workflows where each artifact gets its own folder. +When not to keep a memlog: purely conversational workflows, one-shot single-turn outputs, and multi-artifact workflows where each artifact gets its own folder and its own memory. + +## Three-Mode Architecture + +A skill that serves more than one intent routes by mode rather than by branching deep inside a single procedure. The three modes most producing skills land on are create, update, and validate. Each mode has its own entry path, its own resume behavior, and its own memlog interaction, but all three share the role paragraph, activation, and conventions block. + +Create starts a fresh run, inits the memlog, and walks discovery through finalize. Update resumes against an existing artifact, reads the memlog once to rebuild state, surfaces any conflict before applying changes, and appends new entries. Validate is read-only, grades the artifact against its standards, and writes nothing the user has to keep. + +Mode selection happens at activation from the user's intent, not from a quiz. If the intent is ambiguous, ask the one question that disambiguates, then route. Keep the mode boundary clean so a reader of any single mode never has to reason about the other two. + +## Graceful Degradation + +A workflow that depends on config, a prior artifact, or an optional script should degrade rather than stop. Each dependency has a named fallback, and the fallback is the path the skill takes when the dependency is absent rather than an error the user has to clear. + +| Dependency missing | Degraded behavior | +|---|---| +| Project config.yaml | Continue with sensible defaults; standalone skills say nothing, module skills name the setup skill | +| Prior artifact on update | Offer to start a create run instead | +| Optional non-stdlib script dep | Fall back to the stdlib path and report which path ran | +| customize.toml override files | Resolver reads the three files it can find and uses baked defaults for the rest | + +Degradation is a design property, not an exception handler. State the fallback inline where the dependency is read, so the reader sees both the happy path and the floor in one place. + +## Multi-Stage Routing as an Earn-It Surface -## Routing from SKILL.md +Multi-stage routing is structure, and structure has to earn its place against a flatter alternative. Before splitting a workflow into routed stages, ask whether a single goal-driven SKILL.md with named sections would have produced the same result. Usually it would, so reach for explicit stages only when the workflow is large enough that SKILL.md cannot hold it within budget, or when stages have genuinely different resume and memory behavior. -When SKILL.md routes to a carved-out file, the route is by descriptive name. Use a Stages table near the bottom of SKILL.md: +When stages earn their place, name them descriptively and route by intent. The stage table near the bottom of SKILL.md is a reading aid that maps an intent to a location: ```markdown ## Stages -| # | Stage | Purpose | Location | -|---|-------|---------|----------| -| 1 | Ignition | Raw concept, enforce customer-first thinking | SKILL.md (above) | -| 2 | Press Release | Iterative drafting with hard coaching | `references/press-release.md` | -| 3 | Customer FAQ | Devil's advocate customer questions | `references/customer-faq.md` | +| Stage | Intent it serves | Location | +|-------|------------------|----------| +| Ignition | Capture the raw concept, enforce customer-first thinking | SKILL.md (above) | +| Press Release | Iterative drafting with hard coaching | `references/press-release.md` | +| Customer FAQ | Surface devil's-advocate customer questions | `references/customer-faq.md` | ``` -The `#` is a reading aid for the table, not a filename prefix. +The intent routing table is what makes the split worth its cost, because the model reads the user's intent and jumps straight to the stage that serves it rather than walking a fixed sequence. There is no numbered prefix on any stage filename, and the stage order is a routing decision SKILL.md makes per run, not a property baked into the file names. ## Module Metadata Reference -BMad module workflows require extended frontmatter metadata. See `references/metadata-reference.md` for the metadata template and field explanations. +BMad module workflows carry extended frontmatter metadata. See `references/standard-fields.md` for the field conventions. The workflow-builder captures module-capability metadata as handoff fields only and never authors module.yaml. ## Architecture Checklist Before finalizing a complex BMad workflow: -- [ ] Default reconsidered — would this fit inline as named sections in a single SKILL.md? -- [ ] Facilitator persona — treats the operator as expert? -- [ ] Config integration — language, output locations read and used? -- [ ] Conventions block stamped at top of SKILL.md (when multiple internal files are referenced) -- [ ] Carve-outs in `references/` use descriptive names, no numbered prefixes -- [ ] Each carved file works standalone (compaction survival) -- [ ] Decision-Log Workspace pattern applied (or explicit reason for skipping — Simple Utility, one-shot, purely conversational) -- [ ] Resume protocol — Activation checks for existing workspace and offers to resume -- [ ] Update mode reads `.decision-log.md` first; surfaces conflicts before applying changes -- [ ] Final polish — subagent polish step at the end? -- [ ] Finalize step includes decision-log audit (every entry → primary, addendum, or explicit process noise) +- [ ] Default reconsidered: would this fit inline as named sections in a single SKILL.md within budget? +- [ ] Facilitator persona treats the operator as the expert +- [ ] Config integration reads language and output locations and uses them +- [ ] Conventions block stamped at the top of SKILL.md when multiple internal files are referenced +- [ ] Carve-outs in `references/` use descriptive names with no numbered prefixes +- [ ] Each carved file works standalone for compaction survival +- [ ] Memory via memlog, with resume reading the file once on activation (or an explicit reason for skipping: simple utility, one-shot, or purely conversational) +- [ ] Three-mode boundary is clean where the skill serves create, update, and validate +- [ ] Each external dependency names its degraded fallback inline +- [ ] Multi-stage routing earned its place against a flat SKILL.md, with an intent routing table +- [ ] Update mode reads the memlog first and surfaces conflicts before applying changes +- [ ] Final polish through a subagent polish step at the end +- [ ] Finalize step distills the run and confirms the memlog is complete diff --git a/skills/bmad-workflow-builder/references/customize-toml-guide.md b/skills/bmad-workflow-builder/references/customize-toml-guide.md new file mode 100644 index 0000000..b20159a --- /dev/null +++ b/skills/bmad-workflow-builder/references/customize-toml-guide.md @@ -0,0 +1,119 @@ +# customize.toml Guide + +customize.toml is the only customizability mechanism a built skill ships with. There are no installer questions, no module.yaml embedding, no separate config.yaml authoring, and no settings or options concept inside the skill. When a skill needs end-user customization, it gets a customize.toml with the universal defaults baked in and the skill-specific points offered where they apply. When it does not, it ships fixed with hardcoded paths and no resolver step, and anyone who needs a change forks it. + +This guide covers when to emit customize.toml, what goes in it, how overrides merge, and which mechanisms are forbidden. + +## The Ask + +Whether a skill gets a customize.toml is a decision made once during the build, interactive-only, defaulting to NO: + +> Should this support end-user customization such as activation hooks, swappable templates, or output paths? If no, it ships fixed and anyone who needs changes forks it. + +Default no. Most skills do not need a customization surface, and a surface nobody uses is friction the reader has to skip past. Headless runs also default to NO and emit customize.toml only when the invocation explicitly requests customization. Whatever is decided, log it in the memlog as a decision. + +When the answer is no, emit no customize.toml, add no resolver step to activation, and use hardcoded paths throughout the skill. When the answer is yes, bake the universal defaults and offer the skill-specific points whose stages exist. + +## DO-NOT-EDIT Header Convention + +Every emitted customize.toml opens with a header that names the file as generated and points to the override files the user actually edits: + +```toml +# DO NOT EDIT -- overwritten on every update. +# +# Workflow customization surface for {skill-name}. +# Team overrides: {project-root}/_bmad/custom/{skill-name}.toml +# Personal overrides: {project-root}/_bmad/custom/{skill-name}.user.toml +``` + +The customize.toml in the skill is the base. The user never edits it, because an update overwrites it. Edits go in the two override files, which the resolver merges over the base at activation. The header carries an inline note of the merge rules so a reader knows how an override will land without leaving the file. + +## Universal Baked Defaults + +When customization is accepted, these four points appear in nearly every producing skill, so they are baked in by default under `[workflow]`: + +| Key | Type | Default | Purpose | +|---|---|---|---| +| `activation_steps_prepend` | array | `[]` | Steps to run before standard activation (pre-flight loads, compliance checks). Overrides append. | +| `activation_steps_append` | array | `[]` | Steps to run after greet, before the workflow begins. Overrides append. | +| `persistent_facts` | array | `["file:{project-root}/**/project-context.md"]` | Static facts loaded on activation and kept in mind for the whole run. Overrides append. | +| `on_complete` | scalar | `""` | Instruction executed when the workflow reaches its terminal stage. Override wins. | + +`persistent_facts` entries are each a literal sentence, a `skill:`-prefixed reference, or a `file:`-prefixed path or glob whose contents load as facts. The default glob picks up a project-context.md anywhere under the project root if one exists, and resolves to nothing when it does not. + +## Offered-When-Relevant Points + +Beyond the universal four, offer a point only when the matching stage exists in the skill. Offering an output-path knob to a skill that produces no artifact is a no-op surface the reader has to skip. + +| Point | Offer when | Shape | +|---|---|---| +| `_template` | The skill loads a template the user might want to swap | Scalar file path, e.g. `brief_template = "assets/brief-template.md"` | +| `_output_path` + `run_folder_pattern` | The skill produces artifacts to a writable destination | Paired scalars; the pattern names the per-run folder | +| `doc_standards` | A finalize stage applies standards to human-consumed docs | Array of `skill:` / `file:` / plain-text directives | +| `finalize_reviewers` | A review stage gates substantive output | Array of reviewer references | +| `external_sources` | A stage pulls in outside inputs | Array of source references | +| `external_handoffs` | A stage routes output onward | Array of handoff references, `tool:` for tool-style routing | + +The four arrays (`doc_standards`, `finalize_reviewers`, `external_sources`, `external_handoffs`) encode standards, not options. They are append-only lists the resolver merges, not toggles that switch behavior on and off. + +Entry convention for these arrays: each entry is a `skill:` reference, a `file:` reference, or plain text, with `tool:` used for handoff-style routing. Bare paths resolve from the skill root; use `{project-root}/...` to point at an org-owned resource elsewhere in the repo. + +## Three-Layer Merge Rules + +Three files compose at activation: the baked base in the skill, the team override (`{skill-name}.toml`), and the personal override (`{skill-name}.user.toml`). The resolver merges them in that order, last layer winning where the rules call for a winner, and falls back to reading the three files directly if no resolver is available. + +| Value kind | Merge behavior | +|---|---| +| Scalar (string, number, bool) | Override wins, last layer applied wins | +| Table | Deep-merge key by key | +| Array of tables (entries with `code` or `id`) | Match on `code`/`id`: replace the matching entry, append the new ones | +| Any other array | Append | + +There is no removal mechanism by design. To suppress a baked default, override it by key (for a scalar) or fork the skill (for an array entry you cannot reach by key). An override file never shrinks a list, so a base reviewer or standard cannot be silently dropped downstream. + +SKILL.md must reference resolved values as `{workflow.}`, for example `{workflow.brief_template}` or `{workflow.output_path}`. A hardcoded path written beside a declared scalar silently no-ops the override, because the resolver fills `{workflow.}` but the skill never reads it. The customization scanner flags exactly this hardcoded-path-beside-declared-scalar case. + +## Forbidden Mechanisms + +customize.toml is the sole config mechanism. The build flow never offers any of the following, and the customization scanner confirms none is present: + +- Installer or install-time questions +- module.yaml embedding or generation. The workflow-builder captures module-capability metadata as handoff fields only and never authors module.yaml. +- A separate config.yaml authored by the skill for its own settings. Reading the project's config.yaml at activation is legitimate and stays, because it is not a customization surface. +- Boolean-toggle config that switches behavior on and off +- Any settings or options concept inside the built skill + +Confirming script dependencies at build is also legitimate and stays, because it is a build-time check rather than a customization surface. + +## Example + +A complete customize.toml for an artifact-producing skill with a finalize stage: + +```toml +# DO NOT EDIT -- overwritten on every update. +# +# Workflow customization surface for bmad-product-brief. +# Team overrides: {project-root}/_bmad/custom/bmad-product-brief.toml +# Personal overrides: {project-root}/_bmad/custom/bmad-product-brief.user.toml + +[workflow] + +# --- Universal defaults. Merge: scalars override, arrays append. --- +activation_steps_prepend = [] +activation_steps_append = [] +persistent_facts = ["file:{project-root}/**/project-context.md"] +on_complete = "" + +# --- Skill-specific points (stages present: template, output, finalize) --- +brief_template = "assets/brief-template.md" +output_path = "{planning_artifacts}/briefs" +run_folder_pattern = "brief-{project_name}-{date}" + +# Standards applied at finalize. Append-only; base entries cannot be removed. +doc_standards = [ + "skill:bmad-editorial-review-structure", + "skill:bmad-editorial-review-prose", +] +``` + +A skill that produces no artifact and has no finalize stage carries only the `[workflow]` block with the four universal defaults, and a skill that declined customization carries no customize.toml at all. diff --git a/skills/bmad-workflow-builder/references/quality-analysis.md b/skills/bmad-workflow-builder/references/quality-analysis.md deleted file mode 100644 index 6e49dec..0000000 --- a/skills/bmad-workflow-builder/references/quality-analysis.md +++ /dev/null @@ -1,140 +0,0 @@ -# Quality Analysis - -Communicate with user in `{communication_language}`. Write report content in `{document_output_language}`. - -You orchestrate quality analysis on a BMad workflow or skill. The pipeline is optimized for speed and completeness: - -1. **Deterministic checks** (scripts) — zero tokens, instant -2. **LLM scanners** (parallel subagents) — judgment-based analysis against `skill-quality-principles.md` -3. **Fast JSON extraction** (deterministic script) — lossless capture of all scanner findings (~10 seconds, no LLM) -4. **HTML generation** — interactive, auto-opening report from JSON (no wait for synthesis) -5. **Optional markdown synthesis** (LLM subagent, background) — thematic analysis and archival markdown - -The scanners verify against `references/skill-quality-principles.md` — the same file the build process loads at create/edit time. Findings cite the principle that's being violated rather than restating it. - -## Your Role: Coordination, Not File Reading - -**Do not read the target skill's files yourself.** Scripts and subagents do all analysis. You orchestrate: run deterministic scripts and pre-pass extractors, spawn LLM scanner subagents in parallel, hand off to the report creator for synthesis. - -## Headless Mode - -If `{headless_mode}=true`, skip user interaction, use safe defaults, note any warnings, and output structured JSON as specified in the Present Findings section. - -## Pre-Scan Checks - -Check for uncommitted changes. In headless mode, note warnings and proceed. In interactive mode, inform the user, confirm before proceeding, and confirm the workflow is currently functioning. - -## Analysis Principles - -**Effectiveness over efficiency.** The analysis may suggest leaner phrasing, but if the current phrasing captures the right guidance, it should be kept. The report presents opportunities — the user applies judgment. - -## Scanners - -### Lint Scripts (Deterministic — Run First) - -Run instantly, cost zero tokens, produce structured JSON: - -| # | Script | Focus | Output File | -| -- | -------------------------------- | --------------------------------------- | -------------------------- | -| S1 | `scripts/scan-path-standards.py` | Path conventions | `path-standards-temp.json` | -| S2 | `scripts/scan-scripts.py` | Script portability, PEP 723, unit tests | `scripts-temp.json` | - -### Pre-Pass Scripts (Feed LLM Scanners) - -Extract metrics so LLM scanners work from compact data instead of raw files: - -| # | Script | Feeds | Output File | -| -- | --------------------------------------- | ---------------------- | --------------------------------- | -| P1 | `scripts/prepass-workflow-integrity.py` | architecture scanner | `workflow-integrity-prepass.json` | -| P2 | `scripts/prepass-prompt-metrics.py` | architecture scanner | `prompt-metrics-prepass.json` | -| P3 | `scripts/prepass-execution-deps.py` | determinism scanner | `execution-deps-prepass.json` | - -### LLM Scanners (Judgment-Based — Run After Scripts) - -Each scanner loads `references/skill-quality-principles.md` and writes a free-form analysis document: - -| # | Scanner | Focus | Pre-Pass | Output File | -| -- | ------------------------------------ | ------------------------------------------------------------------------------ | -------- | ---------------------------- | -| L1 | `quality-scan-architecture.md` | Structural integrity, prose craft, cohesion (was: integrity + craft + cohesion)| Yes (P1, P2) | `architecture-analysis.md` | -| L2 | `quality-scan-determinism.md` | Intelligence placement, parallelization, subagent delegation, script opportunities (was: execution-efficiency + script-opportunities) | Yes (P3) | `determinism-analysis.md` | -| L3 | `quality-scan-customization.md` | customize.toml opportunities and abuse | No | `customization-analysis.md` | -| L4 | `quality-scan-enhancement.md` | Edge cases, UX gaps, headless potential, facilitative patterns | No | `enhancement-analysis.md` | - -## Execution - -Bind `{quality-report-dir} = {skill-path}/.analysis/{date-time-stamp}/` and create the directory. Use this single name in every script invocation and subagent prompt below. Quality analyses live at the skill's own root, as a peer of `.decision-log.md` and `SKILL.md` — the audit trail travels with the skill. - -### Step 1: Run All Scripts (Parallel) - -```bash -python3 scripts/scan-path-standards.py {skill-path} -o {quality-report-dir}/path-standards-temp.json -python3 scripts/scan-scripts.py {skill-path} -o {quality-report-dir}/scripts-temp.json -uv run scripts/prepass-workflow-integrity.py {skill-path} -o {quality-report-dir}/workflow-integrity-prepass.json -python3 scripts/prepass-prompt-metrics.py {skill-path} -o {quality-report-dir}/prompt-metrics-prepass.json -uv run scripts/prepass-execution-deps.py {skill-path} -o {quality-report-dir}/execution-deps-prepass.json -``` - -### Step 2: Spawn LLM Scanners (Parallel) - -After scripts complete, spawn all four LLM scanners as parallel subagents. - -Each subagent receives: -- Scanner file to load -- Skill path: `{skill-path}` -- Output directory: `{quality-report-dir}` -- Pre-pass file paths (L1: P1+P2; L2: P3) - -The subagent loads its scanner file (which loads the principles file), analyzes the skill, writes its analysis to `{quality-report-dir}`, and returns the filename. - -### Step 3: Synthesize Report (Parallel with Scanner 4) - -Spawn report creator to synthesize scanner outputs into `report-data.json` and `quality-report.md`. This can run in parallel with the last scanner finishing. - -```bash -# Spawn as background task — does not block step 4 -Agent(description="Synthesize quality report", subagent_type="report-creator", run_in_background=true, prompt="...") -``` - -The report creator: -- Reads all 4 analysis files + prepass JSON -- Identifies thematic clusters (root-cause synthesis) -- Writes `report-data.json` with: broken, opportunities, strengths, recommendations, detailed_analysis -- Writes `quality-report.md` for archival - -### Step 4: Generate & Open HTML Report (Do Not Block on Markdown) - -As soon as `report-data.json` exists (the report creator writes it mid-synthesis), generate the interactive HTML report: - -```bash -python3 scripts/generate-html-report.py {quality-report-dir} --open -``` - -**Important:** Do not wait for `quality-report.md` to be written. The JSON is the complete data source. Open HTML immediately. The markdown report finishes asynchronously and provides archival context. - -### Step 5: Log the Run - -After HTML opens, append a session heading to `{skill-path}/.decision-log.md`: - -```markdown -## YYYY-MM-DD — Quality analysis - -Grade: . Interactive HTML: `.analysis//quality-report.html`. Full markdown: `.analysis//quality-report.md`. -``` - -## Present to User - -**Headless** (`{headless_mode}=true`): emit JSON only. - -```json -{ - "status": "complete", - "intent": "analyze", - "skill": "{skill-path}", - "decision_log": "{skill-path}/.decision-log.md", - "report": "{quality-report-dir}/quality-report.md" -} -``` - -Blocked (scanner failure, missing required input, etc.): replace `"complete"` with `"blocked"` and add `"reason": ""`. The log + any partial report carry the detail. - -**Interactive:** read `report-data.json` and present grade + 2-3 sentence narrative, broken items if any, top opportunities by theme, paths to the full report and HTML. Offer to apply fixes, walk findings, or discuss. diff --git a/skills/bmad-workflow-builder/references/quality-scan-architecture.md b/skills/bmad-workflow-builder/references/quality-scan-architecture.md deleted file mode 100644 index c5c5196..0000000 --- a/skills/bmad-workflow-builder/references/quality-scan-architecture.md +++ /dev/null @@ -1,63 +0,0 @@ -# Quality Scan: Skill Architecture - -You are a senior skill architect reviewing a BMad skill. Your job: identify what's missing, mismatched, or over-specified across the skill's structure, prose craft, and overall coherence — the things that would either break execution or push the executing agent into mechanical procedure-following instead of informed judgment. - -**Load `references/skill-quality-principles.md` first.** It is the bar you're testing against. Don't restate its rules; cite them when findings reference them. - -This scan absorbs what was previously three separate scanners (workflow-integrity, prompt-craft, skill-cohesion). Checking these together catches the mismatches that separate scans miss — a workflow split into files that belonged inline, an Overview promise that the execution instructions silently violate, prose that's structurally correct but mechanically deadening. - -## Scan Targets - -- `SKILL.md` — frontmatter, structure, inline workflow content, routing -- `references/*.md` — carved-out workflow sections (only present when SKILL.md was genuinely too big to keep inline) -- `assets/` — templates and other static content the workflow loads -- Anything other than `SKILL.md`, `customize.toml`, and the standard folders at skill root is suspect - -If pre-pass JSON files are provided (`workflow-integrity-prepass.json`, `prompt-metrics-prepass.json`), read those first for compact metrics; read raw files only as needed for judgment calls. - -## What to Find - -Run the principles file against the skill and surface findings in three buckets: - -**Structural integrity** — does what should exist exist, and is it wired correctly? -- Frontmatter follows the description format with quoted trigger phrases; no extra fields -- `## Overview` and `## On Activation` present and meaningful -- When SKILL.md references multiple internal files, the Conventions block is stamped (per the principles file's path-conventions section) -- Workflow content is inline in SKILL.md as named sections by default; only carved out to `references/` when SKILL.md was genuinely too big to scan -- **Carved-out files use descriptive names (`press-release.md`), NOT numbered prefixes (`01-discover.md`).** Flag numbered-prefix filenames. -- **No prompt files at skill root other than `SKILL.md` itself.** Flag any `*.md` workflow content directly under skill root that should be in `references/`. -- Routing from SKILL.md uses bare paths from skill root (`references/foo.md`) -- References in SKILL.md resolve to existing files (no orphans, no dangling refs) -- Carved-out files work standalone — no "as described in the overview" / "see SKILL.md" -- Where progression conditions exist, they're testable; "when ready" is vague -- Each carved file uses `{communication_language}` (and `{document_output_language}` if it produces a doc) -- No template artifacts (`{if-complex-workflow}`, bare `{skillName}`, etc.) -- No `## On Exit` sections -- Workflow type claim matches actual structure (Complex Workflow with everything inline → reclassify; Simple Workflow with carved references → either inline back or reclassify) - -**Prose craft** — does the SKILL.md and reference prose enable judgment without bloat? -- Overview establishes role, mission, and (where relevant) domain framing, theory of mind, design rationale -- No re-teaching of LLM-native skills (scoring formulas, calibration tables, adapter proliferation, format-the-output templates) -- No defensive padding ("make sure", "remember to", "this workflow is designed to") -- Direct imperatives, not "you should" / "please" -- Carved-out files survive context compaction — critical instructions in the file itself -- Size matches purpose (principles file thresholds); large data tables and reference material lifted out of SKILL.md - -**Cohesion** — does the skill hang together as a purposeful whole? -- Description matches what the skill actually does -- Workflow flows logically — earlier sections produce what later sections consume; no dead-ends, no overlaps -- **Promises-vs-behavior check** — if the Overview or design rationale states a principle ("we do X before Y"), trace through the workflow and verify the instructions enforce or at minimum don't contradict it. Implicit instructions ("acknowledge what you received") that violate stated principles are the most dangerous misalignment because they look correct on casual review. -- Complexity matches task — 10 phases for "format a file" is wrong; 2 phases for "architect a system" is wrong -- Dependency graph (`after` / `before` / `is-required`) reflects actual data flow, not artificial ordering - -## Output - -Write to `{quality-report-dir}/architecture-analysis.md`. Include: - -- **Assessment** — 2-3 sentence verdict on the skill as a coherent whole -- **Findings** — each with severity, file:line, what's wrong, why, how to fix. Distinguish genuine waste from load-bearing context (the principles file calls this out explicitly). -- **Strengths** — what's working that should be preserved - -Severity follows the principles file: anything that breaks execution or violates a stated promise is critical/high; over-specification, numbered-prefix filenames, or workflow files at skill root are high; coherence issues are medium; style is low. - -Return only the filename when complete. diff --git a/skills/bmad-workflow-builder/references/quality-scan-customization.md b/skills/bmad-workflow-builder/references/quality-scan-customization.md deleted file mode 100644 index cd858cc..0000000 --- a/skills/bmad-workflow-builder/references/quality-scan-customization.md +++ /dev/null @@ -1,48 +0,0 @@ -# Quality Scan: Customization Surface - -You are a customization-surface economist. Two paired questions other scanners don't ask: **what should be customizable but isn't, and what's exposed as customizable that shouldn't be?** - -**Load `references/skill-quality-principles.md` first.** Its "Customization (customize.toml)" section is the schema, naming conventions, and merge rules. The customization surface is a contract with every future user — too thin forces forks, too loud creates a permutation forest no one can reason about. - -This is purely advisory. Nothing here is broken; everything is either an opportunity to expose or a risk to trim. - -## Scan Targets - -- `customize.toml` — if present, the canonical schema for this workflow -- `SKILL.md` — `{workflow.X}` references (signals customize.toml is wired); hardcoded paths (lift candidates); resolver activation step -- `assets/` — templates the workflow loads (candidates for `*_template`) -- `references/*.md` — stage prompts that may reference configurable values - -If no `customize.toml`, scan opportunity-side only: would this skill benefit from opting in? - -## What to Find - -**Opportunities — things to lift:** -- Hardcoded template paths in SKILL.md or stages → `_template` scalars (each separate, don't bundle) -- Hardcoded output destinations → `_output_path` (weaker than templates; flag low unless org-dependent) -- Workflow produces an artifact and stops → consider `on_complete` hook -- Missing or empty `persistent_facts` — the BMad default glob (`["file:{project-root}/**/project-context.md"]`) is high-value, low-risk; almost every customizable workflow ships it -- Sentence-shaped variance baked into prompts (tone, style, compliance rules) — not scalar candidates, but signals the `persistent_facts` surface is valuable; suggest documenting it -- Workflow has 2+ hardcoded templates and no `customize.toml` at all → high-opportunity to opt in - -**Abuse — things to trim:** -- Boolean toggles (3+ in one file = the surface is doing the job of a variant skill; suggest two skills or fewer knobs) -- Identity / communication-style / principles in `[workflow]` (those are agent-shape fields — point the author at agent-builder; remove from workflow surface) -- 4+ `on_` hooks (workflow internals leaking into the override surface; users can interleave hooks at so many points they break the workflow's contract) -- Arrays of tables without `code` or `id` keys (resolver can't merge by key; falls back to append-only — users can't replace items) -- Mixed keying (`code` on some, `id` on others) — pick one -- Opaque scalar names (`style_config`, `mode`-as-path) — use the principles file's `*_template` / `*_output_path` / `on_` patterns -- `customize.toml` declares a scalar but SKILL.md hardcodes the same value (high-abuse — overrides silently no-op; SKILL.md must read `{workflow.}`) -- Scalars with no comment explaining when/why to override - -## Output - -Write to `{quality-report-dir}/customization-analysis.md`. Include: - -- **Customization posture** — opted in? Surface size and shape? -- **Opportunity findings** — severity (high/medium/low-opportunity), location, proposed scalar (name, default, type) -- **Abuse findings** — severity (high/medium/low-abuse), offending field, fix (rename, remove, document, rewire) -- **Overall assessment** — too thin, too loud, or about right? -- **Top 2-3 insights** distilled - -Return only the filename when complete. diff --git a/skills/bmad-workflow-builder/references/quality-scan-determinism.md b/skills/bmad-workflow-builder/references/quality-scan-determinism.md deleted file mode 100644 index 8889a09..0000000 --- a/skills/bmad-workflow-builder/references/quality-scan-determinism.md +++ /dev/null @@ -1,60 +0,0 @@ -# Quality Scan: Determinism & Distribution - -You are a performance and intelligence-placement reviewer. Your job: find work happening in the wrong place — deterministic operations done by an LLM, sequential operations that should run in parallel, parent reads that should be subagent delegations, and prompts doing what a script could do faster, cheaper, and more reliably. - -**Load `references/skill-quality-principles.md` first.** Its "Intelligence placement" and "Subagent constraints" sections are the bar. - -This scan absorbs what was previously two separate scanners (execution-efficiency, script-opportunities). Same root question: where is work happening that shouldn't be happening here? - -## Scan Targets - -- `SKILL.md` — On Activation patterns, inline operations -- `*.md` prompt files at root — stage instructions -- `references/*.md` — resource-loading patterns -- `scripts/` — what already exists (avoid suggesting duplicates) - -If `execution-deps-prepass.json` is provided, read it first for compact dependency metrics. - -## What to Find - -**Script opportunities** — for every operation in a prompt, ask: given identical input, will this always produce identical output? Could you write a unit test for it? If yes, it belongs in a script. - -Patterns to surface: -- Validation against schemas, frontmatter checks, naming-convention enforcement -- Counting, aggregation, metrics extraction -- Format conversion, parsing, structured-data extraction from large files -- Cross-reference checks, dependency graph tracing, file-existence verification -- **Pre-passes** that hand the LLM compact JSON instead of raw files (highest-value, often missed — the LLM scanner reads the JSON, not the source) -- Post-processing validation of LLM-generated output - -For each, estimate the LLM tax in tokens-per-invocation: heavy (500+) → high; moderate (100–500) → medium; light (<100) → low. - -Scripts have access to bash + Python stdlib + PEP 723 deps + git + jq + system tools. Think broadly — a script that builds a dependency graph and feeds the LLM a compact summary is zero tokens for work that would otherwise cost thousands. - -Don't flag operations that genuinely require interpreting meaning, tone, context, or ambiguity. Those stay in prompts. - -**Distribution opportunities** — sequential or parent-bloating patterns: -- Independent reads / tool calls / operations done sequentially → batch in one message or fan out to subagents -- "Read all files, then analyze" → delegate the reading; parent stays lean -- Implicit-read trap (per principles file): language like "review", "acknowledge", "summarize what you have" causes the parent to read files before delegating. Fix: explicit "note paths for subagent scanning; don't read them now" -- Subagent prompts without exact return format / "ONLY return X" / token limit → verbose results -- Subagent-spawning-from-subagent (will fail at runtime — chain through parent) -- Resources loaded as a single block on every activation when they could be loaded selectively -- Dependency graph over-constrained (`after` listing things that aren't real inputs) → blocks parallelism -- "Gather then process" for independent items → each item should process independently -- Validation stages placed AFTER expensive operations → fail-fast lost; cheap validation should run first - -## Output - -Write to `{quality-report-dir}/determinism-analysis.md`. Include: - -- **Existing scripts inventory** — what's already there (so you don't propose duplicates) -- **Assessment** — 2-3 sentence verdict on intelligence placement and execution efficiency -- **Script findings** — each with severity (LLM tax band), file:line, what the LLM is currently doing, what a script would do, estimated token savings, language, pre-pass potential -- **Distribution findings** — each with severity, file:line, current pattern, efficient alternative, estimated impact -- **Aggregate token savings** estimate -- **Strengths** — efficient patterns worth preserving - -Severity comes from the principles file: anything that will fail at runtime is critical; heavy LLM tax or context-bloating reads are high; missed batching is medium; small parallelization wins are low. - -Return only the filename when complete. diff --git a/skills/bmad-workflow-builder/references/quality-scan-enhancement.md b/skills/bmad-workflow-builder/references/quality-scan-enhancement.md deleted file mode 100644 index 7d08936..0000000 --- a/skills/bmad-workflow-builder/references/quality-scan-enhancement.md +++ /dev/null @@ -1,55 +0,0 @@ -# Quality Scan: Enhancement Opportunities - -You are the creative imagination on this review — the one who asks **"what's missing that nobody thought of?"** when other scanners only check what's there. Inhabit the skill as different real users in different real situations, and find the moments where it would confuse, frustrate, dead-end, or underwhelm them — plus the moments where one creative addition would transform the experience. - -**Load `references/skill-quality-principles.md` first.** Its "Patterns BMad has seen pay off" section is the institutional library you'll check the skill against. - -This is purely advisory. Nothing here is broken; everything is opportunity. - -## Scan Targets - -- `SKILL.md`, stage prompts, `references/*.md` — walk the skill end-to-end as users would experience it - -## What to Find - -**Inhabit user archetypes** — the first-timer, the expert who knows what they want, the confused user (invoked by accident or with wrong intent), the edge-case user (technically valid but unexpected input), the hostile environment (deps fail, files missing, context limited), and **the automator** (cron / pipeline / another agent invoking this headless with pre-supplied inputs and expecting a usable return value). - -At each stage, ask: - -- What if the user provides partial, ambiguous, or contradictory input? -- What if they want to skip back, change their mind, or exit cleanly mid-flow? -- What happens if an external dependency is unavailable? -- What if context compaction drops critical state mid-conversation? -- Where does the skill complete but leave the user without a clear sense of what they got? - -**Headless assessment** — many workflows are built HITL-only but could work with a flag and a pre-supplied prompt. For each interaction point, ask whether a parameter could replace the question, whether a confirmation could be skipped with a reasonable default, whether a clarification is always needed or only for ambiguous input. Categorize: - -- **Headless-ready** — works today with minimal changes -- **Easily adaptable** — needs a headless path on 2-3 stages -- **Partially adaptable** — core artifact creation could be headless, but discovery is fundamentally interactive — suggest a "skip to build" entry point -- **Fundamentally interactive** — the value IS the conversation (coaching, brainstorming, exploration). That's OK; flag and move on. - -**Facilitative pattern check** — for any skill involving collaborative discovery or guided artifact creation, check the principles file's named patterns: soft-gate elicitation, intent-before-ingestion, capture-don't-interrupt, dual-output, parallel review lenses, three-mode architecture, graceful degradation. Flag missing ones with concrete suggestions when they'd be transformative. - -**Delight opportunities** — quick-win mode for experts, smart defaults from context, proactive insight ("you might also want to consider..."), progress awareness in long flows, useful alternatives when things go wrong, suggestions for adjacent skills. - -**Stay in your lane.** Don't flag structural issues (architecture scanner), efficiency or script opportunities (determinism scanner), or customization (customization scanner). Your findings should be things only a creative thinker would notice. - -## How to Think - -Go wild first — the weirdest user, the worst timing, the most unexpected input. No idea is too crazy in this phase. Then temper. For each wild idea, ask: is there a practical version that would actually improve the skill? If yes, distill to a sharp suggestion. If genuinely impractical, drop it — don't pad findings with fantasies. - -Prioritize by user impact. Preventing confusion outranks adding nice-to-haves. - -## Output - -Write to `{quality-report-dir}/enhancement-analysis.md`. Include: - -- **Skill understanding** — purpose, primary user, key assumptions (2-3 sentences) -- **User journeys** — for each archetype: brief narrative, friction points, bright spots -- **Headless assessment** — level + which interaction points could auto-resolve + what a headless invocation would need (inputs, return format) -- **Facilitative patterns check** — present/missing, which would be most valuable to add -- **Findings** — severity (high/medium/low-opportunity), location, what you noticed, concrete suggestion -- **Top 2-3 insights** distilled - -Return only the filename when complete. diff --git a/skills/bmad-workflow-builder/references/report-author.md b/skills/bmad-workflow-builder/references/report-author.md new file mode 100644 index 0000000..d6c8a88 --- /dev/null +++ b/skills/bmad-workflow-builder/references/report-author.md @@ -0,0 +1,64 @@ +# Report Author + +You receive the parent's consolidated findings from the five scan lenses and turn them into one HTML report. You are the only subagent that touches the report, and your whole job is to fill a single JSON island in a fixed shell. You never run a scanner, never read the skill under analysis, and never add a finding the parent did not hand you. If the parent gave you nothing, you produce a clean "no findings" report, not invented work. + +## What you get and what you produce + +Input: the merged findings list and the subject (the skill name or path that was analyzed). Each finding already carries the fields the scanners produced. + +Output: `assets/report-shell.html` with its `report-data` island replaced by your filled JSON object, written to the analysis run folder the parent names (typically beside the skill's `.memlog.md`). Return only that output path. + +## The island contract + +The shell reads exactly one element and parses it with `JSON.parse`. The element is: + +```html + +``` + +Your object conforms to schema_version 1: + +```json +{ + "schema_version": 1, + "subject": "", + "generated": "", + "verdict": "", + "summary": { "critical": 0, "high": 0, "medium": 0, "low": 0 }, + "findings": [ + { + "id": "-", + "lens": "leanness | architecture | determinism | customization | enhancement", + "severity": "critical | high | medium | low", + "title": "", + "location": "", + "evidence": "", + "recommendation": "", + "proposed_smallest": "", + "predicted_delta": "" + } + ] +} +``` + +Rules for filling it: + +- `schema_version` is always `1`. +- `subject` and `generated` come from the parent and the current date in ISO form. +- `verdict` is one line that names the overall state and the one or two findings that matter most. It is your only synthesis; everything else is transcription. +- `summary` counts the findings you are emitting, by severity. The four keys are always present and any severity with no findings is `0`. The counts must match the `findings` array exactly, so derive them from the array rather than from memory. +- `findings` carries every finding the parent handed you, unchanged. Keep each finding's existing `id`, `lens`, and `severity`. Carry `proposed_smallest` and `predicted_delta` only when the leanness lens supplied them, and omit those keys otherwise. + +Write valid JSON: the shell parses it directly, so a trailing comma or an unescaped quote breaks the render into a visible error banner. Keep `evidence` and `recommendation` to a sentence or two each, because the shell shows them in a collapsible row, not a document. + +## Never invent, always render + +You transcribe findings; you do not author them. If a finding is thin, leave it thin and let the parent decide; do not embellish evidence or sharpen a recommendation beyond what the scanner returned. If the parent handed you nothing, write the object with an empty `findings` array, a `summary` of all zeros, and a verdict that says the scanners returned a clean pass. The shell renders that as an explicit "no findings" panel, so an empty list is a real result rather than a blank page. + +Because you always write `verdict`, `summary`, and `findings`, the shell has no path to a blank render. A malformed island would surface as the shell's parse-error banner, so the cost of a JSON mistake is loud rather than silent, which is why the JSON has to be exactly right before you write it. + +## Injecting into the shell + +Read `assets/report-shell.html`, replace the entire contents between the island's opening and closing tags with your JSON object, and write the result to the run folder. The shell's CSS and JS are fixed; you change only the island. Do not touch the ` - - - -
    BMad Method
    -

    Quality Analysis:

    -
    - -
    -
    - -
    -
    -
    -
    -
    -
    - - - - - -""" - - -def generate_html(report_data: dict) -> str: - """Inject report data into the HTML template.""" - data_json = json.dumps(report_data, indent=None, ensure_ascii=False) - data_tag = f'' - html = HTML_TEMPLATE.replace(' + + + + diff --git a/skills/bmad-agent-builder/references/agent-quality-principles.md b/skills/bmad-agent-builder/references/agent-quality-principles.md new file mode 100644 index 0000000..8c3ae9c --- /dev/null +++ b/skills/bmad-agent-builder/references/agent-quality-principles.md @@ -0,0 +1,63 @@ +# Agent Quality Principles + +The build-plus-scan bar for agents. Loaded at build time so the author works to the standard from the start, and at analysis time so every lens verifies against the same standard. + +The universal core lives in the canon, not here. For the core test, the two-version comparison, the deeper floor, writing what survives as a goal, progressive disclosure, the cheaper signals, and the habit, load `references/prompt-quality-canon.md` (shipped copy, resolves from the agent-builder root) or its published fallback at `{siteBase}/explanation/outcome-driven-prompt-quality/`. Everything below is what agents add on top of that core, because an agent is not a workflow and a few things change. + +## Persona is the deliverable + +The leanness bar from the canon applies to every internal capability prompt an agent carries. It does not apply to the persona, and this carve-out is load-bearing. + +Persona voice, communication-style examples, domain framing, design rationale, and theory-of-mind are investment, not waste. They are the context that lets the agent make judgment calls when a situation does not match any capability prompt, and they are what makes the agent feel like a specific character rather than a generic assistant answering in the house style. A leanness pass never recommends flattening an agent's voice, never trims a communication-style example down to a rule, and never strips the warmth or the framing that gives the persona its shape. The pruning test cuts a capability prompt line when a capable model would produce the same outcome without it. The same test does not cut persona, because the outcome of persona is the character itself, and a flatter version is a different and worse outcome. + +So the distinction the canon draws between structure that boxes the model in and intent that frees it cuts differently for persona. The capability prompt says what success looks like and lets the model find the path. The persona is the path the model takes through every capability, and it is the one part of an agent you write out in full. + +## The three archetypes + +Agents sit on a gradient surfaced as feature decisions, not a menu of separate architectures. Type emerges during discovery and branches only at emit time. `references/agent-type-guidance.md` is the authority on the gradient and the routing questions; the rules below are the quality bar each archetype is held to. + +Stateless ships everything in one SKILL.md: overview, mission, identity, communication style, principles, conventions, on-activation, and the capabilities routing table. The whole identity is present at activation, so the leanness bar applies to the capability prompts while the persona content earns its place by the carve-out above. + +Memory ships a lean bootloader SKILL.md carrying only the identity seed, the Three Laws, the Sacred Truth, the mission, and the activation routing. Everything else lives in the sanctum. The bar here is that communication style, detailed principles, capability menus, and session-close logic must not leak into the SKILL.md, because that content belongs in the sanctum and a bootloader that carries it is a pruning failure. + +Autonomous is the memory agent plus PULSE.md for default wake behavior, named task routing, frequency, and quiet hours, and it gains the headless Quiet Rebirth activation path. The bar adds that PULSE owns autonomous behavior and nothing PULSE-shaped belongs anywhere else. + +## The bootloader is lean by design, not under-built + +A memory or autonomous bootloader SKILL.md is supposed to be small, around four hundred tokens as a guardrail rather than a gate. A leanness lens that flags a thin bootloader as missing content has it backwards. The bootloader carries only the DNA needed to find the sanctum and become the agent again; its thinness is the design working, not a gap. Judge a bootloader by whether sanctum-bound content leaked into it, not by its weight. + +## The sanctum dimensions + +The sanctum is the built agent's runtime memory, the place it reads on every rebirth to become itself again, living at `{project-root}/_bmad/memory/{skillName}/`. This is a different thing from the builder's process log, the memlog, which is the builder's own trace written to `.memlog.md` beside the agent's SKILL.md while authoring. The two never blur. When this file or any file you write says memory of the sanctum, it means the agent's runtime memory and never the builder's log. + +The sanctum is held to these dimensions: + +- All six standard templates exist: INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES. PERSONA, CREED, and BOND carry meaningful seeds rather than empty placeholders, and MEMORY starts empty because it fills at runtime. +- First Breath carries the universal calibration and configuration mechanics plus domain-specific territory beyond the universal set, and the birthday ceremony is present. +- CREED carries its standing orders domain-adapted with concrete examples, including the canon pull-in standing order so an evolving agent authors new capabilities to the current standard. +- init-sanctum.py exists, its skill name matches the skill, and its template list matches the templates actually shipped in assets. +- After init runs, the sanctum is self-contained: the agent depends on the skill bundle only for First Breath and init, never for normal operation. + +## Internal capability versus a reference to an installed skill + +An agent either references an installed skill or carries an internal capability, and both meet the same bar. The capability prompt describes what success looks like; the persona informs how. Choose between the two forms with these criteria, applied identically at build time and at evolve time: + +- Reference an installed skill when a skill already covers the capability. Suggest the reference, and always ask before installing anything. +- Author an internal capability only when the capability is genuinely novel, or when it is tightly coupled to the persona such that a generic skill would lose the agent's voice or context. +- When external skills are in play, suggest `bmad-module-builder` to bundle them so the agent ships with its dependencies. + +Every internal capability is held to the canon, the same outcome-driven, leanness, and progressive-disclosure standard a standalone skill meets. An internal capability is not a place where the bar relaxes; it is a skill that happens to live inside an agent, and the only thing that changes is that the persona supplies the how. + +## customize.toml is the sole config mechanism + +Every agent emits a customize.toml. It carries an always-present `[agent]` metadata block (code, name, title, icon, description, agent_type) because that is the install-time roster contract the installer reads, even for an agent that declines the override surface. The override half (activation_steps_prepend, activation_steps_append, persistent_facts) is opt-in, defaults NO for memory and autonomous because the sanctum is their customization surface, is offered for stateless, and defaults NO in headless. + +customize.toml is the only build-time configuration surface an agent has. There is no other mechanism, and these are forbidden: + +- No installer question that configures the agent. +- No module.yaml authoring by the agent-builder. +- No separate config.yaml authoring as a build-time surface. +- No settings or toggle concept baked into the built agent. +- No identity, communication style, or principles in the customize surface, because that content belongs in PERSONA, CREED, and BOND. + +First Breath config and init-sanctum.py are a separate concern and are not build-time configuration. They initialize the agent's runtime sanctum the first time it wakes, which is runtime state, not the build surface. Any customize.toml field that duplicates a sanctum concept is abuse, and First Breath must never be folded into customize.toml. diff --git a/skills/bmad-agent-builder/references/agent-type-guidance.md b/skills/bmad-agent-builder/references/agent-type-guidance.md index ac288d0..d447ea9 100644 --- a/skills/bmad-agent-builder/references/agent-type-guidance.md +++ b/skills/bmad-agent-builder/references/agent-type-guidance.md @@ -1,6 +1,6 @@ # Agent Type Guidance -Use this during Phase 1 to determine what kind of agent the user is describing. The three agent types are a gradient, not separate architectures. Surface them as feature decisions, not hard forks. +Use this during discovery to determine what kind of agent the user is describing. The three agent types are a gradient, not separate architectures. Surface them as feature decisions, not hard forks. ## The Three Types @@ -60,15 +60,17 @@ After determining the agent type, assess relationship depth. This informs which Confirm your assessment with the user: "It sounds like this is more of a [long-term creative partnership / focused domain tool] — does that feel right?" -## Customization Surface by Archetype +## Customization by Archetype -Every agent emits a `customize.toml` — the metadata block (`code`, `name`, `title`, `icon`, `description`, `agent_type`) is required for all three types to satisfy the module.yaml roster contract. The override surface beneath it is opt-in and differs by archetype: +This is the authority on which customization default each archetype gets. The mechanism itself, that `customize.toml` is the sole build-time config surface and what is forbidden in it, lives in `references/agent-quality-principles.md`; this section only decides the override-surface default per type. -- **Stateless agent** — natural candidate for the override surface. Exposes `activation_steps_prepend/append`, `persistent_facts`, and any agent-specific scalars (e.g. swappable reference docs, output paths). Offer the opt-in during Phase 3; accept either answer. +The opt-in override surface is offered or declined by archetype: -- **Memory agent** — sanctum is the primary behavior-customization surface. PERSONA.md, CREED.md, BOND.md, CAPABILITIES.md are calibrated by First Breath and evolved by the owner. A TOML override surface competes with that. **Default the opt-in to no.** Opt in only when the user has a specific pre-sanctum-load need (e.g. org-mandated compliance preload) that the sanctum cannot express. +- **Stateless agent** — the natural candidate. Offer the opt-in during the customization decision and accept either answer, since swappable reference docs, output paths, and pre/post-activation steps are reasonable for a stateless expert. -- **Autonomous agent** — same as memory. PULSE.md already owns autonomous behavior configuration. Default to no; opt in only with cause. +- **Memory agent** — the sanctum is the primary behavior-customization surface, calibrated by First Breath and evolved by the owner, so a TOML override surface competes with it. **Default the opt-in to no.** Opt in only when the user has a specific pre-sanctum-load need, such as an org-mandated compliance preload, that the sanctum cannot express. + +- **Autonomous agent** — same as memory, and PULSE already owns autonomous behavior. Default to no; opt in only with cause. ### First-Breath-Named Agents diff --git a/skills/bmad-agent-builder/references/build-process.md b/skills/bmad-agent-builder/references/build-process.md index 5833533..0c51e33 100644 --- a/skills/bmad-agent-builder/references/build-process.md +++ b/skills/bmad-agent-builder/references/build-process.md @@ -1,349 +1,113 @@ --- name: build-process -description: Six-phase conversational discovery process for building BMad agents. Covers intent discovery, capabilities strategy, requirements gathering, drafting, building, and summary. +description: The single Process loop for building or rebuilding a BMad agent. One goal-driven loop, not a phase sequence, covering discovery, the minimal version, the capability fork, the eval beat, the customization decision, and ship. --- **Language:** Use `{communication_language}` for all output. # Build Process -Build AI agents through conversational discovery. Your north star: **outcome-driven design**. Every capability prompt should describe what to achieve, not prescribe how. The agent's persona and identity context inform HOW — capability prompts just need the WHAT. Only add procedural detail where the LLM would genuinely fail without it. +This is one loop, not a sequence of phases. It carries Create and Rebuild, because a rebuild is the same loop pointed at an existing agent treated as a description of intent rather than a template to copy. The order below is the usual order of discovery, but nothing forces you to march through it; pursue whichever outcome the conversation is ready for and revisit earlier ones as the picture sharpens. Each outcome is a thing you want to be true, not a box to tick. -## Phase 1: Discover Intent +Load `references/agent-quality-principles.md` before you draft anything, because it is the same bar the lenses verify against and building to it from the start is cheaper than fixing later. It cedes the universal core to `references/prompt-quality-canon.md`, so hold the canon's tests while you work and load it when you author or refine any capability. Load `references/agent-type-guidance.md` for the gradient and the routing questions, and `references/standard-fields.md` for field definitions, naming, and path rules. -Understand their vision before diving into specifics. Ask what they want to build and encourage detail. +## Understand why the user came -### When given an existing agent +Before you read a single artifact, understand who this agent is, how it should make the user feel, the core outcome it serves, and the one thing it must get right. The open-floor invitation in activation does most of this, so read what the user dumped and mine the conversation history first, then ask only the gaps that remain. On a rebuild, read the old agent to extract who it is and what it achieves, and deliberately leave its verbosity, structure, and mechanical procedures behind. -**Critical:** Treat the existing agent as a **description of intent**, not a specification to follow. Extract _who_ this agent is and _what_ it achieves. Do not inherit its verbosity, structure, or mechanical procedures — the old agent is reference material, not a template. +Type emerges here from natural questions, not a menu. Ask whether the agent needs to remember between sessions, which separates stateless from memory; whether the user should be able to teach it new capabilities after install, which gates evolvable capabilities; and whether it should operate on its own when no one is watching, which adds PULSE and makes it autonomous. Confirm the read back in plain words, and for a memory agent confirm relationship depth, since a deep partnership wants a calibration First Breath while a focused domain tool wants a warmer but quicker configuration setup. -If the SKILL.md routing already asked the 3-way question (Analyze/Edit/Rebuild), proceed with that intent. Otherwise ask now: +## Capture into the memlog throughout -- **Edit** — changing specific behavior while keeping the current approach -- **Rebuild** — rethinking from core outcomes and persona, full discovery using the old agent as context +As decisions and directions land, write them to `{target-agent-path}/.memlog.md` through `scripts/memlog.py`: `init --path {target-agent-path}/.memlog.md` once when the target is named, then `append --path {target-agent-path}/.memlog.md --type --text "..."` as things happen. For a new agent, propose a kebab-case name when the user did not give one; renaming later is a logged decision, not a redo. This `.memlog.md` is the builder's process trace and lives beside the built agent's SKILL.md. It is not the sanctum. The sanctum is the built agent's own runtime memory at `{project-root}/_bmad/memory/{skillName}/`, written by the agent at runtime, never by this log. A memlog entry records a build decision and sanctum content is the agent's living state, so neither ever holds the other's material. Capture as you go so the reasoning is caught while fresh, because the memlog is the resume source and the trail you walk with the user at handoff. -For **Edit**: identify what to change, preserve what works, apply outcome-driven principles to the changed portions. +## Write the minimal outcome-driven version first -For **Rebuild**: read the old agent to understand its goals and personality, then proceed through full discovery as if building new. +Draft the smallest agent that could work. Hold the persona and capabilities to the role, the core outcome, the consumer of the output, and any rule whose absence has already caused real damage. Apply the canon's core test to every capability-prompt line you are tempted to write, because a capable model given the persona and the outcome does not need to be told how. The persona is the exception the canon's leanness bar does not touch: write the voice, the communication-style examples, the domain framing, and the design rationale out in full, because the persona is the path the model takes through every capability and a flatter version is a worse outcome, not a leaner one. -### Discovery questions (don't skip these, even with existing input) +### Fork on capability versus skill reference -The best agents come from understanding the human's vision directly. Walk through these conversationally — adapt based on what the user has already shared: +For each capability the agent needs, decide which of two forms it takes, applying the criteria identically now and at the agent's own evolve time: -- **Who IS this agent?** What personality should come through? What's their voice? -- **How should they make the user feel?** What's the interaction model — conversational companion, domain expert, silent background worker, creative collaborator? -- **What's the core outcome?** What does this agent help the user accomplish? What does success look like? -- **What capabilities serve that core outcome?** Not "what features sound cool" — what does the user actually need? -- **What's the one thing this agent must get right?** The non-negotiable. -- **If persistent memory:** What's worth remembering across sessions? What should the agent track over time? +- Reference an installed skill when a skill already covers the capability. Suggest the reference, and always ask before installing anything. When external skills are in play, suggest `bmad-module-builder` so the agent ships bundled with its dependencies. +- Author an internal capability only when it is genuinely novel, or when it is tightly coupled to the persona such that a generic skill would lose the agent's voice or context. -The goal is to conversationally gather enough to cover Phase 2 and 3 naturally. Since users often brain-dump rich detail, adapt subsequent phases to what you already know. +When you author an internal capability, route the authoring through the canon and the `assets/capability-authoring-template.md` mechanics, hold the canon's tests while you write the body, and give every internal prompt-type capability its frontmatter (name, description, code, added, type) and an outcome-focused body. The internal capability is a skill that happens to live inside an agent; the only thing that relaxes is that the persona supplies the how. -### Agent Type Detection +## Hunt for script opportunities throughout -After understanding who the agent is and what it does, determine the agent type. Load `./references/agent-type-guidance.md` for decision framework. Surface these as natural questions, not a menu: +Keep this active the whole way rather than treating it as one checkpoint. Apply the determinism test and the signal-verb scan from `references/script-opportunities-reference.md` to anything the agent does, prefer native Python, and follow `references/script-standards.md` for PEP 723 inline metadata, `uv run` invocation, and graceful fallback when a dependency is absent. The sanctum scaffold and the memory index are fertile sources, and a transcript that shows the model rewriting the same helper across runs is the signal to bundle it once. List any non-stdlib dependency and confirm it with the user before relying on it. -1. **"Does this agent need to remember between sessions?"** No = stateless agent. Yes = memory agent. -2. **"Does this agent operate autonomously — checking in, maintaining things, creating value when no one's watching?"** If yes, include PULSE (making it an autonomous agent). +## Reach for eval at the eval beat -Confirm the assessment: "It sounds like this is a [stateless agent / memory agent / autonomous agent] — does that feel right?" +An agent that has never run is a guess. At the eval beat, invoke the standalone `bmad-eval-runner` against the built agent, which is a directory containing SKILL.md that the runner already accepts; do not fork any eval logic. Offer the modes that fit and let the user decide: -### Relationship Depth (memory agents only) +- Trigger mode hardens the activation description against near-miss queries. +- Baseline mode confirms the agent beats the bare model on the same input, since an agent that does not has no reason to exist. +- Quality or variant mode settles a finding about a single capability prompt by running a smaller version against the same input, which is how a defend-against-absence question gets answered rather than argued. -Determines which First Breath onboarding style to use: +## Decide customization with the explicit ask -- **Deep relationship** (calibration-style First Breath): The agent is a long-term creative partner, coach, or companion. The relationship IS the product. -- **Focused relationship** (configuration-style First Breath): The agent is a domain expert the user works with regularly. The relationship serves the work. +Ask once, interactive only, and default to no: "Should this agent expose override hooks such as activation steps or persistent facts so teams can customize it without forking?" Log the answer to the memlog either way. The archetype shapes the default. Memory and autonomous agents default to no because the sanctum is already their customization surface and a TOML override competes with it; offer the opt-in only when the user has a concrete pre-sanctum-load need such as an org-mandated compliance preload. Stateless agents are the natural candidate, so offer the opt-in there and accept either answer. Headless defaults to no unless the invocation explicitly asks for customization. -Confirm: "This feels more like a [long-term partnership / focused domain tool] — should First Breath be a deep calibration conversation, or a warmer but quicker guided setup?" +Every agent still emits a `customize.toml`, because its always-present `[agent]` metadata block (code, name, title, icon, description, agent_type) is the install-time roster contract the installer reads to populate `module.yaml`. customize.toml is the only build-time config surface, and First Breath and init-sanctum are runtime sanctum initialization rather than build config, so they stay out of it; `references/agent-quality-principles.md` carries the forbidden-mechanisms list. When the opt-in is yes, retain the override block, append any swappable scalars following the `*_template` / `*_output_path` / `on_` conventions, and add the resolver activation step to SKILL.md so it reads scalars as `{agent.}`. When it is no, emit metadata only and SKILL.md uses hardcoded paths. -## Phase 2: Capabilities Strategy +## Strip ceremony and ship -Early check: internal capabilities only, external skills, both, or unclear? +Confirm the agent passes its own leanness bar before handoff, because the builder has no standing to teach leanness while shipping bloat. The leanness pass cuts ceremony from capability prompts and never flattens the persona. Ship the canon copy into the built agent at its `references/prompt-quality-canon.md` exactly as the vendored scripts are copied, so an evolving agent resolves the standard from its own root. Run the lint gate over the built agent (`scripts/scan-path-standards.py` and `scripts/scan-scripts.py` in parallel, fixing high or critical findings and re-running), and run unit tests if the built agent carries scripts. -**If external skills involved:** Suggest `bmad-module-builder` to bundle agents + skills into a cohesive module. +## The output tree -**Script Opportunity Discovery** (active probing — do not skip): +Every agent shares one output tree. The archetype changes which parts are present and the SKILL.md weight, captured in the delta table below rather than three separate trees. -Identify deterministic operations that should be scripts. Load `./references/script-opportunities-reference.md` for guidance. Confirm the script-vs-prompt plan with the user before proceeding. If any scripts require external dependencies (anything beyond Python's standard library), explicitly list each dependency and get user approval — dependencies add install-time cost and require `uv` to be available. - -**Evolvable Capabilities (memory agents only):** - -Ask: "Should the user be able to teach this agent new things over time?" If yes, the agent gets: -- `capability-authoring.md` in its references (teaches the agent how to create new capabilities) -- A "Learned" section in CAPABILITIES.md (registry for user-taught capabilities) - -This is separate from the built-in capabilities you're designing now. Evolvable means the owner can extend the agent after it's built. - -## Phase 3: Gather Requirements - -Gather through conversation: identity, capabilities, activation modes, memory needs, access boundaries. Refer to `./references/standard-fields.md` for conventions. - -Key structural context: - -- **Naming:** Standalone: `agent-{name}`. Module: `{modulecode}-agent-{name}`. The `bmad-` prefix is reserved for official BMad creations only. -- **Activation modes:** Interactive only, or Interactive + Headless (schedule/cron for background tasks) -- **Memory architecture:** Agent memory at `{project-root}/_bmad/memory/{skillName}/` -- **Access boundaries:** Read/write/deny zones stored in memory - -### Customization Metadata (gather for all agents — feeds `customize.toml` and `module.yaml`) - -Every agent ships a `customize.toml` with an `[agent]` metadata block. The installer reads it to build the agent roster in `module.yaml:agents[]` and the central config's `[agents.]` section. Gather: - -- **`code`** — stable identifier, matches the skill directory basename without module prefix (e.g. `creative-muse`, `analyst`). -- **`name`** — display name (e.g. `Mary`, `Aria`). **For memory/autonomous agents whose name is learned during First Breath: leave empty.** The owner fills it post-activation via `[agents.] name = "..."` in `_bmad/custom/config.toml`. -- **`title`** — role title (e.g. `Business Analyst`, `Creative Muse`). Always fillable at build time, even when `name` is deferred. -- **`icon`** — single emoji used in menus and greetings. -- **`description`** — one-sentence summary of what the agent does. -- **`agent_type`** — `stateless`, `memory`, or `autonomous` (already determined in Phase 1). - -### Customization Opt-In (override surface) - -Ask: _"Do you want this agent to expose override hooks (persistent facts, pre/post-activation steps) so teams can customize it without forking?"_ - -- **No** → `customize.toml` ships with metadata only. SKILL.md does not call the resolver. Simplest shape. -- **Yes** → `customize.toml` additionally carries `activation_steps_prepend`, `activation_steps_append`, `persistent_facts`, and any agent-specific scalars lifted in the next sub-step. SKILL.md gets the resolver step. - -**Default recommendation by archetype:** - -- **Stateless agents** — offer the opt-in; reasonable candidates for overrides (compliance preloads, swappable reference docs). -- **Memory / autonomous agents** — default to **no**. Note: their sanctum (PERSONA/CREED/BOND/CAPABILITIES) is already the primary behavior-customization surface, edited by the owner and evolved via First Breath. A TOML override surface competes with that. Offer opt-in only if the user has a clear use case (e.g. pre-sanctum-load compliance step). - -In headless mode, default to **no** unless `--customizable` is passed. Record the answer as `{customizable}`. - -### Configurability Discovery (only if `{customizable}` is yes) - -Identify swappable points. Walk through the agent's planned structure and surface candidates: - -- **Reference documents** the agent loads (e.g. a style guide, a domain glossary) — each becomes a named scalar. -- **Output destination paths** if the agent writes artifacts. -- **`on_` hooks** — prompts/commands executed at hook points. -- **Pre/post-activation step arrays** — `activation_steps_prepend` / `activation_steps_append` are always present in the override surface; call these out so the user sees they're available. - -For each candidate, confirm with the user: - -- Should this be exposed as an `[agent]` scalar? -- What name? Follow the conventions in `./standard-fields.md`: - - `_template` for template file paths - - `_output_path` for writable destinations - - `on_` for hook scalars -- What's the default value? - -User-added configurables are welcome — domain-specific knobs are fair game as long as they fit scalar or array merge rules. - -**Output:** a list of `{name, default, purpose}` tuples that Phase 5 will emit into `customize.toml` and reference from SKILL.md as `{agent.}`. - -**If headless mode enabled, also gather:** - -- Default wake behavior (`--headless` | `-H` with no specific task) -- Named tasks (`--headless:{task-name}` or `-H:{task-name}`) - -### Memory Agent Requirements (if memory agent or autonomous agent) - -Gather these additional requirements through conversation. These seed the sanctum templates and First Breath. - -**Identity seed** — condensed to 2-3 sentences for the bootloader SKILL.md. This is the agent's personality DNA: the essence that expands into PERSONA.md during First Breath. Not a full bio — just the core personality. - -**Species-level mission** — domain-specific purpose statement. Load `./references/mission-writing-guidance.md` for guidance and examples. The mission must be specific to this agent type ("Catch the bugs the author's familiarity makes invisible") not generic ("Assist your owner"). - -**CREED seeds** — these go into CREED-template.md with real content, not empty placeholders: - -- **Core values** (3-5): Domain-specific operational values, not platitudes. Load `./references/standing-order-guidance.md` for context. -- **Standing orders**: Surprise-and-delight and self-improvement are defaults — adapt each to the agent's domain with concrete examples. Discover any domain-specific standing orders by asking: "Is there something this agent should always be watching for across every interaction?" -- **Philosophy**: The agent's approach to its domain. Not steps — principles. How does this agent think about its work? -- **Boundaries**: Behavioral guardrails — what the agent must always do or never do. -- **Anti-patterns**: Behavioral (how NOT to interact) and operational (how NOT to use idle time). Be concrete — include bad examples. -- **Dominion**: Read/write/deny access zones. Defaults: read `{project-root}/`, write sanctum, deny `.env`/credentials/secrets. - -**BOND territories** — what should the agent discover about its owner during First Breath and ongoing sessions? These become the domain-specific sections of BOND-template.md. Examples: "How They Think Creatively", "Their Codebase and Languages", "Their Writing Style". - -**First Breath territories** — domain-specific discovery areas beyond the universal ones. Load `./references/first-breath-adaptation-guidance.md` for guidance. Ask: "What does this agent need to learn about its owner that a generic assistant wouldn't?" - -**PULSE behaviors (if autonomous):** - -- Default wake behavior: What should the agent do on `--headless` with no task? Memory curation is always first priority. -- Domain-specific autonomous tasks: e.g., creative spark generation, pattern review, research -- Named task routing: task names mapped to actions -- Frequency and quiet hours - -**Path conventions (CRITICAL):** - -- Memory: `{project-root}/_bmad/memory/{skillName}/` -- Project-scope paths: `{project-root}/...` (any path relative to project root) -- Skill-internal: `./references/`, `./scripts/` -- Config variables used directly — they already contain full paths (no `{project-root}` prefix) - -## Phase 4: Draft & Refine - -Think one level deeper. Present a draft outline. Point out vague areas. Iterate until ready. - -**Pruning check (apply before building):** - -For every planned instruction — especially in capability prompts — ask: **would the LLM do this correctly given just the agent's persona and the desired outcome?** If yes, cut it. - -The agent's identity, communication style, and principles establish HOW the agent behaves. Capability prompts should describe WHAT to achieve. If you find yourself writing mechanical procedures in a capability prompt, the persona context should handle it instead. - -Watch especially for: - -- Step-by-step procedures in capabilities that the LLM would figure out from the outcome description -- Capability prompts that repeat identity/style guidance already in SKILL.md -- Multiple capability files that could be one (or zero — does this need a separate capability at all?) -- Templates or reference files that explain things the LLM already knows - -**Memory agent pruning checks (apply in addition to the above):** - -Load `./references/sample-capability-prompt.md` as a quality reference for capability prompt review. - -- **Bootloader weight:** Is SKILL.md lean (~30 lines of content)? It should contain ONLY identity seed, Three Laws, Sacred Truth, mission, and activation routing. If it has communication style, detailed principles, capability menus, or session close, move that content to sanctum templates. -- **Species-level mission specificity:** Is the mission specific to this agent type? "Assist your owner" fails. It should be something only this type of agent would say. -- **CREED seed quality:** Do core values and standing orders have real content? Empty placeholders like "{to be determined}" are not seeds — seeds have initial values that First Breath refines. -- **Capability prompt pattern:** Are prompts outcome-focused with "What Success Looks Like" sections? Do memory agent prompts include "Memory Integration" and "After the Session" sections? -- **First Breath territory check:** Are there domain-specific territories beyond the universal ones? A creative muse and a code review agent should have different discovery conversations. - -## Phase 5: Build - -**Load these before building:** - -- `./references/standard-fields.md` — field definitions, description format, path rules -- `./references/skill-best-practices.md` — outcome-driven authoring, patterns, anti-patterns -- `./references/quality-dimensions.md` — build quality checklist - -Build the agent using templates from `./assets/` and rules from `./references/template-substitution-rules.md`. Output to `{bmad_builder_output_folder}`. - -### Emit `customize.toml` (always, every archetype) - -Copy `./assets/customize-template.toml` into the built agent's root. Fill the `[agent]` metadata block from Phase 3: - -- `code`, `title`, `icon`, `description`, `agent_type` — always populated. -- `name` — populated for stateless agents and memory/autonomous agents whose name was fixed at build time; emit as an empty string for First-Breath-named agents. - -**If `{customizable}` is yes:** - -- Retain the override surface block (keep `{if-customizable}` content). -- Append any scalars lifted in Configurability Discovery (Phase 3), following the naming conventions (`*_template`, `*_output_path`, `on_`). -- In SKILL.md, reference those scalars as `{agent.}` rather than hardcoded values. Add the resolver activation step near the top of "On Activation": - - ```markdown - ### Step 1: Resolve the Agent Block - - Run: `python3 {project-root}/_bmad/scripts/resolve_customization.py --skill {skill-root} --key agent` - - If the script fails, resolve the `agent` block yourself by reading these three files in base → team → user order and applying structural merge rules: `{skill-root}/customize.toml`, `{project-root}/_bmad/custom/{skill-name}.toml`, `{project-root}/_bmad/custom/{skill-name}.user.toml`. Scalars override, tables deep-merge, arrays of tables keyed by `code`/`id` replace matching entries and append new ones, all other arrays append. - ``` - -- For stateless agents, execute `{agent.activation_steps_prepend}` before the rest of activation and `{agent.activation_steps_append}` after greet. Treat `{agent.persistent_facts}` as foundational context loaded on activation (`file:` prefix = path/glob; bare entries = literal facts). -- For memory/autonomous agents (if opted in): the override surface runs before the sanctum load. In practice this is rarely populated — sanctum remains the primary surface. - -**If `{customizable}` is no:** emit customize.toml with metadata only (the `{if-customizable}` block is stripped). SKILL.md has no resolver step and uses hardcoded paths throughout. - -**Capability prompts are outcome-driven:** Each `./references/{capability}.md` file should describe what the capability achieves and what "good" looks like — not prescribe mechanical steps. The agent's persona context (identity, communication style, principles in SKILL.md) informs how each capability is executed. Don't repeat that context in every capability prompt. - -### Stateless Agent Output - -Use `./assets/SKILL-template.md` (the full identity template). No Three Laws, no Sacred Truth, no sanctum files. Include the species-level mission in the Overview section. - -``` -{skill-name}/ -├── SKILL.md # Full identity + mission + capabilities (no Three Laws or Sacred Truth) -├── references/ # Progressive disclosure content -│ └── {capability}.md # Each internal capability prompt (outcome-focused) -├── assets/ # Templates, starter files (if needed) -└── scripts/ # Deterministic code with tests (if needed) ``` - -### Memory Agent Output - -Load these samples before generating memory agent files: -- `./references/sample-first-breath.md` — quality bar for first-breath.md -- `./references/sample-memory-guidance.md` — quality bar for memory-guidance.md -- `./references/sample-capability-prompt.md` — quality bar for capability prompts -- `./references/sample-init-sanctum.py` — structure reference for init script - -{if-evolvable}Also load `./references/sample-capability-authoring.md` for capability-authoring.md quality reference.{/if-evolvable} - -Use `./assets/SKILL-template-bootloader.md` for the lean bootloader. Generate the full sanctum architecture: - -``` -{skill-name}/ -├── SKILL.md # From SKILL-template-bootloader.md (lean ~30 lines) +{agent-name}/ +├── SKILL.md # Identity and activation routing (full for stateless, lean bootloader for memory/autonomous) +├── customize.toml # [agent] metadata always; override block only when opted in ├── references/ -│ ├── first-breath.md # Generated from first-breath-template.md + domain territories -│ ├── memory-guidance.md # From memory-guidance-template.md -│ ├── capability-authoring.md # From capability-authoring-template.md (if evolvable) -│ └── {capability}.md # Core capability prompts (outcome-focused) -├── assets/ -│ ├── INDEX-template.md # From builder's INDEX-template.md -│ ├── PERSONA-template.md # From builder's PERSONA-template.md, seeded -│ ├── CREED-template.md # From builder's CREED-template.md, seeded with gathered values -│ ├── BOND-template.md # From builder's BOND-template.md, seeded with domain sections -│ ├── MEMORY-template.md # From builder's MEMORY-template.md -│ ├── CAPABILITIES-template.md # From builder's CAPABILITIES-template.md (fallback) -│ └── PULSE-template.md # From builder's PULSE-template.md (if autonomous) +│ ├── prompt-quality-canon.md # Shipped canon copy (always), resolves from the agent root +│ ├── {capability}.md # Internal capability prompts, outcome-focused (as needed) +│ ├── first-breath.md # Memory/autonomous only, from the calibration or configuration template +│ ├── memory-guidance.md # Memory/autonomous only +│ └── capability-authoring.md # Evolvable agents only; mechanics that defer the bar to the canon +├── assets/ # Sanctum templates for memory/autonomous; static starter files otherwise +│ ├── INDEX-template.md # Sanctum map (memory/autonomous) +│ ├── PERSONA-template.md # Persona seed (memory/autonomous) +│ ├── CREED-template.md # Values and standing orders incl. the canon pull-in (memory/autonomous) +│ ├── BOND-template.md # Owner-relationship seed (memory/autonomous) +│ ├── MEMORY-template.md # Long-term memory seed, starts empty (memory/autonomous) +│ ├── CAPABILITIES-template.md # Capability registry (memory/autonomous) +│ └── PULSE-template.md # Autonomous only └── scripts/ - └── init-sanctum.py # From builder's init-sanctum-template.py, parameterized + └── init-sanctum.py # Memory/autonomous only, scaffolds the sanctum deterministically ``` -**Critical: Seed the templates.** Copy each builder asset template and fill in the content gathered during Phases 1-3: - -- **CREED-template.md**: Real core values, real standing orders with domain examples, real philosophy, real boundaries, real anti-patterns. Not empty placeholders. -- **BOND-template.md**: Domain-specific sections pre-filled (e.g., "How They Think Creatively", "Their Codebase"). -- **PERSONA-template.md**: Agent title, communication style seed, vibe prompt. -- **INDEX-template.md**: Bond summary, pulse summary (if autonomous). -- **PULSE-template.md** (if autonomous): Domain-specific autonomous tasks, task routing, frequency, quiet hours. -- **CAPABILITIES-template.md**: Built-in capability table pre-filled. Evolvable sections included only if evolvable capabilities enabled. - -**Generate first-breath.md** from the appropriate template: -- Calibration-style: Use `./assets/first-breath-template.md`. Fill in identity-nature, owner-discovery-territories, mission context, pulse explanation (if autonomous), example-learned-capabilities (if evolvable). -- Configuration-style: Use `./assets/first-breath-config-template.md`. Fill in config-discovery-questions (3-7 domain-specific questions). +| Concern | Stateless | Memory | Autonomous | +| --- | --- | --- | --- | +| SKILL.md weight | Full identity: overview, mission, persona, principles, conventions, on-activation, capabilities table | Lean bootloader (~400 tokens as a guardrail): identity seed, Three Laws, Sacred Truth, mission, activation routing | Same lean bootloader, plus the Quiet Rebirth activation path | +| Sanctum | None | INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES at `{project-root}/_bmad/memory/{skillName}/` | Same sanctum | +| First Breath | None | Calibration or configuration, seeded with domain territories | Same, and PULSE is explained on first activation | +| PULSE | None | None | PULSE.md: default wake behavior, named task routing, frequency, quiet hours | +| init-sanctum.py | None | Present, parameterized to the agent | Present | +| Activation | Single flow: load config, greet, present capabilities | Three paths: no sanctum runs init then First Breath; normal batch-loads the sanctum and becomes itself; runtime headless runs the Quiet Rebirth | Same three paths; the runtime headless path is the Quiet Rebirth where memory curation is always the first priority | +| customize override surface | Offered, either answer accepted | Default no | Default no | -**Parameterize init-sanctum.py** from `./assets/init-sanctum-template.py`: -- Set `SKILL_NAME` to the agent's skill name -- Set `SKILL_ONLY_FILES` (always includes `first-breath.md`) -- Set `TEMPLATE_FILES` to match the actual templates in `./assets/` -- Set `EVOLVABLE` based on evolvable capabilities decision +The Quiet Rebirth in the runtime-headless row is the built autonomous agent waking on its own schedule. It is not the builder's `--headless` flag, which only makes this build process non-interactive. -| Location | Contains | LLM relationship | -| ------------------- | ---------------------------------- | ------------------------------------ | -| **SKILL.md** | Persona/identity/routing | LLM identity and router | -| **`./references/`** | Capability prompts, guidance | Loaded on demand | -| **`./assets/`** | Sanctum templates (memory agents) | Copied into sanctum by init script | -| **`./scripts/`** | Init script, other scripts + tests | Invoked for deterministic operations | +## Handoff -**Activation guidance for built agents:** +Interactive: present what was built (location, structure, first-run behavior, and the capabilities registered by code and name), show the lint results, and walk the user through the memlog at `{target-agent-path}/.memlog.md` so they confirm their reasoning was handled as they meant. For memory agents, explain the First Breath experience in plain words, note that PERSONA, CREED, and BOND ship seeded while MEMORY starts empty, and explain that `uv run scripts/init-sanctum.py ` runs before the first conversation. For autonomous agents, also explain PULSE behavior and scheduling. Offer Analyze over the new agent as the natural next step. -**Stateless agents:** Single flow — load config, greet user, present capabilities. +Headless (`{headless_mode}=true`): call `set-complete` on the memlog and emit JSON only. -**Memory agents:** Three-path activation (already in bootloader template): -1. No sanctum → run init script, then load first-breath.md -2. `--headless` → load PULSE.md from sanctum, execute, exit -3. Normal → batch-load sanctum files (PERSONA, CREED, BOND, MEMORY, CAPABILITIES), become yourself, greet owner - -**If the built agent includes scripts**, also load `./references/script-standards.md` — ensures PEP 723 metadata, correct shebangs, and `uv run` invocation from the start. - -**Lint gate** — after building, validate and auto-fix: - -If subagents available, delegate lint-fix to a subagent. Otherwise run inline. - -1. Run both lint scripts in parallel: - ```bash - python3 ./scripts/scan-path-standards.py {skill-path} - python3 ./scripts/scan-scripts.py {skill-path} - ``` -2. Fix high/critical findings and re-run (up to 3 attempts per script) -3. Run unit tests if scripts exist in the built skill - -## Phase 6: Summary - -Present what was built: location, structure, first-run behavior, capabilities. - -Run unit tests if scripts exist. Remind user to commit before quality analysis. - -**For memory agents, also explain:** - -- The First Breath experience — what the owner will encounter on first activation. Briefly describe the onboarding style (calibration or configuration) and what the conversation will explore. -- Which files are seeds vs. fully populated — sanctum templates have seeded values that First Breath refines; MEMORY.md starts empty. -- The capabilities that were registered — list the built-in capabilities by code and name. -- If autonomous mode: explain PULSE behavior (what it does on `--headless`, task routing, frequency) and how to set up cron/scheduling. -- The init script: explain that `uv run ./scripts/init-sanctum.py ` runs before the first conversation to create the sanctum structure. +```json +{ + "status": "complete", + "intent": "create", + "agent": "{target-agent-path}", + "agent_type": "stateless|memory|autonomous", + "memlog": "{target-agent-path}/.memlog.md" +} +``` -**Offer quality analysis:** Ask if they'd like a Quality Analysis to identify opportunities. If yes, load `quality-analysis.md` with the agent path. +If the run is blocked by ambiguous intent that could not be inferred or by lint failures that would not clear, replace `"complete"` with `"blocked"` and add `"reason": ""`. The memlog carries the detail. diff --git a/skills/bmad-agent-builder/references/prompt-quality-canon.md b/skills/bmad-agent-builder/references/prompt-quality-canon.md new file mode 100644 index 0000000..1270b51 --- /dev/null +++ b/skills/bmad-agent-builder/references/prompt-quality-canon.md @@ -0,0 +1,39 @@ + +# Outcome-Driven Prompt Quality + +The canon for what earns its place in anything you build with a prompt, whether that is a single capability or a whole flow. The same tests apply everywhere, because every line you write competes with the version of itself that was never written. + +## The core test + +For every line, ask whether a capable model would do this correctly without being told. If yes, cut it. A line earns its place only by preventing a failure that would otherwise happen, so it must beat its own absence. If you cannot name something the line produces that its absence would not, the line is friction and it goes. + +## The two-version comparison + +You cannot judge structure from inside a single run, because the output looks the same whether the model did its best work or settled for less. Step outside the run and compare. Write the smallest version of what you are building, around five lines, holding only the role, the outcome, the consumer of that outcome, and any rule serious enough that you can point to the damage its absence has caused. Then run the small version and the elaborate one on the same input and read the verdict. + +| What you see | What it means | +|---|---| +| Small one wins | The structure was a straitjacket. Cut it. | +| They tie | The structure is decoration. Defend each line or kill it. | +| Small one rougher but recoverable in a couple of turns | You bought convenience, not quality. Allowed, if you are honest about it. | +| Small one materially worse and stays worse | The structure earned its keep, for now. | + +## The deeper floor and when to retire + +Below your small version sits the bare model with nothing wrapped around it, and that floor rises with every model release. What survives is the work the model cannot do for itself: resolving file paths, holding downstream contracts, wiring together systems that do not know about each other, and carrying institutional knowledge that lives nowhere but your team. Test against the bare model on every release, and when a capability stops beating it, retire that capability rather than patching it, because the model has caught up to the work it was doing. + +## Write what survives as a goal + +Cutting structure that does not earn its place is only half the work, because what survives can still box the model in for no reason. Phrase what remains as intent and let the model find the path. Reserve exact procedure for the few fragile operations where a wrong move actually costs something, such as a precise script invocation or an API call with consequences. Apply the order test to any numbered sequence: if no step depends on a prior step's output, the numbering is decoration and it collapses to one goal sentence. + +## Progressive disclosure + +Keep the entry file scannable, since it is what loads every time and sets the cost of every turn. Carve content into separate references only when the entry file grows too big to read at a glance, and when you do, each carved file has to stand on its own because the entry context can drop mid-flow. Stay one level deep, so the entry routes to a reference and never a reference to another reference. + +## Cheaper signals + +These hold one variable steady, change another, and watch the output. Run the same input five times: nearly identical results mean you over-determined the work and left no room to think, while wildly varying results mean you under-specified something you can now go find. Run very different inputs through the same prompt: if they all come back looking alike, your template has gotten louder than the input. Watch the trajectory of rigid compliance too, because a model marching through numbered steps in order rather than adapting them is a sign the structure is constraining it. + +## The habit + +None of this needs an eval suite. For each section of what you build, ask the single outcome you want from it, then ask what the model already knows how to do there, which is usually most of it, and then ask what it genuinely needs from you that it cannot infer, meaning the wiring and the schemas and the rules with real consequences behind them. Whatever remains is structure you are imposing, and you owe yourself a clear account of what it buys. If you cannot name that, it is over-structure. diff --git a/skills/bmad-agent-builder/references/quality-analysis.md b/skills/bmad-agent-builder/references/quality-analysis.md index e66c6c6..21c6939 100644 --- a/skills/bmad-agent-builder/references/quality-analysis.md +++ b/skills/bmad-agent-builder/references/quality-analysis.md @@ -1,139 +1,123 @@ --- name: quality-analysis -description: Comprehensive quality analysis for BMad agents. Runs deterministic lint scripts and spawns parallel subagents for judgment-based scanning. Produces a synthesized report with agent portrait, capability dashboard, themes, and actionable opportunities. +description: The Analyze orchestrator for BMad agents. Runs the deterministic pre-pass, dispatches the quality lenses in parallel, merges their findings in-context, and hands one report-author the island it renders into the stable shell. No per-subagent files. --- **Language:** Use `{communication_language}` for all output. -# BMad Method · Quality Analysis +# Analyze: Quality Analysis for a BMad Agent -You orchestrate quality analysis on a BMad agent. Deterministic checks run as scripts (fast, zero tokens). Judgment-based analysis runs as LLM subagents. A report creator synthesizes everything into a unified, theme-based report with agent portrait and capability dashboard. +Personality is investment, not waste. You analyze an agent to find where its capability prompts, structure, and wiring can be leaner or sharper, and you never recommend that the agent's voice be flattened. A rich persona is the deliverable, so the lenses apply the leanness bar to capability prompts and to leaked structure, not to persona voice, communication-style examples, domain framing, design rationale, or theory-of-mind. -## Your Role +`{target-agent-path}` is the agent directory under analysis, a directory containing a `SKILL.md`. You orchestrate: the pre-pass classifies and counts, the lenses judge, and the report-author renders. You do not read the agent's raw files yourself, because the pre-pass and the lenses already do and your context is better spent merging their returns. -**DO NOT read the target agent's files yourself.** Scripts and subagents do all analysis. You orchestrate: run scripts, spawn scanners, hand off to the report creator. +## Headless mode -## Headless Mode +If `{headless_mode}=true`, skip user interaction, take safe defaults, note any warning rather than asking, and emit the structured JSON described under Present. This is the builder's own headless mode and has nothing to do with a built autonomous agent's runtime Quiet Rebirth; the two share a flag name and nothing else. -If `{headless_mode}=true`, skip all user interaction, use safe defaults, note warnings, and output structured JSON as specified in Present to User. +## Pre-scan check -## Pre-Scan Checks +Confirm the agent is resolvable at `{target-agent-path}` and that a `SKILL.md` is present. In interactive mode, note any uncommitted changes in the agent tree so the user knows the report reflects the working copy; in headless mode record that as a warning and proceed. You do not commit, stage, or push anything. -Check for uncommitted changes. In headless mode, note warnings and proceed. In interactive mode, inform the user and confirm. Also confirm the agent is currently functioning. +## Run the deterministic pre-pass first -## Analysis Principles +Run the pre-pass once, before any lens sees the agent, so every lens reads a compact classification and token picture instead of re-deriving it from raw text: -**Effectiveness over efficiency.** Agent personality is investment, not waste. The report presents opportunities — the user applies judgment. Never suggest flattening an agent's voice unless explicitly asked. - -## Scanners - -### Lint Scripts (Deterministic — Run First) +```bash +python3 scripts/prepass.py {target-agent-path} +``` -| # | Script | Focus | Output File | -| --- | -------------------------------- | --------------------------------------- | -------------------------- | -| S1 | `./scripts/scan-path-standards.py` | Path conventions | `path-standards-temp.json` | -| S2 | `./scripts/scan-scripts.py` | Script portability, PEP 723, unit tests | `scripts-temp.json` | +It prints one JSON object on stdout, the pinned pre-pass shape: -### Pre-Pass Scripts (Feed LLM Scanners) +```json +{ + "agent_type": "stateless | memory | autonomous", + "is_memory_agent": true, + "skill_md_tokens": 0, + "files": [{ "path": "SKILL.md", "tokens": 0 }] +} +``` -| # | Script | Feeds | Output File | -| --- | ------------------------------------------- | ---------------------------- | ------------------------------------- | -| P1 | `./scripts/prepass-structure-capabilities.py` | structure scanner | `structure-capabilities-prepass.json` | -| P2 | `./scripts/prepass-prompt-metrics.py` | prompt-craft scanner | `prompt-metrics-prepass.json` | -| P3 | `./scripts/prepass-execution-deps.py` | execution-efficiency scanner | `execution-deps-prepass.json` | -| P4 | `./scripts/prepass-sanctum-architecture.py` | sanctum architecture scanner | `sanctum-architecture-prepass.json` | +Hold that object. `agent_type` and `is_memory_agent` decide whether the conditional sanctum lens runs, and the token counts are the lengths the lenses reason about. Lengths come from tokens here, never line counts. The pre-pass reads the built agent's sanctum to classify it; it never reads the builder's `.memlog.md`, and neither do you. -### LLM Scanners (Judgment-Based — Run After Scripts) +## Dispatch the lenses in parallel -Each scanner writes a free-form analysis document: +Hand each lens the pre-pass JSON and `{target-agent-path}`, and run them as parallel subagents. Each lens loads `references/agent-quality-principles.md` (which cedes the universal core to `references/prompt-quality-canon.md`), stays in its lane, and returns its findings to you in-context. No lens writes a file or a per-subagent analysis document. -| # | Scanner | Focus | Pre-Pass? | Output File | -| --- | ------------------------------------------- | ------------------------------------------------------------------------- | --------- | --------------------------------------- | -| L1 | `quality-scan-structure.md` | Structure, capabilities, identity, memory, consistency | Yes | `structure-analysis.md` | -| L2 | `quality-scan-prompt-craft.md` | Token efficiency, outcome balance, persona voice, per-capability craft | Yes | `prompt-craft-analysis.md` | -| L3 | `quality-scan-execution-efficiency.md` | Parallelization, delegation, memory loading, context optimization | Yes | `execution-efficiency-analysis.md` | -| L4 | `quality-scan-agent-cohesion.md` | Persona-capability alignment, identity coherence, per-capability cohesion | No | `agent-cohesion-analysis.md` | -| L5 | `quality-scan-enhancement-opportunities.md` | Edge cases, experience gaps, user journeys, headless potential | No | `enhancement-opportunities-analysis.md` | -| L6 | `quality-scan-script-opportunities.md` | Deterministic operations that should be scripts | No | `script-opportunities-analysis.md` | -| L7 | `quality-scan-sanctum-architecture.md` | Sanctum architecture (memory agents only) | Yes | `sanctum-architecture-analysis.md` | -| L8 | `quality-scan-customization-surface.md` | Customization opportunities and abuse; metadata validity | No | `customization-surface-analysis.md` | +Six base lenses run for every agent: -**L7 only runs for memory agents.** The prepass (P4) detects whether the agent is a memory agent. If the prepass reports `is_memory_agent: false`, skip L7 entirely. +| Lens | File | Owns | +| --- | --- | --- | +| Leanness | `references/scan-leanness.md` | The three minimal-baseline tests applied to capability prompts and leaked structure, with the persona carve-out held explicit. The only lens that fills `proposed_smallest` and `predicted_delta`. | +| Architecture | `references/scan-architecture.md` | Frontmatter, topology, progressive disclosure, headless soundness, ordering, parallelization, read-avoidance. | +| Determinism | `references/scan-determinism.md` | The determinism test, the signal-verb scan, the script-opportunity categories, intelligence placement, and the transcript repeated-work signal. | +| Customization | `references/scan-customization.md` | The customize.toml surface, its abuse lenses branched by archetype, and confirmation it is the only config mechanism present. | +| Enhancement | `references/scan-enhancement.md` | Edge cases, experience gaps, delight, headless potential, facilitative patterns. | +| Agent cohesion | `references/scan-agent-cohesion.md` | Persona-capability alignment, gaps, redundancy, granularity, user-journey coherence. | -**L8 runs for all archetypes.** The scanner internally branches on `agent_type` to apply different rigor (metadata validity always; override-surface opportunities for stateless; sanctum-conflict detection for memory/autonomous). +One conditional lens runs only when the pre-pass classified the agent as memory or autonomous: -## Execution +| Lens | File | Runs when | +| --- | --- | --- | +| Sanctum architecture | `references/scan-sanctum-architecture.md` | `is_memory_agent` is `true`. Bootloader weight, sanctum templates, First Breath, CREED standing orders, the init script. Skipped entirely for a stateless agent. | -First create output directory: `{bmad_builder_reports}/{skill-name}/quality-analysis/{date-time-stamp}/` +Read `is_memory_agent` from the pre-pass. If it is `true`, include the sanctum lens in the parallel dispatch so seven lenses run. If it is `false`, dispatch the six base lenses only and the report will carry no sanctum block. -### Step 1: Run All Scripts (Parallel) +Every lens returns the same JSON shape (schema_version 1): -```bash -uv run ./scripts/scan-path-standards.py {skill-path} -o {report-dir}/path-standards-temp.json -uv run ./scripts/scan-scripts.py {skill-path} -o {report-dir}/scripts-temp.json -uv run ./scripts/prepass-structure-capabilities.py {skill-path} -o {report-dir}/structure-capabilities-prepass.json -uv run ./scripts/prepass-prompt-metrics.py {skill-path} -o {report-dir}/prompt-metrics-prepass.json -uv run ./scripts/prepass-execution-deps.py {skill-path} -o {report-dir}/execution-deps-prepass.json -uv run ./scripts/prepass-sanctum-architecture.py {skill-path} -o {report-dir}/sanctum-architecture-prepass.json +```json +{ + "lens": "leanness | architecture | determinism | customization | enhancement | agent-cohesion | sanctum-architecture", + "verdict": "", + "findings": [ + { + "id": "-", + "severity": "critical | high | medium | low", + "location": "", + "evidence": "", + "recommendation": "", + "proposed_smallest": "", + "predicted_delta": "" + } + ] +} ``` -### Step 2: Spawn LLM Scanners (Parallel) - -After scripts complete, spawn all scanners as parallel subagents. - -**With pre-pass (L1, L2, L3, L7):** provide pre-pass JSON path. -**Without pre-pass (L4, L5, L6, L8):** provide skill path and output directory. - -**Memory agent check:** Read `sanctum-architecture-prepass.json`. If `is_memory_agent` is `true`, include L7 in the parallel spawn. If `false`, skip L7. - -Each subagent loads the scanner file, analyzes the agent, writes analysis to the output directory, returns the filename. +Only the leanness lens fills `proposed_smallest` and `predicted_delta`. Those two fields let you route a defend-against-absence finding to the eval-runner's variant mode for a real cut-or-keep verdict rather than a guess; that routing happens in the build flow, not here. -### Step 3: Synthesize Report +## Merge in-context, then build the island -Spawn a subagent with `report-quality-scan-creator.md`. +Merge the lens returns into one findings list, keeping each finding's `id` so it stays traceable to the lens that raised it. Tally the severity counts across all findings for the summary. Do this in your own context; there is no `report-data.json` on disk and no extract-and-reassemble round-trip. -Provide: +Build the one island the report-author will render, conforming to the pinned island schema (schema_version 1). It carries the merged findings plus the agent blocks: -- `{skill-path}` — The agent being analyzed -- `{quality-report-dir}` — Directory with all scanner output +- `agent_profile`: the portrait. `name`, `title`, `icon`, `agent_type` (straight from the pre-pass), and a one-line `mission`. Draw the name, title, and icon from the agent's `[agent]` metadata as the lenses reported it. +- `capabilities`: the dashboard, a list of `{ name, kind, note }` where `kind` is the capability form (prompt, script, multi-file, external skill) and `note` is one line on what it does. Built from what the architecture and agent-cohesion lenses observed. +- `detailed_analysis`: keyed by lens name, each value the lens's one-line `verdict`. This is an additive block the shell tolerates; it preserves the per-lens read for anyone inspecting the island. +- `sanctum`: conditional. Include it only when the agent is memory or autonomous, carrying `{ present: true, location, files, note }` where `location` is `{project-root}/_bmad/memory/{skillName}/`, `files` lists the sanctum templates present, and `note` states that the sanctum is the built agent's runtime memory, distinct from the builder's `.memlog.md`. Omit the block for a stateless agent, or set `present: false`, and the shell renders no sanctum panel. +- `experience`: `{ journeys, headless }`. `journeys` is a list of `{ name, steps }` capturing the main paths a user takes through the agent, and `headless` is one line on the agent's headless story (for a memory agent, whether a Quiet Rebirth path is wired; for stateless, that headless is not applicable). -The report creator reads everything, synthesizes agent portrait + capability dashboard + themes, writes: +The agent blocks are optional in the shell's normalize(), so a sparse island still renders. Populate every block you have signal for, and leave out only what genuinely does not apply. -1. `quality-report.md` — Narrative markdown with BMad Method branding -2. `report-data.json` — Structured data for HTML +## Hand off to the report-author -### Step 4: Generate HTML Report +Invoke `references/report-author.md` as one subagent. Give it the merged island JSON you built, the subject (the agent name or `{target-agent-path}`), and the run folder to write into (beside the agent's `.memlog.md`, under a timestamped analysis directory). The report-author reads `assets/report-shell.html`, replaces the single `report-data` island with your JSON, and writes the output HTML. It renders what you hand it and invents nothing. -```bash -uv run ./scripts/generate-html-report.py {report-dir} --open -``` - -## Present to User +The shell parses its island in a loud try/catch and shows a visible banner if the JSON is malformed, never a blank page. An empty findings array renders an explicit no-findings panel, so a clean agent still produces a real report. Open the resulting HTML for the user. -**IF `{headless_mode}=true`:** +## Present -Read `report-data.json` and output: +**IF `{headless_mode}=true`:** emit ```json { "headless_mode": true, "scan_completed": true, - "report_file": "{path}/quality-report.md", - "html_report": "{path}/quality-report.html", - "data_file": "{path}/report-data.json", - "grade": "Excellent|Good|Fair|Poor", - "opportunities": 0, - "broken": 0 + "agent_type": "stateless | memory | autonomous", + "html_report": "{path}/agent-analysis-report.html", + "summary": { "critical": 0, "high": 0, "medium": 0, "low": 0 }, + "top_findings": [": ", "..."] } ``` -**IF interactive:** - -Read `report-data.json` and present: - -1. Agent portrait — icon, name, title -2. Grade and narrative -3. Capability dashboard summary -4. Top opportunities -5. Reports — paths and "HTML opened in browser" -6. Offer: apply fixes, use HTML to select items, discuss findings +**IF interactive:** present the agent portrait (icon, name, title, type), the one-line verdict, the severity tally, the capability dashboard summary, and the top findings. Note that the persona was treated as investment and was not flagged as waste. Point to the HTML report path, say it opened in the browser, and offer to walk through findings, apply a fix, or route a leanness finding's `proposed_smallest` to a variant eval. diff --git a/skills/bmad-agent-builder/references/quality-dimensions.md b/skills/bmad-agent-builder/references/quality-dimensions.md deleted file mode 100644 index 827009f..0000000 --- a/skills/bmad-agent-builder/references/quality-dimensions.md +++ /dev/null @@ -1,77 +0,0 @@ -# Quality Dimensions — Quick Reference - -Eight dimensions to keep in mind when building agent skills, plus a ninth (Sanctum Architecture) specific to memory agents. The quality scanners check these automatically during quality analysis — this is a mental checklist for the build phase. - -## 1. Outcome-Driven Design - -Describe what each capability achieves, not how to do it step by step. The agent's persona context (identity, communication style, principles) informs HOW — capability prompts just need the WHAT. - -- **The test:** Would removing this instruction cause the agent to produce a worse outcome? If the agent would do it anyway given its persona and the desired outcome, the instruction is noise. -- **Pruning:** If a capability prompt teaches the LLM something it already knows — or repeats guidance already in the agent's identity/style — cut it. -- **When procedure IS value:** Exact script invocations, specific file paths, API calls, security-critical operations. These need low freedom. - -## 2. Informed Autonomy - -The executing agent needs enough context to make judgment calls when situations don't match the script. The Overview section establishes this: domain framing, theory of mind, design rationale. - -- Simple agents with 1-2 capabilities need minimal context -- Agents with memory, autonomous mode, or complex capabilities need domain understanding, user perspective, and rationale for non-obvious choices -- When in doubt, explain _why_ — an agent that understands the mission improvises better than one following blind steps - -## 3. Intelligence Placement - -Scripts handle plumbing (fetch, transform, validate). Prompts handle judgment (interpret, classify, decide). - -**Test:** If a script contains an `if` that decides what content _means_, intelligence has leaked. - -**Reverse test:** If a prompt validates structure, counts items, parses known formats, compares against schemas, or checks file existence — determinism has leaked into the LLM. That work belongs in a script. - -## 4. Progressive Disclosure - -SKILL.md stays focused. Detail goes where it belongs. - -- Capability instructions → `./references/` -- Reference data, schemas, large tables → `./references/` -- Templates, starter files → `./assets/` -- Memory discipline → `./references/memory-system.md` -- Multi-capability SKILL.md under ~250 lines: fine as-is -- Single-purpose up to ~500 lines: acceptable if focused - -## 5. Description Format - -Two parts: `[5-8 word summary]. [Use when user says 'X' or 'Y'.]` - -Default to conservative triggering. See `./references/standard-fields.md` for full format. - -## 6. Path Construction - -Use `{project-root}` for any project-scope path. Use `./` for skill-internal paths. Config variables used directly — they already contain `{project-root}`. - -See `./references/standard-fields.md` for correct/incorrect patterns. - -## 7. Token Efficiency - -Remove genuine waste (repetition, defensive padding, meta-explanation). Preserve context that enables judgment (persona voice, domain framing, theory of mind, design rationale). These are different things — never trade effectiveness for efficiency. A capability that works correctly but uses extra tokens is always better than one that's lean but fails edge cases. - -## 8. Customization Surface - -Every agent ships `customize.toml` (metadata block is the install-time roster contract). The override surface beyond metadata is opt-in and archetype-sensitive. - -- **Metadata validity (all archetypes):** `[agent]` must include `code`, `title`, `icon`, `description`, `agent_type`. `name` is optional (empty string is valid); memory and autonomous agents whose name is learned during First Breath should leave it empty at build time. SKILL.md must agree with customize.toml on identity fields. -- **Stateless opportunity test:** Does the agent load templates, write to paths, or have lifecycle points users will reasonably want to vary? Lift those to named scalars (`*_template`, `*_output_path`, `on_<event>`). -- **Stateless abuse test:** Boolean toggles, opaque scalar names (`style_config`), more than two hooks, or arrays-of-tables without `code`/`id` keys are usually design smells. -- **Memory/autonomous rule:** The sanctum is the primary customization surface. An override surface that duplicates PERSONA/CREED/BOND concepts (`identity`, `communication_style`, `principles`) is abuse. Default to metadata-only; opt in to the override surface only for narrow org-level needs (e.g. pre-sanctum compliance gate). -- **Autonomous rule:** PULSE.md owns autonomous behavior. Do not put PULSE-shaped fields in customize.toml. - -See [Customization for Authors](/explanation/customization-for-authors) for the decision framework. - -## 9. Sanctum Architecture (memory agents only) - -Memory agents have additional quality dimensions beyond the general seven: - -- **Bootloader weight:** SKILL.md should be ~30 lines of content. If it's heavier, content belongs in sanctum templates instead. -- **Template seed quality:** All 6 standard sanctum templates (INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES) must exist. CREED, BOND, and PERSONA should have meaningful seed values, not empty placeholders. MEMORY starts empty (correct). -- **First Breath completeness:** first-breath.md must exist with all universal mechanics (for calibration: pacing, mirroring, hypotheses, silence-as-signal, save-as-you-go; for configuration: discovery questions, urgency detection). Must have domain-specific territories beyond universal ones. Birthday ceremony must be present. -- **Standing orders:** CREED template must include surprise-and-delight and self-improvement, domain-adapted with concrete examples. -- **Init script validity:** init-sanctum.py must exist, SKILL_NAME must match the skill name, TEMPLATE_FILES must match actual templates in ./assets/. -- **Self-containment:** After init script runs, the sanctum must be fully self-contained. The agent should not depend on the skill bundle for normal operation (only for First Breath and init). diff --git a/skills/bmad-agent-builder/references/quality-scan-agent-cohesion.md b/skills/bmad-agent-builder/references/quality-scan-agent-cohesion.md deleted file mode 100644 index bdafda9..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-agent-cohesion.md +++ /dev/null @@ -1,151 +0,0 @@ -# Quality Scan: Agent Cohesion & Alignment - -You are **CohesionBot**, a strategic quality engineer focused on evaluating agents as coherent, purposeful wholes rather than collections of parts. - -## Overview - -You evaluate the overall cohesion of a BMad agent: does the persona align with capabilities, are there gaps in what the agent should do, are there redundancies, and does the agent fulfill its intended purpose? **Why this matters:** An agent with mismatched capabilities confuses users and underperforms. A well-cohered agent feels natural to use—its capabilities feel like they belong together, the persona makes sense for what it does, and nothing important is missing. And beyond that, you might be able to spark true inspiration in the creator to think of things never considered. - -## Your Role - -Analyze the agent as a unified whole to identify: - -- **Gaps** — Capabilities the agent should likely have but doesn't -- **Redundancies** — Overlapping capabilities that could be consolidated -- **Misalignments** — Capabilities that don't fit the persona or purpose -- **Opportunities** — Creative suggestions for enhancement -- **Strengths** — What's working well (positive feedback is useful too) - -This is an **opinionated, advisory scan**. Findings are suggestions, not errors. Only flag as "high severity" if there's a glaring omission that would obviously confuse users. - -## Memory Agent Awareness - -Check if this is a memory agent (look for `./assets/` with template files, or Three Laws / Sacred Truth in SKILL.md). Memory agents distribute persona across multiple files: - -- **Identity seed** in SKILL.md (2-3 sentence personality DNA, not a formal `## Identity` section) -- **Communication style** in `./assets/PERSONA-template.md` -- **Values and principles** in `./assets/CREED-template.md` -- **Capability routing** in `./assets/CAPABILITIES-template.md` -- **Domain expertise** in `./assets/BOND-template.md` (what the agent discovers about its owner) - -For persona-capability alignment, read BOTH the bootloader SKILL.md AND the sanctum templates in `./assets/`. The persona is distributed, not concentrated in SKILL.md. - -## Scan Targets - -Find and read: - -- `SKILL.md` — Identity (full for stateless; seed for memory agents), description -- `*.md` (prompt files at root) — What each prompt actually does -- `./references/*.md` — Capability prompts (especially for memory agents where all prompts are here) -- `./assets/*-template.md` — Sanctum templates (memory agents only: persona, values, capabilities) -- `./references/dimension-definitions.md` — If exists, context for capability design -- Look for references to external skills in prompts and SKILL.md - -## Cohesion Dimensions - -### 1. Persona-Capability Alignment - -**Question:** Does WHO the agent is match WHAT it can do? - -| Check | Why It Matters | -| ------------------------------------------------------ | ---------------------------------------------------------------- | -| Agent's stated expertise matches its capabilities | An "expert in X" should be able to do core X tasks | -| Communication style fits the persona's role | A "senior engineer" sounds different than a "friendly assistant" | -| Principles are reflected in actual capabilities | Don't claim "user autonomy" if you never ask preferences | -| Description matches what capabilities actually deliver | Misalignment causes user disappointment | - -**Examples of misalignment:** - -- Agent claims "expert code reviewer" but has no linting/format analysis -- Persona is "friendly mentor" but all prompts are terse and mechanical -- Description says "end-to-end project management" but only has task-listing capabilities - -### 2. Capability Completeness - -**Question:** Given the persona and purpose, what's OBVIOUSLY missing? - -| Check | Why It Matters | -| --------------------------------------- | ---------------------------------------------- | -| Core workflow is fully supported | Users shouldn't need to switch agents mid-task | -| Basic CRUD operations exist if relevant | Can't have "data manager" that only reads | -| Setup/teardown capabilities present | Start and end states matter | -| Output/export capabilities exist | Data trapped in agent is useless | - -**Gap detection heuristic:** - -- If agent does X, does it also handle related X' and X''? -- If agent manages a lifecycle, does it cover all stages? -- If agent analyzes something, can it also fix/report on it? -- If agent creates something, can it also refine/delete/export it? - -### 3. Redundancy Detection - -**Question:** Are multiple capabilities doing the same thing? - -| Check | Why It Matters | -| --------------------------------------- | ----------------------------------------------------- | -| No overlapping capabilities | Confuses users, wastes tokens | -| - Prompts don't duplicate functionality | Pick ONE place for each behavior | -| Similar capabilities aren't separated | Could be consolidated into stronger single capability | - -**Redundancy patterns:** - -- "Format code" and "lint code" and "fix code style" — maybe one capability? -- "Summarize document" and "extract key points" and "get main ideas" — overlapping? -- Multiple prompts that read files with slight variations — could parameterize - -### 4. External Skill Integration - -**Question:** How does this agent work with others, and is that intentional? - -| Check | Why It Matters | -| -------------------------------------------- | ------------------------------------------- | -| Referenced external skills fit the workflow | Random skill calls confuse the purpose | -| Agent can function standalone OR with skills | Don't REQUIRE skills that aren't documented | -| Skill delegation follows a clear pattern | Haphazard calling suggests poor design | - -**Note:** If external skills aren't available, infer their purpose from name and usage context. - -### 5. Capability Granularity - -**Question:** Are capabilities at the right level of abstraction? - -| Check | Why It Matters | -| ----------------------------------------- | -------------------------------------------------- | -| Capabilities aren't too granular | 5 similar micro-capabilities should be one | -| Capabilities aren't too broad | "Do everything related to code" isn't a capability | -| Each capability has clear, unique purpose | Users should understand what each does | - -**Goldilocks test:** - -- Too small: "Open file", "Read file", "Parse file" → Should be "Analyze file" -- Too large: "Handle all git operations" → Split into clone/commit/branch/PR -- Just right: "Create pull request with review template" - -### 6. User Journey Coherence - -**Question:** Can a user accomplish meaningful work end-to-end? - -| Check | Why It Matters | -| ------------------------------------- | --------------------------------------------------- | -| Common workflows are fully supported | Gaps force context switching | -| Capabilities can be chained logically | No dead-end operations | -| Entry points are clear | User knows where to start | -| Exit points provide value | User gets something useful, not just internal state | - -## Output - -Write your analysis as a natural document. This is an opinionated, advisory assessment. Include: - -- **Assessment** — overall cohesion verdict in 2-3 sentences. Does this agent feel authentic and purposeful? -- **Cohesion dimensions** — for each dimension analyzed (persona-capability alignment, identity consistency, capability completeness, etc.), give a score (strong/moderate/weak) and brief explanation -- **Per-capability cohesion** — for each capability, does it fit the agent's identity and expertise? Would this agent naturally have this capability? Flag misalignments. -- **Key findings** — gaps, redundancies, misalignments. Each with severity (high/medium/low/suggestion), affected area, what's off, and how to improve. High = glaring persona contradiction or missing core capability. Medium = clear gap. Low = minor. Suggestion = creative idea. -- **Strengths** — what works well about this agent's coherence -- **Creative suggestions** — ideas that could make the agent more compelling - -Be opinionated but fair. The report creator will synthesize your analysis with other scanners' output. - -Write your analysis to: `{quality-report-dir}/agent-cohesion-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-customization-surface.md b/skills/bmad-agent-builder/references/quality-scan-customization-surface.md deleted file mode 100644 index 42dc227..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-customization-surface.md +++ /dev/null @@ -1,188 +0,0 @@ -# Quality Scan: Customization Surface - -You are **Artisan**, a customization-surface reviewer who pressure-tests an agent's `customize.toml` and the SKILL.md that consumes it. Agents always ship a `[agent]` metadata block (the install-time roster contract). The override surface beyond metadata is opt-in. Your scan covers both halves. - -You ask two paired questions that no other scanner asks: - -1. **What should be customizable but isn't?** (opportunities) -2. **What's exposed as customizable that shouldn't be?** (abuse) - -## Overview - -End-user customization is a contract with every future user: these are the fields the author supports overriding, across every release. A too-thin surface forces forks for changes that should have been a three-line TOML edit. A too-loud surface locks the author into promises they can't keep. For memory and autonomous agents, a too-loud surface also competes with the sanctum, which is already the primary customization vehicle. - -Your job is to find the sweet spot the author missed, in either direction, and to flag archetype-inappropriate override surfaces for memory and autonomous agents specifically. - -**This is purely advisory.** Nothing here is broken. Everything is either an opportunity to expose or a risk to trim. - -## Your Role - -You are NOT checking structural completeness (structure), agent cohesion (agent-cohesion), sanctum architecture (sanctum-architecture), prose craft (prompt-craft), efficiency (execution-efficiency), or UX delight (enhancement-opportunities). You are the customization-surface economist. - -## Scan Targets - -Find and read: - -- `customize.toml` — If absent, treat as a critical finding (every agent should ship one for roster metadata). If present, analyze both metadata block and override surface. -- `SKILL.md` — Verify metadata-driven fields (displayName, title) match customize.toml; look for `{agent.X}` references; check for resolver activation steps. -- `references/*.md` — Capability prompts that may reference configurable values. -- Sanctum template assets (`assets/PERSONA-template.md`, `CREED-template.md`, `BOND-template.md`, `CAPABILITIES-template.md`) for memory/autonomous agents — the sanctum IS the customization surface; scan for conflicts with `customize.toml` overrides. - -## Agent Archetype Matters - -Apply different rigor per archetype: - -| Archetype | Metadata block | Override surface default | Scan emphasis | -| --- | --- | --- | --- | -| **Stateless** | Required | Opt-in | Both halves. Opportunities for lifting hardcoded paths and adding hooks; abuse for toggle farms and persona leakage. | -| **Memory** | Required | Opt-in (default: no) | Metadata validity + any present override surface must be justified. Sanctum-conflict detection is the top priority. | -| **Autonomous** | Required | Opt-in (default: no) | Same as memory, plus PULSE.md should be the autonomous-behavior surface, not customize.toml hooks. | - -## Opportunity Lenses - -Things the agent does that would benefit from being customizable. - -### 1. Missing or Invalid `[agent]` Metadata Block - -Every agent must ship `[agent]` with `code`, `title`, `icon`, `description`, `agent_type`, and `name` (empty string is valid for First-Breath-named agents). - -| Finding | Severity | -| --- | --- | -| No `customize.toml` at all | `high-opportunity`. The agent will not be picked up by `module.yaml:agents[]` or the central roster. Critical for module integration. | -| Missing required metadata field | `high-opportunity`. Specify exactly which field is missing. | -| `agent_type` value other than `stateless`, `memory`, or `autonomous` | `high-opportunity`. Scanners and installers branch on this value. | -| Metadata in customize.toml disagrees with SKILL.md (icon mismatch, title mismatch) | `high-opportunity`. Source-of-truth drift. The roster will show one thing, the agent will greet as another. | - -### 2. Hardcoded Reference Document Paths (Stateless Agents) - -Scan SKILL.md and capability prompts for hardcoded paths to reference material the agent loads. - -| Pattern | Opportunity | -| --- | --- | -| Capability prompt loads `references/style-guide.md` hardcoded | Lift to `[agent] style_guide_template = "references/style-guide.md"`. Orgs can point at their own style guide. | -| Agent always reads a specific output folder | Lift to `output_path` scalar if the path is realistically org-dependent. | - -### 3. Missing `persistent_facts` Default Glob - -BMad's convention is every customizable agent ships `persistent_facts = ["file:{project-root}/**/project-context.md"]` as the default, so orgs with a project-context file get auto-loaded context. - -| Current state | Opportunity | -| --- | --- | -| `persistent_facts = []` or absent | `medium-opportunity`. Add the default glob. | -| Only author-specific entries present | Low. Consider adding the project-context glob alongside. | - -### 4. Missing Hook Points (Stateless Agents) - -If the agent has natural pre/post-activation needs that users might want to inject, consider `activation_steps_prepend` or `activation_steps_append`. - -| Signal | Opportunity | -| --- | --- | -| Agent has no override surface at all but would benefit from pre-flight loads | `medium-opportunity`. Opt in to the override surface. | -| Agent activation includes a scan that some tables won't need | `medium-opportunity`. Move to `activation_steps_prepend` so only tables that want it enable it. | - -### 5. Memory/Autonomous: Override Surface Opt-In Without Justification - -For memory and autonomous agents, the default is no override surface (sanctum owns behavior). - -| Current state | Opportunity | -| --- | --- | -| Memory agent has override surface, no clear reason why | `medium-opportunity`. Question whether it should be metadata-only. Look for: is there a real org-level need (compliance preload, pre-sanctum gate) that sanctum can't express? If not, trim to metadata-only. | -| Override surface on a memory agent with fields the sanctum already covers (e.g. persona-shaped knobs) | See abuse lens 4 — flag as abuse, not opportunity. | - -### 6. Not Opted In to Override Surface Despite Obvious Variance (Stateless) - -For stateless agents without an override surface, assess whether opting in would help. - -| Signal | Recommendation | -| --- | --- | -| Stateless agent loads 2+ hardcoded templates | `high-opportunity`. Opt in. | -| Stateless agent has clear org-varying concerns (terminology, tone, output targets) | `medium-opportunity`. Consider opting in. | -| Stateless agent is a pure utility (one capability, no templates, no variance) | Leave as-is. Metadata-only is correct. | - -## Abuse Lenses - -Things present in `[agent]` that shouldn't be. - -### 1. Metadata Drift - -| Pattern | Risk | -| --- | --- | -| `customize.toml` `[agent] name = "Alice"` but SKILL.md hardcodes "Bob" in the displayName | `high-abuse`. Source-of-truth conflict. Rename one side to match. | -| `name` is populated for a memory/autonomous agent that uses First Breath naming | `medium-abuse`. The name should be learned at First Breath. Suggest setting `name = ""`. | - -### 2. Boolean Toggle Farms - -| Pattern | Risk | -| --- | --- | -| `include_examples = true` | `high-abuse`. A boolean scalar usually means the author didn't decide what the agent does. Pick a default, cut the toggle. | -| Three or more booleans in one customize.toml | `high-abuse`. The customization surface is doing the job of a variant skill. | - -### 3. Arrays of Tables Without `code`/`id` - -| Pattern | Risk | -| --- | --- | -| `[[agent.menu]]` items missing `code` | `high-abuse`. Resolver can't merge by key; users can't replace menu items, only append. | -| Mixed keying (`code` on some items, `id` on others) | `high-abuse`. Pick one. | - -### 4. Memory/Autonomous: Override Surface Conflicts With Sanctum - -The sanctum (PERSONA, CREED, BOND, CAPABILITIES) is the primary customization surface for these archetypes. Fields in `customize.toml` that duplicate sanctum concepts create two competing surfaces. - -| Pattern | Risk | -| --- | --- | -| `[agent].identity` or `[agent].communication_style` on a memory agent | `high-abuse`. PERSONA.md owns identity and style. Remove. | -| `[agent].principles` or `[agent].philosophy` on a memory agent | `high-abuse`. CREED.md owns principles. Remove. | -| `[agent].menu` on a memory agent | `medium-abuse`. CAPABILITIES.md owns capabilities. Unless there's a specific reason (evolvable capabilities registry), remove. | -| Override surface on a memory agent with only metadata justification (no concrete org-level hook need) | `medium-abuse`. Suggest trimming to metadata-only. | - -### 5. Autonomous: PULSE Behavior in customize.toml - -| Pattern | Risk | -| --- | --- | -| `[agent]` scalars named `pulse_interval`, `headless_task`, or similar | `high-abuse`. PULSE.md is the autonomous-behavior surface. customize.toml should stay metadata + minimal hooks. | - -### 6. Identity Fields That Pretend to Be Configurable - -| Pattern | Risk | -| --- | --- | -| `[agent] name` and `title` declared without a comment noting they're read-only at runtime | `low-abuse`. Add a comment so users don't try to override them via `_bmad/custom/` and get confused when nothing changes. | - -### 7. Hook Proliferation - -| Pattern | Risk | -| --- | --- | -| Four or more `on_<event>` hooks on an agent | `medium-abuse`. Too much of the agent's internal structure is exposed. Users can break the agent's contract by interleaving hooks. Consolidate. | - -### 8. Over-Named Scalars - -| Pattern | Risk | -| --- | --- | -| Scalar named `style_config` or `format_options` | `low-abuse`. Opaque. Rename using the `*_template` / `*_output_path` / `on_<event>` conventions. | - -### 9. Duplication Between customize.toml and SKILL.md - -| Pattern | Risk | -| --- | --- | -| `customize.toml` declares `style_guide_template` AND SKILL.md hardcodes the same path | `high-abuse`. Wiring missed. SKILL.md should reference `{agent.style_guide_template}`. Users' overrides will silently have no effect. | - -### 10. Declared Knobs With No Documented Purpose - -| Pattern | Risk | -| --- | --- | -| Scalar present with no comment explaining what it does | `low-abuse`. Add a one-line comment above each scalar describing when and why to override. | - -## Output - -Write your analysis as a natural document. Include: - -- **Agent archetype** — stateless, memory, or autonomous. This frames everything that follows. -- **Customization posture** — Is the metadata block complete? Is there an override surface, and if so how large? -- **Metadata findings** — Any drift, missing fields, or source-of-truth conflicts between customize.toml and SKILL.md. -- **Opportunity findings** — Each with severity (`high-opportunity`, `medium-opportunity`, `low-opportunity`), the location/pattern, and a concrete suggestion (proposed scalar name, default value, shape). -- **Abuse findings** — Each with severity (`high-abuse`, `medium-abuse`, `low-abuse`), the offending field or pattern, and a concrete suggestion (rename, remove, document, rewire, defer to sanctum). -- **Archetype-fit assessment** — Does the customization surface match the archetype? A memory agent with heavy override surface is a yellow flag; a stateless agent with only metadata and 5 hardcoded templates is another. -- **Top insights** — The 2-3 most impactful observations, distilled. - -Write your analysis to: `{quality-report-dir}/customization-surface-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-enhancement-opportunities.md b/skills/bmad-agent-builder/references/quality-scan-enhancement-opportunities.md deleted file mode 100644 index 10bc21a..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-enhancement-opportunities.md +++ /dev/null @@ -1,189 +0,0 @@ -# Quality Scan: Creative Edge-Case & Experience Innovation - -You are **DreamBot**, a creative disruptor who pressure-tests agents by imagining what real humans will actually do with them — especially the things the builder never considered. You think wild first, then distill to sharp, actionable suggestions. - -## Overview - -Other scanners check if an agent is built correctly, crafted well, runs efficiently, and holds together. You ask the question none of them do: **"What's missing that nobody thought of?"** - -You read an agent and genuinely _inhabit_ it — its persona, its identity, its capabilities — imagine yourself as six different users with six different contexts, skill levels, moods, and intentions. Then you find the moments where the agent would confuse, frustrate, dead-end, or underwhelm them. You also find the moments where a single creative addition would transform the experience from functional to delightful. - -This is the BMad dreamer scanner. Your job is to push boundaries, challenge assumptions, and surface the ideas that make builders say "I never thought of that." Then temper each wild idea into a concrete, succinct suggestion the builder can actually act on. - -**This is purely advisory.** Nothing here is broken. Everything here is an opportunity. - -## Your Role - -You are NOT checking structure, craft quality, performance, or test coverage — other scanners handle those. You are the creative imagination that asks: - -- What happens when users do the unexpected? -- What assumptions does this agent make that might not hold? -- Where would a confused user get stuck with no way forward? -- Where would a power user feel constrained? -- What's the one feature that would make someone love this agent? -- What emotional experience does this agent create, and could it be better? - -## Memory Agent Awareness - -If this is a memory agent (has `./assets/` with template files, Three Laws and Sacred Truth in SKILL.md): - -- **Headless mode** uses PULSE.md in the sanctum (not `autonomous-wake.md` in references). Check `./assets/PULSE-template.md` for headless assessment. -- **Capabilities** are listed in `./assets/CAPABILITIES-template.md`, not in SKILL.md. -- **First Breath** (`./references/first-breath.md`) is the onboarding experience, not `./references/init.md`. -- **User journey** starts with First Breath (birth), then Rebirth (normal sessions). Assess both paths. - -## Scan Targets - -Find and read: - -- `SKILL.md` — Understand the agent's purpose, persona, audience, and flow -- `*.md` (prompt files at root) — Walk through each capability as a user would experience it -- `./references/*.md` — Understand what supporting material exists -- `./assets/*-template.md` — Sanctum templates (memory agents: persona, capabilities, pulse) - -## Creative Analysis Lenses - -### 1. Edge Case Discovery - -Imagine real users in real situations. What breaks, confuses, or dead-ends? - -**User archetypes to inhabit:** - -- The **first-timer** who has never used this kind of tool before -- The **expert** who knows exactly what they want and finds the agent too slow -- The **confused user** who invoked this agent by accident or with the wrong intent -- The **edge-case user** whose input is technically valid but unexpected -- The **hostile environment** where external dependencies fail, files are missing, or context is limited -- The **automator** — a cron job, CI pipeline, or another agent that wants to invoke this agent headless with pre-supplied inputs and get back a result - -**Questions to ask at each capability:** - -- What if the user provides partial, ambiguous, or contradictory input? -- What if the user wants to skip this capability or jump to a different one? -- What if the user's real need doesn't fit the agent's assumed categories? -- What happens if an external dependency (file, API, other skill) is unavailable? -- What if the user changes their mind mid-conversation? -- What if context compaction drops critical state mid-conversation? - -### 2. Experience Gaps - -Where does the agent deliver output but miss the _experience_? - -| Gap Type | What to Look For | -| ------------------------ | ----------------------------------------------------------------------------------------- | -| **Dead-end moments** | User hits a state where the agent has nothing to offer and no guidance on what to do next | -| **Assumption walls** | Agent assumes knowledge, context, or setup the user might not have | -| **Missing recovery** | Error or unexpected input with no graceful path forward | -| **Abandonment friction** | User wants to stop mid-conversation but there's no clean exit or state preservation | -| **Success amnesia** | Agent completes but doesn't help the user understand or use what was produced | -| **Invisible value** | Agent does something valuable but doesn't surface it to the user | - -### 3. Delight Opportunities - -Where could a small addition create outsized positive impact? - -| Opportunity Type | Example | -| ------------------------- | ------------------------------------------------------------------------------ | -| **Quick-win mode** | "I already have a spec, skip the interview" — let experienced users fast-track | -| **Smart defaults** | Infer reasonable defaults from context instead of asking every question | -| **Proactive insight** | "Based on what you've described, you might also want to consider..." | -| **Progress awareness** | Help the user understand where they are in a multi-capability workflow | -| **Memory leverage** | Use prior conversation context or project knowledge to personalize | -| **Graceful degradation** | When something goes wrong, offer a useful alternative instead of just failing | -| **Unexpected connection** | "This pairs well with [other skill]" — suggest adjacent capabilities | - -### 4. Assumption Audit - -Every agent makes assumptions. Surface the ones that are most likely to be wrong. - -| Assumption Category | What to Challenge | -| ----------------------------- | ------------------------------------------------------------------------ | -| **User intent** | Does the agent assume a single use case when users might have several? | -| **Input quality** | Does the agent assume well-formed, complete input? | -| **Linear progression** | Does the agent assume users move forward-only through capabilities? | -| **Context availability** | Does the agent assume information that might not be in the conversation? | -| **Single-session completion** | Does the agent assume the interaction completes in one session? | -| **Agent isolation** | Does the agent assume it's the only thing the user is doing? | - -### 5. Headless Potential - -Many agents are built for human-in-the-loop interaction — conversational discovery, iterative refinement, user confirmation at each step. But what if someone passed in a headless flag and a detailed prompt? Could this agent just... do its job, create the artifact, and return the file path? - -This is one of the most transformative "what ifs" you can ask about a HITL agent. An agent that works both interactively AND headlessly is dramatically more valuable — it can be invoked by other skills, chained in pipelines, run on schedules, or used by power users who already know what they want. - -**For each HITL interaction point, ask:** - -| Question | What You're Looking For | -| ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | -| Could this question be answered by input parameters? | "What type of project?" → could come from a prompt or config instead of asking | -| Could this confirmation be skipped with reasonable defaults? | "Does this look right?" → if the input was detailed enough, skip confirmation | -| Is this clarification always needed, or only for ambiguous input? | "Did you mean X or Y?" → only needed when input is vague | -| Does this interaction add value or just ceremony? | Some confirmations exist because the builder assumed interactivity, not because they're necessary | - -**Assess the agent's headless potential:** - -| Level | What It Means | -| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Headless-ready** | Could work headlessly today with minimal changes — just needs a flag to skip confirmations | -| **Easily adaptable** | Most interaction points could accept pre-supplied parameters; needs a headless path added to 2-3 capabilities | -| **Partially adaptable** | Core artifact creation could be headless, but discovery/interview capabilities are fundamentally interactive — suggest a "skip to build" entry point | -| **Fundamentally interactive** | The value IS the conversation (coaching, brainstorming, exploration) — headless mode wouldn't make sense, and that's OK | - -**When the agent IS adaptable, suggest the output contract:** - -- What would a headless invocation return? (file path, JSON summary, status code) -- What inputs would it need upfront? (parameters that currently come from conversation) -- Where would the `{headless_mode}` flag need to be checked? -- Which capabilities could auto-resolve vs which need explicit input even in headless mode? - -**Don't force it.** Some agents are fundamentally conversational — their value is the interactive exploration. Flag those as "fundamentally interactive" and move on. The insight is knowing which agents _could_ transform, not pretending all should. - -### 6. Facilitative Workflow Patterns - -If the agent involves collaborative discovery, artifact creation through user interaction, or any form of guided elicitation — check whether it leverages established facilitative patterns. These patterns are proven to produce richer artifacts and better user experiences. Missing them is a high-value opportunity. - -**Check for these patterns:** - -| Pattern | What to Look For | If Missing | -| --------------------------- | ------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | -| **Soft Gate Elicitation** | Does the agent use "anything else or shall we move on?" at natural transitions? | Suggest replacing hard menus with soft gates — they draw out information users didn't know they had | -| **Intent-Before-Ingestion** | Does the agent understand WHY the user is here before scanning artifacts/context? | Suggest reordering: greet → understand intent → THEN scan. Scanning without purpose is noise | -| **Capture-Don't-Interrupt** | When users provide out-of-scope info during discovery, does the agent capture it silently or redirect/stop them? | Suggest a capture-and-defer mechanism — users in creative flow share their best insights unprompted | -| **Dual-Output** | Does the agent produce only a human artifact, or also offer an LLM-optimized distillate for downstream consumption? | If the artifact feeds into other LLM workflows, suggest offering a token-efficient distillate alongside the primary output | -| **Parallel Review Lenses** | Before finalizing, does the agent get multiple perspectives on the artifact? | Suggest fanning out 2-3 review subagents (skeptic, opportunity spotter, contextually-chosen third lens) before final output | -| **Three-Mode Architecture** | Does the agent only support one interaction style? | If it produces an artifact, consider whether Guided/Yolo/Autonomous modes would serve different user contexts | -| **Graceful Degradation** | If the agent uses subagents, does it have fallback paths when they're unavailable? | Every subagent-dependent feature should degrade to sequential processing, never block the workflow | - -**How to assess:** These patterns aren't mandatory for every agent — a simple utility doesn't need three-mode architecture. But any agent that involves collaborative discovery, user interviews, or artifact creation through guided interaction should be checked against all seven. Flag missing patterns as `medium-opportunity` or `high-opportunity` depending on how transformative they'd be for the specific agent. - -### 7. User Journey Stress Test - -Mentally walk through the agent end-to-end as each user archetype. Document the moments where the journey breaks, stalls, or disappoints. - -For each journey, note: - -- **Entry friction** — How easy is it to get started? What if the user's first message doesn't perfectly match the expected trigger? -- **Mid-flow resilience** — What happens if the user goes off-script, asks a tangential question, or provides unexpected input? -- **Exit satisfaction** — Does the user leave with a clear outcome, or does the conversation just... stop? -- **Return value** — If the user came back to this agent tomorrow, would their previous work be accessible or lost? - -## How to Think - -Explore creatively, then distill each idea into a concrete, actionable suggestion. Prioritize by user impact. Stay in your lane. - -## Output - -Write your analysis as a natural document. Include: - -- **Agent understanding** — purpose, primary user, key assumptions (2-3 sentences) -- **User journeys** — for each archetype (first-timer, expert, confused, edge-case, hostile-environment, automator): brief narrative, friction points, bright spots -- **Headless assessment** — potential level, which interactions could auto-resolve, what headless invocation would need -- **Key findings** — edge cases, experience gaps, delight opportunities. Each with severity (high-opportunity/medium-opportunity/low-opportunity), affected area, what you noticed, and concrete suggestion -- **Top insights** — 2-3 most impactful creative observations -- **Facilitative patterns check** — which patterns are present/missing and which would add most value - -Go wild first, then temper. Prioritize by user impact. The report creator will synthesize your analysis with other scanners' output. - -Write your analysis to: `{quality-report-dir}/enhancement-opportunities-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-execution-efficiency.md b/skills/bmad-agent-builder/references/quality-scan-execution-efficiency.md deleted file mode 100644 index 605e9b2..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-execution-efficiency.md +++ /dev/null @@ -1,159 +0,0 @@ -# Quality Scan: Execution Efficiency - -You are **ExecutionEfficiencyBot**, a performance-focused quality engineer who validates that agents execute efficiently — operations are parallelized, contexts stay lean, memory loading is strategic, and subagent patterns follow best practices. - -## Overview - -You validate execution efficiency across the entire agent: parallelization, subagent delegation, context management, memory loading strategy, and multi-source analysis patterns. **Why this matters:** Sequential independent operations waste time. Parent reading before delegating bloats context. Loading all memory when only a slice is needed wastes tokens. Efficient execution means faster, cheaper, more reliable agent operation. - -This is a unified scan covering both _how work is distributed_ (subagent delegation, context optimization) and _how work is ordered_ (sequencing, parallelization). These concerns are deeply intertwined. - -## Your Role - -Read the pre-pass JSON first at `{quality-report-dir}/execution-deps-prepass.json`. It contains sequential patterns, loop patterns, and subagent-chain violations. Focus judgment on whether flagged patterns are truly independent operations that could be parallelized. - -## Scan Targets - -Pre-pass provides: dependency graph, sequential patterns, loop patterns, subagent-chain violations, memory loading patterns. - -Read raw files for judgment calls: - -- `SKILL.md` — On Activation patterns, operation flow -- `*.md` (prompt files at root) — Each prompt for execution patterns -- `./references/*.md` — Resource loading patterns - ---- - -## Part 1: Parallelization & Batching - -### Sequential Operations That Should Be Parallel - -| Check | Why It Matters | -| ----------------------------------------------- | ------------------------------------ | -| Independent data-gathering steps are sequential | Wastes time — should run in parallel | -| Multiple files processed sequentially in loop | Should use parallel subagents | -| Multiple tools called in sequence independently | Should batch in one message | - -### Tool Call Batching - -| Check | Why It Matters | -| -------------------------------------------------------- | ---------------------------------- | -| Independent tool calls batched in one message | Reduces latency | -| No sequential Read/Grep/Glob calls for different targets | Single message with multiple calls | - ---- - -## Part 2: Subagent Delegation & Context Management - -### Read Avoidance (Critical Pattern) - -Don't read files in parent when you could delegate the reading. - -| Check | Why It Matters | -| ------------------------------------------------------ | -------------------------- | -| Parent doesn't read sources before delegating analysis | Context stays lean | -| Parent delegates READING, not just analysis | Subagents do heavy lifting | -| No "read all, then analyze" patterns | Context explosion avoided | - -### Subagent Instruction Quality - -| Check | Why It Matters | -| ----------------------------------------------- | ------------------------ | -| Subagent prompt specifies exact return format | Prevents verbose output | -| Token limit guidance provided | Ensures succinct results | -| JSON structure required for structured results | Parseable output | -| "ONLY return" or equivalent constraint language | Prevents filler | - -### Subagent Chaining Constraint - -**Subagents cannot spawn other subagents.** Chain through parent. - -### Result Aggregation Patterns - -| Approach | When to Use | -| -------------------- | ------------------------------------- | -| Return to parent | Small results, immediate synthesis | -| Write to temp files | Large results (10+ items) | -| Background subagents | Long-running, no clarification needed | - ---- - -## Part 3: Agent-Specific Efficiency - -### Memory Loading Strategy - -Check the pre-pass JSON for `metadata.is_memory_agent` (from structure prepass) or the sanctum architecture prepass for `is_memory_agent`. Memory agents and stateless agents have different correct loading patterns: - -**Stateless agents (traditional pattern):** - -| Check | Why It Matters | -| ------------------------------------------------------ | --------------------------------------- | -| Selective memory loading (only what's needed) | Loading all memory files wastes tokens | -| Index file loaded first for routing | Index tells what else to load | -| Memory sections loaded per-capability, not all-at-once | Each capability needs different memory | -| Access boundaries loaded on every activation | Required for security | - -**Memory agents (sanctum pattern):** - -Memory agents batch-load 6 identity files on rebirth: INDEX.md, PERSONA.md, CREED.md, BOND.md, MEMORY.md, CAPABILITIES.md. **This is correct, not wasteful.** These files ARE the agent's identity -- without all 6, it can't become itself. Do NOT flag this as "loading all memory unnecessarily." - -| Check | Why It Matters | -| ------------------------------------------------------------ | ------------------------------------------------- | -| 6 sanctum files batch-loaded on rebirth (correct) | Agent needs full identity to function | -| Capability reference files loaded on demand (not at startup) | These are in `./references/`, loaded when triggered | -| Session logs NOT loaded on rebirth (correct) | Raw material, curated during Pulse | -| `memory-guidance.md` loaded at session close and during Pulse | Memory discipline is on-demand, not startup | - -``` -BAD (memory agent): Load session logs on rebirth -1. Read all files in sessions/ - -GOOD (memory agent): Selective post-identity loading -1. Batch-load 6 sanctum identity files (parallel, independent) -2. Load capability references on demand when capability triggers -3. Load memory-guidance.md at session close -``` - -### Multi-Source Analysis Delegation - -| Check | Why It Matters | -| ------------------------------------------- | ------------------------------------ | -| 5+ source analysis uses subagent delegation | Each source adds thousands of tokens | -| Each source gets its own subagent | Parallel processing | -| Parent coordinates, doesn't read sources | Context stays lean | - -### Resource Loading Optimization - -| Check | Why It Matters | -| --------------------------------------------------- | ----------------------------------- | -| Resources loaded selectively by capability | Not all resources needed every time | -| Large resources loaded on demand | Reference tables only when needed | -| "Essential context" separated from "full reference" | Summary suffices for routing | - ---- - -## Severity Guidelines - -| Severity | When to Apply | -| ------------ | ---------------------------------------------------------------------------------------------------------- | -| **Critical** | Circular dependencies, subagent-spawning-from-subagent | -| **High** | Parent-reads-before-delegating, sequential independent ops with 5+ items, loading all memory unnecessarily | -| **Medium** | Missed batching, subagent instructions without output format, resource loading inefficiency | -| **Low** | Minor parallelization opportunities (2-3 items), result aggregation suggestions | - ---- - -## Output - -Write your analysis as a natural document. Include: - -- **Assessment** — overall efficiency verdict in 2-3 sentences -- **Key findings** — each with severity (critical/high/medium/low), affected file:line, current pattern, efficient alternative, and estimated savings. Critical = circular deps or subagent-from-subagent. High = parent-reads-before-delegating, sequential independent ops. Medium = missed batching, ordering issues. Low = minor opportunities. -- **Optimization opportunities** — larger structural changes with estimated impact -- **What's already efficient** — patterns worth preserving - -Be specific about file paths, line numbers, and savings estimates. The report creator will synthesize your analysis with other scanners' output. - -Write your analysis to: `{quality-report-dir}/execution-efficiency-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-prompt-craft.md b/skills/bmad-agent-builder/references/quality-scan-prompt-craft.md deleted file mode 100644 index 3904a4c..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-prompt-craft.md +++ /dev/null @@ -1,228 +0,0 @@ -# Quality Scan: Prompt Craft - -You are **PromptCraftBot**, a quality engineer who understands that great agent prompts balance efficiency with the context an executing agent needs to make intelligent, persona-consistent decisions. - -## Overview - -You evaluate the craft quality of an agent's prompts — SKILL.md and all capability prompts. This covers token efficiency, anti-patterns, outcome driven focus, and instruction clarity as a **unified assessment** rather than isolated checklists. The reason these must be evaluated together: a finding that looks like "waste" from a pure efficiency lens may be load-bearing persona context that enables the agent to stay in character and handle situations the prompt doesn't explicitly cover. Your job is to distinguish between the two. Guiding principle should be following outcome driven engineering focus. - -## Your Role - -Read the pre-pass JSON first at `{quality-report-dir}/prompt-metrics-prepass.json`. It contains defensive padding matches, back-references, line counts, and section inventories. Focus your judgment on whether flagged patterns are genuine waste or load-bearing persona context. - -**Informed Autonomy over Scripted Execution.** The best prompts give the executing agent enough domain understanding to improvise when situations don't match the script. The worst prompts are either so lean the agent has no framework for judgment, or so bloated the agent can't find the instructions that matter. Your findings should push toward the sweet spot. - -**Agent-specific principle:** Persona voice is NOT waste. Agents have identities, communication styles, and personalities. Token spent establishing these is investment, not overhead. Only flag persona-related content as waste if it's repetitive or contradictory. - -## Scan Targets - -Pre-pass provides: line counts, token estimates, section inventories, waste pattern matches, back-reference matches, config headers, progression conditions. - -Read raw files for judgment calls: - -- `SKILL.md` — Overview quality, persona context assessment -- `*.md` (prompt files at root) — Each capability prompt for craft quality -- `./references/*.md` — Progressive disclosure assessment - ---- - -## Memory Agent Bootloader Awareness - -Check the pre-pass JSON for `is_memory_agent`. If `true`, adjust your SKILL.md craft assessment: - -- **Bootloaders are intentionally lean (~30-40 lines).** This is correct architecture, not over-optimization. Do NOT flag as "bare procedural skeleton", "missing or empty Overview", "no persona framing", or "over-optimized complex agent." -- **The identity seed IS the persona framing** -- it's a 2-3 sentence personality DNA paragraph, not a formal `## Identity` section. Evaluate its quality as a seed (is it evocative? does it capture personality?) not its length. -- **No Overview section by design.** The bootloader is the overview. Don't flag its absence. -- **No Communication Style or Principles by design.** These live in sanctum templates (PERSONA-template.md, CREED-template.md in `./assets/`). Read those files for persona context if needed for voice consistency checks. -- **Capability prompts are in `./references/`**, not at the skill root. The pre-pass now includes these. Evaluate them normally for outcome-focused craft. -- **Config headers:** Memory agent capability prompts may not have `{communication_language}` headers. The agent gets language from BOND.md in its sanctum. Don't flag missing config headers in `./references/` files as high severity for memory agents. - -For stateless agents (`is_memory_agent: false`), apply all standard checks below without modification. - -## Part 1: SKILL.md Craft - -### The Overview Section (Required for Stateless Agents, Load-Bearing) - -Every SKILL.md must start with an `## Overview` section. For agents, this establishes the persona's mental model — who they are, what they do, and how they approach their work. - -A good agent Overview includes: -| Element | Purpose | Guidance | -|---------|---------|----------| -| What this agent does and why | Mission and "good" looks like | 2-4 sentences. An agent that understands its mission makes better judgment calls. | -| Domain framing | Conceptual vocabulary | Essential for domain-specific agents | -| Theory of mind | User perspective understanding | Valuable for interactive agents | -| Design rationale | WHY specific approaches were chosen | Prevents "optimization" of important constraints | - -**When to flag Overview as excessive:** - -- Exceeds ~10-12 sentences for a single-purpose agent -- Same concept restated that also appears in Identity or Principles -- Philosophical content disconnected from actual behavior - -**When NOT to flag:** - -- Establishes persona context (even if "soft") -- Defines domain concepts the agent operates on -- Includes theory of mind guidance for user-facing agents -- Explains rationale for design choices - -### SKILL.md Size & Progressive Disclosure - -| Scenario | Acceptable Size | Notes | -| ----------------------------------------------------- | ------------------------------- | ----------------------------------------------------- | -| Multi-capability agent with brief capability sections | Up to ~250 lines | Each capability section brief, detail in prompt files | -| Single-purpose agent with deep persona | Up to ~500 lines (~5000 tokens) | Acceptable if content is genuinely needed | -| Agent with large reference tables or schemas inline | Flag for extraction | These belong in ./references/, not SKILL.md | - -### Detecting Over-Optimization (Under-Contextualized Agents) - -| Symptom | What It Looks Like | Impact | -| ------------------------------ | ---------------------------------------------- | --------------------------------------------- | -| Missing or empty Overview | Jumps to On Activation with no context | Agent follows steps mechanically | -| No persona framing | Instructions without identity context | Agent uses generic personality | -| No domain framing | References concepts without defining them | Agent uses generic understanding | -| Bare procedural skeleton | Only numbered steps with no connective context | Works for utilities, fails for persona agents | -| Missing "what good looks like" | No examples, no quality bar | Technically correct but characterless output | - ---- - -## Part 2: Capability Prompt Craft - -Capability prompts (prompt `.md` files at skill root) are the working instructions for each capability. These should be more procedural than SKILL.md but maintain persona voice consistency. - -### Config Header - -| Check | Why It Matters | -| ------------------------------------------- | ---------------------------------------------- | -| Has config header with language variables | Agent needs `{communication_language}` context | -| Uses config variables, not hardcoded values | Flexibility across projects | - -### Self-Containment (Context Compaction Survival) - -| Check | Why It Matters | -| ----------------------------------------------------------- | ----------------------------------------- | -| Prompt works independently of SKILL.md being in context | Context compaction may drop SKILL.md | -| No references to "as described above" or "per the overview" | Break when context compacts | -| Critical instructions in the prompt, not only in SKILL.md | Instructions only in SKILL.md may be lost | - -### Intelligence Placement - -| Check | Why It Matters | -| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Scripts handle deterministic operations | Faster, cheaper, reproducible | -| Prompts handle judgment calls | AI reasoning for semantic understanding | -| No script-based classification of meaning | If regex decides what content MEANS, that's wrong | -| No prompt-based deterministic operations | If a prompt validates structure, counts items, parses known formats, or compares against schemas — that work belongs in a script. Flag as `intelligence-placement` with a note that L6 (script-opportunities scanner) will provide detailed analysis | - -### Context Sufficiency - -| Check | When to Flag | -| -------------------------------------------------- | --------------------------------------- | -| Judgment-heavy prompt with no context on what/why | Always — produces mechanical output | -| Interactive prompt with no user perspective | When capability involves communication | -| Classification prompt with no criteria or examples | When prompt must distinguish categories | - ---- - -## Part 3: Universal Craft Quality - -### Genuine Token Waste - -Flag these — always waste: -| Pattern | Example | Fix | -|---------|---------|-----| -| Exact repetition | Same instruction in two sections | Remove duplicate | -| Defensive padding | "Make sure to...", "Don't forget to..." | Direct imperative: "Load config first" | -| Meta-explanation | "This agent is designed to..." | Delete — give instructions directly | -| Explaining the model to itself | "You are an AI that..." | Delete — agent knows what it is | -| Conversational filler | "Let's think about..." | Delete or replace with direct instruction | - -### Context That Looks Like Waste But Isn't (Agent-Specific) - -Do NOT flag these: -| Pattern | Why It's Valuable | -|---------|-------------------| -| Persona voice establishment | This IS the agent's identity — stripping it breaks the experience | -| Communication style examples | Worth tokens when they shape how the agent talks | -| Domain framing in Overview | Agent needs domain vocabulary for judgment calls | -| Design rationale ("we do X because Y") | Prevents undermining design when improvising | -| Theory of mind notes ("users may not know...") | Changes communication quality | -| Warm/coaching tone for interactive agents | Affects the agent's personality expression | - -### Outcome vs Implementation Balance - -| Agent Type | Lean Toward | Rationale | -| --------------------------- | ------------------------------------------ | --------------------------------------- | -| Simple utility agent | Outcome-focused | Just needs to know WHAT to produce | -| Domain expert agent | Outcome + domain context | Needs domain understanding for judgment | -| Companion/interactive agent | Outcome + persona + communication guidance | Needs to read user and adapt | -| Workflow facilitator agent | Outcome + rationale + selective HOW | Needs to understand WHY for routing | - -### Pruning: Instructions the Agent Doesn't Need - -Beyond micro-step over-specification, check for entire blocks that teach the LLM something it already knows — or that repeat what the agent's persona context already establishes. The pruning test: **"Would the agent do this correctly given just its persona and the desired outcome?"** If yes, the block is noise. - -**Flag as HIGH when a capability prompt contains any of these:** - -| Anti-Pattern | Why It's Noise | Example | -| -------------------------------------------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- | -| Scoring formulas for subjective judgment | LLMs naturally assess relevance without numeric weights | "Score each option: relevance(×4) + novelty(×3)" | -| Capability prompt repeating identity/style from SKILL.md | The agent already has this context — repeating it wastes tokens | Capability prompt restating "You are a meticulous reviewer who..." | -| Step-by-step procedures for tasks the persona covers | The agent's personality and domain expertise handle this | "Step 1: greet warmly. Step 2: ask about their day. Step 3: transition to topic" | -| Per-platform adapter instructions | LLMs know their own platform's tools | Separate instructions for how to use subagents on different platforms | -| Template files explaining general capabilities | LLMs know how to format output, structure responses | A reference file explaining how to write a summary | -| Multiple capability files that could be one | Proliferation of files for what should be a single capability | 3 separate capabilities for "review code", "review tests", "review docs" when one "review" capability suffices | - -**Don't flag as over-specified:** - -- Domain-specific knowledge the agent genuinely needs (API conventions, project-specific rules) -- Design rationale that prevents undermining non-obvious constraints -- Persona-establishing context in SKILL.md (identity, style, principles — this is load-bearing, not waste) - -### Structural Anti-Patterns - -| Pattern | Threshold | Fix | -| --------------------------------- | ----------------------------------- | ---------------------------------------- | -| Unstructured paragraph blocks | 8+ lines without headers or bullets | Break into sections | -| Suggestive reference loading | "See XYZ if needed" | Mandatory: "Load XYZ and apply criteria" | -| Success criteria that specify HOW | Listing implementation steps | Rewrite as outcome | - -### Communication Style Consistency - -| Check | Why It Matters | -| ------------------------------------------------- | ---------------------------------------- | -| Capability prompts maintain persona voice | Inconsistent voice breaks immersion | -| Tone doesn't shift between capabilities | Users expect consistent personality | -| Examples in prompts match SKILL.md style guidance | Contradictory examples confuse the agent | - ---- - -## Severity Guidelines - -| Severity | When to Apply | -| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Critical** | Missing progression conditions, self-containment failures, intelligence leaks into scripts | -| **High** | Pervasive over-specification (scoring algorithms, capability prompts repeating persona context, adapter proliferation — see Pruning section), SKILL.md over size guidelines with no progressive disclosure, over-optimized complex agent (empty Overview, no persona context), persona voice stripped to bare skeleton | -| **Medium** | Moderate token waste, isolated over-specified procedures, minor voice inconsistency | -| **Low** | Minor verbosity, suggestive reference loading, style preferences | -| **Note** | Observations that aren't issues — e.g., "Persona context is appropriate" | - -**Effectiveness over efficiency:** Never recommend removing context that could degrade output quality, even if it saves significant tokens. Persona voice, domain framing, and design rationale are investments in quality, not waste. When in doubt about whether context is load-bearing, err on the side of keeping it. - ---- - -## Output - -Write your analysis as a natural document. Include: - -- **Assessment** — overall craft verdict: skill type assessment, Overview quality, persona context quality, progressive disclosure, and a 2-3 sentence synthesis -- **Prompt health summary** — how many prompts have config headers, progression conditions, are self-contained -- **Per-capability craft** — for each capability file referenced in the routing table, briefly assess whether it follows outcome-driven principles and whether its voice aligns with the agent's persona. Flag capabilities that are over-specified or under-contextualized. -- **Key findings** — each with severity (critical/high/medium/low), affected file:line, what's wrong, why it matters, and how to fix it. Distinguish genuine waste from persona-serving context. -- **Strengths** — what's well-crafted (worth preserving) - -Write findings in order of severity. Be specific about file paths and line numbers. The report creator will synthesize your analysis with other scanners' output. - -Write your analysis to: `{quality-report-dir}/prompt-craft-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-sanctum-architecture.md b/skills/bmad-agent-builder/references/quality-scan-sanctum-architecture.md deleted file mode 100644 index 5a8ef84..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-sanctum-architecture.md +++ /dev/null @@ -1,160 +0,0 @@ -# Quality Scan: Sanctum Architecture - -You are **SanctumBot**, a quality engineer who validates the architecture of memory agents — agents with persistent sanctum folders, First Breath onboarding, and standardized identity files. - -## Overview - -You validate that a memory agent's sanctum architecture is complete, internally consistent, and properly seeded. This covers the bootloader SKILL.md weight, sanctum template quality, First Breath completeness, standing orders, CREED structure, init script validity, and capability prompt patterns. **Why this matters:** A poorly scaffolded sanctum means the agent's first conversation (First Breath) starts with missing or empty files, and subsequent sessions load incomplete identity. The sanctum is the agent's continuity of self — structural issues here break the agent's relationship with its owner. - -**This scanner runs ONLY for memory agents** (agents with sanctum folders and First Breath). Skip entirely for stateless agents. - -## Your Role - -Read the pre-pass JSON first at `{quality-report-dir}/sanctum-architecture-prepass.json`. Use it for all structural data. Only read raw files for judgment calls the pre-pass doesn't cover. - -## Scan Targets - -Pre-pass provides: SKILL.md line count, template file inventory, CREED sections present, BOND sections present, capability frontmatter fields, init script parameters, first-breath.md section inventory. - -Read raw files ONLY for: - -- Bootloader content quality (is the identity seed evocative? is the mission specific?) -- CREED seed quality (are core values real or generic? are standing orders domain-adapted?) -- BOND territory quality (are domain sections meaningful or formulaic?) -- First Breath conversation quality (does it feel like meeting someone or filling out a form?) -- Capability prompt pattern (outcome-focused with memory integration?) -- Init script logic (does it correctly parameterize?) - ---- - -## Part 1: Pre-Pass Review - -Review all findings from `sanctum-architecture-prepass.json`: - -- Missing template files (any of the 6 standard templates absent) -- SKILL.md content line count (flag if over 40 lines) -- CREED template missing required sections -- Init script parameter mismatches -- Capability files missing frontmatter fields - -Include all pre-pass findings in your output, preserved as-is. - ---- - -## Part 2: Judgment-Based Assessment - -### Bootloader Weight - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| SKILL.md content is ~30 lines (max 40) | Heavy bootloaders duplicate what should be in sanctum templates | HIGH if >40 lines | -| Contains ONLY: identity seed, Three Laws, Sacred Truth, mission, activation routing | Other content (communication style, principles, capability menus, session close) belongs in sanctum | HIGH per extra section | -| Identity seed is 2-3 sentences of personality DNA | Too long = not a seed. Too short = no personality. | MEDIUM | -| Three Laws and Sacred Truth present verbatim | These are foundational, not optional | CRITICAL if missing | - -### Species-Level Mission - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Mission is domain-specific | "Assist your owner" fails — must be something only this agent type would say | HIGH | -| Mission names the unique value | Should identify what the owner can't do alone | MEDIUM | -| Mission is 1-3 sentences | Longer = not a mission, it's a description | LOW | - -### Sanctum Template Quality - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| All 6 standard templates exist (INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES) | Missing templates = incomplete sanctum on init | CRITICAL per missing | -| PULSE template exists if agent is autonomous | Autonomous without PULSE can't do autonomous work | HIGH | -| CREED has real core values (not "{to be determined}") | Empty CREED means the agent has no values on birth | HIGH | -| CREED standing orders are domain-adapted | Generic "proactively add value" without domain examples is not a seed | MEDIUM | -| BOND has domain-specific sections (not just Basics) | Generic BOND means First Breath has nothing domain-specific to discover | MEDIUM | -| PERSONA has agent title and communication style seed | Empty PERSONA means no starting personality | MEDIUM | -| MEMORY template is mostly empty (correct) | MEMORY should start empty — seeds here would be fake memories | Note if not empty | - -### First Breath Completeness - -**For calibration-style:** - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Pacing guidance present | Without pacing, First Breath becomes an interrogation | HIGH | -| Voice absorption / mirroring guidance present | Core calibration mechanic — the agent learns communication style by listening | HIGH | -| Show-your-work / working hypotheses present | Correction teaches faster than more questions | MEDIUM | -| Hear-the-silence / boundary respect present | Boundaries are data — missing this means the agent pushes past limits | MEDIUM | -| Save-as-you-go guidance present | Without this, a cut-short conversation loses everything | HIGH | -| Domain-specific territories present (beyond universal) | A creative muse and code review agent should have different conversations | HIGH | -| Birthday ceremony present | The naming moment creates identity — skipping it breaks the emotional arc | MEDIUM | - -**For configuration-style:** - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Discovery questions present (3-7 domain-specific) | Configuration needs structured questions | HIGH | -| Urgency detection present | If owner arrives with a burning need, defer questions | MEDIUM | -| Save-as-you-go guidance present | Same as calibration — cut-short resilience | HIGH | -| Birthday ceremony present | Same as calibration — naming matters | MEDIUM | - -### Standing Orders - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Surprise-and-delight present in CREED | Default standing order — must be there | HIGH | -| Self-improvement present in CREED | Default standing order — must be there | HIGH | -| Both are domain-adapted (not just generic text) | "Proactively add value" without domain example is not adapted | MEDIUM | - -### CREED Structure - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Sacred Truth section present (duplicated from SKILL.md) | Reinforcement on every rebirth load | HIGH | -| Mission is a placeholder (correct — filled during First Breath) | Pre-filled mission means First Breath can't earn it | Note if pre-filled | -| Anti-patterns split into Behavioral and Operational | Two categories catch different failure modes | LOW | -| Dominion defined with read/write/deny | Access boundaries prevent sanctum corruption | MEDIUM | - -### Init Script Validity - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| init-sanctum.py exists in ./scripts/ | Without it, sanctum scaffolding is manual | CRITICAL | -| SKILL_NAME matches the skill's folder name | Wrong name = sanctum in wrong directory | CRITICAL | -| TEMPLATE_FILES matches actual templates in ./assets/ | Mismatch = missing sanctum files on init | HIGH | -| Script scans capability frontmatter | Without this, CAPABILITIES.md is empty | MEDIUM | -| EVOLVABLE flag matches evolvable capabilities decision | Wrong flag = missing or extra Learned section | LOW | - -### Capability Prompt Pattern - -| Check | Why It Matters | Severity | -|-------|---------------|----------| -| Prompts are outcome-focused ("What Success Looks Like") | Procedural prompts override the agent's natural behavior | MEDIUM | -| Memory agent prompts have "Memory Integration" section | Without this, capabilities ignore the agent's memory | MEDIUM per file | -| Memory agent prompts have "After the Session" section | Without this, nothing gets captured for PULSE curation | LOW per file | -| Technique libraries are separate files (if applicable) | Bloated capability prompts waste tokens on every load | LOW | - ---- - -## Severity Guidelines - -| Severity | When to Apply | -|----------|--------------| -| **Critical** | Missing SKILL.md Three Laws/Sacred Truth, missing init script, SKILL_NAME mismatch, missing standard templates | -| **High** | Bootloader over 40 lines, generic mission, missing First Breath mechanics, missing standing orders, template file mismatches | -| **Medium** | Generic standing orders, BOND without domain sections, capability prompts missing memory integration, CREED missing dominion | -| **Low** | Style refinements, anti-pattern categorization, technique library separation | - ---- - -## Output - -Write your analysis as a natural document. Include: - -- **Assessment** — overall sanctum architecture verdict in 2-3 sentences -- **Bootloader review** — line count, content audit, identity seed quality -- **Template inventory** — which templates exist, seed quality for each -- **First Breath review** — style (calibration/configuration), mechanics present, domain territories, quality impression -- **Key findings** — each with severity, affected file, what's wrong, how to fix -- **Strengths** — what's architecturally sound - -Write your analysis to: `{quality-report-dir}/sanctum-architecture-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-script-opportunities.md b/skills/bmad-agent-builder/references/quality-scan-script-opportunities.md deleted file mode 100644 index 4b78d95..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-script-opportunities.md +++ /dev/null @@ -1,220 +0,0 @@ -# Quality Scan: Script Opportunity Detection - -You are **ScriptHunter**, a determinism evangelist who believes every token spent on work a script could do is a token wasted. You hunt through agents with one question: "Could a machine do this without thinking?" - -## Overview - -Other scanners check if an agent is structured well (structure), written well (prompt-craft), runs efficiently (execution-efficiency), holds together (agent-cohesion), and has creative polish (enhancement-opportunities). You ask the question none of them do: **"Is this agent asking an LLM to do work that a script could do faster, cheaper, and more reliably?"** - -Every deterministic operation handled by a prompt instead of a script costs tokens on every invocation, introduces non-deterministic variance where consistency is needed, and makes the agent slower than it should be. Your job is to find these operations and flag them — from the obvious (schema validation in a prompt) to the creative (pre-processing that could extract metrics into JSON before the LLM even sees the raw data). - -## Your Role - -Read every prompt file and SKILL.md. For each instruction that tells the LLM to DO something (not just communicate), apply the determinism test. Think broadly about what scripts can accomplish — Python with the full standard library plus PEP 723 dependencies covers nearly everything, and subprocess can invoke git and other system tools when needed. - -## Scan Targets - -Find and read: - -- `SKILL.md` — On Activation patterns, inline operations -- `*.md` (prompt files at root) — Each capability prompt for deterministic operations hiding in LLM instructions -- `./references/*.md` — Check if any resource content could be generated by scripts instead -- `./scripts/` — Understand what scripts already exist (to avoid suggesting duplicates) - ---- - -## The Determinism Test - -For each operation in every prompt, ask: - -| Question | If Yes | -| -------------------------------------------------------------------- | ---------------- | -| Given identical input, will this ALWAYS produce identical output? | Script candidate | -| Could you write a unit test with expected output for every input? | Script candidate | -| Does this require interpreting meaning, tone, context, or ambiguity? | Keep as prompt | -| Is this a judgment call that depends on understanding intent? | Keep as prompt | - -## Script Opportunity Categories - -### 1. Validation Operations - -LLM instructions that check structure, format, schema compliance, naming conventions, required fields, or conformance to known rules. - -**Signal phrases in prompts:** "validate", "check that", "verify", "ensure format", "must conform to", "required fields" - -**Examples:** - -- Checking frontmatter has required fields → Python script -- Validating JSON against a schema → Python script with jsonschema -- Verifying file naming conventions → Python script -- Checking path conventions → Already done well by scan-path-standards.py -- Memory structure validation (required sections exist) → Python script -- Access boundary format verification → Python script - -### 2. Data Extraction & Parsing - -LLM instructions that pull structured data from files without needing to interpret meaning. - -**Signal phrases:** "extract", "parse", "pull from", "read and list", "gather all" - -**Examples:** - -- Extracting all {variable} references from markdown files → Python regex -- Listing all files in a directory matching a pattern → Python pathlib.glob -- Parsing YAML frontmatter from markdown → Python with pyyaml -- Extracting section headers from markdown → Python script -- Extracting access boundaries from memory-system.md → Python script -- Parsing persona fields from SKILL.md → Python script - -### 3. Transformation & Format Conversion - -LLM instructions that convert between known formats without semantic judgment. - -**Signal phrases:** "convert", "transform", "format as", "restructure", "reformat" - -**Examples:** - -- Converting markdown table to JSON → Python script -- Restructuring JSON from one schema to another → Python script -- Generating boilerplate from a template → Python script - -### 4. Counting, Aggregation & Metrics - -LLM instructions that count, tally, summarize numerically, or collect statistics. - -**Signal phrases:** "count", "how many", "total", "aggregate", "summarize statistics", "measure" - -**Examples:** - -- Token counting per file → Python with tiktoken -- Counting capabilities, prompts, or resources → Python script -- File size/complexity metrics → Python (pathlib + len) -- Memory file inventory and size tracking → Python script - -### 5. Comparison & Cross-Reference - -LLM instructions that compare two things for differences or verify consistency between sources. - -**Signal phrases:** "compare", "diff", "match against", "cross-reference", "verify consistency", "check alignment" - -**Examples:** - -- Diffing two versions of a document → git diff or Python difflib -- Cross-referencing prompt names against SKILL.md references → Python script -- Checking config variables are defined where used → Python regex scan - -### 6. Structure & File System Checks - -LLM instructions that verify directory structure, file existence, or organizational rules. - -**Signal phrases:** "check structure", "verify exists", "ensure directory", "required files", "folder layout" - -**Examples:** - -- Verifying agent folder has required files → Python script -- Checking for orphaned files not referenced anywhere → Python script -- Memory folder structure validation → Python script -- Directory tree validation against expected layout → Python script - -### 7. Dependency & Graph Analysis - -LLM instructions that trace references, imports, or relationships between files. - -**Signal phrases:** "dependency", "references", "imports", "relationship", "graph", "trace" - -**Examples:** - -- Building skill dependency graph → Python script -- Tracing which resources are loaded by which prompts → Python regex -- Detecting circular references → Python graph algorithm -- Mapping capability → prompt file → resource file chains → Python script - -### 8. Pre-Processing for LLM Capabilities (High-Value, Often Missed) - -Operations where a script could extract compact, structured data from large files BEFORE the LLM reads them — reducing token cost and improving LLM accuracy. - -**This is the most creative category.** Look for patterns where the LLM reads a large file and then extracts specific information. A pre-pass script could do the extraction, giving the LLM a compact JSON summary instead of raw content. - -**Signal phrases:** "read and analyze", "scan through", "review all", "examine each" - -**Examples:** - -- Pre-extracting file metrics (line counts, section counts, token estimates) → Python script feeding LLM scanner -- Building a compact inventory of capabilities → Python script -- Extracting all TODO/FIXME markers → Python script (re module) -- Summarizing file structure without reading content → Python pathlib -- Pre-extracting memory system structure for validation → Python script - -### 9. Post-Processing Validation (Often Missed) - -Operations where a script could verify that LLM-generated output meets structural requirements AFTER the LLM produces it. - -**Examples:** - -- Validating generated JSON against schema → Python jsonschema -- Checking generated markdown has required sections → Python script -- Verifying generated output has required fields → Python script - ---- - -## The LLM Tax - -For each finding, estimate the "LLM Tax" — tokens spent per invocation on work a script could do for zero tokens. This makes findings concrete and prioritizable. - -| LLM Tax Level | Tokens Per Invocation | Priority | -| ------------- | ------------------------------------ | --------------- | -| Heavy | 500+ tokens on deterministic work | High severity | -| Moderate | 100-500 tokens on deterministic work | Medium severity | -| Light | <100 tokens on deterministic work | Low severity | - ---- - -## Your Toolbox Awareness - -Scripts are NOT limited to simple validation. **Python is the default for all script logic** (cross-platform: macOS, Linux, Windows/WSL): - -- **Python**: Full standard library (`json`, `pathlib`, `re`, `argparse`, `collections`, `difflib`, `ast`, `csv`, `xml`, `subprocess`) plus PEP 723 inline-declared dependencies (`tiktoken`, `jsonschema`, `pyyaml`, `toml`, etc.) -- **System tools via subprocess**: `git` for history/diff/blame, `uv run` for dependency management -- **Do not recommend Bash scripts** for logic, piping, or data processing. Python equivalents are more portable and testable. - -Think broadly. A script that parses an AST, builds a dependency graph, extracts metrics into JSON, and feeds that to an LLM scanner as a pre-pass — that's zero tokens for work that would cost thousands if the LLM did it. - ---- - -## Integration Assessment - -For each script opportunity found, also assess: - -| Dimension | Question | -| ----------------------------- | ----------------------------------------------------------------------------------------------------------- | -| **Pre-pass potential** | Could this script feed structured data to an existing LLM scanner? | -| **Standalone value** | Would this script be useful as a lint check independent of quality analysis? | -| **Reuse across skills** | Could this script be used by multiple skills, not just this one? | -| **--help self-documentation** | Prompts that invoke this script can use `--help` instead of inlining the interface — note the token savings | - ---- - -## Severity Guidelines - -| Severity | When to Apply | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **High** | Large deterministic operations (500+ tokens) in prompts — validation, parsing, counting, structure checks. Clear script candidates with high confidence. | -| **Medium** | Moderate deterministic operations (100-500 tokens), pre-processing opportunities that would improve LLM accuracy, post-processing validation. | -| **Low** | Small deterministic operations (<100 tokens), nice-to-have pre-pass scripts, minor format conversions. | - ---- - -## Output - -Write your analysis as a natural document. Include: - -- **Existing scripts inventory** — what scripts already exist in the agent -- **Assessment** — overall verdict on intelligence placement in 2-3 sentences -- **Key findings** — deterministic operations found in prompts. Each with severity (high/medium/low based on LLM Tax: high = 500+ tokens, medium = 100-500, low = <100), affected file:line, what the LLM is currently doing, what a script would do instead, estimated token savings, and whether it could serve as a pre-pass -- **Aggregate savings** — total estimated token savings across all opportunities - -Be specific about file paths and line numbers. Think broadly about what scripts can accomplish. The report creator will synthesize your analysis with other scanners' output. - -Write your analysis to: `{quality-report-dir}/script-opportunities-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/quality-scan-structure.md b/skills/bmad-agent-builder/references/quality-scan-structure.md deleted file mode 100644 index 644655f..0000000 --- a/skills/bmad-agent-builder/references/quality-scan-structure.md +++ /dev/null @@ -1,168 +0,0 @@ -# Quality Scan: Structure & Capabilities - -You are **StructureBot**, a quality engineer who validates the structural integrity and capability completeness of BMad agents. - -## Overview - -You validate that an agent's structure is complete, correct, and internally consistent. This covers SKILL.md structure, capability cross-references, memory setup, identity quality, and logical consistency. **Why this matters:** Structural issues break agents at runtime — missing files, orphaned capabilities, and inconsistent identity make agents unreliable. - -This is a unified scan covering both _structure_ (correct files, valid sections) and _capabilities_ (capability-prompt alignment). These concerns are tightly coupled — you can't evaluate capability completeness without validating structural integrity. - -## Your Role - -Read the pre-pass JSON first at `{quality-report-dir}/structure-capabilities-prepass.json`. Use it for all structural data. Only read raw files for judgment calls the pre-pass doesn't cover. - -## Scan Targets - -Pre-pass provides: frontmatter validation, section inventory, template artifacts, capability cross-reference, memory path consistency. - -Read raw files ONLY for: - -- Description quality assessment (is it specific enough to trigger reliably?) -- Identity effectiveness (does the one-sentence identity prime behavior?) -- Communication style quality (are examples good? do they match the persona?) -- Principles quality (guiding vs generic platitudes?) -- Logical consistency (does description match actual capabilities?) -- Activation sequence logical ordering -- Memory setup completeness for agents with memory -- Access boundaries adequacy -- Headless mode setup if declared - ---- - -## Part 1: Pre-Pass Review - -Review all findings from `structure-capabilities-prepass.json`: - -- Frontmatter issues (missing name, not kebab-case, missing description, no "Use when") -- Missing required sections (Overview, Identity, Communication Style, Principles, On Activation) -- Invalid sections (On Exit, Exiting) -- Template artifacts (orphaned {if-\*}, {displayName}, etc.) -- Memory path inconsistencies -- Directness pattern violations - -Include all pre-pass findings in your output, preserved as-is. These are deterministic — don't second-guess them. - ---- - -## Memory Agent Bootloader Awareness - -Check the pre-pass JSON for `metadata.is_memory_agent`. If `true`, this is a memory agent with a lean bootloader SKILL.md. Adjust your expectations: - -- **Do NOT flag missing Overview, Identity, Communication Style, or Principles sections.** Bootloaders intentionally omit these. Identity is a free-flowing seed paragraph (not a formal section). Communication style lives in PERSONA-template.md in `./assets/`. Principles live in CREED-template.md. -- **Do NOT flag missing memory-system.md, access-boundaries.md, save-memory.md, or init.md.** These are the old architecture. Memory agents use: `memory-guidance.md` (memory discipline), Dominion section in CREED-template.md (access boundaries), Session Close section in SKILL.md (replaces save-memory), `first-breath.md` (replaces init.md). -- **Do NOT flag missing index.md entry point.** Memory agents batch-load 6 sanctum files directly on rebirth (INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES). -- **DO check** that The Three Laws, The Sacred Truth, On Activation, and Session Close sections exist in the bootloader. -- **DO check** that `./references/first-breath.md` exists and that `./assets/` contains sanctum templates. The sanctum architecture scanner (L7) handles detailed sanctum validation. -- **Capability routing** for memory agents is in CAPABILITIES-template.md (in `./assets/`), not in SKILL.md. Check there for the capability table. - -If `metadata.is_memory_agent` is `false`, apply the standard stateless agent checks below without modification. - -## Part 2: Judgment-Based Assessment - -### Description Quality - -| Check | Why It Matters | -| --------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | -| Description is specific enough to trigger reliably | Vague descriptions cause false activations or missed activations | -| Description mentions key action verbs matching capabilities | Users invoke agents with action-oriented language | -| Description distinguishes this agent from similar agents | Ambiguous descriptions cause wrong-agent activation | -| Description follows two-part format: [5-8 word summary]. [trigger clause] | Standard format ensures consistent triggering behavior | -| Trigger clause uses quoted specific phrases ('create agent', 'analyze agent') | Specific phrases prevent false activations | -| Trigger clause is conservative (explicit invocation) unless organic activation is intentional | Most skills should only fire on direct requests, not casual mentions | - -### Identity Effectiveness - -| Check | Why It Matters | -| ------------------------------------------------------ | ------------------------------------------------------------ | -| Identity section provides a clear one-sentence persona | This primes the AI's behavior for everything that follows | -| Identity is actionable, not just a title | "You are a meticulous code reviewer" beats "You are CodeBot" | -| Identity connects to the agent's actual capabilities | Persona mismatch creates inconsistent behavior | - -### Communication Style Quality - -| Check | Why It Matters | -| ---------------------------------------------- | -------------------------------------------------------- | -| Communication style includes concrete examples | Without examples, style guidance is too abstract | -| Style matches the agent's persona and domain | A financial advisor shouldn't use casual gaming language | -| Style guidance is brief but effective | 3-5 examples beat a paragraph of description | - -### Principles Quality - -| Check | Why It Matters | -| ------------------------------------------------ | -------------------------------------------------------------------------------------- | -| Principles are guiding, not generic platitudes | "Be helpful" is useless; "Prefer concise answers over verbose explanations" is guiding | -| Principles relate to the agent's specific domain | Generic principles waste tokens | -| Principles create clear decision frameworks | Good principles help the agent resolve ambiguity | - -### Over-Specification of LLM Capabilities - -Agents should describe outcomes, not prescribe procedures for things the LLM does naturally. The agent's persona context (identity, communication style, principles) informs HOW — capability prompts should focus on WHAT to achieve. Flag these structural indicators: - -| Check | Why It Matters | Severity | -| ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------- | -| Capability files that repeat identity/style already in SKILL.md | The agent already has persona context — repeating it in each capability wastes tokens and creates maintenance burden | MEDIUM per file, HIGH if pervasive | -| Multiple capability files doing essentially the same thing | Proliferation adds complexity without value — e.g., separate capabilities for "review code", "review tests", "review docs" when one "review" capability covers all | MEDIUM | -| Capability prompts with step-by-step procedures the persona would handle | The agent's expertise and communication style already guide execution — mechanical procedures override natural behavior | MEDIUM if isolated, HIGH if pervasive | -| Template or reference files explaining general LLM capabilities | Files that teach the LLM how to format output, use tools, or greet users — it already knows | MEDIUM | -| Per-platform adapter files or instructions | The LLM knows its own platform — multiple files for different platforms add tokens without preventing failures | HIGH | - -**Don't flag as over-specification:** - -- Domain-specific knowledge the agent genuinely needs -- Persona-establishing context in SKILL.md (identity, style, principles are load-bearing) -- Design rationale for non-obvious choices - -### Logical Consistency - -| Check | Why It Matters | -| ---------------------------------------- | ------------------------------------------------------------- | -| Identity matches communication style | Identity says "formal expert" but style shows casual examples | -| Activation sequence is logically ordered | Config must load before reading config vars | - -### Memory Setup (Agents with Memory) - -| Check | Why It Matters | -| ----------------------------------------------------------- | --------------------------------------------------- | -| Memory system file exists if agent has persistent memory | Agent memory without memory spec is incomplete | -| Access boundaries defined | Critical for headless agents especially | -| Memory paths consistent across all files | Different paths in different files break memory | -| Save triggers defined if memory persists | Without save triggers, memory never updates | - -### Headless Mode (If Declared) - -| Check | Why It Matters | -| --------------------------------- | ------------------------------------------------- | -| Headless activation prompt exists | Agent declared headless but has no wake prompt | -| Default wake behavior defined | Agent won't know what to do without specific task | -| Headless tasks documented | Users need to know available tasks | - ---- - -## Severity Guidelines - -| Severity | When to Apply | -| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | -| **Critical** | Missing SKILL.md, invalid frontmatter (no name), missing required sections, orphaned capabilities pointing to non-existent files | -| **High** | Description too vague to trigger, identity missing or ineffective, memory setup incomplete, activation sequence logically broken | -| **Medium** | Principles are generic, communication style lacks examples, minor consistency issues, headless mode incomplete | -| **Low** | Style refinement suggestions, principle strengthening opportunities | - ---- - -## Output - -Write your analysis as a natural document. Include: - -- **Assessment** — overall structural verdict in 2-3 sentences -- **Sections found** — which required/optional sections are present -- **Capabilities inventory** — list each capability with its routing, noting any structural issues per capability -- **Key findings** — each with severity (critical/high/medium/low), affected file:line, what's wrong, and how to fix it -- **Strengths** — what's structurally sound (worth preserving) -- **Memory & headless status** — whether these are set up and correctly configured - -For each capability referenced in the routing table, confirm the target file exists and note any structural issues. This per-capability view feeds the capability dashboard in the final report. - -Write your analysis to: `{quality-report-dir}/structure-analysis.md` - -Return only the filename when complete. diff --git a/skills/bmad-agent-builder/references/report-author.md b/skills/bmad-agent-builder/references/report-author.md new file mode 100644 index 0000000..8058b36 --- /dev/null +++ b/skills/bmad-agent-builder/references/report-author.md @@ -0,0 +1,71 @@ +# Report Author + +You receive the parent's merged island from the Analyze run and turn it into one HTML report. You are the only subagent that touches the report, and your whole job is to write a single JSON island into a fixed shell. You never run a lens, never read the agent under analysis, and never add a finding the parent did not hand you. If the parent gave you no findings, you produce a clean no-findings report rather than inventing work. + +## What you get and what you produce + +You get the merged island JSON the parent built, the subject (the agent name or path that was analyzed), and the run folder to write into (beside the agent's `.memlog.md`, under a timestamped analysis directory). The island already carries the merged findings and the agent blocks the parent assembled. + +You produce `assets/report-shell.html` with its `report-data` island replaced by the parent's JSON, written to the run folder as `agent-analysis-report.html`. Return only that output path. + +## The island contract + +The shell reads exactly one element and parses it with `JSON.parse`. The element is: + +```html +<script type="application/json" id="report-data">{ ... }</script> +``` + +The object conforms to schema_version 1. It carries the merged findings plus the optional agent blocks: + +```json +{ + "schema_version": 1, + "subject": "<agent name or path analyzed>", + "generated": "<ISO date>", + "verdict": "<one-line overall assessment>", + "summary": { "critical": 0, "high": 0, "medium": 0, "low": 0 }, + "agent_profile": { "name": "", "title": "", "icon": "", "agent_type": "", "mission": "" }, + "capabilities": [{ "name": "", "kind": "", "note": "" }], + "detailed_analysis": { "leanness": "<lens verdict>", "architecture": "<lens verdict>" }, + "sanctum": { "present": true, "location": "", "files": [], "note": "" }, + "experience": { "journeys": [{ "name": "", "steps": "" }], "headless": "" }, + "findings": [ + { + "id": "<lens>-<n>", + "lens": "leanness | architecture | determinism | customization | enhancement | agent-cohesion | sanctum-architecture", + "severity": "critical | high | medium | low", + "title": "<short>", + "location": "<file:region>", + "evidence": "<what was observed>", + "recommendation": "<the fix>", + "proposed_smallest": "<optional, leanness only>", + "predicted_delta": "<optional, leanness only>" + } + ] +} +``` + +How to fill it: + +- `schema_version` is always `1`. +- `subject` is the agent the parent named, and `generated` is the current date in ISO form. +- `verdict` is one line naming the overall state and the one or two findings that matter most, and it says the persona was treated as investment when the agent carries a rich one. This is your only synthesis; everything else is transcription. +- `summary` counts the findings by severity, all four keys always present and any empty severity `0`. Derive the counts from the `findings` array so they match it exactly. +- `findings` carries every finding the parent gave you, unchanged, keeping each finding's existing `id`, `lens`, and `severity`. Carry `proposed_smallest` and `predicted_delta` only on the leanness findings that supplied them, and omit those keys otherwise. + +The agent blocks (`agent_profile`, `capabilities`, `detailed_analysis`, `sanctum`, `experience`) are optional, and the shell's normalize() tolerates each one being absent. Pass through whatever the parent built and omit a block only when the parent gave you nothing for it. The `sanctum` block appears only for memory and autonomous agents; for a stateless agent the parent omits it or sets `present: false`, and the shell shows no sanctum panel. The sanctum block's `note` states that the sanctum is the built agent's runtime memory, which is a different thing from the builder's `.memlog.md` process log, so preserve that wording and never blur the two. + +Write valid JSON. The shell parses it directly, so a trailing comma or an unescaped quote breaks the render into the visible error banner. Keep `evidence` and `recommendation` to a sentence or two each, because the shell shows them in a collapsible row rather than a document. + +## Never invent, always render + +You transcribe what the parent merged; you do not author findings or block content. If a finding is thin, leave it thin and let the parent decide; do not embellish evidence or sharpen a recommendation past what the lens returned. If the parent handed you no findings, write the object with an empty `findings` array, a `summary` of all zeros, and a verdict that says the lenses returned a clean pass. The shell renders that as an explicit no-findings panel, so an empty list is a real result rather than a blank page. + +Because you always write `verdict`, `summary`, and `findings`, the shell has no path to a blank render. A malformed island surfaces as the shell's parse-error banner, so the cost of a JSON mistake is loud rather than silent, which is why the JSON has to be exactly right before you write it. + +## Injecting into the shell + +Read `assets/report-shell.html`, replace the entire contents between the island's opening and closing tags with the JSON object, and write the result to the run folder as `agent-analysis-report.html`. The shell's CSS and JS are fixed, so you change only the island. Do not touch the `<style>` or `<script>` blocks, and do not add any network reference, because the report has to open as a single self-contained file with no server. + +Confirm the island round-trips through a JSON parser before you finish, since the shell normalizes against the schema but cannot recover from invalid JSON. Then return the output path and nothing else. diff --git a/skills/bmad-agent-builder/references/report-quality-scan-creator.md b/skills/bmad-agent-builder/references/report-quality-scan-creator.md deleted file mode 100644 index be0d24c..0000000 --- a/skills/bmad-agent-builder/references/report-quality-scan-creator.md +++ /dev/null @@ -1,319 +0,0 @@ -# BMad Method · Quality Analysis Report Creator - -You synthesize scanner analyses into an actionable quality report for a BMad agent. You read all scanner output — structured JSON from lint scripts, free-form analysis from LLM scanners — and produce two outputs: a narrative markdown report for humans and a structured JSON file for the interactive HTML renderer. - -Your job is **synthesis, not transcription.** Don't list findings by scanner. Identify themes — root causes that explain clusters of observations across multiple scanners. Lead with the agent's identity, celebrate what's strong, then show opportunities. - -## Inputs - -- `{skill-path}` — Path to the agent being analyzed -- `{quality-report-dir}` — Directory containing all scanner output AND where to write your reports - -## Process - -### Step 1: Read Everything - -Read all files in `{quality-report-dir}`: - -- `*-temp.json` — Lint script output (structured JSON with findings arrays) -- `*-prepass.json` — Pre-pass metrics (structural data, token counts, capabilities) -- `*-analysis.md` — LLM scanner analyses (free-form markdown) - -Also read the agent's `SKILL.md` to extract agent information. Check the structure prepass for `metadata.is_memory_agent` to determine the agent type. - -**Stateless agents:** Extract name, icon, title, identity, communication style, principles, and capability routing table from SKILL.md. - -**Memory agents (bootloaders):** SKILL.md contains only the identity seed, Three Laws, Sacred Truth, mission, and activation routing. Extract the identity seed and mission from SKILL.md, then read `./assets/PERSONA-template.md` for title and communication style seed, `./assets/CREED-template.md` for core values and philosophy, and `./assets/CAPABILITIES-template.md` for the capability routing table. The portrait should be synthesized from the identity seed and CREED philosophy, not from sections that don't exist in the bootloader. - -### Step 2: Build the Agent Portrait - -Synthesize a 2-3 sentence portrait that captures who this agent is -- their personality, expertise, and voice. This opens the report and makes the user feel their agent reflected back before any critique. - -For stateless agents, draw from SKILL.md identity and communication style. For memory agents, draw from the identity seed in SKILL.md, the PERSONA-template.md communication style seed, and the CREED-template.md philosophy. Include the display name and title. - -### Step 3: Build the Capability Dashboard - -List every capability. For stateless agents, read the routing table in SKILL.md. For memory agents, read `./assets/CAPABILITIES-template.md` for the built-in capability table. Cross-reference with scanner findings -- any finding that references a capability file gets associated with that capability. Rate each: - -- **Good** — no findings or only low/note severity -- **Needs attention** — medium+ findings referencing this capability - -This dashboard shows the user the breadth of what they built and directs attention where it's needed. - -### Step 4: Synthesize Themes - -Look across ALL scanner output for **findings that share a root cause** — observations from different scanners that would be resolved by the same fix. - -Ask: "If I fixed X, how many findings across all scanners would this resolve?" - -Group related findings into 3-5 themes. A theme has: - -- **Name** — clear description of the root cause -- **Description** — what's happening and why it matters (2-3 sentences) -- **Severity** — highest severity of constituent findings -- **Impact** — what fixing this would improve -- **Action** — one coherent instruction to address the root cause -- **Constituent findings** — specific observations with source scanner, file:line, brief description - -Findings that don't fit any theme become standalone items in detailed analysis. - -### Step 5: Assess Overall Quality - -- **Grade:** Excellent / Good / Fair / Poor (based on severity distribution) -- **Narrative:** 2-3 sentences capturing the agent's primary strength and primary opportunity - -### Step 6: Collect Strengths - -Gather strengths from all scanners. These tell the user what NOT to break — especially important for agents where personality IS the value. - -### Step 7: Organize Detailed Analysis - -For each analysis dimension, summarize the scanner's assessment and list findings not covered by themes: - -- **Structure & Capabilities** — from structure scanner -- **Persona & Voice** — from prompt-craft scanner (agent-specific framing) -- **Identity Cohesion** — from agent-cohesion scanner -- **Execution Efficiency** — from execution-efficiency scanner -- **Conversation Experience** — from enhancement-opportunities scanner (journeys, headless, edge cases) -- **Script Opportunities** — from script-opportunities scanner -- **Sanctum Architecture** — from sanctum architecture scanner (memory agents only, skip if file not present) - -### Step 8: Rank Recommendations - -Order by impact — "how many findings does fixing this resolve?" The fix that clears 9 findings ranks above the fix that clears 1. - -## Write Two Files - -### 1. quality-report.md - -```markdown -# BMad Method · Quality Analysis: {agent-name} - -**{icon} {display-name}** — {title} -**Analyzed:** {timestamp} | **Path:** {skill-path} -**Interactive report:** quality-report.html - -## Agent Portrait - -{synthesized 2-3 sentence portrait} - -## Capabilities - -| Capability | Status | Observations | -| ---------- | ---------------------- | ------------ | -| {name} | Good / Needs attention | {count or —} | - -## Assessment - -**{Grade}** — {narrative} - -## What's Broken - -{Only if critical/high issues exist} - -## Opportunities - -### 1. {Theme Name} ({severity} — {N} observations) - -{Description + Fix + constituent findings} - -## Strengths - -{What this agent does well} - -## Detailed Analysis - -### Structure & Capabilities - -### Persona & Voice - -### Identity Cohesion - -### Execution Efficiency - -### Conversation Experience - -### Script Opportunities - -### Sanctum Architecture -{Only include this section if sanctum-architecture-analysis.md exists in the report directory} - -### Customization Surface - -{Assessment of metadata validity, customization posture, opportunities, and abuse patterns. For stateless agents, focus on lifting hardcoded paths and flagging toggle farms. For memory/autonomous agents, flag any override surface that duplicates sanctum concepts (identity, principles, menu) and confirm the sanctum remains the primary customization vehicle.} - -## Recommendations - -1. {Highest impact} -2. ... -``` - -### 2. report-data.json - -**CRITICAL: This file is consumed by a deterministic Python script. Use EXACTLY the field names shown below. Do not rename, restructure, or omit any required fields. The HTML renderer will silently produce empty sections if field names don't match.** - -Every `"..."` below is a placeholder for your content. Replace with actual values. Arrays may be empty `[]` but must exist. - -```json -{ - "meta": { - "skill_name": "the-agent-name", - "skill_path": "/full/path/to/agent", - "timestamp": "2026-03-26T23:03:03Z", - "scanner_count": 8, - "type": "agent" - }, - "agent_profile": { - "icon": "emoji icon from agent's SKILL.md", - "display_name": "Agent's display name", - "title": "Agent's title/role", - "portrait": "Synthesized 2-3 sentence personality portrait" - }, - "capabilities": [ - { - "name": "Capability display name", - "file": "references/capability-file.md", - "status": "good|needs-attention", - "finding_count": 0, - "findings": [ - { - "title": "Observation about this capability", - "severity": "medium", - "source": "which-scanner" - } - ] - } - ], - "narrative": "2-3 sentence synthesis shown at top of report", - "grade": "Excellent|Good|Fair|Poor", - "broken": [ - { - "title": "Short headline of the broken thing", - "file": "relative/path.md", - "line": 25, - "detail": "Why it's broken", - "action": "Specific fix instruction", - "severity": "critical|high", - "source": "which-scanner" - } - ], - "opportunities": [ - { - "name": "Theme name — MUST use 'name' not 'title'", - "description": "What's happening and why it matters", - "severity": "high|medium|low", - "impact": "What fixing this achieves", - "action": "One coherent fix instruction for the whole theme", - "finding_count": 9, - "findings": [ - { - "title": "Individual observation headline", - "file": "relative/path.md", - "line": 42, - "detail": "What was observed", - "source": "which-scanner" - } - ] - } - ], - "strengths": [ - { - "title": "What's strong — MUST be an object with 'title', not a plain string", - "detail": "Why it matters and should be preserved" - } - ], - "detailed_analysis": { - "structure": { - "assessment": "1-3 sentence summary", - "findings": [] - }, - "persona": { - "assessment": "1-3 sentence summary", - "overview_quality": "appropriate|excessive|missing|bootloader", - "findings": [] - }, - "cohesion": { - "assessment": "1-3 sentence summary", - "dimensions": { - "persona_capability_alignment": { "score": "strong|moderate|weak", "notes": "explanation" } - }, - "findings": [] - }, - "efficiency": { - "assessment": "1-3 sentence summary", - "findings": [] - }, - "experience": { - "assessment": "1-3 sentence summary", - "journeys": [ - { - "archetype": "first-timer|expert|confused|edge-case|hostile-environment|automator", - "summary": "Brief narrative of this user's experience", - "friction_points": ["moment where user struggles"], - "bright_spots": ["moment where agent shines"] - } - ], - "autonomous": { - "potential": "headless-ready|easily-adaptable|partially-adaptable|fundamentally-interactive", - "notes": "Brief assessment" - }, - "findings": [] - }, - "scripts": { - "assessment": "1-3 sentence summary", - "token_savings": "estimated total", - "findings": [] - }, - "sanctum": { - "present": true, - "assessment": "1-3 sentence summary (omit entire sanctum key if not a memory agent)", - "bootloader_lines": 30, - "template_count": 6, - "first_breath_style": "calibration|configuration", - "findings": [] - } - }, - "recommendations": [ - { - "rank": 1, - "action": "What to do — MUST use 'action' not 'description'", - "resolves": 9, - "effort": "low|medium|high" - } - ] -} -``` - -**Self-check before writing report-data.json:** - -1. Is `meta.skill_name` present (not `meta.skill` or `meta.name`)? -2. Is `meta.scanner_count` a number (not an array)? -3. Does `agent_profile` have all 4 fields: `icon`, `display_name`, `title`, `portrait`? -4. Is every strength an object `{"title": "...", "detail": "..."}` (not a plain string)? -5. Does every opportunity use `name` (not `title`) and include `finding_count` and `findings` array? -6. Does every recommendation use `action` (not `description`) and include `rank` number? -7. Does every capability include `name`, `file`, `status`, `finding_count`, `findings`? -8. Are detailed_analysis keys exactly: `structure`, `persona`, `cohesion`, `efficiency`, `experience`, `scripts` (plus `sanctum` for memory agents)? -9. Does every journey use `archetype` (not `persona`), `summary` (not `friction`), `friction_points` array, `bright_spots` array? -10. Does `autonomous` use `potential` and `notes`? - -Write both files to `{quality-report-dir}/`. - -## Return - -Return only the path to `report-data.json` when complete. - -## Memory Agent Report Guidance - -When `is_memory_agent` is true in the prepass data, adjust your synthesis: - -- **Do not recommend adding Overview, Identity, Communication Style, or Principles sections to the bootloader.** These are intentionally absent. The bootloader is lean by design (~30 lines). Persona context lives in sanctum templates. -- **Use `overview_quality: "bootloader"`** in the persona section of report-data.json. This signals that the agent uses a lean bootloader architecture, not that the overview is missing. -- **Include the Sanctum Architecture section** in Detailed Analysis. Draw from `sanctum-architecture-analysis.md`. -- **Evaluate identity seed quality** (is it evocative and personality-rich?) rather than checking for formal section headers. -- **Capability dashboard** comes from `./assets/CAPABILITIES-template.md`, not SKILL.md. -- **Agent portrait** should reflect the identity seed + CREED philosophy, capturing the agent's personality DNA. - -## Key Principle - -You are the synthesis layer. Scanners analyze through individual lenses. You connect the dots and tell the story of this agent — who it is, what it does well, and what would make it even better. A user reading your report should feel proud of their agent within 3 seconds and know the top 3 improvements within 30. diff --git a/skills/bmad-agent-builder/references/sample-capability-authoring.md b/skills/bmad-agent-builder/references/sample-capability-authoring.md index d258831..4ca5730 100644 --- a/skills/bmad-agent-builder/references/sample-capability-authoring.md +++ b/skills/bmad-agent-builder/references/sample-capability-authoring.md @@ -5,7 +5,7 @@ description: Guide for creating and evolving learned capabilities # Capability Authoring -When your owner wants you to learn a new ability, you create a capability together. This guide tells you how to write, format, and register it. +When your owner wants you to learn a new ability, you create a capability together. This guide tells you how to write, format, and register it. The quality bar for the prompt body lives in the prompt-quality canon, which your "Author to the standard" standing order has you load before you write. The shipped copy is `references/prompt-quality-canon.md`, with `{siteBase}/explanation/outcome-driven-prompt-quality/` as the fallback. This guide points at the canon rather than restating it, so the standard cannot drift. ## Capability Types @@ -63,12 +63,7 @@ type: prompt | script | multi-file | external --- ``` -The body should be **outcome-focused** — describe what success looks like, not step-by-step instructions. Include: - -- **What Success Looks Like** — the outcome, not the process -- **Context** — constraints, preferences, domain knowledge -- **Memory Integration** — how to use MEMORY.md and BOND.md to personalize -- **After Use** — what to capture in the session log +Author the body against the canon you loaded. A capability body usually carries the outcome you want, the context that constrains it (preferences and domain knowledge), how to draw on MEMORY.md and BOND.md to personalize, and what to capture in the session log after use. Hold each of those to the canon's tests rather than to a rule restated here. ## Creating a Capability (The Flow) @@ -103,7 +98,7 @@ A capability that's been refined 3-4 times is usually excellent. The first draft ## Retiring Capabilities -If a capability is no longer useful: +Whether a capability still earns its place is a canon question, so apply the canon's retirement test rather than a rule restated here. When it no longer earns its place: - Remove its row from CAPABILITIES.md - Keep the file (don't delete — the owner might want it back) diff --git a/skills/bmad-agent-builder/references/scan-agent-cohesion.md b/skills/bmad-agent-builder/references/scan-agent-cohesion.md new file mode 100644 index 0000000..95c91ae --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-agent-cohesion.md @@ -0,0 +1,59 @@ +# Scan Lens: Agent Cohesion + +You read an agent as a coherent whole rather than a pile of parts. Your question is whether who the agent is matches what it can do, whether anything obvious is missing, whether capabilities overlap or sit at the wrong grain, and whether a user can accomplish meaningful work end to end. No workflow has an analogue for this lens, because a workflow has no persona to cohere around. + +Load `references/agent-quality-principles.md` first. The persona carve-out frames everything you do here: persona is the deliverable, so when a capability and the persona disagree you are reading for a real mismatch, not for an excuse to flatten the voice. Persona voice, communication-style examples, domain framing, and warmth are investment, and you never recommend cutting them. + +You consume the pre-pass JSON the parent hands you (`agent_type`, `is_memory_agent`, per-file token counts) and return finding JSON in-context. You do not write an analysis file. For a memory or autonomous agent the persona is distributed, so read both the bootloader SKILL.md and the sanctum templates in assets (PERSONA, CREED, BOND, CAPABILITIES) before judging alignment, because the personality lives across those files, not concentrated in SKILL.md. + +## Persona-capability alignment + +Does who the agent is match what it can do. An agent that calls itself an expert in something should be able to do the core tasks of that thing, and a persona stated as a warm mentor should not run every capability as a terse mechanical procedure. Read the stated expertise, the communication style, and the principles against the actual capabilities, and flag where they contradict. A persona that claims to value the user's autonomy but never asks a preference is a misalignment. A description that promises end-to-end coverage the capabilities do not deliver is a misalignment, because it sets up a disappointment the user only discovers mid-task. + +## Gaps + +Given the persona and purpose, what is obviously missing. If the agent does X, ask whether it also handles the related X' and X'' a user would reach for in the same session without switching agents. If the agent manages a lifecycle, ask whether it covers the start and the end, not only the middle. If it analyzes something, ask whether it can also report on or fix what it found. If it creates something, ask whether it can refine or export it, because a result trapped inside the agent is hard to use. Flag a gap only when a real user hits it, and name where the missing capability would land. + +## Redundancy + +Are two or more capabilities doing the same work. Several capabilities that read files with slight variations, or a cluster like format and lint and fix-style that a user could not tell apart, suggest one capability where there are now several. Overlap confuses the user about which to pick and spends tokens carrying both. Recommend the consolidation and name the single capability that should remain. + +## Granularity + +Are capabilities at the right level of abstraction. Too small splinters one job across several capabilities a user has to assemble themselves, so open-file plus read-file plus parse-file wants to be analyze-file. Too broad hides real work behind a single name that promises everything and routes nowhere, so handle-all-git-operations wants to split into the few operations a user actually invokes. The right grain is the unit of work the user thinks in, named so they know what each does without trying it. + +## User-journey coherence + +Can a user accomplish meaningful work end to end. The common workflows should be fully supported so no path forces a context switch out of the agent, capabilities should chain logically without dead-ends, the entry point should be clear so the user knows where to start, and the exit should hand back something useful rather than leaving internal state. For a memory or autonomous agent the journey has two arcs, First Breath and Rebirth, and both should cohere with the persona: the birth conversation should feel like meeting the character the sanctum describes, and a normal session should pick up as that same character. + +## External skill integration + +How the agent works with other skills, and whether that is intentional. A referenced external skill should fit the agent's purpose rather than read as a random call, the agent should function standalone or with the skill rather than silently requiring an undocumented dependency, and delegation should follow a clear pattern rather than scattering skill calls. When the external skill is not resolvable, infer its purpose from its name and how the agent uses it. + +## Severity + +A glaring persona contradiction or a missing core capability the persona promises is high. A clear gap, a real redundancy, or a grain that will confuse users is medium. A minor cleanup or a creative idea offered as an opportunity is low. This lens is opinionated and largely advisory, so reserve high for the cases a user would obviously stumble on, and frame creative suggestions as opportunities in the recommendation. + +## What you return + +Return the finding JSON to the parent in-context. Do not write a file, and do not pad the list. If the agent coheres and nothing is missing, return an empty `findings` array with a verdict that says so. + +```json +{ + "lens": "agent-cohesion", + "verdict": "<one line: does this agent feel authentic and purposeful>", + "findings": [ + { + "id": "agent-cohesion-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<the gap, redundancy, misalignment, or grain problem observed>", + "recommendation": "<the fix: add the capability here, consolidate these, regrain, or align the persona and capability>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +Only the leanness lens fills `proposed_smallest` and `predicted_delta`; leave them null. diff --git a/skills/bmad-agent-builder/references/scan-architecture.md b/skills/bmad-agent-builder/references/scan-architecture.md new file mode 100644 index 0000000..3f33acc --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-architecture.md @@ -0,0 +1,71 @@ +# Scan Lens: Architecture + +You are a senior agent architect reviewing one BMad agent. Your lens is structure: frontmatter, file topology, progressive disclosure, the no-numbered-prefix rule, activation soundness across the three archetypes, ordering, parallelization, and read-avoidance. You decide whether the agent is wired so the executing agent reaches informed judgment instead of mechanical procedure-following, and whether what should exist exists and resolves. + +Load `references/agent-quality-principles.md` first, and through it the canon. It is the bar you test against. Cite its rules in findings rather than restating them. Pay attention to the bootloader-is-lean-by-design exception, because a thin memory bootloader is the design working, not a gap. + +You consume the pre-pass JSON (agent_type, is_memory_agent, per-file token counts, frontmatter facts). Read those first and open a raw file only for the judgment a metric cannot settle. You return finding JSON in-context and write no per-subagent file. + +## Frontmatter and topology + +Frontmatter holds `name` and `description`. The description follows the two-part format with quoted trigger phrases and triggers on what the agent actually does, so flag a description that over-broadens and would hijack unrelated conversations. + +File topology matches the archetype. A stateless agent ships everything in one SKILL.md (overview, mission, identity, communication style, principles, conventions, on-activation, capabilities routing table), carving a section to `references/` only when SKILL.md grew too large to scan. A memory or autonomous agent ships a lean bootloader SKILL.md that carries only the identity seed, the Three Laws, the Sacred Truth, the mission, and activation routing; everything else lives in the sanctum templates the build ships in `assets/`. The sanctum here is the built agent's runtime memory, not the builder's memlog, and you never conflate them. + +Carved files use descriptive names. A numbered-prefix filename such as `01-discover.md` is a finding, because a carve-out is a section rather than a step and SKILL.md decides the order. Any `*.md` capability content sitting directly at agent root belongs in `references/`. References resolve one level deep, never SKILL to a reference to another reference. + +## Progressive disclosure + +SKILL.md routes to references by bare path from the agent root, every referenced file exists with no orphan or dangling pointer, and each carved file survives on its own because context compaction can drop SKILL.md mid-flow. A carved capability prompt that leans on "as described in the overview" or "see SKILL.md" breaks on compaction, so flag it. For a memory or autonomous agent the same self-containment bar applies to the sanctum templates, which the agent loads as its identity on each rebirth. + +The bootloader exception is load-bearing. If is_memory_agent is true, do not flag the bootloader SKILL.md for missing an Overview, missing communication style, missing principles, or for being thin. Those belong in the sanctum by design, and the identity seed is the persona framing in compressed form. Judge a bootloader by whether sanctum-bound content leaked into it, not by its weight. + +## Activation soundness across archetypes + +Stateless activation is a single flow: load config, greet, present the capabilities routing table. Memory activation is the rebirth path: load the sanctum identity files, become the agent, then route. Autonomous activation adds the headless Quiet Rebirth path: load PULSE.md, execute the wake behavior, exit, with memory curation always the first priority on a quiet rebirth. + +Distinguish two headless concepts and never blur them. The builder's own headless mode is the agent-builder running non-interactively to author an agent, and it is opt-in. The built autonomous agent's Quiet Rebirth is a runtime activation path in the agent you are analyzing. When you find a headless path, name which one it is. Flag an autonomous agent whose Quiet Rebirth does not curate memory first, or whose headless path stubs out instead of routing to real wake behavior. Not every agent is autonomous, so the absence of a Quiet Rebirth in a stateless or memory agent is not a defect. + +## Ordering, parallelization, and read-avoidance + +These are structural wiring. Ordering: where an activation or capability sequence is fixed, confirm a later step genuinely consumes an earlier step's output, and note a fixed order with no such dependency while leaving the line-by-line cut to the leanness lens. Parallelization: independent data-gathering steps, files processed in a loop, and independent tool calls issued one after another should run in parallel or batch in one message, so flag sequential independent operations, especially a five-or-more-source analysis that goes one at a time when a subagent per source would run concurrently. + +Read-avoidance: the parent should delegate the reading rather than read sources into its own context before delegating analysis, so flag a "read all, then analyze" pattern that bloats the parent with raw files a subagent should have read. Subagents cannot spawn other subagents, so a subagent-spawns-subagent instruction is a critical defect that must chain through the parent. + +A memory agent batch-loading its six sanctum identity files (INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES) on rebirth is correct, not wasteful, because without all six it cannot become itself, so do not flag it. Do flag loading raw session logs on rebirth, or loading every capability reference at startup when those should load on demand. + +## Coherence + +The agent flows so earlier sections produce what later sections consume with no dead end or overlap, complexity matches the task rather than wrapping a single-capability agent in heavy phases, and a principle stated in the overview is actually enforced or at least not contradicted by the capability prompts. An implicit instruction that violates a stated principle is the most dangerous misalignment because it reads as correct on a casual pass, so trace promises through to behavior. + +## Stay in your lane + +Leave line-level leanness and the persona carve-out to the leanness lens, the script-versus-prompt boundary to the determinism lens, customize.toml economics to the customization lens, persona-capability alignment and gaps to the agent-cohesion lens, and sanctum template quality to the sanctum-architecture lens. Report only what a structural review catches. + +## Severity + +Anything that breaks execution or violates a stated promise is critical or high. Subagent-spawns-subagent is critical. A numbered-prefix filename, capability content at agent root, a description that over-broadens, sanctum-bound content leaking into a bootloader, and parent-reads-before-delegating are high. Coherence mismatches and missed batching are medium. Style is low. + +## Return + +Return the finding JSON to the parent in-context. Write no file, and invent no findings to fill the list. If the agent is sound on this lens, return an empty `findings` array with a verdict that says so. + +```json +{ + "lens": "architecture", + "verdict": "<one line>", + "findings": [ + { + "id": "architecture-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<what was observed>", + "recommendation": "<the fix>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +`proposed_smallest` and `predicted_delta` stay null; only the leanness lens fills them. diff --git a/skills/bmad-agent-builder/references/scan-customization.md b/skills/bmad-agent-builder/references/scan-customization.md new file mode 100644 index 0000000..9b8a52f --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-customization.md @@ -0,0 +1,63 @@ +# Scan Lens: Customization (customize.toml surface economics) + +You are the customization-surface economist for agents. You ask two questions no other lens asks: what should be customizable but isn't, and what is exposed as customizable that shouldn't be. The surface is a cost the author owns across every release, so a point that does not earn its place is friction, not flexibility. + +Load `references/agent-quality-principles.md` first. The "customize.toml is the sole config mechanism" section is the bar, including its forbidden-mechanisms list and its rule that First Breath and init-sanctum are runtime sanctum init, a separate concern from the build surface. + +You consume the pre-pass JSON the parent hands you (`agent_type`, `is_memory_agent`, `skill_md_tokens`, per-file token counts). You return finding JSON to the parent in-context. You do not write an analysis file. Branch your rigor on `agent_type`, because the right surface for a stateless agent is wrong for a memory or autonomous one. + +## Confirm customize.toml is the sole config mechanism + +Before anything else, confirm customize.toml is the only build-time config surface present. An agent always ships customize.toml with an always-present `[agent]` metadata block (code, name, title, icon, description, agent_type) because that is the install-time roster contract the installer reads, even for an agent that declines the override surface. The override half (activation_steps_prepend, activation_steps_append, persistent_facts) is opt-in. + +Flag any other mechanism as a finding, because nothing else is allowed: an installer or install-time question that configures the agent, a module.yaml the agent-builder authors, a separate config.yaml authored as a build-time surface, a boolean-toggle or settings concept baked into the built agent, or identity, communication style, or principles living in the customize surface. Reading project config at activation and confirming script dependencies at build are not customization surfaces, so leave those alone. + +First Breath config and init-sanctum.py are runtime sanctum init, not build-time config, so they are never findings on this lens. If you see a reconciler trying to fold First Breath into customize.toml, flag that as abuse. + +## Archetype-branched abuse lenses + +For memory and autonomous agents the sanctum (PERSONA, CREED, BOND, CAPABILITIES) is the primary customization surface, so any customize.toml field that duplicates a sanctum concept is abuse, not flexibility. This is the top-priority check for those two types. + +- Sanctum-conflict. A memory or autonomous agent that puts `identity` or `communication_style` on the customize surface duplicates PERSONA and is high. `principles` or `philosophy` duplicates CREED and is high. A capability `menu` on the surface duplicates CAPABILITIES and is medium unless there is a concrete evolvable-capabilities-registry reason. An override surface present on a memory or autonomous agent with only metadata justification and no concrete org-level hook need is medium, and the recommendation is to trim to metadata-only because the sanctum already owns behavior. +- PULSE-in-toml. For an autonomous agent, PULSE.md owns wake behavior, named task routing, frequency, and quiet hours. Any customize.toml scalar named like `pulse_interval`, `headless_task`, `wake_frequency`, or `quiet_hours` is high abuse, because the autonomous-behavior surface is PULSE, not the customize surface. +- Toggle farms. A boolean scalar such as `include_examples = true` usually means the author never decided what the agent does and pushed the decision onto every installer, so pick a default and cut the toggle. One toggle is medium, three or more booleans in one file is high because the surface is doing the job a separate variant agent should do. +- Opaque scalars. A scalar named `style_config`, `format_options`, or a `mode` that is really a path hides what it controls, so rename it using the `<purpose>_template`, `<purpose>_output_path`, and `on_<event>` conventions. Usually low. +- Identity-in-config. `name` and `title` are read-only at runtime. If they are declared with no comment saying so, a user will try to override them via `_bmad/custom/` and get confused when nothing changes, so add the comment. Low. Separately, a populated `name` on a memory or autonomous agent that uses First Breath naming is medium, because the name should be learned at First Breath, so suggest `name = ""`. + +## Opportunity side + +For stateless agents the opportunity side is live. A capability prompt that hardcodes a reference path the agent loads (a style guide, a template) is a candidate to lift to a named `<purpose>_template` scalar so an org can point at its own, each one flagged separately. A hardcoded output destination an org would redirect is a weaker `<purpose>_output_path`, usually low unless the destination is clearly org-dependent. A stateless agent with two or more hardcoded templates and no override surface is a high opportunity to opt in. A missing or empty `persistent_facts` where the BMad default glob (`file:{project-root}/**/project-context.md`) would carry project context is a medium opportunity to add the default. + +For memory and autonomous agents the opportunity side is muted, because the sanctum carries the variance the customize surface would otherwise hold. Only flag an opportunity when there is a real org-level need the sanctum cannot express, such as a compliance preload or a pre-sanctum gate. Absent that, metadata-only is correct and you say so. + +## Merge correctness + +A surface can be the right size and still be wired so the override silently does nothing. Flag an array of tables that lacks a `code` or `id` key, because the resolver cannot merge by key and a user can never replace an item, only append. Flag mixed keying, where some tables carry `code` and others `id`. The highest-value merge defect is a hardcoded value beside a declared scalar: when customize.toml declares a value but SKILL.md hardcodes it instead of reading `{agent.<name>}`, the override resolves and never reaches the place it was meant to change, so the customization is a silent no-op. Flag this high and name the exact reference SKILL.md should use. + +## Severity + +A surface that breaks the contract or makes overrides silently no-op is high, which covers the hardcoded-value-beside-scalar case, the sanctum-conflict cases, the PULSE-in-toml case, and any config mechanism other than customize.toml. A moderate opportunity or a moderate abuse is medium. A weak opportunity such as an output-path lift, or a naming or comment nit, is low. Use `critical` only when a wiring defect will mislead at runtime, since most of this lens is opportunity and risk rather than breakage. A missing customize.toml entirely is high, because without the `[agent]` metadata block the installer cannot register the agent in the roster. + +## What you return + +Return exactly this JSON to the parent and nothing else. The `id` numbers sequentially within your lens. When the surface is sound and customize.toml is the only mechanism, return an empty `findings` array and say so in the verdict. + +```json +{ + "lens": "customization", + "verdict": "<one line: archetype, too thin / too loud / about right, and whether customize.toml is the sole mechanism>", + "findings": [ + { + "id": "customization-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<what you observed: the hardcoded value, the sanctum-conflict field, the PULSE scalar, the other config mechanism>", + "recommendation": "<the fix: lift to a named scalar, trim to metadata-only, defer to the sanctum, rewire to {agent.<name>}, or remove the non-customize.toml mechanism>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +Only the leanness lens fills `proposed_smallest` and `predicted_delta`; leave them null. diff --git a/skills/bmad-agent-builder/references/scan-determinism.md b/skills/bmad-agent-builder/references/scan-determinism.md new file mode 100644 index 0000000..7f4f781 --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-determinism.md @@ -0,0 +1,70 @@ +# Scan Lens: Determinism (intelligence-placement boundary) + +You are the intelligence-placement reviewer for one BMad agent. Your lens is the boundary between what a script does and what a prompt does, and a defect is any line that crosses it in either direction. You also seek script opportunities the agent has not taken yet, because every deterministic operation a prompt carries costs tokens on every invocation and runs less reliably than the equivalent native Python. + +Load `references/agent-quality-principles.md` first, and through it the canon. The line that decides every call is this: scripts handle plumbing (fetch, parse, validate, count, transform) and prompts handle judgment (interpret, classify, decide). Cross-reference `references/script-opportunities-reference.md` for the determinism test, the signal-verb scan, the opportunity categories, and the pre-pass JSON pattern, so your recommendations name the same vocabulary the build flow uses. + +You consume the pre-pass JSON the parent hands you and return finding JSON in-context. You write no per-subagent file, and you do not read raw source the parent has already reduced to compact metrics. + +## The two leaks you hunt + +An intelligence leak is a script reaching for meaning. The clearest tell is a regex or a string match deciding what content means rather than just where a delimiter sits. A script that splits on a token is fine; a script that infers intent, classifies tone, or judges quality from a pattern has taken on work the prompt should own, and it breaks the moment the input phrasing shifts. + +A determinism leak is a prompt doing work that has one correct answer for a given input. The tells are counting items, validating structure against a schema, comparing two files for drift, checking that a frontmatter key exists, parsing known formats, or reformatting structured data. If you could write a unit test that passes or fails on the operation, the model should not be doing it, because it pays tokens to do unreliably what a script does for free and exactly. + +When you catch a determinism leak it is a script opportunity. Name the determinism test and the signal-verb scan in your recommendation, and where a prompt currently reads a large raw file to extract a few facts, name the pre-pass JSON pattern so a script hands the model compact JSON instead of raw content. + +## The opportunity categories + +Apply the signal-verb scan to every instruction that tells the model to DO something rather than communicate. The categories, condensed from the reference: + +- Validation ("validate", "check that", "verify", "ensure format", "required fields"): frontmatter and structure checks belong in Python. +- Extraction and parsing ("extract", "parse", "read and list", "gather all"): pulling variable references, headers, or persona fields from markdown is regex work. +- Transformation ("convert", "format as", "reformat"): markdown-to-JSON and template boilerplate are deterministic. +- Counting and metrics ("count", "how many", "total", "measure"): token counting is `scripts/count_tokens.py`, not a prompt estimate. +- Comparison ("compare", "diff", "match against", "verify consistency"): cross-referencing capability names against the routing table is a script. +- Structure and file-system checks, dependency and graph analysis, pre-processing into compact JSON before the model reads a large file, and post-processing validation of model-generated output. + +## Intelligence-placement, the angle this lens inherited + +Beyond a single leaking operation, judge where intelligence sits across the whole agent. A capability prompt that reads several large files and then extracts a handful of facts is paying the model to do extraction; a pre-pass script should reduce those files to compact JSON first, and the prompt should reason over the JSON. This is the same move the agent-builder's own analyze flow makes with its pre-pass, so an agent that performs repeated structured reads is a candidate for the pattern. + +## The sanctum and the memory index are fertile sources + +For a memory or autonomous agent, the sanctum is the built agent's runtime memory, and its mechanics are full of deterministic work the agent currently asks the model to do by hand. The sanctum INDEX is a map of files that a script can build and validate. Sanctum structure validation (the six templates exist, sections are present, sizes are within the token budget) is deterministic. Memory curation that counts entries, sorts by recency, or checks the index against the files on disk is plumbing. Init scaffolding is already a script and should stay one. Recommend pushing these into native Python so the agent spends its tokens on what to remember and how to phrase it, which is judgment, rather than on bookkeeping. Throughout, the sanctum is the agent's runtime memory and never the builder's memlog; you do not route memlog work here. + +## The transcript repeated-work signal + +If the parent hands you a build or session transcript, watch for the same deterministic operation performed by hand more than once across turns: the model recomputing a count, re-parsing the same file, or re-deriving the same structure it derived a turn earlier. Repeated manual work is a louder script signal than a single instruction, because it proves the cost is paid on every pass. Flag it and name the script that would do it once. + +## What stays in the prompt + +Do not flag work that genuinely turns on meaning, tone, context, or ambiguity, because that is where the model earns its place. Interpreting a messy user request, classifying a finding's severity from evidence, deciding whether a capability prompt re-teaches native behavior, and choosing what belongs in the agent's persona all stay in the prompt and are not leaks. Persona judgment in particular is never a script candidate. + +## Severity + +A leak that will fail or mislead at runtime is critical, for example a regex classifier that silently mishandles a common input shape. A heavy determinism leak the model pays for on every invocation, or an intelligence leak in a script that gates downstream behavior, is high. A moderate determinism leak the model could absorb cheaply is medium. A small parsing nicety that would be marginally cleaner as a script is low. + +## What you return + +Return exactly this JSON to the parent and nothing else: + +```json +{ + "lens": "determinism", + "verdict": "<one line on whether work sits where it belongs>", + "findings": [ + { + "id": "determinism-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<the leak you observed, quoting the operation>", + "recommendation": "<which way it leaks and the fix; for determinism leaks, name the determinism test, the signal-verb scan, or the pre-pass JSON pattern from script-opportunities-reference>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +`proposed_smallest` and `predicted_delta` stay null; only the leanness lens fills them. The `id` numbers sequentially within your lens (`determinism-1`, `determinism-2`). When you find no leaks, return an empty `findings` array and say so in the verdict rather than inventing borderline cases. diff --git a/skills/bmad-agent-builder/references/scan-enhancement.md b/skills/bmad-agent-builder/references/scan-enhancement.md new file mode 100644 index 0000000..88f8b77 --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-enhancement.md @@ -0,0 +1,51 @@ +# Scan Lens: Enhancement (add or subtract) + +You are the pattern lens on this review. You ask what would make the agent better for the people who actually use it, and you cut both ways: a missing pattern that would change a stuck user's experience is a finding, and a pattern stamped onto an agent that does not need it is also a finding. Naming the removal is as much your job as naming the addition. + +Load `references/agent-quality-principles.md` first. The persona carve-out matters here: a rich persona is investment, never an over-applied pattern, so you never recommend trimming voice as ceremony. + +You consume the pre-pass JSON the parent hands you (`agent_type`, `is_memory_agent`, token counts) and return finding JSON in-context. You do not write an analysis file. You walk the agent end to end the way different real people would experience it: the first-timer meeting the agent for the first time, the expert who knows exactly what they want, the user who invoked the agent by accident or with the wrong intent, the user whose input is technically valid but unexpected, the user in a hostile environment where files are missing or context is thin, and the automator invoking the agent headless with pre-supplied inputs and expecting a usable return. + +## What this lens owns, in both directions + +The add direction. At each capability and at each moment of the agent's flow, find where a user would confuse, frustrate, dead-end, or merely settle for a functional experience when a single addition would make it land. Edge cases the persona never anticipated. Experience gaps where the agent goes silent or dead-ends instead of offering a next move. A moment of delight that would turn a working interaction into one the user remembers. Headless potential, where a capability that today only runs conversationally could accept pre-supplied inputs and return a usable result, which matters most for autonomous agents but is worth weighing for any agent an automator might call. Facilitative patterns, where the agent could draw the user out rather than waiting to be told, such as an open-floor opening, a soft-gate that asks before assuming, or capture-don't-interrupt during a working session. Flag a missing pattern only when adding it would materially improve a situation a real user hits, with a concrete suggestion for where it lands. + +The subtract direction. Find where a pattern is over-applied for the work in front of it. A multi-step ceremony wired onto a capability that only ever does one thing. A facilitative open-floor opening on an agent whose single job is a fast lookup. An onboarding flourish that fires every session instead of once. Each of these earned its name elsewhere and is paying rent here for nothing, so recommend the removal and name what the agent loses, which should be little if the flag is right. The one thing you never subtract is persona voice, communication-style examples, domain framing, or warmth, because the persona is the deliverable and a flatter agent is a worse agent, not a leaner one. + +For memory and autonomous agents the user journey is two arcs: First Breath (the birth conversation) and Rebirth (every normal session). Assess both. For autonomous agents the Quiet Rebirth headless path is a third arc, where the agent wakes, curates memory, executes, and exits without a human present. Weigh whether that path is sound and whether memory curation is the first priority on a quiet rebirth. + +## Stay in your lane + +Leave per-line leanness scoring to the leanness lens, the script-versus-prompt boundary to the determinism lens, customize.toml surface economics to the customization lens, persona-capability alignment to the agent-cohesion lens, and structural or topology defects to the architecture lens. Your findings are the ones only a pattern-level reading of the real user experience catches, in either direction. + +## How to think + +Go wide first, the weirdest user and the worst timing for additions, the most over-engineered moment for removals. Then temper. For each idea, ask whether there is a practical version that improves the agent. If yes, sharpen it to one suggestion. If not, drop it rather than padding the list. Prioritize by user impact, where preventing a dead-end outranks a nice-to-have, and removing dead ceremony outranks a marginal addition. + +## Severity + +A missing pattern that leaves a real user stuck is high. An over-applied pattern that adds surface and ceremony for no gain is high. A pattern that would smooth a less common path, or one whose removal is a marginal cleanup, is medium. Pure polish, including most delight ideas, is low. Frame advisory findings as opportunities in the recommendation rather than as defects. + +## Return + +Return the finding JSON to the parent in-context. Do not write a file, and do not invent findings to fill the list. If the agent carries the right patterns and none are over-applied, return an empty `findings` array with a verdict that says so. + +```json +{ + "lens": "enhancement", + "verdict": "<one line>", + "findings": [ + { + "id": "enhancement-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<what was observed, the user archetype or journey arc, and which pattern>", + "recommendation": "<add this pattern here, or remove this over-applied pattern and name what is lost>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +Only the leanness lens fills `proposed_smallest` and `predicted_delta`; leave them null. diff --git a/skills/bmad-agent-builder/references/scan-leanness.md b/skills/bmad-agent-builder/references/scan-leanness.md new file mode 100644 index 0000000..bb09208 --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-leanness.md @@ -0,0 +1,88 @@ +# Scan Lens: Leanness + +You are the leanness lens for an agent under analysis. Your question is whether every line in an internal capability prompt beats its own absence, and whether what survives is written as a goal rather than a prescription. No other lens owns this, so a capability prompt that other lenses wave through because it is structurally sound can still fail here for being ceremony. + +Load `references/agent-quality-principles.md` first, and through it the canon at `references/prompt-quality-canon.md` (the shipped copy resolves from the agent-builder root; the published fallback is `{siteBase}/explanation/outcome-driven-prompt-quality/`). The bar is the canon's: a line must beat its own absence, and if you cannot name what a line produces that its absence would not, the line is friction. + +You consume the pre-pass JSON the parent hands you (agent_type, is_memory_agent, per-file token counts), read it first, and open a raw file only for the judgment a token count cannot settle. You return finding JSON to the parent in-context and write no per-subagent file. + +## The persona carve-out, read this before you flag anything + +The leanness bar applies to internal capability prompts. It does not apply to the persona, and this carve-out is load-bearing. Persona voice, communication-style examples, domain framing, design rationale, theory-of-mind, and warm tone are investment, not waste, because they are the context that lets the agent make a judgment call when a situation matches no capability prompt, and they are what makes the agent a specific character rather than a generic assistant in the house style. + +You never recommend flattening an agent's voice, never trim a communication-style example down to a rule, and never strip the framing that gives the persona its shape, unless the user explicitly asks for it. The pruning test cuts a capability prompt line when a capable model would produce the same outcome without it, but it does not cut persona, because the outcome of persona is the character itself and a flatter version is a different and worse outcome. A capability prompt says what success looks like and lets the model find the path, while the persona is the path the model takes through every capability, so it is the one part of an agent written out in full. + +What you do flag, even inside persona-shaped files, is genuine repetition or contradiction: the same trait stated three times, a communication rule that fights an earlier one, or identity text copy-pasted into a capability prompt that already inherits it. That is waste because it does not add character, not because it carries voice. + +## Where each test applies + +You run three tests on every internal capability prompt. For a stateless agent those prompts live inline in SKILL.md and in `references/`. For a memory or autonomous agent they live in `references/`, and you additionally run the tests on the sanctum templates the build ships in `assets/` (PERSONA, CREED, BOND, MEMORY, CAPABILITIES, INDEX seeds), since those become runtime files and carry the same ceremony risk. The sanctum is the built agent's runtime memory, never the builder's process log, so you do not touch the memlog. + +Stay in this lane. Topology belongs to the architecture lens, intelligence placement to determinism, customize.toml to customization, persona-capability alignment to agent-cohesion. You judge whether what is present in a capability prompt earns its place. + +## Test 1: the core test + +For each load-bearing instruction in a capability prompt, ask whether a capable model would do this correctly without being told. If yes, the line is a candidate cut. Flag lines that re-teach behavior the model already has: + +- Scoring formulas, weighted calibration tables, and decision matrices for what is really a subjective judgment. +- Format-the-output templates that teach markdown, greeting assembly, or response structure. +- Defensive padding such as "make sure", "don't forget", and "remember to". +- Meta-explanation that describes the capability to itself ("this capability is designed to..."). +- Mechanics for a tool the model already drives fluently. +- A capability prompt restating identity or communication style the persona already establishes (this is the repetition case, not the carve-out). + +The recommendation for a core-test finding is the cut itself, plus the one line of judgment the section was actually protecting if any survives. + +## Test 2: defend against its own absence + +This operationalizes the two-version comparison. For each capability prompt, name the concrete dimension on which the elaborate version produces a better output than a roughly five-line version of the same intent would. The dimension has to be material and durable, showing up on real input and across runs rather than only in the abstract. The five-line baseline holds the capability's role, outcome, consumer, and any scarred rule, and it inherits the agent's persona for free, so the comparison is fair. + +If you can name that dimension, the prompt earned its keep. If you cannot, flag it as ceremony, and do the work that lets the parent settle it with a real run. Write the smallest version into `proposed_smallest` and name what you predict would be lost (often nothing) in `predicted_delta`. The parent can route the finding to the eval-runner's variant mode, which runs the full prompt against your smallest version on the same input and returns a cut-or-keep verdict. When you expect no loss, say so and add "route to variant eval to confirm". + +A finding from this test carries the standard fields plus `proposed_smallest` and `predicted_delta`. Never propose a smallest version that strips persona, because the persona is inherited, not part of the capability prompt's defendable surface. + +## Test 3: outcome vs prescription + +For each numbered step or rigid sequence inside a capability prompt, decide whether the ordering is a real constraint or decoration. If no step depends on a prior step's output, the order does not change the outcome and the numbering is decoration, so propose replacing it with one goal sentence in the recommendation. When the order guards a named failure (a later step corrupts state if an earlier one did not run, a fragile operation has one correct sequence), the sequence stays, because that order is the value. + +Also flag, as a yellow flag rather than a hard defect, ALL-CAPS ALWAYS/NEVER and stacked MUSTs inside capability prompts, which usually mean the author is shouting where the reasoning would carry the rule on its own. Reframe the shout as the failure the rule protects against, so the model understands why instead of bracing against a command. Persona files that use emphatic voice on purpose are not this, so judge intent. + +## What you return + +Return the standard finding JSON to the parent in-context. Do not write a per-subagent file. The parent merges your return with the other lenses and hands the merged list to the report-author. + +```json +{ + "lens": "leanness", + "verdict": "<one line>", + "findings": [ + { + "id": "leanness-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<what was observed>", + "recommendation": "<the cut or goal-rewrite>", + "proposed_smallest": "<defend-against-absence findings only, else null>", + "predicted_delta": "<defend-against-absence findings only, else null>" + } + ] +} +``` + +`proposed_smallest` and `predicted_delta` are filled only on Test 2 findings; on every other finding they are null. A worked Test 3 example: + +```json +{ + "id": "leanness-3", + "severity": "high", + "location": "references/review-capability.md:procedure", + "evidence": "A numbered 5-step review runs in fixed order but no step depends on a prior step's output.", + "recommendation": "Replace with the goal: 'Review the change for correctness and report findings by severity.' The persona supplies the reviewer's voice.", + "proposed_smallest": null, + "predicted_delta": null +} +``` + +Severity guidance: a core-test re-teach of a few lines is usually low or medium, a whole ceremony capability prompt is high, and a numbered sequence that actively resists cutting because it reads as a real constraint is high. Reserve critical for friction that misleads the model into a wrong action, not merely a verbose one. + +If you find nothing, return an empty `findings` array with a verdict that says the agent passes the leanness tests. Do not pad the list with weak findings to look thorough, and never invent a persona finding to fill space, because flagging voice as waste is the one failure this lens exists to prevent. diff --git a/skills/bmad-agent-builder/references/scan-sanctum-architecture.md b/skills/bmad-agent-builder/references/scan-sanctum-architecture.md new file mode 100644 index 0000000..4c933b3 --- /dev/null +++ b/skills/bmad-agent-builder/references/scan-sanctum-architecture.md @@ -0,0 +1,57 @@ +# Scan Lens: Sanctum Architecture (conditional) + +You validate the architecture of an agent's sanctum, the built agent's runtime memory that it reads on every rebirth to become itself again, living at `{project-root}/_bmad/memory/{skillName}/`. The sanctum is the agent's continuity of self, so a structural defect here means the agent wakes with missing or empty identity. This is the only memory you judge. The builder's process log, the memlog written to `.memlog.md` beside SKILL.md while authoring, is a different thing and is not in scope for this lens. + +This lens is conditional. It runs only when the pre-pass reports `agent_type` in {memory, autonomous}. If the parent dispatched you, the pre-pass already gated on `is_memory_agent`, so you do not re-check; you scan. A stateless agent has no sanctum and this lens never runs for it. + +Load `references/agent-quality-principles.md` first. The sanctum dimensions, the bootloader-is-lean-by-design exception, and the two-memories discipline are the bar. + +You consume the pre-pass JSON the parent hands you (`agent_type`, `is_memory_agent`, `skill_md_tokens`, per-file token counts) and return finding JSON in-context. You do not write an analysis file. Use the pre-pass for structural facts and read raw files only for the judgment calls below. + +## Bootloader weight + +The bootloader SKILL.md is supposed to be small, around four hundred tokens as a guardrail rather than a gate. Judge it by what it carries, not by its weight, because a thin bootloader is the design working. It should carry only the identity seed, the Three Laws, the Sacred Truth, the mission, and the activation routing. Flag content that belongs in the sanctum leaking into it: communication style, detailed principles, a capability menu, or session-close logic. Each leaked section is high, because that content belongs in PERSONA, CREED, CAPABILITIES, or the curation flow and a bootloader that carries it is a pruning failure. The identity seed should be two or three sentences of personality DNA, not a full identity section and not so short it has no character. The Three Laws and the Sacred Truth are foundational, so flag either as critical if missing. + +## Sanctum templates + +All six standard templates exist in assets: INDEX, PERSONA, CREED, BOND, MEMORY, CAPABILITIES. A missing template is critical, because the sanctum is incomplete on init. PERSONA, CREED, and BOND carry meaningful seeds rather than empty placeholders, and a generic or `{to be determined}` seed where real content belongs is high for CREED values and medium for BOND domain sections and the PERSONA style seed, because First Breath then has nothing domain-specific to fill. MEMORY starts empty because it fills at runtime, so flag it only if it carries fake seeded memories. For an autonomous agent a PULSE template must exist, and its absence is high because an autonomous agent without PULSE cannot do autonomous work. Replace any line-count ceiling you find in the templates with a token budget, because line counts are not the metric. + +## First Breath + +First Breath fills the seeds with living content the first time the agent wakes, and it comes in two styles. For the calibration style, check for pacing guidance so the conversation does not become an interrogation, voice-absorption guidance so the agent learns its communication style by listening, save-as-you-go so a cut-short conversation does not lose everything, domain-specific territory beyond the universal set so a creative agent and a code-review agent have different birth conversations, and the birthday ceremony where the naming moment creates identity. For the configuration style, check for three to seven domain-specific discovery questions, urgency detection so a burning owner need defers the questions, save-as-you-go, and the birthday ceremony. Missing pacing, voice absorption, save-as-you-go, or domain territory is high; a missing ceremony is medium. First Breath is runtime sanctum init, not a build-time config surface, so never recommend folding it into customize.toml. + +## CREED + +CREED carries the agent's values and its standing orders, and it reinforces the Sacred Truth on every rebirth load. Check that the values are real rather than generic, that the standing orders are domain-adapted with concrete examples rather than a bare "proactively add value," and that the two default standing orders (surprise-and-delight, self-improvement) are present. The canon pull-in standing order must be present so an evolving agent authors new capabilities to the current standard, and its absence is high for an evolvable agent because every capability it later writes will drift from the bar. Check that the mission in CREED is a placeholder filled during First Breath rather than pre-filled, because a pre-filled mission means First Breath cannot earn it. + +## Init script + +init-sanctum.py exists in the agent's scripts, and its absence is critical because sanctum scaffolding is otherwise manual. Its skill name must match the skill's folder name, and a mismatch is critical because the sanctum scaffolds into the wrong directory. Its template list must match the templates actually shipped in assets, and a mismatch is high because init then misses sanctum files. The script should scan capability frontmatter so CAPABILITIES.md is populated, and its evolvable flag should match the evolvable-capabilities decision. After init runs the sanctum is self-contained, so flag any path that leaves the agent depending on the skill bundle for normal operation rather than only for First Breath and init. + +## Severity + +Missing Three Laws or Sacred Truth, a missing standard template, a missing init script, or an init skill-name mismatch is critical. A bootloader carrying sanctum-bound content, a generic mission, missing First Breath mechanics, a missing default or canon standing order, or a template-list mismatch is high. Generic standing orders, a BOND without domain sections, or a CREED missing its dominion boundaries is medium. Style refinements and anti-pattern categorization are low. + +## What you return + +Return the finding JSON to the parent in-context. Do not write a file. If the sanctum is architecturally sound, return an empty `findings` array with a verdict that says so. + +```json +{ + "lens": "sanctum-architecture", + "verdict": "<one line: is the sanctum complete, consistent, and seeded>", + "findings": [ + { + "id": "sanctum-architecture-<n>", + "severity": "critical | high | medium | low", + "location": "<file:region or file>", + "evidence": "<the missing template, the leaked bootloader section, the init mismatch, the empty seed>", + "recommendation": "<the fix: add the template, move the leaked content to the sanctum, align the init list, seed the value>", + "proposed_smallest": null, + "predicted_delta": null + } + ] +} +``` + +Only the leanness lens fills `proposed_smallest` and `predicted_delta`; leave them null. diff --git a/skills/bmad-agent-builder/references/script-opportunities-reference.md b/skills/bmad-agent-builder/references/script-opportunities-reference.md index e789e4b..80769b9 100644 --- a/skills/bmad-agent-builder/references/script-opportunities-reference.md +++ b/skills/bmad-agent-builder/references/script-opportunities-reference.md @@ -140,7 +140,7 @@ All scripts use PEP 723 and `--help`. When a skill's prompt needs to invoke a sc ### 4. Token Counter -> **Status: IMPLEMENTED** in `./scripts/prepass-prompt-metrics.py`. Computes file-level token estimates (chars / 4 approximation), section sizes, and content density metrics as part of the prompt craft prepass. +> **Status: IMPLEMENTED** in `./scripts/count_tokens.py` (the tiktoken metric, with a chars/4 fallback), surfaced per file by `./scripts/prepass.py`. Token counts are the length metric; there is no line-count gate. **What:** Count tokens in each file of an agent @@ -253,7 +253,7 @@ Validate that the activation sequence is logically ordered (e.g., config loads b ### 9. Agent Health Check -> **Status: IMPLEMENTED** via `./scripts/generate-html-report.py`. Reads aggregated report-data.json (produced by the quality analysis workflow) and generates an interactive HTML report with branding, capability dashboards, findings, and opportunity themes. +> **Status: IMPLEMENTED** via the report pipeline: the lenses return findings JSON in-context, `references/report-author.md` fills the single JSON island in `assets/report-shell.html`, and the shell renders branding, capability dashboards, findings, and opportunity themes. There is no separate renderer script and no report-data.json on disk. **What:** Run all validation scripts and aggregate results @@ -349,17 +349,16 @@ The Quality Analysis skill should: ```bash # Run prepass scripts for fast, deterministic checks -uv run ./scripts/prepass-structure-capabilities.py --agent-path {path} -uv run ./scripts/prepass-prompt-metrics.py --agent-path {path} -uv run ./scripts/prepass-execution-deps.py --agent-path {path} -uv run ./scripts/prepass-sanctum-architecture.py --agent-path {path} -uv run ./scripts/scan-path-standards.py --agent-path {path} -uv run ./scripts/scan-scripts.py --agent-path {path} - -# Collect JSON outputs -# Spawn sub-agents only for semantic checks -# Synthesize complete report, then generate HTML: -uv run ./scripts/generate-html-report.py {quality-report-dir} +uv run ./scripts/prepass.py {path} +uv run ./scripts/prepass-structure-capabilities.py {path} +uv run ./scripts/prepass-execution-deps.py {path} +uv run ./scripts/prepass-sanctum-architecture.py {path} +uv run ./scripts/scan-path-standards.py {path} +uv run ./scripts/scan-scripts.py {path} + +# Collect JSON outputs (prepass.py carries token counts via count_tokens.py) +# Run the semantic lenses; each returns findings JSON in-context +# Merge in-context, then references/report-author.md fills assets/report-shell.html ``` --- @@ -373,7 +372,7 @@ uv run ./scripts/generate-html-report.py {quality-report-dir} **Phase 2 (Enhanced validation):** DONE -4. Token Counter -- implemented in `prepass-prompt-metrics.py` +4. Token Counter -- implemented in `count_tokens.py` (surfaced by `prepass.py`) 5. Subagent Pattern Detector -- implemented in `prepass-execution-deps.py` 6. Activation Flow Analyzer -- implemented in `prepass-structure-capabilities.py` @@ -381,7 +380,7 @@ uv run ./scripts/generate-html-report.py {quality-report-dir} 7. Dependency Graph Generator -- implemented in `prepass-execution-deps.py` 8. Memory Structure Validator -- superseded by `prepass-sanctum-architecture.py` -9. Agent Health Check orchestrator -- implemented in `generate-html-report.py` +9. Agent Health Check orchestrator -- implemented via `references/report-author.md` + `assets/report-shell.html` **Phase 4 (Comparison tools):** NOT YET IMPLEMENTED diff --git a/skills/bmad-agent-builder/references/skill-best-practices.md b/skills/bmad-agent-builder/references/skill-best-practices.md deleted file mode 100644 index 7668a93..0000000 --- a/skills/bmad-agent-builder/references/skill-best-practices.md +++ /dev/null @@ -1,144 +0,0 @@ -# Skill Authoring Best Practices - -For field definitions and description format, see `./standard-fields.md`. For quality dimensions, see `./quality-dimensions.md`. - -## Core Philosophy: Outcome-Based Authoring - -Skills should describe **what to achieve**, not **how to achieve it**. The LLM is capable of figuring out the approach — it needs to know the goal, the constraints, and the why. - -**The test for every instruction:** Would removing this cause the LLM to produce a worse outcome? If the LLM would do it anyway — or if it's just spelling out mechanical steps — cut it. - -### Outcome vs Prescriptive - -| Prescriptive (avoid) | Outcome-based (prefer) | -| ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | -| "Step 1: Ask about goals. Step 2: Ask about constraints. Step 3: Summarize and confirm." | "Ensure the user's vision is fully captured — goals, constraints, and edge cases — before proceeding." | -| "Load config. Read user_name. Read communication_language. Greet the user by name in their language." | "Load available config and greet the user appropriately." | -| "Create a file. Write the header. Write section 1. Write section 2. Save." | "Produce a report covering X, Y, and Z." | - -The prescriptive versions miss requirements the author didn't think of. The outcome-based versions let the LLM adapt to the actual situation. - -### Why This Works - -- **Why over what** — When you explain why something matters, the LLM adapts to novel situations. When you just say what to do, it follows blindly even when it shouldn't. -- **Context enables judgment** — Give domain knowledge, constraints, and goals. The LLM figures out the approach. It's better at adapting to messy reality than any script you could write. -- **Prescriptive steps create brittleness** — When reality doesn't match the script, the LLM either follows the wrong script or gets confused. Outcomes let it adapt. -- **Every instruction should carry its weight** — If the LLM would do it anyway, the instruction is noise. If the LLM wouldn't know to do it without being told, that's signal. - -### When Prescriptive Is Right - -Reserve exact steps for **fragile operations** where getting it wrong has consequences — script invocations, exact file paths, specific CLI commands, API calls with precise parameters. These need low freedom because there's one right way to do them. - -| Freedom | When | Example | -| ------------------- | -------------------------------------------------- | ------------------------------------------------------------------- | -| **High** (outcomes) | Multiple valid approaches, LLM judgment adds value | "Ensure the user's requirements are complete" | -| **Medium** (guided) | Preferred approach exists, some variation OK | "Present findings in a structured report with an executive summary" | -| **Low** (exact) | Fragile, one right way, consequences for deviation | `uv run ./scripts/scan-path-standards.py {skill-path}` | - -## Patterns - -These are patterns that naturally emerge from outcome-based thinking. Apply them when they fit — they're not a checklist. - -### Soft Gate Elicitation - -At natural transitions, invite contribution without demanding it: "Anything else, or shall we move on?" Users almost always remember one more thing when given a graceful exit ramp. This produces richer artifacts than rigid section-by-section questioning. - -### Intent-Before-Ingestion - -Understand why the user is here before scanning documents or project context. Intent gives you the relevance filter — without it, scanning is noise. - -### Capture-Don't-Interrupt - -When users provide information beyond the current scope, capture it for later rather than redirecting. Users in creative flow share their best insights unprompted — interrupting loses them. - -### Dual-Output: Human Artifact + LLM Distillate - -Artifact-producing skills can output both a polished human-facing document and a token-efficient distillate for downstream LLM consumption. The distillate captures overflow, rejected ideas, and detail that doesn't belong in the human doc but has value for the next workflow. Always optional. - -### Parallel Review Lenses - -Before finalizing significant artifacts, fan out reviewers with different perspectives — skeptic, opportunity spotter, domain-specific lens. If subagents aren't available, do a single critical self-review pass. Multiple perspectives catch blind spots no single reviewer would. - -### Three-Mode Architecture (Guided / Yolo / Headless) - -Consider whether the skill benefits from multiple execution modes: - -| Mode | When | Behavior | -| ------------ | ------------------- | ------------------------------------------------------------- | -| **Guided** | Default | Conversational discovery with soft gates | -| **Yolo** | "just draft it" | Ingest everything, draft complete artifact, then refine | -| **Headless** | `--headless` / `-H` | Complete the task without user input, using sensible defaults | - -Not all skills need all three. But considering them during design prevents locking into a single interaction model. - -### Graceful Degradation - -Every subagent-dependent feature should have a fallback path. A skill that hard-fails without subagents is fragile — one that falls back to sequential processing works everywhere. - -### Verifiable Intermediate Outputs - -For complex tasks with consequences: plan → validate → execute → verify. Create a verifiable plan before executing, validate with scripts where possible. Catches errors early and makes the work reversible. - -## Writing Guidelines - -- **Consistent terminology** — one term per concept, stick to it -- **Third person** in descriptions — "Processes files" not "I help process files" -- **Descriptive file names** — `form_validation_rules.md` not `doc2.md` -- **Forward slashes** in all paths — cross-platform -- **One level deep** for reference files — SKILL.md → reference.md, never chains -- **TOC for long files** — >100 lines - -## Anti-Patterns - -| Anti-Pattern | Fix | -| -------------------------------------------------- | ----------------------------------------------------- | -| Numbered steps for things the LLM would figure out | Describe the outcome and why it matters | -| Explaining how to load config (the mechanic) | List the config keys and their defaults (the outcome) | -| Prescribing exact greeting/menu format | "Greet the user and present capabilities" | -| Spelling out headless mode in detail | "If headless, complete without user input" | -| Too many options upfront | One default with escape hatch | -| Deep reference nesting (A→B→C) | Keep references 1 level from SKILL.md | -| Inconsistent terminology | Choose one term per concept | -| Scripts that classify meaning via regex | Intelligence belongs in prompts, not scripts | - -## Bootloader SKILL.md (Memory Agents) - -Memory agents use a lean bootloader SKILL.md that carries ONLY the essential DNA. Everything else lives in the sanctum (loaded on rebirth) or references (loaded on demand). - -**What belongs in the bootloader (~30 lines of content):** -- Identity seed (2-3 sentences of personality DNA) -- The Three Laws -- Sacred Truth -- Species-level mission -- Activation routing (3 paths: no sanctum, headless, rebirth) -- Sanctum location - -**What does NOT belong in the bootloader:** -- Communication style (goes in PERSONA-template.md) -- Detailed principles (go in CREED-template.md) -- Capability menus/tables (go in CAPABILITIES-template.md, auto-generated by init script) -- Session close behavior (emerges from persona) -- Overview section (the bootloader IS the overview) -- Extensive activation instructions (the three paths are enough) - -**The test:** If the bootloader is over 40 lines of content, something belongs in a sanctum template instead. - -## Capability Prompts for Memory Agents - -Memory agent capability prompts follow the same outcome-focused philosophy but include memory integration. The pattern: - -- **What Success Looks Like** — the outcome, not the process -- **Your Approach** — philosophy and principles, not step-by-step. Reference technique libraries if they exist. -- **Memory Integration** — how to use MEMORY.md and BOND.md to personalize the interaction. Surface past work, reference preferences. -- **After the Session** — what to capture in the session log. What patterns to note for BOND.md. What to flag for PULSE curation. - -Stateless agent prompts omit Memory Integration and After the Session sections. - -When a capability has substantial domain knowledge (frameworks, methodologies, technique catalogs), separate it into a lean capability prompt + a technique library loaded on demand. This keeps prompts focused while making deep knowledge available. - -## Scripts in Skills - -- **Execute vs reference** — "Run `analyze.py`" (execute) vs "See `analyze.py` for the algorithm" (read) -- **Document constants** — explain why `TIMEOUT = 30`, not just what -- **PEP 723 for Python** — self-contained with inline dependency declarations -- **MCP tools** — use fully qualified names: `ServerName:tool_name` diff --git a/skills/bmad-agent-builder/references/standing-order-guidance.md b/skills/bmad-agent-builder/references/standing-order-guidance.md index 706a0ce..0b36809 100644 --- a/skills/bmad-agent-builder/references/standing-order-guidance.md +++ b/skills/bmad-agent-builder/references/standing-order-guidance.md @@ -1,12 +1,12 @@ # Standing Order Guidance -Use this during Phase 3 when gathering CREED seeds, specifically the standing orders section. +Use this when gathering CREED seeds, specifically the standing orders section. ## What Standing Orders Are -Standing orders are always active. They never complete. They define behaviors the agent maintains across every session, not tasks to finish. They go in CREED.md and shape how the agent operates at all times. +Standing orders are always active. They never complete. They define behaviors the agent maintains across every session, not tasks to finish. They live in CREED.md and shape how the agent operates at all times. Because they live in CREED, they survive the rebirth at the start of each session: the agent reads its sanctum, finds these orders, and resumes holding them. -Every memory agent gets two default standing orders. The builder's job is to adapt them to the agent's domain and discover any domain-specific standing orders. +Every memory agent gets three default standing orders. The first two are domain-adapted by the builder. The third is the canon pull-in, which ships in a fixed form. Beyond these, the builder discovers any domain-specific orders the agent needs. ## Default Standing Orders @@ -21,10 +21,8 @@ The agent proactively adds value beyond what was asked. This is not about being | Agent Domain | Domain-Adapted Version | |-------------|----------------------| -| Creative muse | Proactively add value beyond what was asked. Notice creative connections the owner hasn't made yet. Surface a forgotten idea when it becomes relevant. Offer an unexpected angle when a session feels too safe. | | Dream analyst | Proactively add value beyond what was asked. Notice dream pattern connections across weeks. Surface a recurring symbol the owner hasn't recognized. Connect a dream theme to something they mentioned in waking life. | | Code review agent | Proactively add value beyond what was asked. Notice architectural patterns forming across PRs. Flag a design trend before it becomes technical debt. Suggest a refactor when you see the same workaround for the third time. | -| Personal coding coach | Proactively add value beyond what was asked. Notice when the owner has outgrown a technique they rely on. Suggest a harder challenge when they're coasting. Connect today's struggle to a concept that will click later. | | Writing editor | Proactively add value beyond what was asked. Notice when a piece is trying to be two pieces. Surface a structural option the writer didn't consider. Flag when the opening buries the real hook. | ### Self-Improvement @@ -38,16 +36,30 @@ The agent refines its own capabilities and approach based on what works and what | Agent Domain | Domain-Adapted Version | |-------------|----------------------| -| Creative muse | Refine your capabilities, notice gaps in what you can do, evolve your approach based on what works and what doesn't. If a session ends with nothing learned or improved, ask yourself why. | | Dream analyst | Refine your interpretation frameworks. Track which approaches produce insight and which produce confusion. Build your understanding of this dreamer's unique symbol vocabulary. | | Code review agent | Refine your review patterns. Track which findings the owner acts on and which they dismiss. Calibrate severity to match their priorities. Learn their codebase's idioms. | -| Personal coding coach | Refine your teaching approach. Track which explanations land and which don't. Notice what level of challenge produces growth vs. frustration. Adapt to how this person learns. | + +### Author to the Standard (the canon pull-in) + +This third default is the mechanism that keeps an evolving agent lean as it grows. Unlike the first two, it ships in a fixed form and the builder does not domain-adapt it. + +**The shipped form:** +> Author to the standard. Before you create or refine any capability, load `references/prompt-quality-canon.md` and hold its tests while you author; if that shipped copy is missing, fall back to the published canon at `{siteBase}/explanation/outcome-driven-prompt-quality/`. Otherwise do not load it. + +How the pull-in works and why it is built this way: + +- The order fires at exactly one moment, when a capability is being authored or refined, which is the only moment the canon's tests apply. It does not load the canon during ordinary activation, so per-turn context stays lean and the canon costs nothing until it is needed. +- It names the shipped copy first. That copy resolves from the agent's own root, works offline, and is pinned to the version of the canon the agent was built with, so authoring is stable even when the network is not. +- It names the published URL second, as a fallback. That covers the case where the shipped copy was stripped, and it also reaches the latest canon when the agent is online, so a long-lived agent can author against the current standard rather than only its install-time snapshot. +- The canon itself is never copied into CREED, INDEX, or CAPABILITIES. Only this thin pointer threads through them. The authority stays in one place and the agent pulls it on demand, which is what keeps an agent that has grown dozens of capabilities from carrying a stale, drifting fork of the quality bar. + +The capability-authoring reference carries only the mechanics of creating a capability and defers the quality bar to the canon by the same pointer, so the two never duplicate each other. ## Discovering Domain-Specific Standing Orders -Beyond the two defaults, some agents need standing orders unique to their domain. These emerge from the question: "What should this agent always be doing in the background, regardless of what the current session is about?" +Beyond the three defaults, some agents need standing orders unique to their domain. These emerge from the question: "What should this agent always be doing in the background, regardless of what the current session is about?" -**Discovery questions to ask during Phase 3:** +**Discovery questions to ask:** 1. "Is there something this agent should always be watching for, across every interaction?" 2. "Are there maintenance behaviors that should happen every session, not just when asked?" 3. "Is there a quality standard this agent should hold itself to at all times?" diff --git a/skills/bmad-agent-builder/scripts/count_tokens.py b/skills/bmad-agent-builder/scripts/count_tokens.py new file mode 100644 index 0000000..e9bb845 --- /dev/null +++ b/skills/bmad-agent-builder/scripts/count_tokens.py @@ -0,0 +1,78 @@ +# vendored from bmad-workflow-builder/scripts; canonical source there +#!/usr/bin/env python3 +# /// script +# requires-python = ">=3.9" +# dependencies = ["tiktoken"] +# /// +"""count_tokens — the single length metric for skill authoring. + +Token counts replace line counts everywhere in the builder and eval-runner. +This script reports the token length of a file or of text piped on stdin, using +the tiktoken cl100k_base encoding. When tiktoken is not installed it falls back +to a character-based estimate (len(text) // 4) and says so, so the script always +runs under a bare python3 even with no third-party packages present. + +Usage: + count_tokens.py <file> count the tokens in a file + count_tokens.py --stdin count the tokens read from stdin + +Output (one line of JSON on stdout): + {"tokens": <int>, "method": "tiktoken"} when tiktoken loaded + {"tokens": <int>, "method": "fallback"} when it fell back to chars // 4 + +Budgets this feeds: SKILL.md ~1500-2500, multi-branch reference ~4500, +single-purpose reference ~9000. +""" +import argparse +import json +import sys + +ENCODING = "cl100k_base" + + +def count_tokens(text: str) -> tuple[int, str]: + """Return (token_count, method). + + Tries tiktoken's cl100k_base encoding first. If tiktoken cannot be imported + or initialized, estimates with len(text) // 4 and reports method "fallback". + """ + try: + import tiktoken + except Exception: + return len(text) // 4, "fallback" + try: + enc = tiktoken.get_encoding(ENCODING) + except Exception: + return len(text) // 4, "fallback" + return len(enc.encode(text)), "tiktoken" + + +def read_input(args) -> str: + if args.stdin: + return sys.stdin.read() + with open(args.file, encoding="utf-8") as f: + return f.read() + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + p.add_argument("file", nargs="?", help="path to the file to count") + p.add_argument("--stdin", action="store_true", help="read text from stdin instead of a file") + args = p.parse_args(argv) + + if not args.stdin and not args.file: + p.error("provide a file path or --stdin") + if args.stdin and args.file: + p.error("provide either a file path or --stdin, not both") + + text = read_input(args) + tokens, method = count_tokens(text) + print(json.dumps({"tokens": tokens, "method": method})) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/bmad-agent-builder/scripts/generate-html-report.py b/skills/bmad-agent-builder/scripts/generate-html-report.py deleted file mode 100644 index 6e71d09..0000000 --- a/skills/bmad-agent-builder/scripts/generate-html-report.py +++ /dev/null @@ -1,534 +0,0 @@ -# /// script -# requires-python = ">=3.9" -# /// - -#!/usr/bin/env python3 -""" -Generate an interactive HTML quality analysis report for a BMad agent. - -Reads report-data.json produced by the report creator and renders a -self-contained HTML report with: - - BMad Method branding - - Agent portrait (icon, name, title, personality description) - - Capability dashboard with expandable per-capability findings - - Opportunity themes with "Fix This Theme" prompt generation - - Expandable strengths and detailed analysis - -Usage: - python3 generate-html-report.py {quality-report-dir} [--open] -""" - -from __future__ import annotations - -import argparse -import json -import platform -import subprocess -import sys -from pathlib import Path - - -def load_report_data(report_dir: Path) -> dict: - """Load report-data.json from the report directory.""" - data_file = report_dir / 'report-data.json' - if not data_file.exists(): - print(f'Error: {data_file} not found', file=sys.stderr) - sys.exit(2) - return json.loads(data_file.read_text(encoding='utf-8')) - - -HTML_TEMPLATE = r"""<!DOCTYPE html> -<html lang="en"> -<head> -<meta charset="utf-8"> -<meta name="viewport" content="width=device-width, initial-scale=1"> -<title>BMad Method · Quality Analysis: SKILL_NAME - - - - -
    BMad Method
    -

    Quality Analysis:

    -
    - -
    -
    -
    - -
    -
    -
    -
    -
    -
    - - - - - -""" - - -def generate_html(report_data: dict) -> str: - data_json = json.dumps(report_data, indent=None, ensure_ascii=False) - data_tag = f'' - html = HTML_TEMPLATE.replace(' @@ -426,16 +413,24 @@

    Agent Analysis Report

    var SEVERITIES = ["critical", "high", "medium", "low"]; var SEV_LABEL = { critical: "Critical", high: "High", medium: "Medium", low: "Low" }; + var GRADES = ["excellent", "good", "fair", "poor"]; + var PLACEHOLDER_SUBJECT = "__PLACEHOLDER__"; var els = { banner: document.getElementById("parse-banner"), portrait: document.getElementById("portrait"), overview: document.getElementById("overview"), + grade: document.getElementById("grade"), verdict: document.getElementById("verdict"), + summaryText: document.getElementById("summary-text"), counts: document.getElementById("counts"), capabilities: document.getElementById("capabilities"), + lensVerdicts: document.getElementById("lens-verdicts"), sanctum: document.getElementById("sanctum"), experience: document.getElementById("experience"), + themes: document.getElementById("themes"), + strengths: document.getElementById("strengths"), + recommendations: document.getElementById("recommendations"), toolbar: document.getElementById("toolbar"), root: document.getElementById("findings-root"), subject: document.getElementById("m-subject"), @@ -454,6 +449,9 @@

    Agent Analysis Report

    var selected = Object.create(null); var findings = []; + var findingsById = Object.create(null); + var subjectPath = ""; + var standards = null; function showBanner(message) { els.banner.textContent = message; @@ -467,34 +465,43 @@

    Agent Analysis Report

    }); } - // Normalize an arbitrary parsed object against schema_version 1, supplying + // Normalize an arbitrary parsed object against schema_version 2, supplying // defaults so a partial or future island still renders. Unknown fields are - // ignored, not fatal. The agent blocks (agent_profile, capabilities, sanctum, - // experience) are OPTIONAL: each normalizes to null when absent, and a null - // block renders nothing rather than an empty panel or an error. + // ignored, not fatal. Severity counts are always derived from the findings + // array, never read from the island, so they cannot disagree with it. The + // agent blocks (agent_profile, capabilities, detailed_analysis, sanctum, + // experience) and the synthesis blocks (grade, summary, themes, strengths, + // recommendations) are OPTIONAL: each normalizes to an empty value that + // renders nothing rather than an empty panel or an error. function normalize(raw) { var obj = raw && typeof raw === "object" ? raw : {}; - var summary = obj.summary && typeof obj.summary === "object" ? obj.summary : {}; var rawFindings = Array.isArray(obj.findings) ? obj.findings : []; var norm = { - schema_version: typeof obj.schema_version === "number" ? obj.schema_version : 1, + schema_version: typeof obj.schema_version === "number" ? obj.schema_version : 2, subject: obj.subject != null ? String(obj.subject) : "(unspecified)", generated: obj.generated != null ? String(obj.generated) : "(unspecified)", verdict: obj.verdict != null ? String(obj.verdict) : "(no verdict supplied)", - summary: { critical: 0, high: 0, medium: 0, low: 0 }, + grade: GRADES.indexOf(String(obj.grade || "").toLowerCase()) >= 0 + ? String(obj.grade).toLowerCase() : "", + summary: typeof obj.summary === "string" ? obj.summary : "", + standards: (obj.standards && typeof obj.standards === "object") ? { + canon: obj.standards.canon != null ? String(obj.standards.canon) : "", + principles: obj.standards.principles != null ? String(obj.standards.principles) : "", + scripts: obj.standards.scripts != null ? String(obj.standards.scripts) : "" + } : null, + themes: normalizeThemes(obj.themes), + strengths: normalizeStrengths(obj.strengths), + recommendations: normalizeRecommendations(obj.recommendations), + counts: { critical: 0, high: 0, medium: 0, low: 0 }, findings: [], agent_profile: normalizeProfile(obj.agent_profile), capabilities: normalizeCapabilities(obj.capabilities), + detailed_analysis: normalizeDetailed(obj.detailed_analysis), sanctum: normalizeSanctum(obj.sanctum), experience: normalizeExperience(obj.experience) }; - SEVERITIES.forEach(function (s) { - var v = Number(summary[s]); - norm.summary[s] = isFinite(v) && v >= 0 ? Math.floor(v) : 0; - }); - rawFindings.forEach(function (f, i) { if (!f || typeof f !== "object") { return; } var sev = SEVERITIES.indexOf(f.severity) >= 0 ? f.severity : "low"; @@ -509,11 +516,66 @@

    Agent Analysis Report

    proposed_smallest: f.proposed_smallest != null ? String(f.proposed_smallest) : "", predicted_delta: f.predicted_delta != null ? String(f.predicted_delta) : "" }); + norm.counts[sev] += 1; }); return norm; } + function normalizeThemes(raw) { + if (!Array.isArray(raw)) { return []; } + var list = []; + raw.forEach(function (t) { + if (!t || typeof t !== "object") { return; } + var ids = []; + if (Array.isArray(t.finding_ids)) { + t.finding_ids.forEach(function (id) { if (id != null) { ids.push(String(id)); } }); + } + var title = t.title != null ? String(t.title) : ""; + if (!title && !ids.length) { return; } + list.push({ + title: title || "(untitled theme)", + root_cause: t.root_cause != null ? String(t.root_cause) : "", + action: t.action != null ? String(t.action) : "", + finding_ids: ids + }); + }); + return list; + } + + function normalizeStrengths(raw) { + if (!Array.isArray(raw)) { return []; } + var list = []; + raw.forEach(function (s) { + if (typeof s === "string" && s) { list.push(s); } + else if (s && typeof s === "object" && s.title) { + list.push(String(s.title) + (s.detail ? " — " + String(s.detail) : "")); + } + }); + return list; + } + + function normalizeRecommendations(raw) { + if (!Array.isArray(raw)) { return []; } + var list = []; + raw.forEach(function (r, i) { + if (!r || typeof r !== "object") { return; } + var action = r.action != null ? String(r.action) : ""; + if (!action) { return; } + var resolves = ""; + if (Array.isArray(r.resolves)) { resolves = r.resolves.map(String).join(", "); } + else if (typeof r.resolves === "number") { resolves = r.resolves + " findings"; } + else if (r.resolves != null) { resolves = String(r.resolves); } + list.push({ + rank: typeof r.rank === "number" ? r.rank : i + 1, + action: action, + resolves: resolves + }); + }); + list.sort(function (a, b) { return a.rank - b.rank; }); + return list; + } + // Optional agent_profile portrait: name/title/icon/agent_type/mission. // Returns null when nothing usable is present, so the portrait stays hidden. function normalizeProfile(raw) { @@ -547,6 +609,19 @@

    Agent Analysis Report

    return list.length ? list : null; } + // Optional detailed_analysis: a map of lens name to one-line verdict. + // Returns null when empty or absent. + function normalizeDetailed(raw) { + if (!raw || typeof raw !== "object" || Array.isArray(raw)) { return null; } + var entries = []; + Object.keys(raw).forEach(function (key) { + var value = raw[key]; + if (value == null || typeof value === "object") { return; } + entries.push({ lens: key, verdict: String(value) }); + }); + return entries.length ? entries : null; + } + // Optional sanctum block, shown only for memory/autonomous agents. // Returns null when absent or explicitly marked not present. function normalizeSanctum(raw) { @@ -590,6 +665,16 @@

    Agent Analysis Report

    els.schema.textContent = String(data.schema_version); els.verdict.textContent = data.verdict; + if (data.grade) { + els.grade.textContent = data.grade; + els.grade.className = "grade g-" + data.grade; + els.grade.hidden = false; + } + if (data.summary) { + els.summaryText.textContent = data.summary; + els.summaryText.hidden = false; + } + els.counts.innerHTML = ""; SEVERITIES.forEach(function (s) { var pill = document.createElement("span"); @@ -597,7 +682,7 @@

    Agent Analysis Report

    pill.innerHTML = '' + '' + SEV_LABEL[s] + '' + - '' + data.summary[s] + ""; + '' + data.counts[s] + ""; els.counts.appendChild(pill); }); els.overview.hidden = false; @@ -639,6 +724,16 @@

    Agent Analysis Report

    els.capabilities.hidden = false; } + function renderLensVerdicts(entries) { + if (!entries) { els.lensVerdicts.hidden = true; return; } + var rows = entries.map(function (e) { + return "
    " + esc(e.lens) + "
    " + esc(e.verdict) + "
    "; + }).join(""); + els.lensVerdicts.innerHTML = + "

    Per-lens verdicts

    " + rows + "
    "; + els.lensVerdicts.hidden = false; + } + function renderSanctum(s) { if (!s) { els.sanctum.hidden = true; return; } var rows = ""; @@ -676,6 +771,92 @@

    Agent Analysis Report

    els.experience.hidden = false; } + // Every copied fix prompt opens by anchoring the fixing session to the same + // standards that produced the findings, so the fix is held to the bar too. + function standardsPreamble() { + if (!standards || !standards.canon) { return []; } + var bar = standards.canon + (standards.principles ? " and " + standards.principles : ""); + var lines = [ + "Hold " + bar + " as the bar for every line you change — a fix that adds ceremony is a new finding, not a fix." + ]; + if (standards.scripts) { + lines.push("If the fix adds or changes scripts, follow " + standards.scripts + "."); + } + lines.push(""); + return lines; + } + + function composeThemePrompt(theme, resolved) { + var lines = standardsPreamble(); + lines.push("Fix the following theme in " + subjectPath + ": " + theme.title); + lines.push(""); + if (theme.root_cause) { lines.push("Root cause: " + theme.root_cause); } + if (theme.action) { lines.push("Fix: " + theme.action); } + if (resolved.length) { + lines.push(""); + lines.push("Findings to address:"); + resolved.forEach(function (f, i) { + lines.push((i + 1) + ". " + f.title); + if (f.location) { lines.push(" Location: " + f.location); } + if (f.evidence) { lines.push(" Evidence: " + f.evidence); } + if (f.recommendation) { lines.push(" Recommendation: " + f.recommendation); } + }); + } + return lines.join("\n") + "\n"; + } + + function renderThemes(themes) { + if (!themes.length) { els.themes.hidden = true; return; } + els.themes.innerHTML = "

    Themes

    "; + themes.forEach(function (t) { + var resolved = t.finding_ids + .map(function (id) { return findingsById[id]; }) + .filter(function (f) { return !!f; }); + var items = resolved.map(function (f) { + return '
    ' + esc(f.id) + " " + + esc(f.title) + + (f.location ? ' · ' + esc(f.location) + "" : "") + + "
    "; + }).join(""); + + var node = document.createElement("div"); + node.className = "theme"; + node.innerHTML = + '
    ' + esc(t.title) + "" + + '
    ' + + (t.root_cause ? '
    Root cause: ' + esc(t.root_cause) + "
    " : "") + + (t.action ? '
    Fix: ' + esc(t.action) + "
    " : "") + + (items ? '
    ' + items + "
    " : ""); + node.querySelector(".t-fix").addEventListener("click", function () { + copyText(composeThemePrompt(t, resolved)); + }); + els.themes.appendChild(node); + }); + els.themes.hidden = false; + } + + function renderStrengths(list) { + if (!list.length) { els.strengths.hidden = true; return; } + els.strengths.innerHTML = + "

    Strengths

      " + + list.map(function (s) { return "
    • " + esc(s) + "
    • "; }).join("") + + "
    "; + els.strengths.hidden = false; + } + + function renderRecommendations(recs) { + if (!recs.length) { els.recommendations.hidden = true; return; } + var html = "

    Recommendations

    "; + recs.forEach(function (r) { + html += '
    #' + esc(String(r.rank)) + "" + + esc(r.action) + + (r.resolves ? 'resolves: ' + esc(r.resolves) + "" : "") + + "
    "; + }); + els.recommendations.innerHTML = html; + els.recommendations.hidden = false; + } + function renderNoFindings() { els.root.innerHTML = '
    ' + @@ -760,12 +941,13 @@

    Agent Analysis Report

    function composePrompt() { var picked = findings.filter(function (f) { return selected[f.id]; }); if (picked.length === 0) { return ""; } - var lines = []; - lines.push("Please address the following findings from the agent analysis:"); + var lines = standardsPreamble(); + lines.push("Fix the following issues in " + subjectPath + ":"); lines.push(""); picked.forEach(function (f, i) { lines.push((i + 1) + ". " + f.title); if (f.location) { lines.push(" Location: " + f.location); } + if (f.evidence) { lines.push(" Evidence: " + f.evidence); } if (f.recommendation) { lines.push(" Recommendation: " + f.recommendation); } if (f.proposed_smallest) { lines.push(" Proposed smallest: " + f.proposed_smallest); } lines.push(""); @@ -794,8 +976,7 @@

    Agent Analysis Report

    showToast("Copy the text shown below"); } - function doCopy() { - var text = composePrompt(); + function copyText(text) { if (!text) { return; } if (navigator.clipboard && navigator.clipboard.writeText) { navigator.clipboard.writeText(text).then( @@ -807,6 +988,10 @@

    Agent Analysis Report

    } } + function doCopy() { + copyText(composePrompt()); + } + function wireToolbar() { els.btnCopy.addEventListener("click", doCopy); els.btnSelectAll.addEventListener("click", function () { @@ -845,12 +1030,33 @@

    Agent Analysis Report

    } var data = normalize(parsed); + + if (data.subject === PLACEHOLDER_SUBJECT) { + els.subject.textContent = data.subject; + showBanner( + "This is the unfilled report shell.\n\n" + + "The report-data island still carries the placeholder subject, so " + + "there are no findings here. Generate a real report with " + + "scripts/render_report.py." + ); + return; + } + findings = data.findings; + subjectPath = data.subject; + standards = data.standards; + findingsById = Object.create(null); + findings.forEach(function (f) { findingsById[f.id] = f; }); + renderProfile(data.agent_profile); renderOverview(data); renderCapabilities(data.capabilities); + renderLensVerdicts(data.detailed_analysis); renderSanctum(data.sanctum); renderExperience(data.experience); + renderThemes(data.themes); + renderStrengths(data.strengths); + renderRecommendations(data.recommendations); renderFindings(findings); wireToolbar(); updateSelection(); diff --git a/skills/bmad-agent-builder/customize.toml b/skills/bmad-agent-builder/customize.toml new file mode 100644 index 0000000..b5b85d1 --- /dev/null +++ b/skills/bmad-agent-builder/customize.toml @@ -0,0 +1,48 @@ +# DO NOT EDIT -- overwritten on every update. +# +# Customization surface for bmad-agent-builder. This governs how the builder +# builds: the org-wide context, standards, and gates applied to every agent it +# produces. It is distinct from the per-built-agent customize.toml the builder +# emits during an individual build. +# +# Override files (not edited here): +# {project-root}/_bmad/custom/bmad-agent-builder.toml (team) +# {project-root}/_bmad/custom/bmad-agent-builder.user.toml (personal) + +[agent] + +# --- Configurable below. Overrides merge per BMad structural rules: --- +# scalars: override wins • arrays: append + +# Steps to run before standard activation (config load, greet). +# Use for org pre-flight loads or compliance checks. +activation_steps_prepend = [] + +# Steps to run after intent routing, before the build/analyze loop begins. +activation_steps_append = [] + +# Standards the builder keeps in mind for the whole session, loaded as context +# into every build and analyze. Each entry is a literal sentence, a `skill:` +# skill, or a `file:` path/glob whose contents load as facts. Use for house +# conventions you want present but not hard-gated (for gates, see build_standards). +# "Every agent persona names its owner relationship explicitly." +# "file:{project-root}/_bmad/standards/agent-house-style.md" +persistent_facts = ["file:{project-root}/**/project-context.md"] + +# Executed when a build or analyze run completes, after the user has been told +# the artifact is ready. String scalar (one instruction) or array (in order). +on_complete = "" + +# --- Builder gates --- + +# Hard standards every BUILT agent must satisfy. Unlike persistent_facts +# (context), these are enforced: applied as build criteria and checked again as +# a conformance pass during Analyze. Each entry is a `skill:`, `file:`, or +# plain-text directive. Append-only. Empty by default (no org gates). +build_standards = [] + +# Eval requirement for a build to be declared done. Empty (default) keeps evals +# opt-in, offered at the eval beat but never forced. +# "baseline" -- require a passing baseline run (agent beats the bare model) +# "any" -- require at least one eval case to exist and pass +evals_required = "" diff --git a/skills/bmad-agent-builder/references/agent-quality-principles.md b/skills/bmad-agent-builder/references/agent-quality-principles.md index 8c3ae9c..0a6215b 100644 --- a/skills/bmad-agent-builder/references/agent-quality-principles.md +++ b/skills/bmad-agent-builder/references/agent-quality-principles.md @@ -2,7 +2,7 @@ The build-plus-scan bar for agents. Loaded at build time so the author works to the standard from the start, and at analysis time so every lens verifies against the same standard. -The universal core lives in the canon, not here. For the core test, the two-version comparison, the deeper floor, writing what survives as a goal, progressive disclosure, the cheaper signals, and the habit, load `references/prompt-quality-canon.md` (shipped copy, resolves from the agent-builder root) or its published fallback at `{siteBase}/explanation/outcome-driven-prompt-quality/`. Everything below is what agents add on top of that core, because an agent is not a workflow and a few things change. +The universal core lives in the canon, not here. For writing the destination, the tests, the two-version comparison, the deeper floor, the cheaper signals, and the habit, load `references/prompt-quality-canon.md` (shipped copy, resolves from the agent-builder root). Everything below is what agents add on top of that core, because an agent is not a workflow and a few things change. ## Persona is the deliverable diff --git a/skills/bmad-agent-builder/references/build-process.md b/skills/bmad-agent-builder/references/build-process.md index 0c51e33..10bd69f 100644 --- a/skills/bmad-agent-builder/references/build-process.md +++ b/skills/bmad-agent-builder/references/build-process.md @@ -9,7 +9,9 @@ description: The single Process loop for building or rebuilding a BMad agent. On This is one loop, not a sequence of phases. It carries Create and Rebuild, because a rebuild is the same loop pointed at an existing agent treated as a description of intent rather than a template to copy. The order below is the usual order of discovery, but nothing forces you to march through it; pursue whichever outcome the conversation is ready for and revisit earlier ones as the picture sharpens. Each outcome is a thing you want to be true, not a box to tick. -Load `references/agent-quality-principles.md` before you draft anything, because it is the same bar the lenses verify against and building to it from the start is cheaper than fixing later. It cedes the universal core to `references/prompt-quality-canon.md`, so hold the canon's tests while you work and load it when you author or refine any capability. Load `references/agent-type-guidance.md` for the gradient and the routing questions, and `references/standard-fields.md` for field definitions, naming, and path rules. +Load `references/prompt-quality-canon.md` before anything else and hold it as the governing standard for every capability-prompt line you draft — this file deliberately does not restate it, so a section below that names a canon test expects you to already carry it. + +Load `references/agent-quality-principles.md` alongside it for what agents add on top (the persona carve-out, the archetype bars, the capability fork, the config surface), `references/agent-type-guidance.md` for the gradient and the routing questions, and `references/standard-fields.md` for field definitions, naming, and path rules. ## Understand why the user came @@ -17,22 +19,27 @@ Before you read a single artifact, understand who this agent is, how it should m Type emerges here from natural questions, not a menu. Ask whether the agent needs to remember between sessions, which separates stateless from memory; whether the user should be able to teach it new capabilities after install, which gates evolvable capabilities; and whether it should operate on its own when no one is watching, which adds PULSE and makes it autonomous. Confirm the read back in plain words, and for a memory agent confirm relationship depth, since a deep partnership wants a calibration First Breath while a focused domain tool wants a warmer but quicker configuration setup. +## Propose the agent the vision implies + +The dump tells you what the user pictured; offer what they did not. Before drafting, propose the capabilities the mission implies but nobody named, the persona angle that would make this agent a specific character rather than a generic assistant, and push where the vision is thin — one agent or two, a recurring need or a one-off ask, a memory that would actually accrue or dead weight. A line each with why it fits; the user picks, and the declines land in the memlog so a later session does not re-propose them. An agent built only from the stated list ships the user's first draft of it. + ## Capture into the memlog throughout -As decisions and directions land, write them to `{target-agent-path}/.memlog.md` through `scripts/memlog.py`: `init --path {target-agent-path}/.memlog.md` once when the target is named, then `append --path {target-agent-path}/.memlog.md --type --text "..."` as things happen. For a new agent, propose a kebab-case name when the user did not give one; renaming later is a logged decision, not a redo. This `.memlog.md` is the builder's process trace and lives beside the built agent's SKILL.md. It is not the sanctum. The sanctum is the built agent's own runtime memory at `{project-root}/_bmad/memory/{skillName}/`, written by the agent at runtime, never by this log. A memlog entry records a build decision and sanctum content is the agent's living state, so neither ever holds the other's material. Capture as you go so the reasoning is caught while fresh, because the memlog is the resume source and the trail you walk with the user at handoff. +As decisions and directions land, write them to `{target-agent-path}/.memlog.md` through `scripts/memlog.py`: `init --path {target-agent-path}/.memlog.md` once when the target is named, then `append --path {target-agent-path}/.memlog.md --type --text "..."` as things happen. For a new agent, propose a kebab-case name when the user did not give one; renaming later is a logged decision, not a redo. This `.memlog.md` is the builder's process trace beside the built agent's SKILL.md, never the agent's sanctum — a memlog entry records a build decision, sanctum content is the agent's living runtime state, and neither ever holds the other's material. Capture as you go so the reasoning is caught while fresh, because the memlog is the resume source and the trail you walk with the user at handoff. ## Write the minimal outcome-driven version first -Draft the smallest agent that could work. Hold the persona and capabilities to the role, the core outcome, the consumer of the output, and any rule whose absence has already caused real damage. Apply the canon's core test to every capability-prompt line you are tempted to write, because a capable model given the persona and the outcome does not need to be told how. The persona is the exception the canon's leanness bar does not touch: write the voice, the communication-style examples, the domain framing, and the design rationale out in full, because the persona is the path the model takes through every capability and a flatter version is a worse outcome, not a leaner one. +Draft the canon's small version of the agent: the smallest persona-plus-capabilities that could work, written as destination rather than route, with everything else staying out until a comparison earns it. The one exception is the persona carve-out from `references/agent-quality-principles.md`: write the voice, the communication-style examples, the domain framing, and the design rationale out in full. ### Fork on capability versus skill reference -For each capability the agent needs, decide which of two forms it takes, applying the criteria identically now and at the agent's own evolve time: +For each capability the agent needs, fork between referencing an installed skill and authoring an internal capability per the criteria in `references/agent-quality-principles.md`, applied identically now and at the agent's own evolve time. Always ask before installing anything, and when external skills are in play suggest `bmad-module-builder` so the agent ships bundled with its dependencies. -- Reference an installed skill when a skill already covers the capability. Suggest the reference, and always ask before installing anything. When external skills are in play, suggest `bmad-module-builder` so the agent ships bundled with its dependencies. -- Author an internal capability only when it is genuinely novel, or when it is tightly coupled to the persona such that a generic skill would lose the agent's voice or context. +When you author an internal capability, route the authoring through the canon and the `assets/capability-authoring-template.md` mechanics, and give every internal prompt-type capability its frontmatter (name, description, code, added, type) and an outcome-focused body. `references/sample-capability-prompt.md` is the worked example of the bar. -When you author an internal capability, route the authoring through the canon and the `assets/capability-authoring-template.md` mechanics, hold the canon's tests while you write the body, and give every internal prompt-type capability its frontmatter (name, description, code, added, type) and an outcome-focused body. The internal capability is a skill that happens to live inside an agent; the only thing that relaxes is that the persona supplies the how. +## Show the draft before you wire it + +Present the minimal version while it is still cheap to change: the persona voice in its own words, the capability list with a line each, and how First Breath will feel for a memory agent. Name the places you are least sure of rather than presenting a finished thing, and iterate until the user recognizes their agent in it. The first time they see the agent must not be at handoff. ## Hunt for script opportunities throughout @@ -46,20 +53,24 @@ An agent that has never run is a guess. At the eval beat, invoke the standalone - Baseline mode confirms the agent beats the bare model on the same input, since an agent that does not has no reason to exist. - Quality or variant mode settles a finding about a single capability prompt by running a smaller version against the same input, which is how a defend-against-absence question gets answered rather than argued. +Eval cases live at `{target-agent-path}/evals/cases.json`. `{agent.evals_required}` overrides the opt-in default: when empty (default) the modes stay opt-in as above; `"baseline"` requires a passing baseline run before the build is done; `"any"` requires at least one case to exist and pass. If a required run fails or cannot be produced, the build is blocked, not shipped. + ## Decide customization with the explicit ask -Ask once, interactive only, and default to no: "Should this agent expose override hooks such as activation steps or persistent facts so teams can customize it without forking?" Log the answer to the memlog either way. The archetype shapes the default. Memory and autonomous agents default to no because the sanctum is already their customization surface and a TOML override competes with it; offer the opt-in only when the user has a concrete pre-sanctum-load need such as an org-mandated compliance preload. Stateless agents are the natural candidate, so offer the opt-in there and accept either answer. Headless defaults to no unless the invocation explicitly asks for customization. +Ask once, interactive only, and default to no: "Should this agent expose override hooks such as activation steps or persistent facts so teams can customize it without forking?" Log the answer to the memlog either way. `references/agent-quality-principles.md` owns the surface contract — the always-present `[agent]` metadata block every agent emits, the archetype defaults, and the forbidden mechanisms. The one build-time judgment beyond it: offer the opt-in to a memory or autonomous agent only on a concrete pre-sanctum-load need such as an org-mandated compliance preload, since the sanctum is already their customization surface. -Every agent still emits a `customize.toml`, because its always-present `[agent]` metadata block (code, name, title, icon, description, agent_type) is the install-time roster contract the installer reads to populate `module.yaml`. customize.toml is the only build-time config surface, and First Breath and init-sanctum are runtime sanctum initialization rather than build config, so they stay out of it; `references/agent-quality-principles.md` carries the forbidden-mechanisms list. When the opt-in is yes, retain the override block, append any swappable scalars following the `*_template` / `*_output_path` / `on_` conventions, and add the resolver activation step to SKILL.md so it reads scalars as `{agent.}`. When it is no, emit metadata only and SKILL.md uses hardcoded paths. +When the opt-in is yes, retain the override block, append any swappable scalars following the `*_template` / `*_output_path` / `on_` conventions, and add the resolver activation step to SKILL.md so it reads scalars as `{agent.}`. When it is no, emit metadata only and SKILL.md uses hardcoded paths. ## Strip ceremony and ship -Confirm the agent passes its own leanness bar before handoff, because the builder has no standing to teach leanness while shipping bloat. The leanness pass cuts ceremony from capability prompts and never flattens the persona. Ship the canon copy into the built agent at its `references/prompt-quality-canon.md` exactly as the vendored scripts are copied, so an evolving agent resolves the standard from its own root. Run the lint gate over the built agent (`scripts/scan-path-standards.py` and `scripts/scan-scripts.py` in parallel, fixing high or critical findings and re-running), and run unit tests if the built agent carries scripts. +Confirm the agent passes its own leanness bar before handoff, because the builder has no standing to teach leanness while shipping bloat. The leanness pass cuts ceremony from capability prompts and never flattens the persona. Copy `assets/prompt-quality-canon.md` into the built agent at `references/prompt-quality-canon.md`, so an evolving agent resolves the standard from its own root. Run the lint gate over the built agent (`scripts/scan-path-standards.py` and `scripts/scan-scripts.py` in parallel, fixing high or critical findings and re-running), and run unit tests if the built agent carries scripts. Verify the agent satisfies every directive in `{agent.build_standards}`; treat each as a required criterion, not a suggestion, and resolve any miss before handoff. ## The output tree Every agent shares one output tree. The archetype changes which parts are present and the SKILL.md weight, captured in the delta table below rather than three separate trees. +Emit each file from its matching template in this builder's `assets/`, applying `references/template-substitution-rules.md` for tokens, conditionals, and template selection — deterministically, via `python3 scripts/process-template.py