diff --git a/.agents/skills/build.md b/.agents/skills/build.md index 95e5e292..510dd634 100644 --- a/.agents/skills/build.md +++ b/.agents/skills/build.md @@ -116,6 +116,17 @@ Every commit in SDP-managed repos SHOULD carry provenance trailer: - Edit files in main tree (always use worktree) - Commit raw `.sdp/runs/pi-review/*` telemetry unless the workstream explicitly requires it; use compact verdict/evidence instead. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. | +| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. | +| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. | +| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. | +| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. | +| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. | + ## Response Format After completing work, report: diff --git a/.agents/skills/review.md b/.agents/skills/review.md index bf39c9f0..924241f0 100644 --- a/.agents/skills/review.md +++ b/.agents/skills/review.md @@ -54,6 +54,17 @@ gates are green, no P0/P1 remain, and `.sdp/review_verdict.json` records a compact maintainer note. Never commit raw `.sdp/runs/pi-review/*` telemetry by default. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. | +| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. | +| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. | +| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. | +| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. | +| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. | + ## Routing Rules Dimension based on: (1) Diff size: small (<50 lines) → code only, large → multiple dimensions. diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index 29291214..a314916b 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -358,6 +358,7 @@ {"_type":"issue","id":"sdplab-7","title":"F061-02: bd ready → sdp ready bridge","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:35:17Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:34Z","closed_at":"2026-04-20T14:27:34Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F061","beads","ecosystem"],"dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"sdplab-2","title":"F059-02: Session evidence emitter","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:34:39Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:00Z","closed_at":"2026-04-20T14:27:00Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F059","ecosystem","ohmyopencode"],"dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"sdplab-3","title":"F059-01: Pre-tool-call guard hook","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:34:39Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:00Z","closed_at":"2026-04-20T14:27:00Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F059","ecosystem","ohmyopencode"],"dependency_count":0,"dependent_count":0,"comment_count":0} +{"_type":"issue","id":"sdplab-4cxu","title":"F168-09: Apply harness/skill operating discipline phase 1","description":"Follow-up to the 2026-05-15 harness/skill synthesis. Apply Phase 1 only: update skill authoring policy with trigger/exclusion/verification/degraded-evidence requirements; add reference vocabulary for tool risk classes and degraded evidence; add common rationalizations to build and review. Runtime manifest enforcement and model-routing measurement are out of scope.","acceptance_criteria":"- [ ] docs/reference/skill-authoring.md defines Do Not Use When, Verification, and Degraded Evidence requirements.\n- [ ] docs/reference contains a reusable tool-risk/degraded-evidence reference.\n- [ ] prompts/skills/build/SKILL.md has Common Rationalizations for skipped specs, evidence, prompt-only safety, and review shortcuts.\n- [ ] prompts/skills/review/SKILL.md has Common Rationalizations for empty reviewer output, single-family review, missing provenance, and rubber-stamp coverage.\n- [ ] Runtime support claims introduced by this work are explicitly not_assessed_runtime unless dispatch evidence exists.\n- [ ] Skill lint is run and result recorded.","status":"in_progress","priority":2,"issue_type":"task","assignee":"Andrei","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-15T08:26:12Z","created_by":"Andrei","updated_at":"2026-05-15T08:26:18Z","started_at":"2026-05-15T08:26:18Z","labels":["F168","docs","harness","skills"],"dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"sdplab-tsbi","title":"F168 finding: stale Claude sweep command is outside manifest source of truth","description":"source=pi-review+local verification; feature=F168; workstream=00-168-02; blocking=false. .claude/commands/sweep.md exists and advertises broad autonomous backlog execution, but sdp.manifest.yaml and prompts/commands have no sweep command source. Decide whether to delete it, add it to manifest as experimental, or move it behind explicit future-work docs so generated adapter inventory is truthful.","status":"closed","priority":2,"issue_type":"bug","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T09:34:15Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:41Z","closed_at":"2026-05-14T08:47:41Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","dependencies":[{"issue_id":"sdplab-tsbi","depends_on_id":"sdplab-o8gk","type":"discovered-from","created_at":"2026-05-13T12:34:15Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"sdplab-o8gk.8","title":"F168-08: End-to-end onboarding quality calibration run","description":"Run the completed F168 flow against SDP onboarding and record calibration evidence: actual commands, docs promises, review axes, created findings, and unresolved gaps.","status":"closed","priority":2,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T05:46:50Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:53Z","closed_at":"2026-05-14T08:47:53Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","labels":["F168","calibration","onboarding","pi-review","quality"],"dependencies":[{"issue_id":"sdplab-o8gk.8","depends_on_id":"sdplab-o8gk","type":"parent-child","created_at":"2026-05-13T08:46:49Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"sdplab-o8gk.7","title":"F168-07: CI/advisory rollout and Beads findings loop","description":"Connect deterministic and model-review axes into CI/advisory rollout with Beads finding creation for blocking issues. Avoid fake-green checks; absent credentials and missing tools must produce cannot_verify or not_assessed.","status":"closed","priority":2,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T05:46:49Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:52Z","closed_at":"2026-05-14T08:47:52Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","labels":["F168","beads","ci","onboarding","pi-review","quality"],"dependencies":[{"issue_id":"sdplab-o8gk.7","depends_on_id":"sdplab-o8gk","type":"parent-child","created_at":"2026-05-13T08:46:48Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0} diff --git a/.sdp/generated/.codex/skills/build.md b/.sdp/generated/.codex/skills/build.md index f47a5263..2c9d6cd2 100644 --- a/.sdp/generated/.codex/skills/build.md +++ b/.sdp/generated/.codex/skills/build.md @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate). 5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers. 6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. | +| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. | +| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. | +| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. | +| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. | +| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. | + --- ## Git Safety diff --git a/.sdp/generated/.codex/skills/review.md b/.sdp/generated/.codex/skills/review.md index b687c334..fc911f25 100644 --- a/.sdp/generated/.codex/skills/review.md +++ b/.sdp/generated/.codex/skills/review.md @@ -111,6 +111,17 @@ Rules: huge provider error text or full prompts into the verdict, replace it with a compact verdict that preserves model status, P0/P1 counts, and override reason. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. | +| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. | +| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. | +| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. | +| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. | +| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. | + ## Write Plan (F101) Before writing review output files (verdict, findings), emit a write plan: diff --git a/.sdp/generated/.opencode/skill/build.md b/.sdp/generated/.opencode/skill/build.md index f47a5263..2c9d6cd2 100644 --- a/.sdp/generated/.opencode/skill/build.md +++ b/.sdp/generated/.opencode/skill/build.md @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate). 5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers. 6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. | +| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. | +| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. | +| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. | +| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. | +| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. | + --- ## Git Safety diff --git a/.sdp/generated/.opencode/skill/review.md b/.sdp/generated/.opencode/skill/review.md index b687c334..fc911f25 100644 --- a/.sdp/generated/.opencode/skill/review.md +++ b/.sdp/generated/.opencode/skill/review.md @@ -111,6 +111,17 @@ Rules: huge provider error text or full prompts into the verdict, replace it with a compact verdict that preserves model status, P0/P1 counts, and override reason. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. | +| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. | +| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. | +| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. | +| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. | +| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. | + ## Write Plan (F101) Before writing review output files (verdict, findings), emit a write plan: diff --git a/.sdp/generated/.pi/skills/build/SKILL.md b/.sdp/generated/.pi/skills/build/SKILL.md index f47a5263..2c9d6cd2 100644 --- a/.sdp/generated/.pi/skills/build/SKILL.md +++ b/.sdp/generated/.pi/skills/build/SKILL.md @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate). 5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers. 6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. | +| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. | +| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. | +| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. | +| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. | +| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. | + --- ## Git Safety diff --git a/.sdp/generated/.pi/skills/review/SKILL.md b/.sdp/generated/.pi/skills/review/SKILL.md index b687c334..fc911f25 100644 --- a/.sdp/generated/.pi/skills/review/SKILL.md +++ b/.sdp/generated/.pi/skills/review/SKILL.md @@ -111,6 +111,17 @@ Rules: huge provider error text or full prompts into the verdict, replace it with a compact verdict that preserves model status, P0/P1 counts, and override reason. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. | +| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. | +| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. | +| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. | +| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. | +| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. | + ## Write Plan (F101) Before writing review output files (verdict, findings), emit a write plan: diff --git a/docs/reference/harness-risk-and-evidence.md b/docs/reference/harness-risk-and-evidence.md new file mode 100644 index 00000000..bed4e76e --- /dev/null +++ b/docs/reference/harness-risk-and-evidence.md @@ -0,0 +1,58 @@ +# Harness Risk And Evidence + +Status: reference + +This vocabulary keeps SDP skill and harness claims honest. It is intentionally +small: use it in skills, review reports, adapter checks, and evidence summaries +without turning every task into a policy project. + +## Tool Risk Classes + +| Class | Meaning | Default policy | +|---|---|---| +| `perception` | Read-only inspection: files, logs, docs, links, local state. | Allowed for most roles. | +| `analysis` | Local computation or synthesis without writes or external side effects. | Allowed with recorded evidence. | +| `local_write` | Edits, generated artifacts, local database or checkpoint changes. | Implementer/workflow scope only. | +| `external_write` | Push, publish, create or update a remote system, send messages. | Explicit workflow gate required. | +| `irreversible` | Merge, deploy, delete, rotate credentials, spend money. | Explicit human or workflow authorization required. | + +Prompt text may describe a boundary, but it is not the boundary. If the harness +cannot enforce a risk-class gate, record the claim as `not_assessed_runtime` or +`manual_gate_only`. + +## Evidence States + +| State | Meaning | +|---|---| +| `passed` | Evidence completed and supports the claim. | +| `failed` | Evidence completed and contradicts the claim. | +| `not_assessed` | The plane was not run. | +| `failed_provider` | Provider returned an explicit error. | +| `timeout` | Run exceeded the bounded window. | +| `empty_output` | Run completed with no useful content. | +| `off_task` | Output did not address the requested plane. | +| `unavailable_cli` | Required local tool was missing or could not run. | +| `unverified_benchmark` | Vendor or third-party claim was not validated on SDP tasks. | +| `not_assessed_runtime` | Static files exist, but runtime behavior was not proven. | +| `manual_gate_only` | The workflow used an explicit human/workflow gate because runtime enforcement is unavailable. | + +Missing evidence is not a pass. Use the degraded state that preserves what +actually happened. + +## Assignment Rule + +Deterministic tool output wins over model prose. If a model claims a check +passed but the tool output is missing, classify the check as `not_assessed`. +If states conflict, report the more conservative degraded state until a human or +orchestrator inspects the evidence. + +## Common Examples + +- A review provider returns no findings because the process timed out: + `timeout`, not `passed`. +- A harness adapter file exists but no dispatch run proves it loads: + `not_assessed_runtime`. +- A skill says "do not push", but the harness cannot block pushing: + `manual_gate_only` for that action class unless another runtime gate exists. +- A model vendor page claims strong coding benchmarks: + `unverified_benchmark` until reproduced on SDP tasks. diff --git a/docs/reference/skill-authoring.md b/docs/reference/skill-authoring.md index 76772507..88aad49b 100644 --- a/docs/reference/skill-authoring.md +++ b/docs/reference/skill-authoring.md @@ -1,7 +1,8 @@ # Skill Authoring -- SDP Multi-Harness > **Audience:** Authors creating a new SDP skill. -> **Canonical location:** `.agents/skills/.md` (multi-harness; see `.agents/skills/README.md`). +> **Canonical structured source:** `prompts/skills//SKILL.md` (manifest-backed). +> **Runtime alias surface:** `.agents/skills/.md` for harnesses that read flat skills. > **Policy source:** F127-03 (`docs/plans/2026-04-16-f127-multi-harness-modernization-design.md`). ## Why a policy @@ -11,16 +12,27 @@ SDP-skills must work identically in all major harnesses (Claude Code, OpenCode, - versionable (semver → breaking changes are visible); - portable between harnesses without modifications. -## File location +## File Location -**Canonical:** `.agents/skills/.md`. +**Canonical structured source:** `prompts/skills//SKILL.md`. + +This is the path listed in `sdp.manifest.yaml`, published to the public `sdp` +repo when protocol artifacts are exported, and used by generated adapters. + +**Runtime alias/stub surface:** `.agents/skills/.md`. + +Flat `.agents/skills/*.md` files serve harnesses that discover flat skills +directly, especially OpenCode/Cursor/Kimi-style loaders. Keep the alias/stub in +sync with the structured source when changing behavior. **Do not** put real files in: - `skills/` (removed in F128; no longer exists); -- `.claude/skills/` (symlink to `.agents/skills/`); -- `prompts/skills//SKILL.md` (published artifacts: edit the source in `.agents/skills/` and publish to the public `sdp` repo via `scripts/sdp-publish.sh`). +- `.claude/skills/` (symlink/adapted surface for harness loading); +- generated adapter directories such as `.sdp/generated/`, `.codex/`, `.opencode/`, or `.pi/`. -Filename is `.md`, matches `name:` in frontmatter. +For structured skills, the directory is `/` and frontmatter +`name:` matches the directory. For flat aliases, the filename is +`.md` and matches `name:`. ## YAML Frontmatter -- required @@ -49,6 +61,7 @@ compatibility: requires_mcp: [] # list of MCP servers if skill expects MCP-tools requires_cli: [] # list of CLI binaries on PATH (sdp, bd, gh, …) tags: [discovery, review] # free tags for search +tool_risk_classes: [perception, analysis] ``` | Field | When to specify | @@ -57,6 +70,11 @@ tags: [discovery, review] # free tags for search | `requires_mcp` | If skill expects a specific MCP server (e.g., `beads`, `claude-api`). Empty array = no requirements. | | `requires_cli` | CLI binaries without which the skill will not run (example: `[sdp, bd]`). | | `tags` | For lint/search/marketplace indexing. | +| `tool_risk_classes` | Side-effect classes used by the skill. Use the vocabulary in [harness-risk-and-evidence.md](harness-risk-and-evidence.md). | + +`compatibility` is a runtime claim only after the harness has dispatch evidence. +Before that, treat it as intended portability and report runtime coverage as +`not_assessed_runtime` for that harness. ## Body structure (recommended) @@ -75,7 +93,9 @@ compatibility: [claude-code, opencode, cursor, codex] ## Use When - bullet-list of situations when to apply -- and when NOT to apply (anti-patterns) + +## Do Not Use When +- common false triggers and anti-patterns ## Inputs / Outputs What skill expects on input (context, files, args), what it returns. @@ -83,6 +103,17 @@ What skill expects on input (context, files, args), what it returns. ## Process Execution steps. If >5 steps -- break into subsections. +## Verification +How the agent proves the skill did what it claims. Prefer deterministic evidence: +tests, lint output, schema validation, file existence, Beads/GitHub state, or +captured runtime logs. + +## Degraded Evidence +How the skill reports missing or partial evidence. Use +`not_assessed`, `timeout`, `empty_output`, `off_task`, `unavailable_cli`, +`not_assessed_runtime`, or `manual_gate_only` rather than collapsing unknowns +into pass. + ## References - related skills - design docs (docs/plans/YYYY-MM-DD-*.md) @@ -90,6 +121,8 @@ Execution steps. If >5 steps -- break into subsections. ``` Do not duplicate common rules (beads rules, git workflow) -- reference `AGENTS.md`. +Use [harness-risk-and-evidence.md](harness-risk-and-evidence.md) for shared +tool-risk classes and degraded-evidence vocabulary. ## Boundary With AGENTS.md diff --git a/docs/research/2026-05-15-agent-skill-operating-rules.md b/docs/research/2026-05-15-agent-skill-operating-rules.md new file mode 100644 index 00000000..76ddc7bd --- /dev/null +++ b/docs/research/2026-05-15-agent-skill-operating-rules.md @@ -0,0 +1,263 @@ +# Agent Skill Operating Rules + +Status: research draft +Date: 2026-05-15 + +This document converts the useful parts of `addyosmani/agent-skills` into SDP +rules. It is not a request to copy that repository. SDP already has the stronger +system frame: manifest inventory, workstreams, Beads, evidence, adapter parity, +and prompt-injection boundaries. The borrowed value is the smaller operational +discipline that makes agents less likely to rationalize skipped steps. + +## Sources + +- `addyosmani/agent-skills`: `README.md`, `docs/skill-anatomy.md`, + `skills/using-agent-skills/SKILL.md`, + `skills/doubt-driven-development/SKILL.md`, + `skills/source-driven-development/SKILL.md`, + `skills/incremental-implementation/SKILL.md`, + `references/orchestration-patterns.md` +- SDP local references: + `docs/reference/skills.md`, + `docs/reference/skill-authoring.md`, + `docs/reference/agent-skill-entry-map.md`, + `docs/reference/multi-agent-patterns.md`, + `docs/reference/harness-integration.md` + +## Design Position + +Do not add one more generic skill stack on top of SDP. Improve the authoring and +runtime discipline of existing SDP skills. + +The target behavior is: + +- humans choose from a small intent surface; +- skills execute workflows, not advice; +- agents load only the context needed for the current decision; +- every non-trivial claim has evidence or an explicit `not_assessed` state; +- reviewers are independent lanes, not a single blended opinion; +- high-risk actions are bounded by runtime controls, not prompt text alone. + +## Rule 1: Skills Need Triggers And Exclusions + +Every skill description must say both what it does and when to use it. + +Required shape: + +```yaml +description: Does X. Use when Y. Do not use when Z if Z is a common false trigger. +``` + +The body must include: + +- `Use When` +- `Do Not Use When` +- `Inputs` +- `Outputs` +- `Process` +- `Verification` + +Why this matters: routing quality is UX. If the agent cannot decide when a skill +applies, the operator pays with clarification, retries, or silent wrong mode. + +## Rule 2: Skills Are Workflows, Not Reference Docs + +A skill should tell the agent what to do, in order, with evidence requirements. +It should not become a library of background facts. + +Keep module-local facts out of skills: + +- package APIs; +- internal import rules; +- runtime assumptions for one subtree; +- provider credentials; +- local package gates. + +Those belong in the nearest module-local `AGENTS.md`. Root `AGENTS.md` routes +policy. Skills own executable workflow. + +## Rule 3: Add Anti-Rationalization To Critical Skills + +For high-risk skills, add a `Common Rationalizations` table: + +| Rationalization | Reality | +|---|---| +| "This is simple enough to skip the spec." | Simple tasks may need a short spec, but still need acceptance criteria. | +| "I will test at the end." | Late testing hides which slice introduced the bug. | +| "The reviewer output was empty, so it passed." | Empty, hung, or off-task output is `not_assessed`. | +| "The model says it verified this." | Model prose is not evidence; tool output or inspected state is evidence. | +| "Prompt instructions are enough to enforce safety." | Prompt-only protection is not a security boundary. | +| "One generic review is enough." | Trust-sensitive work needs separate code, evidence/tracing, and requirements planes. | + +Apply first to: + +- `build` +- `review` +- `ship` +- `debug` +- `delivery-loop` +- `spec-interrogate` + +## Rule 4: Use Doubt-Driven Development For Non-Trivial Claims + +For non-trivial decisions, use a bounded doubt cycle: + +1. `CLAIM`: state the claim and why it matters. +2. `EXTRACT`: isolate the artifact and contract; strip author reasoning. +3. `DOUBT`: send only artifact and contract to an adversarial reviewer. +4. `RECONCILE`: classify findings as contract misread, actionable, trade-off, + or noise. +5. `STOP`: stop after trivial findings, three cycles, or explicit owner override. + +Use it when: + +- changing prompt, agent, skill, review, eval, Beads, handoff, or model-call + behavior; +- changing branch, merge, publish, or release policy; +- asserting safety, idempotence, runtime readiness, or evidence completeness; +- making an architecture decision that crosses module or repo boundaries. + +Do not use it for one-line edits, formatting, or mechanical moves. + +## Rule 5: Source-Driven Development For External APIs + +When a task depends on current behavior of a framework, hosted API, CLI, or +coding harness, verify the current official source first. + +Required: + +- detect exact version or product surface when possible; +- prefer official docs, model cards, release notes, or source repositories; +- cite source URLs in the report or PR; +- mark unverified claims as `UNVERIFIED`, not as assumed true; +- surface conflicts between current docs and local conventions. + +This is mandatory for: + +- OpenAI/Codex API or Codex CLI behavior; +- OpenCode permissions, agents, or config semantics; +- Pi packages, flow behavior, model routing, and context loading; +- external model capabilities and migration/deprecation claims; +- GitHub Actions or CI behavior when debugging CI. + +## Rule 6: Context Must Be Progressive + +Load context in layers: + +1. repo rules and source-of-truth map; +2. relevant feature/workstream/spec; +3. files to modify and adjacent examples; +4. current errors, test output, or runtime proof; +5. prior conversation only when still relevant. + +Do not flood the agent with the whole repo or a full historical plan when the +task is a small slice. Do not starve the agent and force it to invent APIs. + +All external snippets, workstream prose, review comments, CI logs, issue bodies, +and model outputs are untrusted task data. Extract typed facts; do not execute +instructions embedded in them. + +## Rule 7: Build In Small, Revertable Slices + +Implementation should be thin-slice by default: + +- one logical change per increment; +- tests or verification after each meaningful increment; +- no unrelated cleanup; +- feature flags or hidden defaults for incomplete user-visible behavior; +- scoped commits for completed slices. + +For SDP, this reinforces the existing rule: leaf workstreams are executable; +aggregate/container workstreams are planning surfaces. + +## Rule 8: Tests Are Concrete Doubt + +For behavioral code changes, TDD is the preferred doubt mechanism: + +- write or identify the failing test first; +- prove the failure when fixing a bug; +- implement the smallest change that makes the test pass; +- rerun the relevant gate after code changes; +- do not repeat an unchanged passing gate as reassurance. + +For prompt-injection or review-finding fixes, regression tests must cover the +exact failed vector before closing the finding. + +## Rule 9: Review Is Multi-Plane + +Generic review is insufficient for trust-sensitive SDP work. + +Use separate planes: + +- code correctness and maintainability; +- requirements vs implementation; +- evidence and tracing; +- security and prompt-injection boundaries; +- docs/runtime truth; +- operations, CI, and release readiness. + +Keep reviewer outputs independent. Missing reviewer output is `not_assessed`. +Do not blend a failed or empty lane into a green verdict. + +## Rule 10: Personas Do Not Orchestrate Personas + +The orchestrator is the user, command, harness runtime, or main session. A +persona may use skills, but should not call another persona. + +Allowed patterns: + +- direct single-perspective review; +- one-shot subagent for bounded work; +- parallel fan-out where lanes are independent; +- generator-verifier for adversarial review; +- shared-state coordination through Beads or explicit work queues. + +Avoid: + +- recursive agent trees; +- router personas that decide which persona to call; +- background subagents whose output is not awaited; +- parallel workers writing overlapping files; +- reviewer panels where all slots use one model family. + +## Rule 11: Simplification Is A Separate Lane + +Do not mix simplification with feature work unless the simplification is required +for the feature. + +A simplification pass may run only after: + +- current behavior is understood; +- tests or equivalent behavior proof exist; +- the scope is bounded to recently touched or explicitly named code; +- every change preserves behavior. + +Good target: `@review --dimension simplification` or a dedicated `simplify` +skill that reports candidates before editing. + +## Rule 12: Runtime Beats Prompt Policy + +Safety policy belongs in the runtime wherever possible: + +- sandbox and writable roots; +- network allowlists and approval gates; +- per-agent tool permissions; +- timeouts and graceful shutdown; +- append-only logs and evidence artifacts; +- scoped credentials and model/provider allowlists. + +Prompt text can instruct. Runtime controls can enforce. + +## Adoption Backlog + +1. Update `docs/reference/skill-authoring.md` to require `Use When`, + `Do Not Use When`, `Verification`, and trigger-rich descriptions. +2. Add manifest/protocol lint for weak descriptions and missing verification. +3. Add `Common Rationalizations` to `build`, `review`, `ship`, and `debug`. +4. Add a `doubt-driven` internal practice to `spec-interrogate` and + trust-sensitive review loops. +5. Add source-driven rules for OpenAI/Codex, OpenCode, Pi, model, and API work. +6. Extract reusable review checklists into `docs/reference/checklists/`. +7. Add simplification as a review dimension, not a default side-effect of + feature delivery. + diff --git a/docs/research/2026-05-15-harness-engineering-landscape.md b/docs/research/2026-05-15-harness-engineering-landscape.md new file mode 100644 index 00000000..2e613695 --- /dev/null +++ b/docs/research/2026-05-15-harness-engineering-landscape.md @@ -0,0 +1,291 @@ +# Harness Engineering Landscape, May 2026 + +Status: research draft +Date: 2026-05-15 + +This document records current harness-engineering patterns and model/tool +changes relevant to SDP. It separates vendor claims, official docs, academic +findings, and local observations. Numbers from vendor launch pages are useful +for model routing hypotheses, not for merge or release authority. + +## Source Classes + +- Official / vendor docs: OpenAI Codex and GPT-5.5 docs, Z.AI GLM-5.1 docs, + Kimi K2.6 page, Qwen3.6 GitHub, OpenCode docs, Pi docs. +- Platform / model infrastructure docs: Hugging Face DeepSeek-V4 analysis, + AWS Qwen SageMaker availability note. +- Academic / preprint evidence: terminal coding agent architecture, instruction + adherence in config files, tool-enabled agent security, MCP tool usage. +- Local observations: installed `opencode` is `1.15.0`; installed `codex` is + `codex-cli 0.130.0`; `pi` is installed at `/opt/homebrew/bin/pi`, but did not + print a version with `-v`/`--version` in this checkout. + +## Executive Summary + +The 2026 harness shift is not "models got better at code". The more important +shift is that coding agents are becoming managed runtimes: + +- bounded sandboxes and writable roots; +- network policy and approval gates; +- per-agent permissions and model routing; +- context compaction and memory; +- task/flow persistence; +- telemetry and audit trails; +- long-horizon budgets and graceful timeouts; +- explicit degraded-evidence states. + +SDP should treat model choice as one routing dimension inside a governed +harness, not as the architecture. + +## Current Harness Engineering Patterns + +### 1. Agent Loop Is The Product Boundary + +OpenAI describes Codex as a suite, but the core reusable concept is the harness: +an agent loop that prepares prompts, calls model inference, invokes tools, +observes results, updates state, and returns an outcome. Codex CLI uses the +Responses API and can be configured against compatible endpoints. + +Implication for SDP: keep the canonical model at the loop/runtime level: +instructions, tools, state, evidence, approvals, and model routing. Do not make +one harness-specific prompt stack the product. + +### 2. Safety Is Moving To Managed Runtime Controls + +OpenAI's May 2026 Codex safety write-up emphasizes sandbox modes, approval +policy, managed network access, managed configuration, and audit logs. OpenCode +documents per-agent permission keys for reads, edits, bash, web, LSP, skills, +task delegation, and external directory access. Pi intentionally keeps the core +minimal and expects teams to add workflows through packages/extensions or +external process controls. + +Implication for SDP: prompt-injection safety must be backed by runtime: + +- read-only reviewer roles; +- explicit write-enabled implementer roles; +- network allowlists or ask gates; +- `git push`, publish, merge, and external side effects behind explicit gates; +- append-only evidence for what the agent actually did. + +### 3. Context Engineering Is A Runtime Concern + +The terminal-agent literature now names strict context management as a core +architecture problem. OPENDEV-style patterns include workload-specialized model +routing, planner/executor separation, lazy tool discovery, adaptive compaction, +memory, and event-driven reminders. + +A May 2026 instruction-adherence study found no strong effect from several +static config-file structure variables, but did find a within-session compliance +drop as generated work accumulates. This supports SDP's existing bias toward +shorter bounded loops, compaction, and fresh reviewer contexts. + +Implication for SDP: improve session lifecycle, not just AGENTS formatting. + +### 4. Tool Layer Risk Is Growing + +Recent MCP/tool studies show software development dominates agent tooling, and +the share of action tools has grown sharply. Security analysis of privileged +agent environments points to over-privileged tools, capability-intent mismatch, +and ambient authority leakage as practical risk categories. + +Implication for SDP: classify tools by side effect, not by convenience: + +- perception/read tools; +- analysis tools; +- local writes; +- external writes; +- irreversible or identity-mediated actions. + +Each class needs separate permission and evidence policy. + +## Model Landscape + +### GPT / Codex + +OpenAI's current public docs position GPT-5.5 as a strong fit for coding, +tool-heavy agents, long-context retrieval, and product-spec-to-plan workflows. +The migration guidance stresses outcome-first prompts, explicit success +criteria, tool descriptions, structured outputs, prompt caching, and reasoning +effort tuning. It also says coding workflows need stronger orchestration: +reuse, delegation, tests, acceptance criteria, and clear stop/ask rules. + +Codex Cloud/Web supports background tasks in cloud environments, parallel work, +GitHub PR workflows, environment setup, internet-access control, IDE delegation, +and GitHub tagging. + +SDP routing hypothesis: + +- use GPT/Codex-class models for hard synthesis, complex tool orchestration, + final review, and high-risk refactors; +- use lower reasoning levels or smaller variants only after evals prove they + preserve evidence quality; +- prefer structured outputs over prompt-described schemas where available. + +### GLM + +Z.AI documents GLM-5.1 as a flagship long-horizon model with 200K context, +128K maximum output, thinking modes, function calling, context caching, +structured output, MCP integration, and agentic coding use cases. Vendor claims +include up to 8-hour sustained autonomous work and strong SWE-Bench Pro results. + +SDP routing hypothesis: + +- good candidate for long-horizon implementer/reviewer lanes; +- verify provider stability and context degradation on real SDP tasks before + trusting it for unattended work; +- use as model-diversity reviewer for prompt/skill/spec work. + +### Kimi + +Kimi K2.6 is positioned as open-source with coding, long-horizon execution, +agent swarm capabilities, document-to-skill workflows, and Claw Groups for +coordinated multi-agent work. Official access paths include Kimi website, app, +API, and Kimi Code. + +SDP routing hypothesis: + +- strong fit for UI/front-end generation critique, skill extraction, and + document-to-workflow experiments; +- useful as a non-OpenAI review lane; +- vendor swarm claims should be treated as design inspiration, not evidence + until reproduced in SDP. + +### Qwen + +The Qwen3.6 repository positions Qwen3.6 as focused on stability and real-world +utility, with stronger agentic coding, front-end workflows, repository-level +reasoning, and thinking preservation across iterative development. Official +weights are on Hugging Face and ModelScope; Qwen Code is the terminal agent +optimized for Qwen models; Alibaba Cloud Model Studio provides OpenAI- and +Anthropic-compatible APIs. AWS now lists Qwen3.6-35B-A3B as a 3B-active MoE +optimized for agentic coding workflows. + +SDP routing hypothesis: + +- good local/open-weight candidate for cheap exploration, codebase mapping, and + review diversity; +- Qwen3.6-35B-A3B and 27B should be evaluated separately; small active-parameter + models may be fast enough for many scout/reviewer lanes; +- thinking-preservation is relevant to long-horizon harness design, but should + not replace external evidence. + +### DeepSeek + +DeepSeek V4 is current but needs extra caution because official English docs +were harder to source directly. Hugging Face's April 2026 analysis reports V4-Pro +and V4-Flash checkpoints with 1M context, agent-focused long-context +architecture, interleaved thinking across tool calls, a dedicated tool-call +schema, and sandbox infrastructure used for RL rollouts. The same source says +benchmark numbers are competitive but not uniformly SOTA. + +SDP routing hypothesis: + +- promising for long-context agent workloads and cost-sensitive reviewer lanes; +- tool-call schema differences may require harness adaptation; +- use with explicit provider/endpoint provenance because community reports mix + API, OpenRouter, chat, and preview variants. + +## Tool Landscape + +### Codex + +Codex is now a family: CLI, Cloud/Web, IDE extension, and app surfaces. The +important 2026 lessons are: + +- agent loop and tool mediation are first-class; +- cloud tasks run in task-specific environments; +- background and parallel work are product features; +- safety posture centers on sandbox, approvals, network policy, managed config, + and telemetry. + +SDP implication: Codex should be treated as one high-reliability execution lane +and as a model/harness reference for managed controls. + +### OpenCode + +OpenCode's current docs support per-agent model overrides, per-agent +permissions, wildcard command permissions, primary/subagent/all modes, hidden +subagents, and task-delegation permissions. Legacy `tools` config is deprecated +in favor of `permission`. + +SDP implication: adapter parity should include permission semantics, not just +file presence. Existing SDP guidance to use `--agent implementer` for +non-interactive dispatch remains important, but should be refreshed against the +current OpenCode agent/permission model. + +### Pi + +Pi's official posture is minimal core plus extensions, skills, prompt templates, +themes, packages, SDK/RPC/event-stream modes, and model registry customization. +It loads `AGENTS.md`/`CLAUDE.md` from global, parent, and current directories. +It intentionally does not include built-in MCP, subagents, permission popups, +plan mode, to-dos, or background bash; those are built through packages or +external tools. The `pi-agents` and `pi-agent-flow` packages add spawn, +sequence, fork, join, loop, budgets, flow persistence, and deadline handling. + +SDP implication: Pi is a good experimental harness for explicit workflow +graphs, diverse model review lanes, and local package-based extension. It needs +strict external guardrails for side effects because the core is intentionally +small. + +## Consequences For SDP + +1. Add runtime permission semantics to the manifest, not just skill/agent file + parity. +2. Separate static adapter parity from runtime dispatch evidence for every + harness. +3. Treat long-horizon work as a flow with budgets, checkpoints, compaction, and + review stops. +4. Route models by role: scout, planner, implementer, reviewer, security, + synthesis, and judge. +5. Keep model diversity in review lanes; avoid all-reviewer panels from one + vendor family. +6. Track endpoint provenance and model version in evidence for any external + model result. +7. Prefer bounded workflows over one giant autonomous run. +8. Encode degraded evidence explicitly: failed provider, timeout, empty output, + unavailable CLI, unverified benchmark, not-assessed runtime. +9. Treat identity-mediated tools as higher risk than API tools with scoped + credentials. +10. Measure harness behavior on SDP tasks; do not assume benchmark rankings + transfer to our workflows. + +## References + +- OpenAI, "Unrolling the Codex agent loop": + https://openai.com/index/unrolling-the-codex-agent-loop/ +- OpenAI, "Running Codex safely at OpenAI": + https://openai.com/index/running-codex-safely/ +- OpenAI API docs, "Using GPT-5.5": + https://developers.openai.com/api/docs/guides/latest-model +- OpenAI Developers, "Codex web": + https://developers.openai.com/codex/cloud +- Z.AI Developer Docs, "GLM-5.1": + https://docs.z.ai/guides/llm/glm-5.1 +- Kimi, "Kimi K2.6": + https://www.kimi.com/ai-models/kimi-k2-6 +- QwenLM/Qwen3.6: + https://github.com/QwenLM/Qwen3.6 +- AWS, "Qwen models on SageMaker JumpStart", 2026-05-04: + https://aws.amazon.com/about-aws/whats-new/2026/05/qwen-models-on-sagemaker-jumpstart/ +- Hugging Face, "DeepSeek-V4: a million-token context that agents can actually use": + https://huggingface.co/blog/deepseekv4 +- OpenCode docs, "Agents": + https://opencode.ai/docs/agents/ +- Pi docs: + https://pi.dev/docs/latest +- Pi usage docs: + https://pi.dev/docs/latest/usage +- Pi agents package: + https://pi.dev/packages/pi-agents +- Pi agent flow package: + https://pi.dev/packages/pi-agent-flow +- Bui, "Building Effective AI Coding Agents for the Terminal", arXiv:2603.05344: + https://arxiv.org/abs/2603.05344 +- McMillan, "Instruction Adherence in Coding Agent Configuration Files", arXiv:2605.10039: + https://arxiv.org/abs/2605.10039 +- Goel, "Security Risks in Tool-Enabled AI Agents", arXiv:2605.09721: + https://arxiv.org/abs/2605.09721 +- Stein, "How are AI agents used? Evidence from 177,000 MCP tools", arXiv:2603.23802: + https://arxiv.org/abs/2603.23802 + diff --git a/docs/research/2026-05-15-sdp-harness-skill-synthesis.md b/docs/research/2026-05-15-sdp-harness-skill-synthesis.md new file mode 100644 index 00000000..201320d5 --- /dev/null +++ b/docs/research/2026-05-15-sdp-harness-skill-synthesis.md @@ -0,0 +1,455 @@ +# SDP Harness And Skill Operating Strategy + +Status: research synthesis +Date: 2026-05-15 + +This document synthesizes two inputs: + +- `2026-05-15-agent-skill-operating-rules.md` +- `2026-05-15-harness-engineering-landscape.md` + +It also incorporates independent cross-review from: + +- `docs/reviews/2026-05-15-harness-cross-review-glm.md` +- `docs/reviews/2026-05-15-harness-cross-review-kimi.md` +- `docs/reviews/2026-05-15-harness-cross-review-qwen.md` + +The cross-reviews are model-generated review evidence, not validation authority. +This synthesis treats them as adversarial input curated by the author. Any +recommendation below still needs normal repo adoption, implementation, and +measurement before it becomes policy. + +## Executive Position + +SDP should not copy `addyosmani/agent-skills` as a command model. SDP already +has a stronger control plane: manifest inventory, Beads, workstreams, evidence, +multi-harness parity, prompt-injection boundaries, and explicit degraded states. + +The useful import is narrower: + +- clearer skill anatomy; +- trigger and exclusion discipline; +- anti-rationalization tables; +- doubt-driven review for high-risk claims; +- source-driven checks for external APIs and harness behavior; +- explicit separation between prompt policy and runtime enforcement. + +The 2026 harness shift is bigger than "better coding models". Coding agents are +becoming managed runtimes: sandboxing, network policy, tool permissions, model +routing, context lifecycle, telemetry, and evidence trails now matter as much as +prompt quality. + +SDP should therefore optimize for governed harness execution, not for a larger +instruction pile. + +## Non-Negotiable Design Lines + +### 1. Skills Are Workflows, Not Libraries + +Skills should say what to do, in order, with inputs, outputs, stop conditions, +and verification. They should not absorb module-local facts, provider secrets, +package APIs, or historical roadmap context. + +Module facts stay in the nearest `AGENTS.md`. Stable reference belongs in +`docs/reference/`. Runtime inventory belongs in `sdp.manifest.yaml` or generated +adapter metadata. + +### 2. Runtime Beats Prompt Policy + +Prompt text can instruct. Runtime controls enforce. + +For SDP, that means: + +- declared writable roots; +- network allowlists or approval gates; +- per-agent tool permissions where the harness supports them; +- read-only reviewer roles by default; +- explicit gates for `git push`, publish, merge, external issue creation, and + other identity-mediated actions; +- append-only evidence for model outputs, tool calls, failures, and degraded + coverage. + +### 3. Long-Horizon Models Do Not Remove Slice Discipline + +GLM, Kimi, Qwen, DeepSeek, and GPT/Codex-class models now advertise stronger +long-horizon coding and tool use. SDP should treat those claims as routing +hypotheses, not as permission for unbounded execution. + +Long-horizon capability is useful for executing many bounded slices in sequence. +It is not a reason to skip workstreams, checkpoints, review stops, or evidence. + +### 4. Vendor Claims Are Quarantined + +Model launch claims and benchmark tables can inform experiments. They cannot +drive merge, release, or trust-sensitive approval. + +Every model-routing claim should carry: + +- provider and endpoint; +- model id or snapshot when available; +- source class: official docs, vendor claim, third-party analysis, local eval; +- `routing_confidence`: `vendor_only`, `local_spike`, or + `validated_on_sdp_tasks`. + +All current model-family routing notes in the source landscape are +`vendor_only` unless separately measured on SDP tasks. + +## Skill Authoring Standard + +Every SDP skill should converge toward this shape: + +```yaml +--- +name: +description: Does X. Use when Y. Do not use when Z if Z is a common false trigger. +version: +compatibility: [claude-code, opencode, cursor, codex, pi] +requires_cli: [] +requires_mcp: [] +tool_risk_classes: [perception, analysis] +runtime_requires: + sandbox: read_only | workspace_write | full_access + network: none | allowlisted | approval_required + approvals: [] +--- +``` + +Required body sections: + +- `Purpose` +- `Use When` +- `Do Not Use When` +- `Inputs` +- `Outputs` +- `Process` +- `Verification` +- `Degraded Evidence` + +For high-risk skills, also add: + +- `Common Rationalizations` +- `Runtime Preconditions` +- `Model Routing` +- `Stop Conditions` + +Initial high-risk skills confirmed in the manifest: + +| Skill | Canonical path | Status for adoption | +|---|---|---| +| `build` | `prompts/skills/build/SKILL.md` | exists; high-priority | +| `review` | `prompts/skills/review/SKILL.md` | exists; high-priority | +| `ship` | `prompts/skills/ship/SKILL.md` | exists; high-priority | +| `debug` | `prompts/skills/debug/SKILL.md` | exists; reconcile with `.agents/skills/debug.md` deprecation note | +| `delivery-loop` | `prompts/skills/delivery-loop/SKILL.md` | exists; high-priority | +| `spec-interrogate` | `prompts/skills/spec-interrogate/SKILL.md` | exists; already closest to target | + +`compatibility` is a claim only after runtime dispatch evidence exists for that +harness. Until then it means intended portability and should be reported as +`not_assessed_runtime` per harness. + +## Tool Risk Classes + +Skills and harness adapters should classify tool use by side-effect class: + +| Class | Meaning | Default policy | +|---|---|---| +| `perception` | read files, list paths, inspect logs, browse docs | allowed for most roles | +| `analysis` | local compute without writes or external side effects | allowed with evidence | +| `local_write` | edit files, generate artifacts, modify local state | implementer only; scoped | +| `external_write` | push, publish, create remote issue, update remote system | explicit gate | +| `irreversible` | merge, deploy, delete, spend money, rotate credentials | explicit human/workflow authorization | + +This should live in manifest/runtime metadata, not prose alone. Pi may enforce +through packages or external wrappers; OpenCode can map to permission keys; +Codex can map to sandbox, approvals, and managed config. The manifest declares +requirements. The harness enforces them where it can. + +If a harness cannot enforce `external_write` or `irreversible` gates at runtime, +the skill may not claim runtime support for that action class. It can only run +under an explicit workflow-level authorization path, and evidence must say +`not_assessed_runtime` or `manual_gate_only`. + +## Degraded Evidence Taxonomy + +Do not collapse missing evidence into green. + +Canonical states: + +| State | Meaning | +|---|---| +| `passed` | deterministic or reviewer evidence completed and supports the claim | +| `failed` | deterministic or reviewer evidence contradicts the claim | +| `not_assessed` | the plane was not run | +| `failed_provider` | provider returned an explicit error | +| `timeout` | run exceeded the bounded window | +| `empty_output` | run completed with no useful content | +| `off_task` | output did not address the requested plane | +| `unavailable_cli` | required tool was missing or could not run | +| `unverified_benchmark` | vendor/third-party claim not validated on SDP tasks | +| `not_assessed_runtime` | static files exist, but runtime behavior was not proven | + +Evidence artifacts and review summaries should report these states directly. +`not_assessed` is not a weak pass. It is missing coverage. + +Assignment rule: deterministic harness/tool output wins over model prose. If no +deterministic owner exists yet, the human/operator or orchestrator assigns the +state and records the reason. Conflicting evidence states are resolved toward +the more conservative degraded state until inspected. + +## Model Routing Policy + +Role-based routing is useful only if it is measured and auditable. + +Recommended target roles: + +| Role | Primary concern | Routing rule | +|---|---|---| +| scout | fast repo/context mapping | low-cost model acceptable after eval | +| planner | decomposition and acceptance criteria | stronger reasoning model | +| implementer | code changes | model with proven local gate performance | +| reviewer | independent critique | different family from implementer where possible | +| security | prompt-injection and side-effect risk | stronger model; strict evidence | +| synthesis | reconcile conflicting reviews | high-reasoning model | +| judge | final spec/review adjudication | different provider from critic | + +Evidence for every model-generated artifact should include: + +- `model_id`; +- provider or endpoint; +- harness name and version when available; +- prompt/context source; +- tool permissions enabled; +- degraded evidence state if coverage is partial. + +Default review posture: for trust-sensitive work, at least one reviewer should +come from a different model family or provider than the implementer. All slots +from one vendor family are not independent enough for high-risk claims. + +This rule requires implementer provenance to be known. If the implementer model +is unknown, the review must record `not_assessed_runtime` for model-family +independence rather than claiming diversity. + +## Doubt Mechanisms: Decision Tree + +Use the cheapest sufficient doubt mechanism: + +| Situation | Mechanism | +|---|---| +| Behavioral code change | test-first or regression test evidence | +| Bug fix | prove failing case first, then prove pass | +| Prompt, skill, agent, review, eval, Beads, or model-routing change | bounded doubt cycle | +| Trust-sensitive workstream | doubt cycle plus selected multi-plane review | +| External API/framework/harness claim | source-driven verification from official docs | +| Release, publish, merge, or external side effect | runtime gate plus explicit authorization | +| One-line mechanical edit | standard local verification only | + +Bounded doubt cycle: + +1. `CLAIM`: state the claim and why it matters. +2. `EXTRACT`: isolate artifact and contract; strip author explanation. +3. `DOUBT`: send artifact and contract to an adversarial reviewer. +4. `RECONCILE`: classify findings as actionable, trade-off, misread, or noise. +5. `STOP`: stop after trivial findings, three cycles, or explicit owner override. + +Do not combine every review plane with every doubt cycle by default. That turns +discipline into latency theater. + +Budget: one doubt cycle should fit in a bounded review window and produce a +short finding list. If the reviewer returns a broad redesign request, split the +claim or reject the output as off-scope. Three cycles is a hard ceiling, not a +target. + +## Multi-Plane Review Scope + +The six-plane matrix is a library, not the default checklist. + +Available planes: + +- code correctness and maintainability; +- requirements vs implementation; +- evidence and tracing; +- security and prompt-injection boundary; +- docs/runtime truth; +- operations, CI, and release readiness; +- model/provider routing and provenance; +- tool-side-effect policy. + +Default for ordinary work: one to two planes. + +Default for trust-sensitive SDP work: two to three planes, selected by risk. + +Mandatory planes: + +- write-capable prompt/agent/skill changes: requirements, security/PI, + evidence/tracing; +- release/publish/merge readiness: operations/CI/release, evidence/tracing, + requirements; +- harness adapter support claims: docs/runtime truth, runtime dispatch evidence, + tool-side-effect policy. + +## Harness Adapter Parity + +Static parity is not runtime support. + +Static parity means: + +- adapter file exists; +- manifest entry exists; +- generated docs or symlink points to the expected location. + +Runtime dispatch evidence means: + +- the harness loaded the intended skill/agent/command; +- the intended model was selected; +- permissions or sandbox settings matched the declaration; +- denied actions were actually denied or escalated; +- tool calls and failures were logged; +- output artifacts include evidence status. + +Do not mark a harness as `supported` for a capability when only static parity is +verified. Use `not_assessed_runtime`. + +## Source-Driven Development + +When the task depends on current behavior of a hosted API, CLI, coding harness, +model, or framework, verify current official sources first. + +Mandatory for: + +- OpenAI/Codex API, Codex CLI, Codex app, or model behavior; +- OpenCode permissions, agent config, or delegation semantics; +- Pi packages, flow behavior, model routing, and context loading; +- external model capabilities, migrations, or deprecations; +- GitHub Actions or CI behavior when debugging CI. + +Use current docs, release notes, model cards, source repositories, or installed +CLI versions. Mark conflicts and unverified claims explicitly. + +## Adoption Plan + +### Phase 1: One-Week Foundation + +Goal: make authoring discipline enforceable without changing runtime. + +- Update `docs/reference/skill-authoring.md` with required sections: + `Do Not Use When`, `Verification`, `Degraded Evidence`. +- Add `Common Rationalizations` to `build` and `review`. +- Extend lint to warn on missing `Do Not Use When` and `Verification`. +- Add a small reference doc for tool risk classes and degraded evidence. + +Acceptance gate: + +- the current skill linter exists and runs, or the phase explicitly creates the + minimal lint path first; +- `build` and `review` contain `Common Rationalizations`, `Verification`, and + `Degraded Evidence`; +- the reference doc defines tool risk classes and degraded evidence states; +- all changed docs/skills pass the existing docs and skill lint checks; +- any runtime support claim added in this phase is marked `not_assessed_runtime` + unless dispatch evidence exists. + +### Phase 2: One-Sprint Harness Metadata + +Goal: manifest can describe runtime needs even before every harness enforces +them. + +- Draft manifest metadata for `tool_risk_classes` and `runtime_requires`. +- Map OpenCode permissions, Codex sandbox/approval concepts, and Pi wrapper or + package enforcement points. +- Add evidence fields for `model_id`, provider, endpoint, harness version, and + evidence status. +- Define `not_assessed_runtime` in review and adapter reports. + +Acceptance gate: + +- manifest schema impact is documented before fields are treated as canonical; +- at least one harness has a concrete mapping from manifest metadata to runtime + behavior or explicit `manual_gate_only` fallback; +- evidence output can carry model/provider/harness metadata without manual + copy-paste for the common path. + +### Phase 3: Review And Doubt Integration + +Goal: reduce skipped-risk rationalization without turning every task into a +research project. + +- Add bounded doubt cycle to `spec-interrogate` and high-risk `review` modes. +- Add model/provider diversity rule for trust-sensitive review lanes. +- Add simplification as a review dimension, not as default feature cleanup. +- Add source-driven checks for OpenAI/Codex, OpenCode, Pi, and external model + claims. + +Acceptance gate: + +- doubt cycle is limited to high-risk prompt/agent/skill/policy changes; +- ordinary daily work keeps a one- or two-plane review default; +- provider-diversity rules are advisory unless implementer provenance and model + availability are known. + +### Phase 4: Measurement + +Goal: turn routing hypotheses into SDP evidence. + +- Build a small SDP task suite: scout, spec critique, bug fix, review, harness + adapter check. +- Run GPT/Codex, GLM, Kimi, Qwen, DeepSeek, and MiniMax lanes where available. +- Track cost, latency, correctness, evidence quality, and failure modes. +- Promote only measured routing rules from `vendor_only` to `local_spike` or + `validated_on_sdp_tasks`. + +Promotion criteria: + +- `vendor_only`: source exists, no SDP run. +- `local_spike`: at least one successful bounded SDP task with recorded + provider/model/harness metadata and failure notes. +- `validated_on_sdp_tasks`: repeated runs across at least three representative + SDP task types with acceptable cost, latency, deterministic gate results, and + evidence quality. The threshold must be recorded with the measurement packet; + one lucky run is not validation. + +## What Not To Adopt + +- Do not replace SDP's intent surface with `/spec -> /plan -> /build -> /test -> + /review -> /ship`. +- Do not invoke a skill at "1% chance"; it is too noisy for this repo. +- Do not replace SDP's multi-plane review with a generic three-persona review. +- Do not copy generic security checklists without SDP prompt-injection and + evidence-boundary semantics. +- Do not let model benchmark rankings determine role routing without local SDP + measurements. +- Do not treat swarms or long-horizon claims as architecture. Treat them as + possible implementation mechanisms inside bounded flows. + +## Immediate Recommendation + +Adopt the operating discipline, not the external command stack. + +The first concrete product move should be: + +1. strengthen skill authoring and lint; +2. add degraded evidence and tool-risk vocabulary; +3. add anti-rationalization to `build` and `review`; +4. draft manifest runtime metadata; +5. measure model routing before changing default roles. + +This is enough to improve UX and DX now without pretending the harness runtime +is already more enforceable than it is. + +## Source Notes + +Current public sources checked during this research include: + +- OpenAI, GPT-5.5 model docs: `https://developers.openai.com/api/docs/models/gpt-5.5` +- OpenAI, running Codex safely: `https://openai.com/index/running-codex-safely/` +- OpenAI, Windows Codex sandbox: `https://openai.com/index/building-codex-windows-sandbox/` +- OpenCode agents and permissions: `https://opencode.ai/docs/agents/` +- Pi docs and packages: `https://pi.dev/docs/latest`, + `https://pi.dev/packages/pi-agents`, `https://pi.dev/packages/pi-agent-flow` +- Z.AI GLM-5.1 docs: `https://docs.z.ai/guides/llm/glm-5.1` +- Kimi K2.6 official blog/help: `https://www.kimi.com/blog/kimi-k2-6` +- Qwen3.6 official repository: `https://github.com/QwenLM/Qwen3.6` +- Hugging Face DeepSeek-V4 analysis: + `https://huggingface.co/blog/deepseekv4` + +External model claims in these sources remain routing hypotheses until validated +on SDP tasks. diff --git a/docs/reviews/2026-05-15-harness-cross-review-glm.md b/docs/reviews/2026-05-15-harness-cross-review-glm.md new file mode 100644 index 00000000..0745f05a --- /dev/null +++ b/docs/reviews/2026-05-15-harness-cross-review-glm.md @@ -0,0 +1,138 @@ +# Cross-Review: Operating Rules vs. Harness/Model Landscape + +**Reviewer:** A (harness/runtime engineer) +**Scope:** Whether Rules 1–12 are sufficient given the landscape document's findings +**Date:** 2026-05-15 + +--- + +## Verdict + +**Conditional pass.** The rules are internally consistent and address most skill-authoring discipline. They are *insufficient* for the harness/runtime concerns the landscape document identifies as the defining 2026 shift: model routing as a first-class concern, long-horizon flow lifecycle, tool side-effect classification, degraded-evidence protocol, and harness adapter parity. These are not edge cases—they are the central claim of the landscape doc (§Executive Summary: "coding agents are becoming managed runtimes"). + +--- + +## Blocking Gaps + +Gaps where a rule claims coverage it does not have, or where a landscape-mandated concern has no rule at all. + +### B1. No rule addresses model routing as a structural requirement + +The landscape identifies model routing by role (scout, planner, implementer, reviewer, security, synthesis, judge) as a harness-level concern. Rule 10 bans single-vendor reviewer panels but says nothing about: + +- declaring model capability prerequisites in skill triggers; +- tracking endpoint provenance and model version in evidence artifacts; +- requiring model diversity for specific review planes; +- routing degraded or fallback behavior when a preferred provider is unavailable. + +Rule 5 cites specific products for source verification but does not extend to model selection. Rule 12 enumerates runtime controls but omits model/provider allowlists. + +**Impact:** Without this, skills can silently route all work to one model with no evidence trail and no diversity requirement. The landscape's consequences §items 4–6 are unaddressed. + +### B2. No rule addresses long-horizon flow lifecycle + +The landscape identifies budgets, checkpoints, compaction, and review stops as essential for long-horizon harness work. Rule 7 mandates small slices (good for leaf work) but says nothing about: + +- when to start a fresh session vs. continue (landscape §3: within-session compliance drops); +- compaction or memory management policy; +- flow primitives (spawn, sequence, fork, join, loop) and their governance; +- harness session lifecycle (create, run, release) as a skill concern. + +**Impact:** The landscape explicitly warns against "one giant autonomous run" and for bounded workflows with checkpoints. The rules have no structural countermeasure for sessions that run long beyond "slice smaller." + +### B3. Tool side-effect classification is absent from all rules + +The landscape §4 and consequences §item 9 call for classifying tools into five categories (perception/read, analysis, local write, external write, irreversible/identity-mediated) with separate permission and evidence policy. Rule 12 mentions per-agent tool permissions but does not require skills to declare which side-effect classes they exercise or to accumulate differentiated evidence per class. + +**Impact:** A skill that performs external writes is governed by the same prose as one that only reads. The landscape identifies this as a growing risk category. + +--- + +## Major Gaps + +Important but not blocking synthesis; can be addressed as amendments. + +### M1. Degraded-evidence states are under-specified + +Rule 3 mentions `not_assessed` for empty/hung reviewer output. The landscape (consequences §item 8) enumerates six degraded states: failed provider, timeout, empty output, unavailable CLI, unverified benchmark, and not-assessed runtime. No rule makes these a general principle or requires skills to handle/report them. + +### M2. Rule 9 (multi-plane review) missing two planes + +Current planes: code correctness, requirements, evidence/tracing, security/PI, docs/runtime truth, ops/CI/release. Missing: + +- **model/provider routing and provenance plane** — did the right model see the right slice with the right evidence chain? +- **tool-side-effect policy plane** — did the skill stay within its declared side-effect classes? + +### M3. Rule 1 (triggers) missing runtime precondition declarations + +Triggers specify *when* to use a skill but not *what runtime controls* the skill requires (sandbox mode, network access, specific tool classes, model capabilities). A skill whose workflow calls external APIs has different precondition requirements than one that only reads local files. + +### M4. Rule 3 anti-rationalization table missing harness-specific entries + +Absent rationalizations the landscape implies: + +| Rationalization | Reality | +|---|---| +| "The harness sandbox covers it." | Sandbox is necessary but not sufficient; prompt-injection and capability-intent mismatch persist inside sandboxes. | +| "This model passed our evals." | Benchmark performance does not transfer to SDP task workflows without local measurement. | +| "The agent had network access so it verified." | Network access is a permission, not evidence of verification. | +| "Context window is large enough to skip compaction." | Within-session compliance degrades regardless of window size. | + +### M5. No rule addresses harness adapter parity + +The landscape (consequences §item 2) calls for separating static adapter parity (file presence) from runtime dispatch evidence (did the harness actually route as configured?). No rule requires verifying that an adapter's runtime behavior matches its spec. + +--- + +## Useful Tensions + +Tensions that are features, not bugs—worth preserving in synthesis. + +### T1. Rule 12 vs. prompt-only minimalism + +Rule 12 ("runtime beats prompt") directly contradicts any temptation to solve safety by writing better instructions. The landscape reinforces this repeatedly. This tension is correct and should be sharpened, not softened. + +### T2. Rule 7 (small slices) vs. landscape long-horizon models + +Rule 7 mandates thin-slice implementation. The landscape describes models claiming 8-hour autonomous runs. The tension is real: long-horizon capability exists, but SDP should still slice. The synthesis should acknowledge this explicitly and state that long-horizon models are for executing *many bounded slices in sequence*, not for eliminating slice discipline. + +### T3. Rule 10 (no persona trees) vs. landscape swarm/flow patterns + +Rule 10 restricts orchestration patterns. The landscape describes swarm capabilities (Kimi Claw Groups) and flow graphs (pi-agent-flow). The tension is healthy: SDP should adopt flow-graph primitives (spawn, fork, join) without adopting recursive agent trees. The synthesis should draw this line clearly. + +### T4. Rule 5 (source-driven) product list vs. ecosystem volatility + +Rule 5 hardcodes specific products (OpenAI/Codex, OpenCode, Pi). The landscape shows this ecosystem changes fast. Useful tension: keep the mandatory-source discipline but make the product list a versioned appendix, not a fixed rule body. + +--- + +## Concrete Changes for Synthesis Document + +1. **Add Rule 13 or extend Rule 12: Model routing and provenance.** Require skills to declare model capability prerequisites, require evidence to include model version/provider/endpoint, require model diversity for trust-sensitive review planes. + +2. **Add Rule 14 or extend Rule 7: Long-horizon flow lifecycle.** Mandate session budgets, compaction triggers, checkpoint evidence, and explicit fresh-session criteria. State that long-horizon models execute bounded slices in sequence, not unbounded runs. + +3. **Extend Rule 1: Runtime preconditions in triggers.** Add a `Requires` field covering sandbox mode, network access, tool side-effect classes, and minimum model capabilities. + +4. **Extend Rule 2: Side-effect class declaration in workflows.** Require each workflow step to declare its tool side-effect class and for the skill's evidence to be differentiated by class. + +5. **Extend Rule 3: Add harness-specific rationalizations** (sandbox sufficiency, benchmark transfer, network-access-as-verification, context-window sufficiency). + +6. **Extend Rule 9: Add two review planes** — model/provider routing and provenance, and tool-side-effect policy. + +7. **Extract degraded-evidence protocol as a cross-cutting concern** (applicable to all rules, not just Rule 3). Enumerate the six states from the landscape. + +8. **Add Rule 15 or extend Rule 12: Harness adapter parity must be runtime-verified**, not just file-presence checked. + +9. **Make Rule 5's product list a versioned appendix** with a review cadence, not a fixed rule body. + +10. **Preserve tensions T1–T4 explicitly** in the synthesis with the framing given above. + +--- + +## Evidence State + +- Operating rules internal consistency: **assessed** (read in full). +- Landscape document accuracy against external sources: **not_assessed** (not in scope for this review). +- Whether proposed changes would be sufficient after adoption: **not_assessed** (requires implementation and measurement on real SDP tasks, per landscape consequence §item 10). +- Runtime feasibility of proposed changes in current Pi/Codex/OpenCode harnesses: **not_assessed** (would require adapter-specific prototyping). diff --git a/docs/reviews/2026-05-15-harness-cross-review-kimi.md b/docs/reviews/2026-05-15-harness-cross-review-kimi.md new file mode 100644 index 00000000..97b96a6a --- /dev/null +++ b/docs/reviews/2026-05-15-harness-cross-review-kimi.md @@ -0,0 +1,81 @@ +**Verdict:** Partially aligned. The Agent Skill Operating Rules establishes strong individual-skill discipline, while the Harness Engineering Landscape demands systemic runtime, manifest, and evidence-policy changes that mostly exceed the Rules’ scope. The pair can proceed as companion research drafts, but the synthesis must close manifest-schema, evidence-taxonomy, and tool-risk gaps before operational adoption. No safety-critical contradiction, but the Rules doc under-specifies the runtime implications raised by the Landscape doc. + +--- + +### Blocking Gaps + +1. **Manifest schema lacks runtime permission semantics.** + Landscape Consequence #1 requires adding runtime permission semantics to the manifest. Rules doc Rule 12 lists runtime controls (sandbox, network allowlists, per-agent tool permissions) but never connects them to manifest structure or harness-specific schemas (e.g., OpenCode `permission` keys, Pi tool scopes, Codex sandbox modes). Without manifest changes, these controls remain prompt-level advice. + *[not_assessed: neither document reproduces the current `sdp.manifest.yaml` schema.]* + +2. **Degraded evidence taxonomy is incomplete.** + Landscape Consequence #8 lists six degraded evidence states (`failed_provider`, `timeout`, `empty_output`, `unavailable_cli`, `unverified_benchmark`, `not_assessed_runtime`). Rules doc Rule 3 recognizes only empty/hung output as `not_assessed`. The gap means a timeout or provider failure could be silently treated as passing evidence, breaking Rule 4 (Doubt-Driven Development) and Rule 9 (Multi-Plane Review). + +3. **Tool-risk classification missing from skill authoring.** + Landscape Section 4 proposes five tool classes (perception/read, analysis, local writes, external writes, irreversible/identity-mediated). Rules doc Rule 12 mentions “per-agent tool permissions” and “scoped credentials” but provides no taxonomy. Skills cannot declare risk class, so the runtime cannot enforce class-based gates. + +--- + +### Major Gaps + +1. **No model-routing-by-role policy.** + Landscape Consequence #4 proposes routing by role (scout, planner, implementer, reviewer, security, synthesis, judge) and Consequence #5 requires model diversity in review lanes. Rules doc Rule 10 prohibits recursive persona orchestration but offers no positive routing matrix or vendor-diversity constraint. + +2. **No endpoint provenance requirement in evidence artifacts.** + Landscape Consequence #6 requires tracking endpoint provenance and model version. Rules doc Rule 5 requires detecting exact version for external APIs but does not mandate provenance metadata on every model-generated artifact, weakening auditability. + +3. **Long-horizon flow semantics absent.** + Landscape Consequence #3 and Pi `pi-agent-flow` describe budgets, checkpoints, compaction, and review stops. Rules doc Rules 6–7 cover progressive context and small slices but do not define flow-control semantics (token/time budgets, checkpoint intervals) for multi-slice work. + *[not_assessed: whether SDP currently supports flow persistence.]* + +4. **Adapter parity separated from runtime dispatch evidence only at headline level.** + Landscape Consequence #2 demands this separation. Rules doc Rule 5 verifies official docs but does not define what constitutes runtime dispatch evidence (e.g., tool-call logs, permission-denial logs) versus static config parity. Adoption backlog item #2 (“manifest/protocol lint”) is too vague to enforce this. + +5. **Vendor benchmark skepticism not encoded in rules.** + Landscape Consequence #10 warns against assuming benchmark rankings transfer to SDP workflows. Rules doc Rule 3 anti-rationalization table lacks an entry for “high benchmark score means safe for this role.” + +--- + +### Useful Tensions + +1. **Rule 2 (Skills are workflows, not reference docs) vs Landscape’s demand for harness metadata in skills.** + Landscape implies skills must carry tool-risk classes, model routing hints, and harness-specific permissions. This risks bloating skills with reference material. The productive tension forces the synthesis to draw a hard boundary between executable workflow steps (skill body) and runtime configuration (manifest), preventing harness metadata from leaking into skill prose. + +2. **Rule 7 (Small revertable slices) vs vendor long-horizon autonomy claims.** + Landscape reports GLM “8-hour sustained autonomous work” and Kimi “agent swarm” claims, then recommends bounded workflows. Rules doc mandates thin slices. The tension challenges the synthesis to explicitly reject or conditionally cage vendor autonomy claims—e.g., permitting long-horizon flows only with explicit budgets, checkpoints, and review stops. + +3. **Rule 9 (Multi-plane review) vs Landscape model-diversity requirement.** + Rule 9 defines independence by output plane (code, requirements, evidence, security). Landscape adds vendor-family diversity as a separate axis. A review could be multi-plane yet single-vendor (e.g., all planes use GPT-5.5). The tension forces the synthesis to decide whether plane independence is sufficient or vendor diversity is mandatory for certain planes (e.g., security). + +4. **Rule 12 (Runtime beats prompt policy) vs Landscape context-engineering findings.** + Landscape cites a study finding instruction adherence drops as generated work accumulates, supporting runtime context management. Rule 12 says runtime enforces, prompts instruct. The tension: if runtime manages context compaction and memory, the skill author’s responsibility for context hygiene becomes unclear. The synthesis must delineate author duties (progressive loading per Rule 6) from harness duties (compaction, event-driven reminders). + +--- + +### Concrete Changes for the Synthesis Doc + +1. **Add manifest runtime permissions schema.** + Define a `runtime` block in `sdp.manifest.yaml` (or equivalent) with `sandbox_mode`, `network_policy`, `tool_class_allowlist`, `approval_gates`, and `harness_adapter` fields. Map OpenCode `permission` semantics, Codex sandbox modes, and Pi tool scopes to this schema. + +2. **Add degraded evidence taxonomy.** + Enumerate canonical `degraded` states: `failed_provider`, `timeout`, `empty_output`, `unavailable_cli`, `unverified_benchmark`, `not_assessed_runtime`. Require every review, test, and model output to carry an `evidence_status` field drawn from this taxonomy. + +3. **Add tool-risk classification to skill authoring.** + Require every skill to declare `tool_risk_classes` used (perception, analysis, local_write, external_write, irreversible). The manifest must gate `external_write` and `irreversible` behind explicit approval; skills lacking this declaration fail manifest lint. + +4. **Add model routing matrix.** + Define SDP roles (scout, planner, implementer, reviewer, security, synthesis, judge) and require: (a) reviewer and security roles must use model families distinct from the implementer role; (b) every model result in evidence must include `model_id`, `endpoint_url`, and `harness_version`. + +5. **Add long-horizon flow policy.** + For work exceeding one slice, require a `flow_spec` with `budget` (tokens / time / tool_calls), `checkpoint_interval`, and `review_stop_gates`. Prohibit unbounded autonomous runs; vendor long-horizon claims must be treated as `UNVERIFIED` until reproduced in SDP evals. + +6. **Extend Anti-Rationalization table.** + Add rows: + - “This model scores high on SWE-Bench / benchmark X” → “Benchmarks do not transfer to SDP workflows without local eval evidence.” + - “The skill prompt forbids this tool, so the agent cannot use it” → “Prompt-only prohibition is not a security boundary; runtime gates required.” + +7. **Add adapter parity verification checklist.** + Distinguish static parity (config file exists) from runtime dispatch evidence (logged tool calls, permission denials, model routing). Require runtime evidence before marking a harness adapter as `supported`. + +8. **Add harness-specific source-driven requirements.** + When asserting behavior for Codex, OpenCode, Pi, or model APIs, cite exact doc version and URL. Mark vendor claims (e.g., GLM 8-hour autonomy, Kimi swarm) as `UNVERIFIED` in SDP planning artifacts. diff --git a/docs/reviews/2026-05-15-harness-cross-review-qwen.md b/docs/reviews/2026-05-15-harness-cross-review-qwen.md new file mode 100644 index 00000000..8b69850f --- /dev/null +++ b/docs/reviews/2026-05-15-harness-cross-review-qwen.md @@ -0,0 +1,108 @@ +# Cross-Review: Agent Skill Operating Rules × Harness Engineering Landscape + +**Reviewer:** C (product/runtime adoption) +**Date:** 2026-05-15 +**Verdict:** `needs_work` — both docs are useful research, but neither is ready for adoption without prioritization cuts, existence proofs, and vendor-claim quarantine. + +--- + +## Blocking Gaps (must close before adoption) + +### B1 — Skill existence proof for Rule 3 +Rule 3 says to add anti-rationalization tables to `build`, `review`, `ship`, `debug`, `delivery-loop`, and `spec-interrogate`. None of these skills are referenced by path in the operating-rules doc. The adoption backlog mentions Rule 3 but does not confirm these skills exist in the SDP skill inventory (presumably `docs/reference/skills.md`). **Status:** `not_assessed` — need a manifest scan to confirm which exist, which are aspirational, which are package-provided. + +### B2 — No adoption priority +The adoption backlog is a flat list of 7 items across both documents. A maintainer cannot triage without knowing: what blocks what, what is a one-hour change versus a week-long spike, and what is cosmetic versus structural. **Status:** `needs_priority_map`. + +### B3 — Runtime-permission semantics conflict with Pi's design posture +Harness doc consequence #1 says "add runtime permission semantics to the manifest." The harness doc itself acknowledges Pi's core is intentionally minimal and expects permissions to come from packages or external process controls. The operating-rules doc Rule 12 says "policy belongs in the runtime." No reconciliation is provided: does the manifest declare permission requirements that a Pi package enforces externally, or does the manifest itself become a runtime policy engine? **Status:** architectural conflict, `not_resolved`. + +### B4 — Multi-plane review (Rule 9) vs. cost +Rule 9 proposes up to six review planes (correctness, requirements, evidence, security, docs, ops/CI). Rule 4 adds a doubt cycle with adversarial review and three maximum cycles. Combined, a single non-trivial claim could spawn 18 review invocations before a verdict. No cost, latency, or model-selection budget is defined. This is unusable for day-to-day SDP work as written. **Status:** `not_feasible_without_scoping`. + +--- + +## Major Gaps + +### M1 — Doubt cycle overlap +Rule 4 (doubt-driven: CLAIM/EXTRACT/DOUBT/RECONCILE/STOP) and Rule 8 (tests are concrete doubt) and Rule 9 (multi-plane review) are all doubt mechanisms for overlapping situations. The doc does not define when to use which, or whether they compose. A bug fix, for instance, might trigger all three simultaneously. **Status:** `not_disambiguated`. + +### M2 — "Degraded evidence" is undefined +The harness doc consequence #8 says "encode degraded evidence explicitly" and lists examples (failed provider, timeout, empty output, etc.). The operating-rules doc mentions `not_assessed` state. No schema, type, or storage location is specified. Is `not_assessed` a Beads field, a review verdict tag, or a manifest state? **Status:** `not_assessed`. + +### M3 — Model routing claims unevaluated on SDP tasks +The harness doc has routing hypotheses for five model families but explicitly says (consequence #10) "measure harness behavior on SDP tasks; do not assume benchmark rankings transfer." The doc does not contain any such measurements. Every routing hypothesis should carry a "validated: yes/no/not_assessed" flag before being actionable. **Status:** `not_assessed` for all five. + +### M4 — Context layers (Rule 6) lack concrete triggers +Rule 6's five context layers describe *what* to load progressively but not *when* to promote to the next layer or how to detect starvation vs. flood. The harness doc discusses context compaction but does not offer thresholds or signals. **Status:** `operationalized_not_feasible`. + +--- + +## Vendor Hype Quarantine + +These claims are flagged `vendor_claim` and should not drive SDP policy until independently validated: + +| Claim | Source | Risk if accepted uncritically | +|---|---|---| +| GLM-5.1 "8-hour sustained autonomous work" | Z.AI docs | Could justify too much unattended execution; contradicts bounded-workflow doctrine | +| Kimi "agent swarm" and "Claw Groups" | Kimi product page | Swarm coordination is orthogonal to SDP's Beads model; no evidence of interoperability | +| DeepSeek V4 "competitive but not uniformly SOTA" | Hugging Face blog | Third-party interpretation, not the primary source; DeepSeek-V4 official English docs noted as `not_assessed` | +| Qwen3.6-35B-A3B "optimized for agentic coding" | AWS/Alibaba docs | Marketing characterization; the 3B active parameter count may produce brittle long-range reasoning | +| GPT-5.5 "strongest fit for hard synthesis" | OpenAI migration guidance | Migration docs say this; independent evals on SDP tasks are `not_assessed` | + +The harness doc already says "vendor claims are useful for routing hypotheses, not for merge or release authority." This discipline must be carried forward into any synthesis. + +--- + +## Useful Tensions + +These are productive disagreements worth preserving in synthesis: + +1. **Pi minimalism vs. runtime safety** (Harness doc vs. Rule 12): Pi intentionally lacks built-in MCP, subagents, and permission popups. Rule 12 demands runtime-enforced safety. The synthesis doc should say explicitly: Pi packages like `pi-agents`/`pi-agent-flow` are the intended enforcement point, not Pi core. + +2. **Small slices vs. long-horizon models** (Rule 7 vs. Harness §3): The harness doc identifies long-horizon execution as the new boundary, while Rule 7 insists on small revertable slices. These are compatible — small slices *within* a long-horizon bounded flow with budgets and review stops — but the synthesis doc must articulate this reconciliation. + +3. **Skill workflow vs. module-local convention** (Rule 2 vs. AGENTS.md): Rule 2 says skills should not contain module-local facts (package APIs, import rules, etc.) — those go in `AGENTS.md`. This is correct but introduces a loading-order dependency: the agent must read `AGENTS.md` before skill execution. The operating-rules doc should specify this explicitly. + +--- + +## Concrete Changes for the Synthesis Document + +These are specific edits to make in whichever doc becomes the synthesis: + +1. **Add an existence table** for every skill named in Rule 3: `build`, `review`, `ship`, `debug`, `delivery-loop`, `spec-interrogate`. Columns: `exists_in_repo` (yes/no), `path_if_exists`, `status` (stable/experimental/missing). If missing, move from adoption to "design needed." + +2. **Replace the flat adoption backlog** with a phased table: + - Phase 1 (this week): Rule 1 description template enforcement, manifest lint skeleton. + - Phase 2 (this sprint): Anti-rationalization in `build` and `review` only. + - Phase 3 (next sprint): Simplification as a review dimension; degraded-evidence schema draft. + - Phase 4+: Remaining items, dependent on Phase 2 measurements. + +3. **Add a `routing_confidence` column** to the model landscape table. Values: `vendor_only`, `local_spike`, or `validated_on_sdp_tasks`. All current entries should be `vendor_only`. + +4. **Resolve B3 explicitly**: Add a paragraph stating manifest declares permission *requirements*, and enforcement is delegated to runtime (Pi packages, OpenCode permissions, or external gate). No new runtime engine is being built in SDP manifest. + +5. **Scope Rule 9 multi-plane review**: Specify a default of 2-3 planes per workstream. Security and evidence planes are mandatory only for write-capable actions. The six-plane matrix is a reference library, not a default checklist. + +6. **Disambiguate doubt mechanisms**: Add a decision tree to the synthesis doc: + - Behavioral code change → Rule 8 (tests first) + - Prompt/agent/model/skill/policy change → Rule 4 (doubt cycle) + - Trust-sensitive workstream → Rule 4 + subset of Rule 9 planes + - Everything else → standard review (single plane) + +7. **Mark all GLM/Kimi/DeepSeek routing hypotheses** with a `vendor_claim` flag and remove them from the "Consequences for SDP" section. They belong in a separate routing-research doc, not in a document claiming to state consequences. + +8. **Define `not_assessed`**: Add one paragraph or a small schema describing where `not_assessed` appears (review verdict, Beads state, evidence artifact), how it displays in reports, and that it is not equivalent to `pass`. + +--- + +## Summary + +| Dimension | Assessment | +|---|---| +| Operating rules quality | Good operational discipline; needs scoping and prioritization | +| Harness landscape quality | Good survey; routing hypotheses are under-evaluated for SDP | +| Adoption readiness | Neither doc is ready as-is; synthesis needs cuts | +| Vendor hype exposure | Managed but not quaranted; routing claims need confidence tags | +| Internal consistency | Three unresolved tensions (B3, B4, M1) | +| **Overall verdict** | `needs_work` → synthesize with priority map, scope cuts, and `not_assessed` markings | diff --git a/docs/reviews/2026-05-15-sdp-harness-synthesis-socratic-review.md b/docs/reviews/2026-05-15-sdp-harness-synthesis-socratic-review.md new file mode 100644 index 00000000..f77999c5 --- /dev/null +++ b/docs/reviews/2026-05-15-sdp-harness-synthesis-socratic-review.md @@ -0,0 +1,103 @@ +# Socratic Review: SDP Harness And Skill Operating Strategy + +Status: review evidence +Date: 2026-05-15 +Target: `docs/research/2026-05-15-sdp-harness-skill-synthesis.md` + +## Reviewer Runs + +| Lane | Model | Status | Notes | +|---|---|---|---| +| initial qwen | `openrouter/qwen/qwen3.6-plus` | `off_task` | Returned a fake tool-call preamble instead of reviewing the document. Not counted as evidence. | +| initial glm | `zai/glm-5.1` | `off_task` | Critiqued claims not present in the synthesis. Not counted as evidence. | +| socratic critic | `zai/glm-5.1` with attached file | `assessed` | Produced blocking and major questions tied to the synthesis. | +| adoption-risk reviewer | `kimi-coding/kimi-for-coding` with attached file | `assessed` | Produced feasibility and process-risk questions tied to the synthesis. | + +## Blocking Questions Accepted + +### B1. Cross-review provenance may self-contaminate the synthesis + +The synthesis cited model-generated cross-reviews while also saying vendor/model +claims are quarantined. The document now states that cross-reviews are review +evidence curated by the author, not validation authority. + +Disposition: accepted and fixed. + +### B2. Phase 1 had no pass/fail gate + +The one-week foundation phase described deliverables but not completion +criteria. The document now has an explicit Phase 1 acceptance gate. + +Disposition: accepted and fixed. + +### B3. Phase 1 declares runtime fields before runtime enforcement exists + +The reviewers flagged that Phase 1 can only create authoring discipline; runtime +claims remain unproven. The document now requires `not_assessed_runtime` when +dispatch evidence does not exist. + +Disposition: accepted and fixed. + +## Major Questions Accepted + +### M1. Model-family diversity needs implementer provenance + +The review asked how "different model family" can be verified. The synthesis now +states that unknown implementer provenance makes diversity `not_assessed_runtime`. + +Disposition: accepted and fixed. + +### M2. `compatibility` was ambiguous + +The review asked whether `compatibility` is a verified portability claim or an +intent. The synthesis now says it is only a claim after runtime dispatch +evidence exists. + +Disposition: accepted and fixed. + +### M3. Doubt cycles needed a per-cycle bound + +The synthesis now says a doubt cycle must stay inside a bounded review window, +with short findings; broad redesign output is off-scope. + +Disposition: accepted and fixed. + +### M4. Routing promotion criteria were undefined + +The synthesis now defines `vendor_only`, `local_spike`, and +`validated_on_sdp_tasks` promotion criteria. + +Disposition: accepted and fixed. + +### M5. Irreversible action gates may not exist in every harness + +The synthesis now says a harness that cannot enforce `external_write` or +`irreversible` runtime gates cannot claim runtime support for that action class; +it must use workflow-level authorization and degraded evidence. + +Disposition: accepted and fixed. + +## Questions Deferred + +### Trigger thresholds + +The Socratic review asked for quantified trigger thresholds to replace "invoke +at 1% chance." This is deferred. The current recommendation is trigger-rich +prose plus lint, not numerical confidence scoring. Numerical thresholds would +likely create false precision until SDP has routing telemetry. + +Disposition: deferred. + +### Phase 4 provider procurement and rate limits + +The adoption-risk reviewer noted that measuring all listed providers may require +API contracts and budget. This is valid but belongs in a future measurement +workstream, not this synthesis. + +Disposition: deferred. + +## Verdict + +The review found real adoption blockers, and the synthesis was updated for the +ones that affect immediate decision quality. Remaining deferred items should be +handled during implementation planning, not by expanding this research document. diff --git a/docs/workstreams/INDEX.md b/docs/workstreams/INDEX.md index e93fb3fb..228b42c7 100644 --- a/docs/workstreams/INDEX.md +++ b/docs/workstreams/INDEX.md @@ -146,11 +146,11 @@ | **F165** | Indirect Prompt Injection Through SDP Task Data — Day-12 defensive demo pack for Beads/workstream/evidence task-data poisoning with deterministic unsafe/defended outcomes | 00-165-00 ... 00-165-05 | Backlog | P1 | | **F166** | Runtime LLM Guard Gateway — core-first Go guard/audit layer for SDP model calls: input/output secret checks, local chunked classifier, Codex/Pi harness compatibility, gateway surfaces, redaction/blocking, token/cost evidence, deterministic corpus | 00-166-01 ... 00-166-09 | In Progress | P1 | | **F167** | Security Verdict Gate — Day-14 runtime security step after green tests and before commit/promotion-ready state, with gateway sanitation, blocking Critical/High findings, warning Medium/Low findings, escalation on provider/sanitizer/evidence failure, and demo evidence | 00-167-01 ... 00-167-04 | Backlog | P1 | -| **F168** | Onboarding Quality Taxonomy — honest first-run promise map plus deterministic/model-review quality axes for Go hygiene, complexity, spec drift, CleanCode, CleanArchitecture, Security, DX, UX, and docs completeness | 00-168-00 ... 00-168-08 | Done on branch; PR review pending | P1 | +| **F168** | Onboarding Quality Taxonomy — honest first-run promise map plus deterministic/model-review quality axes for Go hygiene, complexity, spec drift, CleanCode, CleanArchitecture, Security, DX, UX, and docs completeness | 00-168-00 ... 00-168-09 | Done on branch; PR review pending | P1 | > **Beads:** `F161=sdplab-tffu`, `F163=sdplab-n7a9`, `F164=sdplab-9wxx`, `F165=sdplab-28xb`, `F166=sdplab-mp83`, `F167=sdplab-xe5c`, `F168=sdplab-o8gk` > **F167 DAG:** `01 → 02 → 03 → 04` -> **F168 DAG:** `01 → {02,03,04,05}; {02,03,04,05} → 06 → 07 → 08` +> **F168 DAG:** `01 → {02,03,04,05}; {02,03,04,05} → 06 → 07 → 08 → 09` > **Boundary:** this produces evidence for spec readiness; it is not a process gate. Gate policy remains owned by downstream gate tooling. #### F161 Workstreams @@ -194,6 +194,7 @@ | 00-168-06 | Operator-facing quality report UX | Done on branch | sdplab-f16806 | | 00-168-07 | CI/advisory rollout and Beads findings loop | Done on branch | sdplab-f16807 | | 00-168-08 | End-to-end onboarding quality calibration run | Done on branch | sdplab-f16808 | +| 00-168-09 | Harness and skill operating discipline phase 1 | Done on branch | sdplab-4cxu | ### Phase Product Documentation And Adoption Clarity diff --git a/docs/workstreams/backlog/00-168-09.md b/docs/workstreams/backlog/00-168-09.md new file mode 100644 index 00000000..df7b4352 --- /dev/null +++ b/docs/workstreams/backlog/00-168-09.md @@ -0,0 +1,66 @@ +--- +ws_id: 00-168-09 +feature_id: F168 +status: done +priority: P2 +size: S +depends_on: ["00-168-08"] +ws_kind: leaf +parent_ws_id: "00-168-00" +dispatch_lifecycle: active +--- + +# 00-168-09: Harness and skill operating discipline phase 1 + +Feature: F168 (sdplab-o8gk) +Design reference: [Harness and skill synthesis](../../research/2026-05-15-sdp-harness-skill-synthesis.md) + +## Goal + +Apply the first repo-cleanup slice from the harness/skill synthesis: make skill +authoring expectations more explicit and add shared vocabulary for tool risk and +degraded evidence without claiming new runtime enforcement. + +## Scope Files + +- `docs/reference/skill-authoring.md` +- `docs/reference/harness-risk-and-evidence.md` +- `prompts/skills/build/SKILL.md` +- `prompts/skills/review/SKILL.md` +- `.agents/skills/build.md` +- `.agents/skills/review.md` + +## Beads + +- primary: sdplab-4cxu + +## Acceptance Criteria + +- [x] `docs/reference/skill-authoring.md` defines `Do Not Use When`, + `Verification`, and `Degraded Evidence` requirements. +- [x] `docs/reference/` contains reusable tool-risk and degraded-evidence + vocabulary. +- [x] `build` has common rationalizations for skipped specs, weak evidence, + prompt-only safety, and review shortcuts. +- [x] `review` has common rationalizations for empty reviewer output, + single-family review, missing provenance, and rubber-stamp coverage. +- [x] Runtime support claims introduced by this work are marked + `not_assessed_runtime` unless dispatch evidence exists. +- [x] Skill lint result is recorded. + +## Completion Evidence + +- `docs/reference/skill-authoring.md` +- `docs/reference/harness-risk-and-evidence.md` +- `prompts/skills/build/SKILL.md` +- `prompts/skills/review/SKILL.md` +- `.agents/skills/build.md` +- `.agents/skills/review.md` +- `go run ./cmd/sdp-protocol-check --lint-skills`: 0 errors, 4 pre-existing warnings. + +## Out of Scope + +- Manifest schema changes. +- Runtime permission enforcement. +- Model-routing measurements. +- Public `sdp` publishing. diff --git a/prompts/skills/build/SKILL.md b/prompts/skills/build/SKILL.md index c4de4854..0e4f2407 100644 --- a/prompts/skills/build/SKILL.md +++ b/prompts/skills/build/SKILL.md @@ -38,6 +38,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate). 5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers. 6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. | +| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. | +| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. | +| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. | +| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. | +| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. | + --- ## Git Safety diff --git a/prompts/skills/review/SKILL.md b/prompts/skills/review/SKILL.md index be253234..e042f6cc 100644 --- a/prompts/skills/review/SKILL.md +++ b/prompts/skills/review/SKILL.md @@ -109,6 +109,17 @@ Rules: huge provider error text or full prompts into the verdict, replace it with a compact verdict that preserves model status, P0/P1 counts, and override reason. +## Common Rationalizations + +| Rationalization | Reality | +|---|---| +| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. | +| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. | +| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. | +| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. | +| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. | +| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. | + ## Write Plan (F101) Before writing review output files (verdict, findings), emit a write plan: