Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .agents/skills/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,17 @@ Every commit in SDP-managed repos SHOULD carry provenance trailer:
- Edit files in main tree (always use worktree)
- Commit raw `.sdp/runs/pi-review/*` telemetry unless the workstream explicitly requires it; use compact verdict/evidence instead.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. |
| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. |
| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. |
| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. |
| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. |
| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. |

## Response Format

After completing work, report:
Expand Down
11 changes: 11 additions & 0 deletions .agents/skills/review.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,17 @@ gates are green, no P0/P1 remain, and `.sdp/review_verdict.json` records a
compact maintainer note. Never commit raw `.sdp/runs/pi-review/*` telemetry by
default.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. |
| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. |
| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. |
| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. |
| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. |
| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. |

## Routing Rules

Dimension based on: (1) Diff size: small (<50 lines) → code only, large → multiple dimensions.
Expand Down
1 change: 1 addition & 0 deletions .beads/issues.jsonl
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,7 @@
{"_type":"issue","id":"sdplab-7","title":"F061-02: bd ready → sdp ready bridge","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:35:17Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:34Z","closed_at":"2026-04-20T14:27:34Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F061","beads","ecosystem"],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-2","title":"F059-02: Session evidence emitter","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:34:39Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:00Z","closed_at":"2026-04-20T14:27:00Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F059","ecosystem","ohmyopencode"],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-3","title":"F059-01: Pre-tool-call guard hook","status":"closed","priority":1,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-02-28T21:34:39Z","created_by":"Andrey Zhukov","updated_at":"2026-04-20T14:27:00Z","closed_at":"2026-04-20T14:27:00Z","close_reason":"Verified: code exists, 218 tests pass across guard/evidence/monitor/beads/workstream packages. WS files marked done with all acceptance criteria checked.","labels":["F059","ecosystem","ohmyopencode"],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-4cxu","title":"F168-09: Apply harness/skill operating discipline phase 1","description":"Follow-up to the 2026-05-15 harness/skill synthesis. Apply Phase 1 only: update skill authoring policy with trigger/exclusion/verification/degraded-evidence requirements; add reference vocabulary for tool risk classes and degraded evidence; add common rationalizations to build and review. Runtime manifest enforcement and model-routing measurement are out of scope.","acceptance_criteria":"- [ ] docs/reference/skill-authoring.md defines Do Not Use When, Verification, and Degraded Evidence requirements.\n- [ ] docs/reference contains a reusable tool-risk/degraded-evidence reference.\n- [ ] prompts/skills/build/SKILL.md has Common Rationalizations for skipped specs, evidence, prompt-only safety, and review shortcuts.\n- [ ] prompts/skills/review/SKILL.md has Common Rationalizations for empty reviewer output, single-family review, missing provenance, and rubber-stamp coverage.\n- [ ] Runtime support claims introduced by this work are explicitly not_assessed_runtime unless dispatch evidence exists.\n- [ ] Skill lint is run and result recorded.","status":"in_progress","priority":2,"issue_type":"task","assignee":"Andrei","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-15T08:26:12Z","created_by":"Andrei","updated_at":"2026-05-15T08:26:18Z","started_at":"2026-05-15T08:26:18Z","labels":["F168","docs","harness","skills"],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-tsbi","title":"F168 finding: stale Claude sweep command is outside manifest source of truth","description":"source=pi-review+local verification; feature=F168; workstream=00-168-02; blocking=false. .claude/commands/sweep.md exists and advertises broad autonomous backlog execution, but sdp.manifest.yaml and prompts/commands have no sweep command source. Decide whether to delete it, add it to manifest as experimental, or move it behind explicit future-work docs so generated adapter inventory is truthful.","status":"closed","priority":2,"issue_type":"bug","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T09:34:15Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:41Z","closed_at":"2026-05-14T08:47:41Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","dependencies":[{"issue_id":"sdplab-tsbi","depends_on_id":"sdplab-o8gk","type":"discovered-from","created_at":"2026-05-13T12:34:15Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-o8gk.8","title":"F168-08: End-to-end onboarding quality calibration run","description":"Run the completed F168 flow against SDP onboarding and record calibration evidence: actual commands, docs promises, review axes, created findings, and unresolved gaps.","status":"closed","priority":2,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T05:46:50Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:53Z","closed_at":"2026-05-14T08:47:53Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","labels":["F168","calibration","onboarding","pi-review","quality"],"dependencies":[{"issue_id":"sdplab-o8gk.8","depends_on_id":"sdplab-o8gk","type":"parent-child","created_at":"2026-05-13T08:46:49Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0}
{"_type":"issue","id":"sdplab-o8gk.7","title":"F168-07: CI/advisory rollout and Beads findings loop","description":"Connect deterministic and model-review axes into CI/advisory rollout with Beads finding creation for blocking issues. Avoid fake-green checks; absent credentials and missing tools must produce cannot_verify or not_assessed.","status":"closed","priority":2,"issue_type":"task","owner":"a_v_zhukov@outlook.com","created_at":"2026-05-13T05:46:49Z","created_by":"Andrei","updated_at":"2026-05-14T08:47:52Z","closed_at":"2026-05-14T08:47:52Z","close_reason":"merged in PR #153 (F168 onboarding quality taxonomy)","labels":["F168","beads","ci","onboarding","pi-review","quality"],"dependencies":[{"issue_id":"sdplab-o8gk.7","depends_on_id":"sdplab-o8gk","type":"parent-child","created_at":"2026-05-13T08:46:48Z","created_by":"Andrei","metadata":"{}"}],"dependency_count":0,"dependent_count":0,"comment_count":0}
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.codex/skills/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate).
5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers.
6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. |
| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. |
| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. |
| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. |
| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. |
| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. |

---

## Git Safety
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.codex/skills/review.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,17 @@ Rules:
huge provider error text or full prompts into the verdict, replace it with a
compact verdict that preserves model status, P0/P1 counts, and override reason.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. |
| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. |
| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. |
| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. |
| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. |
| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. |

## Write Plan (F101)

Before writing review output files (verdict, findings), emit a write plan:
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.opencode/skill/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate).
5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers.
6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. |
| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. |
| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. |
| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. |
| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. |
| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. |

---

## Git Safety
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.opencode/skill/review.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,17 @@ Rules:
huge provider error text or full prompts into the verdict, replace it with a
compact verdict that preserves model status, P0/P1 counts, and override reason.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. |
| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. |
| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. |
| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. |
| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. |
| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. |

## Write Plan (F101)

Before writing review output files (verdict, findings), emit a write plan:
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.pi/skills/build/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ Continuation is the orchestrator's job (@oneshot / sdp orchestrate).
5. **MODERN GO FOR GO CODE** — When touched files are Go, load `@go-modern` and prefer safe stdlib modernizations before inventing helpers.
6. **PI FINDINGS NEED REGRESSION TESTS** — For prompt-injection or review-finding fixes, add a deterministic regression test for the exact failed vector before closing the finding bead.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The change is small enough to skip the workstream." | Small changes still need an executable owner. If no WS exists, stop and create or request one. |
| "I can test at the end." | Late testing hides which slice introduced the failure. Use the narrowest relevant test before and after behavior changes. |
| "The model says it verified this." | Model prose is not evidence. Use tool output, file state, schema validation, or Beads/GitHub state. |
| "Prompt instructions are enough to prevent unsafe actions." | Prompt-only boundaries are not security boundaries. Runtime support is `not_assessed_runtime` unless dispatch evidence proves enforcement. |
| "One broad review after implementation is enough." | Trust-sensitive changes need selected review planes, and degraded evidence must remain visible. |
| "Unrelated cleanup will leave the repo better." | Cleanup is in scope only when required by the WS or explicitly accepted in the write plan. |

---

## Git Safety
Expand Down
11 changes: 11 additions & 0 deletions .sdp/generated/.pi/skills/review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,17 @@ Rules:
huge provider error text or full prompts into the verdict, replace it with a
compact verdict that preserves model status, P0/P1 counts, and override reason.

## Common Rationalizations

| Rationalization | Reality |
|---|---|
| "The reviewer returned nothing, so there were no findings." | Empty, timed-out, or off-task output is degraded evidence, not PASS. |
| "All reviewers used the same strong model, so the panel is strong." | Multi-plane review and model-family diversity are separate. For trust-sensitive work, record missing diversity as `not_assessed_runtime`. |
| "The adapter files exist, so the harness is supported." | Static parity is not runtime dispatch evidence. Mark runtime coverage `not_assessed_runtime` until the harness loads and runs the surface. |
| "Network access means the reviewer verified the current docs." | Network permission is not evidence. Cite the source or mark the claim unverified. |
| "Rubber-stamp roles are harmless." | They are acceptable only when explicitly recorded as shallow coverage; do not blend them into a full green verdict. |
| "A compact maintainer note can hide provider failure." | It may justify accepting degraded coverage, but the degraded state must remain visible. |

## Write Plan (F101)

Before writing review output files (verdict, findings), emit a write plan:
Expand Down
58 changes: 58 additions & 0 deletions docs/reference/harness-risk-and-evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Harness Risk And Evidence

Status: reference

This vocabulary keeps SDP skill and harness claims honest. It is intentionally
small: use it in skills, review reports, adapter checks, and evidence summaries
without turning every task into a policy project.

## Tool Risk Classes

| Class | Meaning | Default policy |
|---|---|---|
| `perception` | Read-only inspection: files, logs, docs, links, local state. | Allowed for most roles. |
| `analysis` | Local computation or synthesis without writes or external side effects. | Allowed with recorded evidence. |
| `local_write` | Edits, generated artifacts, local database or checkpoint changes. | Implementer/workflow scope only. |
| `external_write` | Push, publish, create or update a remote system, send messages. | Explicit workflow gate required. |
| `irreversible` | Merge, deploy, delete, rotate credentials, spend money. | Explicit human or workflow authorization required. |

Prompt text may describe a boundary, but it is not the boundary. If the harness
cannot enforce a risk-class gate, record the claim as `not_assessed_runtime` or
`manual_gate_only`.

## Evidence States

| State | Meaning |
|---|---|
| `passed` | Evidence completed and supports the claim. |
| `failed` | Evidence completed and contradicts the claim. |
| `not_assessed` | The plane was not run. |
| `failed_provider` | Provider returned an explicit error. |
| `timeout` | Run exceeded the bounded window. |
| `empty_output` | Run completed with no useful content. |
| `off_task` | Output did not address the requested plane. |
| `unavailable_cli` | Required local tool was missing or could not run. |
| `unverified_benchmark` | Vendor or third-party claim was not validated on SDP tasks. |
| `not_assessed_runtime` | Static files exist, but runtime behavior was not proven. |
| `manual_gate_only` | The workflow used an explicit human/workflow gate because runtime enforcement is unavailable. |

Missing evidence is not a pass. Use the degraded state that preserves what
actually happened.

## Assignment Rule

Deterministic tool output wins over model prose. If a model claims a check
passed but the tool output is missing, classify the check as `not_assessed`.
If states conflict, report the more conservative degraded state until a human or
orchestrator inspects the evidence.

## Common Examples

- A review provider returns no findings because the process timed out:
`timeout`, not `passed`.
- A harness adapter file exists but no dispatch run proves it loads:
`not_assessed_runtime`.
- A skill says "do not push", but the harness cannot block pushing:
`manual_gate_only` for that action class unless another runtime gate exists.
- A model vendor page claims strong coding benchmarks:
`unverified_benchmark` until reproduced on SDP tasks.
Loading
Loading