Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/CORE.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ then returns the result to the same conversation by foreground, background, or h
12. Async without drift: long workflows may use SDK/background/heartbeat adapters, but the coordinator contract and safety gates stay the same.
13. Controller-first state: `cwf-start.mjs` creates preview, run-plan, state, return-envelope, final, worker-packets, and worker-results slots before any worker dispatch.
14. Adapter honesty: native subagent, SDK, Desktop-thread, and heartbeat helpers must write `fixture`, `real-smoke`, `requires_approval`, `unavailable`, or `deferred` evidence labels instead of upgrading claims silently.
15. Checker-owned verification: maker workers may write attempted/proposed/changed state, but `verified`, `passed`, `done`, and `regression_locked` belong to a verifier, deterministic test, replay, or human reviewer.
16. Failure to regression: recurring workflow, helper, route, connector, skill, or harness failures should preserve the failing input or trace and leave behind a regression artifact or explicit skip reason.

## Failure Modes

Expand All @@ -49,6 +51,8 @@ A non-trivial run plan should name:
- exact scope and exclusions;
- phases and workers;
- verifier or challenger role;
- verified-state owner;
- failure-to-regression receipt when applicable;
- write scopes;
- untrusted input route;
- token budget and stop rule;
Expand Down Expand Up @@ -95,6 +99,30 @@ Run experience is part of the core contract:

The proven return path is coordinator synthesis in the originating conversation. Heartbeat synthesis is allowed only after a real heartbeat reply with the expected marker is observed in the originating thread; `heartbeat-scheduled` and `heartbeat-scheduled-not-returned` are not delivery proof. Platform automatic callback is not claimed until a future Codex platform API and real smoke prove it.

## Verified State

CWF treats verification state as a separate ownership boundary:

- maker workers can write `attempted`, `proposed`, `changed`, and `needs_review`;
- verifier workers, deterministic tests, replay commands, external evidence, or human reviewers can write `verified`, `passed`, `done`, and `regression_locked`;
- the coordinator may synthesize verified state only by pointing at the verifier receipt.

Persistent run artifacts should avoid mixing maker narrative with checker-owned truth. If a status file or `goal_delta` will be read by a future run, write verified state after the verifier receipt exists and keep partial writes from looking authoritative.

## Failure To Regression

When a CWF run repairs a repeated failure or a harness-level issue, the repair is not complete until the failing input is replayed or preserved as a future check when feasible:

```text
failing input / trace
-> diagnosis
-> fix or mitigation
-> replay
-> regression artifact
```

Valid regression artifacts include a test, fixture, eval case, route trigger case, helper smoke, documented replay command, or sanitized error-pattern entry. If the input contains secrets, customer data, or private chat, sanitize or hash it before storing. If no safe artifact exists, record the skip reason in the run plan and closeout.

## Budget

Every saved workflow should include a visible `budget` with a token cap and stop rule. Dynamic workflows can cost far more than a normal Codex turn; budget is part of the contract, not an afterthought.
Expand Down
3 changes: 3 additions & 0 deletions docs/CWF_ASYNC_RUNTIME.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ Async runs should record these fields in `.cwf/runs/RUN_ID/return-envelope.json`
- `heartbeat_status`: `not_requested`, `fixture`, `scheduled`, `scheduled-not-returned`, `delivered`, `failed`, or `unavailable`;
- `sdk_thread_ids`: SDK worker ids when known;
- `desktop_thread_ids`: visible Desktop worker thread ids when created;
- `closeout_gate`: whether completed status can stand or must be downgraded pending checker-owned verification or regression lock;
- `verified_state`: maker-owned versus checker-owned state and the verification receipt;
- `failure_to_regression`: recurring-failure receipt, including regression artifact or skip reason when required;
- `final_summary_path`;
- `evidence_path`;
- `deferred_items`.
Expand Down
4 changes: 2 additions & 2 deletions docs/CWF_RELEASE_READINESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ This checklist tracks public-release readiness evidence. It is not an npm publis

| Phase | Current implementation evidence |
|---|---|
| E1 Return envelope | `scripts/cwf-return-envelope.mjs`; `cwf-run-state init/update` writes `.cwf/runs/RUN_ID/return-envelope.json`; `npm run check` validates required fields and deferred platform callback status. |
| E1 Return envelope | `scripts/cwf-return-envelope.mjs`; `cwf-run-state init/update` writes `.cwf/runs/RUN_ID/return-envelope.json`; `npm run check` validates required fields, closeout gate downgrade, checker-owned verified state, regression lock fields, and deferred platform callback status. |
| Full native runtime v1 real smoke | `scripts/cwf-start.mjs` initializes controller artifacts; `scripts/cwf-worker-sdk.mjs` now calls `@openai/codex-sdk` for real marker runs; host-native `spawn_agent` explorers returned to the coordinator. Checked-in evidence: [docs/evidence/CWF_FULL_NATIVE_RUNTIME_REAL_SMOKE_20260609.md](evidence/CWF_FULL_NATIVE_RUNTIME_REAL_SMOKE_20260609.md). Fixture evidence remains in [docs/evidence/CWF_FULL_NATIVE_RUNTIME_FIXTURES_20260608.md](evidence/CWF_FULL_NATIVE_RUNTIME_FIXTURES_20260608.md). |
| E2 Desktop-thread preflight | `desktop-thread-stdio-observed`: the failed probe used the wrong path (`codex app-server proxy` against the remote-control socket). The correct path is a fresh `codex app-server --listen stdio://` JSONL session. Historical evidence recorded thread `019ea726-a070-73f2-b182-602b905cd9ec` and marker `CWF_LEFT_THREAD_TURN_OK_20260608`. Latest checked-in local dynamic smoke evidence is [docs/evidence/CWF_REAL_DYNAMIC_SMOKE_20260608.md](evidence/CWF_REAL_DYNAMIC_SMOKE_20260608.md). This proves Desktop-thread creation/execution/readback locally, not platform automatic callback. |
| E3 Resume/checkpoint | `scripts/cwf-run-state.mjs` resumes only from the last contiguous completed phase boundary; `npm run check` covers completed, blocked, failed, skipped, missing, and partial fixtures. |
| E4 Safe write | `scripts/cwf-safe-write.mjs` evaluates approval gate, changed paths, forbidden/out-of-scope paths, apply-check result, verification status, changed files, and rollback command. A disposable `/tmp` git-repo real-smoke passed after approval with `git apply --check`, apply, verification, changed files, and rollback evidence. |
| E5 Dynamic generation | `scripts/cwf-generate-workflow.mjs` generates bounded data-only repo-audit and safe-fix-loop workflows and rejects unsafe generated content tokens. |
| E6 Catalog/user workflows | `scripts/cwf-catalog.mjs` contains built-in catalog metadata and project-local `.cwf/workflows/*.workflow.js` discovery with fail-closed validation. |
| E7 Verifier gates | `scripts/cwf-safe-write.mjs` implements `pass`, `blocked`, `needs-waiver`, and `advisory`; `blocked` and unwaived findings prevent final pass. |
| E7 Verifier gates | `scripts/cwf-safe-write.mjs` implements `pass`, `blocked`, `needs-waiver`, and `advisory`; `scripts/cwf-return-envelope.mjs` prevents completed status unless checker-owned closeout state passes; `blocked` and unwaived findings prevent final pass. |
| E8 Budget/cost | Preview helpers fail closed without `budget.max_tokens` or `budget.stop_when`, warn before workers run when `max_tokens > 50000`, and label local token accounting as `estimated`. `npm run check` covers expensive-run warning and unbounded-refusal fixtures. |
| E9 Human status UX | `scripts/cwf-run-state.mjs status` includes conclusion, phase, worker counts, blocker, evidence, next action, final destination, return mode, and verifier status. Final summaries start with a Chinese conclusion. |
| E10 Public readiness | This file plus README/docs/skill synchronization, package dry-run, old-runtime absence, and final review. |
Expand Down
2 changes: 1 addition & 1 deletion docs/RUN_EXPERIENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ node scripts/cwf-run-state.mjs status --run-id demo
node scripts/cwf-run-state.mjs resume-plan --run-id demo
```

The return envelope records `final_destination`, `return_mode`, `final_summary_path`, `evidence_path`, `verifier_status`, deferred items, and completion status. `return_mode=coordinator_synthesis` is the proven default. Platform automatic callback remains deferred until a real platform smoke proves it.
The return envelope records `final_destination`, `return_mode`, `final_summary_path`, `evidence_path`, `verifier_status`, `closeout_gate`, `verified_state`, `failure_to_regression`, deferred items, and completion status. A completed run is downgraded to pending closeout when checker-owned verified state is missing or a required regression artifact has neither artifact nor skip reason. `return_mode=coordinator_synthesis` is the proven default. Platform automatic callback remains deferred until a real platform smoke proves it.

For async runs, also record `runtime_mode`, adapter status, `sdk_thread_ids`, and `desktop_thread_ids` when known. `return_mode=heartbeat_synthesis` means the background run completed, a follow-up in the originating conversation read the local result, and the coordinator observed the expected marker reply before recording delivery. It is not the same as platform automatic callback.

Expand Down
57 changes: 57 additions & 0 deletions scripts/check-core.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,8 @@ mustContain(skill, "sunny_skill_type: library");
mustContain(skill, "Agent-readable Skill Registry");
mustContain(skill, "Goal Anchor");
mustContain(skill, "goal_delta");
mustContain(skill, "checker-owned");
mustContain(skill, "failure-to-regression");
mustContain(skill, "Output Contract");
mustContain(skill, "references/routing.md");
mustContain(skillRouting, "goal-writer");
Expand All @@ -157,6 +159,8 @@ mustContain(skillRunPlanTemplate, "## Objective");
mustContain(skillRunPlanTemplate, "## Goal Anchor");
mustContain(skillRunPlanTemplate, "## Goal Delta");
mustContain(skillRunPlanTemplate, "## Resume Checkpoint");
mustContain(skillRunPlanTemplate, "## Verified State Ownership");
mustContain(skillRunPlanTemplate, "## Failure To Regression");
for (const key of ["should_trigger", "should_not_trigger", "near_neighbors"]) {
if (!Array.isArray(skillTriggerCases[key]) || skillTriggerCases[key].length === 0) {
throw new Error(`skills/codex-workflows/evals/trigger_cases.json missing ${key}`);
Expand Down Expand Up @@ -589,9 +593,13 @@ async function checkRunPlanRules() {
"## Budget",
"## Stop Rules",
"## Evidence",
"## Verified State Ownership",
"## Failure To Regression",
"## Resume Checkpoint",
"## Goal Delta",
"goal_delta:",
"verified_by:",
"regression_added:",
".cwf/runs/check/",
]) {
mustContain(markdown, needle);
Expand Down Expand Up @@ -718,6 +726,18 @@ function checkReturnEnvelopeRules() {
status: "completed",
updated_at: "2026-06-08T00:00:00.000Z",
verifier_evaluations: [{ status: "advisory", summary: "follow-up optional" }],
verified_state: {
maker_owned: ["changed"],
checker_owned: ["npm run check"],
verification_receipt: "npm run check passed",
status: "verified",
},
failure_to_regression: {
required: false,
regression_artifact: "",
verified_by: "",
skip_reason: "",
},
deferred_items: [{ id: "desktop-thread-execution-preflight", status: "requires_approval" }],
};
const envelope = buildReturnEnvelope(state);
Expand All @@ -735,6 +755,9 @@ function checkReturnEnvelopeRules() {
"sdk_thread_ids",
"desktop_thread_ids",
"verifier_status",
"closeout_gate",
"verified_state",
"failure_to_regression",
"deferred_items",
"completion_status",
]) {
Expand All @@ -752,6 +775,9 @@ function checkReturnEnvelopeRules() {
if (!envelope.deferred_items.some((item) => item.status === "requires_approval")) {
throw new Error("return envelope must preserve deferred approval items");
}
if (envelope.completion_status !== "completed" || envelope.closeout_gate.status !== "pass") {
throw new Error("return envelope must require closeout gate pass before completed status");
}

const idsEnvelope = buildReturnEnvelope({
...state,
Expand All @@ -768,6 +794,37 @@ function checkReturnEnvelopeRules() {
if (heartbeatEnvelope.return_mode !== "heartbeat_synthesis") {
throw new Error("return envelope must preserve state return_mode when no override is provided");
}

const missingVerifiedEnvelope = buildReturnEnvelope({
...state,
verified_state: {
maker_owned: ["changed"],
checker_owned: [],
verification_receipt: "",
status: "pending",
},
});
if (missingVerifiedEnvelope.completion_status !== "pending-verified-state") {
throw new Error("return envelope must not complete without checker-owned verified state");
}

const missingRegressionEnvelope = buildReturnEnvelope({
...state,
failure_to_regression: {
required: true,
failing_input_or_trace: "sanitized trace id fixture",
diagnosis: "fixture recurring helper failure",
fix_or_mitigation: "fixture patch",
replay_command_or_fixture: "npm run check",
regression_artifact: "",
verified_by: "npm run check",
sensitive_data_handling: "sanitized",
skip_reason: "",
},
});
if (missingRegressionEnvelope.completion_status !== "pending-regression-lock") {
throw new Error("return envelope must not complete required regression loop without artifact or skip reason");
}
}

function checkDynamicGenerationRules() {
Expand Down
63 changes: 60 additions & 3 deletions scripts/cwf-return-envelope.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ import { parseArgs, printHelp, readJsonFile, wantsHelp } from "./lib/cli.mjs";
export function buildReturnEnvelope(state, options = {}) {
const runDir = options.runDir ?? `.cwf/runs/${state.run_id}`;
const verifier = evaluateVerifierGate(state.verifier_evaluations ?? []);
const completionStatus = deriveCompletionStatus(state, verifier);
const closeoutGate = evaluateCloseoutGate(state, verifier);
const completionStatus = deriveCompletionStatus(state, verifier, closeoutGate);
const deferredItems = [
...(state.deferred_items ?? []),
...(options.deferredItems ?? []),
Expand Down Expand Up @@ -39,13 +40,66 @@ export function buildReturnEnvelope(state, options = {}) {
desktop_thread_ids: collectWorkerIds(state, "desktop_thread_id"),
verifier_status: verifier.status,
verifier: verifier,
closeout_gate: closeoutGate,
verified_state: state.verified_state ?? {
maker_owned: [],
checker_owned: [],
verification_receipt: "",
status: "pending",
},
failure_to_regression: state.failure_to_regression ?? {
required: false,
regression_artifact: "",
verified_by: "",
skip_reason: "",
},
deferred_items: deferredItems,
completion_status: completionStatus,
run_status: state.status ?? "planned",
updated_at: state.updated_at ?? new Date().toISOString(),
};
}

export function evaluateCloseoutGate(state, verifier = evaluateVerifierGate(state.verifier_evaluations ?? [])) {
if (state.status !== "completed" || !verifier.final_pass) {
return { status: "not_applicable", issues: [] };
}

const issues = [];
const verifiedState = state.verified_state ?? {};
const checkerOwned = Array.isArray(verifiedState.checker_owned) ? verifiedState.checker_owned.filter(Boolean) : [];
const verificationReceipt = String(verifiedState.verification_receipt ?? "").trim();
const verifiedStatus = String(verifiedState.status ?? "pending");
if (
verifiedStatus === "pending" ||
verifiedStatus === "needs_review" ||
(checkerOwned.length === 0 && !verificationReceipt)
) {
issues.push({
id: "verified-state-missing",
status: "pending-verified-state",
reason: "Completed runs need checker-owned state or a verification receipt before they can claim done.",
});
}

const regression = state.failure_to_regression ?? {};
if (
regression.required === true &&
!String(regression.regression_artifact ?? "").trim() &&
!String(regression.skip_reason ?? "").trim()
) {
issues.push({
id: "regression-lock-missing",
status: "pending-regression-lock",
reason: "Recurring-failure repairs need a regression artifact or an explicit skip reason before closeout.",
});
}

if (issues.length === 0) return { status: "pass", issues: [] };
if (issues.length === 1) return { status: issues[0].status, issues };
return { status: "pending-closeout-gate", issues };
}

function collectWorkerIds(state, key) {
return [...new Set((state.workers ?? []).map((worker) => worker[key]).filter(Boolean))];
}
Expand All @@ -58,9 +112,12 @@ export async function writeReturnEnvelope(runDir, state, options = {}) {
return { path: outputPath, envelope };
}

function deriveCompletionStatus(state, verifier) {
if (state.status === "completed" && verifier.final_pass) return "completed";
function deriveCompletionStatus(state, verifier, closeoutGate = evaluateCloseoutGate(state, verifier)) {
if (state.status === "completed" && verifier.final_pass && closeoutGate.status === "pass") return "completed";
if (state.status === "completed" && verifier.status === "pending") return "pending-verification";
if (state.status === "completed" && closeoutGate.status !== "not_applicable" && closeoutGate.status !== "pass") {
return closeoutGate.status;
}
if (verifier.status === "blocked") return "blocked";
if (verifier.status === "needs-waiver") return "needs-waiver";
if (state.status === "cancelled") return "cancelled";
Expand Down
24 changes: 24 additions & 0 deletions scripts/cwf-run-plan.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,28 @@ export function renderRunPlanMarkdown(plan) {
lines.push("- Record final synthesis in the originating Codex conversation.");
lines.push("- Label evidence as local, fixture, dry-run, real-smoke, requires_approval, or blocked.");

lines.push("", "## Verified State Ownership");
lines.push("- Maker-owned fields: attempted / proposed / changed / needs_review");
lines.push("- Checker-owned fields: verified / passed / done / regression_locked");
if (verifierAgents.length > 0) {
lines.push(`- Checker: ${verifierAgents.map((agent) => agent.id).join(", ")}`);
} else {
lines.push("- Checker: coordinator-held deterministic test, replay command, external evidence, or human reviewer");
}
lines.push("- Verification receipt: required before any verified/passed/done claim");
lines.push(`- Atomic status artifact: ${runId ? `.cwf/runs/${runId}/state.json` : ".cwf/runs/RUN_ID/state.json"}`);
lines.push("- Rule: implementer/maker workers must not write verified state directly.");

lines.push("", "## Failure To Regression");
lines.push("- Failing input or trace: N/A unless this run repairs a recurring workflow, helper, route, connector, skill, or harness failure.");
lines.push("- Diagnosis:");
lines.push("- Fix or mitigation:");
lines.push("- Replay command or fixture:");
lines.push("- Regression artifact:");
lines.push("- Verified by:");
lines.push("- Sensitive data handling:");
lines.push("- Skip reason:");

lines.push("", "## Resume Checkpoint");
lines.push(`- ${resumeCheckpoint}`);

Expand All @@ -139,6 +161,8 @@ export function renderRunPlanMarkdown(plan) {
lines.push(` run_id: ${runId || ""}`);
lines.push(" completed:");
lines.push(" evidence_added:");
lines.push(" verified_by:");
lines.push(" regression_added:");
lines.push(" blockers:");
lines.push(" next_slice:");
lines.push(" next_cwf_run:");
Expand Down
Loading