[codex] add reviewed parallel worktree starter#9
Draft
ChaosRealmsAI wants to merge 37 commits into
Draft
Conversation
Why: each codex node lazily starts a codexctl daemon via --session-socket but never stopped it, so finished sessions left a daemon + codex child resident (dogfooding accumulated 200+ orphaned daemons, ~3GB) until codexctl's idle-timeout reaped them. What: exec/resume/answer stop the daemon once a session is terminal — guarded by session_is_terminal (not awaiting input, not running async), so a needs_input / answer --no-wait continuation is never orphaned. resume() resolves a live run by mirroring that rule: a parked (needs_input) session's daemon was kept, so its run is still live and is continued directly; any other session was reaped, so its ephemeral run_id is dead and the thread is resurrected from codex's persisted rollout (thread_id) into a fresh run used for send AND execute. This keeps resume working across daemon reaping (verified: exec -> daemon=0 -> resume recalls context, incl. multi-hop). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Why: .odw/runs grew unbounded (dogfooding accumulated 4882 run dirs). What: after creating a run dir, prune all but the newest ODW_RUNS_KEEP runs (default 50, 0=off). Protects this run's own dir and the resume source, so a concurrent exec in the same repo cannot delete an in-progress run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
odw starter parallel-review-applyworkflow for large project work: parallel implementation worktrees, review gate, approve-only atomic apply, final read-only verification, and report visibility.args.request/args.specplanning: when explicitargs.tasksare omitted, the starter asks a structured planner for owned task files, then runs the same preflight, parallel implementation, review, repair, apply, and verify gates.odw runs showsummaries for review/apply/snapshot outcomes.odw runs shownow summarizes final workflow history so request-only runs show plan, review reject/approve, targeted repair, and verify status without opening the full HTML report.odw runs listnow defaults to a compact human-readable run list, whileodw runs list --jsonpreserves the raw machine-readable records.odw runs showviews now surfacestate.result.errorin the header, so failed planner/implementation runs show the actionable cause before the event tail.actions/checkout@v6,actions/setup-node@v6,actions/cache@v5) ahead of the June 2026 GitHub Actions Node 20 deprecation warning.--task,--task-file,--cd,--runtime,--timeout-ms, and--jsoninstead of showing blank option descriptions.doctor,models, andlist;--jsonpreserves the full machine-readable reports for automation.status/logsfallback to saved session state plus local logs when a completed per-session Codex daemon has already been reaped.codexctl session startreturns, sostatus/listcan see a just-launched long task instead of reporting no latest session.needs_inputwith a non-zero exit code.odw runs show, so patch conflicts or review-workspace failures are actionable without scraping stderr.createDecisionDigestrepair the owning core task without broadly rerunning docs/tests tasks.odw runs showand HTML report workflow history now include the first review blocker sample, so a reject line shows the actionable cause instead of onlyblockers=1.allowDuplicateTaskFiles:trueintent.runtime/permissionvalues so a planner cannot route implementation work to unsupported PandaCode commands such asnode.Why
This supports the intended remote product-work loop: owner comments/specs can fan out into isolated AI tasks, be reviewed adversarially as a combined candidate, and only land after an explicit approve gate and verification evidence. The request/spec planner reduces owner decision cost when the owner is working from a short doc comment or iPad note instead of hand-authoring task/file decomposition.
Root causes fixed
needs_inputwith question context duringstartand exit non-zero. PandaCode previously only persisted pending input onokexecutor output, so ODW auto-answer could fail withcannot infer Codex question id for --text.reject, forcing repeatedschema_mismatchretries. The same run showed a tests/docs task inventing an undeclaredindex.jspublic entrypoint and trying to skip tests in an isolated worktree. The runtime and starter prompts now make those failure modes explicit.index.jsownership fixed showed tests/docs still invented a different schema (runId,complete,pending) because implementation nodes only saw their own task prompt. The starter now gives every task the shared run context and full planned task contracts.runtimelikenode; the starter now rejects/normalizes unsupported planner runtimes before implementation fan-out.odw runs listwas raw JSON by default, which is hard to scan during remote/iPad-style project checks. The default list is now compact, with--jsonfor automation.runs showburied terminal causes likeplanning_failedandno captured worktree changesin the event tail. Failed run headers now extract object/string result errors directly from state.doctor,models, andlistall printed large raw JSON by default, so the common "is this usable right now?" check was harder than necessary. The defaults now summarize runtime health, model counts/defaults, and session counts, while scripts keep the old full shape via--json.node test.mjs exits ... src/annotations.js does not infer...were repaired aspublic-api-tests-docsbecausetest.mjsappeared first. The repair selector now scores root-cause file mentions and penalizes test-failure evidence, so the next repair targetssrc/annotations.js.error: corrupt patch at line ...even though each candidate diff applied individually. Root cause: ODW usedtrimEnd()before concatenating git patches, which stripped the single-space line that represents a blank context line in unified diffs. The combiner now preserves patch text and only appends a missing newline.review r1/r2/r3: reject ... blockers=0while the actionablecorrupt patchcause only appeared in stderr. Review preflight failures now emit a structured blocker plus preflight category/message so repair prompts,runs show, and reports can surface the cause directly.odw-exec-1780521870647-88709) showed repeated integration-only repairs while reviewers identified the exportedcreateDecisionDigestcore contract as the root cause. The repair selector now uses candidate-defined symbols to map function-named blockers back to the owning task.odw-exec-1780523246667-9602) showed prompt-symbol matching was too broad and caused full-batch repairs, which made independent task contracts churn. Symbol matching now prefers definitions found in captured diffs and only falls back to task prompts when no definition evidence exists.statuscould report a successful completed session as failed after PandaCode intentionally stopped the per-session daemon:codexctl session readreturnedok:false unknown run id, whilelistandartifactsstill had the session record. PandaCode now persists last state/summary and treats that read failure as a local-record fallback instead of a task failure.codexctl session start, prompt/log files already existed but no session record had been saved yet, sostatusandliststill showed no latest session. PandaCode now writes a provisionalstartingsession record with prompt/log/socket artifacts before invoking start.Validation
node scripts/selftest.mjsinodw: 87/87 passed, including structured review preflight blocker/event assertions, symbol-named root-cause repair selection, blocker sample history, and 4+ task default review-round coveragecargo testat workspace root: ODW unit tests + parity selftest, PandaCode 122 unit tests, and 16 fake runtime tests passedcargo clippy --workspace --all-targets -- -D warnings: passedpandacodebinary thatpandacode status --runtime codex --cd /Users/Zhuanz/workspace/pandacode-direct-dogfood-20260604now returnsok:true,state:"completed", andlive_read_unavailable:trueinstead of surfacing the dead-daemonunknown run idas the session state.pandacode logs --runtime codex --jsonfalls back to.pandacode/codex/runs/<session>/logs/latest.jsonland returns a local log tail when live read is unavailable.pandacode codex status --session latestwhilecodexctl session startis still sleeping; the latest pointer exists,status.stateisstarting, and top-levellistshows one Codex session.cargo fmt --package open-dynamic-workflow --check: passedodw 0.3.1,pandacode 0.3.1pandacode run --helpshows descriptions for inline task text, task files, workspace directory, runtime selection, machine-readable JSON, and timeout.pandacode doctor,pandacode models, andpandacode listdefault to compact summaries, and--jsonoutput for all three still parses as JSON.odw runs show odw-exec-1780515008596-9952 --path /Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604-v3prints a compact workflow history: plan → reject → targeted repair → approve → verify./Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604-v3/.odw/runs/odw-exec-1780515008596-9952/report.htmlwith the latestodw: report overview contains plan → reject → targeted repair → approve → verify and"failed":false.odw runs list --path /Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604-v3prints a compact one-line run summary andodw runs list --jsonstill parses as raw run records.odw runs showon failed dogfood runsodw-exec-1780514257522-79981andodw-exec-1780514812502-99690surfacesFailure: planning_failed: ...andFailure: no captured worktree changesin the header.pandacode run --helpand found key task/session flags had empty descriptions. Added help text plus a fake-runtime regression test for the common run options.odw-exec-1780505513941-97703: parallel docs run approved, applied atomically, finalnpm testpassed, verify snapshot clean.odw-exec-1780507859710-45655: implementation/tests/docs split across 3 worktrees, Codexneeds_inputauto-answered, first review rejected docs only, docs-only repair retained implementation/test candidates, second review approved, 3 patches applied atomically, finalnpm testpassed, verify snapshot clean.odw-exec-1780510655854-15704: schema/analysis/report/tests-docs split across 4 worktrees, review rejected failing/incoherent tests, repair stayed local to implicated tasks, Codexneeds_inputauto-answer was exercised, max review rounds stopped without applying unsafe patches, and the final gate retained blocker evidence.odw-exec-1780512381288-43153: explicitindex.jsownership, no reviewerschema_mismatchafter prompt hardening, tests/docs no longer skipped isolated verification, final gate safely rejected an invented API contract and retained evidence for the new shared-context fix.odw-exec-1780513476465-57923: mobile planning-board module split across 4 worktrees, every implementation prompt included batch context/full planned task contracts, review approved, 4 captured patches applied atomically, finalnode test.mjspassed, verify snapshot clean, and output aligned onindex.jsplus canonicalplanned/running/blocked/donestatuses instead of the prior invented contract.odw-exec-1780514257522-79981: exposed long structured JSON extraction failure before any unsafe implementation landed.odw-exec-1780514812502-99690: planner passed schema but selected unsupportedruntime:"node"; implementation failed safely before applying changes.odw-exec-1780515008596-9952: high-level request only, planner produced 4 owned tasks, first review rejected a failing public test, targeted repair reran onlypublic-fixtures-tests-docs, second review approved, 4 captured patches applied atomically, finalnode test.mjspassed, verify snapshot clean.odw-exec-1780518366711-1253: high-level request only, planner produced 4 owned tasks, implementation completed 4 worktrees, review correctly rejected failing source/test behavior, and the run safely refused to land after exposing that root-cause blockers were repeatedly routed to the tests/docs task. Added a mock regression test for that repair-targeting failure.odw-exec-1780519403300-26020after the root-cause repair fix: same high-level owner request, planner produced 5 owned tasks, implementation completed 5 worktrees, dual review approved in round 1 afternode test.mjsandnpm test, 5 patches applied atomically, finalnode test.mjspassed, verify snapshot clean.odw-exec-1780519926354-29730: existing first-slice project plus a high-level request for Feishu-friendly Markdown specs; planner produced 4 owned tasks and implementation completed, but review preflight repeatedly failed safely withcorrupt patchbefore landing. Added a mock regression test covering combined patches with trailing blank context lines.odw-exec-1780520642507-54665after the patch combiner fix: planner produced 3 owned tasks, implementation completed 3 worktrees, review workspace applied the combined candidate without corrupt patch, first review rejected missing prompt content, targeted repair reran only generator/docs while retaining API/tests, second review approved, 3 patches applied atomically, finalnode test.mjspassed, verify snapshot clean.odw-exec-1780521870647-88709: planner produced 2 owned tasks, review found failing digest source-id behavior and batch-only input bugs, targeted repair repeatedly reran only integration and safely refused to land when the root cause belonged tocreateDecisionDigest.odw-exec-1780523246667-9602: after initial symbol matching, planner produced 4 owned tasks and review found a batch metadata mismatch, but prompt-symbol matching was too broad and all tasks were repaired; final review safely rejected inconsistenttrace.batch.id/trace.batch.batchIdcontract churn without landing.odw-exec-1780524017351-26812after preferring diff-defined symbols: planner produced 4 owned tasks, first review rejected over-broad open-question classification insrc/decision-digest.js, targeted repair reran onlycreate-decision-digest|render-decision-digestwhile retaining 4 files, second review approved, 4 patches applied atomically, finalnode test.mjspassed, verify snapshot clean. Re-runningodw runs showwith the latest CLI now shows that first blocker sample directly in workflow history.odw-exec-1780525079576-47490: artifact snapshot diff + owner handoff packet request, planner produced 4 owned tasks, review/repair targeted the right core/Markdown/docs tasks while retaining tests, and the run failed safely at max review round 3 with concrete blocker evidence instead of landing a contract mismatch.odw-exec-1780525853321-53849with a 4-round ceiling: planner produced 5 owned tasks, first review rejected inconsistent digest/source-comment expectations, targeted repair reran 4 tasks; second review rejectedsourceComments.changedcontract mismatch, targeted repair reran only 3 tasks while retaining README/tests; third review approved, 5 patches applied atomically, finalnode test.mjspassed, verify snapshot clean. Dogfood project commit:748355c add owner handoff packet dogfood slice.odw-exec-1780527063497-83594using the new default with no explicitmaxReviewRounds: request-only planner produced 4 owned tasks, first reject showed the real default asround 2/4, targeted repair reran only tests/core; second reject reran tests/core/Markdown-API while retaining docs; third review approved, 4 captured patches applied atomically, finalnode test.mjspassed, verify snapshot clean. Dogfood project commit:432c57b add owner review queue dogfood slice./Users/Zhuanz/workspace/pandacode-direct-dogfood-20260604:pandacode run --runtime codexbuilt a dependency-free ESM remote product-comment inbox library, generated tests/docs, and reportednpm testpassing. During the long run,statuscould not seelatestuntil the session record was saved; after completion, oldstatusshowedunknown run iddespite successful artifacts. Added record-save/fallback behavior and fake-runtime coverage./Users/Zhuanz/workspace/pandacode-active-status-dogfood-20260604: while execute was running, patchedstatusreturnedstate:"running"and the last agent message; after completion, patchedstatusreturnedok:true,state:"completed",live_read_unavailable:true, and the saved last summary. The generated remote-spec workflow package passednpm testand was committed as342221a add remote spec workflow pandacode dogfood.preflight_category/preflight_message, forruns showrendering those fields in compact recent events, for symbol-named root-cause repair selection, and for review blocker samples in CLI/HTML workflow history.Reports:
/Users/Zhuanz/workspace/odw-real-codex-dogfood-20260603/.odw/runs/odw-exec-1780505513941-97703/report.html/Users/Zhuanz/workspace/odw-dependent-task-dogfood-20260603/.odw/runs/odw-exec-1780507859710-45655/report.html/Users/Zhuanz/workspace/odw-four-task-dogfood-20260604/.odw/runs/odw-exec-1780510655854-15704/report.html/Users/Zhuanz/workspace/odw-four-task-success-dogfood-20260604/.odw/runs/odw-exec-1780512381288-43153/report.html/Users/Zhuanz/workspace/odw-shared-context-success-dogfood-20260604/.odw/runs/odw-exec-1780513476465-57923/report.html/Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604/.odw/runs/odw-exec-1780514257522-79981/report.html/Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604-v2/.odw/runs/odw-exec-1780514812502-99690/report.html/Users/Zhuanz/workspace/odw-request-planner-dogfood-20260604-v3/.odw/runs/odw-exec-1780515008596-9952/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780518366711-1253/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780519403300-26020/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780519926354-29730/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780520642507-54665/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780521870647-88709/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780523246667-9602/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780524017351-26812/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780525079576-47490/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780525853321-53849/report.html/Users/Zhuanz/workspace/odw-remote-product-loop-dogfood-20260604-v1/.odw/runs/odw-exec-1780527063497-83594/report.html/Users/Zhuanz/workspace/pandacode-installed-start-smoke-20260604:env FAKE_CODEX_START_SLEEP=30 pandacode codex exec ...used the installed/Users/Zhuanz/.cargo/bin/pandacode; whilecodexctl session startwas still sleeping,pandacode codex statusreturnedok:true,state:"starting",live_read_unavailable:true, and top-levelpandacode listshowedcodex: 1; after start completed,statusandlogsreturned the completed fake session successfully./Users/Zhuanz/workspace/odw-installed-complex-dogfood-20260604:odw exec --backend pandacode --input-file odw-input.json --effort mediumused the installed CLI on a fresh repo, planner produced 4 owned tasks, 4 Codex worktrees implemented 9 files, two review agents approved in round 1 afternpm testand export/integration smoke checks, 4 patches landed atomically, finalnpm testpassed, verify guard was clean, and dogfood project commit8dc4311 add remote product ops dogfood slicecaptured the result. Report:/Users/Zhuanz/workspace/odw-installed-complex-dogfood-20260604/.odw/runs/odw-exec-1780530282012-8494/report.html./Users/Zhuanz/workspace/odw-installed-complex-dogfood-20260604: runodw-exec-1780530747351-12325iterated an existing package with owner inbox, Markdown handoff, and batch progress APIs. Review round 1 correctly rejected a Markdown/progress integration bug, then old repair targeting overmatched all 4 tasks (markdown-handoff,batch-progress,owner-inbox,public-api-docs) even though the blocker namedsrc/markdown.js/src/progress.js. Addedb238689 fix(odw): avoid prompt-symbol repair overmatching, so explicit blocker file paths now take precedence over broad prompt-symbol fallback; new selftestdoes not let prompt API mentions broaden file-path repaircovers the dogfood failure. Dogfood result still converged after repair, finalnpm testpassed, verify guard clean, dogfood project commit435544b add owner inbox and markdown handoff dogfood slice; report:/Users/Zhuanz/workspace/odw-installed-complex-dogfood-20260604/.odw/runs/odw-exec-1780530747351-12325/report.html.b238689:node scripts/selftest.mjspassed 88/88,ODW=$(which odw) node scripts/selftest.mjspassed 88/88 against the installed CLI,cargo testpassed,cargo clippy --workspace --all-targets -- -D warningspassed,cargo fmt --all --checkpassed, andgit diff --checkpassed.