Live checklist tracked by the drift-sentinel loop. Source of truth for "where are we right now."
Conventions:
[ ]= not started[~]= in progress[✓]= complete + verified[!]= blocked / failed- Each item ends with
· verified <timestamp>once Loop 2 confirms.
Last loop-1 sentinel run: tick 4 (2026-05-04T18:13Z)
Last loop-2 gate run: (none)
Last E2E sweep: i2p-smoke-2 webapp counter (2026-05-04 20:27) — verdict=blocked (honest, real merge bug surfaced), $1.21, ~560s, full pipeline compile→build→merge→audit→render with no crash. Report: docs/i2p-smoke-2-20260504-202757.md. Session: /tmp/otto-i2p-smoke-2-20260504-202757/otto_logs/sessions/2026-05-05-032804-b07585/.
Prior E2E sweep: tick 10/11 cli — verdict=blocked (honest), $1.52, 586s. Smoke contract PASS.
Prior E2E sweep: tick 5/6 webapp — verdict=blocked (honest), $0.55, 212s. Smoke contract PASS.
- [✓] A8 — bundled screenshot/video capture —
_synthesized_webapp_walkthroughnow runs Playwright for default webapp walkthroughs after Flask/static discovery. Static discovery includes generated output dirs plus a vanilla rootindex.html. Artifacts:screenshot-home.png,dom-home.html,walkthrough.webmwhen video is produced,browser-capture.log, and conservativewalkthrough.jsonl. BrowserJourney checks also recover printed screenshot/video paths whenevidence_globsmiss. Missing Playwright/browser binary is logged honestly and falls back to HTML evidence. Codex verified Chromium launches and captures screenshot/video in this worktree. · verified 2026-05-05 - [✓] A9 — provider session continuity —
BuildAgentOutput.session_idandBuildAgentInput.agent_session_idnow thread through build retries, merge repair, audit compatibility repair, and Layer 2 repair;default_build_agentmaps it toAgentOptions.resume. · verified 2026-05-05 - [✓] A10 — live retry-layer collapse —
run_pipelinecallsrun_audit(..., fix_agent=None)and reserves the supplied fix agent forrepair_failing_features; directrun_auditcallers retain the old compatibility loop. · verified 2026-05-05 - [✓] A11 — superseded merge eligibility — merge queue now computes latest Group/Component result per id, ignores older PASSING entries superseded by later results, and uses the latest result's branch/worktree for the candidate. Base freshness remains the merge-into-current-HEAD + verification/rollback strategy. · verified 2026-05-05
- [✓] A3.2 — proof templates — added
otto/web/templates/proof-packet.html.j2andfeature-proof.html.j2; the proof packet and per-Feature proof renderers now load those templates through dependency-free placeholder substitution. · verified 2026-05-05 - [✓] B5 — combined lifecycle render fixture —
tests/test_render.pynow has one render-run fixture with a landed Group and blocked Group, asserting HTML + JSON lifecycle state together. · verified 2026-05-05 - [✓] Regression sweep for this pass —
uv run pytest -q tests/test_audit.py tests/test_build.py tests/test_merge_queue.py tests/test_runner.py tests/test_runner_layer2_fix.py tests/test_audit_loop_repair.py tests/test_render.py tests/test_render_per_feature.py tests/test_a1a_dataclasses.py -k 'not test_autopilot_full_executes_safe_recovery_once'-> 294 passed.uv run python scripts/test_tiers.py fast-> 1391 passed, 531 deselected.
- [✓] Issue 1 —
--yesrejected with staleotto runhint —otto/cli.pybuild-path:--yesremoved from i2p ignored-flag list (silently accepted; the i2p path has no interactive spec-approval step, so it's a definitional no-op). Stale "pass them tootto run" hint reworded to drop the bogus subcommand reference (build + certify paths). No tests assert on the message text. · verified 2026-05-04 - [✓] Issue 2 —
--budgetsemantics — no code change.otto/cli.py:1006,1173already declare--budgethelp as "Total wall-clock budget in seconds, must be > 0"; CLAUDE.md doesn't claim USD. Confirmed viaotto build --helpoutput. · verified 2026-05-04 - [✓] Issue 3 — missing
summary.jsonat session root — real regression.otto.runner.run_pipelineis intentionally headless and only emitsproof-packet.{html,json}, butsummary.jsonis consumed byotto/resume.py:_read_prior_accounting,otto/runs/history.py,otto/runs/atomic_repair.py, and Mission Control. Added_persist_session_summaryhelper inotto/cli_run.pythat calls existing_write_session_summaryfromotto.runs.lifecycleafter every i2p completion; wired into greenfield tail oforchestrate_runand into_drive_brownfield_pipeline(covering certify + improve). Threadedcommand="build|certify|improve"through call sites. Failures are logged + non-fatal (bookkeeping must not mask verdict). Full sweeptests/ -q --ignore=integration --ignore=browser: 1585 passed. · verified 2026-05-04
- [✓] A6 — Mid-build spec edit invalidation — landed end-to-end. New lifecycle state
editing_in_flight(otto/web/spec_review_routes.py:52); the runner sets it beforerun_buildand reverts toapprovedafter re-dispatch (otto/runner.pybuild-phase block +_set_lifecycle_best_effort).compute_invalidation(old_spec, new_spec) -> InvalidationPlaninotto/spec_amend.pyreturns direct (name / feature_ids / dependencies / owned_paths / checks deltas) + cascading (Groups whose deps include an invalidated id) Group invalidations, plusremoved_group_ids/added_group_idsinformational sets. POST/api/specs/<id>/editaccepts edits duringediting_in_flightand emitsgroup.invalidated_by_spec_editevents per affected Group. New event kind inotto/spec_state.py:96plus anINVALIDATEDphase. Build loop's_run_slicechecks the journal between attempts and aborts in place withBLOCKED + "invalidated by spec edit: <reason>"(helper_spec_edit_invalidation_reason); the worktree is left unmerged for forensic value. Runner re-dispatches once via_redispatch_invalidated_groups: scans the journal, re-loads the post-edit Spec from<session>/spec/spec.json, runs a secondrun_buildover the invalidated subset (skip_components covers everything else), and merges the GroupResult lists with cost/wall accumulation. Tests: 8 newcompute_invalidationcases intests/test_spec_amend.py(no-op, name change + cascade, feature_ids change, dependency change, unrelated group untouched, removed group, added group, owned_paths change); runner integration test intests/test_runner.py::test_spec_edit_invalidation_redispatches_affected_groupsexercises the full re-dispatch with stubbed agents asserting two build calls + lifecycle revert + cost accumulation; new spec-review-route tests cover theediting_in_flightaccept path and the still-blockedapprovedpath. Design note:docs/i2p-spec-edit-design.md(records the resolved ambiguities: tier-1 stays locked; ANY round-tripping edit accepted; component invalidation, multi-edit-per-run, and worktree GC explicitly deferred). Full sweeptests/ -q --ignore=integration --ignore=browser: 1640 passed. · verified 2026-05-04 - [✓] A13 — Review gate between compile and build — landed in
otto/runner.py:run_pipeline(review-gate poll loop between compile and seed) +otto/cli_run.py+otto/cli.py(new--review-gate/--auto-approve/--gate-timeoutflags onotto runandotto build --i2p). Two new event kinds added tootto/spec_state.py:spec.review_pending(runner emits when gate engages) andspec.review_approved(resume signal the runner polls for).otto/web/spec_review_routes.pyPOST /api/specs/<id>/approvenow additionally emitsspec.review_approvedon every call (idempotent at the lifecycle layer; the journal accepts repeats so a transient runner restart can re-read the signal).--review-gateis OFF by default to preserve CI/script automation;--auto-approveis the explicit-opt-out flag. CLI banner printshttp://localhost:<OTTO_WEB_PORT|8765>/runs/<session>/spec/reviewso the operator knows where to approve. Resume path bypasses the gate (operator already approved on the prior session). Gate timeout (default 24h) halts the run with verdict=BLOCKED honestly when no approval lands. Tests:tests/test_runner.py(5 new — pause-until-approved, timeout, off-by-default, announce callback, resume bypass),tests/test_cli_run.py(5 new — flag wiring, default omitted, mutual-exclusion error,otto build --i2p --review-gate, help-text exposure),tests/test_spec_review_routes.py(1 new — POST /approve emits spec.review_approved). Full sweeptests/ -q --ignore=integration --ignore=browser: 1640 passed. · verified 2026-05-04 - [✓] A14 — Bench parity verdict —
scripts/bench_microfeed_i2p.py_verdict()now honors all plan.md Step 11 parity criteria. Wall excess no longer silently emitsi2p_passed; it returnsi2p_partial_wall_exceededand writes a per-criterionsummary.paritydecomposition intoresult.json. Functional fails (audit/hidden/browser/blocked) outrank wall-excess; quality-low outranks wall-excess. New unit suitetests/test_bench_microfeed_i2p_parity.py(9 tests) locks the ladder; bench itself was not re-run (real-cost). · verified 2026-05-04 - [✓] A7 — Pause + Resume + Abort-a-Group verbs — Mission Control now exposes the verbs the plan promised. Pause:
otto/mission_control/actions.py:execute_pause_run/_resume_runappendrun.paused_by_user/run.resumed_by_userevents to<session>/spec-state.jsonl;otto/runner.py:_wait_while_pausedpolls the journal at every_phase()boundary (1s cadence; tunable viarunner.PAUSE_POLL_INTERVAL_S) and sleeps until cleared. SIGSTOP rejected on portability + IO-safety grounds — freezing async I/O mid-flight risks corrupted tempfiles + held fd locks; the journal poll-flag gives clean phase boundaries. Existing Cancel (SIGTERM) is unchanged. Abort-a-Group:execute_abort_group(session_dir, group_id, reason)appendsgroup.aborted_by_user;otto/build.py:_run_slicechecksis_group_aborted_by_userat the top of each retry attempt and exits withstatus=BLOCKED, failure_narrative="aborted_by_user". The merge queue pre-populatesblocked_idswithaborted_group_ids(session_dir)so aborted groups never become merge candidates. Component abort intentionally not exposed (Group is the user-facing dispatch unit per research §3). Routes:POST /api/run-view/{sid}/actions/{pause,resume}andPOST /api/run-view/{sid}/groups/{gid}/abort(mounted fromotto/web/run_view_routes.py). UI:RunDrawershows Pause+Resume in a run-action-bar while non-terminal;GroupListrenders per-Group Abort buttons (hidden on landed/blocked/failed_scope groups);RunViewPagewiresonAfterAction={reload}so the next poll picks up state immediately. Three new event kinds added tootto/spec_state.py+ helper predicatesis_run_paused_by_user/is_group_aborted_by_user/aborted_group_ids. Tests: 7 unit tests intests/test_mission_control_actions.py(event emission, idempotence, missing-session guards, pause→resume flip), 2 integration-ish tests intests/test_runner.py(realrun_buildwith a 2-group spec where the aborted group is BLOCKED + the live group runs to PASSING;_wait_while_pausedblocks until a resume event lands). Full sweeptests/ -q --ignore=integration --ignore=browser: 1640 passed;npm run web:typecheck && npm run web:buildgreen. · verified 2026-05-04 - [✓] A15 — CLAUDE.md stale CLI surface — added
otto runto the Quick diagnosis block; addedproof-packet.html,proof-packet.json,spec/spec.json,spec-state.jsonlrows to the per-session layout table; header line now listsotto runalongsidebuild|certify|improve. · verified 2026-05-04 - [✓] B1 —
compile_validatorplan-vs-reality drift —plan.md:252now documentsvalidate_spec(spec) -> ValidationResultas the shipping symbol with rationale for why it is broader than schema-only (dep-cycle/vagueness/dup-id checks must run before Build to avoid corrupting the merge queue). No alias added — the "schema-only" framing was a premature constraint. · verified 2026-05-04 - [✓] B6 — stale
oracles.pydoc-comment refs —otto/checks.py:4andotto/spec_compile.py:32,123no longer reference the deletedcodex-i2p/otto/oracles.py. Comments now point at the liveotto/checks.pybrowser_journey executor and frame the prototypeoracles.pyas historical context only. 1608 tests green post-fix. · verified 2026-05-04 - [✓] A1 — Slice→Group rename across runtime + serialized form —
otto/spec_state.pyrenamedSliceState→GroupState,slice_id→group_id,RunState.slices→groups, allslice.*event kinds →group.*.otto/spec_compile.pydroppedslices=/cross_slice_checks=kwargs fromSpec.__init__, removed the back-compatSpec.slices/Spec.cross_slice_checksproperties, andspec_to_dictnow emits ONLYgroups/cross_group_checks(one-cycle deprecation read fallback for legacy keys preserved with deprecation warnings via newspec.deprecated.*warning codes).otto/build.py,otto/audit.py,otto/render.py,otto/merge_queue.py,otto/resume.py,otto/cli_run.py,otto/runner.py,otto/spec_amend.py:SliceResult→GroupResult,SliceStatus→GroupStatus,SliceVerdict→GroupVerdict,SlicePacket→GroupPacket,slice_idfield→group_id,slice_results→group_results,total_passing_slices→total_passing_groups,slice_verdicts→group_verdicts(Python field; JSON wire key also updated togroup_verdicts/group_idsince tests live in repo),passing_slice_ids→passing_group_ids,_merge_slice_branch→_merge_group_branch,BuildAgentInput.slice→group,branch_for_slice→branch_for_group, etc. Commit messages and event payloads usegroup_id. Branch namingi2p/<id>left as opaque prefix per scope.otto/web/left untouched per scope; tests of web routes that readspec_data["slices"]still fail (4 web-only failures, see "Open gaps" below). 1604 / 1608 non-integration tests pass. · verified 2026-05-04 - [✓] A2 — Group field renames —
otto/spec_compile.py:Grouprenamedtasks→feature_ids,title→name; dropped thedependenciesproperty alias and renameddeps→dependenciesdirectly. Added optionaldispatch_plan: str = ""field with docstring documenting deferral honestly (plan.md only lists it as a placeholder; shape is intentionally underspecified, persisted only when non-empty)._coerce_slice_idrenamed_coerce_group_id;_SLICE_ID_RE→_GROUP_ID_RE. Validator messages updated (group id …,feature_ids field empty,multi-group spec declares no cross_group_checks). All consumers updated:otto/build.py:_component_as_slice,otto/merge_queue.py:_component_as_merge_slice,otto/spec_compile.py:render_spec_mdand the spec-md round-trip parser. Test fixtures and helpers updated acrosstests/test_spec_compile.py,tests/test_spec_amend.py,tests/test_audit.py,tests/test_build*.py, etc. The compile-spec prompt example inotto/prompts/compile-spec.mdnow emitsgroups/name/feature_ids/dependenciesJSON keys. · verified 2026-05-04
- [✓] A1 web-surface propagation (cc-i2p-2 follow-up, 2026-05-04) —
otto/web/i2p_routes.pymigrated to canonicalgroups/group_id/group_count. Readsspec_data["groups"], callsreplay(group_ids=...), emitsgroup_count(wasslice_count),landed_countfromlanded_group_ids(waslanded_slice_ids),blocked_countfromblocked_group_ids(wasblocked_slice_ids). State serialiser emitsgroups: [{group_id, phase, ...}](wasslices: [{slice_id, ...}]). Embedded HTML/JS shell + CSS classes renamedslice→group.otto/web/client/src/types.tscomment updated. Test assertions intests/test_web_i2p_routes.pyupdated to match (s["group_count"],ss["group_id"]).npm run web:typecheck && npm run web:buildclean.tests/test_web_i2p_routes.py17/17 green; full sweep 1608/1608. · verified 2026-05-04 - Legacy-name back-compat surface preserved on read in the parser (one cycle):
"slices"→"groups","cross_slice_checks"→"cross_group_checks", innertasks/title/deps→feature_ids/name/dependencies. Each emits a deprecation warning viaspec.deprecated.*codes. Drop these reads in a future cycle. BuildBudget.per_slice_*legacy kwargs and property aliases inotto/build.pywere preserved as-is (caller back-compat is broader than A1's stated scope); the canonical field names areper_group_*.
[✓] i2p --resume implementation — landed per
docs/i2p-resume-design.md. New otto/resume.py derives a
ResumePlan from the paused session's spec-state.jsonl (replay) +
summary.json (cost-carry). The runner accepts resume_plan= and
threads it into run_build (skip-already-landed Components via
synthesised PASSING SliceResult/ComponentResult entries) and into the
audit phase (short-circuit when journal recorded audit.finished with
non-empty verdict). Cost-carry is enforced by charging
prior_cost_usd to the shared BuildBudget on entry; --reset-budget
zeroes it. Spec-edit policy v1: refuse on hash mismatch with --force
escape (logs spec.regenerated event). otto build|certify|run --resume flags wired in otto/cli.py + otto/cli_run.py. Mid-merge
git recovery probes the project worktree at resume entry. Tests:
tests/test_resume.py (11 unit tests covering classification,
spec-hash check, mid-merge detection), tests/test_cli_run.py
(4 new flag-propagation tests), tests/test_runner.py (2 new tests
for skip-components plumbing + audit short-circuit). Full sweep
(1585 tests) passes.
E2E SWEEPS PAUSED (user directive 2026-05-04): prioritize implementation over E2E. Resume only at major implementation milestones (A1a complete, A2 complete, A3 complete, A4 complete, before Phase B cutover, before Phase C deletion). Until then the loop builds new design code; E2E validates after each milestone instead of every 5 ticks.
STRATEGY SHIFT (user directive 2026-05-04): jump ahead from incremental A0 vocabulary cleanup to A1a (Feature dataclass + new design data model). Add Feature/Component/Guardrail/Component dataclasses ALONGSIDE the existing Slice/Group structures. Vocab cleanup of legacy continues opportunistically when files are touched, but is no longer the bottleneck. Current phase: A0.3 (Slice→Group rename; tick 7 resumes — Spec.slices field rename + call-site propagation)
New sequence (per plan reviewer's recommendation): A0 → A1a → A1.5-types → A1b → A1c → A1.5-seed → A2 → A3 → A4 → A5 → A6 → B → C
Phase reasoning:
- A0 is rename + JSON shims (5-7 days, not 1-2 — most code already exists)
- A1 split into A1a (dataclasses) / A1b (build+checks) / A1c (merge)
- A1.5-types between A1a and A1b: lock RunView TS interface upfront
- A1.5-seed between A1c and A2: pre-Audit fixture seeding for multi-user products
Before any phase begins, these must hold:
-
research.mdreviewed by user, signed off -
plan.mdreviewed by user, signed off - All review-walkthrough-* agent reports read and any blocking changes folded into research.md / plan.md
-
docs/otto-wireframes.mdreviewed by user, signed off -
progress.md(this file) initialized with phase checklists -
drift-log.mdinitialized -
review.mdinitialized
Goal: zero hits for retired words; otto/defaults.py owns all
numeric knobs.
-
[✓] A0.1 — Inventory current vocabulary debt · verified 2026-05-04T17:55Z
- [✓] Vocab grep run; baselines:
otto/: 1672 hits across 67 filestests/: 700 hitsdocs/(excluding signed-off): 265 hits- per-term in otto/: slice=520, capability=23, capability_verdict=0, certifier=214, story=753, stories_passed=95, stories_tested=110
- [✓] Magic-number grep run; baseline: 23 hits outside
otto/defaults.py+otto/prompts/ - [✓] Baseline recorded above
- [✓] Vocab grep run; baselines:
-
[✓] A0.2 — Create
otto/defaults.pyand wire it · verified 2026-05-04T18:08Z- [✓] Define schema:
retries,budgets,audit,agents· 2026-05-04T18:01Z - [✓] Read from
otto.yamlif present; fall back to baked-in defaults · 2026-05-04T18:01Z - [✓] Wire
BuildBudgetdefaults throughdefaults.pyviafield(default_factory=...)· 37 build tests passing 2026-05-04T18:08Z - [✓] Test:
tests/test_defaults.pycovers override precedence (CLI > otto.yaml > baked-in) · 11 tests passing 2026-05-04T18:01Z - 23 magic-number hits remaining are transport-layer subprocess/network timeouts in legacy modules (Phase C deletion targets); not configurable budgets. Documented in drift-log.md as info severity. Future scans treat 23 as legacy floor; only halt on increase.
- [✓] Define schema:
-
[~] A0.3 — Rename
slice/Slice→group/Group(in progress)- [~] Python: dataclass + variables + JSON keys + prompt files
- [✓]
class Slice→class Groupin spec_compile.py with backward-compat aliasSlice = Group· 33+37+11+11=92 tests passing 2026-05-04T18:13Z; slice-hits 520→517 - [✓] Spec.slices field → Spec.groups with backward-compat property
+ custom init accepting both
slices=andgroups=kwargs + JSON serialization emits both keys + parse_spec reads either · 81 tests passing 2026-05-04T20:23Z; slice-hits flat at 517 (back-compat aliases preserve refs; will drop after external callers migrate in tick 8+) - [✓] Spec.cross_slice_checks → Spec.cross_group_checks with back-compat property + init kwarg alias + JSON dual-write/dual-read · audit.py call sites migrated · 92 tests passing 2026-05-04T20:50Z
- Propagate Group through call sites (otto/build.py, otto/audit.py, otto/render.py, otto/cli_run.py, otto/merge_queue.py)
- [✓] Update prompt files (otto/prompts/*.md) — only
otto/prompts/compile-spec.md contained legacy
slice/slicesprose; renamed togroup/groupswhile preserving the wire-format<spec_json>JSON example block (which still emits"slices":under dual-write back-compat). New tests/test_prompt_group_vocabulary.py (18 tests) lints every prompt file: prose outside fenced code blocks must usegroup; the legacyslicesJSON key in compile-spec.md's example is asserted to still be present until the wire cutover. 166 prompt+spec tests passing 2026-05-04T22:30Z. - [✓] Remove
Slice = Groupalias once vocab scan = 0 · all otto/ + tests/ identifier-position uses migrated toGroup; remaining\bSlice\bhits are docstrings/comments documenting historical vocabulary or string-literal slug-validation inputs (e.g."Auth Slice"); 1655 tests passing 2026-05-04 (alias deleted at otto/spec_compile.py:256)
- [✓]
- Schema migration: old
proof-packet.jsonfiles (v2 withslices[]) still readable; new files emitgroups[](v3) - Frontend types and field names
- [~] Python: dataclass + variables + JSON keys + prompt files
-
A0.4 — Rename
capability/capability_verdict→feature/feature_verdict- otto/audit.py:
CapabilityVerdictdataclass →FeatureAudit. Name disambiguates from TS-layerFeatureVerdictLiteral (verdict outcome string) — Python dataclass carries name+status+detail+evidence_refs. LegacyCapabilityVerdict = FeatureAuditalias dropped post-cutover. -
AuditResultandAuditAgentOutputcarryfeature_auditsonly. The mirroredcapability_verdictsfield +__post_init__mirror logic was removed after the back-compat window. - Propagated to otto/render.py and scripts/bench_todo_cli_i2p.py.
ProofPacketcarriesfeature_auditsonly;render_jsonemits only the canonical key. Bench script readsfeature_auditsdirectly (legacy.get("capability_verdicts")fallback removed). - Audit prompts updated:
_audit_promptrequestsfeature_audits;_parse_audit_outputreadsfeature_auditsonly. The legacy wire-key parser branch and back-compat note in the prompt were removed. Coverage intests/test_a0_4_propagation.py(rewritten to assert canonical-only) andtests/test_audit*.py. - Back-compat removal (post-cutover, ~2 weeks stable): all
production references to
capability_verdicts/CapabilityVerdictremoved fromotto/andscripts/. The obsoletetests/test_audit_vocab_renames.py(which pinned the alias contract) was deleted. Only historical comments mention the old name. 1568/1568 tests green.
- otto/audit.py:
-
[✗] A0.5 — Rename
certifier→audit— SUPERSEDED by Phase C deletion (tick 60)- The new-stack
otto/audit.py+otto/audit_loop.pymodules already exist and are populated. Renamingotto/certifier/is wasted work because that directory is slated for deletion in Phase C (seedocs/phase-c-deletion-audit.md). Leave certifier in place until Phase C deletes it; new code usesotto/audit*.pydirectly.
- The new-stack
-
[✗] A0.6 — Retire
story/stories_passed/stories_tested— SUPERSEDED by Phase C deletion (tick 60)- Story/stories vocabulary lives in
otto/certifier/__init__.py(legacy) andtests/test_certifier_stories.py. Both delete in Phase C. The new stack usesFeatureandfeature_auditseverywhere.
- Story/stories vocabulary lives in
-
[✗] A0.7 — Retire
taskfrom user-facing surfaces — DEFERRED (tick 60)- Most legacy
taskreferences are in v3 pipeline modules (slated for Phase C). Surfaces in the new stack already usestep/todo_itemper progress.md guidance. Mark as deferred until a user surface concretely needs the rename.
- Most legacy
- Grep: zero hits for retired vocabulary across
otto/,tests/,docs/(excludingdocs/otto-redesign-conversation.mdanddocs/legacy/*if any) - Grep: zero magic-number occurrences outside
otto/defaults.pyandtests/ - Full test suite green
-
npm run web:typecheck && npm run web:buildgreen - Bench A (greenfield e2e tier) runs to completion with no behavioral regression vs pre-A0 baseline
Goal: Refactor existing otto/spec_compile.py (Slice → Group,
add Feature/Guardrail/Component dataclasses + shared_paths +
audit_fixtures).
- [✓] A1a.1 — Rename
Slice→Group(done at A0.3 with backward-compat alias) - [✓] A1a.2 — Add
Featuredataclass (id, name, description, acceptance_detail, evidence_kinds[], group_id, verdict?, evidence_completeness, coverage_confidence, multi_actor_required, audit_pre_merge) · 2026-05-04T tick 12 - [✓] A1a.3 — Add
Guardraildataclass (id, text, applies_to) · tick 12 - [✓] A1a.4 — Add
Componentdataclass (research §2.6) — id, name, description, owned_paths, dependencies, checks, consumed_by · tick 12 - [✓] A1a.5 — Add
Spec.shared_paths: list[str](research §2.6) · tick 12 - [✓] A1a.7 — Add
Spec.audit_fixtures[](research audit-honesty) ·AuditFixturedataclass with kind + payload · tick 12 - [✓] A1a.8 —
Findingdataclass +FINDING_SEVERITIES = (critical, important, polish)for severity ladder (research §4) · tick 12 - [✓] A1a.9 — Spec.init accepts new kwargs (features, components, guardrails, shared_paths, audit_fixtures) · tick 12
- [✓] A1a.10 — Unit tests: 19 new tests in tests/test_a1a_dataclasses.py covering construction, defaults, scoping, id stability · tick 12
- [✓] A1a.11 — JSON round-trip: spec_to_dict emits features / components / guardrails / shared_paths / audit_fixtures keys; parse_spec permissively reads them (defaults to [] if absent → legacy spec compat); 6 new round-trip tests pass · tick 13 (117/117 total)
- [✓] A1a.6 — Per-
project_kindstructure schemas (research §2.7): webapp/api/library/cli JSON schemas exist on disk;DEFAULT_EVIDENCE_KINDS_PER_KINDconstant +default_evidence_kinds_for()helper added to spec_compile.py; 5 new tests pass · tick 14 (122/122 total) - [✓] A1a.12 —
parse_spec_md(md_text, base=None) -> tuple[Spec, list[str]]with id stability across rename + mechanical-field preservation via base · tick 33 (123/123 total) - [✓] A1a.13 —
render_spec_md(spec) -> struser-facing prose with<!-- group: id -->/<!-- feature: id | evidence: ... -->metadata comments · tick 32
-
pytest tests/test_spec.pygreen - Round-trip property test passes
- JSON back-compat: existing session dirs with
"slices"keys still load correctly - Compile produces specs with Feature ids stable from
nameslug
Goal: Lock the API contract before backend implementation continues.
- [✓] Define
otto/web/client/src/types/run.tswith fullRunView,FeatureView,GroupView,ComponentView,GuardrailView,StageView,EvidenceRef,RunMeta,FindingViewinterfaces · tick 15 2026-05-04T21:14Z - [✓] Include
nullsemantics for in-flight fields · verdict, finished_at, duration_s, cost_usd at stage level all nullable for active runs - [✓] Reflect review-walkthrough findings:
evidence_completeness(full/proxy_only/partial),coverage_confidence(high/medium/low),multi_actor_required, severity-taggedFindingView, Component/Guardrail first-class types - Stub
otto/mission_control/run_view.py:build_run_viewreturning the correct shape (deferred to A4 / when backend wires up)
- [✓]
npm run web:typecheckgreen · tick 15 - Stub returns valid
RunViewfor fixture session dir (deferred — A4)
Goal: Refactor otto/build.py + otto/checks.py to dispatch by
Group/Component, thread feature_id through Evidence, add new Check
kinds.
- [✓] A1b.1 —
build_groupsis the canonical dispatch entry. The historical name wasrun_build(nobuild_slicesever shipped);build_groupsandbuild_slicesare now module-level aliases pointing atrun_build, both re-exported via__all__. Verified bytests/test_build_renames.py::test_build_groups_is_canonical_dispatch_entry. - [✓] A1b.2 —
BuildBudgetcanonical fields are nowper_group_retries_hard_cap/per_group_wall_s/per_group_cost_usd; default factories routed throughotto/defaults.py(budgets.per_group_cost_usd,retries.check_loop.*). Legacyper_slice_*names remain accepted as constructor kwargs and as attribute reads/writes via@propertyaliases that proxy to the canonical fields. Passing both legacy and canonical for the same field raisesTypeError. Internal callers inotto/build.pyupdated to canonical names. Verified bytests/test_build_renames.py(10 tests). - [✓] A1b.3 — Component dispatch — Components run alongside Groups
in same parallel build phase ·
run_buildnow iteratesready_slices+ready_componentsin one loop, builds each Component on its own branch, populatesBuildResult.component_results, and propagates dep-blocked Components as BLOCKED with attempts=0. Build agent reuses existingBuildAgentCallablevia a synthetic Slice adapter (_component_as_slice). - [✓] A1b.4 — Shared-paths handling — Groups may freely edit
shared_paths; merge queue serializes lands ·
detect_scope_violationsnow treatsSpec.shared_pathsas globally writeable (in addition to legacyshared_scaffold); Component owned_paths participate in the peer-vs-dep partition. shared_paths overrides peer ownership. - [✓] A1b.5 —
otto/checks.py:Evidencegetsfeature_idfield. Verified in the current build/audit/render path: check evidence carriesfeature_id, audit/render prefer Feature ids over display names, and scoped Layer 2 re-audit uses Feature ids as the stable join key. - A1b.6 — Add
CLIProbe,ImportCheck,TypeCheckkinds (research §2.7) · dataclasses live inspec_compile.py; executors_run_cli_probe,_run_import_check,_run_type_checkwired inotto/checks.pywith happy + failure tests intests/test_checks.py(closes codex-followups gap A3 — full sweeppytest tests/ -q --ignore=tests/integration --ignore=tests/browser1599 passed). - A1b.7 — Walkthrough actions emit
action_kinddiscriminator + per-kind fields (research §2.7)
-
pytest tests/test_build.py tests/test_checks.pygreen - Greenfield Run produces multi-Group artifacts
- Component dispatch verified
- All four
project_kindvariants compile + dispatch successfully
Goal: Eligibility-gated FIFO supports Groups + Components + shared_paths.
- A1c.1 — Eligibility logic uses
Group.dependencies(legacyGroup.depsaliased;shared_paths_sethelper retained for the in-progress shared_paths rule) - A1c.2 — Component dependencies threaded into eligibility ordering (Group<->Component cross-deps resolved via union of landed ids)
- A1c.3 — Per-Component conflict repair (Components have agents
like Groups) ·
run_merge_queuenow iterateseligible_candidates(Groups) +eligible_components(Components) in one FIFO with a sharedlanded_ids/blocked_idsset. Components flow through the same_process_candidatepath via a_component_as_merge_sliceadapter that surfacesComponent.owned_paths/dependencies/checksverbatim, so the existing conflict-repair flow re-invokes the build agent on the Component branch and is pinned to the Component's owned_paths throughBuildAgentInput.slice. Tests:tests/test_merge_component_repair.py(5 tests). - A1c.4 — Stories module renamed:
otto/merge/stories.py→otto/merge/features.pywith backward-compat reader · The legacy import path remains as a shim that re-exports every symbol fromotto.merge.featuresand emits aDeprecationWarningon import. Internal callers (otto/merge/orchestrator.py) now import from the new path.
-
pytest tests/test_merge.pygreen - Integration test: two Groups + one Component land in dep order
- Conflict repair on shared_paths works
Goal: Apply pre-existing fixtures before Audit, for multi-user products.
- New file:
otto/seed.py—seed_fixtures(spec, session_dir) - Reads
Spec.audit_fixtures[], applies to live product - Idempotent on rerun
- Failed seed = blocked Run, not silent proceed-with-empty-state
- Per-fixture-kind handlers: user, channel, follow, data
-
pytest tests/test_seed.pygreen - Integration: Run with
audit_fixturesdeclares pre-existing test users before audit walks
This section is preserved only for cross-reference with older plan versions. Do not work from it; use the split sub-phases above.
-
A1.1 —
otto/spec.py-
Spec,Feature,Group,Guardraildataclasses -
compile_spec(intent, project_kind, base=None) -> Spec -
validate_spec(spec) -> ValidationResult -
parse_spec_md(md_text, base=None) -> Spec | ParseError -
render_spec_md(spec) -> str -
apply_user_edits(spec, edits) -> Spec - Round-trip property test:
parse_spec_md(render_spec_md(s)) == s
-
-
A1.2 —
otto/checks.py-
Checkbase + kinds -
run_check(check, project_dir, *, feature_id) -> Evidence - Evidence carries
feature_id
-
-
A1.3 —
otto/state.py- Append-only
state.jsonlevent log -
replay(session_dir) -> RunState - All event kinds defined
- Append-only
-
A1.4 —
otto/build.py-
build_groups(spec, session_dir, *, defaults) -> BuildResult - Per-Group worktree + branch + long-lived agent
- Check loop bounded by
defaults.retries.check_loop -
owned_pathswrite-scope enforced at commit time
-
-
A1.5 —
otto/merge.py- Eligibility-gated FIFO merge queue per
(project, target_branch) - Conflict repair in Group's worktree
- Post-land verification
- Eligibility-gated FIFO merge queue per
-
[~] A1.6 —
otto/runner.py- [✓] Top-level Run orchestrator:
run_pipeline(intent, project_dir, session_dir, *, project_kind, brownfield, base_url, config, build_agent, audit_agent, fix_agent, ...) -> RunResultdrives compile → seed → build → merge → audit → repair (Layer 2) → render.RunResultdataclass exposes per-phase results + wall_s/cost_usd/verdict/halted_reason.cli_run.orchestrate_runrefactored to delegate the chain (compile stays inside the project lock, thenrun_pipeline(spec=...)drives the rest). 9 new tests intests/test_runner.py; all 25 existingtest_cli_run.pytests still green. - [✓] Seed stage (
otto/seed.py) + Layer 2 retry (otto/audit_loop.py:repair_failing_features) wired into the live pipeline for the first time. Seed failure halts pre-audit withverdict=BLOCKED(auditing a half-seeded product produces meaningless verdicts). Layer 2 fires only when the audit verdict is non-PASS AND a fix_agent is wired. - Budget enforcement (shared BuildBudget threaded across phases —
already done; honest
cost_usdaggregation inRunResult). TODO: dedicated wall-clock budget cap per phase. - Resume from
state.jsonl. Design written 2026-05-04:docs/i2p-resume-design.md(reusepausedpointer +spec_state.replay(); addotto/resume.py:plan_resume; refuse spec-hash mismatch unless--force). - [✓] Layer 2 wired:
_make_layer2_fix_agentnow constructs a realBuildAgentInputwithfeature_id=failing.feature_id(new field) so_build_agent_promptemits a "FIX ONLY THIS FEATURE" preamble naming the Feature and threading the audit detail. The bridge invokes the build agent, translatesBuildAgentOutput→RepairAttempt, and turns agent crashes into honestsucceeded=Falserepairs (no longer a no-op stub). 4 new tests intests/test_runner_layer2_fix.py. - [✓]
orchestrate_certify/orchestrate_improvenow delegate the audit → render chain torunner.run_pipeline(brownfield=True, spec=spec). CLI plumbing (lock acquisition, brownfield compile with its own console heading, intent resolution, exit codes) stays incli_run.pyvia two small helpers (_brownfield_compile_locked,_drive_brownfield_pipeline), but the audit-budget construction, fix_agent wiring, cost/wall-time aggregation, and render call all flow throughrun_pipeline. Both flows reuse_print_run_resultfor the post-run summary. Phase headings (Audit phase,Render phase, plus the certify/improve-specific overrides) are preserved via a small_make_phase_callbackshim around_PHASE_HEADINGS. ~140 lines of duplicated inline-chain code removed; 34/34 tests pass intests/test_cli_run.py+tests/test_runner.py.
- [✓] Top-level Run orchestrator:
-
A1.7 —
otto/cli.pyotto runintegration- Routes through new stack
- CLI flags for budget/retry overrides
- All A1.* unit tests green
- [✓] Integration:
tests/integration/test_intent_to_proof.pywritten (gap A4) — drives realotto build --provider codexagainst a tmp project; asserts spec.json shape (groups/project_kind/intent), no blocked groups, proof-packet.{html,json} on disk, audit verdict ∈ {passed, partial} (strict==passed for the happy path), and at least one screenshot artifact underaudit/now that A8 is closed. Gated behindOTTO_ALLOW_REAL_COST=1, 15min wall budget. Newi2p-e2etier inscripts/test_tiers.py(gap A5) runs only this file. Collection verified:uv run pytest tests/integration/test_intent_to_proof.py --collect-only -q→ 1 test collected; default suite still skips it correctly. - Resume test: kill mid-Build, resume, complete
- Greenfield Bench A passes for ≥1 fixture intent
Goal: every walkthrough action tagged with feature_ids[];
feature-verdicts.json correct.
-
A2.1 —
otto/audit.py-
audit_run(spec, session_dir, *, scope=ALL_FEATURES) -> AuditResult - [✓] Walkthrough Feature-tag coverage validator wired into
run_auditvia_validate_walkthrough_jsonl(walk_log_dir, spec). Readswalkthrough.jsonl, parses each line viaparse_walkthrough_entry, runsvalidate_walkthrough_coverage, attaches result toAuditResult.walkthrough_coverage. Below-threshold coverage logs a WARNING. 8 new tests · tick 58. - [✓] Coverage cap on verdict (deferred from tick 58): when
walkthrough_coverage["meets_threshold"]is False, force the post-judge verdict to at least PARTIAL via_strictest(BLOCKED stays BLOCKED). Cap reason narrated under[walkthrough coverage cap]and recorded in newAuditResult.verdict_cap_reasons: list[str]for render surfacing. Honesty contract: audit cannot certify Features it did not observe. 5 new tests intests/test_audit_coverage_cap.pycover the five cases (full/below/vacuous/no-jsonl/blocked-not-downgraded). - [✓] Parsed walkthrough entries plumbed through to
AuditResult.walkthrough_entries(single read pass —_validate_walkthrough_jsonlnow returns(entries, coverage)).compose_proof_packetreads the field and feeds it intobuild_feature_proof_blocks, so per-Feature proof blocks carry their walkthrough trace. 7 new tests intests/test_audit_walkthrough_entries.pycover the helper signature, the field default, and the end-to-end render flow. Closes the A3.1 gap surfaced by the A3 sub-agent (tick 60). - [✓] Untagged actions outside "exploration" allowlist rejected by parser.
_validate_walkthrough_jsonlnow keeps permissive coverage stats but only strict entries enter proof evidence; strict parse errors capwalkthrough_coverage.meets_thresholdto false.
-
-
[~] A2.2 —
otto/audit_loop.py- [✓]
repair_failing_features(...)— Layer 2 loop (orchestration + caps wired in earlier tick). - [✓] Live build-agent dispatch:
runner._make_layer2_fix_agentnow adapts aBuildAgentCallableto theFixAgentCallablecontract by constructing aBuildAgentInputwith the newfeature_idfield set to the failing feature's id. The build prompt's Layer 2 preamble names the feature and surfaces the audit detail so the agent fixes only the failing feature, not the whole group. Tests:tests/test_runner_layer2_fix.py. - [✓] Re-audit narrows to affected Features.
runner.run_pipelinepasses the failing Feature ids intorun_audit(feature_scope_ids=...); the audit prompt names the scoped ids, and the returned FeatureAudit list is filtered by Feature id with display-name fallback for old judges. - [✓] Caps respected from
defaults(retries.audit_loop.max_repair_attempts_per_run,max_audit_passes_per_run).
- [✓]
-
[✓] A2.3 — Audit prompt rewrite —
otto/prompts/audit-feature-tagging.mdloaded by_audit_promptand embedded into every audit-agent rendering.default_audit_agentinherits the contract via_audit_prompt. 5 new tests in tests/test_audit_prompt_feature_tagging.py pin the markers (feature_ids[], action_kind, ≥90%, per-kind examples,feature_auditswire-format).- [✓] Explicit Feature-tagging requirement (contract markdown reads
"every walkthrough action carries
feature_ids: list[str]") - [✓] Examples of tagged vs exploration actions (per-kind JSONL blocks: webapp, api, library, cli)
- [✓] Failure-mode handling instructions (verdict honesty rules:
zero walkthrough lines →
missing; proxy-only / low-confidence flags; severity ladder)
- [✓] Explicit Feature-tagging requirement (contract markdown reads
"every walkthrough action carries
- Walkthrough.jsonl tagging coverage ≥ 95% on real audit pass (greenfield fixture)
- Audit loop unit tests green
- Failing Feature → re-audit narrows → other Features unaffected
- Greenfield Bench A: all Features get verdicts; per-Feature evidence refs resolve
Goal: proof-packet.html + per-Feature pages render from any
session dir; deterministic; re-runnable.
-
[✓] A3.1 —
otto/render.py- [✓] Pure deterministic function (
render_html+render_json— no IO, no time) - [✓] Reads spec + audit + group logs + state (compose pulls
FeatureAudit per Feature, maps name→id, builds blocks via
build_feature_proof_blocks) - [✓] Writes whole-product packet with
<h2>Features</h2>block (one<section class="feature-proof">per Feature, ordered by spec) — emitted before per-Group dispatch details for primacy (research §3 atomic units). 8 new tests intests/test_render_per_feature.pycover the section, JSONfeatures[]array, multi-Feature cross-link, per-Feature finding filter, escape, and legacy-pass-through. 19/19 render tests green. [✓] Walkthrough-entry plumbing closed in A2.1:AuditResult.walkthrough_entriesflows from the single-pass_validate_walkthrough_jsonltuple return intobuild_feature_proof_blocks, so per-Feature blocks now carry verdicts + findings + walkthrough trace.
- [✓] Pure deterministic function (
-
[✓] A3.2 — Templates · verified 2026-05-05
- [✓]
otto/web/templates/proof-packet.html.j2 - [✓]
otto/web/templates/feature-proof.html.j2 - [✓]
render_htmlandfeature_proof_block_to_htmlload these repo-owned templates via dependency-free placeholder substitution;tests/test_render.py::test_proof_templates_exist_and_are_usedverifies both files are used.
- [✓]
-
[✓] A3.3 —
otto render <session-id>CLI- [✓] Re-renders without LLM cost by loading
proof-packet.jsonand regeneratingproof-packet.html. - [✓] Idempotent by default: JSON is left untouched unless
--rewrite-jsonis passed.
- [✓] Re-renders without LLM cost by loading
- Determinism test: render twice, byte-identical output
- Snapshot test: golden HTML for fixture spec passes
- All per-Feature pages contain ≥1 evidence ref
- [✓] Multi-Feature evidence cross-linked (no duplication) — verified by
test_multi_feature_walkthrough_entry_appears_in_each_feature - [✓] Old session dir (pre-A3) renders successfully — verified by
test_render_html_no_features_section_for_legacy_packet
Goal: otto web shows new RunsView + RunDrawer; legacy panel
untouched.
-
A4.1 —
otto/mission_control/run_view.py-
build_run_view(session_dir, *, live_state=None) -> RunView - Pure function, reads Proof + state.jsonl
-
-
A4.2 — Frontend
otto/web/client/src/components/run/-
RunsView.tsx,RunDrawer.tsx -
VerdictHeader,FeatureList,GroupList,StageTimeline,Guardrails,RunMetadata - Primitives:
MetricChip,EvidenceLink - [✓] Live polling —
useRunViewrefetches every 3s while the run is in-flight (queued/compiling/awaiting_spec_review/building/auditing/ rendering/landing) and stops at terminal status (passed/partial/ blocked/landed/aborted/failed). Mid-poll fetch failures keep the last successful snapshot; cleanup clears the interval on unmount. Data flows through to RunDrawer + FeatureDrilldown via shared hook.
-
-
A4.3 — Types
otto/web/client/src/types/run.ts-
RunView,Feature,Group,Guardrail,Stage,EvidenceRef
-
-
A4.4 — Routing in
App.tsx- New runs → new drawer
- Legacy runs → legacy panel (unchanged)
- Backend tests pass
- Typecheck + build green
- RUA pass: ≥3 fixture sessions through every screen with screenshots (docs/rua/2026-05-04-172101/, 16 screenshots — passed/partial/blocked, RunDrawer + FeatureDrilldown + SpecReviewPage edit/cancel/save/approve)
- Regression: legacy
otto buildruns render unchanged (snapshot)
Goal: spec-review gate works through MC; user can edit, approve, recompile.
-
A5.1 — Backend
- [✓]
otto/web/spec_review_routes.py— GET markdown, POST edit (parse_spec_md round-trip + version archive), POST approve (lifecycle flip, idempotent); install_spec_review_routes wired into otto/web/app.py · tick 35; 9 route tests green - [✓] State events:
spec.review.opened,spec.edited,spec.approvedwired into spec-review routes (with dedupe on opened + idempotent approve);spec.regeneratedreserved for compile-agent recompile path (lands when spec_compile gains a regen entry point) · tick 36; 5 new event tests (14/14 spec-review total)
- [✓]
-
A5.2 — Frontend
- [✓] Skeleton:
useSpecMdhook +SpecReviewPage(read-only + edit-toggle + Save/Approve stubs) +?view=spec-review&spec=<id>URL routing · tick 34. typecheck + build green. - [✓] Full markdown rendering of the spec body via
react-markdown(^9.1.0, repo-root dep). Replaced the raw<pre>with a.spec-markdowncontainer; default plugins only, no GFM / rehype add-ons (KISS)..spec-markdownstyles in styles.css match surrounding page weight. - [✓] Spec history widget (wireframe A5.2): on-mount fetch of
/api/specs/<sid>/versions, sticky right-column aside listing v1..vN with?view=spec-diff&from=<v>&to=<latest>links. Empty/single-version state shows "No prior versions yet". Re-fires after Save viadata.updated_atdependency. typecheck + build green;tests/test_spec_review_routes.py19/19 still green. No React component tests (no vitest+RTL harness — documented gap). -
SpecReview.tsx(Markdown view + Form view) — full styling - Add Feature modal with Otto-suggestion micro-compile
- [✓] Spec diff view (vN → v(N+1)) — wireframe 4d.
Backend:
GET /api/specs/<sid>/versionslists archivedspec-v<N>.jsonintegers;GET /api/specs/<sid>/diff?from=N&to=Mreturns paired markdown + parsed JSON, 404 on missing version. Frontend:SpecDiffPagemounted at?view=spec-diff&session=<sid>with from/to dropdowns and an inline LCS line diff (no extra deps; reuses.diff-pane/.diff-add/.diff-del). 5 new backend tests (19/19 spec-review). typecheck + build green. Frontend component tests skipped — no vitest+RTL harness in repo (honest gap, route is keyboard/URL-accessible).
- [✓] Skeleton:
-
A5.3 — Round-trip
- Edit operations preserve Feature id stability
- Spec versioned (
spec-v1.{md,json},spec-v2.{md,json}, ...)
- Pause Run at gate, edit via API, approve, Build proceeds with edited Spec — integration test
- Browser RUA: spec review flow against an in-flight pause
- Recompile preserves user-added Features when compatible
Goal: otto run "add image upload" against existing project
produces deltas only. Plus: compile_spec(intent="", brownfield=True)
on a no-prior-spec existing project emits a baseline AS-IS spec
that otto certify/otto improve (post-B.1/B.2 cutover) can drive
their audit/repair flows from.
Two modes conflated under "brownfield":
- Brownfield-fresh (no prior spec). Compile reads project tree; emits an AS-IS spec describing existing Features. Required by B.1/B.2 cutovers. Lower complexity.
- Brownfield-additive (prior spec + intent). Compile reads tree
- loads base spec; emits delta — new/changed Features, existing Groups carry forward. Higher complexity; matches research §9.4.
API decision: single compile_spec(intent, project_dir, run_dir, config, *, project_kind, brownfield: bool = False, base_spec: Spec | None = None). brownfield=False is the existing greenfield path
(unchanged). brownfield=True switches to the brownfield prompt
template; base_spec is consulted for additive mode.
Prompt template: new otto/prompts/compile-spec-brownfield.md.
Content sketch:
- Project preamble: top-level dir tree (depth=2, max=200 entries),
README.md first 200 lines, package.json/pyproject.toml manifest,
prior
spec.jsonsummary ifbase_specprovided. - Instructions: agent uses Read/Glob/Grep to dive deeper. Emits Features that REFLECT the project (not the intent). Intent is a scope hint, not a derivation source.
- Diff rule: if
base_specprovided, only emit new + changed Features. Carry forward existing Groups verbatim by id. - "Leave it alone" markers: deferred to tick 43 (out of A6.1 scope).
Project preamble generator (Python helper, not LLM):
- New helper
otto.spec_compile.build_project_preamble(project_dir) -> str. Reads file tree (usinggit ls-filesfor tracked files only, falling back toPath.glob("**")capped at 200 entries), reads README.md / pyproject.toml / package.json / Cargo.toml / go.mod (whichever exist, capped at 200 lines each), formats as fenced-code preamble for the prompt. - Determines project_kind heuristically if not specified: presence of pyproject.toml → library/cli/api candidate; package.json → webapp candidate; tests/ folder + cli entry_point → cli; etc. Final decision still surfaced to user via spec-review gate.
Out-of-scope detection (research §9.5b): if intent text contains "browser", "kernel", "compiler", "OS-level" etc., emit a warning before LLM cost. v1: simple keyword match. v2: LLM-based classifier.
- [✓] A6.1 —
build_project_preamble(project_dir) -> strhelper (file tree + README + first manifest, capped viaBROWNFIELD_PREAMBLE_MAX_FILES=200/MAX_LINES_PER_FILE=200in defaults.py). git-tracked when available, glob fallback, common-ignore filter; deterministic. 11 new tests pass · tick 41 - [✓] A6.2 —
otto/prompts/compile-spec-brownfield.mdprompt template with{project_preamble}interpolation; greenfield prompt unchanged. Anti-derivation rules ("read; don't invent"; intent is scope hint not source); empty-project bootstrap branch; per-Feature evidence_kinds guidance per project_kind. 2 new render_prompt tests pass · tick 42 (13/13 brownfield total) - [✓] A6.3 —
compile_spec(brownfield=True, base_spec=None)wires preamble + brownfield prompt + same parsing pipeline as greenfield; base_spec emits a warning + is ignored until A6.4. Greenfield path entirely unchanged. 4 new tests pass · tick 43 - [✓] A6.4 — Additive mode (
_reconcile_brownfield(new_spec, base_spec)): Group ids carry forward (warning on title conflict); Feature audit/coverage state preserved on matching ids; Components union-by-id; Guardrails union-deduped by text; intent + intent_hash from base; mechanical/historical fields (structure, shared_paths, non_goals, done_means, amendments, audit_fixtures, cross_group_checks, shared_scaffold) preserved from base. 4 new reconciliation tests (8/8 brownfield-compile total) · tick 44 - [✓] A6.5 — Out-of-scope keyword guard before LLM call (research
§9.5b).
OUT_OF_SCOPE_KEYWORDS(13 multi-token phrases) +detect_out_of_scope_intent(intent)helper +compile_specpre-LLM check raisingSpecValidationError. User override via literaloverride-scopetoken in intent (proof packet will mark verdict suggestive). Greenfield + brownfield share the check. 22 new tests · tick 45 (43/43 A6 total) - A6.6 — File-level "preserve" markers (mechanism TBD; deferred
until A6.4 lands; likely
.otto/preservefile pattern). - [✓] A6.7 — Integration test (
tests/integration/test_brownfield_compile_real.py) builds a realistic CLI fixture in tmp_path, runs full compile_spec(brownfield=True) plumbing (preamble + prompt + parsing + reconciliation) with stubbed agent. Empty-base + base additive paths both verified. 2 new tests · tick 46 (45/45 A6 total)
-
otto run --brownfield(or auto-detect via project state) compiles against an existing project and emits a Spec with ≥1 Feature reflecting actual project state. - Bench C (brownfield add-feature) passes — measures delta mode (additive) against a fixture project + intent. Defer until A6.1-A6.5 land.
- B.1 (
otto certify) and B.2 (otto improve) cutovers unblock.
Goal: legacy CLI commands route through new stack.
| Legacy entry point | Current backing | New-stack equivalent | Cutover notes |
|---|---|---|---|
otto build |
otto.pipeline.build_agentic_v3 + run_certify_fix_loop |
compile_spec → build.run_build → merge_queue → audit_loop → render (same chain as otto run) |
Largest blast radius. Wraps spec gate, build, certify, fix in one loop. Cut last. |
otto certify |
otto.certifier.run_agentic_certifier |
audit_loop + render |
Read-mostly (no merge queue). Smallest blast radius; cut FIRST as a smoke proof. |
otto improve |
otto.cli_improve → certify+build feedback loop |
audit_loop (multi-round) + build.run_build for fixes |
Mid blast radius. Logically a long-running loop of certify→build→certify; can wrap the same pieces. |
otto run |
otto.cli_run (new stack — already routes correctly) |
n/a — already on new stack ✓ | Reference implementation; the others should converge here. |
otto history / pow / setup / cleanup / merge / queue / dashboard / web |
otto.cli_logs, cli_pow, cli_setup, cli_cleanup, cli_merge, cli_queue, cli.dashboard, cli.web_command |
No cutover required — utility commands; web mounts both /api/runs (legacy) and /api/run-view (new) |
Read-only or auxiliary; defer to Phase C deletion of any legacy view. |
/api/runs/<id>/... artifact routes |
legacy MC inspector | /api/run-view/<id> (RunView JSON) + /api/specs/<id>/markdown (SpecMdView) |
Both currently coexist. Phase B leaves both; Phase C deletes the legacy /api/runs/* body once MC default switches over. |
Cutover order — REVISED tick 39 after design-gap discovery:
Critical finding (tick 39): run_audit(spec, ..., build_result, merge_result, ...) REQUIRES populated BuildResult and MergeQueueResult from the new-stack chain. Legacy otto certify runs without these (no spec, no build phase). Direct cutover would require either:
(a) synthesizing fake BuildResult/MergeQueueResult — bandaid, violates anti-slop rule, OR
(b) brownfield-compiling a spec on the fly + running a no-op build → A6 dependency.
This means B.1 (otto certify) and B.2 (otto improve) BLOCK on A6 (brownfield compile) because they're called on projects without an existing spec/build cycle. otto build does NOT block — it always compiles a spec first, which is exactly what otto run already does end-to-end.
Revised order:
- B.0 — opt-in
otto build --i2p(new step): add a flag that routesotto buildthrough the existingotto.cli_run.run_commandbody without removing the legacy v3 stack. Lets users dogfood the new chain on real projects without breaking anyone. Smallest-possible safe move. - B.3 — flip default to the new stack once bench evidence + dogfood shows feature parity. Add a
--legacyescape hatch for one cycle. Removesotto.pipeline.build_agentic_v3callers. - A6 — brownfield compile (was deferred): implement
compile_spec(intent, *, project_dir, brownfield=True)that reads the existing project to seed groups + features. Required by B.1/B.2. - B.1 —
otto certifycutover (after A6): brownfield-compile a spec, runaudit_loop + render. Becomes "audit a project that already exists". - B.2 —
otto improvecutover (after A6 + B.1): wrap multi-roundaudit_loopwith thecli_improveretry pattern. - B.4 — DeprecationWarnings on any legacy module functions slated for Phase C deletion.
- [✓] B.0 —
otto build --i2popt-in flag routes through new stack via extractedotto.cli_run.orchestrate_run; legacy default untouched · tick 39; 4 new tests (16/16 cli_run + cli_smoke pass) - [✓] B.3 (PREP) —
default_pipelineconfig field added (default still"legacy");--legacyflag added tootto build/certify/improve bugs;resolve_pipeline_choice(i2p, legacy, project_dir)helper centralizes dispatch logic. Mutual-exclusion with--i2praisesclick.UsageError. Actual default flip awaits bench validation. 9 new tests · tick 50 (38/38 cli + deprecation + pipeline-choice total). - [✓] B.1 —
otto certify --i2proutes through brownfield-compile +run_audit(with placeholder BuildResult/MergeQueueResult, no build phase, fix_agent=None) +render_run. Legacy default unchanged.orchestrate_certifyextracted as reusable helper in cli_run.py. 4 new tests · tick 47 (20/20 cli total). Follow-up tick:orchestrate_certifynow delegates the audit + render chain torunner.run_pipeline(brownfield=True, fix_agent=None, spec=...); only the brownfield-compile (which lives inside the project lock and prints its own heading) plus CLI plumbing stay in cli_run.py. - [✓] B.2 —
otto improve bugs --i2proutes through brownfield-compile +run_audit(withfix_agent=default_build_agentenabling repair loop,--roundsmapped toAuditBudget.audit_retries) +render_run. Legacy default unchanged.orchestrate_improveextracted in cli_run.py. 4 new tests · tick 48;featureandtargetsubcommands wired with the same pattern · tick 51. 6 additional tests (43/43 cli + deprecation + pipeline-choice total). Follow-up tick:orchestrate_improvenow delegates the audit + render chain torunner.run_pipeline(brownfield=True, fix_agent=default_build_agent, audit_budget=..., spec=...), sharing_brownfield_compile_locked/_drive_brownfield_pipelinewithorchestrate_certify. The duplicated inline chain (run_audit placeholder BuildResult/MergeQueueResult + render_run) is gone. - [✓] B.4 — DeprecationWarnings on legacy paths slated for Phase C
deletion.
build_agentic_v3andrun_agentic_certifieremitDeprecationWarningon each call (not at module import) naming Phase C and the--i2pmigration path. 3 new tests · tick 49. NOTE:_run_improvenot warned (private helper; user-facing surface isotto improve bugs --i2pwhich already warns about ignored flags).
- Bench B (Microfeed parity) passes (criteria in research §12.7)
- All four legacy commands route through new stack with passing fixture tests
- Legacy
otto.pipeline.build_agentic_v3andotto.certifier.run_agentic_certifierreachable only through deprecation shim
Audit: see
docs/phase-c-deletion-audit.mdfor module line counts, caller surface, MC route enumeration, and the proposed deletion order. Re-run the audit before landing the deletion PR — line counts shift as Phase B work continues.
Goal: delete every module/component listed in research §13.
- C.1 — Delete legacy backend modules
-
[✓] C.1a — Gut
otto/cli_improve.pylegacy bodies (_run_improve,_run_improve_locked,_apply_improver_agent_aliases,_exit_for_lock_busy,_create_improve_branch,_resolve_improve_certifier_mode,_resolve_feature_certifier_mode). The click subcommands stay;--legacyis now a hard error pointing at--i2p. Tick 63. Removed test files:test_improvement_report_splits_pass_warn_fail.py,test_improve_writes_build_journal_single_round.py,test_improve_phase_writes_to_improve_dir.py. Legacy improve hardening tests intest_hardening.pydeleted in place. -
[✓] C.1b — Gut
otto/certifier/__init__.py(tick 64). Reduced from 4,456 lines → ~80 lines (pure shim re-exportingcontracts.py+report.py, plus a hard-error stub forrun_agentic_certifierso legacy lazy imports inotto/pipeline.pyandotto/merge/orchestrator.py(Phase C.3 deletion targets) fail loudly instead of withImportError).otto/certifier/contracts.py(292 lines) andotto/certifier/report.py(40 lines) kept — referenced byotto/merge/orchestrator.py(slated for C.3) andtests/test_merge_orchestrator.py/tests/test_hardening.py. The new stack (otto/audit.py,otto/render.py,otto/audit_loop.py) does not import them.otto/cli.py::_certify_lockeddeleted (-229 lines);otto certify --legacynow hard-errors via_exit_legacy_certify_removed(sibling to_exit_legacy_build_removedfrom C.3).tests/test_certifier_stories.pydeleted (-1,839 lines);tests/test_proof_provenance.py's certifier-coupledtest_visual_evidence_manifest_written_at_captureremoved (-76 lines).tests/test_legacy_deprecation.pyupdated: the run_agentic_certifier deprecation test became a hard-error assertion (call →RuntimeErrornaming Phase C.2). -
[✓] C.1b cleanup — Phase C cleanup pass (tick 64 follow-up). Pruned the 23 orphaned tests that exercised legacy certifier internals deleted in W8-A:
TestProofOfWorkRendering(13 tests, -525 lines —_build_pow_report_data/_render_pow_html/_render_pow_markdown/_write_pow_report/_intent_excerpt),TestSpecTimeoutTolerance(3 tests, -64 lines —run_agentic_certifier),TestCertifyPassesConfig(2 tests, -65 lines —run_agentic_certifier),TestCertifierStoryDedup(2 tests, -67 lines —run_agentic_certifier), and the twotest_standalone_certifier_target_*standalone functions (-62 lines —run_agentic_certifier) intests/test_hardening.py; plusTestStandaloneCertifierPrompt(3 tests, -63 lines —_render_certifier_prompt) intests/test_spec.py. Cross-cutting parser/marker/resume tests retained.tests/test_cli_run.py::test_certify_without_i2p_uses_legacy_pathrewritten astest_certify_without_i2p_hard_errors_after_phase_c2(mirrors the build-side_after_phase_c3pattern from C.1c). Targeted suite: 135 passed. -
[✓] C.1c — Gut
otto/pipeline.py(Phase C.3, this tick). Reduced from 2,875 lines → 61 lines (thin shim that re-exports shared lifecycle helpers from the newotto/runs/lifecycle.pymodule and hard-errors on access tobuild_agentic_v3,run_certify_fix_loop,BuildResult,InfraFailureError). Shared run-lifecycle helpers (process cleanup, atomic publisher / heartbeat / cancel ack, session summary writer, terminal history append, runtime metadata) moved tootto/runs/lifecycle.py(~600 lines) — not deleted, sinceotto/agent.py,otto/cli.py:_run_spec_phase, and the test suite still need them._build_lockedbody (-654 lines) gutted inotto/cli.py;--legacyroute now exits via_exit_legacy_build_removed. Removed tests:tests/test_v3_pipeline.py(-1,354 lines, deleted),tests/test_build_fallback_to_intent_md.py(-85 lines, deleted), 18 v3-only classes/functions intests/test_hardening.py(-2,732 lines pruned in place; file is now 1,953 lines vs 4,685). Re-pointedtests/test_run_history.py,tests/test_token_usage_phase_logs.py,tests/test_agent.pyto import the shared helpers fromotto.runs.lifecycle.tests/test_legacy_deprecation.pyalready updated by W8-A to assert the hard-error contract. Final pipeline.py shim deletion tracked separately — keeps existing whileotto/merge/orchestrator.py's lazy import ofrun_agentic_certifiersurvives. -
C.1d — Delete
otto/spec.py(legacy markdown spec gate). Inlined the smallwrite_spec_review_decisionhelper intootto/mission_control/actions.pyas_write_spec_review_decision(single consumer, ~12 LOC), then deletedotto/spec.pyand the legacytests/test_spec.py. Sidecarreview-decision.jsonis currently informational — no consumer reads it; queue resume works off the checkpoint. New web spec-review flow lives inotto/web/spec_review_routes.pyon top ofspec_compile. Verified:grep -rn 'from otto\.spec\b'returns zero matches.tests/test_mission_control_actions.py+ spec test sweep (138 tests) green; broad sweep (1652 tests) green modulo a pre-existing test-ordering flake unrelated to this change. -
[✓] C.1f — Phase C cleanup pass (W8-B follow-up, tick 66). Removed orphaned references left by C.3: (1) Deleted
otto/cli.py:_run_spec_phase(-450 LOC) — its only caller was the gutted_build_lockedstub. (2) Deletedtests/integration/test_resume_flow.py(-47 LOC) — single test depended onfrom otto.pipeline import build_agentic_v3(now hard-error stub). (3) Surgically removedtests/integration/test_build_flow.py::test_build_agentic_v3_dedupes_repeated_certify_round_markersand pruned the resulting unused imports; the canonical+mirror manifest test in the same file is unchanged. (4) Annotated the lazyfrom otto.certifier import run_agentic_certifierimport atotto/merge/orchestrator.py:1730with a 10-line comment explaining the C.2/C.3 status and why a substitution intootto.audit.run_auditis a structural rewrite rather than a cleanup. The orchestrator stays reachable viaotto/cli_merge.py; tests intests/test_merge_orchestrator.pyexercise the path through monkeypatched stubs.**Post-compact finding (W10-D, this tick):** Re-audited the reachability claim. The lazy import is **NOT dead** — call chain confirmed: `otto/cli.py:1289 register_merge_command` → `otto/cli_merge.py:206 await run_merge(...)` → `otto.merge.orchestrator.run_merge` (line ~1080-1100) → `_run_post_merge_verification` (line 1654) → unconditional `await run_agentic_certifier(...)` at line 1763 (unless `--no-certify` set). **Real `otto merge` invocations without `--no-certify` will fail with the Phase C.2 RuntimeError stub.** Tests pass only because they monkeypatch `otto.certifier.run_agentic_certifier`, masking the production hard-error. **Decision required (user input):** (a) delete `otto merge` CLI + orchestrator entirely (Phase C.3 expansion); (b) wrap `_run_post_merge_verification` in a `--no-certify` short-circuit at `run_merge` level so the orchestrator never reaches the stub; or (c) do the structural rewrite to call `otto.audit.run_audit` (incompatible call shapes — `intent + stories + merge_context` vs `Spec + BuildResult + MergeQueueResult`). i2p stack is unaffected — uses `otto/merge_queue.py`, not the legacy orchestrator. **Resolution (W10-E, this tick):** Option (a) executed in C.1g below — the lazy import (and everything around it) is now deleted, not annotated. Verification: `uv run pytest tests/test_merge_orchestrator.py -q` → 49 passed; `tests/integration/test_build_flow.py` → 1 passed (the canonical+mirror manifest test). -
[✓] C.1f cleanup — Dead helper sweep (this tick). Deleted the residual
write_test_pow_reporthelper intests/_helpers.py(-72 lines) — its body imported_build_pow_report_dataand_write_pow_reportfromotto.certifier, both removed in Phase C.2 (tick 64), so the helper could no longer execute. Grep confirmed zero callers acrossotto/andtests/(only historical comments inotto/certifier/__init__.py:36andtests/test_hardening.py:576-577reference the deleted symbols by name). Also dropped the now-unusedAnyimport. Verification:uv run pytest tests/ -q --ignore=tests/integration --ignore=tests/browser --tb=no→ 1700 passed. -
[✓] C.1g — Delete legacy
otto mergeCLI + orchestrator (W10-E, this tick). Option (a) of the C.1f decision tree. i2p stack usesotto/merge_queue.py; the legacy multi-mode merge orchestrator was reachable only throughotto merge, and any real invocation without--no-certifywould crash on the C.2run_agentic_certifierstub. Production deletions: (1)otto/merge/orchestrator.py(-2382 LOC) (2)otto/cli_merge.py(-359 LOC) (3)otto/merge/conflict_agent.py(-312 LOC, production callers: only the orchestrator) (4)otto/merge/edit_scope.py(-230 LOC, production callers: only conflict_agent) (5)otto/merge/stories.py(-30 LOC, deprecated re-export shim used only by the orchestrator) (6)otto/cli.py— dropped theregister_merge_commandregistration block (-3 LOC) Helpers preserved (in active use by i2p stack / mission_control / test suite):otto/merge/git_ops.py,otto/merge/state.py,otto/merge/features.py,otto/merge/verification.py,otto/merge/__init__.py. Test deletions (1996 + 260 + 46 + 267 + 109 + 127 + 77 + 142 = 3024 LOC, 8 files):tests/test_merge_orchestrator.py,tests/test_cli_merge.py,tests/test_registry_gc.py,tests/test_merge_conflict_agent.py,tests/merge/test_conflict_agent_scope.py,tests/merge/test_edit_scope.py,tests/integration/test_merge_flow.py,tests/integration/test_conflict_scope_flow.py. Test rewrites (3 files):tests/test_cli_smoke.py—merge --helptest inverted to assertNo such command 'merge'(Click exits non-zero).tests/test_run_history.py— dropped the_append_merge_historyimport + the merge slice of the terminal-history multi-domain test (other 4 domains retained).tests/conftest.py— removedtests/test_merge_orchestrator.pyfromHEAVY_TEST_FILES. Doc/comment updates:otto/certifier/__init__.pyandtests/test_legacy_deprecation.pyno longer claim a lazy import lives inotto/merge/orchestrator.py(the file is gone). The C.2 hard-error stub stays forotto/pipeline.py's lazy import (still alive). Total: -3334 production LOC + -3024 test LOC = -6358 LOC. Verification: full unit sweepuv run pytest tests/ -q --ignore=tests/integration --ignore=tests/browser --tb=short -x→ 1568 passed in 128s;uv run otto --help | grep -i merge→ exit 1 (no match, command removed). -
[✓] C.1e — Delete legacy
/api/runs/<run_id>/...MC routes fromotto/web/app.py(audit-doc step 4). Tick 65. 11 GET/POST routes removed:run_detail,run_logs,run_artifacts,run_artifact_content,run_artifact_raw,run_proof_report,run_proof_asset,run_legacy_proof_evidence_asset,run_legacy_proof_file,run_diff,run_action. Frontend default switched simultaneously:main.tsxnow renders<RunListLanding/>(lists sessions from/api/run-view) when no?view=param is set; legacyApp.tsxdeleted (1731 lines). New component:otto/web/client/src/components/run/RunListLanding.tsx. Removed test files:test_web_landing.py,test_web_mission_control.py,test_web_review_packet.py(55 legacy tests). Surgically dropped 1 test intest_web_events_history.pyand 1 legacy/api/runs/<id>call intest_web_queue_actions.py. Targeted suite green: 120 tests acrosstests/test_run_view*.py,tests/test_spec_review_routes.py, andtests/test_web_*.py.
-
- C.2 — Delete legacy frontend components
- C.3 — Remove
domainfield fromHistoryRow - C.4 — Single-PR diff with full test suite + RUA pass
- All Layer-1 to Layer-5 evaluations from research §12 green
- Sign-off criteria from research §4.8 met
Updated by drift sentinel each tick. If any goes nonzero outside the
permitted allowlist, drift sentinel halts and writes to drift-log.md.
| Counter | Target | Last value | Last checked |
|---|---|---|---|
| Retired-vocab hits | 0 (post-A0) | 1669 in otto/ (down from 1672; A0 active) | 2026-05-04T18:13Z |
| Magic-number hits outside defaults.py | 23 (legacy floor) | 23 (unchanged; transport timeouts in Phase-C-doomed modules) | 2026-05-04T18:08Z |
| Out-of-scope file edits in current phase | 0 | 0 | 2026-05-04T17:55Z |
| Failing unit tests | 0 | (not run this tick) | (never) |
| Frontend typecheck errors | 0 | (not run this tick) | (never) |
User has explicitly removed cost constraints. Track spend per phase for retrospective only. No phase blocks on cost.
| Phase | Spent so far | Wall (real time) |
|---|---|---|
| A0 | (n/a) | (n/a) |
| A1a | (n/a) | (n/a) |
| A1.5-types | (n/a) | (n/a) |
| A1b | (n/a) | (n/a) |
| A1c | (n/a) | (n/a) |
| A1.5-seed | (n/a) | (n/a) |
| A2 | (n/a) | (n/a) |
| A3 | (n/a) | (n/a) |
| A4 | (n/a) | (n/a) |
| A5 | (n/a) | (n/a) |
| A6 | (n/a) | (n/a) |
| B | (n/a) | (n/a) |
| C | (n/a) | (n/a) |