From 73974b2c54668a682be0f6fb841fbaba85f25fd5 Mon Sep 17 00:00:00 2001 From: "Arty S." <248714260+arty-kk@users.noreply.github.com> Date: Mon, 30 Mar 2026 23:08:49 +0300 Subject: [PATCH] Tighten evidence anchors in evolution plan --- evolution_plan.md | 229 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 229 insertions(+) create mode 100644 evolution_plan.md diff --git a/evolution_plan.md b/evolution_plan.md new file mode 100644 index 0000000..3a1b0be --- /dev/null +++ b/evolution_plan.md @@ -0,0 +1,229 @@ +# Evolution Plan + +## 0. Baseline (from audit) +- Architecture map: + - CLI entrypoint orchestrates rollback and cycle execution in `sif.cli` (`src/sif/cli.py:39-57#build_parser`, `src/sif/cli.py:60-217#_run_main`). + - Core autonomy loop is centered in `SpiralEngine.step()` and evolve/apply stages (`src/core/spiral_engine.py:1160-1300#SpiralEngine`, `src/core/spiral_engine.py:760-1025#SpiralEngine.evolve`). + - Candidate evaluation pipeline is `ExperimentManager -> evaluate_async -> selector.should_accept` (`src/core/experiment_manager.py:124-323#ExperimentManager.run_async`, `src/core/evaluator.py:54-164#evaluate_async`, `src/core/selector.py:9-29#should_accept`). + - Persistence and recovery are split between state store and version snapshots (`src/core/state_store.py:203-255#load_state`, `src/core/versioning.py:106-325#create_version_async`, `src/core/versioning.py:198-325#restore_version_async`). + - Mutation boundary is policy-gated to `src/components` and `src/evolvable` (`src/core/policy.py:17-58#is_path_allowed`). +- Critical flows: + 1. `sif.cli` single-cycle run with state load/save (`src/sif/cli.py:79-127#_run_main`). + 2. Continuous unattended run with restart/error counters (`src/sif/cli.py:115-193#_run_main`). + 3. Plan/evaluate/reflect/code-change decision inside spiral cycle (`src/core/spiral_engine.py:1160-1268#SpiralEngine.step`). + 4. Candidate experiment selection and application (`src/core/spiral_engine.py:743-855#SpiralEngine.evolve`, `src/core/experiment_manager.py:172-323#ExperimentManager.run_async`). + 5. Post-apply degradation detection and rollback to LKG/latest version (`src/core/spiral_engine.py:866-980#SpiralEngine.evolve`). + 6. Snapshot creation/restore for rollback (`src/core/versioning.py:106-195#create_version_async`, `src/core/versioning.py:198-325#restore_version_async`). + 7. Policy enforcement for file changes (`src/core/evolution.py:116-151#apply_code_changes_to_root_async`, `src/core/policy.py:52-58#is_path_allowed`). + 8. Event durability via async writer and fail-safe queue (`src/core/events.py:244-280#AsyncEventWriter`, `src/core/events.py:142-160#_enqueue_fail_safe_lines`). + 9. LLM fallback path without API key (`src/core/llm.py:31-44#LLMOrchestrator.start`, `src/core/llm.py:84-89#LLMOrchestrator.request_response`). + 10. Repository-native validation path (`Makefile:3-13`, `scripts/check.sh:5-10`, `docs/evaluation.md:14-21`). +- Current pain points: + - **P0** Evaluation command in runtime uses `unittest discover`, but repository test contract is pytest-based. Runtime evaluator executes `python -m unittest discover ...` (`src/core/evaluator.py:91-100#evaluate_async`), while repo-native checks are `pytest -q` (`Makefile:3-4`, `scripts/check.sh:7-8`, `docs/evaluation.md:16-19`), and tests use pytest fixtures/functions (`tests/test_cli.py:13-29`, `tests/test_versioning.py:11-25`). This causes evaluation mismatch and candidate rejection risk. + - **P0** Candidate acceptance hard-requires `tests_success=True` (`src/core/selector.py:13-29#should_accept`) and `ExperimentManager` relies on evaluator output for accept/reject (`src/core/experiment_manager.py:254-260#ExperimentManager.run_async`). With the evaluator mismatch above, valid candidates can be consistently rejected. + - **P1** Critical evolution/rollback paths are effectively untested in automated suite: tests currently cover only CLI smoke, policy, state store, version smoke, and LLM fallback (`tests/test_cli.py:13-29`, `tests/test_policy.py:1-31`, `tests/test_state_store.py:12-27`, `tests/test_versioning.py:11-25`, `tests/test_llm.py:8-22`), but there is no direct test for `ExperimentManager.run_async` (`src/core/experiment_manager.py:149-323`) and no direct test for post-apply rollback branch in `SpiralEngine.evolve` (`src/core/spiral_engine.py:866-980`). + - **P2** LLM task contract is misleading: `self_evolve` is declared as supported (`src/core/llm.py:27#LLMOrchestrator`) but no `self_evolve` branch exists in fallback builder (`src/core/llm.py:145-155#LLMOrchestrator.build_fallback`), while call-sites in the core loop explicitly request only `plan`, `evaluate`, `reflect`, and `code_changes` (`src/core/spiral_engine.py:1941-2108#SpiralEngine._resolve_llm_plan`, `src/core/spiral_engine.py:1976-2108#SpiralEngine._resolve_llm_evaluation`). +- Constraints: + - Mutation policy intentionally restricts writable code to evolvable/component areas (`src/core/policy.py:17-58#is_path_allowed`). + - Safety and rollback are first-class contracts in docs and runtime (`README.md:27-39`, `docs/guarantees.md:5-10`, `src/core/spiral_engine.py:866-980#SpiralEngine.evolve`). + - Validation commands already standardized in repository scripts/docs (`Makefile:3-13`, `scripts/check.sh:5-10`, `CONTRIBUTING.md:11-17`). + +## 1. North Star +- UX outcomes: + - Candidate evaluation outcomes become predictable: same result from runtime evaluator and contributor local checks (proxy metric: zero command-set drift between evaluator and repo check commands). + - Failed candidate explanations map to actionable reasons (proxy metric: each rejection reason corresponds to a tested branch in evaluator/selector). +- Domain outcomes: + - Single source of truth for “candidate is valid” contract across evaluator and selector. + - Rollback invariants remain verifiable for post-apply degradation path. +- Engineering outcomes: + - Reduced false-negative candidate rejection rate (proxy: accepted candidates after evaluator fix under unchanged baseline). + - Higher regression resistance by direct tests for ExperimentManager + rollback branch. + +## 2. Roadmap (incremental) +### Phase 1 (Stabilize Core) +- Goal + - Remove correctness defects that currently distort candidate acceptance. +- Scope (what we touch / what we don’t) + - Touch: evaluator command path, acceptance contract tests, experiment-manager evaluation tests. + - Don’t touch: mutation policy boundary, component strategy logic, LLM prompt design. +- Deliverables (concrete changes) + 1. Replace `unittest discover` invocation with repo-native pytest invocation in evaluator. + 2. Add regression tests proving evaluator executes pytest-compatible suites in workspace mode. + 3. Add acceptance-flow tests: evaluator output -> `should_accept` behavior. + 4. Add `ExperimentManager.run_async` test for accepted candidate under passing evaluator. +- Dependencies + - Existing repo command contract (`Makefile:3-4`, `scripts/check.sh:7-8`). +- Risk & Rollback strategy (if migration/contract changes are required) + - Risk: evaluator runtime cost increase when running pytest. + - Rollback: keep a feature flag to switch evaluator test command back only if CI/runtime regressions appear; revert evaluator command change commit. +- Validation (how to verify: tests/linter/commands from the repo) + - `pytest -q` + - `python -m compileall -q src` + - `PYTHONPATH=src python -m sif.cli --cycles 1 --json` + +### Phase 2 (UX & Domain Consolidation) +- Goal + - Make rejection/rollback behavior transparent and directly testable. +- Scope (what we touch / what we don’t) + - Touch: SpiralEngine rollback branch observability and tests. + - Don’t touch: snapshot file format, version id format. +- Deliverables (concrete changes) + 1. Add deterministic test for post-apply degradation branch and rollback metadata persistence. + 2. Add contract test for fallback restore path (`lkg_version_id` absent -> latest version restore attempt). + 3. Normalize rejection reason taxonomy (selector + experiment manager) and test branch mapping. +- Dependencies + - Phase 1 evaluator reliability. +- Risk & Rollback strategy (if migration/contract changes are required) + - Risk: tests may require controlled stubs/mocks for versioning + evaluator. + - Rollback: keep behavior unchanged, add tests first; if behavior adjustments needed, gate behind explicit branch conditions with backward-compatible defaults. +- Validation (how to verify: tests/linter/commands from the repo) + - `pytest -q` + - `python -m compileall -q src` + - `python scripts/build_proof_pack.py` + +### Phase 3 (Scale & Maintainability) +- Goal + - Remove misleading contracts and reduce future decision drift. +- Scope (what we touch / what we don’t) + - Touch: LLM orchestrator task registry contract and related tests/docs. + - Don’t touch: model provider integration mechanics. +- Deliverables (concrete changes) + 1. Remove or implement `self_evolve` task end-to-end so supported task list matches actual behavior. + 2. Add tests asserting every `supported_tasks` entry has runtime handling/fallback and at least one call-site or explicit “reserved” marker. + 3. Update docs describing supported model tasks and expected payload contracts. +- Dependencies + - None blocking core correctness; can proceed after Phase 1. +- Risk & Rollback strategy (if migration/contract changes are required) + - Risk: external prompts/scripts depending on old task list. + - Rollback: retain compatibility alias for one release cycle if task key removal is required. +- Validation (how to verify: tests/linter/commands from the repo) + - `pytest -q` + - `python -m compileall -q src` + +## 3. Task Specs (atomic, single-strategy) +- ID: EVO-001 +- Priority: P0 +- Theme: Reliability +- Problem: + - Runtime evaluator uses non-repo-native test runner and fails to evaluate actual suite consistently. +- Evidence: `src/core/evaluator.py:91-100#evaluate_async`, `Makefile:3-4`, `scripts/check.sh:7-8`, `docs/evaluation.md:16-19`, `tests/test_cli.py:13-29`. +- Root Cause + - Evaluator hardcodes `unittest discover` while project standard and tests are pytest-based. +- Impact + - False negatives during candidate validation; valid candidates rejected. +- Fix (single solution) + - Switch evaluator test command to `pytest -q` against workspace root. +- Steps + 1. Update evaluator subprocess command construction. + 2. Preserve timeout and output capture contract. + 3. Add regression test for evaluator command behavior. +- Acceptance Criteria (verifiable) + - Evaluator passes when pytest suite passes in workspace. + - Evaluator marks tests failed when pytest exits non-zero. +- Validation Commands (if visible in the project) + - `pytest -q` + - `python -m compileall -q src` +- Migration/Rollback (if needed) + - Revert evaluator command change if execution time becomes unacceptable and add explicit documented flag before reattempt. + +- ID: EVO-002 +- Priority: P0 +- Theme: Domain +- Problem: + - Acceptance gate can reject all candidates when evaluator output is systematically wrong. +- Evidence: `src/core/selector.py:13-29#should_accept`, `src/core/experiment_manager.py:254-260#ExperimentManager.run_async`, `src/core/evaluator.py:74-113#evaluate_async`. +- Root Cause + - Strict acceptance depends entirely on evaluator fields without contract tests for evaluator-selector coupling. +- Impact + - Self-evolution loop stalls despite valid code changes. +- Fix (single solution) + - Add contract tests for evaluator→selector coupling and enforce required metric fields before decision. +- Steps + 1. Add tests with synthetic evaluator payloads for accept/reject branches. + 2. Validate required fields (`compile_success`, `tests_success`, `tests_skipped`) before `should_accept` call. + 3. Reject malformed payloads with explicit reason. +- Acceptance Criteria (verifiable) + - Accepted candidate path reachable with valid metrics. + - Malformed payload yields deterministic rejection reason. +- Validation Commands (if visible in the project) + - `pytest -q` +- Migration/Rollback (if needed) + - No data migration; rollback by reverting coupling checks. + +- ID: EVO-003 +- Priority: P1 +- Theme: Reliability +- Problem: + - Post-apply rollback branch is complex but not covered by direct tests. +- Evidence: `src/core/spiral_engine.py:866-980#SpiralEngine.evolve`, `tests/test_versioning.py:11-25`, `tests/test_cli.py:13-29`. +- Root Cause + - Existing tests validate only snapshot smoke and CLI smoke, not degradation-triggered rollback path. +- Impact + - High regression risk in safety-critical recovery path. +- Fix (single solution) + - Add deterministic unit/integration tests for degradation-triggered rollback metadata and restore path selection. +- Steps + 1. Stub evaluator metrics to trigger degradation. + 2. Stub versioning outcomes for LKG success/fallback/failure. + 3. Assert memory keys/events set as specified. +- Acceptance Criteria (verifiable) + - Rollback keys (`rollback_triggered`, `rollback_reason`, `rollback_version_id|rollback_failed`) are correct for each branch. +- Validation Commands (if visible in the project) + - `pytest -q` +- Migration/Rollback (if needed) + - No migration. + +- ID: EVO-004 +- Priority: P1 +- Theme: Platform +- Problem: + - Candidate evaluation behavior is under-tested despite being a critical integration point. +- Evidence: `src/core/experiment_manager.py:149-323#ExperimentManager.run_async`, `tests/test_code_mutation.py:11-38`, `tests/test_llm.py:15-22`. +- Root Cause + - Current tests focus on isolated modules; integration path (workspace materialization + code apply + evaluator + cache) lacks direct coverage. +- Impact + - Refactors can silently break candidate throughput and selection. +- Fix (single solution) + - Add integration tests for `ExperimentManager.run_async` covering accepted, blocked, timeout, and cached outcomes. +- Steps + 1. Use temporary repo fixture with synthetic candidates. + 2. Inject deterministic evaluator coroutine. + 3. Assert result payload and `best_candidate` selection. +- Acceptance Criteria (verifiable) + - All documented result reasons are covered by tests. +- Validation Commands (if visible in the project) + - `pytest -q` +- Migration/Rollback (if needed) + - No migration. + +- ID: EVO-005 +- Priority: P2 +- Theme: Domain +- Problem: + - `LLMOrchestrator.supported_tasks` advertises `self_evolve` without actual runtime handling. +- Evidence: `src/core/llm.py:27#LLMOrchestrator`, `src/core/llm.py:145-155#LLMOrchestrator.build_fallback`, `src/core/spiral_engine.py:2072-2111#SpiralEngine._resolve_llm_code_changes`. +- Root Cause + - Task registry grew without end-to-end implementation/use. +- Impact + - Misleading contract for maintainers and future integrations. +- Fix (single solution) + - Remove `self_evolve` from supported list until a concrete call path and schema exist. +- Steps + 1. Update supported task list. + 2. Update tests/docs for supported tasks. + 3. Add guard test that supported tasks map to explicit handlers. +- Acceptance Criteria (verifiable) + - No orphan task entries in supported list. +- Validation Commands (if visible in the project) + - `pytest -q` +- Migration/Rollback (if needed) + - If external dependency exists, keep temporary compatibility alias in one release. + +## 4. Explicit Non-Goals +- Do not broaden mutation policy boundaries beyond current allowed paths (`src/core/policy.py:17-58#is_path_allowed`). +- Do not redesign component strategy heuristics or prompt content in this plan. +- Do not introduce new external dependencies for validation tooling. +- Do not refactor unrelated modules for style-only reasons. + +Stopping rule: +- This audit found actionable items in whitelist categories (P0/P1/P2), therefore no “no-op” conclusion is applicable.