[plan][feat]: Add raw and full execution modes to SWE-bench eval harness

## Description

Add a `--mode {raw,full}` flag to the SWE-bench evaluation harness to enable A/B comparison between baseline Claude (`raw` mode: bare `claude -p` subprocess) and the full agentize pipeline (`full` mode: planning via `run_planner_pipeline()` + implementation via FSM orchestrator with review/simp iterations).

The `raw` mode tests the model alone (baseline), while `full` mode tests the complete agentize framework — planning step IS the differentiator.

## Proposed Solution

Adopt kernel-substitution approach (~100 LOC) over custom transition table (~415 LOC). Replace `pr_stage_kernel` and `rebase_stage_kernel` with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up.

### Consensus Summary

Adopt the reducer's kernel-substitution approach (~100 LOC) over the bold proposal's custom transition table (~415 LOC), validated by the critique's finding that `run_fsm_orchestrator()` does not accept a custom transition table. The key insight: replace `pr_stage_kernel` and `rebase_stage_kernel` with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up — total tokens from the result dict suffice for MVP.

### Goal

Add `--mode {raw,full}` to the SWE-bench eval harness so we can A/B compare baseline Claude (`raw`: bare `claude -p`) against the full agentize pipeline (`full`: planning via `run_planner_pipeline()` + implementation via FSM orchestrator with review/simp iterations).

**Success criteria:**
- `--mode raw` preserves existing behavior (bare `claude -p` subprocess)
- `--mode full` runs the planner pipeline, writes consensus plan as issue file, then drives the FSM orchestrator through impl → review → simp → finish (skipping PR/rebase via kernel substitution)
- Patches from both modes are extractable via `git diff` and scoreable by SWE-bench Docker evaluator
- No modifications to upstream `impl.py`, `orchestrator.py`, `pipeline.py`, or `kernels.py`

**Out of scope:**
- Per-stage token tracking (TokenLedger) — ✅ Good to have in the future
- Parallel task execution — ✅ Good to have in the future
- Resume support for `full` mode — ✅ Good to have in the future

### Codebase Analysis

**Files verified (docs/code checked by agents):**
- `python/agentize/workflow/impl/orchestrator.py`: Confirmed `run_fsm_orchestrator()` accepts `kernels: KernelRegistry` dict — kernel substitution is the correct approach
- `python/agentize/workflow/impl/kernels.py`: Confirmed `KERNELS` dict maps stages to handlers, can be copied and modified
- `python/agentize/workflow/impl/checkpoint.py`: Confirmed `create_initial_state(issue_no, worktree, plan_file)` factory
- `python/agentize/workflow/impl/state.py`: Confirmed `WorkflowContext`, `StageResult`, and all event constants
- `python/agentize/workflow/planner/pipeline.py`: Confirmed `run_planner_pipeline(feature_desc, output_dir=...)` returns dict with consensus text
- `python/agentize/eval/eval_harness.py`: Confirmed current structure with `run_impl()`, `write_overrides()`, `_cmd_run()`

**File changes:**

| File | Level | Purpose |
|------|-------|---------|
| `python/agentize/eval/eval_harness.py` | major | Add `--mode` flag, `run_planning_phase()`, `run_full_impl()`, eval kernel stubs, extended shell overrides (~100 LOC) |
| `python/agentize/eval/eval_harness.md` | medium | Document dual-mode architecture, full mode data flow, kernel substitution rationale |
| `python/tests/test_eval_harness.py` | medium | Add tests for new functionality (~40 LOC) |

### Interface Design

**New interfaces:**

1. `_eval_pr_kernel(context) -> StageResult` — No-op returning `EVENT_PR_PASS`
2. `_eval_rebase_kernel(context) -> StageResult` — No-op returning `EVENT_REBASE_OK`
3. `run_planning_phase(problem_statement, output_dir, model) -> str` — Runs planner pipeline, returns formatted issue content (~25 LOC)
4. `run_full_impl(wt_path, instance_id, problem_statement, timeout, model, enable_review, enable_simp, max_iterations) -> dict` — Drives FSM orchestrator with kernel substitution (~55 LOC)

**Modified interfaces:**

CLI argument parser:
```diff
  run_parser.add_argument("--model", default="sonnet")
+ run_parser.add_argument("--mode", choices=["raw", "full"], default="raw")
+ run_parser.add_argument("--enable-review", action="store_true", default=False)
+ run_parser.add_argument("--enable-simp", action="store_true", default=False)
+ run_parser.add_argument("--max-iterations", type=int, default=10)
```

`write_overrides()`:
```diff
- git() { case "$1" in push) ... ;; esac; }
+ git() { case "$1" in push|fetch|rebase) ... ;; esac; }
- wt() { echo "STUB: wt skipped"; }
+ wt() { case "$1" in pathto) echo "." ;; *) echo "STUB: wt $1 skipped" ;; esac; }
```

### Implementation Steps

**Step 1: Update documentation** (Est: 20 LOC)
**Step 2: Add tests** (Est: 40 LOC) — 5 new test cases
**Step 3: Implement dual-mode** (Est: 100 LOC) — kernel stubs, planning phase, full impl, CLI flags, mode dispatch

**Total:** ~160 LOC — Medium — Single session

### Success Criteria

- [ ] `--mode raw --limit 1 --dry-run` works (existing behavior unchanged)
- [ ] `--mode full --limit 1 --dry-run` exercises the full mode path
- [ ] No-op PR and rebase kernels correctly skip GitHub operations
- [ ] All existing tests pass (14 tests)
- [ ] All new tests pass (5 tests)
- [ ] No modifications to upstream modules

### Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| `WorkflowContext.data` missing keys → KeyError | H | H | Mirror `run_impl_workflow()` exactly |
| `Session` requires AGENTIZE_HOME | M | M | Set env var; validate path |
| `run_planner_pipeline()` requires API | M | M | Planning optional; skip-planning flag later |
| `template_path` relative to AGENTIZE_HOME | M | M | Use `get_agentize_home()` |

## Related PR

TBD — will be updated when the PR is created.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan][feat]: Add raw and full execution modes to SWE-bench eval harness #974

Description

Proposed Solution

Consensus Summary

Goal

Codebase Analysis

Interface Design

Implementation Steps

Success Criteria

Risks and Mitigations

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Level	Purpose
`python/agentize/eval/eval_harness.py`	major	Add `--mode` flag, `run_planning_phase()`, `run_full_impl()`, eval kernel stubs, extended shell overrides (~100 LOC)
`python/agentize/eval/eval_harness.md`	medium	Document dual-mode architecture, full mode data flow, kernel substitution rationale
`python/tests/test_eval_harness.py`	medium	Add tests for new functionality (~40 LOC)

Risk	Likelihood	Impact	Mitigation
`WorkflowContext.data` missing keys → KeyError	H	H	Mirror `run_impl_workflow()` exactly
`Session` requires AGENTIZE_HOME	M	M	Set env var; validate path
`run_planner_pipeline()` requires API	M	M	Planning optional; skip-planning flag later
`template_path` relative to AGENTIZE_HOME	M	M	Use `get_agentize_home()`

[plan][feat]: Add raw and full execution modes to SWE-bench eval harness #974

Description

Description

Proposed Solution

Consensus Summary

Goal

Codebase Analysis

Interface Design

Implementation Steps

Success Criteria

Risks and Mitigations

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions