Skip to content

[plan][feat]: Add raw and full execution modes to SWE-bench eval harness #974

@ayazhankadessova

Description

@ayazhankadessova

Description

Add a --mode {raw,full} flag to the SWE-bench evaluation harness to enable A/B comparison between baseline Claude (raw mode: bare claude -p subprocess) and the full agentize pipeline (full mode: planning via run_planner_pipeline() + implementation via FSM orchestrator with review/simp iterations).

The raw mode tests the model alone (baseline), while full mode tests the complete agentize framework — planning step IS the differentiator.

Proposed Solution

Adopt kernel-substitution approach (~100 LOC) over custom transition table (~415 LOC). Replace pr_stage_kernel and rebase_stage_kernel with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up.

Consensus Summary

Adopt the reducer's kernel-substitution approach (~100 LOC) over the bold proposal's custom transition table (~415 LOC), validated by the critique's finding that run_fsm_orchestrator() does not accept a custom transition table. The key insight: replace pr_stage_kernel and rebase_stage_kernel with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up — total tokens from the result dict suffice for MVP.

Goal

Add --mode {raw,full} to the SWE-bench eval harness so we can A/B compare baseline Claude (raw: bare claude -p) against the full agentize pipeline (full: planning via run_planner_pipeline() + implementation via FSM orchestrator with review/simp iterations).

Success criteria:

  • --mode raw preserves existing behavior (bare claude -p subprocess)
  • --mode full runs the planner pipeline, writes consensus plan as issue file, then drives the FSM orchestrator through impl → review → simp → finish (skipping PR/rebase via kernel substitution)
  • Patches from both modes are extractable via git diff and scoreable by SWE-bench Docker evaluator
  • No modifications to upstream impl.py, orchestrator.py, pipeline.py, or kernels.py

Out of scope:

  • Per-stage token tracking (TokenLedger) — ✅ Good to have in the future
  • Parallel task execution — ✅ Good to have in the future
  • Resume support for full mode — ✅ Good to have in the future

Codebase Analysis

Files verified (docs/code checked by agents):

  • python/agentize/workflow/impl/orchestrator.py: Confirmed run_fsm_orchestrator() accepts kernels: KernelRegistry dict — kernel substitution is the correct approach
  • python/agentize/workflow/impl/kernels.py: Confirmed KERNELS dict maps stages to handlers, can be copied and modified
  • python/agentize/workflow/impl/checkpoint.py: Confirmed create_initial_state(issue_no, worktree, plan_file) factory
  • python/agentize/workflow/impl/state.py: Confirmed WorkflowContext, StageResult, and all event constants
  • python/agentize/workflow/planner/pipeline.py: Confirmed run_planner_pipeline(feature_desc, output_dir=...) returns dict with consensus text
  • python/agentize/eval/eval_harness.py: Confirmed current structure with run_impl(), write_overrides(), _cmd_run()

File changes:

File Level Purpose
python/agentize/eval/eval_harness.py major Add --mode flag, run_planning_phase(), run_full_impl(), eval kernel stubs, extended shell overrides (~100 LOC)
python/agentize/eval/eval_harness.md medium Document dual-mode architecture, full mode data flow, kernel substitution rationale
python/tests/test_eval_harness.py medium Add tests for new functionality (~40 LOC)

Interface Design

New interfaces:

  1. _eval_pr_kernel(context) -> StageResult — No-op returning EVENT_PR_PASS
  2. _eval_rebase_kernel(context) -> StageResult — No-op returning EVENT_REBASE_OK
  3. run_planning_phase(problem_statement, output_dir, model) -> str — Runs planner pipeline, returns formatted issue content (~25 LOC)
  4. run_full_impl(wt_path, instance_id, problem_statement, timeout, model, enable_review, enable_simp, max_iterations) -> dict — Drives FSM orchestrator with kernel substitution (~55 LOC)

Modified interfaces:

CLI argument parser:

  run_parser.add_argument("--model", default="sonnet")
+ run_parser.add_argument("--mode", choices=["raw", "full"], default="raw")
+ run_parser.add_argument("--enable-review", action="store_true", default=False)
+ run_parser.add_argument("--enable-simp", action="store_true", default=False)
+ run_parser.add_argument("--max-iterations", type=int, default=10)

write_overrides():

- git() { case "$1" in push) ... ;; esac; }
+ git() { case "$1" in push|fetch|rebase) ... ;; esac; }
- wt() { echo "STUB: wt skipped"; }
+ wt() { case "$1" in pathto) echo "." ;; *) echo "STUB: wt $1 skipped" ;; esac; }

Implementation Steps

Step 1: Update documentation (Est: 20 LOC)
Step 2: Add tests (Est: 40 LOC) — 5 new test cases
Step 3: Implement dual-mode (Est: 100 LOC) — kernel stubs, planning phase, full impl, CLI flags, mode dispatch

Total: ~160 LOC — Medium — Single session

Success Criteria

  • --mode raw --limit 1 --dry-run works (existing behavior unchanged)
  • --mode full --limit 1 --dry-run exercises the full mode path
  • No-op PR and rebase kernels correctly skip GitHub operations
  • All existing tests pass (14 tests)
  • All new tests pass (5 tests)
  • No modifications to upstream modules

Risks and Mitigations

Risk Likelihood Impact Mitigation
WorkflowContext.data missing keys → KeyError H H Mirror run_impl_workflow() exactly
Session requires AGENTIZE_HOME M M Set env var; validate path
run_planner_pipeline() requires API M M Planning optional; skip-planning flag later
template_path relative to AGENTIZE_HOME M M Use get_agentize_home()

Related PR

TBD — will be updated when the PR is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentize:planPlan created by /ultra-planner command

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions