-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Description
Add a --mode {raw,full} flag to the SWE-bench evaluation harness to enable A/B comparison between baseline Claude (raw mode: bare claude -p subprocess) and the full agentize pipeline (full mode: planning via run_planner_pipeline() + implementation via FSM orchestrator with review/simp iterations).
The raw mode tests the model alone (baseline), while full mode tests the complete agentize framework — planning step IS the differentiator.
Proposed Solution
Adopt kernel-substitution approach (~100 LOC) over custom transition table (~415 LOC). Replace pr_stage_kernel and rebase_stage_kernel with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up.
Consensus Summary
Adopt the reducer's kernel-substitution approach (~100 LOC) over the bold proposal's custom transition table (~415 LOC), validated by the critique's finding that run_fsm_orchestrator() does not accept a custom transition table. The key insight: replace pr_stage_kernel and rebase_stage_kernel with no-ops that emit pass events, reusing the production transition table unchanged. Defer per-stage token tracking (TokenLedger) to a follow-up — total tokens from the result dict suffice for MVP.
Goal
Add --mode {raw,full} to the SWE-bench eval harness so we can A/B compare baseline Claude (raw: bare claude -p) against the full agentize pipeline (full: planning via run_planner_pipeline() + implementation via FSM orchestrator with review/simp iterations).
Success criteria:
--mode rawpreserves existing behavior (bareclaude -psubprocess)--mode fullruns the planner pipeline, writes consensus plan as issue file, then drives the FSM orchestrator through impl → review → simp → finish (skipping PR/rebase via kernel substitution)- Patches from both modes are extractable via
git diffand scoreable by SWE-bench Docker evaluator - No modifications to upstream
impl.py,orchestrator.py,pipeline.py, orkernels.py
Out of scope:
- Per-stage token tracking (TokenLedger) — ✅ Good to have in the future
- Parallel task execution — ✅ Good to have in the future
- Resume support for
fullmode — ✅ Good to have in the future
Codebase Analysis
Files verified (docs/code checked by agents):
python/agentize/workflow/impl/orchestrator.py: Confirmedrun_fsm_orchestrator()acceptskernels: KernelRegistrydict — kernel substitution is the correct approachpython/agentize/workflow/impl/kernels.py: ConfirmedKERNELSdict maps stages to handlers, can be copied and modifiedpython/agentize/workflow/impl/checkpoint.py: Confirmedcreate_initial_state(issue_no, worktree, plan_file)factorypython/agentize/workflow/impl/state.py: ConfirmedWorkflowContext,StageResult, and all event constantspython/agentize/workflow/planner/pipeline.py: Confirmedrun_planner_pipeline(feature_desc, output_dir=...)returns dict with consensus textpython/agentize/eval/eval_harness.py: Confirmed current structure withrun_impl(),write_overrides(),_cmd_run()
File changes:
| File | Level | Purpose |
|---|---|---|
python/agentize/eval/eval_harness.py |
major | Add --mode flag, run_planning_phase(), run_full_impl(), eval kernel stubs, extended shell overrides (~100 LOC) |
python/agentize/eval/eval_harness.md |
medium | Document dual-mode architecture, full mode data flow, kernel substitution rationale |
python/tests/test_eval_harness.py |
medium | Add tests for new functionality (~40 LOC) |
Interface Design
New interfaces:
_eval_pr_kernel(context) -> StageResult— No-op returningEVENT_PR_PASS_eval_rebase_kernel(context) -> StageResult— No-op returningEVENT_REBASE_OKrun_planning_phase(problem_statement, output_dir, model) -> str— Runs planner pipeline, returns formatted issue content (~25 LOC)run_full_impl(wt_path, instance_id, problem_statement, timeout, model, enable_review, enable_simp, max_iterations) -> dict— Drives FSM orchestrator with kernel substitution (~55 LOC)
Modified interfaces:
CLI argument parser:
run_parser.add_argument("--model", default="sonnet")
+ run_parser.add_argument("--mode", choices=["raw", "full"], default="raw")
+ run_parser.add_argument("--enable-review", action="store_true", default=False)
+ run_parser.add_argument("--enable-simp", action="store_true", default=False)
+ run_parser.add_argument("--max-iterations", type=int, default=10)write_overrides():
- git() { case "$1" in push) ... ;; esac; }
+ git() { case "$1" in push|fetch|rebase) ... ;; esac; }
- wt() { echo "STUB: wt skipped"; }
+ wt() { case "$1" in pathto) echo "." ;; *) echo "STUB: wt $1 skipped" ;; esac; }Implementation Steps
Step 1: Update documentation (Est: 20 LOC)
Step 2: Add tests (Est: 40 LOC) — 5 new test cases
Step 3: Implement dual-mode (Est: 100 LOC) — kernel stubs, planning phase, full impl, CLI flags, mode dispatch
Total: ~160 LOC — Medium — Single session
Success Criteria
-
--mode raw --limit 1 --dry-runworks (existing behavior unchanged) -
--mode full --limit 1 --dry-runexercises the full mode path - No-op PR and rebase kernels correctly skip GitHub operations
- All existing tests pass (14 tests)
- All new tests pass (5 tests)
- No modifications to upstream modules
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
WorkflowContext.data missing keys → KeyError |
H | H | Mirror run_impl_workflow() exactly |
Session requires AGENTIZE_HOME |
M | M | Set env var; validate path |
run_planner_pipeline() requires API |
M | M | Planning optional; skip-planning flag later |
template_path relative to AGENTIZE_HOME |
M | M | Use get_agentize_home() |
Related PR
TBD — will be updated when the PR is created.