Skip to content

feat: add SWE-bench Pro benchmark and preflight check system#3

Draft
greynewell wants to merge 12 commits intomainfrom
feat/swe-bench-pro
Draft

feat: add SWE-bench Pro benchmark and preflight check system#3
greynewell wants to merge 12 commits intomainfrom
feat/swe-bench-pro

Conversation

@greynewell
Copy link
Contributor

@greynewell greynewell commented Feb 25, 2026

Summary

  • SWE-bench Pro benchmark: Multi-language benchmark support (Python, Go, TypeScript, JavaScript) with 731 instances from ScaleAI/SWE-bench_Pro
  • Preflight check system: mcpbr preflight CLI command validates golden patches pass all tests in Docker environments before running agent evaluations
  • Docker image override: Support for DockerHub-hosted images (_image_override, _workdir_override) alongside existing GHCR pattern

Key implementation details

  • SWEBenchProBenchmark class with language-specific test runners (Go: go test, TS/JS: npx jest)
  • Docker entrypoint override for images with /bin/bash entrypoint (SWE-bench Pro images)
  • Editable reinstall (pip install -e .) after patching for SWE-bench Pro Python images (they install to site-packages, not editable mode)
  • Case-insensitive test list field access (fail_to_pass/FAIL_TO_PASS)
  • Skip Claude CLI installation during preflight (not needed, avoids OOM under QEMU)

Preflight validation status

  • All Python instances passing (qutebrowser, ansible, openlibrary)
  • Go/JS/TS instances require x86_64 host (QEMU emulation on ARM64 causes binary execution failures)
  • Full 731-instance validation in progress

Test plan

  • All unit tests pass (4407/4407)
  • Ruff lint clean
  • mypy clean (137 source files)
  • Pre-commit hooks pass
  • Smoke test: 1 Python instance passes preflight (4/4 FTP, 10/10 PTP)
  • Full preflight validation across all 731 instances (in progress)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • SWE-bench Pro: multi-language benchmark (Python, Go, TypeScript, JavaScript) with many instances and prebuilt Docker images
    • preflight command: validate golden patches before evaluation with concurrency, fail-fast, per-instance and aggregate reporting
    • Per-task Docker image, working directory and entrypoint overrides; ability to specify prebuilt images
  • Behavior

    • Case-insensitive test-list field access for improved metadata compatibility
  • Tests

    • Extensive test coverage for SWE-bench Pro and preflight validation features

Add multi-language benchmark support (Python, Go, TypeScript, JavaScript)
with 731 instances from ScaleAI/SWE-bench_Pro, plus a preflight validation
system that verifies golden patches pass all tests before agent evaluation.

Key changes:
- SWEBenchProBenchmark class with DockerHub image support and
  language-specific test runners (Go: go test, TS/JS: npx jest)
- Preflight check system (mcpbr preflight CLI command) that validates
  golden patches in Docker environments
- Docker image override support (_image_override, _workdir_override)
  for non-GHCR registries
- Entrypoint override for images with /bin/bash entrypoint
- Editable reinstall after patching for SWE-bench Pro Python images
- Case-insensitive test list field access (fail_to_pass/FAIL_TO_PASS)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds SWE‑bench Pro multi-language benchmark and wiring; introduces an async preflight validator that applies golden patches and runs language-aware tests in Docker with image/workdir/entrypoint overrides; adds a CLI preflight command, case-insensitive test-list access, Python post-patch reinstall, and extensive tests and changelog updates.

Changes

Cohort / File(s) Summary
Benchmark registration & config
src/mcpbr/config.py, src/mcpbr/benchmarks/__init__.py
Register swe-bench-pro in VALID_BENCHMARKS and BENCHMARK_REGISTRY; export SWEBenchProBenchmark; special-case dataset mapping in factory.
SWE‑bench Pro implementation
src/mcpbr/benchmarks/swebench_pro.py
New multi-language benchmark supporting python/go/typescript/javascript: task loading/filtering, normalization, Docker image/workdir/entrypoint overrides, multi-lang evaluation and test-command builders, prebuilt-image logic, prompt template.
Preflight orchestration
src/mcpbr/benchmark_preflight.py
New PreflightResult/PreflightReport dataclasses and async orchestration: per-instance Docker setup, golden/test patch apply, language-aware test runs, concurrency, timeouts, fail-fast, aggregation and cleanup.
CLI
src/mcpbr/cli.py
New preflight CLI command added (appears duplicated in file) — loads benchmark tasks, runs preflight, prints per-instance results and summary, exits nonzero on failures.
Docker environment manager
src/mcpbr/docker_env.py
Add _try_pull_image; support task "_image_override" and "_workdir_override"; conditional entrypoint override (tail keep-alive); optional skip of CLI install; dynamic workspace copy behavior.
Evaluation & test handling
src/mcpbr/evaluation.py, src/mcpbr/swebench_test_specs.py
Add get_test_list_field (case-insensitive key lookup) and _normalize_test_id; quote pytest/k -style test ids; post-patch pip install -e . for SWE‑bench Pro Python images; add LANGUAGE_TO_TEST_CMD and get_language_test_command.
Tests & Changelog
CHANGELOG.md, tests/test_benchmark_preflight.py, tests/test_swebench_pro.py
Changelog entries; extensive unit tests for preflight flows, SWE‑bench Pro normalization, test command generation, image/workdir overrides, and aggregated reporting.
Minor
src/mcpbr/benchmarks/codegraph.py
Small datetime import/formatting change to use UTC isoformat.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as CLI (preflight)
    participant Bench as SWE-bench Pro
    participant Pref as Preflight Orchestrator
    participant DockerMgr as Docker Manager
    participant Container as Container

    User->>CLI: invoke preflight --benchmark swe-bench-pro ...
    CLI->>Bench: load_tasks(filter, sample_size)
    Bench-->>CLI: tasks[]
    CLI->>Pref: run_benchmark_preflight(tasks, max_concurrent, fail_fast)

    par concurrent checks
        Pref->>DockerMgr: create_environment(task)
        DockerMgr->>DockerMgr: _try_pull_image(_image_override?)
        DockerMgr->>Container: start(workdir=_workdir_override?, entrypoint_override?)
        Container-->>Pref: TaskEnvironment ready

        Pref->>Container: apply golden patch
        Pref->>Container: (optional) apply test patch
        Pref->>Container: (if python) pip install -e .
        Pref->>Container: run fail_to_pass tests
        Pref->>Container: run pass_to_pass tests
        Container-->>Pref: test results
        Pref->>Pref: record PreflightResult, cleanup container
    end

    Pref-->>CLI: PreflightReport (per-instance results, success_rate)
    CLI->>User: print report, exit code (0/1)
Loading
sequenceDiagram
    actor Evaluator
    participant Bench as SWE-bench Pro
    participant Env as TaskEnvironment
    participant Docker as Container
    participant Runner as Language Test Runner

    Evaluator->>Bench: evaluate(env, task, solution_patch)
    Bench->>Env: write patch & git apply
    Env->>Docker: prepare workspace

    alt language == python
        Env->>Docker: pip install -e .
        Docker-->>Runner: run pytest (normalized, quoted ids)
    else language == go
        Docker-->>Runner: go test -v ...
    else language == ts/js
        Docker-->>Runner: npx jest ...
    end

    Runner-->>Bench: pass/fail, outputs
    Bench-->>Evaluator: evaluation dict (status, ftp/ptp metrics)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🌊 Docker wakes, tests speak in many tongues,
Preflight tends golden patches, one by one.
Go, Python, JS — they run, they report,
Concurrent checks tidy the CI port.
Small flags, big care — smoother runs begun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.01% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding SWE-bench Pro benchmark support and a preflight validation system, which are the core additions across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/swe-bench-pro

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mcpbr/docker_env.py (1)

886-892: ⚠️ Potential issue | 🟡 Minor

Bug: error message still says "/testbed" instead of using source_dir.

You updated the copy commands (lines 816, 862, 866) to use the dynamic source_dir, but the final error message on line 889 is hardcoded to say "from /testbed". If the image uses /app (like SWE-bench Pro), this error message will be misleading.

🐛 Proposed fix
         # --- Exhausted ---
         raise RuntimeError(
-            f"Workspace /workspace appears empty after copy from /testbed "
+            f"Workspace /workspace appears empty after copy from {source_dir} "
             f"(instance={env.instance_id}). "
             f"The filesystem may not have synced correctly."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/docker_env.py` around lines 886 - 892, The error string raised
after the workspace copy still hardcodes "/testbed"; change that message to
reference the dynamic source_dir variable so it accurately reflects where files
were copied from (use the existing source_dir and env.instance_id in the
f-string). Locate the RuntimeError raise in docker_env.py (the block that
currently builds the message with env.instance_id) and replace "/testbed" with
{source_dir} in the formatted message so it matches the updated copy commands.
🧹 Nitpick comments (7)
src/mcpbr/benchmarks/__init__.py (1)

143-145: Prefer setdefault so explicit dataset= overrides still work.

Right now this always overwrites caller intent.

💡 Proposed fix
-    if name == "swe-bench-pro":
-        kwargs["dataset"] = "ScaleAI/SWE-bench_Pro"
+    if name == "swe-bench-pro":
+        kwargs.setdefault("dataset", "ScaleAI/SWE-bench_Pro")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/__init__.py` around lines 143 - 145, The current
conditional for name == "swe-bench-pro" unconditionally assigns
kwargs["dataset"] = "ScaleAI/SWE-bench_Pro", which overwrites any
caller-provided dataset; change this to only apply the default when no dataset
was provided by using kwargs.setdefault("dataset", "ScaleAI/SWE-bench_Pro")
inside the same branch so explicit dataset= callers continue to take precedence.
tests/test_swebench_pro.py (1)

96-99: Loose assertion for Python delegation — consider tightening.

The assertion "pytest" in cmd or "test_foo" in cmd is pretty permissive. It basically says "as long as the command mentions something vaguely test-related, we're good." Since _build_pro_test_command for Python delegates to _build_test_command, you could assert more specifically — for example, that "pytest" is in the command. If the delegation ever breaks and returns garbage, this test might still pass.

Not a big deal though — this is just a nit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_swebench_pro.py` around lines 96 - 99, The test
test_python_delegates in tests/test_swebench_pro.py currently uses a loose
assertion; tighten it to verify Python delegation by asserting that
_build_pro_test_command("tests/test_foo.py::test_bar", "python") produces a
command that contains "pytest" (i.e., replace the or-check with a single assert
"pytest" in cmd) to ensure it truly delegates to _build_test_command.
src/mcpbr/docker_env.py (2)

448-470: Consider DRYing up _try_pull_image and _try_pull_prebuilt.

These two methods are nearly identical — both pull an image with platform="linux/amd64" and catch the same exceptions. The only difference is that _try_pull_prebuilt calls get_swebench_image_name() first to compute the image name.

You could have _try_pull_prebuilt just call _try_pull_image(get_swebench_image_name(instance_id)) internally. Not critical, but it removes a bit of duplication.

♻️ Suggested refactor
     async def _try_pull_prebuilt(self, instance_id: str) -> str | None:
         image_name = get_swebench_image_name(instance_id)
-
-        def _pull() -> str | None:
-            try:
-                self.client.images.pull(image_name, platform="linux/amd64")
-                return image_name
-            except docker.errors.ImageNotFound:
-                return None
-            except docker.errors.APIError:
-                return None
-
-        loop = asyncio.get_event_loop()
-        return await loop.run_in_executor(None, _pull)
+        return await self._try_pull_image(image_name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/docker_env.py` around lines 448 - 470, Refactor the duplicate
image-pulling logic by having _try_pull_prebuilt call the existing
_try_pull_image instead of duplicating the pull/catch logic: replace the body of
_try_pull_prebuilt so it computes the image via
get_swebench_image_name(instance_id) and returns await
_try_pull_image(image_name). Keep _try_pull_image as the single implementation
that does client.images.pull(..., platform="linux/amd64") and handles
docker.errors.ImageNotFound and docker.errors.APIError via run_in_executor;
remove the redundant pull logic from _try_pull_prebuilt.

617-619: Entrypoint override is applied for all _image_override images — is that always correct?

Right now, has_entrypoint_override = bool(task.get("_image_override")) means every single override image gets the entrypoint replaced. This works fine if all SWE-bench Pro images set /bin/bash as their entrypoint (which the PR description says they do). But if someone later adds an override image that doesn't set an entrypoint, the behavior is still harmless — the entrypoint just becomes ["/bin/sh", "-c", "tail -f /dev/null"] which achieves the same keep-alive as the normal command="tail -f /dev/null".

So this is safe, just a bit of an overly broad brush. Worth a comment noting why this is safe for all cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/docker_env.py` around lines 617 - 619, Add a short inline comment
next to the has_entrypoint_override = bool(task.get("_image_override"))
assignment explaining the rationale: we intentionally treat every
_image_override as having an entrypoint because if an override image actually
lacks an entrypoint, replacing it with our keep-alive entrypoint
(["/bin/sh","-c","tail -f /dev/null"]) is harmless and functionally equivalent
to setting command="tail -f /dev/null"; reference the has_entrypoint_override
variable and the task.get("_image_override") check in the comment so future
readers understand this intentional and safe behavior.
tests/test_benchmark_preflight.py (1)

315-340: Consider hoisting the import to the top of the file or class level.

Right now, from mcpbr.evaluation import get_test_list_field is repeated in every single test method (lines 319, 325, 331, 337). It works fine, but it's a bit noisy. You could import it once at the top of the file alongside the other imports (lines 3-12) and keep things DRY.

That said, this is purely a readability nit — no functional issue.

♻️ Suggested refactor
 from mcpbr.benchmark_preflight import (
     PreflightReport,
     PreflightResult,
     _check_single_instance,
     run_benchmark_preflight,
 )
+from mcpbr.evaluation import get_test_list_field

Then remove the from mcpbr.evaluation import get_test_list_field lines inside each test method.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_benchmark_preflight.py` around lines 315 - 340, Hoist the repeated
import of get_test_list_field out of each test method by adding a single "from
mcpbr.evaluation import get_test_list_field" alongside the module-level imports
and deleting the per-test imports inside the TestGetTestListField methods
(test_lowercase_field, test_uppercase_field, test_lowercase_preferred,
test_missing_field) so the tests use the shared top-level import.
src/mcpbr/benchmarks/swebench_pro.py (2)

241-243: pass_to_pass_tests[:10] silently truncates — could hide regressions.

You're capping pass-to-pass tests at 10 for performance, which makes sense for a large test suite. But if a patch breaks test #11, you'd never know. Consider at least logging when truncation happens, so people debugging a missed regression have a breadcrumb.

Something like:

if len(pass_to_pass_tests) > 10:
    logger.info("Truncating pass_to_pass from %d to 10 tests", len(pass_to_pass_tests))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 241 - 243, The call to
_run_lang_tests is silently truncating pass_to_pass_tests with
pass_to_pass_tests[:10]; before calling _run_lang_tests compute the intended
slice (e.g., selected_tests = pass_to_pass_tests[:10]) and if
len(pass_to_pass_tests) > len(selected_tests) emit a logger.info message noting
the original count and the truncated count (for example: "Truncating
pass_to_pass from %d to %d tests"), then pass selected_tests into
_run_lang_tests and assign the result to pass_to_pass_results; reference the
variables pass_to_pass_tests, selected_tests, _run_lang_tests, and
pass_to_pass_results when making this change.

409-416: Shell metacharacters in test identifiers are interpolated unsanitized.

The test string is f-string'd directly into the shell command (e.g., f"go test -v -count=1 -run '{test}' ./... 2>&1"). Since these come from the HuggingFace dataset (not arbitrary user input), the practical risk is low. But if a test name ever contained a single quote like Test'Something, the command would break or behave unexpectedly.

Consider using shlex.quote(test) for defense-in-depth:

import shlex
return f"{activate}go test -v -count=1 -run {shlex.quote(test)} ./... 2>&1"

Same applies to the TS/JS commands at lines 422 and 425.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 409 - 416, The code
interpolates the test identifier directly into shell commands (e.g., in the
block checking language == "go") which can break or allow shell injection if the
test contains quotes or special chars; fix by importing shlex and using
shlex.quote(test) wherever test is placed in shell strings (for example in the
go branch return f"... -run {shlex.quote(test)} ..." and similarly for the TS/JS
command constructions that inject test), ensuring all interpolated values are
properly quoted before being embedded in the command string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 207-227: The code creates an asyncio.Semaphore(max_concurrent)
without validating max_concurrent, so passing 0 will deadlock; update the
function that declares max_concurrent to validate it before creating semaphore
(e.g., in the start of the function that constructs PreflightReport and
semaphore) by either raising a ValueError when max_concurrent < 1 or normalizing
it to a safe minimum like 1, then proceed to create semaphore =
asyncio.Semaphore(max_concurrent); ensure references to max_concurrent and
semaphore are adjusted accordingly.
- Around line 140-157: The preflight currently calls run_tests for ftp_results
and ptp_results which is Python-specific; instead detect the task language
(e.g., from task.get("language") or repo metadata) and dispatch to a
language-specific runner or a unified dispatcher before running tests. Update
the code around run_tests (the calls producing ftp_results and ptp_results) to:
if language is "python" call run_tests with the existing args (env,
fail_to_pass_tests / pass_to_pass_tests, timeout, uses_prebuilt,
workdir=eval_workdir, repo), otherwise call the appropriate runner (e.g.,
run_go_tests, run_js_tests/run_ts_tests) or a new
choose_runner_for_language(task_language, ...) function that accepts the same
parameters; ensure pass_to_pass_tests and fail_to_pass_tests are passed in the
right format for non-Python runners and preserve
timeout/uses_prebuilt/eval_workdir handling.
- Around line 123-127: The current preflight call swallows editable reinstall
failures by running env.exec_command with the command string "pip install -e .
-q 2>/dev/null || true"; remove the "|| true" and the stderr redirection so
install errors surface (i.e., call env.exec_command with a command that fails on
error, preserving stderr), keeping the same timeout and workdir (eval_workdir)
and invocation site (env.exec_command) so the preflight fails when the editable
install fails.

In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 224-231: The block that runs "pip install -e ." inside
_evaluate_multilang() is dead because evaluate() already returns early for
language == "python", so remove this unreachable code from
_evaluate_multilang(); alternatively, if a reinstall is actually needed for
Python evaluations, move the pip reinstall logic into evaluate() before the call
to evaluate_patch() (referencing evaluate(), _evaluate_multilang(),
evaluate_patch(), the language variable, and eval_workdir to locate the relevant
logic).

In `@src/mcpbr/cli.py`:
- Around line 1540-1642: The preflight command ignores the --config option;
update preflight to, when config_path is provided, load the YAML/JSON config
(e.g., with yaml.safe_load or json.load) and pass the parsed settings into
DockerEnvironmentManager instead of always instantiating
DockerEnvironmentManager(use_prebuilt=True). Modify the DockerEnvironmentManager
instantiation in preflight to accept the loaded config (e.g.,
DockerEnvironmentManager(config=cfg, use_prebuilt=True)) or apply the loaded
settings via its public API before use; ensure you handle file read errors and
missing keys gracefully and still default to use_prebuilt=True when no
config_path is supplied.

In `@src/mcpbr/evaluation.py`:
- Around line 404-413: The reinstall step for DockerHub Python images uses
eval_workdir (prebuilt path) which can be /testbed; change it to use the
task-specific workdir (e.g., task.get("workdir") or fallback to eval_workdir) so
SWE-bench Pro images run pip install in the actual repo directory. In the branch
that checks task.get("dockerhub_tag") and repo_language == "python", compute
task_workdir = task.get("workdir", eval_workdir) and pass that as the workdir
argument to env.exec_command (same place where "pip install -e . -q ..." is
invoked) so the reinstall runs in the correct directory.
- Around line 32-54: The evaluate_patch function currently reads uppercase
fields directly for Python tasks (variables fail_to_pass_tests /
pass_to_pass_tests) which breaks on SWE-bench Pro lowercase names; replace the
direct task.get(...) calls in evaluate_patch with calls to
get_test_list_field(task, "fail_to_pass") and get_test_list_field(task,
"pass_to_pass"), then pass those results through parse_test_list (e.g.,
fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
and similarly for pass_to_pass_tests) so Python handling matches the non-Python
branch and supports both conventions.

---

Outside diff comments:
In `@src/mcpbr/docker_env.py`:
- Around line 886-892: The error string raised after the workspace copy still
hardcodes "/testbed"; change that message to reference the dynamic source_dir
variable so it accurately reflects where files were copied from (use the
existing source_dir and env.instance_id in the f-string). Locate the
RuntimeError raise in docker_env.py (the block that currently builds the message
with env.instance_id) and replace "/testbed" with {source_dir} in the formatted
message so it matches the updated copy commands.

---

Nitpick comments:
In `@src/mcpbr/benchmarks/__init__.py`:
- Around line 143-145: The current conditional for name == "swe-bench-pro"
unconditionally assigns kwargs["dataset"] = "ScaleAI/SWE-bench_Pro", which
overwrites any caller-provided dataset; change this to only apply the default
when no dataset was provided by using kwargs.setdefault("dataset",
"ScaleAI/SWE-bench_Pro") inside the same branch so explicit dataset= callers
continue to take precedence.

In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 241-243: The call to _run_lang_tests is silently truncating
pass_to_pass_tests with pass_to_pass_tests[:10]; before calling _run_lang_tests
compute the intended slice (e.g., selected_tests = pass_to_pass_tests[:10]) and
if len(pass_to_pass_tests) > len(selected_tests) emit a logger.info message
noting the original count and the truncated count (for example: "Truncating
pass_to_pass from %d to %d tests"), then pass selected_tests into
_run_lang_tests and assign the result to pass_to_pass_results; reference the
variables pass_to_pass_tests, selected_tests, _run_lang_tests, and
pass_to_pass_results when making this change.
- Around line 409-416: The code interpolates the test identifier directly into
shell commands (e.g., in the block checking language == "go") which can break or
allow shell injection if the test contains quotes or special chars; fix by
importing shlex and using shlex.quote(test) wherever test is placed in shell
strings (for example in the go branch return f"... -run {shlex.quote(test)} ..."
and similarly for the TS/JS command constructions that inject test), ensuring
all interpolated values are properly quoted before being embedded in the command
string.

In `@src/mcpbr/docker_env.py`:
- Around line 448-470: Refactor the duplicate image-pulling logic by having
_try_pull_prebuilt call the existing _try_pull_image instead of duplicating the
pull/catch logic: replace the body of _try_pull_prebuilt so it computes the
image via get_swebench_image_name(instance_id) and returns await
_try_pull_image(image_name). Keep _try_pull_image as the single implementation
that does client.images.pull(..., platform="linux/amd64") and handles
docker.errors.ImageNotFound and docker.errors.APIError via run_in_executor;
remove the redundant pull logic from _try_pull_prebuilt.
- Around line 617-619: Add a short inline comment next to the
has_entrypoint_override = bool(task.get("_image_override")) assignment
explaining the rationale: we intentionally treat every _image_override as having
an entrypoint because if an override image actually lacks an entrypoint,
replacing it with our keep-alive entrypoint (["/bin/sh","-c","tail -f
/dev/null"]) is harmless and functionally equivalent to setting command="tail -f
/dev/null"; reference the has_entrypoint_override variable and the
task.get("_image_override") check in the comment so future readers understand
this intentional and safe behavior.

In `@tests/test_benchmark_preflight.py`:
- Around line 315-340: Hoist the repeated import of get_test_list_field out of
each test method by adding a single "from mcpbr.evaluation import
get_test_list_field" alongside the module-level imports and deleting the
per-test imports inside the TestGetTestListField methods (test_lowercase_field,
test_uppercase_field, test_lowercase_preferred, test_missing_field) so the tests
use the shared top-level import.

In `@tests/test_swebench_pro.py`:
- Around line 96-99: The test test_python_delegates in
tests/test_swebench_pro.py currently uses a loose assertion; tighten it to
verify Python delegation by asserting that
_build_pro_test_command("tests/test_foo.py::test_bar", "python") produces a
command that contains "pytest" (i.e., replace the or-check with a single assert
"pytest" in cmd) to ensure it truly delegates to _build_test_command.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 655b467 and 5108867.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • CHANGELOG.md
  • src/mcpbr/benchmark_preflight.py
  • src/mcpbr/benchmarks/__init__.py
  • src/mcpbr/benchmarks/swebench_pro.py
  • src/mcpbr/cli.py
  • src/mcpbr/config.py
  • src/mcpbr/docker_env.py
  • src/mcpbr/evaluation.py
  • src/mcpbr/swebench_test_specs.py
  • tests/test_benchmark_preflight.py
  • tests/test_swebench_pro.py

Comment on lines +123 to +127
await env.exec_command(
"pip install -e . -q 2>/dev/null || true",
timeout=120,
workdir=eval_workdir,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t hide editable reinstall failures in preflight.

Swallowing install errors (|| true) can yield false positives by testing stale installed code.

💡 Proposed fix
-        if eval_workdir and language == "python":
-            await env.exec_command(
-                "pip install -e . -q 2>/dev/null || true",
-                timeout=120,
-                workdir=eval_workdir,
-            )
+        if eval_workdir and language == "python":
+            exit_code, _stdout, stderr = await env.exec_command(
+                "pip install -e . -q",
+                timeout=120,
+                workdir=eval_workdir,
+            )
+            if exit_code != 0:
+                return PreflightResult(
+                    instance_id=instance_id,
+                    status="error",
+                    error=f"Editable reinstall failed: {stderr[:300]}",
+                    language=language,
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 123 - 127, The current
preflight call swallows editable reinstall failures by running env.exec_command
with the command string "pip install -e . -q 2>/dev/null || true"; remove the
"|| true" and the stderr redirection so install errors surface (i.e., call
env.exec_command with a command that fails on error, preserving stderr), keeping
the same timeout and workdir (eval_workdir) and invocation site
(env.exec_command) so the preflight fails when the editable install fails.

Comment on lines +207 to +227
max_concurrent: int = 4,
timeout: int = 300,
fail_fast: bool = False,
) -> PreflightReport:
"""Run preflight validation on benchmark tasks.

Applies golden patches and verifies all tests pass for each instance.

Args:
benchmark: Benchmark instance.
tasks: List of task dictionaries to validate.
docker_manager: Docker environment manager.
max_concurrent: Maximum concurrent validations.
timeout: Timeout per test in seconds.
fail_fast: Stop on first failure.

Returns:
PreflightReport with aggregate results.
"""
report = PreflightReport(total=len(tasks))
semaphore = asyncio.Semaphore(max_concurrent)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate max_concurrent before creating the semaphore.

max_concurrent=0 will block all checks forever in this flow.

💡 Proposed fix
 async def run_benchmark_preflight(
@@
 ) -> PreflightReport:
@@
-    report = PreflightReport(total=len(tasks))
+    if max_concurrent < 1:
+        raise ValueError("max_concurrent must be at least 1")
+
+    report = PreflightReport(total=len(tasks))
     semaphore = asyncio.Semaphore(max_concurrent)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 207 - 227, The code creates an
asyncio.Semaphore(max_concurrent) without validating max_concurrent, so passing
0 will deadlock; update the function that declares max_concurrent to validate it
before creating semaphore (e.g., in the start of the function that constructs
PreflightReport and semaphore) by either raising a ValueError when
max_concurrent < 1 or normalizing it to a safe minimum like 1, then proceed to
create semaphore = asyncio.Semaphore(max_concurrent); ensure references to
max_concurrent and semaphore are adjusted accordingly.

Comment on lines +224 to +231
# Reinstall package so patched code is active (SWE-bench Pro images
# install into site-packages, not editable mode)
if eval_workdir and language == "python":
await env.exec_command(
"pip install -e . -q 2>/dev/null || true",
timeout=120,
workdir=eval_workdir,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the _evaluate_multilang function and where it's called
rg -n '_evaluate_multilang' --type=py -B5 -A10

Repository: supermodeltools/mcpbr

Length of output: 1446


🏁 Script executed:

# Look at the evaluate() method around lines 180-192
sed -n '175,195p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 850


🏁 Script executed:

# Read lines 193 onwards to see the _evaluate_multilang function, especially lines 224-231
sed -n '193,235p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 1696


Dead code: the pip install -e . block at lines 224-231 can never execute.

Here's what's happening — in the evaluate() method, there's an if/else split: if the language is Python, it handles evaluation right there and returns early (line 188). For everything else (non-Python languages), it delegates to _evaluate_multilang() at line 191.

So inside _evaluate_multilang(), the language parameter can only be something other than Python. The condition at line 226 (if eval_workdir and language == "python") will never be true — it's checking for a language that's already been ruled out upstream.

This pip reinstall block looks like it was copied from the Python evaluation path but never actually applies to multilang evaluation. If you need this for Python tasks, it belongs earlier in the evaluate() method (before calling evaluate_patch()). If not, just remove it.

💡 Proposed fix — remove the dead block
         if test_patch:
             await _apply_test_patch(env, test_patch, workdir=eval_workdir)
 
-        # Reinstall package so patched code is active (SWE-bench Pro images
-        # install into site-packages, not editable mode)
-        if eval_workdir and language == "python":
-            await env.exec_command(
-                "pip install -e . -q 2>/dev/null || true",
-                timeout=120,
-                workdir=eval_workdir,
-            )
-
         fail_to_pass_str = get_test_list_field(task, "fail_to_pass")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 224 - 231, The block that
runs "pip install -e ." inside _evaluate_multilang() is dead because evaluate()
already returns early for language == "python", so remove this unreachable code
from _evaluate_multilang(); alternatively, if a reinstall is actually needed for
Python evaluations, move the pip reinstall logic into evaluate() before the call
to evaluate_patch() (referencing evaluate(), _evaluate_multilang(),
evaluate_patch(), the language variable, and eval_workdir to locate the relevant
logic).

Comment on lines 1540 to 1642
@click.option(
"-c",
"--config",
"config_path",
type=click.Path(exists=True),
help="Path to configuration YAML file (used for Docker settings).",
)
@click.option(
"-b",
"--benchmark",
"benchmark_name",
type=click.Choice(list(VALID_BENCHMARKS)),
default="swe-bench-pro",
help="Benchmark to validate (default: swe-bench-pro).",
)
@click.option(
"-n",
"--sample",
"sample_size",
type=int,
default=None,
help="Number of instances to validate (default: all).",
)
@click.option(
"--task",
"task_ids",
multiple=True,
help="Specific task ID(s) to validate.",
)
@click.option(
"--max-concurrent",
type=int,
default=4,
help="Maximum concurrent validations (default: 4).",
)
@click.option(
"--fail-fast",
is_flag=True,
help="Stop on first failure.",
)
@click.option(
"--filter-category",
multiple=True,
help="Filter by language or repo substring.",
)
@click.option(
"--timeout",
type=int,
default=300,
help="Timeout per test in seconds (default: 300).",
)
def preflight(
config_path: str | None,
benchmark_name: str,
sample_size: int | None,
task_ids: tuple[str, ...],
max_concurrent: int,
fail_fast: bool,
filter_category: tuple[str, ...],
timeout: int,
) -> None:
"""Validate golden patches pass all tests before evaluation.

Runs the benchmark's golden (reference) patches against Docker
environments and verifies all tests pass. Use this to catch
environment or configuration issues before running agent evaluations.

\b
Examples:
mcpbr preflight -b swe-bench-pro -n 5 # Check 5 instances
mcpbr preflight --fail-fast # Stop on first failure
mcpbr preflight --filter-category python -n 10 # Check 10 Python instances
mcpbr preflight --task django__django-16046 # Check specific instance
"""
from .benchmark_preflight import run_benchmark_preflight
from .benchmarks import create_benchmark
from .docker_env import DockerEnvironmentManager

benchmark = create_benchmark(benchmark_name)

# Load tasks
task_id_list = list(task_ids) if task_ids else None
category_list = list(filter_category) if filter_category else None

console.print(f"[bold]Preflight Check: {benchmark_name}[/bold]\n")
dataset_name = getattr(benchmark, "dataset", benchmark_name)
console.print(f"Loading tasks from {dataset_name}...")

tasks = benchmark.load_tasks(
sample_size=sample_size,
task_ids=task_id_list,
filter_category=category_list,
)

if not tasks:
console.print("[yellow]No tasks found matching the criteria.[/yellow]")
return

console.print(f"Validating {len(tasks)} instance(s)...\n")

# Create Docker manager
docker_manager = DockerEnvironmentManager(use_prebuilt=True)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

--config is currently ignored by mcpbr preflight.

The option says it drives Docker settings, but the command never loads config and always creates DockerEnvironmentManager(use_prebuilt=True).

💡 Proposed fix
 def preflight(
@@
 ) -> None:
@@
-    benchmark = create_benchmark(benchmark_name)
+    benchmark = create_benchmark(benchmark_name)
+
+    cfg = load_config(Path(config_path)) if config_path else None
@@
-    # Create Docker manager
-    docker_manager = DockerEnvironmentManager(use_prebuilt=True)
+    # Create Docker manager (respect config when provided)
+    docker_manager = DockerEnvironmentManager(
+        use_prebuilt=cfg.use_prebuilt_images if cfg is not None else True,
+        extra_volumes=cfg.volumes if cfg is not None else None,
+        claude_code_version=cfg.claude_code_version if cfg is not None else None,
+    )
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 1592-1592: Unused function argument: config_path

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/cli.py` around lines 1540 - 1642, The preflight command ignores the
--config option; update preflight to, when config_path is provided, load the
YAML/JSON config (e.g., with yaml.safe_load or json.load) and pass the parsed
settings into DockerEnvironmentManager instead of always instantiating
DockerEnvironmentManager(use_prebuilt=True). Modify the DockerEnvironmentManager
instantiation in preflight to accept the loaded config (e.g.,
DockerEnvironmentManager(config=cfg, use_prebuilt=True)) or apply the loaded
settings via its public API before use; ensure you handle file read errors and
missing keys gracefully and still default to use_prebuilt=True when no
config_path is supplied.

Comment on lines +32 to +54
def get_test_list_field(task: dict[str, Any], field_name: str) -> str:
"""Get a test list field from a task, checking both lowercase and uppercase names.

SWE-bench uses FAIL_TO_PASS/PASS_TO_PASS while SWE-bench Pro uses
fail_to_pass/pass_to_pass. This helper provides backward-compatible access.

Args:
task: Task dictionary.
field_name: Field name in lowercase (e.g., "fail_to_pass").

Returns:
Field value as string, or "[]" if not found.
"""
# Try lowercase first (SWE-bench Pro convention)
value = task.get(field_name)
if value is not None:
return str(value)
# Fall back to uppercase (SWE-bench convention)
upper_name = field_name.upper()
value = task.get(upper_name)
if value is not None:
return str(value)
return "[]"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check usage of get_test_list_field and how test fields are accessed
rg -n -C3 'evaluate_patch\(' src/mcpbr/benchmarks/swebench_pro.py src/mcpbr/evaluation.py
rg -n -C3 'FAIL_TO_PASS|PASS_TO_PASS|fail_to_pass|pass_to_pass|get_test_list_field' src/mcpbr/benchmarks/swebench_pro.py src/mcpbr/evaluation.py

Repository: supermodeltools/mcpbr

Length of output: 10673


Fix evaluate_patch() to use get_test_list_field() for backward compatibility with SWE-bench Pro lowercase field names.

When Python SWE-bench Pro tasks hit evaluate_patch() at line 187, they have lowercase field names like fail_to_pass, but the function reads uppercase names like FAIL_TO_PASS directly (lines 391-392). Replace those hardcoded .get() calls with get_test_list_field():

fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
pass_to_pass_tests = parse_test_list(get_test_list_field(task, "pass_to_pass"))

Currently, non-Python languages (lines 233-234) handle this correctly, but Python tasks silently get empty test lists instead of actual tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/evaluation.py` around lines 32 - 54, The evaluate_patch function
currently reads uppercase fields directly for Python tasks (variables
fail_to_pass_tests / pass_to_pass_tests) which breaks on SWE-bench Pro lowercase
names; replace the direct task.get(...) calls in evaluate_patch with calls to
get_test_list_field(task, "fail_to_pass") and get_test_list_field(task,
"pass_to_pass"), then pass those results through parse_test_list (e.g.,
fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
and similarly for pass_to_pass_tests) so Python handling matches the non-Python
branch and supports both conventions.

Comment on lines +404 to +413
elif task.get("dockerhub_tag") and task.get("repo_language", "python").lower() == "python":
# SWE-bench Pro images install packages into site-packages (not
# editable). After patching we must reinstall so the new code is
# importable.
with contextlib.suppress(TimeoutError):
await env.exec_command(
"pip install -e . -q 2>/dev/null || true",
timeout=120,
workdir=eval_workdir,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the actual task workdir for SWE-bench Pro DockerHub images before reinstall.

This branch is for DockerHub images, but eval_workdir is still derived from the generic prebuilt path, which can target /testbed instead of /app.

💡 Proposed fix
-    # For pre-built images, apply patch and run tests in /testbed
-    # because that's where the editable install points to
-    eval_workdir = "/testbed" if env.uses_prebuilt else None
+    # For pre-built images, use the environment workdir (supports /testbed and /app overrides)
+    eval_workdir = env.workdir if env.uses_prebuilt else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/evaluation.py` around lines 404 - 413, The reinstall step for
DockerHub Python images uses eval_workdir (prebuilt path) which can be /testbed;
change it to use the task-specific workdir (e.g., task.get("workdir") or
fallback to eval_workdir) so SWE-bench Pro images run pip install in the actual
repo directory. In the branch that checks task.get("dockerhub_tag") and
repo_language == "python", compute task_workdir = task.get("workdir",
eval_workdir) and pass that as the workdir argument to env.exec_command (same
place where "pip install -e . -q ..." is invoked) so the reinstall runs in the
correct directory.

Test IDs from SWE-bench Pro datasets can contain single quotes and
literal \uXXXX escape sequences that break shell command construction.

- Replace bare single-quote wrapping with shlex.quote() in
  _build_test_command and _build_pro_test_command
- Add _normalize_test_id() to decode \uXXXX to actual unicode chars
- Fixes qutebrowser PTP test failures where 8/10 tests had IDs
  containing embedded single quotes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/mcpbr/benchmarks/swebench_pro.py (1)

224-231: Remove unreachable Python reinstall block from _evaluate_multilang().

evaluate() returns early for Python tasks (Line 185–188), so Line 226 (language == "python") can’t be true in this function.

💡 Suggested cleanup
-        # Reinstall package so patched code is active (SWE-bench Pro images
-        # install into site-packages, not editable mode)
-        if eval_workdir and language == "python":
-            await env.exec_command(
-                "pip install -e . -q 2>/dev/null || true",
-                timeout=120,
-                workdir=eval_workdir,
-            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 224 - 231, The reinstall
pip block inside _evaluate_multilang is unreachable because evaluate() returns
early for Python tasks, so remove the conditional block that checks "if
eval_workdir and language == 'python':" (including the await env.exec_command
call) from _evaluate_multilang; ensure any necessary reinstall logic for Python
remains only in evaluate() or is relocated to a shared helper if needed—search
for the symbols _evaluate_multilang, evaluate(), eval_workdir, and language to
find and delete or move this code.
src/mcpbr/evaluation.py (2)

403-404: ⚠️ Potential issue | 🔴 Critical

Use get_test_list_field() here to preserve SWE-bench Pro compatibility.

Line 403 and Line 404 still read uppercase-only keys, so lowercase fail_to_pass/pass_to_pass tasks can evaluate with empty lists.

💡 Suggested fix
-    fail_to_pass_tests = parse_test_list(task.get("FAIL_TO_PASS", "[]"))
-    pass_to_pass_tests = parse_test_list(task.get("PASS_TO_PASS", "[]"))
+    fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
+    pass_to_pass_tests = parse_test_list(get_test_list_field(task, "pass_to_pass"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/evaluation.py` around lines 403 - 404, The code is using
task.get("FAIL_TO_PASS", ...) and task.get("PASS_TO_PASS", ...) which misses
lowercase keys; replace these two calls to first retrieve the test-list value
via get_test_list_field(task, "fail_to_pass", "FAIL_TO_PASS") and
get_test_list_field(task, "pass_to_pass", "PASS_TO_PASS") and then pass those
results into parse_test_list (i.e., fail_to_pass_tests =
parse_test_list(get_test_list_field(...)) and pass_to_pass_tests =
parse_test_list(get_test_list_field(...))). This preserves SWE-bench Pro
compatibility while keeping parse_test_list usage.

385-387: ⚠️ Potential issue | 🔴 Critical

Don’t hardcode /testbed for all prebuilt images.

Line 387 forces /testbed, but SWE-bench Pro DockerHub images use /app; then Line 424 runs reinstall in the wrong directory.

💡 Suggested fix
-    # For pre-built images, apply patch and run tests in /testbed
-    # because that's where the editable install points to
-    eval_workdir = "/testbed" if env.uses_prebuilt else None
+    # For pre-built images, use the environment's configured workdir
+    # (supports both /testbed and /app overrides).
+    eval_workdir = env.workdir if env.uses_prebuilt else None
@@
-            await env.exec_command(
+            task_workdir = task.get("workdir", eval_workdir)
+            await env.exec_command(
                 "pip install -e . -q 2>/dev/null || true",
                 timeout=120,
-                workdir=eval_workdir,
+                workdir=task_workdir,
             )

Also applies to: 421-425

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/evaluation.py` around lines 385 - 387, When selecting the workdir
for pre-built images, don't hardcode "/testbed"; instead, when env.uses_prebuilt
is true use a configurable attribute (e.g., eval_workdir = getattr(env,
"prebuilt_workdir", "/testbed")) so images that set prebuilt_workdir="/app" are
respected; then pass that eval_workdir into the reinstall/test invocation
(replace any hardcoded paths used around the reinstall call with
workdir=eval_workdir) so the reinstall runs in the correct directory for
SWE-bench Pro and other prebuilt images.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 224-231: The reinstall pip block inside _evaluate_multilang is
unreachable because evaluate() returns early for Python tasks, so remove the
conditional block that checks "if eval_workdir and language == 'python':"
(including the await env.exec_command call) from _evaluate_multilang; ensure any
necessary reinstall logic for Python remains only in evaluate() or is relocated
to a shared helper if needed—search for the symbols _evaluate_multilang,
evaluate(), eval_workdir, and language to find and delete or move this code.

In `@src/mcpbr/evaluation.py`:
- Around line 403-404: The code is using task.get("FAIL_TO_PASS", ...) and
task.get("PASS_TO_PASS", ...) which misses lowercase keys; replace these two
calls to first retrieve the test-list value via get_test_list_field(task,
"fail_to_pass", "FAIL_TO_PASS") and get_test_list_field(task, "pass_to_pass",
"PASS_TO_PASS") and then pass those results into parse_test_list (i.e.,
fail_to_pass_tests = parse_test_list(get_test_list_field(...)) and
pass_to_pass_tests = parse_test_list(get_test_list_field(...))). This preserves
SWE-bench Pro compatibility while keeping parse_test_list usage.
- Around line 385-387: When selecting the workdir for pre-built images, don't
hardcode "/testbed"; instead, when env.uses_prebuilt is true use a configurable
attribute (e.g., eval_workdir = getattr(env, "prebuilt_workdir", "/testbed")) so
images that set prebuilt_workdir="/app" are respected; then pass that
eval_workdir into the reinstall/test invocation (replace any hardcoded paths
used around the reinstall call with workdir=eval_workdir) so the reinstall runs
in the correct directory for SWE-bench Pro and other prebuilt images.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5108867 and df54b19.

📒 Files selected for processing (3)
  • src/mcpbr/benchmarks/codegraph.py
  • src/mcpbr/benchmarks/swebench_pro.py
  • src/mcpbr/evaluation.py

greynewell and others added 2 commits February 25, 2026 17:05
_normalize_test_id was incorrectly converting literal \uXXXX sequences
to actual unicode characters. SWE-bench Pro test IDs already contain
the correct literal sequences that match pytest node IDs — converting
them breaks test matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset uses 'js' and 'ts' as language values, but
filter_category accepted 'javascript' and 'typescript' without
mapping to the dataset values. Added _LANGUAGE_ALIASES dict
to resolve user-friendly names to dataset values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/mcpbr/benchmarks/swebench_pro.py (4)

272-280: Return type Any hides the actual TestResults contract from callers.

Both code paths return a TestResults instance. Annotating it as Any means type checkers can't catch misuse of .passed / .total at the call sites.

♻️ Proposed fix
+from ..evaluation import TestResults
 
     async def _run_lang_tests(
         self,
         env: TaskEnvironment,
         tests: list[str],
         language: str,
         workdir: str | None = None,
         timeout: int = 120,
-    ) -> Any:
+    ) -> "TestResults":

(The TestResults import inside the function body at line 298 can move to the top-level import block too.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 272 - 280, The function
_run_lang_tests currently annotates its return type as Any but always returns a
TestResults instance; change the async def _run_lang_tests(...) -> Any to ->
TestResults and update the import by moving the TestResults import (now inside
the function body) to the module-level imports so type checkers can see the
symbol; ensure any callers still compile and remove the internal import at line
298.

407-409: Move shlex and internal imports to the module-level import block.

There's no circular import or optional-dependency reason to lazy-import shlex (stdlib) or _build_test_command/_normalize_test_id (same package) inside the function body. Deferred imports are a code smell unless you need them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 407 - 409, The imports for
shlex and the package-internal functions _build_test_command and
_normalize_test_id are currently deferred inside the function body; move those
three imports into the module-level import block with the other imports so they
are imported at module load time and remove the lazy imports from the function;
keep the same import names (shlex, _build_test_command, _normalize_test_id) so
existing references in the function continue to work.

60-65: Consider silencing the Ruff ARG002 warnings on intentionally unused params.

level, filter_difficulty, and filter_tags are documented as unused, but Ruff still flags them. The cleanest way to fix this without changing the signature (which is probably dictated by a base class interface) is a targeted suppression:

🔧 Proposed suppression
     def load_tasks(
         self,
         sample_size: int | None = None,
         task_ids: list[str] | None = None,
-        level: int | None = None,
-        filter_difficulty: list[str] | None = None,
+        level: int | None = None,  # noqa: ARG002
+        filter_difficulty: list[str] | None = None,  # noqa: ARG002
         filter_category: list[str] | None = None,
-        filter_tags: list[str] | None = None,
+        filter_tags: list[str] | None = None,  # noqa: ARG002
     ) -> list[dict[str, Any]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 60 - 65, Ruff is flagging
ARG002 for intentionally unused parameters level, filter_difficulty, and
filter_tags in the function signature; silence these with targeted per-parameter
noqa comments by appending "# noqa: ARG002" to each unused parameter in the
signature (e.g., level: int | None = None  # noqa: ARG002), leaving the overall
signature unchanged and preserving the base-class interface.

260-269: if fail_to_pass_results: / if pass_to_pass_results: don't actually guard against anything — these checks are always true.

Here's the deal: _run_lang_tests always returns a TestResults instance (it never returns None). Since TestResults is a dataclass without a custom __bool__ method, it's always truthy in Python — just like any other object instance would be. So those if checks will always pass, and both keys get added to the result dict every time.

This isn't a bug, but the conditions are misleading. They look like safety checks, but they're not doing any actual guarding. You can simplify this:

♻️ Suggested simplification
-        if fail_to_pass_results:
-            result["fail_to_pass"] = {
+        result["fail_to_pass"] = {
                 "passed": fail_to_pass_results.passed,
                 "total": fail_to_pass_results.total,
-            }
-        if pass_to_pass_results:
-            result["pass_to_pass"] = {
+        }
+        result["pass_to_pass"] = {
                 "passed": pass_to_pass_results.passed,
                 "total": pass_to_pass_results.total,
-            }
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 260 - 269, The current
guards "if fail_to_pass_results:" and "if pass_to_pass_results:" are misleading
because _run_lang_tests always returns a TestResults dataclass instance (which
is always truthy); remove those redundant if checks and unconditionally assign
the dictionaries using fail_to_pass_results and pass_to_pass_results (e.g., set
result["fail_to_pass"] = {"passed": fail_to_pass_results.passed, "total":
fail_to_pass_results.total} and similarly for pass_to_pass) so the code is
simpler and clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 430-437: The TypeScript/JavaScript branch currently checks if
language in ("typescript", "javascript") and thus misses the short identifiers
"ts" and "js" from PRO_LANGUAGES; update the conditional that builds the Jest
command (the if language in (...) branch that returns f"{activate}npx jest ...")
to include "ts" and "js" so short-form tasks use Jest rather than falling back
to the raw shell path, and similarly review/update the analogous Go branch (the
if language in (...) branch near the Go handling) to include any short Go
identifiers if present in PRO_LANGUAGES.

---

Nitpick comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 272-280: The function _run_lang_tests currently annotates its
return type as Any but always returns a TestResults instance; change the async
def _run_lang_tests(...) -> Any to -> TestResults and update the import by
moving the TestResults import (now inside the function body) to the module-level
imports so type checkers can see the symbol; ensure any callers still compile
and remove the internal import at line 298.
- Around line 407-409: The imports for shlex and the package-internal functions
_build_test_command and _normalize_test_id are currently deferred inside the
function body; move those three imports into the module-level import block with
the other imports so they are imported at module load time and remove the lazy
imports from the function; keep the same import names (shlex,
_build_test_command, _normalize_test_id) so existing references in the function
continue to work.
- Around line 60-65: Ruff is flagging ARG002 for intentionally unused parameters
level, filter_difficulty, and filter_tags in the function signature; silence
these with targeted per-parameter noqa comments by appending "# noqa: ARG002" to
each unused parameter in the signature (e.g., level: int | None = None  # noqa:
ARG002), leaving the overall signature unchanged and preserving the base-class
interface.
- Around line 260-269: The current guards "if fail_to_pass_results:" and "if
pass_to_pass_results:" are misleading because _run_lang_tests always returns a
TestResults dataclass instance (which is always truthy); remove those redundant
if checks and unconditionally assign the dictionaries using fail_to_pass_results
and pass_to_pass_results (e.g., set result["fail_to_pass"] = {"passed":
fail_to_pass_results.passed, "total": fail_to_pass_results.total} and similarly
for pass_to_pass) so the code is simpler and clearer.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78487e8 and cd5378f.

📒 Files selected for processing (1)
  • src/mcpbr/benchmarks/swebench_pro.py

Comment on lines 430 to 437
if language in ("typescript", "javascript"):
# Jest-style test identifiers
if "/" in test or test.endswith((".ts", ".js", ".tsx", ".jsx")):
# File path
return f"{activate}npx jest {shlex.quote(test)} --verbose --no-cache 2>&1"
else:
# Test name pattern
return f"{activate}npx jest -t {shlex.quote(test)} --verbose --no-cache 2>&1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for _LANGUAGE_ALIASES and understand the alias mapping
rg -n "_LANGUAGE_ALIASES" --type=py -B2 -A10

Repository: supermodeltools/mcpbr

Length of output: 2066


🏁 Script executed:

#!/bin/bash
# Look at the evaluate() function to see how language is retrieved and passed
rg -n "def evaluate" --type=py -A20 | head -60

Repository: supermodeltools/mcpbr

Length of output: 3829


🏁 Script executed:

#!/bin/bash
# Check the _build_pro_test_command function signature and context
sed -n '420,450p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 992


🏁 Script executed:

#!/bin/bash
# Search for how repo_language values are set in the dataset
rg -n "repo_language" --type=py -B2 -A2

Repository: supermodeltools/mcpbr

Length of output: 8486


🏁 Script executed:

#!/bin/bash
# Search for where the dataset is loaded/fetched
rg -n "load_pro_dataset\|json\.load\|\.jsonl" --type=py -B2 -A5

Repository: supermodeltools/mcpbr

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check what SWEBenchProBenchmark.evaluate() gets from the task parameter
sed -n '188,210p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 876


🏁 Script executed:

#!/bin/bash
# Look for any place where language might be transformed before _build_pro_test_command
rg -n "_build_pro_test_command\|_evaluate_multilang\|_run_lang_tests" --type=py -B5 -A3

Repository: supermodeltools/mcpbr

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Get the full _evaluate_multilang implementation
sed -n '207,250p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 1877


🏁 Script executed:

#!/bin/bash
# Search for all calls to _build_pro_test_command to see what gets passed
rg -n "_build_pro_test_command" --type=py -B3 -A1

Repository: supermodeltools/mcpbr

Length of output: 4080


🏁 Script executed:

#!/bin/bash
# Check test files to see if there are actual dataset loading tests
find . -name "*.json" -o -name "*.jsonl" | head -20

Repository: supermodeltools/mcpbr

Length of output: 312


🏁 Script executed:

#!/bin/bash
# Look at the entire _build_pro_test_command function from start to finish
sed -n '410,460p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 1267


🏁 Script executed:

#!/bin/bash
# Check if there's any transformation of repo_language values in the load process
rg -n "repo_language" --type=py -B3 -A3 | head -100

Repository: supermodeltools/mcpbr

Length of output: 6225


🏁 Script executed:

#!/bin/bash
# Look at what the actual tests show - are they PASSING or FAILING with current code?
grep -n "test_typescript\|test_javascript" tests/test_swebench_pro.py -A5

Repository: supermodeltools/mcpbr

Length of output: 957


🏁 Script executed:

#!/bin/bash
# Check if there are any fixtures or sample data that show actual repo_language values
find . -type f \( -name "*.py" -o -name "*.json" \) -exec grep -l "repo_language.*[ts]\|repo_language.*js" {} \;

Repository: supermodeltools/mcpbr

Length of output: 229


🏁 Script executed:

#!/bin/bash
# Let me check PRO_LANGUAGES constant to see what values are expected
rg -n "PRO_LANGUAGES" --type=py -B2 -A10

Repository: supermodeltools/mcpbr

Length of output: 3635


🏁 Script executed:

#!/bin/bash
# Check if the tests for typescript/javascript actually pass with the current code
# by looking at the test assertions more carefully
sed -n '113,132p' tests/test_swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 819


🏁 Script executed:

#!/bin/bash
# Verify if there's any place in the code that might normalize repo_language to short form
rg -n "\.get\(.*repo_language" --type=py -B2 -A2

Repository: supermodeltools/mcpbr

Length of output: 2154


🏁 Script executed:

#!/bin/bash
# Run the specific test to see if it's currently failing
cd /root && python -m pytest tests/test_swebench_pro.py::TestSWEBenchProInit::test_pro_languages_set -xvs 2>&1

Repository: supermodeltools/mcpbr

Length of output: 120


🏁 Script executed:

#!/bin/bash
# Check if the test assertion matches what PRO_LANGUAGES actually contains
rg -n "test_pro_languages_set" --type=py -A5

Repository: supermodeltools/mcpbr

Length of output: 441


🏁 Script executed:

#!/bin/bash
# Get the full PRO_LANGUAGES definition
sed -n '29,29p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 137


🏁 Script executed:

#!/bin/bash
# Search for any actual dataset loading or task creation that would show real repo_language values
rg -n "load_dataset\|from_dataset" --type=py -B2 -A10

Repository: supermodeltools/mcpbr

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check the filter_by logic more carefully to understand when short vs long forms are used
sed -n '95,115p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 1068


🏁 Script executed:

#!/bin/bash
# Check if there's any normalization happening before _build_pro_test_command is called
# Look at _run_lang_tests to trace the full flow
sed -n '290,320p' src/mcpbr/benchmarks/swebench_pro.py

Repository: supermodeltools/mcpbr

Length of output: 1165


Add support for short-form language identifiers ("ts", "js") in the TypeScript/JavaScript test builder.

The check at line 430 only matches the long forms ("typescript", "javascript"), but PRO_LANGUAGES explicitly includes the short forms "ts" and "js" (line 29), indicating the dataset uses them. Any TypeScript or JavaScript task with those short forms will silently fall through to the bare fallback at line 440, running the test identifier as a raw shell command instead of with Jest—breaking test execution entirely.

The fix is straightforward:

🐛 Proposed fix
-    if language in ("typescript", "javascript"):
+    if language in ("typescript", "javascript", "ts", "js"):

(The same issue should be checked for the Go branch at line 420 if Go also uses short forms in the dataset.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 430 - 437, The
TypeScript/JavaScript branch currently checks if language in ("typescript",
"javascript") and thus misses the short identifiers "ts" and "js" from
PRO_LANGUAGES; update the conditional that builds the Jest command (the if
language in (...) branch that returns f"{activate}npx jest ...") to include "ts"
and "js" so short-form tasks use Jest rather than falling back to the raw shell
path, and similarly review/update the analogous Go branch (the if language in
(...) branch near the Go handling) to include any short Go identifiers if
present in PRO_LANGUAGES.

Preflight was using the Python-only run_tests() (pytest) for all
languages, causing 100% failure for Go, JavaScript, and TypeScript
instances. Now routes non-Python languages through
_build_pro_test_command() which generates the correct commands:
- Go: go test -v -count=1 ...
- JS/TS: npx jest ...

Also fixes conda activation bug in _run_lang_tests: SWE-bench Pro
images don't have conda, so uses_prebuilt=False is passed for
non-Python test commands to avoid prepending conda activation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
src/mcpbr/benchmark_preflight.py (2)

200-205: ⚠️ Potential issue | 🟠 Major

Editable reinstall failures are still silently swallowed.

Hey, so the || true at the end of the pip command (line 202) means if the install fails — like a missing setup.py, broken dependencies, whatever — you'll never know. The preflight will happily keep going and run tests against the old installed code, giving you a false "passed" result. That's the opposite of what preflight is supposed to catch!

Think of it this way: if you order food and the kitchen says "we might have burned it, but let's serve it anyway" — that defeats the purpose of quality checks.

Proposed fix
         if eval_workdir and language == "python":
-            await env.exec_command(
-                "pip install -e . -q 2>/dev/null || true",
+            exit_code, _stdout, stderr = await env.exec_command(
+                "pip install -e . -q",
                 timeout=120,
                 workdir=eval_workdir,
             )
+            if exit_code != 0:
+                return PreflightResult(
+                    instance_id=instance_id,
+                    status="error",
+                    error=f"Editable reinstall failed: {stderr[:300] if stderr else 'unknown error'}",
+                    language=language,
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 200 - 205, The pip reinstall
command is silently swallowing failures via "|| true"; update the block that
runs env.exec_command when eval_workdir and language == "python" to remove the
"|| true" so failures surface, and ensure exec_command's non-zero exit is
propagated (or explicitly check its result) so the preflight aborts and logs the
error; refer to the eval_workdir conditional and the call to env.exec_command in
benchmark_preflight.py to locate and change this behavior.

306-307: ⚠️ Potential issue | 🟠 Major

max_concurrent=0 will deadlock all preflight checks.

If someone passes max_concurrent=0 (say, by misconfiguration or a bad CLI default), asyncio.Semaphore(0) means no coroutine can ever acquire the semaphore. Every check hangs forever. A simple guard at the top of the function prevents a really confusing debugging session.

Proposed fix
 async def run_benchmark_preflight(
     benchmark: Any,
     tasks: list[dict[str, Any]],
     docker_manager: DockerEnvironmentManager,
     max_concurrent: int = 4,
     timeout: int = 300,
     fail_fast: bool = False,
 ) -> PreflightReport:
+    if max_concurrent < 1:
+        raise ValueError("max_concurrent must be at least 1")
+
     report = PreflightReport(total=len(tasks))
     semaphore = asyncio.Semaphore(max_concurrent)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 306 - 307, The code creates
asyncio.Semaphore(max_concurrent) which deadlocks when max_concurrent <= 0; add
a guard at the start of the function (before creating report and semaphore) to
ensure max_concurrent is at least 1 (e.g., max_concurrent = max(1,
max_concurrent)) or raise a clear ValueError if you prefer failing fast; update
the place that constructs semaphore (the variable named semaphore) so it uses
the validated value and reference PreflightReport/semphore as needed.
src/mcpbr/benchmarks/swebench_pro.py (2)

432-439: ⚠️ Potential issue | 🟠 Major

Short-form language identifiers "ts" and "js" fall through to the raw command fallback.

PRO_LANGUAGES on line 29 explicitly includes "ts" and "js", which means tasks can arrive with those as their language. But _build_pro_test_command only checks for "typescript" and "javascript" on line 432. If a task has language="ts", it skips the Jest branch entirely and hits the fallback on line 441, which just runs the test identifier as a raw shell command. That'll probably fail or do something weird.

Same thing could apply to Go if short forms ever show up, but "go" doesn't have an alias so it's fine for now.

Proposed fix
-    if language in ("typescript", "javascript"):
+    if language in ("typescript", "javascript", "ts", "js"):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 432 - 439, The Jest branch
in _build_pro_test_command only checks for "typescript" and "javascript", so
short-form languages ("ts", "js") from PRO_LANGUAGES fall through to the raw
fallback; update the language check to include the short aliases (e.g., check if
language in ("typescript","javascript","ts","js")) or normalize language at the
top of _build_pro_test_command (map "ts" -> "typescript", "js" -> "javascript")
so that Jest commands are generated for short-form inputs; ensure you update the
conditional that currently uses language in ("typescript", "javascript") to the
new set or use the normalization logic before the existing checks.

232-239: 🛠️ Refactor suggestion | 🟠 Major

Dead code: the language == "python" check inside _evaluate_multilang can never be true.

So here's the thing — evaluate() (line 193) already returns early when language == "python". That means _evaluate_multilang is only ever called for non-Python languages. So this entire block (lines 232-239) checking if language == "python" will never execute. It's not harmful, but it's confusing for the next person reading this code — they'll wonder "wait, can Python end up here?"

Proposed fix — just remove it
         if test_patch:
             await _apply_test_patch(env, test_patch, workdir=eval_workdir)
 
-        # Reinstall package so patched code is active (SWE-bench Pro images
-        # install into site-packages, not editable mode)
-        if eval_workdir and language == "python":
-            await env.exec_command(
-                "pip install -e . -q 2>/dev/null || true",
-                timeout=120,
-                workdir=eval_workdir,
-            )
-
         fail_to_pass_str = get_test_list_field(task, "fail_to_pass")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 232 - 239, The block in
_evaluate_multilang that checks `if language == "python"` and runs the pip
reinstall is dead because evaluate() returns early for Python, so remove that
entire conditional and its pip install call to avoid confusing dead code; locate
the code inside _evaluate_multilang (the await env.exec_command("pip install -e
. -q 2>/dev/null || true", ...)) and delete the if-check plus the enclosed lines
so the function contains no unreachable Python-specific reinstall logic.
🧹 Nitpick comments (4)
src/mcpbr/benchmark_preflight.py (2)

313-325: fail_fast mode still acquires the semaphore unnecessarily.

In the fail_fast branch (lines 313-325), tasks are run sequentially with await. The semaphore acquisition inside _check_with_semaphore is pointless here since there's no concurrency. It won't cause a bug, but it's misleading — a reader might think concurrency is happening. You could just call _check_single_instance directly.

This is totally optional and won't break anything as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 313 - 325, The fail_fast
branch is calling _check_with_semaphore which needlessly acquires a semaphore
even though tasks are executed sequentially; replace the call to
_check_with_semaphore(task) with a direct call to the non-concurrent checker
_check_single_instance(task) (or the underlying function that performs the
check) so the semaphore isn't acquired when fail_fast is true, then append the
returned result to report.results and update
report.passed/report.failed/report.errors exactly as before.

265-280: Use lazy %s formatting in logger calls instead of f-strings.

Lines 267 and 280 use f-strings inside logger.exception() and logger.warning(). This means the string interpolation happens every time, even if the log level is disabled. It's a small thing, but it's the idiomatic way to do logging in Python — let the logger decide whether to format the string.

Proposed fix
-        logger.exception(f"Preflight error for {instance_id}")
+        logger.exception("Preflight error for %s", instance_id)
-                logger.warning(f"Failed to clean up container for {instance_id}")
+                logger.warning("Failed to clean up container for %s", instance_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmark_preflight.py` around lines 265 - 280, Replace the
f-strings in the logging calls with lazy %-style formatting so interpolation
only happens when the message will be emitted: change
logger.exception(f"Preflight error for {instance_id}") to
logger.exception("Preflight error for %s", instance_id) and change
logger.warning(f"Failed to clean up container for {instance_id}") to
logger.warning("Failed to clean up container for %s", instance_id); these occur
in the exception/finally block around the Preflight logic (references: variable
instance_id and env.cleanup()).
tests/test_swebench_pro.py (1)

173-240: Filter tests don't cover the "typescript""ts" alias path.

The test_filter_by_language test uses filter_category=["go"] and dataset repo_language="go" — a case with no aliasing involved. There's no test where the user passes a long-form name (like "typescript") and the dataset uses the other form. Adding such a test would catch the alias resolution issue flagged in the implementation review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_swebench_pro.py` around lines 173 - 240, Add a unit test to cover
the typescript ↔ ts alias path: in tests/test_swebench_pro.py create a new test
(e.g., test_filter_typescript_alias) that patches
mcpbr.benchmarks.swebench_pro.load_dataset to return a dataset where one item
has repo_language set to "ts" and then call
SWEBenchProBenchmark().load_tasks(filter_category=["typescript"]) (and vice
versa) asserting the item is included; reference SWEBenchProBenchmark.load_tasks
and PRO_LANGUAGES to ensure the alias mapping is exercised by the test.
src/mcpbr/benchmarks/swebench_pro.py (1)

272-337: _run_lang_tests nicely mirrors _run_preflight_tests in the preflight module.

These two functions (_run_lang_tests here and _run_preflight_tests in benchmark_preflight.py) share almost identical logic for the non-Python path — loop over tests, call _build_pro_test_command, collect results. The only real differences are:

  1. Here, uses_prebuilt=False is always passed (line 309) vs. the preflight version passing uses_conda.
  2. The preflight version accepts repo as a param for the Python path.

Not a blocker, but if this code drifts apart it could cause subtle inconsistencies in how tests are run during evaluation vs. preflight. Consider extracting a shared helper down the road.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 272 - 337, Extract the
duplicated non-Python test-running logic from _run_lang_tests and
benchmark_preflight._run_preflight_tests into a shared helper (e.g.,
_run_non_python_tests or run_language_tests_helper) that accepts env, tests,
language, workdir, timeout, and a uses_prebuilt/uses_conda flag; move the loop
that builds commands via _build_pro_test_command, calls env.exec_command,
constructs the results list, and returns TestResults into that helper, then have
both _run_lang_tests and _run_preflight_tests call the new helper with the
appropriate uses_prebuilt/uses_conda value to eliminate duplication and keep
behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 94-112: The filter logic drops JS/TS because you resolve the
filter category via _LANGUAGE_ALIASES but compare it against the raw task
language; update the normalization so both sides use the same alias mapping: in
the loop that reads task.get("repo_language", "").lower() (variable language)
apply _LANGUAGE_ALIASES.get(language, language) (or otherwise normalize task
language to its canonical short form) before comparing to resolved, and then use
that normalized language in the equality check inside the filter_category
handling (functions/vars: filter_category, tasks, PRO_LANGUAGES,
_LANGUAGE_ALIASES, repo_language).

In `@tests/test_swebench_pro.py`:
- Around line 93-139: Add two tests to TestBuildProTestCommand to cover
short-form language identifiers: add test_ts_short_form_uses_jest and
test_js_short_form_uses_jest that call
_build_pro_test_command("src/test.ts","ts") and
_build_pro_test_command("test/index.test.js","js") respectively and assert that
"npx jest" is in the returned cmd (and that the relevant filename appears), so
short forms "ts"/"js" no longer fall back to the raw-command path.

---

Duplicate comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 200-205: The pip reinstall command is silently swallowing failures
via "|| true"; update the block that runs env.exec_command when eval_workdir and
language == "python" to remove the "|| true" so failures surface, and ensure
exec_command's non-zero exit is propagated (or explicitly check its result) so
the preflight aborts and logs the error; refer to the eval_workdir conditional
and the call to env.exec_command in benchmark_preflight.py to locate and change
this behavior.
- Around line 306-307: The code creates asyncio.Semaphore(max_concurrent) which
deadlocks when max_concurrent <= 0; add a guard at the start of the function
(before creating report and semaphore) to ensure max_concurrent is at least 1
(e.g., max_concurrent = max(1, max_concurrent)) or raise a clear ValueError if
you prefer failing fast; update the place that constructs semaphore (the
variable named semaphore) so it uses the validated value and reference
PreflightReport/semphore as needed.

In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 432-439: The Jest branch in _build_pro_test_command only checks
for "typescript" and "javascript", so short-form languages ("ts", "js") from
PRO_LANGUAGES fall through to the raw fallback; update the language check to
include the short aliases (e.g., check if language in
("typescript","javascript","ts","js")) or normalize language at the top of
_build_pro_test_command (map "ts" -> "typescript", "js" -> "javascript") so that
Jest commands are generated for short-form inputs; ensure you update the
conditional that currently uses language in ("typescript", "javascript") to the
new set or use the normalization logic before the existing checks.
- Around line 232-239: The block in _evaluate_multilang that checks `if language
== "python"` and runs the pip reinstall is dead because evaluate() returns early
for Python, so remove that entire conditional and its pip install call to avoid
confusing dead code; locate the code inside _evaluate_multilang (the await
env.exec_command("pip install -e . -q 2>/dev/null || true", ...)) and delete the
if-check plus the enclosed lines so the function contains no unreachable
Python-specific reinstall logic.

---

Nitpick comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 313-325: The fail_fast branch is calling _check_with_semaphore
which needlessly acquires a semaphore even though tasks are executed
sequentially; replace the call to _check_with_semaphore(task) with a direct call
to the non-concurrent checker _check_single_instance(task) (or the underlying
function that performs the check) so the semaphore isn't acquired when fail_fast
is true, then append the returned result to report.results and update
report.passed/report.failed/report.errors exactly as before.
- Around line 265-280: Replace the f-strings in the logging calls with lazy
%-style formatting so interpolation only happens when the message will be
emitted: change logger.exception(f"Preflight error for {instance_id}") to
logger.exception("Preflight error for %s", instance_id) and change
logger.warning(f"Failed to clean up container for {instance_id}") to
logger.warning("Failed to clean up container for %s", instance_id); these occur
in the exception/finally block around the Preflight logic (references: variable
instance_id and env.cleanup()).

In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 272-337: Extract the duplicated non-Python test-running logic from
_run_lang_tests and benchmark_preflight._run_preflight_tests into a shared
helper (e.g., _run_non_python_tests or run_language_tests_helper) that accepts
env, tests, language, workdir, timeout, and a uses_prebuilt/uses_conda flag;
move the loop that builds commands via _build_pro_test_command, calls
env.exec_command, constructs the results list, and returns TestResults into that
helper, then have both _run_lang_tests and _run_preflight_tests call the new
helper with the appropriate uses_prebuilt/uses_conda value to eliminate
duplication and keep behavior consistent.

In `@tests/test_swebench_pro.py`:
- Around line 173-240: Add a unit test to cover the typescript ↔ ts alias path:
in tests/test_swebench_pro.py create a new test (e.g.,
test_filter_typescript_alias) that patches
mcpbr.benchmarks.swebench_pro.load_dataset to return a dataset where one item
has repo_language set to "ts" and then call
SWEBenchProBenchmark().load_tasks(filter_category=["typescript"]) (and vice
versa) asserting the item is included; reference SWEBenchProBenchmark.load_tasks
and PRO_LANGUAGES to ensure the alias mapping is exercised by the test.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd5378f and 6472bbb.

📒 Files selected for processing (3)
  • src/mcpbr/benchmark_preflight.py
  • src/mcpbr/benchmarks/swebench_pro.py
  • tests/test_swebench_pro.py

Comment on lines +94 to +112
if filter_category:
filtered = []
for task in tasks:
repo = task.get("repo", "")
language = task.get("repo_language", "").lower()
for category in filter_category:
cat_lower = category.lower()
# If the category is a known language, match by language only
if cat_lower in PRO_LANGUAGES:
# Resolve aliases (e.g., "javascript" -> "js")
resolved = _LANGUAGE_ALIASES.get(cat_lower, cat_lower)
if resolved == language:
filtered.append(task)
break
elif cat_lower in repo.lower():
# Otherwise, match by repo substring
filtered.append(task)
break
tasks = filtered
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find PRO_LANGUAGES and _LANGUAGE_ALIASES definitions
rg -n "_LANGUAGE_ALIASES|PRO_LANGUAGES" --type=py -B2 -A5 src/

Repository: supermodeltools/mcpbr

Length of output: 1809


🏁 Script executed:

#!/bin/bash
# Check test data to see what repo_language values are used
rg -n "repo_language" --type=py -C2 tests/

Repository: supermodeltools/mcpbr

Length of output: 6005


🏁 Script executed:

#!/bin/bash
# Check if there are any tests for filtering by "typescript" or "javascript"
rg -n "filter_category.*typescript|filter_category.*javascript" --type=py -B3 -A10 tests/

Repository: supermodeltools/mcpbr

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Look for all filter_category tests to see what's tested
rg -n "filter_category" --type=py -B5 -A10 tests/

Repository: supermodeltools/mcpbr

Length of output: 35465


Language alias mismatch will silently drop TypeScript and JavaScript tasks from filter results.

Here's the concrete issue: the test dataset has "repo_language": "typescript" (long form), but _LANGUAGE_ALIASES maps "typescript""ts". When someone filters by ["typescript"], the code resolves it to "ts" (line 104), then compares "ts" == "typescript" — no match, task is skipped silently.

The fix is straightforward — normalize the dataset's language value the same way:

Proposed fix
                     if cat_lower in PRO_LANGUAGES:
                         # Resolve aliases (e.g., "javascript" -> "js")
                         resolved = _LANGUAGE_ALIASES.get(cat_lower, cat_lower)
-                        if resolved == language:
+                        lang_resolved = _LANGUAGE_ALIASES.get(language, language)
+                        if resolved == lang_resolved:
                             filtered.append(task)
                             break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 94 - 112, The filter logic
drops JS/TS because you resolve the filter category via _LANGUAGE_ALIASES but
compare it against the raw task language; update the normalization so both sides
use the same alias mapping: in the loop that reads task.get("repo_language",
"").lower() (variable language) apply _LANGUAGE_ALIASES.get(language, language)
(or otherwise normalize task language to its canonical short form) before
comparing to resolved, and then use that normalized language in the equality
check inside the filter_category handling (functions/vars: filter_category,
tasks, PRO_LANGUAGES, _LANGUAGE_ALIASES, repo_language).

Comment on lines 93 to 139
class TestBuildProTestCommand:
"""Tests for language-specific test command building."""

def test_python_delegates(self) -> None:
"""Python should delegate to existing _build_test_command."""
cmd = _build_pro_test_command("tests/test_foo.py::test_bar", "python")
assert "pytest" in cmd or "test_foo" in cmd

def test_go_package_path(self) -> None:
cmd = _build_pro_test_command("./pkg/router", "go")
assert "go test" in cmd
assert "./pkg/router" in cmd
assert "-v" in cmd

def test_go_function_name(self) -> None:
cmd = _build_pro_test_command("TestRouteMatching", "go")
assert "go test" in cmd
assert "-run" in cmd
assert "TestRouteMatching" in cmd

def test_typescript_file(self) -> None:
cmd = _build_pro_test_command("src/__tests__/parser.test.ts", "typescript")
assert "npx jest" in cmd
assert "parser.test.ts" in cmd

def test_typescript_pattern(self) -> None:
cmd = _build_pro_test_command("should parse tokens", "typescript")
assert "npx jest" in cmd
assert "-t" in cmd

def test_javascript_file(self) -> None:
cmd = _build_pro_test_command("test/index.test.js", "javascript")
assert "npx jest" in cmd
assert "index.test.js" in cmd

def test_javascript_pattern(self) -> None:
cmd = _build_pro_test_command("handles edge case", "javascript")
assert "npx jest" in cmd
assert "-t" in cmd

def test_prebuilt_conda_activation(self) -> None:
cmd = _build_pro_test_command("TestFoo", "go", uses_prebuilt=True)
assert "conda activate testbed" in cmd

def test_unknown_language_fallback(self) -> None:
cmd = _build_pro_test_command("test_something", "rust")
assert "test_something" in cmd
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Good coverage of the command builder, but missing tests for short-form language identifiers "ts" / "js".

The test class covers Python, Go, TypeScript (long form), JavaScript (long form), prebuilt conda, and unknown fallback — nice work! But there's no test for what happens when language="ts" or language="js" is passed. Given the bug flagged in _build_pro_test_command (where those short forms silently fall through to the raw command fallback), adding tests for these would both document the expected behavior and catch regressions once the fix lands.

Something like:

def test_ts_short_form_uses_jest(self) -> None:
    cmd = _build_pro_test_command("src/test.ts", "ts")
    assert "npx jest" in cmd

def test_js_short_form_uses_jest(self) -> None:
    cmd = _build_pro_test_command("test/index.test.js", "js")
    assert "npx jest" in cmd

Based on learnings: "New features must have comprehensive test coverage; bug fixes should include regression tests; aim for at least 80% code coverage on new code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_swebench_pro.py` around lines 93 - 139, Add two tests to
TestBuildProTestCommand to cover short-form language identifiers: add
test_ts_short_form_uses_jest and test_js_short_form_uses_jest that call
_build_pro_test_command("src/test.ts","ts") and
_build_pro_test_command("test/index.test.js","js") respectively and assert that
"npx jest" is in the returned cmd (and that the relevant filename appears), so
short forms "ts"/"js" no longer fall back to the raw-command path.

greynewell and others added 7 commits February 25, 2026 18:39
Go: Test IDs like "TestFoo/#00" and "TestFoo//api/v1" are subtests,
not package paths. Always extract the top-level function name and
use -run with ./... to search all packages.

JS/TS: SWE-bench Pro uses "file.js | test description" format.
Split on " | " to get the file path and test name separately.
"test suite" as description runs the whole file without -t filter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Different SWE-bench Pro JS/TS repos use different test frameworks:
- NodeBB: mocha
- element-web: jest
- protonmail/webclients: jest
- tutanota: custom (testdouble)

Added _detect_js_runner() that checks for runner binaries in
node_modules/.bin/ inside the container. Mocha uses --grep for
filtering, jest uses -t, vitest uses -t.

Also refactored JS/TS command building into _build_js_test_command()
for clearer separation of runner-specific logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/TS runner detection

Add docker image prune after each preflight instance to prevent disk
exhaustion on CI runners (each SWE-bench Pro image is ~1.5GB).

Enhance JS/TS test runner detection to check package.json scripts.test
and support ospec, ava, and npm test fallback. Previously, unrecognized
projects (like tutanota) defaulted to jest, causing 100% test failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace custom language-specific test command building (jest/mocha/go test)
with official run_script.sh + parser.py from scaleapi/SWE-bench_Pro-os.
Each of the 11 repos has unique test infrastructure that the official
scripts handle correctly (e.g., Redis for NodeBB, ansible-test for ansible,
custom runners for tutanota). Parser runs locally on the host, avoiding
Python dependency in Go/JS/TS container images.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation

The official SWE-bench Pro evaluation harness runs before_repo_set_cmd
(last line) between patch application and test execution. This typically
restores specific test files from the fix commit, e.g.:
  git checkout <commit> -- test/tests/SomeTest.ts

Without this, 2 tutanota instances fail because their test files end up
in a broken state after patch + test_patch application.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Protonmail/webclients images are ~4.8GB compressed. docker image prune
alone doesn't clear build cache and volumes. docker system prune -af
--volumes reclaims all reclaimable space between instances.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant