feat: add SWE-bench Pro benchmark and preflight check system#3
feat: add SWE-bench Pro benchmark and preflight check system#3greynewell wants to merge 12 commits intomainfrom
Conversation
Add multi-language benchmark support (Python, Go, TypeScript, JavaScript) with 731 instances from ScaleAI/SWE-bench_Pro, plus a preflight validation system that verifies golden patches pass all tests before agent evaluation. Key changes: - SWEBenchProBenchmark class with DockerHub image support and language-specific test runners (Go: go test, TS/JS: npx jest) - Preflight check system (mcpbr preflight CLI command) that validates golden patches in Docker environments - Docker image override support (_image_override, _workdir_override) for non-GHCR registries - Entrypoint override for images with /bin/bash entrypoint - Editable reinstall after patching for SWE-bench Pro Python images - Case-insensitive test list field access (fail_to_pass/FAIL_TO_PASS) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds SWE‑bench Pro multi-language benchmark and wiring; introduces an async preflight validator that applies golden patches and runs language-aware tests in Docker with image/workdir/entrypoint overrides; adds a CLI Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as CLI (preflight)
participant Bench as SWE-bench Pro
participant Pref as Preflight Orchestrator
participant DockerMgr as Docker Manager
participant Container as Container
User->>CLI: invoke preflight --benchmark swe-bench-pro ...
CLI->>Bench: load_tasks(filter, sample_size)
Bench-->>CLI: tasks[]
CLI->>Pref: run_benchmark_preflight(tasks, max_concurrent, fail_fast)
par concurrent checks
Pref->>DockerMgr: create_environment(task)
DockerMgr->>DockerMgr: _try_pull_image(_image_override?)
DockerMgr->>Container: start(workdir=_workdir_override?, entrypoint_override?)
Container-->>Pref: TaskEnvironment ready
Pref->>Container: apply golden patch
Pref->>Container: (optional) apply test patch
Pref->>Container: (if python) pip install -e .
Pref->>Container: run fail_to_pass tests
Pref->>Container: run pass_to_pass tests
Container-->>Pref: test results
Pref->>Pref: record PreflightResult, cleanup container
end
Pref-->>CLI: PreflightReport (per-instance results, success_rate)
CLI->>User: print report, exit code (0/1)
sequenceDiagram
actor Evaluator
participant Bench as SWE-bench Pro
participant Env as TaskEnvironment
participant Docker as Container
participant Runner as Language Test Runner
Evaluator->>Bench: evaluate(env, task, solution_patch)
Bench->>Env: write patch & git apply
Env->>Docker: prepare workspace
alt language == python
Env->>Docker: pip install -e .
Docker-->>Runner: run pytest (normalized, quoted ids)
else language == go
Docker-->>Runner: go test -v ...
else language == ts/js
Docker-->>Runner: npx jest ...
end
Runner-->>Bench: pass/fail, outputs
Bench-->>Evaluator: evaluation dict (status, ftp/ptp metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mcpbr/docker_env.py (1)
886-892:⚠️ Potential issue | 🟡 MinorBug: error message still says "/testbed" instead of using
source_dir.You updated the copy commands (lines 816, 862, 866) to use the dynamic
source_dir, but the final error message on line 889 is hardcoded to say"from /testbed". If the image uses/app(like SWE-bench Pro), this error message will be misleading.🐛 Proposed fix
# --- Exhausted --- raise RuntimeError( - f"Workspace /workspace appears empty after copy from /testbed " + f"Workspace /workspace appears empty after copy from {source_dir} " f"(instance={env.instance_id}). " f"The filesystem may not have synced correctly." )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/docker_env.py` around lines 886 - 892, The error string raised after the workspace copy still hardcodes "/testbed"; change that message to reference the dynamic source_dir variable so it accurately reflects where files were copied from (use the existing source_dir and env.instance_id in the f-string). Locate the RuntimeError raise in docker_env.py (the block that currently builds the message with env.instance_id) and replace "/testbed" with {source_dir} in the formatted message so it matches the updated copy commands.
🧹 Nitpick comments (7)
src/mcpbr/benchmarks/__init__.py (1)
143-145: Prefersetdefaultso explicitdataset=overrides still work.Right now this always overwrites caller intent.
💡 Proposed fix
- if name == "swe-bench-pro": - kwargs["dataset"] = "ScaleAI/SWE-bench_Pro" + if name == "swe-bench-pro": + kwargs.setdefault("dataset", "ScaleAI/SWE-bench_Pro")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/__init__.py` around lines 143 - 145, The current conditional for name == "swe-bench-pro" unconditionally assigns kwargs["dataset"] = "ScaleAI/SWE-bench_Pro", which overwrites any caller-provided dataset; change this to only apply the default when no dataset was provided by using kwargs.setdefault("dataset", "ScaleAI/SWE-bench_Pro") inside the same branch so explicit dataset= callers continue to take precedence.tests/test_swebench_pro.py (1)
96-99: Loose assertion for Python delegation — consider tightening.The assertion
"pytest" in cmd or "test_foo" in cmdis pretty permissive. It basically says "as long as the command mentions something vaguely test-related, we're good." Since_build_pro_test_commandfor Python delegates to_build_test_command, you could assert more specifically — for example, that"pytest"is in the command. If the delegation ever breaks and returns garbage, this test might still pass.Not a big deal though — this is just a nit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_swebench_pro.py` around lines 96 - 99, The test test_python_delegates in tests/test_swebench_pro.py currently uses a loose assertion; tighten it to verify Python delegation by asserting that _build_pro_test_command("tests/test_foo.py::test_bar", "python") produces a command that contains "pytest" (i.e., replace the or-check with a single assert "pytest" in cmd) to ensure it truly delegates to _build_test_command.src/mcpbr/docker_env.py (2)
448-470: Consider DRYing up_try_pull_imageand_try_pull_prebuilt.These two methods are nearly identical — both pull an image with
platform="linux/amd64"and catch the same exceptions. The only difference is that_try_pull_prebuiltcallsget_swebench_image_name()first to compute the image name.You could have
_try_pull_prebuiltjust call_try_pull_image(get_swebench_image_name(instance_id))internally. Not critical, but it removes a bit of duplication.♻️ Suggested refactor
async def _try_pull_prebuilt(self, instance_id: str) -> str | None: image_name = get_swebench_image_name(instance_id) - - def _pull() -> str | None: - try: - self.client.images.pull(image_name, platform="linux/amd64") - return image_name - except docker.errors.ImageNotFound: - return None - except docker.errors.APIError: - return None - - loop = asyncio.get_event_loop() - return await loop.run_in_executor(None, _pull) + return await self._try_pull_image(image_name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/docker_env.py` around lines 448 - 470, Refactor the duplicate image-pulling logic by having _try_pull_prebuilt call the existing _try_pull_image instead of duplicating the pull/catch logic: replace the body of _try_pull_prebuilt so it computes the image via get_swebench_image_name(instance_id) and returns await _try_pull_image(image_name). Keep _try_pull_image as the single implementation that does client.images.pull(..., platform="linux/amd64") and handles docker.errors.ImageNotFound and docker.errors.APIError via run_in_executor; remove the redundant pull logic from _try_pull_prebuilt.
617-619: Entrypoint override is applied for all_image_overrideimages — is that always correct?Right now,
has_entrypoint_override = bool(task.get("_image_override"))means every single override image gets the entrypoint replaced. This works fine if all SWE-bench Pro images set/bin/bashas their entrypoint (which the PR description says they do). But if someone later adds an override image that doesn't set an entrypoint, the behavior is still harmless — the entrypoint just becomes["/bin/sh", "-c", "tail -f /dev/null"]which achieves the same keep-alive as the normalcommand="tail -f /dev/null".So this is safe, just a bit of an overly broad brush. Worth a comment noting why this is safe for all cases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/docker_env.py` around lines 617 - 619, Add a short inline comment next to the has_entrypoint_override = bool(task.get("_image_override")) assignment explaining the rationale: we intentionally treat every _image_override as having an entrypoint because if an override image actually lacks an entrypoint, replacing it with our keep-alive entrypoint (["/bin/sh","-c","tail -f /dev/null"]) is harmless and functionally equivalent to setting command="tail -f /dev/null"; reference the has_entrypoint_override variable and the task.get("_image_override") check in the comment so future readers understand this intentional and safe behavior.tests/test_benchmark_preflight.py (1)
315-340: Consider hoisting the import to the top of the file or class level.Right now,
from mcpbr.evaluation import get_test_list_fieldis repeated in every single test method (lines 319, 325, 331, 337). It works fine, but it's a bit noisy. You could import it once at the top of the file alongside the other imports (lines 3-12) and keep things DRY.That said, this is purely a readability nit — no functional issue.
♻️ Suggested refactor
from mcpbr.benchmark_preflight import ( PreflightReport, PreflightResult, _check_single_instance, run_benchmark_preflight, ) +from mcpbr.evaluation import get_test_list_fieldThen remove the
from mcpbr.evaluation import get_test_list_fieldlines inside each test method.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_benchmark_preflight.py` around lines 315 - 340, Hoist the repeated import of get_test_list_field out of each test method by adding a single "from mcpbr.evaluation import get_test_list_field" alongside the module-level imports and deleting the per-test imports inside the TestGetTestListField methods (test_lowercase_field, test_uppercase_field, test_lowercase_preferred, test_missing_field) so the tests use the shared top-level import.src/mcpbr/benchmarks/swebench_pro.py (2)
241-243:pass_to_pass_tests[:10]silently truncates — could hide regressions.You're capping pass-to-pass tests at 10 for performance, which makes sense for a large test suite. But if a patch breaks test
#11, you'd never know. Consider at least logging when truncation happens, so people debugging a missed regression have a breadcrumb.Something like:
if len(pass_to_pass_tests) > 10: logger.info("Truncating pass_to_pass from %d to 10 tests", len(pass_to_pass_tests))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 241 - 243, The call to _run_lang_tests is silently truncating pass_to_pass_tests with pass_to_pass_tests[:10]; before calling _run_lang_tests compute the intended slice (e.g., selected_tests = pass_to_pass_tests[:10]) and if len(pass_to_pass_tests) > len(selected_tests) emit a logger.info message noting the original count and the truncated count (for example: "Truncating pass_to_pass from %d to %d tests"), then pass selected_tests into _run_lang_tests and assign the result to pass_to_pass_results; reference the variables pass_to_pass_tests, selected_tests, _run_lang_tests, and pass_to_pass_results when making this change.
409-416: Shell metacharacters in test identifiers are interpolated unsanitized.The
teststring is f-string'd directly into the shell command (e.g.,f"go test -v -count=1 -run '{test}' ./... 2>&1"). Since these come from the HuggingFace dataset (not arbitrary user input), the practical risk is low. But if a test name ever contained a single quote likeTest'Something, the command would break or behave unexpectedly.Consider using
shlex.quote(test)for defense-in-depth:import shlex return f"{activate}go test -v -count=1 -run {shlex.quote(test)} ./... 2>&1"Same applies to the TS/JS commands at lines 422 and 425.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 409 - 416, The code interpolates the test identifier directly into shell commands (e.g., in the block checking language == "go") which can break or allow shell injection if the test contains quotes or special chars; fix by importing shlex and using shlex.quote(test) wherever test is placed in shell strings (for example in the go branch return f"... -run {shlex.quote(test)} ..." and similarly for the TS/JS command constructions that inject test), ensuring all interpolated values are properly quoted before being embedded in the command string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 207-227: The code creates an asyncio.Semaphore(max_concurrent)
without validating max_concurrent, so passing 0 will deadlock; update the
function that declares max_concurrent to validate it before creating semaphore
(e.g., in the start of the function that constructs PreflightReport and
semaphore) by either raising a ValueError when max_concurrent < 1 or normalizing
it to a safe minimum like 1, then proceed to create semaphore =
asyncio.Semaphore(max_concurrent); ensure references to max_concurrent and
semaphore are adjusted accordingly.
- Around line 140-157: The preflight currently calls run_tests for ftp_results
and ptp_results which is Python-specific; instead detect the task language
(e.g., from task.get("language") or repo metadata) and dispatch to a
language-specific runner or a unified dispatcher before running tests. Update
the code around run_tests (the calls producing ftp_results and ptp_results) to:
if language is "python" call run_tests with the existing args (env,
fail_to_pass_tests / pass_to_pass_tests, timeout, uses_prebuilt,
workdir=eval_workdir, repo), otherwise call the appropriate runner (e.g.,
run_go_tests, run_js_tests/run_ts_tests) or a new
choose_runner_for_language(task_language, ...) function that accepts the same
parameters; ensure pass_to_pass_tests and fail_to_pass_tests are passed in the
right format for non-Python runners and preserve
timeout/uses_prebuilt/eval_workdir handling.
- Around line 123-127: The current preflight call swallows editable reinstall
failures by running env.exec_command with the command string "pip install -e .
-q 2>/dev/null || true"; remove the "|| true" and the stderr redirection so
install errors surface (i.e., call env.exec_command with a command that fails on
error, preserving stderr), keeping the same timeout and workdir (eval_workdir)
and invocation site (env.exec_command) so the preflight fails when the editable
install fails.
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 224-231: The block that runs "pip install -e ." inside
_evaluate_multilang() is dead because evaluate() already returns early for
language == "python", so remove this unreachable code from
_evaluate_multilang(); alternatively, if a reinstall is actually needed for
Python evaluations, move the pip reinstall logic into evaluate() before the call
to evaluate_patch() (referencing evaluate(), _evaluate_multilang(),
evaluate_patch(), the language variable, and eval_workdir to locate the relevant
logic).
In `@src/mcpbr/cli.py`:
- Around line 1540-1642: The preflight command ignores the --config option;
update preflight to, when config_path is provided, load the YAML/JSON config
(e.g., with yaml.safe_load or json.load) and pass the parsed settings into
DockerEnvironmentManager instead of always instantiating
DockerEnvironmentManager(use_prebuilt=True). Modify the DockerEnvironmentManager
instantiation in preflight to accept the loaded config (e.g.,
DockerEnvironmentManager(config=cfg, use_prebuilt=True)) or apply the loaded
settings via its public API before use; ensure you handle file read errors and
missing keys gracefully and still default to use_prebuilt=True when no
config_path is supplied.
In `@src/mcpbr/evaluation.py`:
- Around line 404-413: The reinstall step for DockerHub Python images uses
eval_workdir (prebuilt path) which can be /testbed; change it to use the
task-specific workdir (e.g., task.get("workdir") or fallback to eval_workdir) so
SWE-bench Pro images run pip install in the actual repo directory. In the branch
that checks task.get("dockerhub_tag") and repo_language == "python", compute
task_workdir = task.get("workdir", eval_workdir) and pass that as the workdir
argument to env.exec_command (same place where "pip install -e . -q ..." is
invoked) so the reinstall runs in the correct directory.
- Around line 32-54: The evaluate_patch function currently reads uppercase
fields directly for Python tasks (variables fail_to_pass_tests /
pass_to_pass_tests) which breaks on SWE-bench Pro lowercase names; replace the
direct task.get(...) calls in evaluate_patch with calls to
get_test_list_field(task, "fail_to_pass") and get_test_list_field(task,
"pass_to_pass"), then pass those results through parse_test_list (e.g.,
fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
and similarly for pass_to_pass_tests) so Python handling matches the non-Python
branch and supports both conventions.
---
Outside diff comments:
In `@src/mcpbr/docker_env.py`:
- Around line 886-892: The error string raised after the workspace copy still
hardcodes "/testbed"; change that message to reference the dynamic source_dir
variable so it accurately reflects where files were copied from (use the
existing source_dir and env.instance_id in the f-string). Locate the
RuntimeError raise in docker_env.py (the block that currently builds the message
with env.instance_id) and replace "/testbed" with {source_dir} in the formatted
message so it matches the updated copy commands.
---
Nitpick comments:
In `@src/mcpbr/benchmarks/__init__.py`:
- Around line 143-145: The current conditional for name == "swe-bench-pro"
unconditionally assigns kwargs["dataset"] = "ScaleAI/SWE-bench_Pro", which
overwrites any caller-provided dataset; change this to only apply the default
when no dataset was provided by using kwargs.setdefault("dataset",
"ScaleAI/SWE-bench_Pro") inside the same branch so explicit dataset= callers
continue to take precedence.
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 241-243: The call to _run_lang_tests is silently truncating
pass_to_pass_tests with pass_to_pass_tests[:10]; before calling _run_lang_tests
compute the intended slice (e.g., selected_tests = pass_to_pass_tests[:10]) and
if len(pass_to_pass_tests) > len(selected_tests) emit a logger.info message
noting the original count and the truncated count (for example: "Truncating
pass_to_pass from %d to %d tests"), then pass selected_tests into
_run_lang_tests and assign the result to pass_to_pass_results; reference the
variables pass_to_pass_tests, selected_tests, _run_lang_tests, and
pass_to_pass_results when making this change.
- Around line 409-416: The code interpolates the test identifier directly into
shell commands (e.g., in the block checking language == "go") which can break or
allow shell injection if the test contains quotes or special chars; fix by
importing shlex and using shlex.quote(test) wherever test is placed in shell
strings (for example in the go branch return f"... -run {shlex.quote(test)} ..."
and similarly for the TS/JS command constructions that inject test), ensuring
all interpolated values are properly quoted before being embedded in the command
string.
In `@src/mcpbr/docker_env.py`:
- Around line 448-470: Refactor the duplicate image-pulling logic by having
_try_pull_prebuilt call the existing _try_pull_image instead of duplicating the
pull/catch logic: replace the body of _try_pull_prebuilt so it computes the
image via get_swebench_image_name(instance_id) and returns await
_try_pull_image(image_name). Keep _try_pull_image as the single implementation
that does client.images.pull(..., platform="linux/amd64") and handles
docker.errors.ImageNotFound and docker.errors.APIError via run_in_executor;
remove the redundant pull logic from _try_pull_prebuilt.
- Around line 617-619: Add a short inline comment next to the
has_entrypoint_override = bool(task.get("_image_override")) assignment
explaining the rationale: we intentionally treat every _image_override as having
an entrypoint because if an override image actually lacks an entrypoint,
replacing it with our keep-alive entrypoint (["/bin/sh","-c","tail -f
/dev/null"]) is harmless and functionally equivalent to setting command="tail -f
/dev/null"; reference the has_entrypoint_override variable and the
task.get("_image_override") check in the comment so future readers understand
this intentional and safe behavior.
In `@tests/test_benchmark_preflight.py`:
- Around line 315-340: Hoist the repeated import of get_test_list_field out of
each test method by adding a single "from mcpbr.evaluation import
get_test_list_field" alongside the module-level imports and deleting the
per-test imports inside the TestGetTestListField methods (test_lowercase_field,
test_uppercase_field, test_lowercase_preferred, test_missing_field) so the tests
use the shared top-level import.
In `@tests/test_swebench_pro.py`:
- Around line 96-99: The test test_python_delegates in
tests/test_swebench_pro.py currently uses a loose assertion; tighten it to
verify Python delegation by asserting that
_build_pro_test_command("tests/test_foo.py::test_bar", "python") produces a
command that contains "pytest" (i.e., replace the or-check with a single assert
"pytest" in cmd) to ensure it truly delegates to _build_test_command.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
CHANGELOG.mdsrc/mcpbr/benchmark_preflight.pysrc/mcpbr/benchmarks/__init__.pysrc/mcpbr/benchmarks/swebench_pro.pysrc/mcpbr/cli.pysrc/mcpbr/config.pysrc/mcpbr/docker_env.pysrc/mcpbr/evaluation.pysrc/mcpbr/swebench_test_specs.pytests/test_benchmark_preflight.pytests/test_swebench_pro.py
| await env.exec_command( | ||
| "pip install -e . -q 2>/dev/null || true", | ||
| timeout=120, | ||
| workdir=eval_workdir, | ||
| ) |
There was a problem hiding this comment.
Don’t hide editable reinstall failures in preflight.
Swallowing install errors (|| true) can yield false positives by testing stale installed code.
💡 Proposed fix
- if eval_workdir and language == "python":
- await env.exec_command(
- "pip install -e . -q 2>/dev/null || true",
- timeout=120,
- workdir=eval_workdir,
- )
+ if eval_workdir and language == "python":
+ exit_code, _stdout, stderr = await env.exec_command(
+ "pip install -e . -q",
+ timeout=120,
+ workdir=eval_workdir,
+ )
+ if exit_code != 0:
+ return PreflightResult(
+ instance_id=instance_id,
+ status="error",
+ error=f"Editable reinstall failed: {stderr[:300]}",
+ language=language,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmark_preflight.py` around lines 123 - 127, The current
preflight call swallows editable reinstall failures by running env.exec_command
with the command string "pip install -e . -q 2>/dev/null || true"; remove the
"|| true" and the stderr redirection so install errors surface (i.e., call
env.exec_command with a command that fails on error, preserving stderr), keeping
the same timeout and workdir (eval_workdir) and invocation site
(env.exec_command) so the preflight fails when the editable install fails.
| max_concurrent: int = 4, | ||
| timeout: int = 300, | ||
| fail_fast: bool = False, | ||
| ) -> PreflightReport: | ||
| """Run preflight validation on benchmark tasks. | ||
|
|
||
| Applies golden patches and verifies all tests pass for each instance. | ||
|
|
||
| Args: | ||
| benchmark: Benchmark instance. | ||
| tasks: List of task dictionaries to validate. | ||
| docker_manager: Docker environment manager. | ||
| max_concurrent: Maximum concurrent validations. | ||
| timeout: Timeout per test in seconds. | ||
| fail_fast: Stop on first failure. | ||
|
|
||
| Returns: | ||
| PreflightReport with aggregate results. | ||
| """ | ||
| report = PreflightReport(total=len(tasks)) | ||
| semaphore = asyncio.Semaphore(max_concurrent) |
There was a problem hiding this comment.
Validate max_concurrent before creating the semaphore.
max_concurrent=0 will block all checks forever in this flow.
💡 Proposed fix
async def run_benchmark_preflight(
@@
) -> PreflightReport:
@@
- report = PreflightReport(total=len(tasks))
+ if max_concurrent < 1:
+ raise ValueError("max_concurrent must be at least 1")
+
+ report = PreflightReport(total=len(tasks))
semaphore = asyncio.Semaphore(max_concurrent)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmark_preflight.py` around lines 207 - 227, The code creates an
asyncio.Semaphore(max_concurrent) without validating max_concurrent, so passing
0 will deadlock; update the function that declares max_concurrent to validate it
before creating semaphore (e.g., in the start of the function that constructs
PreflightReport and semaphore) by either raising a ValueError when
max_concurrent < 1 or normalizing it to a safe minimum like 1, then proceed to
create semaphore = asyncio.Semaphore(max_concurrent); ensure references to
max_concurrent and semaphore are adjusted accordingly.
| # Reinstall package so patched code is active (SWE-bench Pro images | ||
| # install into site-packages, not editable mode) | ||
| if eval_workdir and language == "python": | ||
| await env.exec_command( | ||
| "pip install -e . -q 2>/dev/null || true", | ||
| timeout=120, | ||
| workdir=eval_workdir, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the _evaluate_multilang function and where it's called
rg -n '_evaluate_multilang' --type=py -B5 -A10Repository: supermodeltools/mcpbr
Length of output: 1446
🏁 Script executed:
# Look at the evaluate() method around lines 180-192
sed -n '175,195p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 850
🏁 Script executed:
# Read lines 193 onwards to see the _evaluate_multilang function, especially lines 224-231
sed -n '193,235p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 1696
Dead code: the pip install -e . block at lines 224-231 can never execute.
Here's what's happening — in the evaluate() method, there's an if/else split: if the language is Python, it handles evaluation right there and returns early (line 188). For everything else (non-Python languages), it delegates to _evaluate_multilang() at line 191.
So inside _evaluate_multilang(), the language parameter can only be something other than Python. The condition at line 226 (if eval_workdir and language == "python") will never be true — it's checking for a language that's already been ruled out upstream.
This pip reinstall block looks like it was copied from the Python evaluation path but never actually applies to multilang evaluation. If you need this for Python tasks, it belongs earlier in the evaluate() method (before calling evaluate_patch()). If not, just remove it.
💡 Proposed fix — remove the dead block
if test_patch:
await _apply_test_patch(env, test_patch, workdir=eval_workdir)
- # Reinstall package so patched code is active (SWE-bench Pro images
- # install into site-packages, not editable mode)
- if eval_workdir and language == "python":
- await env.exec_command(
- "pip install -e . -q 2>/dev/null || true",
- timeout=120,
- workdir=eval_workdir,
- )
-
fail_to_pass_str = get_test_list_field(task, "fail_to_pass")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 224 - 231, The block that
runs "pip install -e ." inside _evaluate_multilang() is dead because evaluate()
already returns early for language == "python", so remove this unreachable code
from _evaluate_multilang(); alternatively, if a reinstall is actually needed for
Python evaluations, move the pip reinstall logic into evaluate() before the call
to evaluate_patch() (referencing evaluate(), _evaluate_multilang(),
evaluate_patch(), the language variable, and eval_workdir to locate the relevant
logic).
| @click.option( | ||
| "-c", | ||
| "--config", | ||
| "config_path", | ||
| type=click.Path(exists=True), | ||
| help="Path to configuration YAML file (used for Docker settings).", | ||
| ) | ||
| @click.option( | ||
| "-b", | ||
| "--benchmark", | ||
| "benchmark_name", | ||
| type=click.Choice(list(VALID_BENCHMARKS)), | ||
| default="swe-bench-pro", | ||
| help="Benchmark to validate (default: swe-bench-pro).", | ||
| ) | ||
| @click.option( | ||
| "-n", | ||
| "--sample", | ||
| "sample_size", | ||
| type=int, | ||
| default=None, | ||
| help="Number of instances to validate (default: all).", | ||
| ) | ||
| @click.option( | ||
| "--task", | ||
| "task_ids", | ||
| multiple=True, | ||
| help="Specific task ID(s) to validate.", | ||
| ) | ||
| @click.option( | ||
| "--max-concurrent", | ||
| type=int, | ||
| default=4, | ||
| help="Maximum concurrent validations (default: 4).", | ||
| ) | ||
| @click.option( | ||
| "--fail-fast", | ||
| is_flag=True, | ||
| help="Stop on first failure.", | ||
| ) | ||
| @click.option( | ||
| "--filter-category", | ||
| multiple=True, | ||
| help="Filter by language or repo substring.", | ||
| ) | ||
| @click.option( | ||
| "--timeout", | ||
| type=int, | ||
| default=300, | ||
| help="Timeout per test in seconds (default: 300).", | ||
| ) | ||
| def preflight( | ||
| config_path: str | None, | ||
| benchmark_name: str, | ||
| sample_size: int | None, | ||
| task_ids: tuple[str, ...], | ||
| max_concurrent: int, | ||
| fail_fast: bool, | ||
| filter_category: tuple[str, ...], | ||
| timeout: int, | ||
| ) -> None: | ||
| """Validate golden patches pass all tests before evaluation. | ||
|
|
||
| Runs the benchmark's golden (reference) patches against Docker | ||
| environments and verifies all tests pass. Use this to catch | ||
| environment or configuration issues before running agent evaluations. | ||
|
|
||
| \b | ||
| Examples: | ||
| mcpbr preflight -b swe-bench-pro -n 5 # Check 5 instances | ||
| mcpbr preflight --fail-fast # Stop on first failure | ||
| mcpbr preflight --filter-category python -n 10 # Check 10 Python instances | ||
| mcpbr preflight --task django__django-16046 # Check specific instance | ||
| """ | ||
| from .benchmark_preflight import run_benchmark_preflight | ||
| from .benchmarks import create_benchmark | ||
| from .docker_env import DockerEnvironmentManager | ||
|
|
||
| benchmark = create_benchmark(benchmark_name) | ||
|
|
||
| # Load tasks | ||
| task_id_list = list(task_ids) if task_ids else None | ||
| category_list = list(filter_category) if filter_category else None | ||
|
|
||
| console.print(f"[bold]Preflight Check: {benchmark_name}[/bold]\n") | ||
| dataset_name = getattr(benchmark, "dataset", benchmark_name) | ||
| console.print(f"Loading tasks from {dataset_name}...") | ||
|
|
||
| tasks = benchmark.load_tasks( | ||
| sample_size=sample_size, | ||
| task_ids=task_id_list, | ||
| filter_category=category_list, | ||
| ) | ||
|
|
||
| if not tasks: | ||
| console.print("[yellow]No tasks found matching the criteria.[/yellow]") | ||
| return | ||
|
|
||
| console.print(f"Validating {len(tasks)} instance(s)...\n") | ||
|
|
||
| # Create Docker manager | ||
| docker_manager = DockerEnvironmentManager(use_prebuilt=True) | ||
|
|
There was a problem hiding this comment.
--config is currently ignored by mcpbr preflight.
The option says it drives Docker settings, but the command never loads config and always creates DockerEnvironmentManager(use_prebuilt=True).
💡 Proposed fix
def preflight(
@@
) -> None:
@@
- benchmark = create_benchmark(benchmark_name)
+ benchmark = create_benchmark(benchmark_name)
+
+ cfg = load_config(Path(config_path)) if config_path else None
@@
- # Create Docker manager
- docker_manager = DockerEnvironmentManager(use_prebuilt=True)
+ # Create Docker manager (respect config when provided)
+ docker_manager = DockerEnvironmentManager(
+ use_prebuilt=cfg.use_prebuilt_images if cfg is not None else True,
+ extra_volumes=cfg.volumes if cfg is not None else None,
+ claude_code_version=cfg.claude_code_version if cfg is not None else None,
+ )🧰 Tools
🪛 Ruff (0.15.2)
[warning] 1592-1592: Unused function argument: config_path
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/cli.py` around lines 1540 - 1642, The preflight command ignores the
--config option; update preflight to, when config_path is provided, load the
YAML/JSON config (e.g., with yaml.safe_load or json.load) and pass the parsed
settings into DockerEnvironmentManager instead of always instantiating
DockerEnvironmentManager(use_prebuilt=True). Modify the DockerEnvironmentManager
instantiation in preflight to accept the loaded config (e.g.,
DockerEnvironmentManager(config=cfg, use_prebuilt=True)) or apply the loaded
settings via its public API before use; ensure you handle file read errors and
missing keys gracefully and still default to use_prebuilt=True when no
config_path is supplied.
| def get_test_list_field(task: dict[str, Any], field_name: str) -> str: | ||
| """Get a test list field from a task, checking both lowercase and uppercase names. | ||
|
|
||
| SWE-bench uses FAIL_TO_PASS/PASS_TO_PASS while SWE-bench Pro uses | ||
| fail_to_pass/pass_to_pass. This helper provides backward-compatible access. | ||
|
|
||
| Args: | ||
| task: Task dictionary. | ||
| field_name: Field name in lowercase (e.g., "fail_to_pass"). | ||
|
|
||
| Returns: | ||
| Field value as string, or "[]" if not found. | ||
| """ | ||
| # Try lowercase first (SWE-bench Pro convention) | ||
| value = task.get(field_name) | ||
| if value is not None: | ||
| return str(value) | ||
| # Fall back to uppercase (SWE-bench convention) | ||
| upper_name = field_name.upper() | ||
| value = task.get(upper_name) | ||
| if value is not None: | ||
| return str(value) | ||
| return "[]" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check usage of get_test_list_field and how test fields are accessed
rg -n -C3 'evaluate_patch\(' src/mcpbr/benchmarks/swebench_pro.py src/mcpbr/evaluation.py
rg -n -C3 'FAIL_TO_PASS|PASS_TO_PASS|fail_to_pass|pass_to_pass|get_test_list_field' src/mcpbr/benchmarks/swebench_pro.py src/mcpbr/evaluation.pyRepository: supermodeltools/mcpbr
Length of output: 10673
Fix evaluate_patch() to use get_test_list_field() for backward compatibility with SWE-bench Pro lowercase field names.
When Python SWE-bench Pro tasks hit evaluate_patch() at line 187, they have lowercase field names like fail_to_pass, but the function reads uppercase names like FAIL_TO_PASS directly (lines 391-392). Replace those hardcoded .get() calls with get_test_list_field():
fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
pass_to_pass_tests = parse_test_list(get_test_list_field(task, "pass_to_pass"))Currently, non-Python languages (lines 233-234) handle this correctly, but Python tasks silently get empty test lists instead of actual tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/evaluation.py` around lines 32 - 54, The evaluate_patch function
currently reads uppercase fields directly for Python tasks (variables
fail_to_pass_tests / pass_to_pass_tests) which breaks on SWE-bench Pro lowercase
names; replace the direct task.get(...) calls in evaluate_patch with calls to
get_test_list_field(task, "fail_to_pass") and get_test_list_field(task,
"pass_to_pass"), then pass those results through parse_test_list (e.g.,
fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass"))
and similarly for pass_to_pass_tests) so Python handling matches the non-Python
branch and supports both conventions.
| elif task.get("dockerhub_tag") and task.get("repo_language", "python").lower() == "python": | ||
| # SWE-bench Pro images install packages into site-packages (not | ||
| # editable). After patching we must reinstall so the new code is | ||
| # importable. | ||
| with contextlib.suppress(TimeoutError): | ||
| await env.exec_command( | ||
| "pip install -e . -q 2>/dev/null || true", | ||
| timeout=120, | ||
| workdir=eval_workdir, | ||
| ) |
There was a problem hiding this comment.
Use the actual task workdir for SWE-bench Pro DockerHub images before reinstall.
This branch is for DockerHub images, but eval_workdir is still derived from the generic prebuilt path, which can target /testbed instead of /app.
💡 Proposed fix
- # For pre-built images, apply patch and run tests in /testbed
- # because that's where the editable install points to
- eval_workdir = "/testbed" if env.uses_prebuilt else None
+ # For pre-built images, use the environment workdir (supports /testbed and /app overrides)
+ eval_workdir = env.workdir if env.uses_prebuilt else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/evaluation.py` around lines 404 - 413, The reinstall step for
DockerHub Python images uses eval_workdir (prebuilt path) which can be /testbed;
change it to use the task-specific workdir (e.g., task.get("workdir") or
fallback to eval_workdir) so SWE-bench Pro images run pip install in the actual
repo directory. In the branch that checks task.get("dockerhub_tag") and
repo_language == "python", compute task_workdir = task.get("workdir",
eval_workdir) and pass that as the workdir argument to env.exec_command (same
place where "pip install -e . -q ..." is invoked) so the reinstall runs in the
correct directory.
Test IDs from SWE-bench Pro datasets can contain single quotes and literal \uXXXX escape sequences that break shell command construction. - Replace bare single-quote wrapping with shlex.quote() in _build_test_command and _build_pro_test_command - Add _normalize_test_id() to decode \uXXXX to actual unicode chars - Fixes qutebrowser PTP test failures where 8/10 tests had IDs containing embedded single quotes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/mcpbr/benchmarks/swebench_pro.py (1)
224-231: Remove unreachable Python reinstall block from_evaluate_multilang().
evaluate()returns early for Python tasks (Line 185–188), so Line 226 (language == "python") can’t be true in this function.💡 Suggested cleanup
- # Reinstall package so patched code is active (SWE-bench Pro images - # install into site-packages, not editable mode) - if eval_workdir and language == "python": - await env.exec_command( - "pip install -e . -q 2>/dev/null || true", - timeout=120, - workdir=eval_workdir, - )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 224 - 231, The reinstall pip block inside _evaluate_multilang is unreachable because evaluate() returns early for Python tasks, so remove the conditional block that checks "if eval_workdir and language == 'python':" (including the await env.exec_command call) from _evaluate_multilang; ensure any necessary reinstall logic for Python remains only in evaluate() or is relocated to a shared helper if needed—search for the symbols _evaluate_multilang, evaluate(), eval_workdir, and language to find and delete or move this code.src/mcpbr/evaluation.py (2)
403-404:⚠️ Potential issue | 🔴 CriticalUse
get_test_list_field()here to preserve SWE-bench Pro compatibility.Line 403 and Line 404 still read uppercase-only keys, so lowercase
fail_to_pass/pass_to_passtasks can evaluate with empty lists.💡 Suggested fix
- fail_to_pass_tests = parse_test_list(task.get("FAIL_TO_PASS", "[]")) - pass_to_pass_tests = parse_test_list(task.get("PASS_TO_PASS", "[]")) + fail_to_pass_tests = parse_test_list(get_test_list_field(task, "fail_to_pass")) + pass_to_pass_tests = parse_test_list(get_test_list_field(task, "pass_to_pass"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/evaluation.py` around lines 403 - 404, The code is using task.get("FAIL_TO_PASS", ...) and task.get("PASS_TO_PASS", ...) which misses lowercase keys; replace these two calls to first retrieve the test-list value via get_test_list_field(task, "fail_to_pass", "FAIL_TO_PASS") and get_test_list_field(task, "pass_to_pass", "PASS_TO_PASS") and then pass those results into parse_test_list (i.e., fail_to_pass_tests = parse_test_list(get_test_list_field(...)) and pass_to_pass_tests = parse_test_list(get_test_list_field(...))). This preserves SWE-bench Pro compatibility while keeping parse_test_list usage.
385-387:⚠️ Potential issue | 🔴 CriticalDon’t hardcode
/testbedfor all prebuilt images.Line 387 forces
/testbed, but SWE-bench Pro DockerHub images use/app; then Line 424 runs reinstall in the wrong directory.💡 Suggested fix
- # For pre-built images, apply patch and run tests in /testbed - # because that's where the editable install points to - eval_workdir = "/testbed" if env.uses_prebuilt else None + # For pre-built images, use the environment's configured workdir + # (supports both /testbed and /app overrides). + eval_workdir = env.workdir if env.uses_prebuilt else None @@ - await env.exec_command( + task_workdir = task.get("workdir", eval_workdir) + await env.exec_command( "pip install -e . -q 2>/dev/null || true", timeout=120, - workdir=eval_workdir, + workdir=task_workdir, )Also applies to: 421-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/evaluation.py` around lines 385 - 387, When selecting the workdir for pre-built images, don't hardcode "/testbed"; instead, when env.uses_prebuilt is true use a configurable attribute (e.g., eval_workdir = getattr(env, "prebuilt_workdir", "/testbed")) so images that set prebuilt_workdir="/app" are respected; then pass that eval_workdir into the reinstall/test invocation (replace any hardcoded paths used around the reinstall call with workdir=eval_workdir) so the reinstall runs in the correct directory for SWE-bench Pro and other prebuilt images.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 224-231: The reinstall pip block inside _evaluate_multilang is
unreachable because evaluate() returns early for Python tasks, so remove the
conditional block that checks "if eval_workdir and language == 'python':"
(including the await env.exec_command call) from _evaluate_multilang; ensure any
necessary reinstall logic for Python remains only in evaluate() or is relocated
to a shared helper if needed—search for the symbols _evaluate_multilang,
evaluate(), eval_workdir, and language to find and delete or move this code.
In `@src/mcpbr/evaluation.py`:
- Around line 403-404: The code is using task.get("FAIL_TO_PASS", ...) and
task.get("PASS_TO_PASS", ...) which misses lowercase keys; replace these two
calls to first retrieve the test-list value via get_test_list_field(task,
"fail_to_pass", "FAIL_TO_PASS") and get_test_list_field(task, "pass_to_pass",
"PASS_TO_PASS") and then pass those results into parse_test_list (i.e.,
fail_to_pass_tests = parse_test_list(get_test_list_field(...)) and
pass_to_pass_tests = parse_test_list(get_test_list_field(...))). This preserves
SWE-bench Pro compatibility while keeping parse_test_list usage.
- Around line 385-387: When selecting the workdir for pre-built images, don't
hardcode "/testbed"; instead, when env.uses_prebuilt is true use a configurable
attribute (e.g., eval_workdir = getattr(env, "prebuilt_workdir", "/testbed")) so
images that set prebuilt_workdir="/app" are respected; then pass that
eval_workdir into the reinstall/test invocation (replace any hardcoded paths
used around the reinstall call with workdir=eval_workdir) so the reinstall runs
in the correct directory for SWE-bench Pro and other prebuilt images.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/mcpbr/benchmarks/codegraph.pysrc/mcpbr/benchmarks/swebench_pro.pysrc/mcpbr/evaluation.py
_normalize_test_id was incorrectly converting literal \uXXXX sequences to actual unicode characters. SWE-bench Pro test IDs already contain the correct literal sequences that match pytest node IDs — converting them breaks test matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset uses 'js' and 'ts' as language values, but filter_category accepted 'javascript' and 'typescript' without mapping to the dataset values. Added _LANGUAGE_ALIASES dict to resolve user-friendly names to dataset values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/mcpbr/benchmarks/swebench_pro.py (4)
272-280: Return typeAnyhides the actualTestResultscontract from callers.Both code paths return a
TestResultsinstance. Annotating it asAnymeans type checkers can't catch misuse of.passed/.totalat the call sites.♻️ Proposed fix
+from ..evaluation import TestResults async def _run_lang_tests( self, env: TaskEnvironment, tests: list[str], language: str, workdir: str | None = None, timeout: int = 120, - ) -> Any: + ) -> "TestResults":(The
TestResultsimport inside the function body at line 298 can move to the top-level import block too.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 272 - 280, The function _run_lang_tests currently annotates its return type as Any but always returns a TestResults instance; change the async def _run_lang_tests(...) -> Any to -> TestResults and update the import by moving the TestResults import (now inside the function body) to the module-level imports so type checkers can see the symbol; ensure any callers still compile and remove the internal import at line 298.
407-409: Moveshlexand internal imports to the module-level import block.There's no circular import or optional-dependency reason to lazy-import
shlex(stdlib) or_build_test_command/_normalize_test_id(same package) inside the function body. Deferred imports are a code smell unless you need them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 407 - 409, The imports for shlex and the package-internal functions _build_test_command and _normalize_test_id are currently deferred inside the function body; move those three imports into the module-level import block with the other imports so they are imported at module load time and remove the lazy imports from the function; keep the same import names (shlex, _build_test_command, _normalize_test_id) so existing references in the function continue to work.
60-65: Consider silencing the RuffARG002warnings on intentionally unused params.
level,filter_difficulty, andfilter_tagsare documented as unused, but Ruff still flags them. The cleanest way to fix this without changing the signature (which is probably dictated by a base class interface) is a targeted suppression:🔧 Proposed suppression
def load_tasks( self, sample_size: int | None = None, task_ids: list[str] | None = None, - level: int | None = None, - filter_difficulty: list[str] | None = None, + level: int | None = None, # noqa: ARG002 + filter_difficulty: list[str] | None = None, # noqa: ARG002 filter_category: list[str] | None = None, - filter_tags: list[str] | None = None, + filter_tags: list[str] | None = None, # noqa: ARG002 ) -> list[dict[str, Any]]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 60 - 65, Ruff is flagging ARG002 for intentionally unused parameters level, filter_difficulty, and filter_tags in the function signature; silence these with targeted per-parameter noqa comments by appending "# noqa: ARG002" to each unused parameter in the signature (e.g., level: int | None = None # noqa: ARG002), leaving the overall signature unchanged and preserving the base-class interface.
260-269:if fail_to_pass_results:/if pass_to_pass_results:don't actually guard against anything — these checks are always true.Here's the deal:
_run_lang_testsalways returns aTestResultsinstance (it never returnsNone). SinceTestResultsis a dataclass without a custom__bool__method, it's always truthy in Python — just like any other object instance would be. So thoseifchecks will always pass, and both keys get added to the result dict every time.This isn't a bug, but the conditions are misleading. They look like safety checks, but they're not doing any actual guarding. You can simplify this:
♻️ Suggested simplification
- if fail_to_pass_results: - result["fail_to_pass"] = { + result["fail_to_pass"] = { "passed": fail_to_pass_results.passed, "total": fail_to_pass_results.total, - } - if pass_to_pass_results: - result["pass_to_pass"] = { + } + result["pass_to_pass"] = { "passed": pass_to_pass_results.passed, "total": pass_to_pass_results.total, - } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 260 - 269, The current guards "if fail_to_pass_results:" and "if pass_to_pass_results:" are misleading because _run_lang_tests always returns a TestResults dataclass instance (which is always truthy); remove those redundant if checks and unconditionally assign the dictionaries using fail_to_pass_results and pass_to_pass_results (e.g., set result["fail_to_pass"] = {"passed": fail_to_pass_results.passed, "total": fail_to_pass_results.total} and similarly for pass_to_pass) so the code is simpler and clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 430-437: The TypeScript/JavaScript branch currently checks if
language in ("typescript", "javascript") and thus misses the short identifiers
"ts" and "js" from PRO_LANGUAGES; update the conditional that builds the Jest
command (the if language in (...) branch that returns f"{activate}npx jest ...")
to include "ts" and "js" so short-form tasks use Jest rather than falling back
to the raw shell path, and similarly review/update the analogous Go branch (the
if language in (...) branch near the Go handling) to include any short Go
identifiers if present in PRO_LANGUAGES.
---
Nitpick comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 272-280: The function _run_lang_tests currently annotates its
return type as Any but always returns a TestResults instance; change the async
def _run_lang_tests(...) -> Any to -> TestResults and update the import by
moving the TestResults import (now inside the function body) to the module-level
imports so type checkers can see the symbol; ensure any callers still compile
and remove the internal import at line 298.
- Around line 407-409: The imports for shlex and the package-internal functions
_build_test_command and _normalize_test_id are currently deferred inside the
function body; move those three imports into the module-level import block with
the other imports so they are imported at module load time and remove the lazy
imports from the function; keep the same import names (shlex,
_build_test_command, _normalize_test_id) so existing references in the function
continue to work.
- Around line 60-65: Ruff is flagging ARG002 for intentionally unused parameters
level, filter_difficulty, and filter_tags in the function signature; silence
these with targeted per-parameter noqa comments by appending "# noqa: ARG002" to
each unused parameter in the signature (e.g., level: int | None = None # noqa:
ARG002), leaving the overall signature unchanged and preserving the base-class
interface.
- Around line 260-269: The current guards "if fail_to_pass_results:" and "if
pass_to_pass_results:" are misleading because _run_lang_tests always returns a
TestResults dataclass instance (which is always truthy); remove those redundant
if checks and unconditionally assign the dictionaries using fail_to_pass_results
and pass_to_pass_results (e.g., set result["fail_to_pass"] = {"passed":
fail_to_pass_results.passed, "total": fail_to_pass_results.total} and similarly
for pass_to_pass) so the code is simpler and clearer.
src/mcpbr/benchmarks/swebench_pro.py
Outdated
| if language in ("typescript", "javascript"): | ||
| # Jest-style test identifiers | ||
| if "/" in test or test.endswith((".ts", ".js", ".tsx", ".jsx")): | ||
| # File path | ||
| return f"{activate}npx jest {shlex.quote(test)} --verbose --no-cache 2>&1" | ||
| else: | ||
| # Test name pattern | ||
| return f"{activate}npx jest -t {shlex.quote(test)} --verbose --no-cache 2>&1" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for _LANGUAGE_ALIASES and understand the alias mapping
rg -n "_LANGUAGE_ALIASES" --type=py -B2 -A10Repository: supermodeltools/mcpbr
Length of output: 2066
🏁 Script executed:
#!/bin/bash
# Look at the evaluate() function to see how language is retrieved and passed
rg -n "def evaluate" --type=py -A20 | head -60Repository: supermodeltools/mcpbr
Length of output: 3829
🏁 Script executed:
#!/bin/bash
# Check the _build_pro_test_command function signature and context
sed -n '420,450p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 992
🏁 Script executed:
#!/bin/bash
# Search for how repo_language values are set in the dataset
rg -n "repo_language" --type=py -B2 -A2Repository: supermodeltools/mcpbr
Length of output: 8486
🏁 Script executed:
#!/bin/bash
# Search for where the dataset is loaded/fetched
rg -n "load_pro_dataset\|json\.load\|\.jsonl" --type=py -B2 -A5Repository: supermodeltools/mcpbr
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check what SWEBenchProBenchmark.evaluate() gets from the task parameter
sed -n '188,210p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 876
🏁 Script executed:
#!/bin/bash
# Look for any place where language might be transformed before _build_pro_test_command
rg -n "_build_pro_test_command\|_evaluate_multilang\|_run_lang_tests" --type=py -B5 -A3Repository: supermodeltools/mcpbr
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Get the full _evaluate_multilang implementation
sed -n '207,250p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 1877
🏁 Script executed:
#!/bin/bash
# Search for all calls to _build_pro_test_command to see what gets passed
rg -n "_build_pro_test_command" --type=py -B3 -A1Repository: supermodeltools/mcpbr
Length of output: 4080
🏁 Script executed:
#!/bin/bash
# Check test files to see if there are actual dataset loading tests
find . -name "*.json" -o -name "*.jsonl" | head -20Repository: supermodeltools/mcpbr
Length of output: 312
🏁 Script executed:
#!/bin/bash
# Look at the entire _build_pro_test_command function from start to finish
sed -n '410,460p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 1267
🏁 Script executed:
#!/bin/bash
# Check if there's any transformation of repo_language values in the load process
rg -n "repo_language" --type=py -B3 -A3 | head -100Repository: supermodeltools/mcpbr
Length of output: 6225
🏁 Script executed:
#!/bin/bash
# Look at what the actual tests show - are they PASSING or FAILING with current code?
grep -n "test_typescript\|test_javascript" tests/test_swebench_pro.py -A5Repository: supermodeltools/mcpbr
Length of output: 957
🏁 Script executed:
#!/bin/bash
# Check if there are any fixtures or sample data that show actual repo_language values
find . -type f \( -name "*.py" -o -name "*.json" \) -exec grep -l "repo_language.*[ts]\|repo_language.*js" {} \;Repository: supermodeltools/mcpbr
Length of output: 229
🏁 Script executed:
#!/bin/bash
# Let me check PRO_LANGUAGES constant to see what values are expected
rg -n "PRO_LANGUAGES" --type=py -B2 -A10Repository: supermodeltools/mcpbr
Length of output: 3635
🏁 Script executed:
#!/bin/bash
# Check if the tests for typescript/javascript actually pass with the current code
# by looking at the test assertions more carefully
sed -n '113,132p' tests/test_swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 819
🏁 Script executed:
#!/bin/bash
# Verify if there's any place in the code that might normalize repo_language to short form
rg -n "\.get\(.*repo_language" --type=py -B2 -A2Repository: supermodeltools/mcpbr
Length of output: 2154
🏁 Script executed:
#!/bin/bash
# Run the specific test to see if it's currently failing
cd /root && python -m pytest tests/test_swebench_pro.py::TestSWEBenchProInit::test_pro_languages_set -xvs 2>&1Repository: supermodeltools/mcpbr
Length of output: 120
🏁 Script executed:
#!/bin/bash
# Check if the test assertion matches what PRO_LANGUAGES actually contains
rg -n "test_pro_languages_set" --type=py -A5Repository: supermodeltools/mcpbr
Length of output: 441
🏁 Script executed:
#!/bin/bash
# Get the full PRO_LANGUAGES definition
sed -n '29,29p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 137
🏁 Script executed:
#!/bin/bash
# Search for any actual dataset loading or task creation that would show real repo_language values
rg -n "load_dataset\|from_dataset" --type=py -B2 -A10Repository: supermodeltools/mcpbr
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check the filter_by logic more carefully to understand when short vs long forms are used
sed -n '95,115p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 1068
🏁 Script executed:
#!/bin/bash
# Check if there's any normalization happening before _build_pro_test_command is called
# Look at _run_lang_tests to trace the full flow
sed -n '290,320p' src/mcpbr/benchmarks/swebench_pro.pyRepository: supermodeltools/mcpbr
Length of output: 1165
Add support for short-form language identifiers ("ts", "js") in the TypeScript/JavaScript test builder.
The check at line 430 only matches the long forms ("typescript", "javascript"), but PRO_LANGUAGES explicitly includes the short forms "ts" and "js" (line 29), indicating the dataset uses them. Any TypeScript or JavaScript task with those short forms will silently fall through to the bare fallback at line 440, running the test identifier as a raw shell command instead of with Jest—breaking test execution entirely.
The fix is straightforward:
🐛 Proposed fix
- if language in ("typescript", "javascript"):
+ if language in ("typescript", "javascript", "ts", "js"):(The same issue should be checked for the Go branch at line 420 if Go also uses short forms in the dataset.)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 430 - 437, The
TypeScript/JavaScript branch currently checks if language in ("typescript",
"javascript") and thus misses the short identifiers "ts" and "js" from
PRO_LANGUAGES; update the conditional that builds the Jest command (the if
language in (...) branch that returns f"{activate}npx jest ...") to include "ts"
and "js" so short-form tasks use Jest rather than falling back to the raw shell
path, and similarly review/update the analogous Go branch (the if language in
(...) branch near the Go handling) to include any short Go identifiers if
present in PRO_LANGUAGES.
Preflight was using the Python-only run_tests() (pytest) for all languages, causing 100% failure for Go, JavaScript, and TypeScript instances. Now routes non-Python languages through _build_pro_test_command() which generates the correct commands: - Go: go test -v -count=1 ... - JS/TS: npx jest ... Also fixes conda activation bug in _run_lang_tests: SWE-bench Pro images don't have conda, so uses_prebuilt=False is passed for non-Python test commands to avoid prepending conda activation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
src/mcpbr/benchmark_preflight.py (2)
200-205:⚠️ Potential issue | 🟠 MajorEditable reinstall failures are still silently swallowed.
Hey, so the
|| trueat the end of the pip command (line 202) means if the install fails — like a missingsetup.py, broken dependencies, whatever — you'll never know. The preflight will happily keep going and run tests against the old installed code, giving you a false "passed" result. That's the opposite of what preflight is supposed to catch!Think of it this way: if you order food and the kitchen says "we might have burned it, but let's serve it anyway" — that defeats the purpose of quality checks.
Proposed fix
if eval_workdir and language == "python": - await env.exec_command( - "pip install -e . -q 2>/dev/null || true", + exit_code, _stdout, stderr = await env.exec_command( + "pip install -e . -q", timeout=120, workdir=eval_workdir, ) + if exit_code != 0: + return PreflightResult( + instance_id=instance_id, + status="error", + error=f"Editable reinstall failed: {stderr[:300] if stderr else 'unknown error'}", + language=language, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmark_preflight.py` around lines 200 - 205, The pip reinstall command is silently swallowing failures via "|| true"; update the block that runs env.exec_command when eval_workdir and language == "python" to remove the "|| true" so failures surface, and ensure exec_command's non-zero exit is propagated (or explicitly check its result) so the preflight aborts and logs the error; refer to the eval_workdir conditional and the call to env.exec_command in benchmark_preflight.py to locate and change this behavior.
306-307:⚠️ Potential issue | 🟠 Major
max_concurrent=0will deadlock all preflight checks.If someone passes
max_concurrent=0(say, by misconfiguration or a bad CLI default),asyncio.Semaphore(0)means no coroutine can ever acquire the semaphore. Every check hangs forever. A simple guard at the top of the function prevents a really confusing debugging session.Proposed fix
async def run_benchmark_preflight( benchmark: Any, tasks: list[dict[str, Any]], docker_manager: DockerEnvironmentManager, max_concurrent: int = 4, timeout: int = 300, fail_fast: bool = False, ) -> PreflightReport: + if max_concurrent < 1: + raise ValueError("max_concurrent must be at least 1") + report = PreflightReport(total=len(tasks)) semaphore = asyncio.Semaphore(max_concurrent)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmark_preflight.py` around lines 306 - 307, The code creates asyncio.Semaphore(max_concurrent) which deadlocks when max_concurrent <= 0; add a guard at the start of the function (before creating report and semaphore) to ensure max_concurrent is at least 1 (e.g., max_concurrent = max(1, max_concurrent)) or raise a clear ValueError if you prefer failing fast; update the place that constructs semaphore (the variable named semaphore) so it uses the validated value and reference PreflightReport/semphore as needed.src/mcpbr/benchmarks/swebench_pro.py (2)
432-439:⚠️ Potential issue | 🟠 MajorShort-form language identifiers
"ts"and"js"fall through to the raw command fallback.
PRO_LANGUAGESon line 29 explicitly includes"ts"and"js", which means tasks can arrive with those as their language. But_build_pro_test_commandonly checks for"typescript"and"javascript"on line 432. If a task haslanguage="ts", it skips the Jest branch entirely and hits the fallback on line 441, which just runs the test identifier as a raw shell command. That'll probably fail or do something weird.Same thing could apply to Go if short forms ever show up, but
"go"doesn't have an alias so it's fine for now.Proposed fix
- if language in ("typescript", "javascript"): + if language in ("typescript", "javascript", "ts", "js"):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 432 - 439, The Jest branch in _build_pro_test_command only checks for "typescript" and "javascript", so short-form languages ("ts", "js") from PRO_LANGUAGES fall through to the raw fallback; update the language check to include the short aliases (e.g., check if language in ("typescript","javascript","ts","js")) or normalize language at the top of _build_pro_test_command (map "ts" -> "typescript", "js" -> "javascript") so that Jest commands are generated for short-form inputs; ensure you update the conditional that currently uses language in ("typescript", "javascript") to the new set or use the normalization logic before the existing checks.
232-239: 🛠️ Refactor suggestion | 🟠 MajorDead code: the
language == "python"check inside_evaluate_multilangcan never be true.So here's the thing —
evaluate()(line 193) already returns early whenlanguage == "python". That means_evaluate_multilangis only ever called for non-Python languages. So this entire block (lines 232-239) checkingif language == "python"will never execute. It's not harmful, but it's confusing for the next person reading this code — they'll wonder "wait, can Python end up here?"Proposed fix — just remove it
if test_patch: await _apply_test_patch(env, test_patch, workdir=eval_workdir) - # Reinstall package so patched code is active (SWE-bench Pro images - # install into site-packages, not editable mode) - if eval_workdir and language == "python": - await env.exec_command( - "pip install -e . -q 2>/dev/null || true", - timeout=120, - workdir=eval_workdir, - ) - fail_to_pass_str = get_test_list_field(task, "fail_to_pass")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 232 - 239, The block in _evaluate_multilang that checks `if language == "python"` and runs the pip reinstall is dead because evaluate() returns early for Python, so remove that entire conditional and its pip install call to avoid confusing dead code; locate the code inside _evaluate_multilang (the await env.exec_command("pip install -e . -q 2>/dev/null || true", ...)) and delete the if-check plus the enclosed lines so the function contains no unreachable Python-specific reinstall logic.
🧹 Nitpick comments (4)
src/mcpbr/benchmark_preflight.py (2)
313-325:fail_fastmode still acquires the semaphore unnecessarily.In the
fail_fastbranch (lines 313-325), tasks are run sequentially withawait. The semaphore acquisition inside_check_with_semaphoreis pointless here since there's no concurrency. It won't cause a bug, but it's misleading — a reader might think concurrency is happening. You could just call_check_single_instancedirectly.This is totally optional and won't break anything as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmark_preflight.py` around lines 313 - 325, The fail_fast branch is calling _check_with_semaphore which needlessly acquires a semaphore even though tasks are executed sequentially; replace the call to _check_with_semaphore(task) with a direct call to the non-concurrent checker _check_single_instance(task) (or the underlying function that performs the check) so the semaphore isn't acquired when fail_fast is true, then append the returned result to report.results and update report.passed/report.failed/report.errors exactly as before.
265-280: Use lazy%sformatting in logger calls instead of f-strings.Lines 267 and 280 use f-strings inside
logger.exception()andlogger.warning(). This means the string interpolation happens every time, even if the log level is disabled. It's a small thing, but it's the idiomatic way to do logging in Python — let the logger decide whether to format the string.Proposed fix
- logger.exception(f"Preflight error for {instance_id}") + logger.exception("Preflight error for %s", instance_id)- logger.warning(f"Failed to clean up container for {instance_id}") + logger.warning("Failed to clean up container for %s", instance_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmark_preflight.py` around lines 265 - 280, Replace the f-strings in the logging calls with lazy %-style formatting so interpolation only happens when the message will be emitted: change logger.exception(f"Preflight error for {instance_id}") to logger.exception("Preflight error for %s", instance_id) and change logger.warning(f"Failed to clean up container for {instance_id}") to logger.warning("Failed to clean up container for %s", instance_id); these occur in the exception/finally block around the Preflight logic (references: variable instance_id and env.cleanup()).tests/test_swebench_pro.py (1)
173-240: Filter tests don't cover the"typescript"→"ts"alias path.The
test_filter_by_languagetest usesfilter_category=["go"]and datasetrepo_language="go"— a case with no aliasing involved. There's no test where the user passes a long-form name (like"typescript") and the dataset uses the other form. Adding such a test would catch the alias resolution issue flagged in the implementation review.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_swebench_pro.py` around lines 173 - 240, Add a unit test to cover the typescript ↔ ts alias path: in tests/test_swebench_pro.py create a new test (e.g., test_filter_typescript_alias) that patches mcpbr.benchmarks.swebench_pro.load_dataset to return a dataset where one item has repo_language set to "ts" and then call SWEBenchProBenchmark().load_tasks(filter_category=["typescript"]) (and vice versa) asserting the item is included; reference SWEBenchProBenchmark.load_tasks and PRO_LANGUAGES to ensure the alias mapping is exercised by the test.src/mcpbr/benchmarks/swebench_pro.py (1)
272-337:_run_lang_testsnicely mirrors_run_preflight_testsin the preflight module.These two functions (
_run_lang_testshere and_run_preflight_testsinbenchmark_preflight.py) share almost identical logic for the non-Python path — loop over tests, call_build_pro_test_command, collect results. The only real differences are:
- Here,
uses_prebuilt=Falseis always passed (line 309) vs. the preflight version passinguses_conda.- The preflight version accepts
repoas a param for the Python path.Not a blocker, but if this code drifts apart it could cause subtle inconsistencies in how tests are run during evaluation vs. preflight. Consider extracting a shared helper down the road.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 272 - 337, Extract the duplicated non-Python test-running logic from _run_lang_tests and benchmark_preflight._run_preflight_tests into a shared helper (e.g., _run_non_python_tests or run_language_tests_helper) that accepts env, tests, language, workdir, timeout, and a uses_prebuilt/uses_conda flag; move the loop that builds commands via _build_pro_test_command, calls env.exec_command, constructs the results list, and returns TestResults into that helper, then have both _run_lang_tests and _run_preflight_tests call the new helper with the appropriate uses_prebuilt/uses_conda value to eliminate duplication and keep behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 94-112: The filter logic drops JS/TS because you resolve the
filter category via _LANGUAGE_ALIASES but compare it against the raw task
language; update the normalization so both sides use the same alias mapping: in
the loop that reads task.get("repo_language", "").lower() (variable language)
apply _LANGUAGE_ALIASES.get(language, language) (or otherwise normalize task
language to its canonical short form) before comparing to resolved, and then use
that normalized language in the equality check inside the filter_category
handling (functions/vars: filter_category, tasks, PRO_LANGUAGES,
_LANGUAGE_ALIASES, repo_language).
In `@tests/test_swebench_pro.py`:
- Around line 93-139: Add two tests to TestBuildProTestCommand to cover
short-form language identifiers: add test_ts_short_form_uses_jest and
test_js_short_form_uses_jest that call
_build_pro_test_command("src/test.ts","ts") and
_build_pro_test_command("test/index.test.js","js") respectively and assert that
"npx jest" is in the returned cmd (and that the relevant filename appears), so
short forms "ts"/"js" no longer fall back to the raw-command path.
---
Duplicate comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 200-205: The pip reinstall command is silently swallowing failures
via "|| true"; update the block that runs env.exec_command when eval_workdir and
language == "python" to remove the "|| true" so failures surface, and ensure
exec_command's non-zero exit is propagated (or explicitly check its result) so
the preflight aborts and logs the error; refer to the eval_workdir conditional
and the call to env.exec_command in benchmark_preflight.py to locate and change
this behavior.
- Around line 306-307: The code creates asyncio.Semaphore(max_concurrent) which
deadlocks when max_concurrent <= 0; add a guard at the start of the function
(before creating report and semaphore) to ensure max_concurrent is at least 1
(e.g., max_concurrent = max(1, max_concurrent)) or raise a clear ValueError if
you prefer failing fast; update the place that constructs semaphore (the
variable named semaphore) so it uses the validated value and reference
PreflightReport/semphore as needed.
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 432-439: The Jest branch in _build_pro_test_command only checks
for "typescript" and "javascript", so short-form languages ("ts", "js") from
PRO_LANGUAGES fall through to the raw fallback; update the language check to
include the short aliases (e.g., check if language in
("typescript","javascript","ts","js")) or normalize language at the top of
_build_pro_test_command (map "ts" -> "typescript", "js" -> "javascript") so that
Jest commands are generated for short-form inputs; ensure you update the
conditional that currently uses language in ("typescript", "javascript") to the
new set or use the normalization logic before the existing checks.
- Around line 232-239: The block in _evaluate_multilang that checks `if language
== "python"` and runs the pip reinstall is dead because evaluate() returns early
for Python, so remove that entire conditional and its pip install call to avoid
confusing dead code; locate the code inside _evaluate_multilang (the await
env.exec_command("pip install -e . -q 2>/dev/null || true", ...)) and delete the
if-check plus the enclosed lines so the function contains no unreachable
Python-specific reinstall logic.
---
Nitpick comments:
In `@src/mcpbr/benchmark_preflight.py`:
- Around line 313-325: The fail_fast branch is calling _check_with_semaphore
which needlessly acquires a semaphore even though tasks are executed
sequentially; replace the call to _check_with_semaphore(task) with a direct call
to the non-concurrent checker _check_single_instance(task) (or the underlying
function that performs the check) so the semaphore isn't acquired when fail_fast
is true, then append the returned result to report.results and update
report.passed/report.failed/report.errors exactly as before.
- Around line 265-280: Replace the f-strings in the logging calls with lazy
%-style formatting so interpolation only happens when the message will be
emitted: change logger.exception(f"Preflight error for {instance_id}") to
logger.exception("Preflight error for %s", instance_id) and change
logger.warning(f"Failed to clean up container for {instance_id}") to
logger.warning("Failed to clean up container for %s", instance_id); these occur
in the exception/finally block around the Preflight logic (references: variable
instance_id and env.cleanup()).
In `@src/mcpbr/benchmarks/swebench_pro.py`:
- Around line 272-337: Extract the duplicated non-Python test-running logic from
_run_lang_tests and benchmark_preflight._run_preflight_tests into a shared
helper (e.g., _run_non_python_tests or run_language_tests_helper) that accepts
env, tests, language, workdir, timeout, and a uses_prebuilt/uses_conda flag;
move the loop that builds commands via _build_pro_test_command, calls
env.exec_command, constructs the results list, and returns TestResults into that
helper, then have both _run_lang_tests and _run_preflight_tests call the new
helper with the appropriate uses_prebuilt/uses_conda value to eliminate
duplication and keep behavior consistent.
In `@tests/test_swebench_pro.py`:
- Around line 173-240: Add a unit test to cover the typescript ↔ ts alias path:
in tests/test_swebench_pro.py create a new test (e.g.,
test_filter_typescript_alias) that patches
mcpbr.benchmarks.swebench_pro.load_dataset to return a dataset where one item
has repo_language set to "ts" and then call
SWEBenchProBenchmark().load_tasks(filter_category=["typescript"]) (and vice
versa) asserting the item is included; reference SWEBenchProBenchmark.load_tasks
and PRO_LANGUAGES to ensure the alias mapping is exercised by the test.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/mcpbr/benchmark_preflight.pysrc/mcpbr/benchmarks/swebench_pro.pytests/test_swebench_pro.py
| if filter_category: | ||
| filtered = [] | ||
| for task in tasks: | ||
| repo = task.get("repo", "") | ||
| language = task.get("repo_language", "").lower() | ||
| for category in filter_category: | ||
| cat_lower = category.lower() | ||
| # If the category is a known language, match by language only | ||
| if cat_lower in PRO_LANGUAGES: | ||
| # Resolve aliases (e.g., "javascript" -> "js") | ||
| resolved = _LANGUAGE_ALIASES.get(cat_lower, cat_lower) | ||
| if resolved == language: | ||
| filtered.append(task) | ||
| break | ||
| elif cat_lower in repo.lower(): | ||
| # Otherwise, match by repo substring | ||
| filtered.append(task) | ||
| break | ||
| tasks = filtered |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find PRO_LANGUAGES and _LANGUAGE_ALIASES definitions
rg -n "_LANGUAGE_ALIASES|PRO_LANGUAGES" --type=py -B2 -A5 src/Repository: supermodeltools/mcpbr
Length of output: 1809
🏁 Script executed:
#!/bin/bash
# Check test data to see what repo_language values are used
rg -n "repo_language" --type=py -C2 tests/Repository: supermodeltools/mcpbr
Length of output: 6005
🏁 Script executed:
#!/bin/bash
# Check if there are any tests for filtering by "typescript" or "javascript"
rg -n "filter_category.*typescript|filter_category.*javascript" --type=py -B3 -A10 tests/Repository: supermodeltools/mcpbr
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look for all filter_category tests to see what's tested
rg -n "filter_category" --type=py -B5 -A10 tests/Repository: supermodeltools/mcpbr
Length of output: 35465
Language alias mismatch will silently drop TypeScript and JavaScript tasks from filter results.
Here's the concrete issue: the test dataset has "repo_language": "typescript" (long form), but _LANGUAGE_ALIASES maps "typescript" → "ts". When someone filters by ["typescript"], the code resolves it to "ts" (line 104), then compares "ts" == "typescript" — no match, task is skipped silently.
The fix is straightforward — normalize the dataset's language value the same way:
Proposed fix
if cat_lower in PRO_LANGUAGES:
# Resolve aliases (e.g., "javascript" -> "js")
resolved = _LANGUAGE_ALIASES.get(cat_lower, cat_lower)
- if resolved == language:
+ lang_resolved = _LANGUAGE_ALIASES.get(language, language)
+ if resolved == lang_resolved:
filtered.append(task)
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mcpbr/benchmarks/swebench_pro.py` around lines 94 - 112, The filter logic
drops JS/TS because you resolve the filter category via _LANGUAGE_ALIASES but
compare it against the raw task language; update the normalization so both sides
use the same alias mapping: in the loop that reads task.get("repo_language",
"").lower() (variable language) apply _LANGUAGE_ALIASES.get(language, language)
(or otherwise normalize task language to its canonical short form) before
comparing to resolved, and then use that normalized language in the equality
check inside the filter_category handling (functions/vars: filter_category,
tasks, PRO_LANGUAGES, _LANGUAGE_ALIASES, repo_language).
tests/test_swebench_pro.py
Outdated
| class TestBuildProTestCommand: | ||
| """Tests for language-specific test command building.""" | ||
|
|
||
| def test_python_delegates(self) -> None: | ||
| """Python should delegate to existing _build_test_command.""" | ||
| cmd = _build_pro_test_command("tests/test_foo.py::test_bar", "python") | ||
| assert "pytest" in cmd or "test_foo" in cmd | ||
|
|
||
| def test_go_package_path(self) -> None: | ||
| cmd = _build_pro_test_command("./pkg/router", "go") | ||
| assert "go test" in cmd | ||
| assert "./pkg/router" in cmd | ||
| assert "-v" in cmd | ||
|
|
||
| def test_go_function_name(self) -> None: | ||
| cmd = _build_pro_test_command("TestRouteMatching", "go") | ||
| assert "go test" in cmd | ||
| assert "-run" in cmd | ||
| assert "TestRouteMatching" in cmd | ||
|
|
||
| def test_typescript_file(self) -> None: | ||
| cmd = _build_pro_test_command("src/__tests__/parser.test.ts", "typescript") | ||
| assert "npx jest" in cmd | ||
| assert "parser.test.ts" in cmd | ||
|
|
||
| def test_typescript_pattern(self) -> None: | ||
| cmd = _build_pro_test_command("should parse tokens", "typescript") | ||
| assert "npx jest" in cmd | ||
| assert "-t" in cmd | ||
|
|
||
| def test_javascript_file(self) -> None: | ||
| cmd = _build_pro_test_command("test/index.test.js", "javascript") | ||
| assert "npx jest" in cmd | ||
| assert "index.test.js" in cmd | ||
|
|
||
| def test_javascript_pattern(self) -> None: | ||
| cmd = _build_pro_test_command("handles edge case", "javascript") | ||
| assert "npx jest" in cmd | ||
| assert "-t" in cmd | ||
|
|
||
| def test_prebuilt_conda_activation(self) -> None: | ||
| cmd = _build_pro_test_command("TestFoo", "go", uses_prebuilt=True) | ||
| assert "conda activate testbed" in cmd | ||
|
|
||
| def test_unknown_language_fallback(self) -> None: | ||
| cmd = _build_pro_test_command("test_something", "rust") | ||
| assert "test_something" in cmd |
There was a problem hiding this comment.
Good coverage of the command builder, but missing tests for short-form language identifiers "ts" / "js".
The test class covers Python, Go, TypeScript (long form), JavaScript (long form), prebuilt conda, and unknown fallback — nice work! But there's no test for what happens when language="ts" or language="js" is passed. Given the bug flagged in _build_pro_test_command (where those short forms silently fall through to the raw command fallback), adding tests for these would both document the expected behavior and catch regressions once the fix lands.
Something like:
def test_ts_short_form_uses_jest(self) -> None:
cmd = _build_pro_test_command("src/test.ts", "ts")
assert "npx jest" in cmd
def test_js_short_form_uses_jest(self) -> None:
cmd = _build_pro_test_command("test/index.test.js", "js")
assert "npx jest" in cmdBased on learnings: "New features must have comprehensive test coverage; bug fixes should include regression tests; aim for at least 80% code coverage on new code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_swebench_pro.py` around lines 93 - 139, Add two tests to
TestBuildProTestCommand to cover short-form language identifiers: add
test_ts_short_form_uses_jest and test_js_short_form_uses_jest that call
_build_pro_test_command("src/test.ts","ts") and
_build_pro_test_command("test/index.test.js","js") respectively and assert that
"npx jest" is in the returned cmd (and that the relevant filename appears), so
short forms "ts"/"js" no longer fall back to the raw-command path.
Go: Test IDs like "TestFoo/#00" and "TestFoo//api/v1" are subtests, not package paths. Always extract the top-level function name and use -run with ./... to search all packages. JS/TS: SWE-bench Pro uses "file.js | test description" format. Split on " | " to get the file path and test name separately. "test suite" as description runs the whole file without -t filter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Different SWE-bench Pro JS/TS repos use different test frameworks: - NodeBB: mocha - element-web: jest - protonmail/webclients: jest - tutanota: custom (testdouble) Added _detect_js_runner() that checks for runner binaries in node_modules/.bin/ inside the container. Mocha uses --grep for filtering, jest uses -t, vitest uses -t. Also refactored JS/TS command building into _build_js_test_command() for clearer separation of runner-specific logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/TS runner detection Add docker image prune after each preflight instance to prevent disk exhaustion on CI runners (each SWE-bench Pro image is ~1.5GB). Enhance JS/TS test runner detection to check package.json scripts.test and support ospec, ava, and npm test fallback. Previously, unrecognized projects (like tutanota) defaulted to jest, causing 100% test failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace custom language-specific test command building (jest/mocha/go test) with official run_script.sh + parser.py from scaleapi/SWE-bench_Pro-os. Each of the 11 repos has unique test infrastructure that the official scripts handle correctly (e.g., Redis for NodeBB, ansible-test for ansible, custom runners for tutanota). Parser runs locally on the host, avoiding Python dependency in Go/JS/TS container images. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation The official SWE-bench Pro evaluation harness runs before_repo_set_cmd (last line) between patch application and test execution. This typically restores specific test files from the fix commit, e.g.: git checkout <commit> -- test/tests/SomeTest.ts Without this, 2 tutanota instances fail because their test files end up in a broken state after patch + test_patch application. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Protonmail/webclients images are ~4.8GB compressed. docker image prune alone doesn't clear build cache and volumes. docker system prune -af --volumes reclaims all reclaimable space between instances.
Summary
ScaleAI/SWE-bench_Promcpbr preflightCLI command validates golden patches pass all tests in Docker environments before running agent evaluations_image_override,_workdir_override) alongside existing GHCR patternKey implementation details
SWEBenchProBenchmarkclass with language-specific test runners (Go:go test, TS/JS:npx jest)/bin/bashentrypoint (SWE-bench Pro images)pip install -e .) after patching for SWE-bench Pro Python images (they install to site-packages, not editable mode)fail_to_pass/FAIL_TO_PASS)Preflight validation status
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Behavior
Tests