Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps#1
Merged
Conversation
The /swe skill drives any Claude Code model through a software-engineering
task in a real repo and lands four artifacts (github-issue, lld, review,
testing) under benchmarks/swe-benchmark-data/{repo}/{problem}/{model}/.
/summarize produces a per-run report with token usage, errors, and themes.
Includes a worked example targeting agentic-community/mcp-gateway-registry
at tag 1.24.4, with the remove-faiss problem already attempted by
qwen.qwen3-coder-next so contributors can see expected output before running
the skill themselves.
The cloned target repo lives at benchmarks/swe-benchmark-data/{repo}/repo/
and is gitignored so each contributor pins their own checkout.
Drops the alias layer in claude-model.sh / litellm-config.yaml / humaneval_runner.py / README so the value passed via --model is the same Bedrock model ID that hits the wire. Picker output, --list, and the benchmark runner all key on the raw IDs. Audited the OpenAI-compatible Bedrock endpoint (bedrock-mantle.us-east-1.api.aws/v1/models): the 38 third-party entries match the catalog exactly; no new providers to add. Adds Claude Opus 4.7 and 4.8 cross-region inference profiles to the native Bedrock list (verified ACTIVE via aws bedrock list-inference-profiles), bumping the model count from 43 -> 45 and native count from 5 -> 7. Stamps "latest" / "newest" descriptions with the audit date (June 5 2026) so the recency claim is falsifiable; relabels MiniMax M2.5's 80.2% SWE-bench number as vendor-claimed since this repo did not measure it. Adds bedrock/pyproject.toml (litellm[proxy], aws-bedrock-token-generator, datasets) so contributors run `uv sync` instead of letting setup-proxy.sh discover-and-pip-install at runtime (which broke under PEP 668 on Python 3.12+). Pins requires-python ">=3.10,<3.14" because uvloop 0.21 cannot import on Python 3.14. setup-proxy.sh now binds to 127.0.0.1 by default with an optional --host override (prints a warning when non-loopback). The proxy itself does not authenticate clients; the only gate has always been the bearer token upstream to Bedrock, so leaving it on 0.0.0.0 was unnecessarily exposed. Reframes the root README around the two purposes the repo now serves: running Claude Code against non-Anthropic models, and measuring how well each model does coding work. The Evaluation section splits into the new SWE-skill mode (real-world tasks, design-package artifacts) and the existing HumanEval mode (single-function pass@1), with a clear note that "SWE" in this repo means software engineering generally and is not the SWE-bench dataset. Also fixes paths in bedrock/README.md that pointed at the old shekharprateek standalone repos (LICENSE, clone URL, See Also link, alias paths in Shell Aliases section).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR makes two complementary changes that map to the two purposes the repo now serves: run non-Anthropic models through Claude Code, and measure how well each model actually does coding work.
1.
/sweand/summarizeskills + worked example.claude/skills/swe/SKILL.md— drives any Claude Code model through a real software-engineering task in any GitHub repo, producing four artifacts (github-issue.md,lld.md,review.md,testing.md) underbenchmarks/swe-benchmark-data/{repo}/{problem}/{model}/. The skill stops at design — it does not implement the change..claude/skills/summarize/SKILL.md— post-run report covering artifact completeness, error signals, token usage broken down by model + cache type, and recurring themes.1.24.4with two scoped problems (remove-faiss,remove-efs-from-terraform-aws-ecs).qwen.qwen3-coder-nextalready has the four artifacts onremove-faissso reviewers can see expected output before running anything.benchmarks/swe-benchmark-data/{repo}/repo/and is gitignored — each contributor pins their own checkout rather than this repo carrying large third-party trees.2. Raw Bedrock model IDs end-to-end + Opus 4.7/4.8 + uv-managed deps
claude-model.sh,litellm-config.yaml,humaneval_runner.py, andbedrock/README.md. What you pass to--modelis now the same raw Bedrock model ID that hits the wire (e.g.qwen.qwen3-coder-next,us.anthropic.claude-opus-4-8).aws bedrock list-inference-profiles). Counts go 43 → 45 total, 5 → 7 native.bedrock-mantle.us-east-1.api.aws/v1/models) — the 38 third-party entries match the catalog exactly, nothing missing.bedrock/pyproject.tomlsouv syncworks directly. Replaces the runtimepip installcalls insetup-proxy.shthat broke under PEP 668 on Python 3.12+. Pinsrequires-python = ">=3.10,<3.14"because uvloop 0.21 cannot import on Python 3.14.setup-proxy.shnow binds to127.0.0.1by default with an optional--hostoverride (prints a warning when non-loopback). The proxy does not authenticate clients itself, so the previous0.0.0.0default was unnecessarily exposed.3. README reframe (root + bedrock)
README.mdreframed around the two purposes. New Evaluation 1 — SWE skill and Evaluation 2 — HumanEval sections, with a clear note that "SWE" here means software engineering generally and is not SWE-bench..claude/andbenchmarks/.bedrock/README.mdpath fixes:LICENSE, clone URL, "See Also" link, alias paths — all previously pointed at the oldshekharprateek/...standalone repos that this monorepo replaced.Test plan
cd bedrock && uv syncsucceeds on Python 3.10–3.13./scripts/setup-proxy.shstarts cleanly, binds to127.0.0.1:4000,curl http://127.0.0.1:4000/healthreturns 200./scripts/setup-proxy.sh --host 0.0.0.0prints the network-exposure warning and binds to all interfaces./scripts/claude-model.sh --listshows 45 models with the date-stamped "latest" tags./scripts/claude-model.sh --model us.anthropic.claude-opus-4-8 -p "hi"succeeds (after enabling Opus 4.8 in Bedrock model access)./scripts/claude-model.sh --model qwen.qwen3-coder-next -p "hi"succeeds (with proxy running)/sweskill end-to-end on a fresh problem produces all four artifacts under the expected path