Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps by aarora79 · Pull Request #1 · aws-samples/sample-claude-code-multi-model

aarora79 · 2026-06-05T20:57:33Z

Summary

This PR makes two complementary changes that map to the two purposes the repo now serves: run non-Anthropic models through Claude Code, and measure how well each model actually does coding work.

1. `/swe` and `/summarize` skills + worked example

Adds .claude/skills/swe/SKILL.md — drives any Claude Code model through a real software-engineering task in any GitHub repo, producing four artifacts (github-issue.md, lld.md, review.md, testing.md) under benchmarks/swe-benchmark-data/{repo}/{problem}/{model}/. The skill stops at design — it does not implement the change.
Adds .claude/skills/summarize/SKILL.md — post-run report covering artifact completeness, error signals, token usage broken down by model + cache type, and recurring themes.
Ships a worked example targeting agentic-community/mcp-gateway-registry at tag 1.24.4 with two scoped problems (remove-faiss, remove-efs-from-terraform-aws-ecs). qwen.qwen3-coder-next already has the four artifacts on remove-faiss so reviewers can see expected output before running anything.
The cloned target lives at benchmarks/swe-benchmark-data/{repo}/repo/ and is gitignored — each contributor pins their own checkout rather than this repo carrying large third-party trees.

2. Raw Bedrock model IDs end-to-end + Opus 4.7/4.8 + uv-managed deps

Drops the alias layer across claude-model.sh, litellm-config.yaml, humaneval_runner.py, and bedrock/README.md. What you pass to --model is now the same raw Bedrock model ID that hits the wire (e.g. qwen.qwen3-coder-next, us.anthropic.claude-opus-4-8).
Adds Claude Opus 4.7 and 4.8 to the native Bedrock list (verified ACTIVE via aws bedrock list-inference-profiles). Counts go 43 → 45 total, 5 → 7 native.
Audited the OpenAI-compat Bedrock endpoint (bedrock-mantle.us-east-1.api.aws/v1/models) — the 38 third-party entries match the catalog exactly, nothing missing.
Adds bedrock/pyproject.toml so uv sync works directly. Replaces the runtime pip install calls in setup-proxy.sh that broke under PEP 668 on Python 3.12+. Pins requires-python = ">=3.10,<3.14" because uvloop 0.21 cannot import on Python 3.14.
setup-proxy.sh now binds to 127.0.0.1 by default with an optional --host override (prints a warning when non-loopback). The proxy does not authenticate clients itself, so the previous 0.0.0.0 default was unnecessarily exposed.
Date-stamps "latest"/"newest" descriptions in the picker so recency claims are falsifiable; relabels MiniMax M2.5's 80.2% SWE-bench number as vendor-claimed.

3. README reframe (root + bedrock)

Root README.md reframed around the two purposes. New Evaluation 1 — SWE skill and Evaluation 2 — HumanEval sections, with a clear note that "SWE" here means software engineering generally and is not SWE-bench.
Repository structure block updated for .claude/ and benchmarks/.
bedrock/README.md path fixes: LICENSE, clone URL, "See Also" link, alias paths — all previously pointed at the old shekharprateek/... standalone repos that this monorepo replaced.

Test plan

cd bedrock && uv sync succeeds on Python 3.10–3.13
./scripts/setup-proxy.sh starts cleanly, binds to 127.0.0.1:4000, curl http://127.0.0.1:4000/health returns 200
./scripts/setup-proxy.sh --host 0.0.0.0 prints the network-exposure warning and binds to all interfaces
./scripts/claude-model.sh --list shows 45 models with the date-stamped "latest" tags
./scripts/claude-model.sh --model us.anthropic.claude-opus-4-8 -p "hi" succeeds (after enabling Opus 4.8 in Bedrock model access)
./scripts/claude-model.sh --model qwen.qwen3-coder-next -p "hi" succeeds (with proxy running)
All local markdown links in root + bedrock READMEs resolve
/swe skill end-to-end on a fresh problem produces all four artifacts under the expected path

The /swe skill drives any Claude Code model through a software-engineering task in a real repo and lands four artifacts (github-issue, lld, review, testing) under benchmarks/swe-benchmark-data/{repo}/{problem}/{model}/. /summarize produces a per-run report with token usage, errors, and themes. Includes a worked example targeting agentic-community/mcp-gateway-registry at tag 1.24.4, with the remove-faiss problem already attempted by qwen.qwen3-coder-next so contributors can see expected output before running the skill themselves. The cloned target repo lives at benchmarks/swe-benchmark-data/{repo}/repo/ and is gitignored so each contributor pins their own checkout.

Drops the alias layer in claude-model.sh / litellm-config.yaml / humaneval_runner.py / README so the value passed via --model is the same Bedrock model ID that hits the wire. Picker output, --list, and the benchmark runner all key on the raw IDs. Audited the OpenAI-compatible Bedrock endpoint (bedrock-mantle.us-east-1.api.aws/v1/models): the 38 third-party entries match the catalog exactly; no new providers to add. Adds Claude Opus 4.7 and 4.8 cross-region inference profiles to the native Bedrock list (verified ACTIVE via aws bedrock list-inference-profiles), bumping the model count from 43 -> 45 and native count from 5 -> 7. Stamps "latest" / "newest" descriptions with the audit date (June 5 2026) so the recency claim is falsifiable; relabels MiniMax M2.5's 80.2% SWE-bench number as vendor-claimed since this repo did not measure it. Adds bedrock/pyproject.toml (litellm[proxy], aws-bedrock-token-generator, datasets) so contributors run `uv sync` instead of letting setup-proxy.sh discover-and-pip-install at runtime (which broke under PEP 668 on Python 3.12+). Pins requires-python ">=3.10,<3.14" because uvloop 0.21 cannot import on Python 3.14. setup-proxy.sh now binds to 127.0.0.1 by default with an optional --host override (prints a warning when non-loopback). The proxy itself does not authenticate clients; the only gate has always been the bearer token upstream to Bedrock, so leaving it on 0.0.0.0 was unnecessarily exposed. Reframes the root README around the two purposes the repo now serves: running Claude Code against non-Anthropic models, and measuring how well each model does coding work. The Evaluation section splits into the new SWE-skill mode (real-world tasks, design-package artifacts) and the existing HumanEval mode (single-function pass@1), with a clear note that "SWE" in this repo means software engineering generally and is not the SWE-bench dataset. Also fixes paths in bedrock/README.md that pointed at the old shekharprateek standalone repos (LICENSE, clone URL, See Also link, alias paths in Shell Aliases section).

aarora79 added 2 commits June 5, 2026 20:52

aarora79 merged commit 805db2c into main Jun 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps#1

Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps#1
aarora79 merged 2 commits into
mainfrom
trying-out-first-time

aarora79 commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aarora79 commented Jun 5, 2026

Summary

1. /swe and /summarize skills + worked example

2. Raw Bedrock model IDs end-to-end + Opus 4.7/4.8 + uv-managed deps

3. README reframe (root + bedrock)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `/swe` and `/summarize` skills + worked example