Skip to content

Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps#1

Merged
aarora79 merged 2 commits into
mainfrom
trying-out-first-time
Jun 5, 2026
Merged

Raw Bedrock model IDs, Opus 4.7/4.8, /swe + /summarize skills, uv-managed deps#1
aarora79 merged 2 commits into
mainfrom
trying-out-first-time

Conversation

@aarora79
Copy link
Copy Markdown
Contributor

@aarora79 aarora79 commented Jun 5, 2026

Summary

This PR makes two complementary changes that map to the two purposes the repo now serves: run non-Anthropic models through Claude Code, and measure how well each model actually does coding work.

1. /swe and /summarize skills + worked example

  • Adds .claude/skills/swe/SKILL.md — drives any Claude Code model through a real software-engineering task in any GitHub repo, producing four artifacts (github-issue.md, lld.md, review.md, testing.md) under benchmarks/swe-benchmark-data/{repo}/{problem}/{model}/. The skill stops at design — it does not implement the change.
  • Adds .claude/skills/summarize/SKILL.md — post-run report covering artifact completeness, error signals, token usage broken down by model + cache type, and recurring themes.
  • Ships a worked example targeting agentic-community/mcp-gateway-registry at tag 1.24.4 with two scoped problems (remove-faiss, remove-efs-from-terraform-aws-ecs). qwen.qwen3-coder-next already has the four artifacts on remove-faiss so reviewers can see expected output before running anything.
  • The cloned target lives at benchmarks/swe-benchmark-data/{repo}/repo/ and is gitignored — each contributor pins their own checkout rather than this repo carrying large third-party trees.

2. Raw Bedrock model IDs end-to-end + Opus 4.7/4.8 + uv-managed deps

  • Drops the alias layer across claude-model.sh, litellm-config.yaml, humaneval_runner.py, and bedrock/README.md. What you pass to --model is now the same raw Bedrock model ID that hits the wire (e.g. qwen.qwen3-coder-next, us.anthropic.claude-opus-4-8).
  • Adds Claude Opus 4.7 and 4.8 to the native Bedrock list (verified ACTIVE via aws bedrock list-inference-profiles). Counts go 43 → 45 total, 5 → 7 native.
  • Audited the OpenAI-compat Bedrock endpoint (bedrock-mantle.us-east-1.api.aws/v1/models) — the 38 third-party entries match the catalog exactly, nothing missing.
  • Adds bedrock/pyproject.toml so uv sync works directly. Replaces the runtime pip install calls in setup-proxy.sh that broke under PEP 668 on Python 3.12+. Pins requires-python = ">=3.10,<3.14" because uvloop 0.21 cannot import on Python 3.14.
  • setup-proxy.sh now binds to 127.0.0.1 by default with an optional --host override (prints a warning when non-loopback). The proxy does not authenticate clients itself, so the previous 0.0.0.0 default was unnecessarily exposed.
  • Date-stamps "latest"/"newest" descriptions in the picker so recency claims are falsifiable; relabels MiniMax M2.5's 80.2% SWE-bench number as vendor-claimed.

3. README reframe (root + bedrock)

  • Root README.md reframed around the two purposes. New Evaluation 1 — SWE skill and Evaluation 2 — HumanEval sections, with a clear note that "SWE" here means software engineering generally and is not SWE-bench.
  • Repository structure block updated for .claude/ and benchmarks/.
  • bedrock/README.md path fixes: LICENSE, clone URL, "See Also" link, alias paths — all previously pointed at the old shekharprateek/... standalone repos that this monorepo replaced.

Test plan

  • cd bedrock && uv sync succeeds on Python 3.10–3.13
  • ./scripts/setup-proxy.sh starts cleanly, binds to 127.0.0.1:4000, curl http://127.0.0.1:4000/health returns 200
  • ./scripts/setup-proxy.sh --host 0.0.0.0 prints the network-exposure warning and binds to all interfaces
  • ./scripts/claude-model.sh --list shows 45 models with the date-stamped "latest" tags
  • ./scripts/claude-model.sh --model us.anthropic.claude-opus-4-8 -p "hi" succeeds (after enabling Opus 4.8 in Bedrock model access)
  • ./scripts/claude-model.sh --model qwen.qwen3-coder-next -p "hi" succeeds (with proxy running)
  • All local markdown links in root + bedrock READMEs resolve
  • /swe skill end-to-end on a fresh problem produces all four artifacts under the expected path

aarora79 added 2 commits June 5, 2026 20:52
The /swe skill drives any Claude Code model through a software-engineering
task in a real repo and lands four artifacts (github-issue, lld, review,
testing) under benchmarks/swe-benchmark-data/{repo}/{problem}/{model}/.
/summarize produces a per-run report with token usage, errors, and themes.

Includes a worked example targeting agentic-community/mcp-gateway-registry
at tag 1.24.4, with the remove-faiss problem already attempted by
qwen.qwen3-coder-next so contributors can see expected output before running
the skill themselves.

The cloned target repo lives at benchmarks/swe-benchmark-data/{repo}/repo/
and is gitignored so each contributor pins their own checkout.
Drops the alias layer in claude-model.sh / litellm-config.yaml /
humaneval_runner.py / README so the value passed via --model is the same
Bedrock model ID that hits the wire. Picker output, --list, and the
benchmark runner all key on the raw IDs. Audited the OpenAI-compatible
Bedrock endpoint (bedrock-mantle.us-east-1.api.aws/v1/models): the 38
third-party entries match the catalog exactly; no new providers to add.

Adds Claude Opus 4.7 and 4.8 cross-region inference profiles to the native
Bedrock list (verified ACTIVE via aws bedrock list-inference-profiles),
bumping the model count from 43 -> 45 and native count from 5 -> 7. Stamps
"latest" / "newest" descriptions with the audit date (June 5 2026) so the
recency claim is falsifiable; relabels MiniMax M2.5's 80.2% SWE-bench
number as vendor-claimed since this repo did not measure it.

Adds bedrock/pyproject.toml (litellm[proxy], aws-bedrock-token-generator,
datasets) so contributors run `uv sync` instead of letting setup-proxy.sh
discover-and-pip-install at runtime (which broke under PEP 668 on Python
3.12+). Pins requires-python ">=3.10,<3.14" because uvloop 0.21 cannot
import on Python 3.14.

setup-proxy.sh now binds to 127.0.0.1 by default with an optional --host
override (prints a warning when non-loopback). The proxy itself does not
authenticate clients; the only gate has always been the bearer token
upstream to Bedrock, so leaving it on 0.0.0.0 was unnecessarily exposed.

Reframes the root README around the two purposes the repo now serves:
running Claude Code against non-Anthropic models, and measuring how well
each model does coding work. The Evaluation section splits into the new
SWE-skill mode (real-world tasks, design-package artifacts) and the
existing HumanEval mode (single-function pass@1), with a clear note that
"SWE" in this repo means software engineering generally and is not the
SWE-bench dataset.

Also fixes paths in bedrock/README.md that pointed at the old
shekharprateek standalone repos (LICENSE, clone URL, See Also link, alias
paths in Shell Aliases section).
@aarora79 aarora79 merged commit 805db2c into main Jun 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant