research: evaluate GPT-4.1 Mini via SAM AI Gateway by simple-agent-manager[bot] · Pull Request #913 · raphaeltm/simple-agent-manager

simple-agent-manager · 2026-05-05T23:32:56Z

Summary

Evaluates GPT-4.1 Mini as a small OpenAI harness fallback via SAM's AI Gateway OpenAI path.
Corrects the follow-up cost model: Workers AI/Gemma are Cloudflare-billed, not free.
Updates model registry metadata, admin AI proxy labels, and harness docs to use a low-cost Workers AI tier with current per-token costs.

Findings

Metric	GPT-4.1 Mini	Gemma 4 26B
Two-tool loop	PASS	PASS
Workarounds needed	0	0
Total tokens (2-tool)	606	1,159
Total latency	~2.6s	~4.0s
Reasoning traces	None	Built-in `reasoning` field
Billing	OpenAI via Cloudflare Unified Billing	Cloudflare Workers AI billing
Context window	1M	32K

Recommendation: Keep Gemma 4 as the low-cost default and continue evaluating small OpenAI fallbacks. GPT-4.1 Mini is strong for latency/efficiency-sensitive workloads; GPT-5 Mini is the next cost-aligned candidate to test. GPT-4.1 Nano is not recommended because duplicate tool calls were observed.

Test Evidence

Verified GPT-4.1 Mini two-tool loop via manual curl (3 turns)
Verified GPT-4.1 Mini harness-style coding tools (grep first action)
Verified content: null handling (works, native OpenAI format)
Verified GPT-4.1 Nano quality issue (duplicate tool calls)
Ran full experiment.ts suite (4/5 models pass)
pnpm --filter @simple-agent-manager/shared test -- ai-model-registry
pnpm --filter @simple-agent-manager/api test -- ai-proxy
pnpm --filter @simple-agent-manager/web typecheck
pnpm exec playwright test tests/playwright/admin-ai-proxy-audit.spec.ts

Specialist Review Evidence

Reviewer	Status	Outcome
N/A	DEFERRED	No separate reviewer dispatched for this narrow pricing metadata/docs/UI correction; focused tests and Playwright audit were run locally.

Staging

Deployed with deploy-staging.yml run 25413180777 from branch sam/use-skill-continue-sam-01kqx7.
Staging smoke tests passed.
Live API verification: GET https://api.sammy.party/api/admin/ai-proxy/config returns Workers AI models with tier: "low-cost" and nonzero per-1K token costs; Gemma 4 returns 0.0001 input / 0.0003 output.

Agent Preflight (Required)

Preflight completed before code changes

Classification

External References

Official documentation consulted for current pricing before changing metadata:

Cloudflare Workers AI pricing and Gemma 4 model page: https://developers.cloudflare.com/workers-ai/platform/pricing/ and https://developers.cloudflare.com/workers-ai/models/gemma-4-26b-a4b-it/
OpenAI model pricing for GPT-4.1 Mini and GPT-5 Mini: https://platform.openai.com/docs/models/gpt-4.1-mini and https://platform.openai.com/docs/models/gpt-5-mini
Anthropic pricing for Claude Haiku 4.5 and Opus 4.6: https://claude.com/pricing

Codebase Impact Analysis

Cross-component impact is limited to pricing/tier metadata and surfaces that display or test it: packages/shared/src/constants/ai-services.ts, packages/shared/tests/unit/ai-model-registry.test.ts, apps/api/tests/unit/routes/ai-proxy.test.ts, apps/web/src/lib/api/admin.ts, apps/web/src/pages/AdminAIProxy.tsx, and apps/web/tests/playwright/admin-ai-proxy-audit.spec.ts.

Documentation & Specs

Updated harness and architecture docs that previously described Workers AI/Gemma as free: experiments/ai-gateway-tool-call/FINDINGS.md, experiments/ai-gateway-tool-call/FINDINGS-gemma.md, experiments/ai-gateway-tool-call/FINDINGS-openai.md, and docs/architecture/agent-harness-integration.md.

Constitution & Risk Check

Checked Principle XI for hardcoded values risk. Pricing metadata remains centralized in the platform model registry and is explicitly approximate for budget estimation, with actual usage expected from AI Gateway logs. Risk is pricing staleness, mitigated by documenting official sources and updating tests to prevent future zero-cost Workers AI assumptions.

Run focused experiment against gpt-4.1-mini through SAM's existing AI Gateway (/openai path with cf-aig-authorization / Unified Billing). Compare tool-call behavior, token efficiency, latency, and response shape against the merged Gemma 4 26B baseline. Key findings: - Two-tool loop PASS with tool_choice: "auto" (zero workarounds) - 1.9x more token-efficient than Gemma 4 (606 vs 1,159 tokens) - ~1.5x faster latency (~2.6s vs ~4.0s total) - No reasoning field (unlike Gemma's free CoT traces) - gpt-4.1-nano NOT recommended (duplicate tool calls observed) - Validates existing SAM proxy code path works without changes - Unified Billing required (cf-aig-authorization header) Recommendation: keep Gemma 4 as free default, add GPT-4.1 Mini as paid fallback tier for latency/efficiency-sensitive workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sonarqubecloud · 2026-05-06T02:21:20Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
2.8% Duplication on New Code

See analysis details on SonarQube Cloud

raphaeltm and others added 3 commits May 5, 2026 23:32

fix: correct Workers AI harness pricing metadata

58b8deb

chore: refresh PR evidence checks

4c49b81

simple-agent-manager Bot temporarily deployed to staging May 6, 2026 02:26 Inactive

simple-agent-manager Bot temporarily deployed to staging May 6, 2026 02:33 Inactive

simple-agent-manager Bot merged commit d8dea4c into main May 6, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: evaluate GPT-4.1 Mini via SAM AI Gateway#913

research: evaluate GPT-4.1 Mini via SAM AI Gateway#913
simple-agent-manager[bot] merged 3 commits intomainfrom
sam/use-skill-continue-sam-01kqx7

simple-agent-manager Bot commented May 5, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simple-agent-manager Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Findings

Test Evidence

Specialist Review Evidence

Staging

Agent Preflight (Required)

Classification

External References

Codebase Impact Analysis

Documentation & Specs

Constitution & Risk Check

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simple-agent-manager Bot commented May 5, 2026 •

edited

Loading