Skip to content

research: evaluate GPT-4.1 Mini via SAM AI Gateway#913

Merged
simple-agent-manager[bot] merged 3 commits intomainfrom
sam/use-skill-continue-sam-01kqx7
May 6, 2026
Merged

research: evaluate GPT-4.1 Mini via SAM AI Gateway#913
simple-agent-manager[bot] merged 3 commits intomainfrom
sam/use-skill-continue-sam-01kqx7

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

@simple-agent-manager simple-agent-manager Bot commented May 5, 2026

Summary

  • Evaluates GPT-4.1 Mini as a small OpenAI harness fallback via SAM's AI Gateway OpenAI path.
  • Corrects the follow-up cost model: Workers AI/Gemma are Cloudflare-billed, not free.
  • Updates model registry metadata, admin AI proxy labels, and harness docs to use a low-cost Workers AI tier with current per-token costs.

Findings

Metric GPT-4.1 Mini Gemma 4 26B
Two-tool loop PASS PASS
Workarounds needed 0 0
Total tokens (2-tool) 606 1,159
Total latency ~2.6s ~4.0s
Reasoning traces None Built-in reasoning field
Billing OpenAI via Cloudflare Unified Billing Cloudflare Workers AI billing
Context window 1M 32K

Recommendation: Keep Gemma 4 as the low-cost default and continue evaluating small OpenAI fallbacks. GPT-4.1 Mini is strong for latency/efficiency-sensitive workloads; GPT-5 Mini is the next cost-aligned candidate to test. GPT-4.1 Nano is not recommended because duplicate tool calls were observed.

Test Evidence

  • Verified GPT-4.1 Mini two-tool loop via manual curl (3 turns)
  • Verified GPT-4.1 Mini harness-style coding tools (grep first action)
  • Verified content: null handling (works, native OpenAI format)
  • Verified GPT-4.1 Nano quality issue (duplicate tool calls)
  • Ran full experiment.ts suite (4/5 models pass)
  • pnpm --filter @simple-agent-manager/shared test -- ai-model-registry
  • pnpm --filter @simple-agent-manager/api test -- ai-proxy
  • pnpm --filter @simple-agent-manager/web typecheck
  • pnpm exec playwright test tests/playwright/admin-ai-proxy-audit.spec.ts

Specialist Review Evidence

Reviewer Status Outcome
N/A DEFERRED No separate reviewer dispatched for this narrow pricing metadata/docs/UI correction; focused tests and Playwright audit were run locally.

Staging

  • Deployed with deploy-staging.yml run 25413180777 from branch sam/use-skill-continue-sam-01kqx7.
  • Staging smoke tests passed.
  • Live API verification: GET https://api.sammy.party/api/admin/ai-proxy/config returns Workers AI models with tier: "low-cost" and nonzero per-1K token costs; Gemma 4 returns 0.0001 input / 0.0003 output.

Agent Preflight (Required)

  • Preflight completed before code changes

Classification

  • external-api-change
  • cross-component-change
  • business-logic-change
  • public-surface-change
  • docs-sync-change
  • security-sensitive-change
  • ui-change
  • infra-change

External References

Official documentation consulted for current pricing before changing metadata:

Codebase Impact Analysis

Cross-component impact is limited to pricing/tier metadata and surfaces that display or test it: packages/shared/src/constants/ai-services.ts, packages/shared/tests/unit/ai-model-registry.test.ts, apps/api/tests/unit/routes/ai-proxy.test.ts, apps/web/src/lib/api/admin.ts, apps/web/src/pages/AdminAIProxy.tsx, and apps/web/tests/playwright/admin-ai-proxy-audit.spec.ts.

Documentation & Specs

Updated harness and architecture docs that previously described Workers AI/Gemma as free: experiments/ai-gateway-tool-call/FINDINGS.md, experiments/ai-gateway-tool-call/FINDINGS-gemma.md, experiments/ai-gateway-tool-call/FINDINGS-openai.md, and docs/architecture/agent-harness-integration.md.

Constitution & Risk Check

Checked Principle XI for hardcoded values risk. Pricing metadata remains centralized in the platform model registry and is explicitly approximate for budget estimation, with actual usage expected from AI Gateway logs. Risk is pricing staleness, mitigated by documenting official sources and updating tests to prevent future zero-cost Workers AI assumptions.

raphaeltm and others added 3 commits May 5, 2026 23:32
Run focused experiment against gpt-4.1-mini through SAM's existing
AI Gateway (/openai path with cf-aig-authorization / Unified Billing).
Compare tool-call behavior, token efficiency, latency, and response
shape against the merged Gemma 4 26B baseline.

Key findings:
- Two-tool loop PASS with tool_choice: "auto" (zero workarounds)
- 1.9x more token-efficient than Gemma 4 (606 vs 1,159 tokens)
- ~1.5x faster latency (~2.6s vs ~4.0s total)
- No reasoning field (unlike Gemma's free CoT traces)
- gpt-4.1-nano NOT recommended (duplicate tool calls observed)
- Validates existing SAM proxy code path works without changes
- Unified Billing required (cf-aig-authorization header)

Recommendation: keep Gemma 4 as free default, add GPT-4.1 Mini as
paid fallback tier for latency/efficiency-sensitive workloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

@simple-agent-manager simple-agent-manager Bot merged commit d8dea4c into main May 6, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant