feat: cost-aware harness model eval suite by simple-agent-manager[bot] · Pull Request #914 · raphaeltm/simple-agent-manager

simple-agent-manager · 2026-05-06T02:26:44Z

Summary

Add cost-aware model evaluation suite under experiments/harness-eval/ with 6 scenarios (1 baseline + 5 coding), 3 models (Gemma 4 26B, GPT-4.1 Mini, Claude Haiku 4.5), cost-per-success scoring, and full JSON trace persistence
All scenarios use deterministic virtual filesystems — no network, no side effects
Routes through SAM's single AI Gateway (ID: "sam"), treating Workers AI as Cloudflare-billed (not free)
Documents credential blockers (Unified Billing scope for OpenAI/Anthropic) and GPT-5 Mini unavailability

Validation

pnpm lint
pnpm typecheck
pnpm test
Additional validation run: local tsc --noEmit on experiment tsconfig

Staging Verification (REQUIRED for all code changes — merge-blocking)

N/A: docs-only — this PR only adds files under experiments/harness-eval/. No deployed code is changed.

End-to-End Verification (Required for multi-component changes)

N/A: standalone experiment, no multi-component interaction.

Post-Mortem (Required for bug fix PRs)

N/A: not a bug fix.

Agent Preflight (Required)

Preflight completed before code changes

Classification

docs-sync-change
business-logic-change

External References

Read existing experiment at experiments/ai-gateway-tool-call/experiment.ts and FINDINGS-gemma.md
Read model registry at packages/shared/src/constants/ai-services.ts (lines 100-399)
Read harness architecture at docs/architecture/agent-harness-integration.md
Read Go harness at packages/harness/agent/loop.go, packages/harness/llm/types.go

Codebase Impact Analysis

Only experiments/harness-eval/ — no changes to deployed packages (apps/, packages/).

Documentation & Specs

experiments/harness-eval/README.md — comprehensive docs for running, interpreting, and extending the suite

Constitution & Risk Check

Principle XI (No Hardcoded Values): Workers AI cost is configurable via WORKERS_AI_COST_PER_1K_TOKENS env var. Gateway ID configurable via AI_GATEWAY_ID. Scenarios and models filterable via env vars. No security risk — experiment does not modify any deployed code.

Specialist Review Evidence

Reviewer	Status	Findings
task-completion-validator	WARN (no blockers)	MEDIUM: rubric.ts inlined per-scenario (functional). LOW: cost comment fixed.
doc-sync-validator	ADDRESSED	2 HIGH fixed (model ID, env var name), 1 MEDIUM fixed (trace schema).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

6 scenarios (weather baseline + 5 coding tasks), 3 models (Gemma 4 26B, GPT-4.1 Mini, Claude Haiku 4.5), cost-per-success scoring, full JSON trace persistence. Scenarios: read-and-summarize, grep-locate-code, missing-file-recovery, propose-patch, interpret-test-failure, weather-baseline. All scenarios use virtual filesystems for deterministic execution. Routes through SAM's single AI Gateway ("sam"). Workers AI treated as Cloudflare-billed, not free. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix Gemma model ID: gemma-4-27b-it → gemma-4-26b-a4b-it - Fix env var name: WORKERS_AI_COST_PER_1K → WORKERS_AI_COST_PER_1K_TOKENS - Rewrite trace schema to match actual EvalTrace/ScenarioResult types - Fix cost.ts comment about PLATFORM_AI_MODELS import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sonarqubecloud · 2026-05-06T02:39:15Z

Quality Gate failed

Failed conditions
3 Security Hotspots
8.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

raphaeltm and others added 6 commits May 6, 2026 02:07

chore: move harness eval task to active

cf8d189

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: archive harness eval suite task

ac5b271

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: rerun with preflight evidence

110cdbb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: rerun with preflight evidence markers

eea45a2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simple-agent-manager Bot merged commit 15d9048 into main May 6, 2026
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cost-aware harness model eval suite#914

feat: cost-aware harness model eval suite#914
simple-agent-manager[bot] merged 6 commits intomainfrom
sam/use-skill-build-next-01kqxg

simple-agent-manager Bot commented May 6, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simple-agent-manager Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Staging Verification (REQUIRED for all code changes — merge-blocking)

End-to-End Verification (Required for multi-component changes)

Post-Mortem (Required for bug fix PRs)

Agent Preflight (Required)

Classification

External References

Codebase Impact Analysis

Documentation & Specs

Constitution & Risk Check

Specialist Review Evidence

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simple-agent-manager Bot commented May 6, 2026 •

edited

Loading