Skip to content

feat: cost-aware harness model eval suite#914

Merged
simple-agent-manager[bot] merged 6 commits intomainfrom
sam/use-skill-build-next-01kqxg
May 6, 2026
Merged

feat: cost-aware harness model eval suite#914
simple-agent-manager[bot] merged 6 commits intomainfrom
sam/use-skill-build-next-01kqxg

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

@simple-agent-manager simple-agent-manager Bot commented May 6, 2026

Summary

  • Add cost-aware model evaluation suite under experiments/harness-eval/ with 6 scenarios (1 baseline + 5 coding), 3 models (Gemma 4 26B, GPT-4.1 Mini, Claude Haiku 4.5), cost-per-success scoring, and full JSON trace persistence
  • All scenarios use deterministic virtual filesystems — no network, no side effects
  • Routes through SAM's single AI Gateway (ID: "sam"), treating Workers AI as Cloudflare-billed (not free)
  • Documents credential blockers (Unified Billing scope for OpenAI/Anthropic) and GPT-5 Mini unavailability

Validation

  • pnpm lint
  • pnpm typecheck
  • pnpm test
  • Additional validation run: local tsc --noEmit on experiment tsconfig

Staging Verification (REQUIRED for all code changes — merge-blocking)

N/A: docs-only — this PR only adds files under experiments/harness-eval/. No deployed code is changed.

End-to-End Verification (Required for multi-component changes)

N/A: standalone experiment, no multi-component interaction.

Post-Mortem (Required for bug fix PRs)

N/A: not a bug fix.

Agent Preflight (Required)

  • Preflight completed before code changes

Classification

  • docs-sync-change
  • business-logic-change

External References

  • Read existing experiment at experiments/ai-gateway-tool-call/experiment.ts and FINDINGS-gemma.md
  • Read model registry at packages/shared/src/constants/ai-services.ts (lines 100-399)
  • Read harness architecture at docs/architecture/agent-harness-integration.md
  • Read Go harness at packages/harness/agent/loop.go, packages/harness/llm/types.go

Codebase Impact Analysis

Only experiments/harness-eval/ — no changes to deployed packages (apps/, packages/).

Documentation & Specs

  • experiments/harness-eval/README.md — comprehensive docs for running, interpreting, and extending the suite

Constitution & Risk Check

Principle XI (No Hardcoded Values): Workers AI cost is configurable via WORKERS_AI_COST_PER_1K_TOKENS env var. Gateway ID configurable via AI_GATEWAY_ID. Scenarios and models filterable via env vars. No security risk — experiment does not modify any deployed code.

Specialist Review Evidence

Reviewer Status Findings
task-completion-validator WARN (no blockers) MEDIUM: rubric.ts inlined per-scenario (functional). LOW: cost comment fixed.
doc-sync-validator ADDRESSED 2 HIGH fixed (model ID, env var name), 1 MEDIUM fixed (trace schema).

🤖 Generated with Claude Code

raphaeltm and others added 6 commits May 6, 2026 02:07
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 scenarios (weather baseline + 5 coding tasks), 3 models
(Gemma 4 26B, GPT-4.1 Mini, Claude Haiku 4.5), cost-per-success
scoring, full JSON trace persistence.

Scenarios: read-and-summarize, grep-locate-code, missing-file-recovery,
propose-patch, interpret-test-failure, weather-baseline.

All scenarios use virtual filesystems for deterministic execution.
Routes through SAM's single AI Gateway ("sam").
Workers AI treated as Cloudflare-billed, not free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Gemma model ID: gemma-4-27b-it → gemma-4-26b-a4b-it
- Fix env var name: WORKERS_AI_COST_PER_1K → WORKERS_AI_COST_PER_1K_TOKENS
- Rewrite trace schema to match actual EvalTrace/ScenarioResult types
- Fix cost.ts comment about PLATFORM_AI_MODELS import

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
8.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@simple-agent-manager simple-agent-manager Bot merged commit 15d9048 into main May 6, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant