GitHub - Mike-E-Log/agentic-eval-harness: Eval-gated runner driving Claude Code through phases with cross-vendor decision-support gates.

agentic-eval-harness · eval-gated runner for coding agents

Drives Claude Code through ideate → spec → plan → implement → review; at each
boundary, a cross-vendor judge panel scores against a known-good exemplar.
Decision-support, not an automated verdict.

It doesn't pretend an uncalibrated multi-LLM vote is a trustworthy classifier. You approve each phase; the gate informs the call. The disciplined path to automation — calibrating the gate against your own approve/reject decisions — is the v0.2 roadmap.

What the gate shows you at a phase boundary (aeh demo prints exactly this):

PHASE: plan (gated) - exemplar match 6.67/10 · dispersion σ=2.1
  [σ = stdev of judge scores, 0-10; ⚠ judges disagree]
  judge   score critique
  claude  9     tasks are bite-sized and tested
  gpt     5     dependency order unclear in places
  gemini  6     some steps lack concrete code

[a]pprove  [r]eject  [d]iff  [v]erbose-critiques

Quickstart (no keys, no CLI)

pipx install agentic-eval-harness   # or: uvx agentic-eval-harness demo
aeh demo

aeh demo replays a recorded run through the same renderer a live run uses — in under 2 seconds, with no claude CLI and no API keys.

For a real run

Needs: Python 3.11+, git 2.5+, the claude CLI + a Claude Code subscription, and API keys for the judges (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY). Missing a key just disables that judge.
Cost: a full 5-phase run fires up to ~15 judge API calls (3 judges x 5 boundaries) plus the claude --print drive. The demo costs nothing.
Safe on your repo: each run works in an isolated git worktree; your working tree is never mutated; run state lives outside the worktree and survives cleanup.

aeh run <project>     # drive a project; the gate prompts you at each phase
aeh list              # see run ids
aeh resume <id>       # pick up an interrupted run
aeh show <id>         # re-print a run's last gate scorecard
aeh cleanup <id>      # remove a run's worktree (dry-run; --force to delete)

Status: v0.1 in progress. See docs/SPEC.md for the design and docs/DECISIONS.md for how it was reached (including the reviews that shaped it).

AI-evaluation engineering portfolio — five repos, one discipline:

ai-eval-toolkit — judge-vs-human calibration (Cohen's κ / Kendall-τ vs Landis–Koch bands)
agentic-eval-harness (you are here) — eval-gated Claude Code phase boundaries with cross-vendor scorecards
ai-eval-atlas — practitioner + technique map, source-linked
ai-engineer-best-practices — handbook + score MCP tool (3-vendor judge ensemble)
learn-ai-eval — Claude-tutored learning engine for the eval canon

Profile: github.com/Mike-E-Log · website: mikeilog.com

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
docs		docs
examples/recorded-run		examples/recorded-run
plan		plan
prompts		prompts
src/aeh		src/aeh
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
social-preview.png		social-preview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentic-eval-harness · eval-gated runner for coding agents

Quickstart (no keys, no CLI)

For a real run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentic-eval-harness · eval-gated runner for coding agents

Quickstart (no keys, no CLI)

For a real run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages