Skip to content

Mike-E-Log/ai-eval-toolkit

Repository files navigation

CI python license fixture

ai-eval-toolkit · judge-vs-human calibration

Cohen's κ, Kendall-τ, and disclosed Landis–Koch bands — measures whether your
LLM judge agrees with human labels, the question most eval tooling skips.


$ python -m eval_tk calibrate --demo
metric: cohen_kappa
value:  0.688042
band:   substantial (Landis-Koch 1977 (kappa))
n_items: 60
verdict: substantial agreement (cohen_kappa=0.688): judge is trustworthy enough to reduce human review; spot-check periodically for drift [operating cutoffs are this tool's recommendation, not an established standard]
note:    no rows dropped

The bundled fixture is real MT-Bench data: 60 pairwise matchups that both humans and GPT-4 judged (lmsys/mt_bench_human_judgments, Apache-2.0). κ=0.688 is the genuine GPT-4-vs-human agreement on that sample.

How this differs

Most eval tooling answers "did the output match a reference?" or "which output did the model-judge prefer?" This answers a different question: does your automated judge actually agree with human labels — and by how much?

Tool / approach Core question it answers Centers judge-vs-human calibration?
ai-eval-toolkit Does my LLM judge agree with human labels? Yes — Cohen's κ / Kendall-τ vs Landis-Koch bands, honest cutoffs
LLM-as-judge harnesses Which output does the model-judge prefer? No — judge-vs-judge
promptfoo, OpenAI evals Did output match an assertion / reference? No — output-vs-expected
RAGAS, DeepEval How good is the output on metric X? No — judge-vs-reference, not human agreement

Single-axis comparison (judge-vs-human calibration), not overall capability — these tools do much more. The point is this is the one question they don't center.

Install & run (30 seconds)

pip install -r requirements.txt
python -m eval_tk calibrate --demo
python -m eval_tk calibrate --input your_labels.csv   # columns: item_id,human_label,judge_label

Design decisions (why these metrics)

The choices here are the point — each is a deliberate call about not over-claiming:

  • Cohen's κ for agreement above chance (Cohen 1960) — raw % agreement lies under label imbalance; κ corrects for chance.
  • Landis & Koch (1977) bands, used as-is — an established κ-interpretation standard, so no invented threshold applies there.
  • The drift verdict's action cutoffs (≥0.61 / ≥0.41) are disclosed as editorial, not dressed up as standard — every verdict carries an explicit caveat.
  • Kendall-τ has no established band standard — τ is rank correlation, not κ, so Landis-Koch's κ interpretation doesn't formally apply. The tool reuses the same κ cutoffs as a disclosed heuristic (the τ output's band_basis says exactly that), rather than inventing a second arbitrary scale or pretending a τ standard exists.
  • Metrics via scipy / scikit-learn, never hand-rolled — auditable against reference impls.
  • The bias command is an open question, not a verdict — experimental, threshold-free by design.

CLI

python -m eval_tk calibrate --demo                       # bundled fixture
python -m eval_tk calibrate --input labels.csv           # your data
python -m eval_tk calibrate --input labels.csv --metric kendall_tau --format json
python -m eval_tk bias --input judges.json               # experimental, no threshold

Error responses are structured: {error_code, message, fix_hint, docs_url} — every failure tells you the concrete next step (codes EVAL_TK_001-004, 999).

MCP (secondary surface)

The same functions are exposed as MCP tools (eval_calibrate, eval_bias) over stdio.

  • Claude Code: claude mcp add eval-tk -- python -m eval_tk.server
  • Claude Desktop / Cursor: add an stdio MCP server running python -m eval_tk.server.

Methodology, attribution, limitations

  • Cohen, J. (1960). A coefficient of agreement for nominal scales.
  • Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data.
  • Judge-vs-human calibration framing follows Hamel Husain's and Shreya Shankar's eval practice.
  • v0.1 uses unweighted Cohen's κ. The bundled fixture is a 60-row subset of lmsys/mt_bench_human_judgments (Apache-2.0), reproducible via eval_tk/data/build_fixture.py (see eval_tk/data/PROVENANCE.md).
  • Not in v0.1: Fleiss' κ, plots/PNGs, anti-collusion verdicts. See docs/SPEC.md.

License

Apache-2.0.


AI-evaluation engineering portfolio — five repos, one discipline:

  • ai-eval-toolkit (you are here) — judge-vs-human calibration (Cohen's κ / Kendall-τ vs Landis–Koch bands)
  • agentic-eval-harness — eval-gated Claude Code phase boundaries with cross-vendor scorecards
  • ai-eval-atlas — practitioner + technique map, source-linked
  • ai-engineer-best-practices — handbook + score MCP tool (3-vendor judge ensemble)
  • learn-ai-eval — Claude-tutored learning engine for the eval canon

Profile: github.com/Mike-E-Log · website: mikeilog.com

About

Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages