Cohen's κ, Kendall-τ, and disclosed Landis–Koch bands — measures whether your
LLM judge agrees with human labels, the question most eval tooling skips.
$ python -m eval_tk calibrate --demo
metric: cohen_kappa
value: 0.688042
band: substantial (Landis-Koch 1977 (kappa))
n_items: 60
verdict: substantial agreement (cohen_kappa=0.688): judge is trustworthy enough to reduce human review; spot-check periodically for drift [operating cutoffs are this tool's recommendation, not an established standard]
note: no rows dropped
The bundled fixture is real MT-Bench data: 60 pairwise matchups that both humans and GPT-4 judged (lmsys/mt_bench_human_judgments, Apache-2.0). κ=0.688 is the genuine GPT-4-vs-human agreement on that sample.
Most eval tooling answers "did the output match a reference?" or "which output did the model-judge prefer?" This answers a different question: does your automated judge actually agree with human labels — and by how much?
| Tool / approach | Core question it answers | Centers judge-vs-human calibration? |
|---|---|---|
| ai-eval-toolkit | Does my LLM judge agree with human labels? | Yes — Cohen's κ / Kendall-τ vs Landis-Koch bands, honest cutoffs |
| LLM-as-judge harnesses | Which output does the model-judge prefer? | No — judge-vs-judge |
| promptfoo, OpenAI evals | Did output match an assertion / reference? | No — output-vs-expected |
| RAGAS, DeepEval | How good is the output on metric X? | No — judge-vs-reference, not human agreement |
Single-axis comparison (judge-vs-human calibration), not overall capability — these tools do much more. The point is this is the one question they don't center.
pip install -r requirements.txt
python -m eval_tk calibrate --demo
python -m eval_tk calibrate --input your_labels.csv # columns: item_id,human_label,judge_labelThe choices here are the point — each is a deliberate call about not over-claiming:
- Cohen's κ for agreement above chance (Cohen 1960) — raw % agreement lies under label imbalance; κ corrects for chance.
- Landis & Koch (1977) bands, used as-is — an established κ-interpretation standard, so no invented threshold applies there.
- The drift verdict's action cutoffs (≥0.61 / ≥0.41) are disclosed as editorial, not dressed up as standard — every verdict carries an explicit caveat.
- Kendall-τ has no established band standard — τ is rank correlation, not κ, so
Landis-Koch's κ interpretation doesn't formally apply. The tool reuses the same κ
cutoffs as a disclosed heuristic (the τ output's
band_basissays exactly that), rather than inventing a second arbitrary scale or pretending a τ standard exists. - Metrics via scipy / scikit-learn, never hand-rolled — auditable against reference impls.
- The
biascommand is an open question, not a verdict — experimental, threshold-free by design.
python -m eval_tk calibrate --demo # bundled fixture
python -m eval_tk calibrate --input labels.csv # your data
python -m eval_tk calibrate --input labels.csv --metric kendall_tau --format json
python -m eval_tk bias --input judges.json # experimental, no thresholdError responses are structured: {error_code, message, fix_hint, docs_url} — every
failure tells you the concrete next step (codes EVAL_TK_001-004, 999).
The same functions are exposed as MCP tools (eval_calibrate, eval_bias) over stdio.
- Claude Code:
claude mcp add eval-tk -- python -m eval_tk.server - Claude Desktop / Cursor: add an stdio MCP server running
python -m eval_tk.server.
- Cohen, J. (1960). A coefficient of agreement for nominal scales.
- Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data.
- Judge-vs-human calibration framing follows Hamel Husain's and Shreya Shankar's eval practice.
- v0.1 uses unweighted Cohen's κ. The bundled fixture is a 60-row subset of
lmsys/mt_bench_human_judgments(Apache-2.0), reproducible viaeval_tk/data/build_fixture.py(seeeval_tk/data/PROVENANCE.md). - Not in v0.1: Fleiss' κ, plots/PNGs, anti-collusion verdicts. See
docs/SPEC.md.
Apache-2.0.
AI-evaluation engineering portfolio — five repos, one discipline:
- ai-eval-toolkit (you are here) — judge-vs-human calibration (Cohen's κ / Kendall-τ vs Landis–Koch bands)
- agentic-eval-harness — eval-gated Claude Code phase boundaries with cross-vendor scorecards
- ai-eval-atlas — practitioner + technique map, source-linked
- ai-engineer-best-practices — handbook +
scoreMCP tool (3-vendor judge ensemble) - learn-ai-eval — Claude-tutored learning engine for the eval canon
Profile: github.com/Mike-E-Log · website: mikeilog.com