GitHub - Mike-E-Log/ai-eval-toolkit: Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.

ai-eval-toolkit · judge-vs-human calibration

Cohen's κ, Kendall-τ, and disclosed Landis–Koch bands — measures whether your
LLM judge agrees with human labels, the question most eval tooling skips.

$ python -m eval_tk calibrate --demo
metric: cohen_kappa
value:  0.688042
band:   substantial (Landis-Koch 1977 (kappa))
n_items: 60
verdict: substantial agreement (cohen_kappa=0.688): judge is trustworthy enough to reduce human review; spot-check periodically for drift [operating cutoffs are this tool's recommendation, not an established standard]
note:    no rows dropped

The bundled fixture is real MT-Bench data: 60 pairwise matchups that both humans and GPT-4 judged (lmsys/mt_bench_human_judgments, Apache-2.0). κ=0.688 is the genuine GPT-4-vs-human agreement on that sample.

How this differs

Most eval tooling answers "did the output match a reference?" or "which output did the model-judge prefer?" This answers a different question: does your automated judge actually agree with human labels — and by how much?

Tool / approach	Core question it answers	Centers judge-vs-human calibration?
ai-eval-toolkit	Does my LLM judge agree with human labels?	Yes — Cohen's κ / Kendall-τ vs Landis-Koch bands, honest cutoffs
LLM-as-judge harnesses	Which output does the model-judge prefer?	No — judge-vs-judge
promptfoo, OpenAI evals	Did output match an assertion / reference?	No — output-vs-expected
RAGAS, DeepEval	How good is the output on metric X?	No — judge-vs-reference, not human agreement

Single-axis comparison (judge-vs-human calibration), not overall capability — these tools do much more. The point is this is the one question they don't center.

Install & run (30 seconds)

pip install -r requirements.txt
python -m eval_tk calibrate --demo
python -m eval_tk calibrate --input your_labels.csv   # columns: item_id,human_label,judge_label

Design decisions (why these metrics)

The choices here are the point — each is a deliberate call about not over-claiming:

Cohen's κ for agreement above chance (Cohen 1960) — raw % agreement lies under label imbalance; κ corrects for chance.
Landis & Koch (1977) bands, used as-is — an established κ-interpretation standard, so no invented threshold applies there.
The drift verdict's action cutoffs (≥0.61 / ≥0.41) are disclosed as editorial, not dressed up as standard — every verdict carries an explicit caveat.
Kendall-τ has no established band standard — τ is rank correlation, not κ, so Landis-Koch's κ interpretation doesn't formally apply. The tool reuses the same κ cutoffs as a disclosed heuristic (the τ output's band_basis says exactly that), rather than inventing a second arbitrary scale or pretending a τ standard exists.
Metrics via scipy / scikit-learn, never hand-rolled — auditable against reference impls.
The bias command is an open question, not a verdict — experimental, threshold-free by design.

CLI

python -m eval_tk calibrate --demo                       # bundled fixture
python -m eval_tk calibrate --input labels.csv           # your data
python -m eval_tk calibrate --input labels.csv --metric kendall_tau --format json
python -m eval_tk bias --input judges.json               # experimental, no threshold

Error responses are structured: {error_code, message, fix_hint, docs_url} — every failure tells you the concrete next step (codes EVAL_TK_001-004, 999).

MCP (secondary surface)

The same functions are exposed as MCP tools (eval_calibrate, eval_bias) over stdio.

Claude Code: claude mcp add eval-tk -- python -m eval_tk.server
Claude Desktop / Cursor: add an stdio MCP server running python -m eval_tk.server.

Methodology, attribution, limitations

Cohen, J. (1960). A coefficient of agreement for nominal scales.
Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data.
Judge-vs-human calibration framing follows Hamel Husain's and Shreya Shankar's eval practice.
v0.1 uses unweighted Cohen's κ. The bundled fixture is a 60-row subset of lmsys/mt_bench_human_judgments (Apache-2.0), reproducible via eval_tk/data/build_fixture.py (see eval_tk/data/PROVENANCE.md).
Not in v0.1: Fleiss' κ, plots/PNGs, anti-collusion verdicts. See docs/SPEC.md.

License

Apache-2.0.

AI-evaluation engineering portfolio — five repos, one discipline:

ai-eval-toolkit (you are here) — judge-vs-human calibration (Cohen's κ / Kendall-τ vs Landis–Koch bands)
agentic-eval-harness — eval-gated Claude Code phase boundaries with cross-vendor scorecards
ai-eval-atlas — practitioner + technique map, source-linked
ai-engineer-best-practices — handbook + score MCP tool (3-vendor judge ensemble)
learn-ai-eval — Claude-tutored learning engine for the eval canon

Profile: github.com/Mike-E-Log · website: mikeilog.com

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
eval_tk		eval_tk
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
social-preview.png		social-preview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-eval-toolkit · judge-vs-human calibration

How this differs

Install & run (30 seconds)

Design decisions (why these metrics)

CLI

MCP (secondary surface)

Methodology, attribution, limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-eval-toolkit · judge-vs-human calibration

How this differs

Install & run (30 seconds)

Design decisions (why these metrics)

CLI

MCP (secondary surface)

Methodology, attribution, limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages