You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**CI/CD integration**| All operations scriptable, JSON output, non-zero exit on regression |
91
+
|**Cost efficiency**|~$2.50 per 1,000 evaluations at GPT-4o pricing |
92
+
93
+
## What This System Does Best
94
+
95
+
-**Catches regressions before production.** Compare model versions on every CI run -- know within minutes if quality degraded.
96
+
-**Replaces expensive human annotation.** LLM-as-judge at ~$2.50/1,000 samples vs $50-125 for human annotators.
97
+
-**Produces explainable scores.** Per-criterion reasoning ("factual accuracy dropped on medical queries") instead of opaque BLEU numbers.
98
+
-**Composable subsystems.** Use just the judge, just the tracker, or just the generator -- no framework lock-in.
99
+
100
+
## Limitations
101
+
102
+
-**Judge quality depends on the judge model.** Blind spots in the judge model produce inflated scores. Mitigate with ensemble voting and golden-set calibration.
103
+
-**No concurrent writes.** DuckDB single-writer model means parallel CI jobs need separate database files.
104
+
-**Cost scales linearly.** 10,000 samples × 3 judges = 30,000 API calls. Start with representative samples (~100-500).
105
+
-**No built-in inference.** evalkit evaluates outputs but does not run models -- intentionally framework-agnostic.
106
+
-**Single-machine ceiling.** DuckDB handles ~100M rows; beyond that, migrate to a columnar warehouse.
5.**Schema-versioned**: Rubrics carry version strings. Database migrations are idempotent. Evaluation results are immutable once stored.
102
181
182
+
## What This System Does Best
183
+
184
+
1.**Catches regressions before production.** The regression tracker compares model versions on every CI run. Teams know within minutes if a prompt change, fine-tune, or model swap degraded quality -- before any user sees the output.
185
+
186
+
2.**Replaces expensive human annotation at scale.** LLM-as-judge evaluation costs ~$2.50 per 1,000 samples (GPT-4o at standard pricing). A human annotation team doing the same work costs 10-50x more and takes days instead of minutes. Ensemble voting with 3 judges still costs under $10 per 1,000 samples.
187
+
188
+
3.**Produces explainable, per-criterion scores.** Unlike BLEU/ROUGE which output a single number, evalkit returns structured rubric scores with reasoning. Engineers can see *why* a score dropped -- "factual accuracy degraded on medical queries" is actionable; "BLEU went from 0.43 to 0.41" is not.
189
+
190
+
4.**Zero-config local development.** DuckDB means no database server to install, no connection strings to manage, no Docker compose. Clone, install, run. The entire evaluation history lives in a single `.duckdb` file that can be committed, shared, or backed up.
191
+
192
+
5.**Composable subsystems.** Each piece works independently. Use just the judge engine for one-off evaluations. Use just the regression tracker with your own scoring. Use just the synthetic generator to build test sets. No framework lock-in.
193
+
194
+
## Limitations
195
+
196
+
1.**LLM judge quality depends on the judge model.** If the judge model has blind spots (e.g., poor math reasoning), it will give inflated scores on tasks it cannot evaluate well. Mitigation: use ensemble voting with diverse providers and calibrate against a human-labeled golden set.
197
+
198
+
2.**No concurrent write support.** DuckDB uses a single-writer model. Two CI jobs writing to the same database file simultaneously will fail. For teams with parallel CI pipelines, use separate database files per job and merge results, or migrate to PostgreSQL using the provided adapter interface.
199
+
200
+
3.**Cost scales linearly with sample count and judge count.** Evaluating 10,000 samples with a 3-judge ensemble requires 30,000 LLM API calls. There is no caching or deduplication of identical inputs across runs. Teams should start with small representative samples (~100-500) and scale up selectively.
201
+
202
+
4.**No built-in model inference.** evalkit evaluates outputs but does not run models. Users must implement their own inference loop and pass (input, output) pairs to the judge. This is intentional -- evalkit stays framework-agnostic -- but it means more integration code.
203
+
204
+
5.**Single-machine scale ceiling.** DuckDB handles ~100M rows on a single machine before query performance degrades. For teams generating millions of evaluation records per month, plan a migration to a columnar warehouse (BigQuery, Snowflake) using the storage adapter pattern.
205
+
206
+
6.**Rubric drift over time.** As products evolve, rubrics need manual updates. There is no automatic detection of criteria becoming stale or irrelevant. Teams should review rubrics quarterly alongside prompt and model changes.
0 commit comments