Skip to content

Commit cfe72da

Browse files
committed
Add E2E customer scenario tests and enhance documentation
- 8 E2E test scenarios: chatbot eval, CI/CD gate, multi-model comparison, version history, domain rubrics, ensemble judge, batch pipeline, prompt A/B - Add data flow diagrams to README and architecture docs - Add functional/non-functional goals tables - Add "What This System Does Best" and "Limitations" sections
1 parent d0dd40b commit cfe72da

3 files changed

Lines changed: 868 additions & 8 deletions

File tree

README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,66 @@ Built for engineering teams that ship LLM features and need to catch regressions
4545
+-----------------+
4646
```
4747

48+
### Data Flow
49+
50+
```
51+
Rubric + Strategy Your Model
52+
| |
53+
v v
54+
SyntheticGenerator ──> test inputs ──> inference ──> (input, output) pairs
55+
|
56+
v
57+
LLMJudge / EnsembleJudge
58+
|
59+
v
60+
list[JudgeScore]
61+
|
62+
v
63+
EvalResult ──> DuckDB Storage
64+
|
65+
v
66+
RegressionTracker.compare_versions()
67+
|
68+
+-----------+-----------+
69+
| |
70+
v v
71+
Console / Markdown Streamlit Dashboard
72+
```
73+
74+
### Functional Goals
75+
76+
| Goal | How It Works |
77+
|------|-------------|
78+
| **Evaluate LLM outputs** | Structured rubrics scored by LLM judges with per-criterion reasoning |
79+
| **Detect regressions** | Automated version comparison with configurable thresholds |
80+
| **Generate test data** | Synthetic inputs via 4 strategies (standard, adversarial, edge case, distribution) |
81+
| **Visualize trends** | Interactive Streamlit dashboard backed by DuckDB analytics |
82+
83+
### Non-Functional Goals
84+
85+
| Goal | Design Decision |
86+
|------|----------------|
87+
| **Zero-config setup** | DuckDB embedded storage -- no database server required |
88+
| **Reproducibility** | Immutable Pydantic models (`frozen=True`) + schema-versioned rubrics |
89+
| **Provider independence** | Thin LLM abstraction -- swap OpenAI/Anthropic/Gemini without code changes |
90+
| **CI/CD integration** | All operations scriptable, JSON output, non-zero exit on regression |
91+
| **Cost efficiency** | ~$2.50 per 1,000 evaluations at GPT-4o pricing |
92+
93+
## What This System Does Best
94+
95+
- **Catches regressions before production.** Compare model versions on every CI run -- know within minutes if quality degraded.
96+
- **Replaces expensive human annotation.** LLM-as-judge at ~$2.50/1,000 samples vs $50-125 for human annotators.
97+
- **Produces explainable scores.** Per-criterion reasoning ("factual accuracy dropped on medical queries") instead of opaque BLEU numbers.
98+
- **Composable subsystems.** Use just the judge, just the tracker, or just the generator -- no framework lock-in.
99+
100+
## Limitations
101+
102+
- **Judge quality depends on the judge model.** Blind spots in the judge model produce inflated scores. Mitigate with ensemble voting and golden-set calibration.
103+
- **No concurrent writes.** DuckDB single-writer model means parallel CI jobs need separate database files.
104+
- **Cost scales linearly.** 10,000 samples × 3 judges = 30,000 API calls. Start with representative samples (~100-500).
105+
- **No built-in inference.** evalkit evaluates outputs but does not run models -- intentionally framework-agnostic.
106+
- **Single-machine ceiling.** DuckDB handles ~100M rows; beyond that, migrate to a columnar warehouse.
107+
48108
## Quick Start
49109

50110
### Install

docs/architecture.md

Lines changed: 113 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,94 @@ evalkit.dashboard.app <-- Streamlit UI (depends on storage)
6161

6262
## Data Flow
6363

64-
### Evaluation Pipeline
64+
### End-to-End Evaluation Pipeline
6565

66-
1. **Input Generation** (optional): `SyntheticGenerator` produces test inputs
67-
2. **Model Inference** (external): User runs their model on the inputs
68-
3. **Judging**: `LLMJudge` or `EnsembleJudge` scores each output against the rubric
69-
4. **Storage**: `EvalResult` objects are persisted via `DuckDBStorage`
70-
5. **Regression Analysis**: `RegressionTracker` compares scores across versions
71-
6. **Reporting**: `RegressionReporter` produces Markdown/JSON reports
72-
7. **Visualization**: `Dashboard` renders interactive charts
66+
```
67+
User defines Rubric + selects GenerationStrategy
68+
|
69+
v
70+
SyntheticGenerator.generate(domain, count)
71+
|
72+
v
73+
list[str] (test inputs)
74+
|
75+
+---> User runs their model on each input (external)
76+
|
77+
v
78+
list[(input, output)] pairs
79+
|
80+
v
81+
LLMJudge.evaluate(input, output) -- or EnsembleJudge for multi-judge
82+
|
83+
v
84+
list[JudgeScore] per sample
85+
|
86+
v
87+
EvalResult(model_id, version, scores, aggregate)
88+
|
89+
v
90+
DuckDBStorage.store_results(results)
91+
|
92+
v
93+
RegressionTracker.compare_versions("v1", "v2")
94+
|
95+
v
96+
RegressionReport { deltas[], has_regression }
97+
|
98+
+---> RegressionReporter.to_console() / .to_markdown() / .to_json()
99+
|
100+
+---> Dashboard (Streamlit) reads from DuckDB
101+
```
102+
103+
### Ensemble Judging Flow
104+
105+
```
106+
Input + Output
107+
|
108+
+---> Judge A (e.g., GPT-4o) --> list[JudgeScore]
109+
|
110+
+---> Judge B (e.g., Claude) --> list[JudgeScore]
111+
|
112+
+---> Judge C (e.g., Gemini) --> list[JudgeScore]
113+
|
114+
v
115+
EnsembleJudge._aggregate(all_scores, strategy)
116+
|
117+
+---> WEIGHTED_AVERAGE: sum(score * weight) / sum(weights)
118+
+---> MAJORITY: most common score wins
119+
+---> UNANIMOUS: min(scores) -- conservative
120+
|
121+
v
122+
list[JudgeScore] (consensus scores with combined reasoning)
123+
```
124+
125+
### Regression Detection Flow
126+
127+
```
128+
DuckDBStorage
129+
|
130+
+---> query baseline (model_id, v1) --> list[EvalResult]
131+
+---> query candidate (model_id, v2) --> list[EvalResult]
132+
|
133+
v
134+
RegressionTracker._compute_deltas()
135+
|
136+
v
137+
per-criterion delta = candidate_mean - baseline_mean
138+
|
139+
+---> if delta < -threshold --> REGRESSION DETECTED
140+
+---> if delta >= 0 --> IMPROVEMENT or STABLE
141+
|
142+
v
143+
RegressionReport
144+
|
145+
v
146+
OutputComparator.compare(baseline_output, candidate_output)
147+
|
148+
+---> EXACT: strict equality
149+
+---> FUZZY: normalized similarity ratio
150+
+---> SEMANTIC: embedding cosine similarity
151+
```
73152

74153
### Data Model Hierarchy
75154

@@ -100,6 +179,32 @@ RegressionReport
100179

101180
5. **Schema-versioned**: Rubrics carry version strings. Database migrations are idempotent. Evaluation results are immutable once stored.
102181

182+
## What This System Does Best
183+
184+
1. **Catches regressions before production.** The regression tracker compares model versions on every CI run. Teams know within minutes if a prompt change, fine-tune, or model swap degraded quality -- before any user sees the output.
185+
186+
2. **Replaces expensive human annotation at scale.** LLM-as-judge evaluation costs ~$2.50 per 1,000 samples (GPT-4o at standard pricing). A human annotation team doing the same work costs 10-50x more and takes days instead of minutes. Ensemble voting with 3 judges still costs under $10 per 1,000 samples.
187+
188+
3. **Produces explainable, per-criterion scores.** Unlike BLEU/ROUGE which output a single number, evalkit returns structured rubric scores with reasoning. Engineers can see *why* a score dropped -- "factual accuracy degraded on medical queries" is actionable; "BLEU went from 0.43 to 0.41" is not.
189+
190+
4. **Zero-config local development.** DuckDB means no database server to install, no connection strings to manage, no Docker compose. Clone, install, run. The entire evaluation history lives in a single `.duckdb` file that can be committed, shared, or backed up.
191+
192+
5. **Composable subsystems.** Each piece works independently. Use just the judge engine for one-off evaluations. Use just the regression tracker with your own scoring. Use just the synthetic generator to build test sets. No framework lock-in.
193+
194+
## Limitations
195+
196+
1. **LLM judge quality depends on the judge model.** If the judge model has blind spots (e.g., poor math reasoning), it will give inflated scores on tasks it cannot evaluate well. Mitigation: use ensemble voting with diverse providers and calibrate against a human-labeled golden set.
197+
198+
2. **No concurrent write support.** DuckDB uses a single-writer model. Two CI jobs writing to the same database file simultaneously will fail. For teams with parallel CI pipelines, use separate database files per job and merge results, or migrate to PostgreSQL using the provided adapter interface.
199+
200+
3. **Cost scales linearly with sample count and judge count.** Evaluating 10,000 samples with a 3-judge ensemble requires 30,000 LLM API calls. There is no caching or deduplication of identical inputs across runs. Teams should start with small representative samples (~100-500) and scale up selectively.
201+
202+
4. **No built-in model inference.** evalkit evaluates outputs but does not run models. Users must implement their own inference loop and pass (input, output) pairs to the judge. This is intentional -- evalkit stays framework-agnostic -- but it means more integration code.
203+
204+
5. **Single-machine scale ceiling.** DuckDB handles ~100M rows on a single machine before query performance degrades. For teams generating millions of evaluation records per month, plan a migration to a columnar warehouse (BigQuery, Snowflake) using the storage adapter pattern.
205+
206+
6. **Rubric drift over time.** As products evolve, rubrics need manual updates. There is no automatic detection of criteria becoming stale or irrelevant. Teams should review rubrics quarterly alongside prompt and model changes.
207+
103208
## Key Decisions
104209

105210
- [ADR-001: LLM-as-Judge Architecture](adr/001-llm-as-judge-architecture.md)

0 commit comments

Comments
 (0)