Add E2E customer scenario tests and enhance documentation

cortexark · cortexark · commit cfe72da64020 · 2026-03-05T22:15:46.000-08:00
- 8 E2E test scenarios: chatbot eval, CI/CD gate, multi-model comparison,
  version history, domain rubrics, ensemble judge, batch pipeline, prompt A/B
- Add data flow diagrams to README and architecture docs
- Add functional/non-functional goals tables
- Add "What This System Does Best" and "Limitations" sections
diff --git a/README.md b/README.md
@@ -45,6 +45,66 @@ Built for engineering teams that ship LLM features and need to catch regressions
            +-----------------+
 ```
 
+### Data Flow
+
+```
+Rubric + Strategy                     Your Model
+     |                                     |
+     v                                     v
+SyntheticGenerator ──> test inputs ──> inference ──> (input, output) pairs
+                                                          |
+                                                          v
+                                             LLMJudge / EnsembleJudge
+                                                          |
+                                                          v
+                                                   list[JudgeScore]
+                                                          |
+                                                          v
+                                           EvalResult ──> DuckDB Storage
+                                                          |
+                                                          v
+                                           RegressionTracker.compare_versions()
+                                                          |
+                                              +-----------+-----------+
+                                              |                       |
+                                              v                       v
+                                     Console / Markdown        Streamlit Dashboard
+```
+
+### Functional Goals
+
+| Goal | How It Works |
+|------|-------------|
+| **Evaluate LLM outputs** | Structured rubrics scored by LLM judges with per-criterion reasoning |
+| **Detect regressions** | Automated version comparison with configurable thresholds |
+| **Generate test data** | Synthetic inputs via 4 strategies (standard, adversarial, edge case, distribution) |
+| **Visualize trends** | Interactive Streamlit dashboard backed by DuckDB analytics |
+
+### Non-Functional Goals
+
+| Goal | Design Decision |
+|------|----------------|
+| **Zero-config setup** | DuckDB embedded storage -- no database server required |
+| **Reproducibility** | Immutable Pydantic models (`frozen=True`) + schema-versioned rubrics |
+| **Provider independence** | Thin LLM abstraction -- swap OpenAI/Anthropic/Gemini without code changes |
+| **CI/CD integration** | All operations scriptable, JSON output, non-zero exit on regression |
+| **Cost efficiency** | ~$2.50 per 1,000 evaluations at GPT-4o pricing |
+
+## What This System Does Best
+
+- **Catches regressions before production.** Compare model versions on every CI run -- know within minutes if quality degraded.
+- **Replaces expensive human annotation.** LLM-as-judge at ~$2.50/1,000 samples vs $50-125 for human annotators.
+- **Produces explainable scores.** Per-criterion reasoning ("factual accuracy dropped on medical queries") instead of opaque BLEU numbers.
+- **Composable subsystems.** Use just the judge, just the tracker, or just the generator -- no framework lock-in.
+
+## Limitations
+
+- **Judge quality depends on the judge model.** Blind spots in the judge model produce inflated scores. Mitigate with ensemble voting and golden-set calibration.
+- **No concurrent writes.** DuckDB single-writer model means parallel CI jobs need separate database files.
+- **Cost scales linearly.** 10,000 samples × 3 judges = 30,000 API calls. Start with representative samples (~100-500).
+- **No built-in inference.** evalkit evaluates outputs but does not run models -- intentionally framework-agnostic.
+- **Single-machine ceiling.** DuckDB handles ~100M rows; beyond that, migrate to a columnar warehouse.
+
 ## Quick Start
 
 ### Install
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -61,15 +61,94 @@ evalkit.dashboard.app    <-- Streamlit UI (depends on storage)
 
 ## Data Flow
 
-### Evaluation Pipeline
+### End-to-End Evaluation Pipeline
 
-1. **Input Generation** (optional): `SyntheticGenerator` produces test inputs
-2. **Model Inference** (external): User runs their model on the inputs
-3. **Judging**: `LLMJudge` or `EnsembleJudge` scores each output against the rubric
-4. **Storage**: `EvalResult` objects are persisted via `DuckDBStorage`
-5. **Regression Analysis**: `RegressionTracker` compares scores across versions
-6. **Reporting**: `RegressionReporter` produces Markdown/JSON reports
-7. **Visualization**: `Dashboard` renders interactive charts
+```
+User defines Rubric + selects GenerationStrategy
+     |
+     v
+SyntheticGenerator.generate(domain, count)
+     |
+     v
+list[str] (test inputs)
+     |
+     +---> User runs their model on each input (external)
+     |
+     v
+list[(input, output)] pairs
+     |
+     v
+LLMJudge.evaluate(input, output) -- or EnsembleJudge for multi-judge
+     |
+     v
+list[JudgeScore] per sample
+     |
+     v
+EvalResult(model_id, version, scores, aggregate)
+     |
+     v
+DuckDBStorage.store_results(results)
+     |
+     v
+RegressionTracker.compare_versions("v1", "v2")
+     |
+     v
+RegressionReport { deltas[], has_regression }
+     |
+     +---> RegressionReporter.to_console() / .to_markdown() / .to_json()
+     |
+     +---> Dashboard (Streamlit) reads from DuckDB
+```
+
+### Ensemble Judging Flow
+
+```
+Input + Output
+     |
+     +---> Judge A (e.g., GPT-4o)  --> list[JudgeScore]
+     |
+     +---> Judge B (e.g., Claude)  --> list[JudgeScore]
+     |
+     +---> Judge C (e.g., Gemini)  --> list[JudgeScore]
+     |
+     v
+EnsembleJudge._aggregate(all_scores, strategy)
+     |
+     +---> WEIGHTED_AVERAGE: sum(score * weight) / sum(weights)
+     +---> MAJORITY: most common score wins
+     +---> UNANIMOUS: min(scores) -- conservative
+     |
+     v
+list[JudgeScore] (consensus scores with combined reasoning)
+```
+
+### Regression Detection Flow
+
+```
+DuckDBStorage
+     |
+     +---> query baseline (model_id, v1) --> list[EvalResult]
+     +---> query candidate (model_id, v2) --> list[EvalResult]
+     |
+     v
+RegressionTracker._compute_deltas()
+     |
+     v
+per-criterion delta = candidate_mean - baseline_mean
+     |
+     +---> if delta < -threshold --> REGRESSION DETECTED
+     +---> if delta >= 0         --> IMPROVEMENT or STABLE
+     |
+     v
+RegressionReport
+     |
+     v
+OutputComparator.compare(baseline_output, candidate_output)
+     |
+     +---> EXACT: strict equality
+     +---> FUZZY: normalized similarity ratio
+     +---> SEMANTIC: embedding cosine similarity
+```
 
 ### Data Model Hierarchy
 
@@ -100,6 +179,32 @@ RegressionReport
 
 5. **Schema-versioned**: Rubrics carry version strings. Database migrations are idempotent. Evaluation results are immutable once stored.
 
+## What This System Does Best
+
+1. **Catches regressions before production.** The regression tracker compares model versions on every CI run. Teams know within minutes if a prompt change, fine-tune, or model swap degraded quality -- before any user sees the output.
+
+2. **Replaces expensive human annotation at scale.** LLM-as-judge evaluation costs ~$2.50 per 1,000 samples (GPT-4o at standard pricing). A human annotation team doing the same work costs 10-50x more and takes days instead of minutes. Ensemble voting with 3 judges still costs under $10 per 1,000 samples.
+
+3. **Produces explainable, per-criterion scores.** Unlike BLEU/ROUGE which output a single number, evalkit returns structured rubric scores with reasoning. Engineers can see *why* a score dropped -- "factual accuracy degraded on medical queries" is actionable; "BLEU went from 0.43 to 0.41" is not.
+
+4. **Zero-config local development.** DuckDB means no database server to install, no connection strings to manage, no Docker compose. Clone, install, run. The entire evaluation history lives in a single `.duckdb` file that can be committed, shared, or backed up.
+
+5. **Composable subsystems.** Each piece works independently. Use just the judge engine for one-off evaluations. Use just the regression tracker with your own scoring. Use just the synthetic generator to build test sets. No framework lock-in.
+
+## Limitations
+
+1. **LLM judge quality depends on the judge model.** If the judge model has blind spots (e.g., poor math reasoning), it will give inflated scores on tasks it cannot evaluate well. Mitigation: use ensemble voting with diverse providers and calibrate against a human-labeled golden set.
+
+2. **No concurrent write support.** DuckDB uses a single-writer model. Two CI jobs writing to the same database file simultaneously will fail. For teams with parallel CI pipelines, use separate database files per job and merge results, or migrate to PostgreSQL using the provided adapter interface.
+
+3. **Cost scales linearly with sample count and judge count.** Evaluating 10,000 samples with a 3-judge ensemble requires 30,000 LLM API calls. There is no caching or deduplication of identical inputs across runs. Teams should start with small representative samples (~100-500) and scale up selectively.
+
+4. **No built-in model inference.** evalkit evaluates outputs but does not run models. Users must implement their own inference loop and pass (input, output) pairs to the judge. This is intentional -- evalkit stays framework-agnostic -- but it means more integration code.
+
+5. **Single-machine scale ceiling.** DuckDB handles ~100M rows on a single machine before query performance degrades. For teams generating millions of evaluation records per month, plan a migration to a columnar warehouse (BigQuery, Snowflake) using the storage adapter pattern.
+
+6. **Rubric drift over time.** As products evolve, rubrics need manual updates. There is no automatic detection of criteria becoming stale or irrelevant. Teams should review rubrics quarterly alongside prompt and model changes.
+
 ## Key Decisions
 
 - [ADR-001: LLM-as-Judge Architecture](adr/001-llm-as-judge-architecture.md)
diff --git a/tests/test_e2e_customer_scenarios.py b/tests/test_e2e_customer_scenarios.py