Comprehensive LLM evaluation framework with custom metrics, LLM-as-judge pattern, and detailed reporting. Evaluate language model outputs across quality, faithfulness, and reference-based dimensions.
Input Samples
|
v
[Eval Runner] ---> [Metrics] ---> Text Quality (coherence, conciseness, completeness)
| | ---> Reference (ROUGE, exact match, semantic similarity)
| | ---> Faithfulness (claim verification, groundedness)
| |
| +---> [Judges] ---> Quality Judge (helpfulness, accuracy, harmlessness)
| ---> Pairwise Judge (A/B comparison)
|
v
[Report Generator] ---> JSON | Markdown | Summary Table
# Install
pip install -e ".[dev]"
# Run demo (no API keys needed)
python -m src.cli demo
# Run evaluation on built-in dataset
python -m src.cli run --dataset qa --metrics coherence completeness faithfulness
# Run with LLM judge
python -m src.cli run --dataset qa --judge --format markdown
# Start API server
uvicorn src.api.app:app --reload- 9 Built-in Metrics - Coherence, conciseness, completeness, exact match, ROUGE-1/2, semantic similarity, faithfulness, groundedness
- LLM-as-Judge - Quality judge with 4 dimensions + pairwise A/B comparison
- Mock Mode - Full framework runs without external API keys
- Built-in Datasets - QA, RAG, edge case datasets for testing
- Multiple Report Formats - JSON, Markdown, summary tables
- REST API - FastAPI endpoints for programmatic evaluation
- CLI Interface - Command-line tools for quick evaluations
- Custom JSONL Datasets - Load your own evaluation data
| Metric | Description |
|---|---|
coherence |
Sentence structure, logical connectors, repetition |
conciseness |
Filler phrase penalty, hedging detection, length ratio |
completeness |
Topic coverage, structural completeness, depth |
| Metric | Description |
|---|---|
exact_match |
Case-insensitive exact match |
rouge_1 |
Unigram overlap F1 |
rouge_2 |
Bigram overlap F1 |
semantic_similarity |
Content word Jaccard similarity |
| Metric | Description |
|---|---|
faithfulness |
Claim-level verification against context |
groundedness |
Context term utilization in output |
| Judge | Dimensions |
|---|---|
quality_judge |
Helpfulness, accuracy, harmlessness, instruction following |
pairwise_judge |
A/B comparison between two outputs |
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check with available metrics/datasets |
| POST | /evaluate |
Run evaluation |
| GET | /datasets |
List available datasets |
| GET | /metrics |
List available metrics |
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{
"eval_name": "my-eval",
"dataset": "qa",
"metrics": ["coherence", "faithfulness"],
"use_judge": true
}'llm-eval-framework/
src/
metrics/
base.py # BaseMetric interface
text_quality.py # Coherence, conciseness, completeness
reference.py # ROUGE, exact match, semantic similarity
faithfulness.py # Faithfulness, groundedness
judges/
base.py # BaseJudge interface
quality_judge.py # Quality judge + pairwise judge
datasets/
loader.py # Built-in datasets + JSONL loader
reports/
formatter.py # JSON, Markdown, summary table
api/
app.py # FastAPI application
config.py # EvalConfig
models.py # EvalSample, MetricScore, EvalReport
runner.py # EvalRunner orchestrator
cli.py # CLI interface
tests/
unit/ # Per-component tests
integration/ # Full pipeline + API tests
from src.config import EvalConfig
from src.runner import EvalRunner
from src.metrics.text_quality import CoherenceMetric
from src.judges.quality_judge import QualityJudge
config = EvalConfig(
mode="mock", # mock or production
model_name="gpt-4", # Model for judge
max_samples=100, # Limit evaluation size
seed=42, # Reproducibility
)
runner = (
EvalRunner(config=config, pass_threshold=0.6)
.add_metric(CoherenceMetric())
.add_judge(QualityJudge(config))
)Create a JSONL file with one sample per line:
{"input": "What is AI?", "output": "AI is...", "reference": "Expected...", "context": ["ctx1"]}Load it:
from src.datasets.loader import load_jsonl
samples = load_jsonl("my_data.jsonl")pip install -e ".[dev]"
pytest --cov=src --cov-report=term-missing
mypy src/
ruff check src/ tests/MIT