Skip to content

LLM evaluation framework with custom metrics, LLM-as-judge, and comprehensive reporting

License

Notifications You must be signed in to change notification settings

jstilb/llm-eval-framework

Repository files navigation

LLM Eval Framework

Tests Python 3.11+ License: MIT

Comprehensive LLM evaluation framework with custom metrics, LLM-as-judge pattern, and detailed reporting. Evaluate language model outputs across quality, faithfulness, and reference-based dimensions.

Architecture

Input Samples
    |
    v
[Eval Runner] ---> [Metrics] ---> Text Quality (coherence, conciseness, completeness)
    |          |              ---> Reference (ROUGE, exact match, semantic similarity)
    |          |              ---> Faithfulness (claim verification, groundedness)
    |          |
    |          +---> [Judges] ---> Quality Judge (helpfulness, accuracy, harmlessness)
    |                         ---> Pairwise Judge (A/B comparison)
    |
    v
[Report Generator] ---> JSON | Markdown | Summary Table

Quick Start

# Install
pip install -e ".[dev]"

# Run demo (no API keys needed)
python -m src.cli demo

# Run evaluation on built-in dataset
python -m src.cli run --dataset qa --metrics coherence completeness faithfulness

# Run with LLM judge
python -m src.cli run --dataset qa --judge --format markdown

# Start API server
uvicorn src.api.app:app --reload

Features

  • 9 Built-in Metrics - Coherence, conciseness, completeness, exact match, ROUGE-1/2, semantic similarity, faithfulness, groundedness
  • LLM-as-Judge - Quality judge with 4 dimensions + pairwise A/B comparison
  • Mock Mode - Full framework runs without external API keys
  • Built-in Datasets - QA, RAG, edge case datasets for testing
  • Multiple Report Formats - JSON, Markdown, summary tables
  • REST API - FastAPI endpoints for programmatic evaluation
  • CLI Interface - Command-line tools for quick evaluations
  • Custom JSONL Datasets - Load your own evaluation data

Metrics

Text Quality (No Reference Needed)

Metric Description
coherence Sentence structure, logical connectors, repetition
conciseness Filler phrase penalty, hedging detection, length ratio
completeness Topic coverage, structural completeness, depth

Reference-Based

Metric Description
exact_match Case-insensitive exact match
rouge_1 Unigram overlap F1
rouge_2 Bigram overlap F1
semantic_similarity Content word Jaccard similarity

Faithfulness (Context-Based)

Metric Description
faithfulness Claim-level verification against context
groundedness Context term utilization in output

LLM-as-Judge

Judge Dimensions
quality_judge Helpfulness, accuracy, harmlessness, instruction following
pairwise_judge A/B comparison between two outputs

API Endpoints

Method Path Description
GET /health Health check with available metrics/datasets
POST /evaluate Run evaluation
GET /datasets List available datasets
GET /metrics List available metrics

Example API Call

curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "eval_name": "my-eval",
    "dataset": "qa",
    "metrics": ["coherence", "faithfulness"],
    "use_judge": true
  }'

Project Structure

llm-eval-framework/
  src/
    metrics/
      base.py              # BaseMetric interface
      text_quality.py       # Coherence, conciseness, completeness
      reference.py          # ROUGE, exact match, semantic similarity
      faithfulness.py       # Faithfulness, groundedness
    judges/
      base.py              # BaseJudge interface
      quality_judge.py      # Quality judge + pairwise judge
    datasets/
      loader.py            # Built-in datasets + JSONL loader
    reports/
      formatter.py         # JSON, Markdown, summary table
    api/
      app.py               # FastAPI application
    config.py              # EvalConfig
    models.py              # EvalSample, MetricScore, EvalReport
    runner.py              # EvalRunner orchestrator
    cli.py                 # CLI interface
  tests/
    unit/                  # Per-component tests
    integration/           # Full pipeline + API tests

Configuration

from src.config import EvalConfig
from src.runner import EvalRunner
from src.metrics.text_quality import CoherenceMetric
from src.judges.quality_judge import QualityJudge

config = EvalConfig(
    mode="mock",           # mock or production
    model_name="gpt-4",   # Model for judge
    max_samples=100,       # Limit evaluation size
    seed=42,               # Reproducibility
)

runner = (
    EvalRunner(config=config, pass_threshold=0.6)
    .add_metric(CoherenceMetric())
    .add_judge(QualityJudge(config))
)

Custom Datasets

Create a JSONL file with one sample per line:

{"input": "What is AI?", "output": "AI is...", "reference": "Expected...", "context": ["ctx1"]}

Load it:

from src.datasets.loader import load_jsonl
samples = load_jsonl("my_data.jsonl")

Development

pip install -e ".[dev]"
pytest --cov=src --cov-report=term-missing
mypy src/
ruff check src/ tests/

Design Decisions

License

MIT

About

LLM evaluation framework with custom metrics, LLM-as-judge, and comprehensive reporting

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published