Related work: Primary development for this problem space has converged on evalharness — prompts, agents, and RAG-pipeline red-teaming, regression, and CI testing. This repo remains available; check the canonical repo first for the latest tooling.
---# EvalBench — LLM evaluation toolkit — BLEU, ROUGE, semantic similarity, and custom metrics for benchmarking AI outputs
LLM evaluation toolkit — BLEU, ROUGE, semantic similarity, and custom metrics for benchmarking AI outputs.
EvalBench exists to make this workflow practical. Llm evaluation toolkit — bleu, rouge, semantic similarity, and custom metrics for benchmarking ai outputs. It favours a small, inspectable surface over sprawling configuration.
- CLI command
evalbench TestCase— exported fromsrc/evalbench/core.pyEvalResult— exported fromsrc/evalbench/core.pyEvalReport— exported fromsrc/evalbench/core.py- Included test suite
- Dedicated documentation folder
- Runtime: Python
- Frameworks: Typer
- Tooling: Rich, Pydantic
The codebase is organised into docs/, src/, tests/. The primary entry points are src/evalbench/core.py, src/evalbench/__init__.py. src/evalbench/core.py exposes TestCase, EvalResult, EvalReport — the core types that drive the behaviour.
pip install -e .
evalbench --helpevalbench --helpEvalBench/
├── .env.example
├── CONTRIBUTING.md
├── Makefile
├── README.md
├── docs/
├── pyproject.toml
├── src/
├── tests/