Skip to content

MukundaKatta/EvalBench

Repository files navigation

Related work: Primary development for this problem space has converged on evalharness — prompts, agents, and RAG-pipeline red-teaming, regression, and CI testing. This repo remains available; check the canonical repo first for the latest tooling.

---# EvalBench — LLM evaluation toolkit — BLEU, ROUGE, semantic similarity, and custom metrics for benchmarking AI outputs

LLM evaluation toolkit — BLEU, ROUGE, semantic similarity, and custom metrics for benchmarking AI outputs.

Why EvalBench

EvalBench exists to make this workflow practical. Llm evaluation toolkit — bleu, rouge, semantic similarity, and custom metrics for benchmarking ai outputs. It favours a small, inspectable surface over sprawling configuration.

Features

  • CLI command evalbench
  • TestCase — exported from src/evalbench/core.py
  • EvalResult — exported from src/evalbench/core.py
  • EvalReport — exported from src/evalbench/core.py
  • Included test suite
  • Dedicated documentation folder

Tech Stack

  • Runtime: Python
  • Frameworks: Typer
  • Tooling: Rich, Pydantic

How It Works

The codebase is organised into docs/, src/, tests/. The primary entry points are src/evalbench/core.py, src/evalbench/__init__.py. src/evalbench/core.py exposes TestCase, EvalResult, EvalReport — the core types that drive the behaviour.

Getting Started

pip install -e .
evalbench --help

Usage

evalbench --help

Project Structure

EvalBench/
├── .env.example
├── CONTRIBUTING.md
├── Makefile
├── README.md
├── docs/
├── pyproject.toml
├── src/
├── tests/

About

LLM evaluation toolkit — BLEU, ROUGE, semantic similarity, and custom metrics for benchmarking AI outputs

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors