Skip to content

avedave/intereval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intereval: The Interactive LLM Evaluation Tool

Intereval is a command-line tool for evaluating and comparing Large Language Models (LLMs) available through Ollama. It provides an interactive and non-interactive interface to streamline the process of testing prompts against different models and evaluating their responses.

Features

  • Interactive Mode: A user-friendly, guided experience for setting up evaluations.
  • Non-Interactive Mode: Run evaluations using command-line arguments for easy scripting and automation.
  • Configuration Files: Save and reuse evaluation setups in YAML format.
  • Two Evaluation Modes:
    • One-Prompt-Many-Models: Test a single prompt against multiple LLMs.
    • Many-Prompts-One-Model: Test multiple prompts against a single LLM.
  • Flexible Evaluation: Evaluate responses based on an expected "golden" response or a set of instructions (rubric).
  • Rich Output: Presents evaluation results in a clean, readable table format.

Getting Started

Prerequisites

  • Python 3.8+
  • Docker (optional, for containerized execution)
  • Ollama installed with at least one model already downloaded and running.

Installation

  1. Clone the repository:

    git clone https://github.com/avedave/intereval.git
    cd intereval
  2. Create and activate a virtual environment:

    # On Windows, you may need to use 'python' instead of 'python3'
    python3 -m venv .venv
    source .venv/bin/activate
  3. Install the dependencies:

    pip install -r requirements.txt

Usage

Intereval can be run in three main ways: interactive mode, non-interactive mode, and from a configuration file.

Interactive Mode

To start the interactive session, run the following command:

python -m src.intereval.main

The tool will guide you through selecting the evaluation mode, providing prompts, choosing models, and setting up the evaluation criteria.

Non-Interactive Mode

For quick evaluations, you can use command-line arguments.

Example: One prompt against multiple models

python -m src.intereval.main \
  --mode one-prompt-many-models \
  --prompt "What is the capital of France?" \
  --models llama3 qwen:7b \
  --instructions "Is the answer Paris?" \
  --eval-model llama3

Using a Configuration File

You can also run an evaluation from a YAML configuration file.

  1. Create a config.yaml file (or let the interactive mode generate one for you in the config/ directory).
  2. Run the evaluation:
    python -m src.intereval.main --config /path/to/your/config.yaml

Docker

You can build and run Intereval using Docker.

  1. Build the Docker image:
    docker build -t intereval .
  2. Run the Docker container:
    docker run -it --rm --network=host intereval
    Note: --network=host is used to allow the container to connect to the Ollama service running on the host machine.

Project Structure

/
├── config/              # Stores generated YAML configuration files
├── eval_prompts/        # Stores evaluation prompts/instructions
├── prompts/             # Stores user-defined prompts
├── results/             # Stores evaluation results in JSON format
└── src/
    └── intereval/
        ├── main.py      # Main application logic
        └── templates/   # Templates for evaluation prompts

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Intereval: An interactive CLI tool for evaluating and comparing LLMs locally via Ollama. Features guided sessions, non-interactive modes for automation, and flexible evaluation criteria. Compare one prompt against many models or many prompts against one model. Results are presented in a clean, tabular format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors