ModelArena 2

A lightweight Python framework for testing Large Language Model APIs. Send prompts, validate responses against customizable criteria, and generate detailed reports.

Features

Multiple Validation Types: Contains, regex, length limits, JSON validation, exact match, and custom validators
OpenAI-Compatible: Works with OpenAI, Azure OpenAI, and any OpenAI-compatible API
Flexible Configuration: Define tests in JSON files or programmatically
Detailed Reporting: HTML reports with visual dashboard, plus console and JSON output
Environment Support: Loads configuration from .env files

Installation

pip install -r requirements.txt

Quick Start

Create a .env file with your API key:

OPENAI_API_KEY=sk-your-key-here

Run the example tests:

python llm_tester.py test_cases.json

Usage

Command Line

# Basic usage (generates report.html by default)
python llm_tester.py test_cases.json

# Specify model
python llm_tester.py test_cases.json --model gpt-4o

# Custom API endpoint
python llm_tester.py test_cases.json --base-url https://your-api.com/v1

# Custom output file
python llm_tester.py test_cases.json -o results.html

# JSON format instead of HTML
python llm_tester.py test_cases.json --format json -o report.json

# Verbose logging
python llm_tester.py test_cases.json -v

Programmatic Usage

from llm_tester import LLMClient, TestRunner, ReportGenerator

# Initialize with custom settings
client = LLMClient(
    api_key="sk-...",
    base_url="https://api.openai.com/v1",
    model="gpt-4o"
)

# Run tests from file
runner = TestRunner(client)
report = runner.run_from_file("test_cases.json")

# Or run tests programmatically
test_cases = [
    {
        "name": "Greeting test",
        "prompt": "Say hello",
        "criteria": {"contains": ["hello"]}
    }
]
report = runner.run_tests(test_cases)

# Generate reports
ReportGenerator.to_console(report)
ReportGenerator.to_html(report, "report.html")  # HTML dashboard
ReportGenerator.to_json(report, "report.json")  # JSON export

Test Case Format

Tests are defined in JSON with the following structure:

{
  "test_cases": [
    {
      "name": "Test name",
      "prompt": "The prompt to send",
      "system_prompt": "Optional system prompt",
      "criteria": {
        "contains": ["required", "words"],
        "not_contains": ["forbidden"],
        "regex": "pattern.*match",
        "min_length": 10,
        "max_length": 500,
        "equals": "exact match",
        "json_valid": true
      }
    }
  ]
}

Validation Criteria

Criteria	Description	Example
`contains`	Response must contain all strings (case-insensitive)	`["hello", "world"]`
`not_contains`	Response must not contain any strings	`["error", "fail"]`
`regex`	Response must match regex pattern	`"\\d{3}-\\d{4}"`
`min_length`	Minimum response character count	`50`
`max_length`	Maximum response character count	`1000`
`equals`	Exact match (case-insensitive, trimmed)	`"yes"`
`json_valid`	Response must be valid JSON (handles markdown code blocks)	`true`
`custom`	Custom validator function (programmatic only)	`lambda r: (bool, reason)`

Environment Variables

Variable	Description	Default
`OPENAI_API_KEY`	API key for authentication	-
`LLM_API_KEY`	Alternative API key variable	-
`LLM_BASE_URL`	API base URL	`https://api.openai.com/v1`
`LLM_MODEL`	Model to use	`gpt-4o`

Sample Output

HTML Report

The default HTML report includes:

Summary cards showing total tests, passed, failed, and pass rate
Detailed results table with prompts, responses, and failure reasons
Color-coded status indicators
Responsive design

Console Output

============================================================
TEST REPORT
============================================================
Start: 2025-01-25T10:30:00
End:   2025-01-25T10:30:15

Total: 8 | Passed: 7 | Failed: 1
Pass Rate: 87.5%

------------------------------------------------------------
FAILURES:
------------------------------------------------------------

[FAIL] Math calculation
  Prompt: What is 15 + 27? Reply with just the number....
  Response: The answer is 42....
  - Response doesn't match pattern: '^\d+$'

============================================================

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
llm_tester.py		llm_tester.py
requirements.txt		requirements.txt
test_cases.json		test_cases.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelArena 2

Features

Installation

Quick Start

Usage

Command Line

Programmatic Usage

Test Case Format

Validation Criteria

Environment Variables

Sample Output

HTML Report

Console Output

License

About

Uh oh!

Releases

Packages

Languages

antonyga/modelarena2

Folders and files

Latest commit

History

Repository files navigation

ModelArena 2

Features

Installation

Quick Start

Usage

Command Line

Programmatic Usage

Test Case Format

Validation Criteria

Environment Variables

Sample Output

HTML Report

Console Output

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages