A lightweight Python framework for testing Large Language Model APIs. Send prompts, validate responses against customizable criteria, and generate detailed reports.
- Multiple Validation Types: Contains, regex, length limits, JSON validation, exact match, and custom validators
- OpenAI-Compatible: Works with OpenAI, Azure OpenAI, and any OpenAI-compatible API
- Flexible Configuration: Define tests in JSON files or programmatically
- Detailed Reporting: HTML reports with visual dashboard, plus console and JSON output
- Environment Support: Loads configuration from
.envfiles
pip install -r requirements.txt- Create a
.envfile with your API key:
OPENAI_API_KEY=sk-your-key-here
- Run the example tests:
python llm_tester.py test_cases.json# Basic usage (generates report.html by default)
python llm_tester.py test_cases.json
# Specify model
python llm_tester.py test_cases.json --model gpt-4o
# Custom API endpoint
python llm_tester.py test_cases.json --base-url https://your-api.com/v1
# Custom output file
python llm_tester.py test_cases.json -o results.html
# JSON format instead of HTML
python llm_tester.py test_cases.json --format json -o report.json
# Verbose logging
python llm_tester.py test_cases.json -vfrom llm_tester import LLMClient, TestRunner, ReportGenerator
# Initialize with custom settings
client = LLMClient(
api_key="sk-...",
base_url="https://api.openai.com/v1",
model="gpt-4o"
)
# Run tests from file
runner = TestRunner(client)
report = runner.run_from_file("test_cases.json")
# Or run tests programmatically
test_cases = [
{
"name": "Greeting test",
"prompt": "Say hello",
"criteria": {"contains": ["hello"]}
}
]
report = runner.run_tests(test_cases)
# Generate reports
ReportGenerator.to_console(report)
ReportGenerator.to_html(report, "report.html") # HTML dashboard
ReportGenerator.to_json(report, "report.json") # JSON exportTests are defined in JSON with the following structure:
{
"test_cases": [
{
"name": "Test name",
"prompt": "The prompt to send",
"system_prompt": "Optional system prompt",
"criteria": {
"contains": ["required", "words"],
"not_contains": ["forbidden"],
"regex": "pattern.*match",
"min_length": 10,
"max_length": 500,
"equals": "exact match",
"json_valid": true
}
}
]
}| Criteria | Description | Example |
|---|---|---|
contains |
Response must contain all strings (case-insensitive) | ["hello", "world"] |
not_contains |
Response must not contain any strings | ["error", "fail"] |
regex |
Response must match regex pattern | "\\d{3}-\\d{4}" |
min_length |
Minimum response character count | 50 |
max_length |
Maximum response character count | 1000 |
equals |
Exact match (case-insensitive, trimmed) | "yes" |
json_valid |
Response must be valid JSON (handles markdown code blocks) | true |
custom |
Custom validator function (programmatic only) | lambda r: (bool, reason) |
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
API key for authentication | - |
LLM_API_KEY |
Alternative API key variable | - |
LLM_BASE_URL |
API base URL | https://api.openai.com/v1 |
LLM_MODEL |
Model to use | gpt-4o |
The default HTML report includes:
- Summary cards showing total tests, passed, failed, and pass rate
- Detailed results table with prompts, responses, and failure reasons
- Color-coded status indicators
- Responsive design
============================================================
TEST REPORT
============================================================
Start: 2025-01-25T10:30:00
End: 2025-01-25T10:30:15
Total: 8 | Passed: 7 | Failed: 1
Pass Rate: 87.5%
------------------------------------------------------------
FAILURES:
------------------------------------------------------------
[FAIL] Math calculation
Prompt: What is 15 + 27? Reply with just the number....
Response: The answer is 42....
- Response doesn't match pattern: '^\d+$'
============================================================
MIT