Powerful, minimal framework for LLM prompt evaluation with YAML configuration, tool execution support, and comprehensive result tracking.
Most prompt testing tools are either too academic or too bloated.
RawBench is for devs who want:
- YAML-first, CLI-native minimal workflow
- Built in tool call mocking with recursive support
- Dynamic variables (functions, env, time, etc.)
- Multi-model testing with latency + cost metrics
- Zero setup, just run
rawbench init && rawbench run
- Multi-model testing with simultaneous evaluation
- YAML configuration with Docker-compose style anchors
- Variable substitution and template system
- Metrics for latency, tokens, and costs
- CLI and Python API interfaces
- Extensible tool mocking system
- Dynamic variable injection
- Beautiful html reports
- Local dashboard for interactive result viewing
- assertions
- response caching
- ai judge
- prompt auto-finetuning
- more llm providers
- ...
Setup
git clone https://github.com/0xsomesh/rawbench.git
cd rawbench
make install
# initiate rawbench
rawbench init rawbench_tests
cd rawbench_testsEnter the api keys of inference provides in .env. rawbench uses litellm to interact with the providers. Here is a list of all the providers supported on rawbench.
# Run evaluation
rawbench run tests/template.yaml --html -o template_result
# Start local dashboard server
rawbench serve --port 8000RawBench now includes a local React dashboard for interactive result viewing:
- Interactive Results Viewer: Browse and analyze evaluation results with a modern web interface
- Real-time Updates: View results as they're generated
- Detailed Metrics: Explore latency, token usage, and cost breakdowns
- Test Case Analysis: Drill down into individual test cases and responses
- Model Comparison: Compare performance across different models side-by-side
To start the dashboard:
rawbench serve --port 8000Then open your browser to http://localhost:8000 to access the dashboard.
RawBench uses YAML files for configuration. Here's a comprehensive guide to the configuration options:
id: evaluation-name
description: Optional description of the evaluation
models:
- id: model-id
provider: openai
name: gpt-4
temperature: 0.7
max_tokens: 1024
prompts:
- id: prompt-id
system: |
System prompt text here
tests:
- id: test-id
messages:
- role: user
content: Test message contentRawBench supports powerful tool mocking for testing agents that use function calling:
- Recursive: Handles multiple tool calls in sequence
- Priority Resolution: Test-specific mocks override global mocks
- Loop Prevention:
max_iterationsprevents infinite loops - Clean: Simple YAML structure
tools:
- id: search_tool
name: search_tool
description: Search for information
parameters:
type: object
properties:
query:
type: string
description: Search query
required: [query]
mock:
output: '{"results": [{"title": "Example", "content": "Search result"}]}'
tests:
- id: search-test
tool_execution:
mode: mock # mock or actual
max_iterations: 5 # Prevent infinite loops
output: # Test-specific mocks (overrides global)
- id: search_tool
output: '{"results": [{"title": "Custom", "content": "Custom result"}]}'
messages:
- role: user
content: "Search for information about AI"You can compare multiple models or different configurations of the same model:
models:
- id: gpt4-conservative
provider: openai
name: gpt-4
temperature: 0.2
- id: gpt4-creative
provider: openai
name: gpt-4
temperature: 0.8You can compare multiple prompts:
prompts:
- id: default_researcher
system: |
You are a helpful crypto research assistant.
- id: default_teacher
system: |
You are a knowledgeable teacher.RawBench supports dynamic variables in your prompts:
variables:
- id: current_time
function: current_datetime # Loads from variables/current_datetime.py
prompts:
- id: time_aware_prompt
system: |
Current time is {{current_time}}
Please consider this timestamp in your responses.Note: You'll have to create a new file current_time and define a function current_time returning the string
-
Multi-Model Comparison
- Location:
examples/evaluations/multi-model-comparison.yaml - Compare responses from different models or configurations
- Track performance metrics across models
- Location:
-
Complex Evaluation Criteria
- Location:
examples/evaluations/complex-criteria.yaml - Define sophisticated evaluation rules
- Apply multiple test cases
- Location:
-
Variable Usage
- Location:
examples/evaluations/variable-usage.yaml - Inject dynamic content into prompts
- Use environment variables and functions
- Location:
-
Tool Mocking
- Location:
examples/evaluations/tool-mock-example.yaml - Mock external tool calls
- Test tool-using agents
- Location:
-
Recursive Tool Testing
- Location:
examples/evaluations/recursive-tool-test.yaml - Test agents that make multiple tool calls
- Complex workflow testing
- Location:
- Python ≥ 3.8
MIT



