This comprehensive guide is designed for developers who want to contribute to or modify the Eval Protocol codebase. It covers all key aspects of development, from environment setup to testing and contributing new reward functions.
We are committed to fostering an open and welcoming environment. All contributors are expected to adhere to our Code of Conduct.
# Clone the repository
git clone https://github.com/fireworks-ai/eval-protocol.git
cd eval-protocol
# Set up environment with uv
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]" # Includes development dependencies
# Run tests
uv run pytest
# Type check and lint
make pre-commit- Clone the repository:
git clone https://github.com/fireworks-ai/eval-protocol.git
cd eval-protocol- Create and activate a virtual environment with uv:
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateImportant for LLMs and automated scripts: After activating the virtual environment, always explicitly use executables from the .venv/bin/ directory (e.g., .venv/bin/pip, .venv/bin/pytest, .venv/bin/python). This ensures commands run within the isolated environment, even if the shell's PATH isn't immediately updated or if context is lost between commands.
-
Install the package in development mode:
Use
uv pipfrom the virtual environment:
uv pip install -e . # Basic installation
uv pip install -e ".[dev]" # With development dependenciesFor development and testing interactions with the Fireworks AI platform, you need to configure your Fireworks AI credentials. Eval Protocol supports two methods:
For a streamlined local development experience, especially when managing multiple environment variables, Eval Protocol utilizes a .env.dev file in the root of the project. This file is used to load environment variables automatically when running the application locally.
Setup:
-
Create the
.env.devfile: Copy the example environment file to create your local development configuration:cp .env.example .env.dev
-
Populate
.env.dev: Open.env.devand fill in the necessary environment variables, such asFIREWORKS_API_KEYand any other variables required for your development tasks (e.g.,E2B_API_KEY).Example content for
.env.dev:FIREWORKS_API_KEY="your_dev_fireworks_api_key" FIREWORKS_API_BASE="https://api.fireworks.ai" E2B_API_KEY="your_e2b_api_key"
Important:
- The
.env.devfile should not be committed to version control. It is already listed in the.gitignorefile. - Variables set directly in your shell environment will take precedence over those defined in
.env.devifpython-dotenvis configured to not override existing variables (which is the default behavior).
This file simplifies managing development-specific settings without needing to export them in every terminal session.
For development and testing interactions with the Fireworks AI platform, you need to configure your Fireworks AI credentials. Eval Protocol supports two methods:
A. Environment Variables (Highest Priority)
Set the following environment variables. For development, you might use specific development keys or a dedicated development account:
FIREWORKS_API_KEY: Your Fireworks AI API key.- For development, you might use a specific dev key:
export FIREWORKS_API_KEY="your_dev_fireworks_api_key"
- For development, you might use a specific dev key:
FIREWORKS_API_BASE: (Optional) If you need to target a non-production Fireworks API endpoint.- For development:
export FIREWORKS_API_BASE="https://dev.api.fireworks.ai"
- For development:
Example for a typical development setup:
export FIREWORKS_API_KEY="your_development_api_key"
export FIREWORKS_API_BASE="https://dev.api.fireworks.ai" # If targeting dev APIB. Configuration File (Lower Priority)
Eval Protocol does not read ~/.fireworks/auth.ini (or any firectl profiles). Use environment variables instead.
Credential Sourcing Order: Eval Protocol prioritizes credentials as follows:
- Environment Variables (
FIREWORKS_API_KEY)
Purpose of Credentials:
FIREWORKS_API_KEY: Authenticates your requests to the Fireworks AI service.FIREWORKS_API_BASE: Allows targeting different API environments (e.g., development, staging).
Other Environment Variables:
E2B_API_KEY: (Optional) If you are working on or testing features involving E2B code execution:export E2B_API_KEY="your_e2b_api_key"
eval-protocol/
├── eval_protocol/ # Main package source code
│ ├── __init__.py # Package initialization
│ ├── reward_function.py # Core reward function decorator
│ ├── models.py # Data models and types
│ ├── typed_interface.py # Type interfaces for reward functions
│ ├── evaluation.py # Evaluation pipeline
│ ├── auth.py # Authentication utilities
│ ├── cli.py # Command line interface
│ ├── rewards/ # Out-of-the-box reward functions
│ │ ├── __init__.py # Reward functions registry
│ │ ├── code_execution.py # Code execution rewards
│ │ ├── function_calling.py # Function calling rewards
│ │ ├── json_schema.py # JSON schema validation
│ │ ├── math.py # Math evaluation
│ │ ├── format.py # Format validation
│ │ ├── tag_count.py # Tag counting
│ │ ├── accuracy.py # Accuracy evaluation
│ │ ├── language_consistency.py # Language consistency
│ │ ├── reasoning_steps.py # Reasoning steps evaluation
│ │ ├── length.py # Response length evaluation
│ │ ├── repetition.py # Repetition detection
│ │ ├── cpp_code.py # C/C++ code evaluation
│ │ └── accuracy_length.py # Combined accuracy and length evaluation
├── examples/ # Example code and tutorials
│ ├── metrics/ # Example metric implementations
│ ├── samples/ # Sample data for evaluation
│ └── ... # Example scripts
├── tests/ # Unit and integration tests
├── docs/ # Documentation
└── ... # Project configuration files
- Create a new module in
eval_protocol/rewards/if needed - Implement your reward function using the
@reward_functiondecorator - Update
eval_protocol/rewards/__init__.pyto expose your function - Add unit tests in the
tests/directory
Example structure:
from typing import Dict, List, Any, Union, Optional
from ..typed_interface import reward_function
from ..models import Message, EvaluateResult, MetricResult
@reward_function
def my_reward_function(
messages: Union[List[Dict[str, Any]], List[Message]],
ground_truth: Optional[str] = None,
**kwargs: Any
) -> EvaluateResult:
"""
Evaluate responses based on custom criteria.
Args:
messages: List of conversation messages
ground_truth: Expected correct answer
**kwargs: Additional arguments
Returns:
EvaluateResult with evaluation score and metrics
"""
# Your evaluation logic here
# score = ...
# reason = "..."
# metric_score = ...
# metric_success = ... (e.g., True if a condition is met)
# metric_reason = "..."
# For demonstration:
score = 0.75
reason = "The response met most criteria."
metric_score = 0.8
metric_success = True # Example: metric condition was met
metric_reason = "Specific aspect evaluated positively."
return EvaluateResult(
score=score,
reason=reason,
is_score_valid=is_score_valid,
metrics={
"metric_name": MetricResult(
score=metric_score,
is_score_valid=metric_success,
reason=metric_reason
)
}
)To maintain code quality and consistency, please adhere to the following standards:
- Formatting:
- Use
blackfor code formatting. The maximum line length is 88 characters. - Use
isortfor organizing imports.
- Use
- Linting:
- Adhere to
flake8guidelines.
- Adhere to
- Type Hinting:
- Use type hints for all function parameters, return values, and variables where appropriate.
- Run
mypy eval_protocolto check for type errors.
- Naming Conventions:
snake_casefor functions, methods, and variables.PascalCasefor classes and dataclasses.UPPER_SNAKE_CASEfor constants.
- Imports:
- Group imports in the following order:
- Standard library imports
- Third-party library imports
- Local application/library specific imports (e.g.,
from ..module import something)
- Separate each group with a blank line.
- Group imports in the following order:
- Docstrings:
- Write clear and concise docstrings for all public modules, classes, functions, and methods. Follow PEP 257 conventions.
- For functions, explain the arguments, what the function does, and what it returns.
- Error Handling:
- Use specific exception types rather than generic
Exception. - Provide meaningful error messages.
- Use specific exception types rather than generic
- Function Design:
- Keep functions and methods short and focused on a single responsibility.
- Aim for readability and maintainability.
- Testing:
- Write unit tests for all new public functions and significant private logic.
- Ensure tests cover a variety of cases, including edge cases and expected failures.
To help enforce coding standards and catch issues early, we use pre-commit hooks. These hooks run automatically before each commit to check your code for issues like formatting, linting errors, and type errors.
Installation and Setup:
-
Install pre-commit: If you installed development dependencies with
uv pip install -e ".[dev]",pre-commitshould already be installed. If not, you can install it via uv:uv pip install pre-commit
-
Install the git hooks: Navigate to the root of the repository and run:
pre-commit install
This will set up the pre-commit hooks to run automatically when you
git commit.
Usage:
- Once installed, pre-commit hooks will run on any changed files before you commit. If any hook fails, the commit will be aborted. You'll need to fix the reported issues and then
git addthe files again before attempting to commit. - Some hooks (like
blackandisort) may automatically fix issues. If they do, you'll still need togit addthe modified files. - You can also run the hooks manually on all files at any time:
This is useful for checking the entire codebase or after pulling new changes.
pre-commit run --all-files
By using pre-commit hooks, we can ensure a consistent code style and catch many common errors before they even reach the CI pipeline, saving time and effort.
Use uv to run tests with pytest:
# Run all tests
uv run pytest tests/
# Run specific test file
uv run pytest tests/test_evaluation.py
# Run specific test function
uv run pytest tests/test_file.py::test_function
# Run with coverage report
uv run pytest --cov=eval_protocolWe can focus on tests/ and examples/ folder for now since there are a lot of other repos
Create test files in the tests/ directory following this pattern:
import unittest
# Import the function you're testing
from eval_protocol.rewards.your_module import your_function
class TestYourFunction(unittest.TestCase):
"""Test your reward function."""
def test_basic_functionality(self):
"""Test basic functionality."""
messages = [
{"role": "user", "content": "Test question"},
{"role": "assistant", "content": "Test response"}
]
result = your_function(messages=messages)
# Assert expectations
self.assertIsNotNone(result)
self.assertIsInstance(result, dict)
self.assertIn("score", result)
self.assertGreaterEqual(result["score"], 0.0)
self.assertLessEqual(result["score"], 1.0)Use uv to run code quality tools:
# Type checking
uv run mypy eval_protocol
# Linting
uv run flake8 eval_protocol
# Format code
uv run black eval_protocolEval Protocol includes these out-of-the-box reward functions:
| Category | Reward Functions |
|---|---|
| Format | format_reward |
| Tag Count | tag_count_reward |
| Accuracy | accuracy_reward |
| Language | language_consistency_reward |
| Reasoning | reasoning_steps_reward |
| Length | length_reward, cosine_length_reward |
| Repetition | repetition_penalty_reward |
| Code Execution | binary_code_reward, fractional_code_reward |
| C/C++ Code | ioi_cpp_code_reward, binary_cpp_code_reward |
| Combined | cosine_scaled_accuracy_length_reward |
| Function Calling | schema_jaccard_reward, llm_judge_reward, composite_function_call_reward |
| JSON Schema | json_schema_reward |
| Math | math_reward |
The examples folder contains sample code for using Eval Protocol:
# Run evaluation preview example
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY \
FIREWORKS_API_BASE=https://dev.api.fireworks.ai \
uv run python examples/evaluation_preview_example.py
# Run deployment example
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY \
FIREWORKS_API_BASE=https://dev.api.fireworks.ai \
uv run python examples/deploy_example.pySeveral example scripts, particularly those involving local evaluations (local_eval.py) and TRL integration (trl_grpo_integration.py) within directories like examples/math_example/, examples/math_example_openr1/, and examples/tool_calling_example/, have been refactored to use Hydra for configuration management.
How to Run:
-
Navigate to the repository root if you aren't already there.
-
Run the script directly using uv: Hydra will automatically pick up the configuration from the
confsubdirectory relative to the script's location.# Example for math_example local_eval.py uv run python examples/math_example/local_eval.py # Example for math_example trl_grpo_integration.py uv run python examples/math_example/trl_grpo_integration.py
Configuration:
- Configuration files are typically found in a
confsubdirectory alongside the script (e.g.,examples/math_example/conf/local_eval_config.yaml). - These YAML files define various parameters, including dataset paths, model names, and training arguments.
Overriding Configuration:
You can easily override any configuration parameter from the command line:
- Dataset Path:
uv run python examples/math_example/local_eval.py dataset_file_path=path/to/your/specific_dataset.jsonl
- Model Name (for TRL scripts):
uv run python examples/math_example/trl_grpo_integration.py model_name=mistralai/Mistral-7B-Instruct-v0.2
- GRPO Training Arguments (for TRL scripts):
Access nested parameters using dot notation.
uv run python examples/math_example/trl_grpo_integration.py grpo.learning_rate=5e-5 grpo.num_train_epochs=3
- Multiple Overrides:
uv run python examples/tool_calling_example/trl_grpo_integration.py dataset_file_path=my_tools_data.jsonl model_name=google/gemma-7b grpo.per_device_train_batch_size=4
Output Directory:
Hydra manages output directories for each run. By default, outputs (logs, saved models, etc.) are saved to a timestamped directory structure like:
outputs/YYYY-MM-DD/HH-MM-SS/ (relative to where the command is run, typically the repo root).
The exact base output path can also be configured within the YAML files (e.g., hydra.run.dir).
Refer to the specific conf/*.yaml file for each example to see all available configuration options.
Use the Eval Protocol CLI for common operations during development. Use uv to run the CLI commands:
# Preview an evaluator
uv run eval-protocol preview --metrics-folders "word_count=./examples/metrics/word_count" \
--samples ./examples/samples/samples.jsonl
# Deploy an evaluator
uv run eval-protocol deploy --id my-test-evaluator \
--metrics-folders "word_count=./examples/metrics/word_count" --force
# Deploy as local development server with tunnel (ideal for development/testing)
uv run eval-protocol deploy --id test-local-serve-eval --target local-serve \
--function-ref examples.row_wise.dummy_example.dummy_rewards.simple_echo_reward --verbose --forceFor local development and testing, you can use the --target local-serve option to run a reward function server locally with external tunnel access:
uv run eval-protocol deploy --id test-local-serve-eval --target local-serve \
--function-ref examples.row_wise.dummy_example.dummy_rewards.simple_echo_reward --verbose --forceWhat this does:
- Starts a local HTTP server on port 8001 serving your reward function
- Creates an external tunnel (using ngrok or serveo.net) to make the server publicly accessible
- Registers the tunnel URL with Fireworks AI for remote evaluation
- Returns to command prompt but keeps server running in background
Important Notes:
- The CLI returns control to you, but the server processes continue running in the background
- Check running processes:
ps aux | grep -E "(generic_server|ngrok)" - Test locally:
curl -X POST http://localhost:8001/evaluate -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "test"}]}' - Monitor server logs:
tail -f logs/eval-protocol-local/generic_server_*.log - Monitor tunnel logs:
tail -f logs/eval-protocol-local/ngrok_*.log - Stop when done: Kill the background processes manually
This is perfect for development, webhook testing, or making your reward function accessible to remote services without deploying to cloud infrastructure.
If you encounter authentication issues:
- Check Credential Sources:
- Verify that
FIREWORKS_API_KEYis correctly set as an environment variable.
- Verify that
- Verify API Key Permissions: Ensure the API key has the necessary permissions for the operations you are attempting.
- API Base URL: If using
FIREWORKS_API_BASE, ensure it points to the correct API endpoint (e.g.,https://dev.api.fireworks.aifor development).
You can use the following snippet to check what credentials Eval Protocol is resolving:
from eval_protocol.auth import get_fireworks_api_key, get_fireworks_account_id
api_key = get_fireworks_api_key()
account_id = get_fireworks_account_id()
if api_key:
print(f"Retrieved API Key (first 4, last 4 chars): {api_key[:4]}...{api_key[-4:]}")
else:
print("API Key not found.")
if account_id:
print(f"Retrieved Account ID: {account_id}")
else:
print("Account ID not found.")For verbose API logging:
import logging
logging.basicConfig(level=logging.DEBUG)Or use the --verbose flag with CLI commands (from the venv):
.venv/bin/eval-protocol --verbose preview --metrics-folders "word_count=./examples/metrics/word_count" \
--samples ./examples/samples/samples.jsonl# Clean previous builds
rm -rf dist/ build/ *.egg-info
# Build the package
uv build
# Install locally from the built package
uv pip install dist/eval_protocol-*.whlWe welcome contributions to Eval Protocol! Please follow these steps to contribute:
-
Find or Create an Issue:
- Look for existing issues on the GitHub Issues page that you'd like to work on.
- If you have a new feature or bug fix, please create a new issue first to discuss it with the maintainers, unless it's a very minor change.
-
Fork and Clone the Repository:
- Fork the repository to your own GitHub account.
- Clone your fork locally:
git clone https://github.com/YOUR_USERNAME/eval-protocol.git - Add the upstream repository:
git remote add upstream https://github.com/fireworks-ai/eval-protocol.git
-
Create a New Branch:
- Create a descriptive branch name for your feature or fix (e.g.,
feat/add-new-reward-metricorfix/resolve-auth-bug). git checkout -b your-branch-name
- Create a descriptive branch name for your feature or fix (e.g.,
-
Make Your Changes:
- Implement your changes, adhering to the Coding Style and Standards.
- Ensure your code is well-documented with docstrings.
-
Add Tests:
- Write new tests for any new functionality.
- Ensure all tests pass by running
.venv/bin/pytest(after activating the virtual environment).
-
Run Code Quality Checks:
- Format your code:
uv run black eval_protocol tests - Check linting:
uv run flake8 eval_protocol tests - Check types:
uv run mypy eval_protocol - Run pre-commit hooks:
pre-commit run --all-files
- Format your code:
-
Update Documentation:
- If your changes affect user-facing features or APIs, update the relevant documentation in the
docs/directory. - Add examples if applicable.
- If your changes affect user-facing features or APIs, update the relevant documentation in the
-
Commit Your Changes:
- Write clear and concise commit messages. Reference the issue number if applicable (e.g.,
feat: Add awesome new metric (closes #123)). git commit -m "Your descriptive commit message"
- Write clear and concise commit messages. Reference the issue number if applicable (e.g.,
-
Push to Your Fork:
git push origin your-branch-name
-
Submit a Pull Request (PR):
- Open a pull request from your branch to the
mainbranch of thefireworks-ai/eval-protocolrepository. - Provide a clear title and a detailed description of your changes in the PR.
- Explain the "what" and "why" of your contribution.
- Link to the relevant issue(s) using keywords like
Closes #123orFixes #456.
- Ensure all CI checks pass on your PR.
- Be responsive to any feedback or questions from the maintainers during the code review process.
- Open a pull request from your branch to the
Update the documentation when adding new functionality:
- Update relevant files in
docs/ - Add examples for new reward functions
- Update the reward functions overview
- API Connection Errors: Check your internet connection and API base URL
- Authentication Failures: Verify your API key and account ID
- Import Errors: Ensure you're using the correct virtual environment
- Deployment Failures: Check API logs and your account permissions
- Type Errors: Run
mypy eval_protocolto identify typing issues
For more help, consult the official documentation or file an issue on GitHub.