Eval Protocol Developer Guide

This comprehensive guide is designed for developers who want to contribute to or modify the Eval Protocol codebase. It covers all key aspects of development, from environment setup to testing and contributing new reward functions.

We are committed to fostering an open and welcoming environment. All contributors are expected to adhere to our Code of Conduct.

Quick Start

# Clone the repository
git clone https://github.com/fireworks-ai/eval-protocol.git
cd eval-protocol

# Set up environment with uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"  # Includes development dependencies

# Run tests
uv run pytest

# Type check and lint
make pre-commit

Development Environment

Setting Up Your Environment

Clone the repository:

git clone https://github.com/fireworks-ai/eval-protocol.git
cd eval-protocol

Create and activate a virtual environment with uv:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Important for LLMs and automated scripts: After activating the virtual environment, always explicitly use executables from the .venv/bin/ directory (e.g., .venv/bin/pip, .venv/bin/pytest, .venv/bin/python). This ensures commands run within the isolated environment, even if the shell's PATH isn't immediately updated or if context is lost between commands.

Install the package in development mode:

Use uv pip from the virtual environment:

uv pip install -e .            # Basic installation
uv pip install -e ".[dev]"     # With development dependencies

Authentication Setup for Development

For development and testing interactions with the Fireworks AI platform, you need to configure your Fireworks AI credentials. Eval Protocol supports two methods:

Local Development Configuration (.env.dev)

For a streamlined local development experience, especially when managing multiple environment variables, Eval Protocol utilizes a .env.dev file in the root of the project. This file is used to load environment variables automatically when running the application locally.

Setup:

Create the .env.dev file: Copy the example environment file to create your local development configuration:
```
cp .env.example .env.dev
```
Populate .env.dev: Open .env.dev and fill in the necessary environment variables, such as FIREWORKS_API_KEY and any other variables required for your development tasks (e.g., E2B_API_KEY).

Example content for .env.dev:
```
FIREWORKS_API_KEY="your_dev_fireworks_api_key"
FIREWORKS_API_BASE="https://api.fireworks.ai"
E2B_API_KEY="your_e2b_api_key"
```

Important:

The .env.dev file should not be committed to version control. It is already listed in the .gitignore file.
Variables set directly in your shell environment will take precedence over those defined in .env.dev if python-dotenv is configured to not override existing variables (which is the default behavior).

This file simplifies managing development-specific settings without needing to export them in every terminal session.

Authentication Setup for Development (Continued)

For development and testing interactions with the Fireworks AI platform, you need to configure your Fireworks AI credentials. Eval Protocol supports two methods:

A. Environment Variables (Highest Priority)

Set the following environment variables. For development, you might use specific development keys or a dedicated development account:

FIREWORKS_API_KEY: Your Fireworks AI API key.
- For development, you might use a specific dev key: export FIREWORKS_API_KEY="your_dev_fireworks_api_key"
FIREWORKS_API_BASE: (Optional) If you need to target a non-production Fireworks API endpoint.
- For development: export FIREWORKS_API_BASE="https://dev.api.fireworks.ai"

Example for a typical development setup:

export FIREWORKS_API_KEY="your_development_api_key"
export FIREWORKS_API_BASE="https://dev.api.fireworks.ai" # If targeting dev API

B. Configuration File (Lower Priority)

Eval Protocol does not read ~/.fireworks/auth.ini (or any firectl profiles). Use environment variables instead.

Credential Sourcing Order: Eval Protocol prioritizes credentials as follows:

Environment Variables (FIREWORKS_API_KEY)

Purpose of Credentials:

FIREWORKS_API_KEY: Authenticates your requests to the Fireworks AI service.
FIREWORKS_API_BASE: Allows targeting different API environments (e.g., development, staging).

Other Environment Variables:

E2B_API_KEY: (Optional) If you are working on or testing features involving E2B code execution: export E2B_API_KEY="your_e2b_api_key"

Code Structure

eval-protocol/
├── eval_protocol/                 # Main package source code
│   ├── __init__.py             # Package initialization
│   ├── reward_function.py      # Core reward function decorator
│   ├── models.py               # Data models and types
│   ├── typed_interface.py      # Type interfaces for reward functions
│   ├── evaluation.py           # Evaluation pipeline
│   ├── auth.py                 # Authentication utilities
│   ├── cli.py                  # Command line interface
│   ├── rewards/                # Out-of-the-box reward functions
│   │   ├── __init__.py         # Reward functions registry
│   │   ├── code_execution.py   # Code execution rewards
│   │   ├── function_calling.py # Function calling rewards
│   │   ├── json_schema.py      # JSON schema validation
│   │   ├── math.py             # Math evaluation
│   │   ├── format.py           # Format validation
│   │   ├── tag_count.py        # Tag counting
│   │   ├── accuracy.py         # Accuracy evaluation
│   │   ├── language_consistency.py # Language consistency
│   │   ├── reasoning_steps.py  # Reasoning steps evaluation
│   │   ├── length.py           # Response length evaluation
│   │   ├── repetition.py       # Repetition detection
│   │   ├── cpp_code.py         # C/C++ code evaluation
│   │   └── accuracy_length.py  # Combined accuracy and length evaluation
├── examples/                   # Example code and tutorials
│   ├── metrics/                # Example metric implementations
│   ├── samples/                # Sample data for evaluation
│   └── ...                     # Example scripts
├── tests/                      # Unit and integration tests
├── docs/                       # Documentation
└── ...                         # Project configuration files

Reward Function Development

Creating a New Reward Function

Create a new module in eval_protocol/rewards/ if needed
Implement your reward function using the @reward_function decorator
Update eval_protocol/rewards/__init__.py to expose your function
Add unit tests in the tests/ directory

Example structure:

from typing import Dict, List, Any, Union, Optional

from ..typed_interface import reward_function
from ..models import Message, EvaluateResult, MetricResult

@reward_function
def my_reward_function(
    messages: Union[List[Dict[str, Any]], List[Message]],
    ground_truth: Optional[str] = None,
    **kwargs: Any
) -> EvaluateResult:
    """
    Evaluate responses based on custom criteria.

    Args:
        messages: List of conversation messages
        ground_truth: Expected correct answer
        **kwargs: Additional arguments

    Returns:
        EvaluateResult with evaluation score and metrics
    """
    # Your evaluation logic here
    # score = ...
    # reason = "..."
    # metric_score = ...
    # metric_success = ... (e.g., True if a condition is met)
    # metric_reason = "..."
    # For demonstration:
    score = 0.75
    reason = "The response met most criteria."
    metric_score = 0.8
    metric_success = True # Example: metric condition was met
    metric_reason = "Specific aspect evaluated positively."


    return EvaluateResult(
        score=score,
        reason=reason,
        is_score_valid=is_score_valid,
        metrics={
            "metric_name": MetricResult(
                score=metric_score,
                is_score_valid=metric_success,
                reason=metric_reason
            )
        }
    )

Coding Style and Standards

To maintain code quality and consistency, please adhere to the following standards:

Formatting:
- Use black for code formatting. The maximum line length is 88 characters.
- Use isort for organizing imports.
Linting:
- Adhere to flake8 guidelines.
Type Hinting:
- Use type hints for all function parameters, return values, and variables where appropriate.
- Run mypy eval_protocol to check for type errors.
Naming Conventions:
- snake_case for functions, methods, and variables.
- PascalCase for classes and dataclasses.
- UPPER_SNAKE_CASE for constants.
Imports:
- Group imports in the following order:
  1. Standard library imports
  2. Third-party library imports
  3. Local application/library specific imports (e.g., from ..module import something)
- Separate each group with a blank line.
Docstrings:
- Write clear and concise docstrings for all public modules, classes, functions, and methods. Follow PEP 257 conventions.
- For functions, explain the arguments, what the function does, and what it returns.
Error Handling:
- Use specific exception types rather than generic Exception.
- Provide meaningful error messages.
Function Design:
- Keep functions and methods short and focused on a single responsibility.
- Aim for readability and maintainability.
Testing:
- Write unit tests for all new public functions and significant private logic.
- Ensure tests cover a variety of cases, including edge cases and expected failures.

Pre-commit Hooks

To help enforce coding standards and catch issues early, we use pre-commit hooks. These hooks run automatically before each commit to check your code for issues like formatting, linting errors, and type errors.

Installation and Setup:

Install pre-commit: If you installed development dependencies with uv pip install -e ".[dev]", pre-commit should already be installed. If not, you can install it via uv:
```
uv pip install pre-commit
```
Install the git hooks: Navigate to the root of the repository and run:
```
pre-commit install
```
This will set up the pre-commit hooks to run automatically when you git commit.

Usage:

Once installed, pre-commit hooks will run on any changed files before you commit. If any hook fails, the commit will be aborted. You'll need to fix the reported issues and then git add the files again before attempting to commit.
Some hooks (like black and isort) may automatically fix issues. If they do, you'll still need to git add the modified files.
You can also run the hooks manually on all files at any time:
```
pre-commit run --all-files
```
This is useful for checking the entire codebase or after pulling new changes.

By using pre-commit hooks, we can ensure a consistent code style and catch many common errors before they even reach the CI pipeline, saving time and effort.

Testing

Running Tests

Use uv to run tests with pytest:

# Run all tests
uv run pytest tests/

# Run specific test file
uv run pytest tests/test_evaluation.py

# Run specific test function
uv run pytest tests/test_file.py::test_function

# Run with coverage report
uv run pytest --cov=eval_protocol

We can focus on tests/ and examples/ folder for now since there are a lot of other repos

Writing Tests

Create test files in the tests/ directory following this pattern:

import unittest
# Import the function you're testing
from eval_protocol.rewards.your_module import your_function

class TestYourFunction(unittest.TestCase):
    """Test your reward function."""

    def test_basic_functionality(self):
        """Test basic functionality."""
        messages = [
            {"role": "user", "content": "Test question"},
            {"role": "assistant", "content": "Test response"}
        ]

        result = your_function(messages=messages)

        # Assert expectations
        self.assertIsNotNone(result)
        self.assertIsInstance(result, dict)
        self.assertIn("score", result)
        self.assertGreaterEqual(result["score"], 0.0)
        self.assertLessEqual(result["score"], 1.0)

Code Quality Tools

Use uv to run code quality tools:

# Type checking
uv run mypy eval_protocol

# Linting
uv run flake8 eval_protocol

# Format code
uv run black eval_protocol

Available Reward Functions

Eval Protocol includes these out-of-the-box reward functions:

Category	Reward Functions
Format	`format_reward`
Tag Count	`tag_count_reward`
Accuracy	`accuracy_reward`
Language	`language_consistency_reward`
Reasoning	`reasoning_steps_reward`
Length	`length_reward`, `cosine_length_reward`
Repetition	`repetition_penalty_reward`
Code Execution	`binary_code_reward`, `fractional_code_reward`
C/C++ Code	`ioi_cpp_code_reward`, `binary_cpp_code_reward`
Combined	`cosine_scaled_accuracy_length_reward`
Function Calling	`schema_jaccard_reward`, `llm_judge_reward`, `composite_function_call_reward`
JSON Schema	`json_schema_reward`
Math	`math_reward`

Running Examples

The examples folder contains sample code for using Eval Protocol:

# Run evaluation preview example
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY \
FIREWORKS_API_BASE=https://dev.api.fireworks.ai \
uv run python examples/evaluation_preview_example.py

# Run deployment example
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY \
FIREWORKS_API_BASE=https://dev.api.fireworks.ai \
uv run python examples/deploy_example.py

Running Hydra-based Examples

Several example scripts, particularly those involving local evaluations (local_eval.py) and TRL integration (trl_grpo_integration.py) within directories like examples/math_example/, examples/math_example_openr1/, and examples/tool_calling_example/, have been refactored to use Hydra for configuration management.

How to Run:

Navigate to the repository root if you aren't already there.

Run the script directly using uv: Hydra will automatically pick up the configuration from the conf subdirectory relative to the script's location.

# Example for math_example local_eval.py
uv run python examples/math_example/local_eval.py

# Example for math_example trl_grpo_integration.py
uv run python examples/math_example/trl_grpo_integration.py

Configuration:

Configuration files are typically found in a conf subdirectory alongside the script (e.g., examples/math_example/conf/local_eval_config.yaml).
These YAML files define various parameters, including dataset paths, model names, and training arguments.

Overriding Configuration:

You can easily override any configuration parameter from the command line:

Dataset Path:

uv run python examples/math_example/local_eval.py dataset_file_path=path/to/your/specific_dataset.jsonl

Model Name (for TRL scripts):

uv run python examples/math_example/trl_grpo_integration.py model_name=mistralai/Mistral-7B-Instruct-v0.2

GRPO Training Arguments (for TRL scripts): Access nested parameters using dot notation.

uv run python examples/math_example/trl_grpo_integration.py grpo.learning_rate=5e-5 grpo.num_train_epochs=3

Multiple Overrides:

uv run python examples/tool_calling_example/trl_grpo_integration.py dataset_file_path=my_tools_data.jsonl model_name=google/gemma-7b grpo.per_device_train_batch_size=4

Output Directory:

Hydra manages output directories for each run. By default, outputs (logs, saved models, etc.) are saved to a timestamped directory structure like: outputs/YYYY-MM-DD/HH-MM-SS/ (relative to where the command is run, typically the repo root). The exact base output path can also be configured within the YAML files (e.g., hydra.run.dir).

Refer to the specific conf/*.yaml file for each example to see all available configuration options.

Command Line Interface

Use the Eval Protocol CLI for common operations during development. Use uv to run the CLI commands:

# Preview an evaluator
uv run eval-protocol preview --metrics-folders "word_count=./examples/metrics/word_count" \
--samples ./examples/samples/samples.jsonl

# Deploy an evaluator
uv run eval-protocol deploy --id my-test-evaluator \
--metrics-folders "word_count=./examples/metrics/word_count" --force

# Deploy as local development server with tunnel (ideal for development/testing)
uv run eval-protocol deploy --id test-local-serve-eval --target local-serve \
--function-ref examples.row_wise.dummy_example.dummy_rewards.simple_echo_reward --verbose --force

Local Development Server

For local development and testing, you can use the --target local-serve option to run a reward function server locally with external tunnel access:

uv run eval-protocol deploy --id test-local-serve-eval --target local-serve \
--function-ref examples.row_wise.dummy_example.dummy_rewards.simple_echo_reward --verbose --force

What this does:

Starts a local HTTP server on port 8001 serving your reward function
Creates an external tunnel (using ngrok or serveo.net) to make the server publicly accessible
Registers the tunnel URL with Fireworks AI for remote evaluation
Returns to command prompt but keeps server running in background

Important Notes:

The CLI returns control to you, but the server processes continue running in the background
Check running processes: ps aux | grep -E "(generic_server|ngrok)"
Test locally: curl -X POST http://localhost:8001/evaluate -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "test"}]}'
Monitor server logs: tail -f logs/eval-protocol-local/generic_server_*.log
Monitor tunnel logs: tail -f logs/eval-protocol-local/ngrok_*.log
Stop when done: Kill the background processes manually

This is perfect for development, webhook testing, or making your reward function accessible to remote services without deploying to cloud infrastructure.

Debugging Tips

Authentication Issues

If you encounter authentication issues:

Check Credential Sources:
- Verify that FIREWORKS_API_KEY is correctly set as an environment variable.
Verify API Key Permissions: Ensure the API key has the necessary permissions for the operations you are attempting.
API Base URL: If using FIREWORKS_API_BASE, ensure it points to the correct API endpoint (e.g., https://dev.api.fireworks.ai for development).

You can use the following snippet to check what credentials Eval Protocol is resolving:

from eval_protocol.auth import get_fireworks_api_key, get_fireworks_account_id

api_key = get_fireworks_api_key()
account_id = get_fireworks_account_id()

if api_key:
    print(f"Retrieved API Key (first 4, last 4 chars): {api_key[:4]}...{api_key[-4:]}")
else:
    print("API Key not found.")

if account_id:
    print(f"Retrieved Account ID: {account_id}")
else:
    print("Account ID not found.")

API Debugging

For verbose API logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Or use the --verbose flag with CLI commands (from the venv):

.venv/bin/eval-protocol --verbose preview --metrics-folders "word_count=./examples/metrics/word_count" \
--samples ./examples/samples/samples.jsonl

Building and Distribution

# Clean previous builds
rm -rf dist/ build/ *.egg-info

# Build the package
uv build

# Install locally from the built package

uv pip install dist/eval_protocol-*.whl

Contributing Process

We welcome contributions to Eval Protocol! Please follow these steps to contribute:

Find or Create an Issue:
- Look for existing issues on the GitHub Issues page that you'd like to work on.
- If you have a new feature or bug fix, please create a new issue first to discuss it with the maintainers, unless it's a very minor change.
Fork and Clone the Repository:
- Fork the repository to your own GitHub account.
- Clone your fork locally: git clone https://github.com/YOUR_USERNAME/eval-protocol.git
- Add the upstream repository: git remote add upstream https://github.com/fireworks-ai/eval-protocol.git
Create a New Branch:
- Create a descriptive branch name for your feature or fix (e.g., feat/add-new-reward-metric or fix/resolve-auth-bug).
- git checkout -b your-branch-name
Make Your Changes:
- Implement your changes, adhering to the Coding Style and Standards.
- Ensure your code is well-documented with docstrings.
Add Tests:
- Write new tests for any new functionality.
- Ensure all tests pass by running .venv/bin/pytest (after activating the virtual environment).
Run Code Quality Checks:
- Format your code: uv run black eval_protocol tests
- Check linting: uv run flake8 eval_protocol tests
- Check types: uv run mypy eval_protocol
- Run pre-commit hooks: pre-commit run --all-files
Update Documentation:
- If your changes affect user-facing features or APIs, update the relevant documentation in the docs/ directory.
- Add examples if applicable.
Commit Your Changes:
- Write clear and concise commit messages. Reference the issue number if applicable (e.g., feat: Add awesome new metric (closes #123)).
- git commit -m "Your descriptive commit message"
Push to Your Fork:
- git push origin your-branch-name
Submit a Pull Request (PR):
- Open a pull request from your branch to the main branch of the fireworks-ai/eval-protocol repository.
- Provide a clear title and a detailed description of your changes in the PR.
  - Explain the "what" and "why" of your contribution.
  - Link to the relevant issue(s) using keywords like Closes #123 or Fixes #456.
- Ensure all CI checks pass on your PR.
- Be responsive to any feedback or questions from the maintainers during the code review process.

Documentation

Update the documentation when adding new functionality:

Update relevant files in docs/
Add examples for new reward functions
Update the reward functions overview

Troubleshooting

Common Issues

API Connection Errors: Check your internet connection and API base URL
Authentication Failures: Verify your API key and account ID
Import Errors: Ensure you're using the correct virtual environment
Deployment Failures: Check API logs and your account permissions
Type Errors: Run mypy eval_protocol to identify typing issues

For more help, consult the official documentation or file an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Protocol Developer Guide

Quick Start

Development Environment

Setting Up Your Environment

Authentication Setup for Development

Local Development Configuration (.env.dev)

Authentication Setup for Development (Continued)

Code Structure

Reward Function Development

Creating a New Reward Function

Coding Style and Standards

Pre-commit Hooks

Testing

Running Tests

Writing Tests

Code Quality Tools

Available Reward Functions

Running Examples

Running Hydra-based Examples

Command Line Interface

Local Development Server

Debugging Tips

Authentication Issues

API Debugging

Building and Distribution

Contributing Process

Documentation

Troubleshooting

Common Issues

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Eval Protocol Developer Guide

Quick Start

Development Environment

Setting Up Your Environment

Authentication Setup for Development

Local Development Configuration (.env.dev)

Authentication Setup for Development (Continued)

Code Structure

Reward Function Development

Creating a New Reward Function

Coding Style and Standards

Pre-commit Hooks

Testing

Running Tests

Writing Tests

Code Quality Tools

Available Reward Functions

Running Examples

Running Hydra-based Examples

Command Line Interface

Local Development Server

Debugging Tips

Authentication Issues

API Debugging

Building and Distribution

Contributing Process

Documentation

Troubleshooting

Common Issues