-
Notifications
You must be signed in to change notification settings - Fork 0
Evaluation Module
gitpavleenbali edited this page Feb 17, 2026
·
2 revisions
The Evaluation module provides comprehensive tools for testing, benchmarking, and evaluating AI agent performance.
from pyai.evaluation import Evaluator, TestCase, EvalSet, EvalCriteria| Component | Description |
|---|---|
| Evaluator | Main evaluation engine |
| TestCase | Individual test case definition |
| EvalSet | Collection of test cases |
| EvalCriteria | Evaluation criteria and metrics |
| EvalResult | Evaluation results container |
from pyai.evaluation import Evaluator, TestCase, EvalSet
# Create test cases
test_cases = [
TestCase(
input="What is 2+2?",
expected_output="4",
criteria=["accuracy", "conciseness"]
),
TestCase(
input="Explain quantum computing",
expected_output=None, # Open-ended
criteria=["relevance", "clarity"]
)
]
# Create evaluation set
eval_set = EvalSet(name="Math Tests", test_cases=test_cases)
# Run evaluation
evaluator = Evaluator()
results = evaluator.evaluate(eval_set, agent=my_agent)
# View results
print(f"Pass Rate: {results.pass_rate}%")
print(f"Average Score: {results.avg_score}")from pyai.evaluation.criteria import AccuracyCriteria
criteria = AccuracyCriteria(
threshold=0.8, # 80% accuracy required
comparison_method="exact" # or "semantic", "fuzzy"
)from pyai.evaluation import EvalCriteria
class ToneCriteria(EvalCriteria):
def evaluate(self, output: str, expected: str) -> float:
# Custom evaluation logic
if "professional" in output.lower():
return 1.0
return 0.5# Evaluate multiple agents
agents = [agent1, agent2, agent3]
comparison = evaluator.compare(eval_set, agents=agents)
# Generate comparison report
comparison.to_markdown("comparison_report.md")
comparison.to_json("comparison_results.json")The evaluation module tracks:
- Pass Rate: Percentage of tests passed
- Average Score: Mean score across all criteria
- Latency: Response time metrics
- Token Usage: Token consumption per test
- Cost: Estimated API cost
# .github/workflows/eval.yml
- name: Run Agent Evaluation
run: |
python -m pyai.evaluation run \
--eval-set tests/eval_cases.yaml \
--threshold 0.85 \
--output results.jsonIntelligence, Embedded.