Evaluation Module

The Evaluation module provides comprehensive tools for testing, benchmarking, and evaluating AI agent performance.

Overview

from pyai.evaluation import Evaluator, TestCase, EvalSet, EvalCriteria

Key Components

Component	Description
Evaluator	Main evaluation engine
TestCase	Individual test case definition
EvalSet	Collection of test cases
EvalCriteria	Evaluation criteria and metrics
EvalResult	Evaluation results container

Quick Start

from pyai.evaluation import Evaluator, TestCase, EvalSet

# Create test cases
test_cases = [
    TestCase(
        input="What is 2+2?",
        expected_output="4",
        criteria=["accuracy", "conciseness"]
    ),
    TestCase(
        input="Explain quantum computing",
        expected_output=None,  # Open-ended
        criteria=["relevance", "clarity"]
    )
]

# Create evaluation set
eval_set = EvalSet(name="Math Tests", test_cases=test_cases)

# Run evaluation
evaluator = Evaluator()
results = evaluator.evaluate(eval_set, agent=my_agent)

# View results
print(f"Pass Rate: {results.pass_rate}%")
print(f"Average Score: {results.avg_score}")

Built-in Criteria

Accuracy Criteria

from pyai.evaluation.criteria import AccuracyCriteria

criteria = AccuracyCriteria(
    threshold=0.8,  # 80% accuracy required
    comparison_method="exact"  # or "semantic", "fuzzy"
)

Custom Criteria

from pyai.evaluation import EvalCriteria

class ToneCriteria(EvalCriteria):
    def evaluate(self, output: str, expected: str) -> float:
        # Custom evaluation logic
        if "professional" in output.lower():
            return 1.0
        return 0.5

Batch Evaluation

# Evaluate multiple agents
agents = [agent1, agent2, agent3]
comparison = evaluator.compare(eval_set, agents=agents)

# Generate comparison report
comparison.to_markdown("comparison_report.md")
comparison.to_json("comparison_results.json")

Metrics

The evaluation module tracks:

Pass Rate: Percentage of tests passed
Average Score: Mean score across all criteria
Latency: Response time metrics
Token Usage: Token consumption per test
Cost: Estimated API cost

Integration with CI/CD

# .github/workflows/eval.yml
- name: Run Agent Evaluation
  run: |
    python -m pyai.evaluation run \
      --eval-set tests/eval_cases.yaml \
      --threshold 0.85 \
      --output results.json

🧠 PYAI Wiki

Home

🚀 Getting Started

💡 Core Concepts

🎯 One-Liner APIs

🤖 Agent Framework

🔗 Multi-Agent

🛠️ Tools & Skills

🔒 Security

📚 Reference

_{Intelligence, Embedded.}

Evaluation Module

Evaluation Module

Overview

Key Components

Quick Start

Built-in Criteria

Accuracy Criteria

Custom Criteria

Batch Evaluation

Metrics

Integration with CI/CD

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!