Skip to content

**feat: Add comprehensive AI Agent Evaluation Framework with Azure AI Foundry integration**#375

Merged
james-tn merged 10 commits into
int-agenticfrom
james-dev
Feb 3, 2026
Merged

**feat: Add comprehensive AI Agent Evaluation Framework with Azure AI Foundry integration**#375
james-tn merged 10 commits into
int-agenticfrom
james-dev

Conversation

@james-tn
Copy link
Copy Markdown
Contributor

@james-tn james-tn commented Feb 3, 2026

Description

Summary

This PR introduces a comprehensive evaluation framework for testing AI agents against customer support scenarios. The framework supports both local evaluation with custom metrics and remote evaluation via Azure AI Foundry with LLM-as-judge capabilities.

🎯 Key Features

Evaluation Framework (evaluations)

  • 30 test scenarios covering billing, internet, mobile, account, TV, bundles, and support categories
  • Single-turn and multi-turn evaluation with different weighting strategies
  • Custom evaluators: Tool behavior (recall/precision/efficiency), completeness, grounded accuracy, safety
  • Azure AI Foundry integration: Built-in evaluators for coherence, fluency, relevance, groundedness, task adherence, intent resolution
  • 1-5 scale scoring matching Azure AI Foundry portal for consistent comparison

Evaluation Methodology

Mode Focus Key Metrics
Single-Turn Tool-level accuracy Tool behavior (25%), tool call accuracy, completeness, response quality
Multi-Turn Outcome-focused Solution accuracy (30%), task adherence (20%), intent resolution (20%), coherence, fluency, relevance

Multi-turn evaluation intentionally excludes tool-level metrics because what matters is the final outcome, not the intermediate steps taken across conversation turns.

CLI Options

# Local evaluation only (default)
python run_agent_eval.py --agent my_agent

# Remote evaluation (push to Azure AI Foundry)
python run_agent_eval.py --agent my_agent --remote

# Both local and remote
python run_agent_eval.py --agent my_agent --local --remote

# Single-turn or multi-turn only
python run_agent_eval.py --agent my_agent --single-turn-only
python run_agent_eval.py --agent my_agent --multi-turn-only

# Quick test with limited cases
python run_agent_eval.py --agent my_agent --limit 5

📁 New Files

File Purpose
run_agent_eval.py Main evaluation script - orchestrates tests, local eval, and remote push
evaluator.py Evaluation runner, weight definitions, result aggregation
metrics.py All metric implementations (custom + Azure AI wrappers)
eval_dataset.json 30 test cases with ground truth and rubrics
telemetry.py Azure Monitor tracing configuration
README.md Comprehensive documentation
agent-evaluation.yml CI/CD workflow for automated evaluation

🔧 Configuration

Only two environment variables needed for evaluation:

# Required for --remote evaluation
AZURE_AI_PROJECT_ENDPOINT=https://your-account.services.ai.azure.com/api/projects/your-project

# Optional - separate model for evaluation (defaults to AZURE_OPENAI_CHAT_DEPLOYMENT)
AZURE_OPENAI_EVAL_DEPLOYMENT=gpt-4o-mini

🏗️ Agent Framework Updates

  • Updated base_agent.py with improved tool call tracking
  • Enhanced single_agent.py and handoff_multi_domain_agent.py for better observability
  • Added ToolCallTrackingMixin support for accurate tool usage metrics

✅ Testing

  • Unit tests for evaluation metrics
  • Integration tests for agent evaluation pipeline
  • GitHub Actions workflow for automated evaluation on PRs

📚 Documentation

  • Comprehensive README explaining evaluation methodology
  • Step-by-step setup guide
  • Troubleshooting section
  • Environment variables reference

Checklist

  • Added evaluation framework with local and remote modes
  • Implemented single-turn (tool-focused) and multi-turn (outcome-focused) evaluation strategies
  • Integrated Azure AI Foundry built-in evaluators
  • Created 30 test scenarios with ground truth
  • Added comprehensive documentation
  • Added GitHub Actions workflow for CI/CD
  • Updated .gitignore for generated files

Breaking Changes

None. This is an additive feature.


@james-tn james-tn merged commit b9b1f03 into int-agentic Feb 3, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant