**feat: Add comprehensive AI Agent Evaluation Framework with Azure AI Foundry integration** by james-tn · Pull Request #375 · microsoft/OpenAIWorkshop

james-tn · 2026-02-03T21:13:02Z

Description

Summary

This PR introduces a comprehensive evaluation framework for testing AI agents against customer support scenarios. The framework supports both local evaluation with custom metrics and remote evaluation via Azure AI Foundry with LLM-as-judge capabilities.

🎯 Key Features

Evaluation Framework (evaluations)

30 test scenarios covering billing, internet, mobile, account, TV, bundles, and support categories
Single-turn and multi-turn evaluation with different weighting strategies
Custom evaluators: Tool behavior (recall/precision/efficiency), completeness, grounded accuracy, safety
Azure AI Foundry integration: Built-in evaluators for coherence, fluency, relevance, groundedness, task adherence, intent resolution
1-5 scale scoring matching Azure AI Foundry portal for consistent comparison

Evaluation Methodology

Mode	Focus	Key Metrics
Single-Turn	Tool-level accuracy	Tool behavior (25%), tool call accuracy, completeness, response quality
Multi-Turn	Outcome-focused	Solution accuracy (30%), task adherence (20%), intent resolution (20%), coherence, fluency, relevance

Multi-turn evaluation intentionally excludes tool-level metrics because what matters is the final outcome, not the intermediate steps taken across conversation turns.

CLI Options

# Local evaluation only (default)
python run_agent_eval.py --agent my_agent

# Remote evaluation (push to Azure AI Foundry)
python run_agent_eval.py --agent my_agent --remote

# Both local and remote
python run_agent_eval.py --agent my_agent --local --remote

# Single-turn or multi-turn only
python run_agent_eval.py --agent my_agent --single-turn-only
python run_agent_eval.py --agent my_agent --multi-turn-only

# Quick test with limited cases
python run_agent_eval.py --agent my_agent --limit 5

📁 New Files

File	Purpose
run_agent_eval.py	Main evaluation script - orchestrates tests, local eval, and remote push
evaluator.py	Evaluation runner, weight definitions, result aggregation
metrics.py	All metric implementations (custom + Azure AI wrappers)
eval_dataset.json	30 test cases with ground truth and rubrics
telemetry.py	Azure Monitor tracing configuration
README.md	Comprehensive documentation
agent-evaluation.yml	CI/CD workflow for automated evaluation

🔧 Configuration

Only two environment variables needed for evaluation:

# Required for --remote evaluation
AZURE_AI_PROJECT_ENDPOINT=https://your-account.services.ai.azure.com/api/projects/your-project

# Optional - separate model for evaluation (defaults to AZURE_OPENAI_CHAT_DEPLOYMENT)
AZURE_OPENAI_EVAL_DEPLOYMENT=gpt-4o-mini

🏗️ Agent Framework Updates

Updated base_agent.py with improved tool call tracking
Enhanced single_agent.py and handoff_multi_domain_agent.py for better observability
Added ToolCallTrackingMixin support for accurate tool usage metrics

✅ Testing

Unit tests for evaluation metrics
Integration tests for agent evaluation pipeline
GitHub Actions workflow for automated evaluation on PRs

📚 Documentation

Comprehensive README explaining evaluation methodology
Step-by-step setup guide
Troubleshooting section
Environment variables reference

Checklist

Added evaluation framework with local and remote modes
Implemented single-turn (tool-focused) and multi-turn (outcome-focused) evaluation strategies
Integrated Azure AI Foundry built-in evaluators
Created 30 test scenarios with ground truth
Added comprehensive documentation
Added GitHub Actions workflow for CI/CD
Updated .gitignore for generated files

Breaking Changes

None. This is an additive feature.

… into james-dev

James N. added 10 commits January 12, 2026 14:06

add ppt

ed557c3

Merge branch 'james-dev' of https://github.com/microsoft/OpenAIWorkshop…

5274c63

… into james-dev

add ppt

acdc7aa

add ppt

2910b59

clean up old documentation references

ca35d9e

add bullet point

fa1c0c6

add database

9c62328

add evaluation

b966c4c

add evaluation

83db3a9

update CI/CD workflow

f2ba847

james-tn requested review from heenajvs and nicoleserafino February 3, 2026 21:13

james-tn had a problem deploying to dev February 3, 2026 21:13 — with GitHub Actions Failure

james-tn merged commit b9b1f03 into int-agentic Feb 3, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add comprehensive AI Agent Evaluation Framework with Azure AI Foundry integration#375

feat: Add comprehensive AI Agent Evaluation Framework with Azure AI Foundry integration#375
james-tn merged 10 commits into
int-agenticfrom
james-dev

james-tn commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

james-tn commented Feb 3, 2026

Description

Summary

🎯 Key Features

Evaluation Framework (evaluations)

Evaluation Methodology

CLI Options

📁 New Files

🔧 Configuration

🏗️ Agent Framework Updates

✅ Testing

📚 Documentation

Checklist

Breaking Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant