Evaluation strategies that catch failures before your users do.
Companion code for the Medium blog post: My AI agent caused a Sev2 in five days
A three-layer evaluation framework for production AI agents using Amazon Bedrock AgentCore Evaluations:
- Offline evaluation — test suites that catch regressions before deployment
- Online evaluation — continuous monitoring of production agent behavior
- Gate evaluation — synchronous checks that block irreversible actions
evaluations/
list_evaluators.py # List all 16 built-in AgentCore evaluators
offline_evaluation.py # Run offline test suite against an agent
online_evaluation.py # Configure continuous online monitoring
gates/
gate_evaluator.py # Lambda function for synchronous action gates
workflow.asl.json # Step Functions workflow with gate integration
patterns/
research_verify.py # Two-pass "Research + Verify" pattern
tests/
test_offline_evaluation.py # Tests for offline evaluation
test_gate_evaluator.py # Tests for gate Lambda
test_research_verify.py # Tests for Research + Verify pattern
- Python 3.11+
- AWS account with Amazon Bedrock AgentCore access
- boto3 >= 1.35.0
- An AgentCore Runtime with CloudWatch logging enabled
pip install -r requirements.txt
# List available evaluators
python evaluations/list_evaluators.py
# Run offline evaluation
python evaluations/offline_evaluation.py
# Run tests
pytest tests/ -v- Hallucinated facts — agent fabricates data and presents it as tool output
- Unauthorized actions — agent reasons around prompt-based guardrails
- Compounding errors — subtle mistakes amplify across multi-agent pipelines
- Offline evaluation: included in AgentCore pricing
- Online evaluation: included in AgentCore pricing
- Gate evaluation (Lambda): ~$0.003 per invocation, ~340ms latency
- Research + Verify: one additional inference call per agent invocation
MIT