Skip to content

aws-samples/sample-bedrock-agentcore-evaluations-framework

Repository files navigation

Building AI agents that work

Evaluation strategies that catch failures before your users do.

Companion code for the Medium blog post: My AI agent caused a Sev2 in five days

What this repo covers

A three-layer evaluation framework for production AI agents using Amazon Bedrock AgentCore Evaluations:

  1. Offline evaluation — test suites that catch regressions before deployment
  2. Online evaluation — continuous monitoring of production agent behavior
  3. Gate evaluation — synchronous checks that block irreversible actions

Structure

evaluations/
  list_evaluators.py          # List all 16 built-in AgentCore evaluators
  offline_evaluation.py       # Run offline test suite against an agent
  online_evaluation.py        # Configure continuous online monitoring

gates/
  gate_evaluator.py           # Lambda function for synchronous action gates
  workflow.asl.json           # Step Functions workflow with gate integration

patterns/
  research_verify.py          # Two-pass "Research + Verify" pattern

tests/
  test_offline_evaluation.py  # Tests for offline evaluation
  test_gate_evaluator.py      # Tests for gate Lambda
  test_research_verify.py     # Tests for Research + Verify pattern

Prerequisites

  • Python 3.11+
  • AWS account with Amazon Bedrock AgentCore access
  • boto3 >= 1.35.0
  • An AgentCore Runtime with CloudWatch logging enabled

Quick start

pip install -r requirements.txt

# List available evaluators
python evaluations/list_evaluators.py

# Run offline evaluation
python evaluations/offline_evaluation.py

# Run tests
pytest tests/ -v

Three failure modes this framework addresses

  1. Hallucinated facts — agent fabricates data and presents it as tool output
  2. Unauthorized actions — agent reasons around prompt-based guardrails
  3. Compounding errors — subtle mistakes amplify across multi-agent pipelines

Cost

  • Offline evaluation: included in AgentCore pricing
  • Online evaluation: included in AgentCore pricing
  • Gate evaluation (Lambda): ~$0.003 per invocation, ~340ms latency
  • Research + Verify: one additional inference call per agent invocation

License

MIT

About

Three-layer evaluation framework for Amazon Bedrock AgentCore agents — offline scoring, online monitoring, and progressive trust gates with Lambda deployment.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages