Building AI agents that work

Evaluation strategies that catch failures before your users do.

Companion code for the Medium blog post: My AI agent caused a Sev2 in five days

What this repo covers

A three-layer evaluation framework for production AI agents using Amazon Bedrock AgentCore Evaluations:

Offline evaluation — test suites that catch regressions before deployment
Online evaluation — continuous monitoring of production agent behavior
Gate evaluation — synchronous checks that block irreversible actions

Structure

evaluations/
  list_evaluators.py          # List all 16 built-in AgentCore evaluators
  offline_evaluation.py       # Run offline test suite against an agent
  online_evaluation.py        # Configure continuous online monitoring

gates/
  gate_evaluator.py           # Lambda function for synchronous action gates
  workflow.asl.json           # Step Functions workflow with gate integration

patterns/
  research_verify.py          # Two-pass "Research + Verify" pattern

tests/
  test_offline_evaluation.py  # Tests for offline evaluation
  test_gate_evaluator.py      # Tests for gate Lambda
  test_research_verify.py     # Tests for Research + Verify pattern

Prerequisites

Python 3.11+
AWS account with Amazon Bedrock AgentCore access
boto3 >= 1.35.0
An AgentCore Runtime with CloudWatch logging enabled

Quick start

pip install -r requirements.txt

# List available evaluators
python evaluations/list_evaluators.py

# Run offline evaluation
python evaluations/offline_evaluation.py

# Run tests
pytest tests/ -v

Three failure modes this framework addresses

Hallucinated facts — agent fabricates data and presents it as tool output
Unauthorized actions — agent reasons around prompt-based guardrails
Compounding errors — subtle mistakes amplify across multi-agent pipelines

Cost

Offline evaluation: included in AgentCore pricing
Online evaluation: included in AgentCore pricing
Gate evaluation (Lambda): ~$0.003 per invocation, ~340ms latency
Research + Verify: one additional inference call per agent invocation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluations		evaluations
gates		gates
patterns		patterns
tests		tests
trust		trust
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt
test_blog_example.py		test_blog_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building AI agents that work

What this repo covers

Structure

Prerequisites

Quick start

Three failure modes this framework addresses

Cost

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Building AI agents that work

What this repo covers

Structure

Prerequisites

Quick start

Three failure modes this framework addresses

Cost

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages