An interactive showcase of the SAFE framework for designing and evaluating responsible agentic systems, based on the article "SAFE: Designing Responsible Agentic Systems".
SAFE stands for Scope · Anchored Decisions · Flow Integrity · Escalation — four principles that keep agent autonomy bounded, evidence-grounded, and stoppable.
- Overview page — explains all four SAFE principles with examples and observable signals
- Live Demo — chat with an agent (banking or clinical triage), pick any configured model, and see real-time SAFE scores per response
- Compare — send the same message to an aligned agent and a failure-mode agent side by side
- Multi-provider model selector — switch between Anthropic, OpenAI, Azure OpenAI, and Azure AI Foundry from the UI
| Layer | Tech |
|---|---|
| Backend | FastAPI + Anthropic SDK + OpenAI SDK |
| Frontend | React 18 + TypeScript + Tailwind CSS |
| Agent | Any configured model (see providers below) |
| Evaluator (judge) | Auto-selected: Haiku → gpt-4o-mini → first available |
Configure one or more providers by adding keys to .env. All configured providers appear in the model selector; only the keys you set are required.
| Provider | Required env vars | Models shown |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
Opus 4.6, Sonnet 4.6, Haiku 4.5 |
| OpenAI | OPENAI_API_KEY |
GPT-4o, GPT-4o mini, o3-mini |
| Azure OpenAI | AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT |
Your deployment |
| Azure AI Foundry | AZURE_FOUNDRY_ENDPOINT, AZURE_FOUNDRY_API_KEY, AZURE_FOUNDRY_MODEL |
Your deployed model |
Copy .env and fill in at least one provider:
# Anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI (optional)
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o
# Azure OpenAI (optional)
# AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
# AZURE_OPENAI_API_KEY=...
# AZURE_OPENAI_DEPLOYMENT=gpt-4o
# AZURE_OPENAI_API_VERSION=2024-12-01-preview
# Azure AI Foundry serverless (optional)
# AZURE_FOUNDRY_ENDPOINT=https://<endpoint>.inference.ai.azure.com
# AZURE_FOUNDRY_API_KEY=...
# AZURE_FOUNDRY_MODEL=Meta-Llama-3.1-70B-Instructpip install -r requirements.txt
uvicorn backend.main:app --reloadcd frontend
npm install
npm run devOpen http://localhost:5173.
| Method | Path | Description |
|---|---|---|
GET |
/api/providers |
Returns configured providers and their available models |
POST |
/api/chat |
Run agent + SAFE evaluation for one turn |
POST |
/api/compare |
Run aligned and failure agents in parallel |
/api/chat and /api/compare accept optional provider and model fields. If omitted, the first configured provider and its default model are used.
├── backend/
│ ├── main.py # FastAPI app
│ ├── config.py # Multi-provider env config
│ ├── schemas.py # Pydantic models
│ ├── routes/
│ │ ├── agent.py # POST /api/chat
│ │ ├── compare.py # POST /api/compare
│ │ └── providers.py # GET /api/providers
│ └── services/
│ ├── provider.py # Provider abstraction (Anthropic / OpenAI / Azure)
│ ├── agent_service.py # Scenario prompts + agent runner
│ └── safe_evaluator.py # SAFE evaluation (LLM-as-judge)
└── frontend/src/
├── App.tsx
├── api/client.ts
├── components/
│ ├── Layout.tsx
│ ├── ModelSelector.tsx # Provider + model dropdown
│ └── SafeScorePanel.tsx
└── pages/
├── Home.tsx # SAFE overview
├── Demo.tsx # Interactive demo
└── Compare.tsx # Side-by-side comparison
A complete evaluation pipeline for measuring SAFE framework compliance across a hybrid 100-scenario golden dataset.
100 scenarios across two domains (banking, triage), drawn from three sources:
| Source | Count | License |
|---|---|---|
| τ²-bench (Yao et al., 2024) | 40 | MIT |
| Agent-SafetyBench (Zhang et al., 2024) | 30 | MIT |
| Custom (this study) | 30 | — |
Each scenario has per-dimension ground truth labels (compliant / violation) for all four SAFE dimensions.
python -m experiments.merge_datasets
# → data/golden_dataset/banking.jsonl (71 scenarios)
# → data/golden_dataset/triage.jsonl (29 scenarios)
# → data/golden_dataset/manifest.json# Evaluate aligned agent on both scenarios
python -m experiments.batch_eval --mode aligned
# Evaluate failure-mode agent (for comparison)
python -m experiments.batch_eval --mode failure
# Resume an interrupted run
python -m experiments.batch_eval --mode aligned --resume
# Test on a small subset first
python -m experiments.batch_eval --scenario banking --limit 5Results are written to experiments/results/ (gitignored).
# Analyze aligned results only
python -m experiments.analyze --mode aligned
# Full mode comparison (generates Mann-Whitney U tests)
python -m experiments.analyze --mode both
# Figures saved to experiments/figures/ (gitignored)Produces:
- Per-dimension Precision / Recall / F1 / Cohen's κ
- Mean scores by source (τ²-bench vs ASB vs custom)
- Mann-Whitney U tests + rank-biserial effect sizes (aligned vs failure)
- PNG figures at 300 dpi
python -m experiments.annotate \
--input experiments/results/banking_aligned_results.jsonl \
--output experiments/annotations/rater1_banking.jsonl \
--annotator rater1python -m experiments.generate_paper_tables
# → experiments/paper_tables/table1_detection_*.tex
# → experiments/paper_tables/table2_sources_*.tex
# → experiments/paper_tables/table3_mode_comparison.texexperiments/
├── __init__.py
├── merge_datasets.py # Phase 1: build golden dataset
├── batch_eval.py # Phase 2: run LLM evaluations
├── analyze.py # Phase 2: statistical analysis + figures
├── annotate.py # Phase 2: human annotation CLI
└── generate_paper_tables.py # Phase 4: LaTeX table output
data/
├── public_adapted/
│ ├── tau_bench/
│ │ ├── adapted_banking.jsonl (20 scenarios)
│ │ └── adapted_retail.jsonl (20 scenarios)
│ └── agent_safety_bench/
│ └── adapted_scenarios.jsonl (30 scenarios)
├── custom/
│ ├── banking_safe_specific.jsonl (15 scenarios)
│ └── triage_safe_specific.jsonl (15 scenarios)
└── golden_dataset/
├── banking.jsonl (71 scenarios — generated)
├── triage.jsonl (29 scenarios — generated)
├── manifest.json (source stats + citations)
└── README.md
The original Microsoft Foundry evaluation platform demo is preserved in demo_backup/.