SAFE Framework — Responsible Agentic Systems Demo

An interactive showcase of the SAFE framework for designing and evaluating responsible agentic systems, based on the article "SAFE: Designing Responsible Agentic Systems".

SAFE stands for Scope · Anchored Decisions · Flow Integrity · Escalation — four principles that keep agent autonomy bounded, evidence-grounded, and stoppable.

Features

Overview page — explains all four SAFE principles with examples and observable signals
Live Demo — chat with an agent (banking or clinical triage), pick any configured model, and see real-time SAFE scores per response
Compare — send the same message to an aligned agent and a failure-mode agent side by side
Multi-provider model selector — switch between Anthropic, OpenAI, Azure OpenAI, and Azure AI Foundry from the UI

Stack

Layer	Tech
Backend	FastAPI + Anthropic SDK + OpenAI SDK
Frontend	React 18 + TypeScript + Tailwind CSS
Agent	Any configured model (see providers below)
Evaluator (judge)	Auto-selected: Haiku → gpt-4o-mini → first available

Providers

Configure one or more providers by adding keys to .env. All configured providers appear in the model selector; only the keys you set are required.

Provider	Required env vars	Models shown
Anthropic	`ANTHROPIC_API_KEY`	Opus 4.6, Sonnet 4.6, Haiku 4.5
OpenAI	`OPENAI_API_KEY`	GPT-4o, GPT-4o mini, o3-mini
Azure OpenAI	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`	Your deployment
Azure AI Foundry	`AZURE_FOUNDRY_ENDPOINT`, `AZURE_FOUNDRY_API_KEY`, `AZURE_FOUNDRY_MODEL`	Your deployed model

Setup

1. Environment

Copy .env and fill in at least one provider:

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI (optional)
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o

# Azure OpenAI (optional)
# AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
# AZURE_OPENAI_API_KEY=...
# AZURE_OPENAI_DEPLOYMENT=gpt-4o
# AZURE_OPENAI_API_VERSION=2024-12-01-preview

# Azure AI Foundry serverless (optional)
# AZURE_FOUNDRY_ENDPOINT=https://<endpoint>.inference.ai.azure.com
# AZURE_FOUNDRY_API_KEY=...
# AZURE_FOUNDRY_MODEL=Meta-Llama-3.1-70B-Instruct

2. Backend

pip install -r requirements.txt
uvicorn backend.main:app --reload

3. Frontend

cd frontend
npm install
npm run dev

Open http://localhost:5173.

API

Method	Path	Description
`GET`	`/api/providers`	Returns configured providers and their available models
`POST`	`/api/chat`	Run agent + SAFE evaluation for one turn
`POST`	`/api/compare`	Run aligned and failure agents in parallel

/api/chat and /api/compare accept optional provider and model fields. If omitted, the first configured provider and its default model are used.

Project structure

├── backend/
│   ├── main.py                  # FastAPI app
│   ├── config.py                # Multi-provider env config
│   ├── schemas.py               # Pydantic models
│   ├── routes/
│   │   ├── agent.py             # POST /api/chat
│   │   ├── compare.py           # POST /api/compare
│   │   └── providers.py         # GET /api/providers
│   └── services/
│       ├── provider.py          # Provider abstraction (Anthropic / OpenAI / Azure)
│       ├── agent_service.py     # Scenario prompts + agent runner
│       └── safe_evaluator.py    # SAFE evaluation (LLM-as-judge)
└── frontend/src/
    ├── App.tsx
    ├── api/client.ts
    ├── components/
    │   ├── Layout.tsx
    │   ├── ModelSelector.tsx    # Provider + model dropdown
    │   └── SafeScorePanel.tsx
    └── pages/
        ├── Home.tsx             # SAFE overview
        ├── Demo.tsx             # Interactive demo
        └── Compare.tsx          # Side-by-side comparison

Experiment pipeline (AIES 2026)

A complete evaluation pipeline for measuring SAFE framework compliance across a hybrid 100-scenario golden dataset.

Dataset

100 scenarios across two domains (banking, triage), drawn from three sources:

Source	Count	License
τ²-bench (Yao et al., 2024)	40	MIT
Agent-SafetyBench (Zhang et al., 2024)	30	MIT
Custom (this study)	30	—

Each scenario has per-dimension ground truth labels (compliant / violation) for all four SAFE dimensions.

Phase 1 — Build golden dataset

python -m experiments.merge_datasets
# → data/golden_dataset/banking.jsonl  (71 scenarios)
# → data/golden_dataset/triage.jsonl   (29 scenarios)
# → data/golden_dataset/manifest.json

Phase 2 — Run batch evaluation

# Evaluate aligned agent on both scenarios
python -m experiments.batch_eval --mode aligned

# Evaluate failure-mode agent (for comparison)
python -m experiments.batch_eval --mode failure

# Resume an interrupted run
python -m experiments.batch_eval --mode aligned --resume

# Test on a small subset first
python -m experiments.batch_eval --scenario banking --limit 5

Results are written to experiments/results/ (gitignored).

Phase 2 — Statistical analysis

# Analyze aligned results only
python -m experiments.analyze --mode aligned

# Full mode comparison (generates Mann-Whitney U tests)
python -m experiments.analyze --mode both

# Figures saved to experiments/figures/ (gitignored)

Produces:

Per-dimension Precision / Recall / F1 / Cohen's κ
Mean scores by source (τ²-bench vs ASB vs custom)
Mann-Whitney U tests + rank-biserial effect sizes (aligned vs failure)
PNG figures at 300 dpi

Phase 2 — Human annotation (inter-rater reliability)

python -m experiments.annotate \
    --input experiments/results/banking_aligned_results.jsonl \
    --output experiments/annotations/rater1_banking.jsonl \
    --annotator rater1

Phase 4 — Generate LaTeX paper tables

python -m experiments.generate_paper_tables
# → experiments/paper_tables/table1_detection_*.tex
# → experiments/paper_tables/table2_sources_*.tex
# → experiments/paper_tables/table3_mode_comparison.tex

Experiment directory structure

experiments/
├── __init__.py
├── merge_datasets.py        # Phase 1: build golden dataset
├── batch_eval.py            # Phase 2: run LLM evaluations
├── analyze.py               # Phase 2: statistical analysis + figures
├── annotate.py              # Phase 2: human annotation CLI
└── generate_paper_tables.py # Phase 4: LaTeX table output

data/
├── public_adapted/
│   ├── tau_bench/
│   │   ├── adapted_banking.jsonl   (20 scenarios)
│   │   └── adapted_retail.jsonl    (20 scenarios)
│   └── agent_safety_bench/
│       └── adapted_scenarios.jsonl (30 scenarios)
├── custom/
│   ├── banking_safe_specific.jsonl (15 scenarios)
│   └── triage_safe_specific.jsonl  (15 scenarios)
└── golden_dataset/
    ├── banking.jsonl               (71 scenarios — generated)
    ├── triage.jsonl                (29 scenarios — generated)
    ├── manifest.json               (source stats + citations)
    └── README.md

Original demo

The original Microsoft Foundry evaluation platform demo is preserved in demo_backup/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAFE Framework — Responsible Agentic Systems Demo

Features

Stack

Providers

Setup

1. Environment

2. Backend

3. Frontend

API

Project structure

Experiment pipeline (AIES 2026)

Dataset

Phase 1 — Build golden dataset

Phase 2 — Run batch evaluation

Phase 2 — Statistical analysis

Phase 2 — Human annotation (inter-rater reliability)

Phase 4 — Generate LaTeX paper tables

Experiment directory structure

Original demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
backend		backend
data		data
demo_backup		demo_backup
experiments		experiments
frontend		frontend
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SAFE Framework — Responsible Agentic Systems Demo

Features

Stack

Providers

Setup

1. Environment

2. Backend

3. Frontend

API

Project structure

Experiment pipeline (AIES 2026)

Dataset

Phase 1 — Build golden dataset

Phase 2 — Run batch evaluation

Phase 2 — Statistical analysis

Phase 2 — Human annotation (inter-rater reliability)

Phase 4 — Generate LaTeX paper tables

Experiment directory structure

Original demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages