Skip to content

ogkranthi/evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAFE Framework — Responsible Agentic Systems Demo

An interactive showcase of the SAFE framework for designing and evaluating responsible agentic systems, based on the article "SAFE: Designing Responsible Agentic Systems".

SAFE stands for Scope · Anchored Decisions · Flow Integrity · Escalation — four principles that keep agent autonomy bounded, evidence-grounded, and stoppable.

Features

  • Overview page — explains all four SAFE principles with examples and observable signals
  • Live Demo — chat with an agent (banking or clinical triage), pick any configured model, and see real-time SAFE scores per response
  • Compare — send the same message to an aligned agent and a failure-mode agent side by side
  • Multi-provider model selector — switch between Anthropic, OpenAI, Azure OpenAI, and Azure AI Foundry from the UI

Stack

Layer Tech
Backend FastAPI + Anthropic SDK + OpenAI SDK
Frontend React 18 + TypeScript + Tailwind CSS
Agent Any configured model (see providers below)
Evaluator (judge) Auto-selected: Haiku → gpt-4o-mini → first available

Providers

Configure one or more providers by adding keys to .env. All configured providers appear in the model selector; only the keys you set are required.

Provider Required env vars Models shown
Anthropic ANTHROPIC_API_KEY Opus 4.6, Sonnet 4.6, Haiku 4.5
OpenAI OPENAI_API_KEY GPT-4o, GPT-4o mini, o3-mini
Azure OpenAI AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT Your deployment
Azure AI Foundry AZURE_FOUNDRY_ENDPOINT, AZURE_FOUNDRY_API_KEY, AZURE_FOUNDRY_MODEL Your deployed model

Setup

1. Environment

Copy .env and fill in at least one provider:

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI (optional)
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o

# Azure OpenAI (optional)
# AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
# AZURE_OPENAI_API_KEY=...
# AZURE_OPENAI_DEPLOYMENT=gpt-4o
# AZURE_OPENAI_API_VERSION=2024-12-01-preview

# Azure AI Foundry serverless (optional)
# AZURE_FOUNDRY_ENDPOINT=https://<endpoint>.inference.ai.azure.com
# AZURE_FOUNDRY_API_KEY=...
# AZURE_FOUNDRY_MODEL=Meta-Llama-3.1-70B-Instruct

2. Backend

pip install -r requirements.txt
uvicorn backend.main:app --reload

3. Frontend

cd frontend
npm install
npm run dev

Open http://localhost:5173.

API

Method Path Description
GET /api/providers Returns configured providers and their available models
POST /api/chat Run agent + SAFE evaluation for one turn
POST /api/compare Run aligned and failure agents in parallel

/api/chat and /api/compare accept optional provider and model fields. If omitted, the first configured provider and its default model are used.

Project structure

├── backend/
│   ├── main.py                  # FastAPI app
│   ├── config.py                # Multi-provider env config
│   ├── schemas.py               # Pydantic models
│   ├── routes/
│   │   ├── agent.py             # POST /api/chat
│   │   ├── compare.py           # POST /api/compare
│   │   └── providers.py         # GET /api/providers
│   └── services/
│       ├── provider.py          # Provider abstraction (Anthropic / OpenAI / Azure)
│       ├── agent_service.py     # Scenario prompts + agent runner
│       └── safe_evaluator.py    # SAFE evaluation (LLM-as-judge)
└── frontend/src/
    ├── App.tsx
    ├── api/client.ts
    ├── components/
    │   ├── Layout.tsx
    │   ├── ModelSelector.tsx    # Provider + model dropdown
    │   └── SafeScorePanel.tsx
    └── pages/
        ├── Home.tsx             # SAFE overview
        ├── Demo.tsx             # Interactive demo
        └── Compare.tsx          # Side-by-side comparison

Experiment pipeline (AIES 2026)

A complete evaluation pipeline for measuring SAFE framework compliance across a hybrid 100-scenario golden dataset.

Dataset

100 scenarios across two domains (banking, triage), drawn from three sources:

Source Count License
τ²-bench (Yao et al., 2024) 40 MIT
Agent-SafetyBench (Zhang et al., 2024) 30 MIT
Custom (this study) 30

Each scenario has per-dimension ground truth labels (compliant / violation) for all four SAFE dimensions.

Phase 1 — Build golden dataset

python -m experiments.merge_datasets
# → data/golden_dataset/banking.jsonl  (71 scenarios)
# → data/golden_dataset/triage.jsonl   (29 scenarios)
# → data/golden_dataset/manifest.json

Phase 2 — Run batch evaluation

# Evaluate aligned agent on both scenarios
python -m experiments.batch_eval --mode aligned

# Evaluate failure-mode agent (for comparison)
python -m experiments.batch_eval --mode failure

# Resume an interrupted run
python -m experiments.batch_eval --mode aligned --resume

# Test on a small subset first
python -m experiments.batch_eval --scenario banking --limit 5

Results are written to experiments/results/ (gitignored).

Phase 2 — Statistical analysis

# Analyze aligned results only
python -m experiments.analyze --mode aligned

# Full mode comparison (generates Mann-Whitney U tests)
python -m experiments.analyze --mode both

# Figures saved to experiments/figures/ (gitignored)

Produces:

  • Per-dimension Precision / Recall / F1 / Cohen's κ
  • Mean scores by source (τ²-bench vs ASB vs custom)
  • Mann-Whitney U tests + rank-biserial effect sizes (aligned vs failure)
  • PNG figures at 300 dpi

Phase 2 — Human annotation (inter-rater reliability)

python -m experiments.annotate \
    --input experiments/results/banking_aligned_results.jsonl \
    --output experiments/annotations/rater1_banking.jsonl \
    --annotator rater1

Phase 4 — Generate LaTeX paper tables

python -m experiments.generate_paper_tables
# → experiments/paper_tables/table1_detection_*.tex
# → experiments/paper_tables/table2_sources_*.tex
# → experiments/paper_tables/table3_mode_comparison.tex

Experiment directory structure

experiments/
├── __init__.py
├── merge_datasets.py        # Phase 1: build golden dataset
├── batch_eval.py            # Phase 2: run LLM evaluations
├── analyze.py               # Phase 2: statistical analysis + figures
├── annotate.py              # Phase 2: human annotation CLI
└── generate_paper_tables.py # Phase 4: LaTeX table output

data/
├── public_adapted/
│   ├── tau_bench/
│   │   ├── adapted_banking.jsonl   (20 scenarios)
│   │   └── adapted_retail.jsonl    (20 scenarios)
│   └── agent_safety_bench/
│       └── adapted_scenarios.jsonl (30 scenarios)
├── custom/
│   ├── banking_safe_specific.jsonl (15 scenarios)
│   └── triage_safe_specific.jsonl  (15 scenarios)
└── golden_dataset/
    ├── banking.jsonl               (71 scenarios — generated)
    ├── triage.jsonl                (29 scenarios — generated)
    ├── manifest.json               (source stats + citations)
    └── README.md

Original demo

The original Microsoft Foundry evaluation platform demo is preserved in demo_backup/.

About

AI Evaluation Platform - industry-configurable evaluations powered by Microsoft Foundry

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors