Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![REUSE status](https://api.reuse.software/badge/github.com/SAP-samples/agent-quality-inspect)](https://api.reuse.software/info/github.com/SAP-samples/agent-quality-inspect)
[![REUSE status](https://api.reuse.software/badge/github.com/SAP/agent-quality-inspect)](https://api.reuse.software/info/github.com/SAP/agent-quality-inspect)
[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-red.svg)](https://iclr.cc/Conferences/2026)

Paper Link: https://openreview.net/pdf?id=fHsVNklKOc (Will be updated with the published version when available)
Paper Link: https://openreview.net/pdf?id=fHsVNklKOc

Documentation Link: https://sap-samples.github.io/agent-quality-inspect/
Documentation Link: https://sap.github.io/agent-quality-inspect/

## Table of Contents
- [Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis](#talk-evaluate-diagnose-user-aware-agent-evaluation-with-automated-error-analysis)
Expand All @@ -34,6 +34,10 @@ Documentation Link: https://sap-samples.github.io/agent-quality-inspect/

## Overview

![Two-step automated error discovery approach. Identical error colors indicate
that similar low-level errors are clustered into the same high-level category.](error_analysis_framework.png)


This repository contains the implementation of **Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis (TED)**.

The agent-quality-inspect toolkit evaluates agentic systems under different user personas (expert and non-expert), reports metrics such as **Area Under the Curve (AUC)**, **Progress Per Turn (PPT)**, **pass@k**, **pass^k**, etc., and provides detailed error analysis to identify specific areas for improvement in the agent.
Expand Down Expand Up @@ -66,7 +70,7 @@ At the core of TED is a **subgoal-based evaluation**: users specify a set of nat
1. Clone the repository:

```bash
git clone https://github.com/SAP-samples/agent-quality-inspect
git clone https://github.com/SAP/agent-quality-inspect
cd agent-quality-inspect
```

Expand Down Expand Up @@ -105,7 +109,7 @@ The standard flow of using it as a metrics package is as follows:
1. To use the package as an importable dependency, enter your command terminal and use this command.

```bash
pip install git+https://github.com/SAP-samples/agent-quality-inspect.git
pip install git+https://github.com/SAP/agent-quality-inspect.git
```

2. Define your evaluation sample and agent trace.
Expand Down Expand Up @@ -323,7 +327,7 @@ Note: `expected_tools` is optional for now, in the future we plan to support too
No known issues.

## How to obtain support
[Create an issue](https://github.com/SAP-samples/agent-quality-inspect/issues) in this repository if you find a bug or have questions about the content.
[Create an issue](https://github.com/SAP/agent-quality-inspect/issues) in this repository if you find a bug or have questions about the content.

For additional support, [ask a question in SAP Community](https://answers.sap.com/questions/ask.html).

Expand Down
2 changes: 1 addition & 1 deletion REUSE.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
version = 1
SPDX-PackageName = "agentinspect"
SPDX-PackageSupplier = "ospo@sap.com"
SPDX-PackageDownloadLocation = "https://github.com/SAP-samples/agentinspect"
SPDX-PackageDownloadLocation = "https://github.com/SAP/agent-quality-inspect"
SPDX-PackageComment = "The code in this project may include calls to APIs (\"API Calls\") of\n SAP or third-party products or services developed outside of this project\n (\"External Products\").\n \"APIs\" means application programming interfaces, as well as their respective\n specifications and implementing code that allows software to communicate with\n other software.\n API Calls to External Products are not licensed under the open source license\n that governs this project. The use of such API Calls and related External\n Products are subject to applicable additional agreements with the relevant\n provider of the External Products. In no event shall the open source license\n that governs this project grant any rights in or to any External Products,or\n alter, expand or supersede any terms of the applicable additional agreements.\n If you have a valid license agreement with SAP for the use of a particular SAP\n External Product, then you may make use of any API Calls included in this\n project's code for that SAP External Product, subject to the terms of such\n license agreement. If you do not have a valid license agreement for the use of\n a particular SAP External Product, then you may only make use of any API Calls\n in this project for that SAP External Product for your internal, non-productive\n and non-commercial test and evaluation of such API Calls. Nothing herein grants\n you any rights to use or access any SAP External Product, or provide any third\n parties the right to use of access any SAP External Product, through API Calls."

[[annotations]]
Expand Down
Binary file added error_analysis_framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions paper_experiments/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ This directory contains scripts for running agent evaluation experiments with us

## Prerequisites

1. **Agent Setup** (separate from this repository): [TODO: Provide links to agent repositories]
- For Tau2Bench agent: Clone the tau2bench repository and set it up by following the instructions in its README
- For ToolSandbox agent: Clone the toolsandbox repository and set it up by following the instructions in its README
1. **Agent Setup**
- For Tau2Bench agent: Refer to [agent_runners/README_tau2_bench_setup.md](../agent_runners/README_tau2_bench_setup.md)
- For ToolSandbox agent: Refer to [agent_runners/README_tool_sandbox_setup.md](../agent_runners/README_tool_sandbox_setup.md)

2. **Azure OpenAI Configuration**:
- Create a `.env` file in the **project root directory** with the following variables:
Expand Down
Loading