Skip to content

Commit 850b04c

Browse files
authored
Merge pull request #139 from agent-diff-bench/fixes-kdd
Update Examples
2 parents a82941f + 5f4cdab commit 850b04c

3 files changed

Lines changed: 285 additions & 621 deletions

File tree

README.md

Lines changed: 58 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behav
1818
<a href="mailto:hubert@uni.minerva.edu">Feedback</a>
1919
</p>
2020

21+
### Try it now
22+
23+
| Notebook | Description | |
24+
|----------|-------------|---|
25+
| [ReAct Agent (Paper)](examples/react_agent_benchmark.ipynb) | Custom ReAct loop matching the paper methodology | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb) |
26+
| [LangChain Agent](examples/langchain_agent_benchmark.ipynb) | LangChain agent with tool calling | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb) |
2127

2228
## Quick Start
2329

@@ -52,69 +58,52 @@ export AGENT_DIFF_BASE_URL="https://api.agentdiff.dev"
5258
<summary><b>Self-Hosted</b></summary>
5359

5460
```bash
55-
git clone https://github.com/hubertpysklo/agent-diff.git
61+
git clone https://github.com/agent-diff-bench/agent-diff.git
5662
cd agent-diff/ops
5763
docker-compose up --build
5864
# Backend runs on http://localhost:8000
5965
```
6066

6167
</details>
6268

63-
### 3. Flow
69+
### 3. Use
70+
6471
```python
6572
from agent_diff import AgentDiff
6673

67-
# Self-hosted (defaults to http://localhost:8000)
6874
client = AgentDiff()
6975

70-
# Initialise isolated environment from a template. See: examples/slack/seeds
71-
env = client.init_env(templateService="slack", templateName="slack_default",
72-
impersonateUserId="U01AGENBOT9", TTL="3600") #impersonateUserId - seeded user account that agent will use
76+
# Create an isolated environment from a template
77+
env = client.init_env(
78+
templateService="slack",
79+
templateName="slack_default",
80+
impersonateUserId="U01AGENBOT9",
81+
)
7382

74-
# print(env.environmentUrl) = http://localhost:8000/api/env/{environmentId}/services/slack
75-
76-
# Take before snapshot
83+
# Snapshot before agent runs
7784
run = client.start_run(envId=env.environmentId)
7885

79-
# Your agent does stuff using the environment URL
80-
# You can swap the URLs in MCPs or use the code executor tool (Python or bash) with a proxy
81-
82-
# Using CodeExecutorProxy with OpenAI Agents SDK (For Vercel AI, check TS SDK docs)
83-
from agent_diff import PythonExecutorProxy, create_openai_tool
84-
from agents import Agent, Runner
86+
# --- Your agent interacts with the API here ---
87+
# SDK provides code execution proxies (Python/Bash) for OpenAI Agents, LangChain, etc.
88+
# Agent writes normal code (e.g. requests.post('https://slack.com/api/chat.postMessage', ...))
89+
# which is automatically intercepted and routed to the sandboxed environment.
8590

86-
# Create executor (auto-loads from AGENT_DIFF_API_KEY and AGENT_DIFF_BASE_URL env vars)
87-
python_executor = PythonExecutorProxy(env.environmentId)
88-
python_tool = create_openai_tool(python_executor)
91+
from agent_diff import BashExecutorProxy, create_openai_tool
92+
bash = BashExecutorProxy(env.environmentId)
93+
tool = create_openai_tool(bash) # also: create_langchain_tool, create_smolagents_tool
8994

90-
agent = Agent(
91-
name="Slack Assistant",
92-
instructions="Use execute_python tool to interact with Slack API at https://slack.com/api/*. Complete the task using the tools provided. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token.",
93-
tools=[python_tool] # python_tool (or bash_tool) where agent will write code
94-
)
95-
96-
response = await Runner.run(agent, "Post 'Hello' to Slack channel #general")
97-
98-
# The agent writes normal code like:
99-
# requests.post('https://slack.com/api/chat.postMessage', ...)
100-
# But it will be proxied to the temporary sandbox environment
101-
# e.g. transforms:
102-
# from: https://api.slack.com/api/conversations.list
103-
# to: http://localhost:8000/api/env/{environmentId}/services/slack/conversations.list
104-
105-
# Compute diff (changes in the environment) and get results
95+
# Compute state diff and inspect changes
10696
diff = client.diff_run(runId=run.runId)
107-
108-
# Inspect changes
109-
print(diff.diff['inserts']) # New records, e.g. new message or user added by agent
110-
print(diff.diff['updates']) # Modified records, edited message
111-
print(diff.diff['deletes']) # Deleted records, deleted message, linear issue, etc.
97+
print(diff.diff['inserts']) # new records created by agent
98+
print(diff.diff['updates']) # modified records
99+
print(diff.diff['deletes']) # deleted records
112100

113101
# Clean up
114102
client.delete_env(envId=env.environmentId)
115-
116103
```
117104

105+
See the [Python SDK](sdk/agent-diff-python/README.md) and [TS SDK](sdk/agent-diff-ts/README.md) for full reference.
106+
118107
## Supported APIs
119108

120109
- **Box** – REST API for file/folder management, search, comments, tags, shared links, hubs, and content versioning. See [`backend/src/services/box/README.md`](backend/src/services/box/README.md). 27 endpoints.
@@ -130,9 +119,9 @@ client.delete_env(envId=env.environmentId)
130119
## Templates, Seeds & Environments
131120

132121
**Templates** are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:
133-
- **Location**: Templates live in PostgreSQL schemas (e.g., `slack_default`, `linear_base`)
134-
- **Content**: Templates are seeded during startup time from seeds with data like users, channels, messages, issues, etc.
135-
- **Example Seeds**: **[slack_default](examples/slack/seeds/slack_bench_default.json)** - sample users, channels and messages.
122+
- **Location**: Templates live in PostgreSQL schemas (e.g., `slack_default`, `box_default`, `linear_expanded`, `calendar_base`)
123+
- **Content**: Seeded with realistic data users, channels, messages, files, folders, issues, calendar events, etc.
124+
- **Seeds**: [box](examples/box/seeds/) | [calendar](examples/calendar/seeds/) | [linear](examples/linear/seeds/) | [slack](examples/slack/seeds/)
136125

137126
<img width="2330" height="688" alt="image" src="https://github.com/user-attachments/assets/481d3f40-e378-402c-9d3c-8a2ab75c880e" />
138127

@@ -144,43 +133,12 @@ client.delete_env(envId=env.environmentId)
144133
<img width="2344" height="432" alt="image" src="https://github.com/user-attachments/assets/c61e93f2-1826-429e-8ee7-4a32f4172a38" />
145134

146135

147-
## CodeExecutorProxy
148-
149-
SDK provides **code execution proxies** - tools for AI agents. You add it to your toolbox in Vercel AI SDK, Langchain or OpenAI Agents, making LLM write Python or Bash code to talk with Slack or Linear API. Requests will automatically be intercepted and routed to isolated test environments. This enables agents to interact with service replicas without any code changes. See more in: **[Python SDK](sdk/agent-diff-python/README.md)**
150-
151-
152-
## Paper
153-
154-
> **Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation**
155-
> Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson
156-
> *Pre-print. Under review for KDD 2026.*
157-
> [arXiv:2602.11224](https://arxiv.org/abs/2602.11224)
158-
159-
If you use Agent-Diff in your research, please cite:
160-
161-
```bibtex
162-
@article{pysklo2025agentdiff,
163-
title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
164-
author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
165-
journal={arXiv preprint arXiv:2602.11224},
166-
year={2025}
167-
}
168-
```
169136

170137
## Run Evaluations
171138

172-
The fastest way to run Agent-Diff evaluations is via **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — run evals or RL training with no setup required.
173-
174-
Alternatively, run locally or self-hosted using the SDK (see [To run evaluations](#to-run-evaluations) below).
175-
176-
### Example Notebooks
177-
178-
- **[ReAct Agent (Paper)](examples/react_agent_benchmark.ipynb)** — Custom ReAct loop matching the paper methodology [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb)
179-
- **[LangChain Agent](examples/langchain_agent_benchmark.ipynb)** — LangChain agent with tool calling [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)
180-
181-
**Resources:**
182-
- **Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split)
183-
- **Prime Intellect**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — hosted evaluations & RL training
139+
- **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
140+
- **[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
141+
- **[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split)
184142

185143
## Benchmark
186144

@@ -232,81 +190,37 @@ Tasks are characterized along five dimensions: _task horizon_ (minimum API calls
232190

233191
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
234192

235-
## Evaluations & Test Suites
236-
237-
Collections of test cases with assertions that you can run against agent runs using evaluations.
193+
## Test Suites
238194

239-
- **[box_bench.json](examples/box/testsuites/box_bench.json)** - test cases covering file/folder operations, search, tags, comments, hubs, and content versioning
240-
- **[calendar_bench.json](examples/calendar/testsuites/calendar_bench.json)** - test cases covering event CRUD, recurring events, free/busy queries, ACL management, and calendar lifecycle
241-
- **[linear_bench.json](examples/linear/testsuites/linear_bench.json)** - test cases covering issue management, labels, comments, workflow states, and team operations
242-
- **[slack_bench.json](examples/slack/testsuites/slack_bench.json)** - test cases covering message sending, channel ops, reactions, threading
195+
| Service | Test Suite | Tests | Coverage |
196+
|---------|-----------|-------|----------|
197+
| Box | [box_bench.json](examples/box/testsuites/box_bench.json) | 48 | File/folder ops, search, tags, comments, hubs, versioning |
198+
| Calendar | [calendar_bench.json](examples/calendar/testsuites/calendar_bench.json) | 60 | Event CRUD, recurring events, free/busy, ACL, lifecycle |
199+
| Linear | [linear_bench.json](examples/linear/testsuites/linear_bench.json) | 57 | Issues, labels, comments, workflow states, teams |
200+
| Slack | [slack_bench.json](examples/slack/testsuites/slack_bench.json) | 59 | Messages, channels, reactions, threading |
243201

244-
<img width="2985" height="1966" alt="pass_rates_annotated" src="https://github.com/user-attachments/assets/f5c59c81-c3bd-427e-977c-a5c2c0695e86" />
245-
246-
- **[Evaluation DSL](docs/evaluation-dsl.md)** - Check DSL docs on how it works.
202+
Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
247203

248204
<img width="2516" height="1020" alt="image" src="https://github.com/user-attachments/assets/3270f1f1-5afa-4db2-97b0-c35c070ef44f" />
249205

206+
## Documentation
250207

251-
### To run evaluations:
252-
253-
```python
254-
from agent_diff import AgentDiff, PythonExecutorProxy, BashExecutorProxy, create_openai_tool
255-
from agents import Agent, Runner
256-
257-
client = AgentDiff()
258-
259-
260-
suite_list = client.list_test_suites(name="Slack Bench")
261-
slack_suite = suite_list.testSuites[0]
262-
suite = client.get_test_suite(slack_suite.id, expand=True)
263-
264-
evaluation_results = []
265-
266-
for test in suite.tests:
267-
prompt = test.prompt
268-
test_id = test.id
269-
270-
#In test suite you define which env seed template is used for each test
271-
env = client.init_env(testId=test_id)
272-
273-
# This function will take a snapshot before run
274-
run = client.start_run(envId=env.environmentId, testId=test_id)
275-
276-
277-
bash_executor = BashExecutorProxy(env.environmentId) # Auto-loads from env vars
278-
bash_tool = create_openai_tool(bash_executor)
279-
280-
agent = Agent(
281-
name="Slack Assistant",
282-
instructions="Use execute_bash tool with curl to interact with Slack API at https://slack.com/api/*. Authentication is handled automatically.",
283-
tools=[bash_tool]
284-
)
285-
286-
response = await Runner.run(agent, prompt)
208+
- **[Python SDK](https://agentdiff.mintlify.app/sdks/python/installation)** — Full Python SDK reference
209+
- **[TypeScript SDK](https://agentdiff.mintlify.app/sdks/typescript/installation)** — Full TypeScript SDK reference
210+
- **[Assertions & Evaluation DSL](https://agentdiff.mintlify.app/core-concepts/assertions)** — Write test assertions
211+
- **[API Reference](https://agentdiff.mintlify.app/api-reference/introduction)** — REST API documentation
212+
- **[Self-Hosting](https://agentdiff.mintlify.app/hosting/docker-setup)** — Docker setup & configuration
287213

288-
#This function will take a 2nd snapshot, run diff and assert results against expected state defined in test suite
289-
290-
#computes eval
291-
client.evaluate_run(runId=run.runId)
292-
293-
#returns score runId, full diff and score (0/1)
294-
run_result = client.get_results_for_run(runId=run.runId)
214+
## Citation
295215

296-
evaluation_results.append(run_result)
216+
If you use Agent-Diff in your research, please cite:
297217

298-
client.delete_env(envId=env.environmentId)
218+
```bibtex
219+
@article{pysklo2025agentdiff,
220+
title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
221+
author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
222+
journal={arXiv preprint arXiv:2602.11224},
223+
year={2025}
224+
}
299225
```
300226

301-
### Example output:
302-
303-
<img width="1669" height="878" alt="image" src="https://github.com/user-attachments/assets/096393d2-e464-4a3d-b0a8-b188af5cf8a9" />
304-
305-
306-
## Documentation
307-
308-
- **[Python SDK](sdk/agent-diff-python/README.md)** - Complete Python SDK reference
309-
- **[TS SDK](sdk/agent-diff-ts/README.md)** - Complete TS SDK reference
310-
- **[Evaluation DSL](docs/evaluation-dsl.md)** - Write test assertions
311-
- **[API Reference](docs/api-reference.md)** - REST API documentation
312-

0 commit comments

Comments
 (0)