You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|[ReAct Agent (Paper)](examples/react_agent_benchmark.ipynb)| Custom ReAct loop matching the paper methodology |[](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb)|
26
+
|[LangChain Agent](examples/langchain_agent_benchmark.ipynb)| LangChain agent with tool calling |[](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)|
instructions="Use execute_python tool to interact with Slack API at https://slack.com/api/*. Complete the task using the tools provided. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token.",
93
-
tools=[python_tool] # python_tool (or bash_tool) where agent will write code
94
-
)
95
-
96
-
response =await Runner.run(agent, "Post 'Hello' to Slack channel #general")
print(diff.diff['deletes']) # Deleted records, deleted message, linear issue, etc.
97
+
print(diff.diff['inserts']) # new records created by agent
98
+
print(diff.diff['updates']) # modified records
99
+
print(diff.diff['deletes']) # deleted records
112
100
113
101
# Clean up
114
102
client.delete_env(envId=env.environmentId)
115
-
116
103
```
117
104
105
+
See the [Python SDK](sdk/agent-diff-python/README.md) and [TS SDK](sdk/agent-diff-ts/README.md) for full reference.
106
+
118
107
## Supported APIs
119
108
120
109
-**Box** – REST API for file/folder management, search, comments, tags, shared links, hubs, and content versioning. See [`backend/src/services/box/README.md`](backend/src/services/box/README.md). 27 endpoints.
**Templates** are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:
133
-
-**Location**: Templates live in PostgreSQL schemas (e.g., `slack_default`, `linear_base`)
134
-
-**Content**: Templates are seeded during startup time from seeds with data like users, channels, messages, issues, etc.
135
-
-**Example Seeds**: **[slack_default](examples/slack/seeds/slack_bench_default.json)** - sample users, channels and messages.
122
+
-**Location**: Templates live in PostgreSQL schemas (e.g., `slack_default`, `box_default`, `linear_expanded`, `calendar_base`)
123
+
-**Content**: Seeded with realistic data — users, channels, messages, files, folders, issues, calendar events, etc.
SDK provides **code execution proxies** - tools for AI agents. You add it to your toolbox in Vercel AI SDK, Langchain or OpenAI Agents, making LLM write Python or Bash code to talk with Slack or Linear API. Requests will automatically be intercepted and routed to isolated test environments. This enables agents to interact with service replicas without any code changes. See more in: **[Python SDK](sdk/agent-diff-python/README.md)**
150
-
151
-
152
-
## Paper
153
-
154
-
> **Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation**
155
-
> Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson
If you use Agent-Diff in your research, please cite:
160
-
161
-
```bibtex
162
-
@article{pysklo2025agentdiff,
163
-
title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
164
-
author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
165
-
journal={arXiv preprint arXiv:2602.11224},
166
-
year={2025}
167
-
}
168
-
```
169
136
170
137
## Run Evaluations
171
138
172
-
The fastest way to run Agent-Diff evaluations is via **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — run evals or RL training with no setup required.
173
-
174
-
Alternatively, run locally or self-hosted using the SDK (see [To run evaluations](#to-run-evaluations) below).
175
-
176
-
### Example Notebooks
177
-
178
-
-**[ReAct Agent (Paper)](examples/react_agent_benchmark.ipynb)** — Custom ReAct loop matching the paper methodology [](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb)
179
-
-**[LangChain Agent](examples/langchain_agent_benchmark.ipynb)** — LangChain agent with tool calling [](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)
180
-
181
-
**Resources:**
182
-
-**Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split)
183
-
-**Prime Intellect**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — hosted evaluations & RL training
139
+
-**[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
140
+
-**[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
141
+
-**[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split)
184
142
185
143
## Benchmark
186
144
@@ -232,81 +190,37 @@ Tasks are characterized along five dimensions: _task horizon_ (minimum API calls
232
190
233
191
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
234
192
235
-
## Evaluations & Test Suites
236
-
237
-
Collections of test cases with assertions that you can run against agent runs using evaluations.
193
+
## Test Suites
238
194
239
-
-**[box_bench.json](examples/box/testsuites/box_bench.json)** - test cases covering file/folder operations, search, tags, comments, hubs, and content versioning
240
-
-**[calendar_bench.json](examples/calendar/testsuites/calendar_bench.json)** - test cases covering event CRUD, recurring events, free/busy queries, ACL management, and calendar lifecycle
241
-
-**[linear_bench.json](examples/linear/testsuites/linear_bench.json)** - test cases covering issue management, labels, comments, workflow states, and team operations
-**[Evaluation DSL](docs/evaluation-dsl.md)** - Check DSL docs on how it works.
202
+
Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
0 commit comments