question about AgentBench evaluation

I have a question regarding the AgentBench evaluation reported in Table 3 of the paper.

From the AgentBench results shown in the benchmark leaderboard (https://arxiv.org/abs/2308.03688, https://github.com/THUDM/AgentBench (Benchmarking Results)), models such as Qwen2.5-14B-Instruct achieve 17.6 on WebShop.

However, in Table 3 of your paper, the reported result for Qwen2.5-7B-Instruct on WebShop is 58.8, which appears significantly higher than the numbers shown in the AgentBench benchmark.

Could you clarify:
	1.	Whether the WebShop evaluation setting used in your Table 3 is the same as the original AgentBench benchmark?
	2.	If there are any modifications to the environment, evaluation protocol, or dataset?
	
Thank you very much for your time and help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about AgentBench evaluation #32

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

question about AgentBench evaluation #32

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions