I have a question regarding the AgentBench evaluation reported in Table 3 of the paper.
From the AgentBench results shown in the benchmark leaderboard (https://arxiv.org/abs/2308.03688, https://github.com/THUDM/AgentBench (Benchmarking Results)), models such as Qwen2.5-14B-Instruct achieve 17.6 on WebShop.
However, in Table 3 of your paper, the reported result for Qwen2.5-7B-Instruct on WebShop is 58.8, which appears significantly higher than the numbers shown in the AgentBench benchmark.
Could you clarify:
1. Whether the WebShop evaluation setting used in your Table 3 is the same as the original AgentBench benchmark?
2. If there are any modifications to the environment, evaluation protocol, or dataset?
Thank you very much for your time and help.
I have a question regarding the AgentBench evaluation reported in Table 3 of the paper.
From the AgentBench results shown in the benchmark leaderboard (https://arxiv.org/abs/2308.03688, https://github.com/THUDM/AgentBench (Benchmarking Results)), models such as Qwen2.5-14B-Instruct achieve 17.6 on WebShop.
However, in Table 3 of your paper, the reported result for Qwen2.5-7B-Instruct on WebShop is 58.8, which appears significantly higher than the numbers shown in the AgentBench benchmark.
Could you clarify:
1. Whether the WebShop evaluation setting used in your Table 3 is the same as the original AgentBench benchmark?
2. If there are any modifications to the environment, evaluation protocol, or dataset?
Thank you very much for your time and help.