diff --git a/browseruse_bench/eval/lexbench_browser/prompts/eval_final_en.txt b/browseruse_bench/eval/lexbench_browser/prompts/eval_final_en.txt index e117ab8..3020c4c 100644 --- a/browseruse_bench/eval/lexbench_browser/prompts/eval_final_en.txt +++ b/browseruse_bench/eval/lexbench_browser/prompts/eval_final_en.txt @@ -33,7 +33,10 @@ Note: I will provide the final screenshot after Agent execution. Please evaluate Please evaluate whether the Agent successfully completed the task based on the final screenshot and answer: 1. **Result Analysis**: Judge from final screenshot whether the Agent achieved the task goal -2. **Item-by-Item Scoring**: Provide specific scores and reasons according to scoring criteria +2. **Item-by-Item Scoring**: Provide specific scores and reasons according to scoring criteria. + **You MUST also apply the following rules when scoring**: + - **Upstream-unreachable tolerance**: If the site resource/data the task depends on does not actually exist in the current environment or the site does not provide it (the final screenshot confirms the agent reached the correct target location AND the answer explicitly reports "I searched / the site does not provide it" or gives another clear explanation), the corresponding **upstream retrieval items** ("locate target / extract target info / count statistics / key info identification / retrieval / lookup / query") **may receive full or near-full credit**. Conversely, if the agent skipped steps before claiming "not found" or fabricated results, deduct per the original standard. + - **Downstream knock-on tolerance**: When upstream is confirmed unreachable per the previous rule AND the agent did not forcibly fabricate placeholder records, **downstream write items** that depend on that upstream data ("create Lead/Quotation/Task/Event/Time Off / field filling / status update / cross-system data consistency / transcription / writing / type mapping / priority judgment") **should not be flat-zeroed**; award **30%-50%** tolerance credit of their full points (up to 50% if the agent explicitly states "skipped due to upstream unavailability"). 3. **Deduction Explanation**: List triggered deduction items 4. **Total Score Calculation**: Scoring items score - Deductions = Final score diff --git a/browseruse_bench/eval/lexbench_browser/prompts/eval_final_zh.txt b/browseruse_bench/eval/lexbench_browser/prompts/eval_final_zh.txt index bc82c1c..f126975 100644 --- a/browseruse_bench/eval/lexbench_browser/prompts/eval_final_zh.txt +++ b/browseruse_bench/eval/lexbench_browser/prompts/eval_final_zh.txt @@ -33,7 +33,10 @@ 请根据最终截图和答案,评估Agent是否成功完成了任务: 1. **结果分析**: 从最终截图判断Agent是否达到了任务目标 -2. **逐项评分**: 根据评分标准给出具体分数和理由 +2. **逐项评分**: 根据评分标准给出具体分数和理由。 + **评分同时必须应用以下规则**: + - **上游不可达容忍**: 若任务依赖的站点资源/数据在当前环境实际不存在或该站点不提供(最终截图确认 agent 到达正确目标位置 + 答案明确报告"已检索 / 该站点不提供"或给出其他清晰解释),对应"找到目标 / 提取目标信息 / 数量统计 / 关键信息识别 / 检索 / 定位 / 查询"等**上游检索类**评分项**可给满分或接近满分**;反之若 agent 跳步声称"未找到"或编造结果,按原标准扣分。 + - **下游连带容忍**: 上游按上一条确认不可达 且 agent 未强行编造占位记录时,依赖该上游数据的"创建 Lead/Quotation/Task/Event/Time Off / 字段填写 / 状态更新 / 跨系统数据一致性 / 转录 / 写入 / 类型映射 / 优先级判断"等**下游写操作类**评分项**不应直接 0 分**,按其满分的 **30%-50%** 给容忍分(agent 明确标注"因上游无数据跳过"可到 50%)。 3. **扣分说明**: 列出触发的扣分项 4. **总分计算**: 评分项得分 - 扣分 = 最终得分 diff --git a/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_en.txt b/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_en.txt index 7386df0..b58a503 100644 --- a/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_en.txt +++ b/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_en.txt @@ -34,7 +34,10 @@ Note: I will provide key screenshots from the Agent's execution process. Please Please evaluate the Agent's performance based on the screenshots and final answer: 1. **Process Analysis**: Judge from screenshots whether the Agent correctly executed the task steps -2. **Item-by-Item Scoring**: Provide specific scores and reasons according to scoring criteria +2. **Item-by-Item Scoring**: Provide specific scores and reasons according to scoring criteria. + **You MUST also apply the following rules when scoring**: + - **Upstream-unreachable tolerance**: If the site resource/data the task depends on does not actually exist in the current environment or the site does not provide it (screenshots confirm the agent reached the correct target location AND the answer explicitly reports "I searched / the site does not provide it" or gives another clear explanation), the corresponding **upstream retrieval items** ("locate target / extract target info / count statistics / key info identification / retrieval / lookup / query") **may receive full or near-full credit**. Conversely, if the agent skipped steps before claiming "not found" or fabricated results, deduct per the original standard. + - **Downstream knock-on tolerance**: When upstream is confirmed unreachable per the previous rule AND the agent did not forcibly fabricate placeholder records, **downstream write items** that depend on that upstream data ("create Lead/Quotation/Task/Event/Time Off / field filling / status update / cross-system data consistency / transcription / writing / type mapping / priority judgment") **should not be flat-zeroed**; award **30%-50%** tolerance credit of their full points (up to 50% if the agent explicitly states "skipped due to upstream unavailability"). 3. **Total Score Calculation**: Sum of all scoring items = Final score Please output in the following format: diff --git a/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_zh.txt b/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_zh.txt index 562ed93..268b354 100644 --- a/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_zh.txt +++ b/browseruse_bench/eval/lexbench_browser/prompts/eval_stepwise_zh.txt @@ -34,7 +34,10 @@ 请根据截图和最终答案,评估Agent的表现: 1. **过程分析**: 从截图判断Agent是否正确执行了任务步骤 -2. **逐项评分**: 根据评分标准给出具体分数和理由 +2. **逐项评分**: 根据评分标准给出具体分数和理由。 + **评分同时必须应用以下规则**: + - **上游不可达容忍**: 若任务依赖的站点资源/数据在当前环境实际不存在或该站点不提供(截图确认 agent 到达正确目标位置 + 答案明确报告"已检索 / 该站点不提供"或给出其他清晰解释),对应"找到目标 / 提取目标信息 / 数量统计 / 关键信息识别 / 检索 / 定位 / 查询"等**上游检索类**评分项**可给满分或接近满分**;反之若 agent 跳步声称"未找到"或编造结果,按原标准扣分。 + - **下游连带容忍**: 上游按上一条确认不可达 且 agent 未强行编造占位记录时,依赖该上游数据的"创建 Lead/Quotation/Task/Event/Time Off / 字段填写 / 状态更新 / 跨系统数据一致性 / 转录 / 写入 / 类型映射 / 优先级判断"等**下游写操作类**评分项**不应直接 0 分**,按其满分的 **30%-50%** 给容忍分(agent 明确标注"因上游无数据跳过"可到 50%)。 3. **合计得分**: 各评分项得分累加 = 最终得分(必须以"### 总分: XX分"格式输出) 请按以下格式输出: