feat(eval): judge prompt 引入上下游不可达容忍规则#35
Open
jiaxinwang-sherry wants to merge 1 commit into
Open
Conversation
在 4 个 LexBench-Browser 正向 judge prompt 中,往 "逐项评分" 步骤 注入两条子规则,处理站点 live 数据稀疏导致诚实 agent 被零分误判的 系统性偏差: - 上游不可达容忍:截图确认 agent 到达目标位置 + 答案明确报告 "已检索 / 该站点不提供" 时,上游检索类评分项可给满分;跳步声称 "未找到" 或编造结果仍按原标准扣分。 - 下游连带容忍:上游不可达且 agent 未编造占位记录时,下游写操作类 评分项不直接 0 分,给满分的 30%-50% 容忍分。 Co-authored-by: Cursor <cursoragent@cursor.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
核心目标:如果网站内资源本身实现不了query,模型准确判断之后报告情况,judge时让模型适当获得相应的分数
改动
browseruse_bench/eval/lexbench_browser/prompts/下 4 个文件,每个文件 +4 净行:形式上尽量融入了现有评分框架:把两条规则注入到 "2. 逐项评分" 步骤的子条款(而非新加独立 section),让 judge 在评分动作里被强制读到,不需要改动 numbered list 整体结构、不引入新概念。
规则一:上游不可达容忍
规则二:下游连带容忍
关于规则中的枚举
中间几个动词枚举("找到 / 提取 / 统计 / ..."、"创建 Lead/Quotation/..." 等)看起来确实有点繁琐,但实验中尝试删去或概括之后 judge 都未达到预期效果:
原因是 gpt-5.4 在 "task 评分项原文 → 上下游类别" 的分类任务上需要词面 lexical bridge:枚举里的动词与 task data 里的评分项措辞要能直接 overlap,judge 才能稳定归类。完整的枚举写法可能形式上像 "打补丁",但目前是 prompt-only 改动里效果最稳的形态。
如果后续要根除枚举依赖,可以把分类工作搬到代码侧(lexmount_eval.py 加 ~30 关键词 dict,渲染时给评分项打 `[上游]/[下游]` 前缀,prompt 可瘦身到 60 字),需要额外 ~20 行代码改动;此 PR 范围内先用 prompt 形态。
实验记录
完整的 8 轮 prompt 迭代过程、per-task 评分对照、MAE 量化、对 email split 的 cross-validation 见:
数据:跑分用 `browser-use × gpt-5.5 × LexBench-Browser/cross_system`(10 题)和 `/email`(10 题)。
验证结果
注:本 PR 目的不是拉 pass rate(PASS 数字 2/10 → 2/10 未变),而是把 6 道被 v0 系统性零分的 honest agent case 从 22-33 分托到 43-63 分的合理 partial credit 区间。
测试计划
Made with Cursor