Add LexBench post-attribution rerun workflow by XuweiDing04 · Pull Request #68 · lexmount/browseruse-agent-bench

XuweiDing04 · 2026-06-25T05:00:34Z

Summary

This PR packages the LexBench-Browser automated evaluation workflow and updates the rerun logic to avoid spending judge tokens on deterministic run failures.

Recommended flow:

run benchmark -> hard artifact pre-check -> eval non-hard tasks -> failure attribution on non-hard failures -> post-attribution rerun check -> rerun selected tasks -> re-eval -> final attribution / visualization

Final rerun candidate set:

hard_artifact_rerun
∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks

The hard pre-check catches result-json and latest run-log failures such as early internal max-step breaks, Stopping due to 5 consecutive failures, Result failed 6/6 times: LLM call timed out, and ERR_TUNNEL_CONNECTION_FAILED. Those tasks go directly to rerun and can be excluded from eval/failure-attribution. Attribution is then reserved for the remaining ambiguous failures.

What changed

Added scripts/collect_lexbench_rerun_candidates.py to generate hard and post-attribution rerun task ids.
Added eval task-id include/exclude filters, including --exclude-task-ids-file, so hard-hit tasks can skip judge calls.
Updated LexBench synthetic eval backfill to respect include/exclude task filters.
Added failure attribution prompt and runner for M1/M2/M3 taxonomy classification.
Added M3.3 api-log audit and failure taxonomy visualization/report scripts.
Updated README, README_ZH, workflow docs, rerun rules, and 12-model validation docs to describe the token-efficient flow.

Validation

/Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python -m py_compile browseruse_bench/eval/base.py browseruse_bench/eval/lexbench_browser/evaluator.py browseruse_bench/cli/eval.py browseruse_bench/cli/run_eval.py scripts/judge_lexbench_failure_taxonomy.py scripts/collect_lexbench_rerun_candidates.py scripts/audit_m3_3_api_log_failures.py
PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/eval.py --help | rg "exclude-task|task-ids"
PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/judge_lexbench_failure_taxonomy.py --help | rg "exclude-task|task-ids"
PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/collect_lexbench_rerun_candidates.py --help | rg "artifact-mode|taxonomy|out-dir|include-protocol"
git diff --check

12-model validation documented in docs/rerun-rule-validation-12-models.md:

M3.2/M3.3 target: 171
hit: 171
recall: 100.0%
total candidates: 219
false positives vs primary M3.2/M3.3: 48

chatgpt-codex-connector · 2026-06-25T05:00:38Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Add LexBench post-attribution rerun workflow

b2f1c2a

Optimize rerun workflow with hard pre-check

a585910

waple0820 approved these changes Jun 25, 2026

View reviewed changes

waple0820 merged commit 8cd99bb into lexmount:main Jun 25, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LexBench post-attribution rerun workflow#68

Add LexBench post-attribution rerun workflow#68
waple0820 merged 2 commits into
lexmount:mainfrom
XuweiDing04:codex/post-attribution-rerun-workflow

XuweiDing04 commented Jun 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

XuweiDing04 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Validation

Uh oh!

chatgpt-codex-connector Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XuweiDing04 commented Jun 25, 2026 •

edited

Loading