Skip to content

Add LexBench post-attribution rerun workflow#68

Merged
waple0820 merged 2 commits into
lexmount:mainfrom
XuweiDing04:codex/post-attribution-rerun-workflow
Jun 25, 2026
Merged

Add LexBench post-attribution rerun workflow#68
waple0820 merged 2 commits into
lexmount:mainfrom
XuweiDing04:codex/post-attribution-rerun-workflow

Conversation

@XuweiDing04

@XuweiDing04 XuweiDing04 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR packages the LexBench-Browser automated evaluation workflow and updates the rerun logic to avoid spending judge tokens on deterministic run failures.

Recommended flow:

run benchmark -> hard artifact pre-check -> eval non-hard tasks -> failure attribution on non-hard failures -> post-attribution rerun check -> rerun selected tasks -> re-eval -> final attribution / visualization

Final rerun candidate set:

hard_artifact_rerun
∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks

The hard pre-check catches result-json and latest run-log failures such as early internal max-step breaks, Stopping due to 5 consecutive failures, Result failed 6/6 times: LLM call timed out, and ERR_TUNNEL_CONNECTION_FAILED. Those tasks go directly to rerun and can be excluded from eval/failure-attribution. Attribution is then reserved for the remaining ambiguous failures.

What changed

  • Added scripts/collect_lexbench_rerun_candidates.py to generate hard and post-attribution rerun task ids.
  • Added eval task-id include/exclude filters, including --exclude-task-ids-file, so hard-hit tasks can skip judge calls.
  • Updated LexBench synthetic eval backfill to respect include/exclude task filters.
  • Added failure attribution prompt and runner for M1/M2/M3 taxonomy classification.
  • Added M3.3 api-log audit and failure taxonomy visualization/report scripts.
  • Updated README, README_ZH, workflow docs, rerun rules, and 12-model validation docs to describe the token-efficient flow.

Validation

  • /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python -m py_compile browseruse_bench/eval/base.py browseruse_bench/eval/lexbench_browser/evaluator.py browseruse_bench/cli/eval.py browseruse_bench/cli/run_eval.py scripts/judge_lexbench_failure_taxonomy.py scripts/collect_lexbench_rerun_candidates.py scripts/audit_m3_3_api_log_failures.py
  • PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/eval.py --help | rg "exclude-task|task-ids"
  • PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/judge_lexbench_failure_taxonomy.py --help | rg "exclude-task|task-ids"
  • PYTHONPATH=. /Users/abc/Desktop/lexmount/browseruse-agent-bench/.venvs/browser_use/bin/python scripts/collect_lexbench_rerun_candidates.py --help | rg "artifact-mode|taxonomy|out-dir|include-protocol"
  • git diff --check

12-model validation documented in docs/rerun-rule-validation-12-models.md:

M3.2/M3.3 target: 171
hit: 171
recall: 100.0%
total candidates: 219
false positives vs primary M3.2/M3.3: 48

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@waple0820 waple0820 merged commit 8cd99bb into lexmount:main Jun 25, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants