Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,48 @@ bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1
> `--split` is optional — the benchmark's `default_split` (from `data_info.json`) is used automatically. Pass `--split <name>` only to override the default.
> For the full parameter reference, see the [Quickstart docs](https://docs.bubench.lexmount.io/en/quickstart).

**Post-attribution rerun workflow**

For LexBench-Browser result analysis, use the automated post-run workflow:

```text
run benchmark -> hard artifact pre-check -> eval remaining results
-> failure attribution excluding hard-hit tasks -> post-attribution rerun check
-> rerun selected tasks -> re-eval -> final attribution / visualization
```

The final rerun candidate set is:

```text
hard_artifact_rerun
∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks
```

First collect deterministic hard failures, which can be excluded from judge calls:

```bash
PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
--model MODEL_DIR_NAME \
--timestamp TIMESTAMP \
--artifact-mode hard \
--out-dir experiments/LexBench-Browser/All/browser-use/MODEL_DIR_NAME/TIMESTAMP/rerun_candidates_hard
```

Then run eval / failure attribution on non-hard tasks and generate the final
rerun task ids:

```bash
PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
--model MODEL_DIR_NAME \
--timestamp TIMESTAMP \
--artifact-mode hard \
--include-taxonomy-web-constraints
```

See [LexBench automated evaluation system](docs/lexbench-automated-evaluation-system.md),
[rerun check rules](docs/result-rerun-check-rules.md), and
[12-model rerun rule validation](docs/rerun-rule-validation-12-models.md).

## Data Loading

Use `--data-source` to control where benchmark data is loaded from:
Expand Down
41 changes: 41 additions & 0 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,47 @@ bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1

> 全量参数说明见[快速开始文档](https://docs.bubench.lexmount.io/zh/quickstart)。

**Post-attribution 重测流程**

LexBench-Browser 结果分析推荐使用这套自动化 post-run 流程:

```text
run benchmark -> hard artifact pre-check -> eval remaining results
-> failure attribution excluding hard-hit tasks -> post-attribution rerun check
-> rerun selected tasks -> re-eval -> final attribution / visualization
```

最终 rerun candidate 集合是:

```text
hard_artifact_rerun
∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks
```

先收集确定性的 hard failures,这部分可以直接进入 rerun,并从 judge 调用中排除:

```bash
PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
--model MODEL_DIR_NAME \
--timestamp TIMESTAMP \
--artifact-mode hard \
--out-dir experiments/LexBench-Browser/All/browser-use/MODEL_DIR_NAME/TIMESTAMP/rerun_candidates_hard
```

然后对 non-hard tasks 跑 eval / failure attribution,再生成最终 rerun task ids:

```bash
PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
--model MODEL_DIR_NAME \
--timestamp TIMESTAMP \
--artifact-mode hard \
--include-taxonomy-web-constraints
```

详见 [LexBench 自动化评测体系](docs/lexbench-automated-evaluation-system.md)、
[rerun check rules](docs/result-rerun-check-rules.md) 和
[12-model rerun rule validation](docs/rerun-rule-validation-12-models.md)。

## 数据加载

通过 `--data-source` 控制数据来源:
Expand Down
30 changes: 30 additions & 0 deletions browseruse_bench/cli/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,14 @@ def _parse_extra_args(extra_args: list[str]) -> dict[str, Any]:
return extra


def _read_task_ids_file(path: Path | None) -> list[str]:
if path is None:
return []
if not path.exists():
raise SystemExit(f"[FAILED] Task id file does not exist: {path}")
return path.read_text(encoding="utf-8").split()


def run_evaluation(
agent_name: str,
benchmark_name: str,
Expand Down Expand Up @@ -303,6 +311,14 @@ def run_evaluation(
"force_download": bool(getattr(args, "force_download", False)),
}
extra.update(_parse_extra_args(extra_args))
task_ids = [str(task_id) for task_id in (getattr(args, "task_ids", None) or [])]
task_ids.extend(_read_task_ids_file(getattr(args, "task_ids_file", None)))
exclude_task_ids = [str(task_id) for task_id in (getattr(args, "exclude_task_ids", None) or [])]
exclude_task_ids.extend(_read_task_ids_file(getattr(args, "exclude_task_ids_file", None)))
if task_ids:
extra["task_ids"] = task_ids
if exclude_task_ids:
extra["exclude_task_ids"] = exclude_task_ids
if max_tokens is not None:
extra["max_tokens"] = max_tokens

Expand Down Expand Up @@ -425,6 +441,20 @@ def configure_eval_parser(parser: argparse.ArgumentParser, config: dict[str, Any
action="store_true",
help="Rerun evaluation (default reuses existing results, only runs failure classification)",
)
parser.add_argument("--task-ids", nargs="*", default=[], help="Only evaluate these task IDs.")
parser.add_argument(
"--task-ids-file",
type=Path,
default=None,
help="Whitespace-separated task IDs to evaluate.",
)
parser.add_argument("--exclude-task-ids", nargs="*", default=[], help="Do not evaluate these task IDs.")
parser.add_argument(
"--exclude-task-ids-file",
type=Path,
default=None,
help="Whitespace-separated task IDs to skip during evaluation.",
)
parser.add_argument(
"--agent-config",
type=Path,
Expand Down
1 change: 1 addition & 0 deletions browseruse_bench/cli/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ def _report_output_dir(report_file: str, run_dir: Path) -> None:
# stay on both.)
_EVAL_ONLY_VALUE_FLAGS = {
"--score-threshold", "--num-worker", "--api-key", "--base-url", "--eval-strategy",
"--task-ids-file", "--exclude-task-ids-file",
}
_EVAL_ONLY_BOOL_FLAGS = {"--force-reeval"}

Expand Down
22 changes: 21 additions & 1 deletion browseruse_bench/eval/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,21 @@ def run(self) -> int:
tasks = self.load_tasks()
completed = self.list_completed_tasks()
already = self._resume_skip_set()
include_task_ids = self._task_id_filter("task_ids")
exclude_task_ids = self._task_id_filter("exclude_task_ids")
pending = [
p.name for p in completed
if p.name not in already and p.name in tasks
and (not include_task_ids or p.name in include_task_ids)
and p.name not in exclude_task_ids
]
logger.info("Evaluating %d tasks (skip %d already done)", len(pending), len(already))
logger.info(
"Evaluating %d tasks (skip %d already done, include_filter=%d, exclude_filter=%d)",
len(pending),
len(already),
len(include_task_ids),
len(exclude_task_ids),
)
for result in self._run_iteration(pending, tasks):
self._append_result(result)
# Hook runs before summary so subclasses (e.g. LexBench coverage backfill)
Expand All @@ -138,6 +148,16 @@ def run(self) -> int:
self._generate_summary(records)
return 0

def _task_id_filter(self, key: str) -> Set[str]:
raw = self.args.extra.get(key)
if raw is None:
return set()
if isinstance(raw, str):
return {item for item in raw.split() if item}
if isinstance(raw, (list, tuple, set)):
return {str(item) for item in raw if str(item)}
return {str(raw)}

def _load_all_records(self) -> List[Dict[str, Any]]:
"""Load every record currently appended to the results JSONL on disk."""
path = self.results_path()
Expand Down
6 changes: 6 additions & 0 deletions browseruse_bench/eval/lexbench_browser/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,12 @@ def post_eval_hook(self, records: list[dict[str, Any]]) -> None:
return
attempted_ids = {d.name for d in self.args.trajectories_dir.iterdir() if d.is_dir()}
expected = [tid for tid in self._expected_task_ids if tid in attempted_ids]
include_task_ids = self._task_id_filter("task_ids")
exclude_task_ids = self._task_id_filter("exclude_task_ids")
if include_task_ids:
expected = [tid for tid in expected if tid in include_task_ids]
if exclude_task_ids:
expected = [tid for tid in expected if tid not in exclude_task_ids]
if not expected:
return
records_by_task_id = {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
You are an expert browser-agent benchmark analyst. A browser agent failed a LexBench-Browser task. Classify the failure into the taxonomy below.

Use the supplied task spec, scoring rubric, evaluator feedback, agent final answer, compact action trace, runtime result, and screenshots. Prefer the evidence from the trajectory and evaluator feedback over assumptions. Do not rely on the old A1/B1/C1 failure category except as weak auxiliary context.

## Taxonomy

### M1 Task Reasoning
Failures in task understanding, decision making, selection, evidence use, or safety judgment.

M1.1 Requirement Following
The agent misses explicit task requirements, required websites, required fields, required output format, required number of items, or the required safety/legal response. Use this for incomplete fulfillment of the user's objective even when the browser interactions were technically possible.

M1.2 Target Selection
The agent applies the wrong scope, entity, date, city, item, channel, season, product, ranking criterion, filter, sort order, or comparison logic. Use this when it reaches usable pages but chooses the wrong target or fails to enforce "latest", "highest", "most viewed", "top N", date windows, or cross-platform comparison criteria.

M1.3 Evidence Grounding
The agent fails to extract information that is available, extracts the wrong fields, mixes fields from different items, fabricates or hallucinates values, reports unverifiable data, or answers without enough evidence. Use this when the central problem is grounding and information fidelity.

### M2 Action Execution
Failures in controlling the browser-agent loop, UI operations, recovery behavior, or tool/output protocol. These are agent capability failures, not external website failures, unless the page is blocked or unavailable.

M2.1 UI Misoperation
The agent cannot operate normal UI elements: search boxes, buttons, date pickers, dropdowns, filters, tabs, popups, modals, pagination, detail-page links, window/tab switching, or page scrolling. Use this when the site is accessible but the agent cannot drive the interface to the needed state.

M2.2 Infinite Loop
The agent repeats ineffective actions, gets stuck, fails to recover from a bad page state, runs out of steps, times out, or completes only a small part of a long multi-item task due to poor workflow control. Use this for loops, dead ends, and poor long-horizon task management.

M2.3 Format Breakdown
The agent fails because of malformed JSON action output, invalid tool-call structure, parser failures, missing final response, model service no-response, failed file saving, corrupted artifacts, or required output files not being produced. Use this only when protocol or artifact generation is a direct cause of failure.

### M3 Web Constraints
Failures mainly caused by external web environment constraints. These may still expose agent limits, but the primary obstacle is the website or access environment.

M3.1 Bot Defense
The target site blocks automation with CAPTCHA, Cloudflare, PerimeterX, slider verification, "robot or human", 403 caused by automation, rate limits, "Too Many Requests", security control, abnormal traffic, or similar bot-detection defenses.

M3.2 Access Barrier
The needed content or action is blocked by login, session expiry, SMS/QR authentication, membership, VIP, paywall, permissions, account-only views, paid downloads, copyright restrictions, or regional access restrictions.

M3.3 Site Limitation
The site is down, unreachable, returns 404/server errors, has empty DOM or SPA rendering failure, does not expose the requested content, lacks the requested filter/data, or the target content genuinely does not exist on the specified site. Use this when the environment itself makes the task impossible or under-specified.

## OTHER

Use OTHER only when none of the nine categories captures the core failure. If OTHER is used, provide a short phrase in other_phrase. Do not use OTHER for common combinations of the above categories. Prefer assigning one or more existing categories whenever possible.

## Multi-label rules

- Assign every category that substantially contributed to the failed outcome.
- A trajectory may have one or multiple codes.
- Err on the side of inclusion for real contributing failures, but do not add categories that are only mentioned in the task text.
- Choose primary_code as the most direct cause that explains why the run failed.
- If the agent is blocked by CAPTCHA or rate limiting, include M3.1 even if it also fails later.
- If the page is accessible but the agent misses filters, sorting, or target selection, use M1.2, not M3.3.
- If the page is accessible and the answer is unsupported, use M1.3.
- If the agent cannot click or manipulate a normal accessible interface, use M2.1.
- If repeated ineffective attempts, timeout, or step exhaustion prevent completion, use M2.2.
- If the run stops because the model produced malformed JSON, tool-call parsing failed, or no final response was produced, use M2.3.

## Output

Return only a JSON object matching this schema:

{
"primary_code": "M1.1",
"codes": ["M1.1", "M2.2"],
"other_phrase": null,
"confidence": "high",
"reasoning": "Short evidence-based explanation.",
"evidence": [
"Concrete evidence from evaluator feedback or trajectory.",
"Concrete evidence from agent answer or screenshot."
]
}

Allowed codes are M1.1, M1.2, M1.3, M2.1, M2.2, M2.3, M3.1, M3.2, M3.3, OTHER.
confidence must be high, medium, or low.
Loading
Loading