lexmount · waple0820 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/README.md b/README.md
@@ -201,6 +201,48 @@ bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1
 > `--split` is optional — the benchmark's `default_split` (from `data_info.json`) is used automatically. Pass `--split <name>` only to override the default.
 > For the full parameter reference, see the [Quickstart docs](https://docs.bubench.lexmount.io/en/quickstart).
 
+**Post-attribution rerun workflow**
+
+For LexBench-Browser result analysis, use the automated post-run workflow:
+
+```text
+run benchmark -> hard artifact pre-check -> eval remaining results
+-> failure attribution excluding hard-hit tasks -> post-attribution rerun check
+-> rerun selected tasks -> re-eval -> final attribution / visualization
+```
+
+The final rerun candidate set is:
+
+```text
+hard_artifact_rerun
+∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks
+```
+
+First collect deterministic hard failures, which can be excluded from judge calls:
+
+```bash
+PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
+  --model MODEL_DIR_NAME \
+  --timestamp TIMESTAMP \
+  --artifact-mode hard \
+  --out-dir experiments/LexBench-Browser/All/browser-use/MODEL_DIR_NAME/TIMESTAMP/rerun_candidates_hard
+```
+
+Then run eval / failure attribution on non-hard tasks and generate the final
+rerun task ids:
+
+```bash
+PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
+  --model MODEL_DIR_NAME \
+  --timestamp TIMESTAMP \
+  --artifact-mode hard \
+  --include-taxonomy-web-constraints
+```
+
+See [LexBench automated evaluation system](docs/lexbench-automated-evaluation-system.md),
+[rerun check rules](docs/result-rerun-check-rules.md), and
+[12-model rerun rule validation](docs/rerun-rule-validation-12-models.md).
+
 ## Data Loading
 
 Use `--data-source` to control where benchmark data is loaded from:

diff --git a/README_ZH.md b/README_ZH.md
@@ -200,6 +200,47 @@ bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1
 
 > 全量参数说明见[快速开始文档](https://docs.bubench.lexmount.io/zh/quickstart)。
 
+**Post-attribution 重测流程**
+
+LexBench-Browser 结果分析推荐使用这套自动化 post-run 流程：
+
+```text
+run benchmark -> hard artifact pre-check -> eval remaining results
+-> failure attribution excluding hard-hit tasks -> post-attribution rerun check
+-> rerun selected tasks -> re-eval -> final attribution / visualization
+```
+
+最终 rerun candidate 集合是：
+
+```text
+hard_artifact_rerun
+∪ taxonomy_primary_M3.2_or_M3.3_on_non_hard_tasks
+```
+
+先收集确定性的 hard failures，这部分可以直接进入 rerun，并从 judge 调用中排除：
+
+```bash
+PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
+  --model MODEL_DIR_NAME \
+  --timestamp TIMESTAMP \
+  --artifact-mode hard \
+  --out-dir experiments/LexBench-Browser/All/browser-use/MODEL_DIR_NAME/TIMESTAMP/rerun_candidates_hard
+```
+
+然后对 non-hard tasks 跑 eval / failure attribution，再生成最终 rerun task ids：
+
+```bash
+PYTHONPATH=. python scripts/collect_lexbench_rerun_candidates.py \
+  --model MODEL_DIR_NAME \
+  --timestamp TIMESTAMP \
+  --artifact-mode hard \
+  --include-taxonomy-web-constraints
+```
+
+详见 [LexBench 自动化评测体系](docs/lexbench-automated-evaluation-system.md)、
+[rerun check rules](docs/result-rerun-check-rules.md) 和
+[12-model rerun rule validation](docs/rerun-rule-validation-12-models.md)。
+
 ## 数据加载
 
 通过 `--data-source` 控制数据来源：

diff --git a/browseruse_bench/cli/eval.py b/browseruse_bench/cli/eval.py
@@ -229,6 +229,14 @@ def _parse_extra_args(extra_args: list[str]) -> dict[str, Any]:
     return extra
 
 
+def _read_task_ids_file(path: Path | None) -> list[str]:
+    if path is None:
+        return []
+    if not path.exists():
+        raise SystemExit(f"[FAILED] Task id file does not exist: {path}")
+    return path.read_text(encoding="utf-8").split()
+
+
 def run_evaluation(
     agent_name: str,
     benchmark_name: str,
@@ -303,6 +311,14 @@ def run_evaluation(
         "force_download": bool(getattr(args, "force_download", False)),
     }
     extra.update(_parse_extra_args(extra_args))
+    task_ids = [str(task_id) for task_id in (getattr(args, "task_ids", None) or [])]
+    task_ids.extend(_read_task_ids_file(getattr(args, "task_ids_file", None)))
+    exclude_task_ids = [str(task_id) for task_id in (getattr(args, "exclude_task_ids", None) or [])]
+    exclude_task_ids.extend(_read_task_ids_file(getattr(args, "exclude_task_ids_file", None)))
+    if task_ids:
+        extra["task_ids"] = task_ids
+    if exclude_task_ids:
+        extra["exclude_task_ids"] = exclude_task_ids
     if max_tokens is not None:
         extra["max_tokens"] = max_tokens
 
@@ -425,6 +441,20 @@ def configure_eval_parser(parser: argparse.ArgumentParser, config: dict[str, Any
         action="store_true",
         help="Rerun evaluation (default reuses existing results, only runs failure classification)",
     )
+    parser.add_argument("--task-ids", nargs="*", default=[], help="Only evaluate these task IDs.")
+    parser.add_argument(
+        "--task-ids-file",
+        type=Path,
+        default=None,
+        help="Whitespace-separated task IDs to evaluate.",
+    )
+    parser.add_argument("--exclude-task-ids", nargs="*", default=[], help="Do not evaluate these task IDs.")
+    parser.add_argument(
+        "--exclude-task-ids-file",
+        type=Path,
+        default=None,
+        help="Whitespace-separated task IDs to skip during evaluation.",
+    )
     parser.add_argument(
         "--agent-config",
         type=Path,

diff --git a/browseruse_bench/cli/run_eval.py b/browseruse_bench/cli/run_eval.py
@@ -117,6 +117,7 @@ def _report_output_dir(report_file: str, run_dir: Path) -> None:
 # stay on both.)
 _EVAL_ONLY_VALUE_FLAGS = {
     "--score-threshold", "--num-worker", "--api-key", "--base-url", "--eval-strategy",
+    "--task-ids-file", "--exclude-task-ids-file",
 }
 _EVAL_ONLY_BOOL_FLAGS = {"--force-reeval"}
 

diff --git a/browseruse_bench/eval/base.py b/browseruse_bench/eval/base.py
@@ -122,11 +122,21 @@ def run(self) -> int:
         tasks = self.load_tasks()
         completed = self.list_completed_tasks()
         already = self._resume_skip_set()
+        include_task_ids = self._task_id_filter("task_ids")
+        exclude_task_ids = self._task_id_filter("exclude_task_ids")
         pending = [
             p.name for p in completed
             if p.name not in already and p.name in tasks
+            and (not include_task_ids or p.name in include_task_ids)
+            and p.name not in exclude_task_ids
         ]
-        logger.info("Evaluating %d tasks (skip %d already done)", len(pending), len(already))
+        logger.info(
+            "Evaluating %d tasks (skip %d already done, include_filter=%d, exclude_filter=%d)",
+            len(pending),
+            len(already),
+            len(include_task_ids),
+            len(exclude_task_ids),
+        )
         for result in self._run_iteration(pending, tasks):
             self._append_result(result)
         # Hook runs before summary so subclasses (e.g. LexBench coverage backfill)
@@ -138,6 +148,16 @@ def run(self) -> int:
         self._generate_summary(records)
         return 0
 
+    def _task_id_filter(self, key: str) -> Set[str]:
+        raw = self.args.extra.get(key)
+        if raw is None:
+            return set()
+        if isinstance(raw, str):
+            return {item for item in raw.split() if item}
+        if isinstance(raw, (list, tuple, set)):
+            return {str(item) for item in raw if str(item)}
+        return {str(raw)}
+
     def _load_all_records(self) -> List[Dict[str, Any]]:
         """Load every record currently appended to the results JSONL on disk."""
         path = self.results_path()

diff --git a/browseruse_bench/eval/lexbench_browser/evaluator.py b/browseruse_bench/eval/lexbench_browser/evaluator.py
@@ -568,6 +568,12 @@ def post_eval_hook(self, records: list[dict[str, Any]]) -> None:
             return
         attempted_ids = {d.name for d in self.args.trajectories_dir.iterdir() if d.is_dir()}
         expected = [tid for tid in self._expected_task_ids if tid in attempted_ids]
+        include_task_ids = self._task_id_filter("task_ids")
+        exclude_task_ids = self._task_id_filter("exclude_task_ids")
+        if include_task_ids:
+            expected = [tid for tid in expected if tid in include_task_ids]
+        if exclude_task_ids:
+            expected = [tid for tid in expected if tid not in exclude_task_ids]
         if not expected:
             return
         records_by_task_id = {

diff --git a/browseruse_bench/eval/lexbench_browser/prompts/failure_taxonomy_system.txt b/browseruse_bench/eval/lexbench_browser/prompts/failure_taxonomy_system.txt
@@ -0,0 +1,77 @@
+You are an expert browser-agent benchmark analyst. A browser agent failed a LexBench-Browser task. Classify the failure into the taxonomy below.
+
+Use the supplied task spec, scoring rubric, evaluator feedback, agent final answer, compact action trace, runtime result, and screenshots. Prefer the evidence from the trajectory and evaluator feedback over assumptions. Do not rely on the old A1/B1/C1 failure category except as weak auxiliary context.
+
+## Taxonomy
+
+### M1 Task Reasoning
+Failures in task understanding, decision making, selection, evidence use, or safety judgment.
+
+M1.1 Requirement Following
+The agent misses explicit task requirements, required websites, required fields, required output format, required number of items, or the required safety/legal response. Use this for incomplete fulfillment of the user's objective even when the browser interactions were technically possible.
+
+M1.2 Target Selection
+The agent applies the wrong scope, entity, date, city, item, channel, season, product, ranking criterion, filter, sort order, or comparison logic. Use this when it reaches usable pages but chooses the wrong target or fails to enforce "latest", "highest", "most viewed", "top N", date windows, or cross-platform comparison criteria.
+
+M1.3 Evidence Grounding
+The agent fails to extract information that is available, extracts the wrong fields, mixes fields from different items, fabricates or hallucinates values, reports unverifiable data, or answers without enough evidence. Use this when the central problem is grounding and information fidelity.
+
+### M2 Action Execution
+Failures in controlling the browser-agent loop, UI operations, recovery behavior, or tool/output protocol. These are agent capability failures, not external website failures, unless the page is blocked or unavailable.
+
+M2.1 UI Misoperation
+The agent cannot operate normal UI elements: search boxes, buttons, date pickers, dropdowns, filters, tabs, popups, modals, pagination, detail-page links, window/tab switching, or page scrolling. Use this when the site is accessible but the agent cannot drive the interface to the needed state.
+
+M2.2 Infinite Loop
+The agent repeats ineffective actions, gets stuck, fails to recover from a bad page state, runs out of steps, times out, or completes only a small part of a long multi-item task due to poor workflow control. Use this for loops, dead ends, and poor long-horizon task management.
+
+M2.3 Format Breakdown
+The agent fails because of malformed JSON action output, invalid tool-call structure, parser failures, missing final response, model service no-response, failed file saving, corrupted artifacts, or required output files not being produced. Use this only when protocol or artifact generation is a direct cause of failure.
+
+### M3 Web Constraints
+Failures mainly caused by external web environment constraints. These may still expose agent limits, but the primary obstacle is the website or access environment.
+
+M3.1 Bot Defense
+The target site blocks automation with CAPTCHA, Cloudflare, PerimeterX, slider verification, "robot or human", 403 caused by automation, rate limits, "Too Many Requests", security control, abnormal traffic, or similar bot-detection defenses.
+
+M3.2 Access Barrier
+The needed content or action is blocked by login, session expiry, SMS/QR authentication, membership, VIP, paywall, permissions, account-only views, paid downloads, copyright restrictions, or regional access restrictions.
+
+M3.3 Site Limitation
+The site is down, unreachable, returns 404/server errors, has empty DOM or SPA rendering failure, does not expose the requested content, lacks the requested filter/data, or the target content genuinely does not exist on the specified site. Use this when the environment itself makes the task impossible or under-specified.
+
+## OTHER
+
+Use OTHER only when none of the nine categories captures the core failure. If OTHER is used, provide a short phrase in other_phrase. Do not use OTHER for common combinations of the above categories. Prefer assigning one or more existing categories whenever possible.
+
+## Multi-label rules
+
+- Assign every category that substantially contributed to the failed outcome.
+- A trajectory may have one or multiple codes.
+- Err on the side of inclusion for real contributing failures, but do not add categories that are only mentioned in the task text.
+- Choose primary_code as the most direct cause that explains why the run failed.
+- If the agent is blocked by CAPTCHA or rate limiting, include M3.1 even if it also fails later.
+- If the page is accessible but the agent misses filters, sorting, or target selection, use M1.2, not M3.3.
+- If the page is accessible and the answer is unsupported, use M1.3.
+- If the agent cannot click or manipulate a normal accessible interface, use M2.1.
+- If repeated ineffective attempts, timeout, or step exhaustion prevent completion, use M2.2.
+- If the run stops because the model produced malformed JSON, tool-call parsing failed, or no final response was produced, use M2.3.
+
+## Output
+
+Return only a JSON object matching this schema:
+
+{
+  "primary_code": "M1.1",
+  "codes": ["M1.1", "M2.2"],
+  "other_phrase": null,
+  "confidence": "high",
+  "reasoning": "Short evidence-based explanation.",
+  "evidence": [
+    "Concrete evidence from evaluator feedback or trajectory.",
+    "Concrete evidence from agent answer or screenshot."
+  ]
+}
+
+Allowed codes are M1.1, M1.2, M1.3, M2.1, M2.2, M2.3, M3.1, M3.2, M3.3, OTHER.
+confidence must be high, medium, or low.