diff --git a/landing/assets/img/task-distribution.png b/landing/assets/img/task-distribution.png index f19129c..fa8894e 100644 Binary files a/landing/assets/img/task-distribution.png and b/landing/assets/img/task-distribution.png differ diff --git a/landing/index.html b/landing/index.html index bb817d3..c6526b3 100644 --- a/landing/index.html +++ b/landing/index.html @@ -10,7 +10,7 @@ - + @@ -39,7 +39,6 @@ Agents Community Leaderboard - Cite GitHub → @@ -140,43 +139,44 @@

Overview

Tasks

- The first dataset, LexBench-Browser, focuses on browser-agent tasks - that resemble actual user workflows: search, e-commerce, video, social, academic and - tool-use across both English and Chinese websites, with varying difficulty tiers. + LexBench-Browser is a live-web benchmark for realistic browser-agent + workflows across Chinese and English websites. It covers information retrieval and + state-changing task execution, including authenticated and safety-sensitive scenarios + that prior benchmarks often under-represent.

-
210
+
377
Tasks total
-
across 107 distinct websites
+
across 200+ target websites
-
4
-
Reference agents
-
browser-use · deepbrowse · Agent-TARS · skyvern
+
2
+
Task types
+
294 information retrieval · 83 task execution
-
15
+
10
Models evaluated
-
GPT · Claude · Gemini · Doubao · Kimi · Qwen · DeepSeek · MiniMax
+
BU · GLM · GPT · Claude · Gemini · Doubao · Kimi · Qwen · MiniMax
-
3
-
Browser backends
-
Chrome-Local · Lexmount Cloud · AgentBay
+
167
+
Login-required tasks
+
plus 25 safety-testing tasks
- Task distribution diagram showing the categorical and difficulty breakdown of LexBench-Browser tasks -
Categorical and difficulty distribution across the LexBench-Browser task pool.
+ Task distribution diagram showing LexBench-Browser domains and task types +
Domain and task-type distribution across the LexBench-Browser task pool.
- Login-gated and operation-tier tasks are coming soon. - The next release adds tasks that require account context, multi-step transactions - and safety-sensitive flows — to evaluate Agents on the parts of the web that - English-only benchmarks miss. + Authenticated and operational workflows are included. + The benchmark contains 281 Chinese and 96 English tasks, with live sessions, account + context, multi-step operations and safety-sensitive flows evaluated under the same + browser-agent protocol.
@@ -332,13 +332,13 @@

Community

Leaderboard

- Live results across all reference agents and models. Click any header to sort. Showing the - top 15 entries by success rate. + Snapshot results for browser-use on LexBench-Browser across 10 models. Click any header to + sort. Showing all 10 entries by success rate.

benchmark · LexBench-Browser judge · gpt-5.4 (per-task threshold) - snapshot · 2026-04-29 + snapshot · 2026-06-16
@@ -354,172 +354,106 @@

Leaderboard

- + - + - - - + + + - + - + - - - + + + - + - + - - - + + + - + - + - - - + + + - + - - - - - - - - - - + - - - + + + - - - - - - - - - - - + + - - - - - - - - - - - - + + + - - - - - - - - - - - - - - - - - - - - + + - + - - - + + + - - + + - + - - - + + + - - + + - + - - - + + + - - + + - + - - - + + +
1 browser-useclaude-opus-4-7dmx-claude-opus-4-8-thinking Lexmount58.014.2205.875.316.9522.4
2 browser-usekimi-k2.5gpt-5.5 Lexmount58.024.7280.174.314.0212.0
3 browser-usegemini-3.1-pro-previewbu-2-0 Lexmount56.013.9149.068.717.5323.8
4 browser-usegpt-5.5glm-5.1 Lexmount54.014.1273.068.222.0589.7
5 browser-usegemini-3.1-pro-previewChrome-Local54.014.0163.3
6browser-usekimi-k2.6qwen3.7-max Lexmount52.030.3447.665.818.8487.6
7browser-usebu-2-0Chrome-Local48.020.8136.5
8
6 browser-useMiniMax-M2.7Chrome-Local42.027.5413.1
9Agent-TARS gemini-3.1-pro-preview Lexmount40.018.4121.163.912.6368.6
10browser-usebu-2-0Lexmount40.023.4350.7
11browser-usegemini-2.5-proLexmount40.018.7279.2
12
7 browser-useqwen3.5-plusdoubao-seed-2-0-pro Lexmount40.023.7326.761.817.7595.3
13
8 browser-useMiniMax-M2.5gemini-3.5-flash Lexmount38.027.4354.358.626.9376.4
14
9 browser-useMiniMax-M2.7kimi-k2.6 Lexmount36.022.4408.953.323.6433.4
15
10 browser-usedoubao-seed-2-0-proMiniMax-M3 Lexmount36.017.3385.340.120.1584.5

- Live source: leaderboard server at the team intranet. Data is automatically pulled from + Snapshot source: experiments/{benchmark}/{split}/{agent}/{model_id}/{ts}/ run dirs.

-
-
-

Cite

-

If you use LexBench-Browser in your work, please cite:

-
- -
@misc{lexbench_browser_2026,
-  title        = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},
-  author       = {Lexmount Research and Collaborators},
-  year         = {2026},
-  howpublished = {\url{https://lexmount.github.io/browseruse-agent-bench/}},
-  note         = {Open benchmark; v1.0 reference release}
-}
-
-

- Acknowledgements: integration scaffolding adapted patterns from browser-use, skyvern, - Agent-TARS and deepbrowse upstream codebases. -

-
-
-