diff --git a/landing/assets/img/task-distribution.png b/landing/assets/img/task-distribution.png index f19129c..fa8894e 100644 Binary files a/landing/assets/img/task-distribution.png and b/landing/assets/img/task-distribution.png differ diff --git a/landing/index.html b/landing/index.html index bb817d3..c6526b3 100644 --- a/landing/index.html +++ b/landing/index.html @@ -10,7 +10,7 @@ - + @@ -39,7 +39,6 @@ Agents Community Leaderboard - Cite GitHub → @@ -140,43 +139,44 @@
- The first dataset, LexBench-Browser, focuses on browser-agent tasks - that resemble actual user workflows: search, e-commerce, video, social, academic and - tool-use across both English and Chinese websites, with varying difficulty tiers. + LexBench-Browser is a live-web benchmark for realistic browser-agent + workflows across Chinese and English websites. It covers information retrieval and + state-changing task execution, including authenticated and safety-sensitive scenarios + that prior benchmarks often under-represent.
-
+ - Live results across all reference agents and models. Click any header to sort. Showing the - top 15 entries by success rate. + Snapshot results for browser-use on LexBench-Browser across 10 models. Click any header to + sort. Showing all 10 entries by success rate.
| 1 | browser-use | -claude-opus-4-7 |
+ dmx-claude-opus-4-8-thinking |
Lexmount | -58.0 | -14.2 | -205.8 | +75.3 | +16.9 | +522.4 |
| 2 | browser-use | -kimi-k2.5 |
+ gpt-5.5 |
Lexmount | -58.0 | -24.7 | -280.1 | +74.3 | +14.0 | +212.0 |
| 3 | browser-use | -gemini-3.1-pro-preview |
+ bu-2-0 |
Lexmount | -56.0 | -13.9 | -149.0 | +68.7 | +17.5 | +323.8 |
| 4 | browser-use | -gpt-5.5 |
+ glm-5.1 |
Lexmount | -54.0 | -14.1 | -273.0 | +68.2 | +22.0 | +589.7 |
| 5 | browser-use | -gemini-3.1-pro-preview |
- Chrome-Local | -54.0 | -14.0 | -163.3 | -||||
| 6 | -browser-use | -kimi-k2.6 |
+ qwen3.7-max |
Lexmount | -52.0 | -30.3 | -447.6 | +65.8 | +18.8 | +487.6 |
| 7 | -browser-use | -bu-2-0 |
- Chrome-Local | -48.0 | -20.8 | -136.5 | -||||
| 8 | +||||||||||
| 6 | browser-use | -MiniMax-M2.7 |
- Chrome-Local | -42.0 | -27.5 | -413.1 | -||||
| 9 | -Agent-TARS | gemini-3.1-pro-preview |
Lexmount | -40.0 | -18.4 | -121.1 | +63.9 | +12.6 | +368.6 | |
| 10 | -browser-use | -bu-2-0 |
- Lexmount | -40.0 | -23.4 | -350.7 | -||||
| 11 | -browser-use | -gemini-2.5-pro |
- Lexmount | -40.0 | -18.7 | -279.2 | -||||
| 12 | +||||||||||
| 7 | browser-use | -qwen3.5-plus |
+ doubao-seed-2-0-pro |
Lexmount | -40.0 | -23.7 | -326.7 | +61.8 | +17.7 | +595.3 |
| 13 | +||||||||||
| 8 | browser-use | -MiniMax-M2.5 |
+ gemini-3.5-flash |
Lexmount | -38.0 | -27.4 | -354.3 | +58.6 | +26.9 | +376.4 |
| 14 | +||||||||||
| 9 | browser-use | -MiniMax-M2.7 |
+ kimi-k2.6 |
Lexmount | -36.0 | -22.4 | -408.9 | +53.3 | +23.6 | +433.4 |
| 15 | +||||||||||
| 10 | browser-use | -doubao-seed-2-0-pro |
+ MiniMax-M3 |
Lexmount | -36.0 | -17.3 | -385.3 | +40.1 | +20.1 | +584.5 |
- Live source: leaderboard server at the team intranet. Data is automatically pulled from
+ Snapshot source:
experiments/{benchmark}/{split}/{agent}/{model_id}/{ts}/ run dirs.
If you use LexBench-Browser in your work, please cite:
-@misc{lexbench_browser_2026,
- title = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},
- author = {Lexmount Research and Collaborators},
- year = {2026},
- howpublished = {\url{https://lexmount.github.io/browseruse-agent-bench/}},
- note = {Open benchmark; v1.0 reference release}
-}
- - Acknowledgements: integration scaffolding adapted patterns from browser-use, skyvern, - Agent-TARS and deepbrowse upstream codebases. -
-