It is so difficult to setup this benchmark.
First of all, it requires openclaw but it doesn't tell you the openclaw requirements (docker, bare metal, remote openclaw support, etc)
Secondly, within openclaw, it needs to do some searching and web browsing stuff, my private vllm setup doesn't have Internet out of box (require http proxy to access internet), it didn't work and now I need to manually fix it, also later realized it requires search api key. Okay brave it is, but it's so blackbox that I didn't know all the extensions or skills or api that the openclaw needs, took me 3 days to figure out.
The judge doesn't support third party api endpoint, again, our LLM endpoint are all going thru LLM proxy (litellm), but who know it has claude cli support for the judge. Adding another hack to make the judge work.
There is more random stuff that failed during the middle of the run and it require manual intervention to continue
Can we have a good and human readable setup guide so that it can be run easily? Or even better, just give me a docker image self contain everything so that it can be run easily with all env var flag to pass in for third party integration so people can save some time using it.
I do like the test suite and benchmark items, but running it is such painful experience especially in a more restricted env.
It is so difficult to setup this benchmark.
First of all, it requires openclaw but it doesn't tell you the openclaw requirements (docker, bare metal, remote openclaw support, etc)
Secondly, within openclaw, it needs to do some searching and web browsing stuff, my private vllm setup doesn't have Internet out of box (require http proxy to access internet), it didn't work and now I need to manually fix it, also later realized it requires search api key. Okay brave it is, but it's so blackbox that I didn't know all the extensions or skills or api that the openclaw needs, took me 3 days to figure out.
The judge doesn't support third party api endpoint, again, our LLM endpoint are all going thru LLM proxy (litellm), but who know it has claude cli support for the judge. Adding another hack to make the judge work.
There is more random stuff that failed during the middle of the run and it require manual intervention to continue
Can we have a good and human readable setup guide so that it can be run easily? Or even better, just give me a docker image self contain everything so that it can be run easily with all env var flag to pass in for third party integration so people can save some time using it.
I do like the test suite and benchmark items, but running it is such painful experience especially in a more restricted env.