much more reliable and can work in CI unlike criterion https://www.bazhenov.me/posts/paired-benchmarking/