Add benchmarks to vignette#41
Conversation
|
Looks good to me. |
| H2O cluster allowed cores: 20 | ||
| ``` | ||
|
|
||
| so it is unclear why the processing was relatively slow. The usage graph: |
There was a problem hiding this comment.
It's definitely not what I'd expect from H2O.
Possible reason could be the usage of hyper-threading. I'm not sure if all but at least some Intel processors have one floating point unit per core and with HT you can have 2 parallel threads on one core which means that in floating point intensive workload the threads will have to wait for each other.
Another possible reason related to HT is security fixes - couple years ago there were several security issues related to HT (Meltdown, Spectre,...) and one of the mitigation techniques was to disable HT altogether so I assume it might take some performance hit on some workloads.
For both cases it could help to use nthreads=10.
Another reason could be cache utilization - the more threads, the more cache invalidations => more time spent on waiting on memory.
It can also be related to how we split the data, if the dataset is small, it could very well make the training slower with higher parallelism (more time spent on communication and synchronization). Trying with bigger data could make the more parallel version perform better in this case.
Anyway, now I'm curious about it so I'll try to run the benchmark with a profiler. If I find some reason I will mention it here as well.
There was a problem hiding this comment.
I did not find the definition of sim_classification function but if it takes as its first argument number of rows to generate, I would say the reason for slower parallel run is really just small data (10k rows).
With 10-fold cv it will use 9k rows to train the model and if we have 20 threads we will process 450 rows per thread so the time spent on communication/synchronization might be significant when compared to the computation time and cause this behavior.
@ledell do we have any recommendation about how many threads should we use based on the dataset size? I know we have some heuristic for GLM (nodes = rows*columns^2/(nthreads*1e8)) but I don't know if we have something like that in general.
My general recommendation would be to use the H2O parallelism (using nthreads and/or h2o cluster) for bigger data (but I don't know where the threshold is).
There was a problem hiding this comment.
@tomasfryda Recommendation is to use default nthreads except if strong reason not to do so (e.g. running on laptop and don't want h2o to use all my cpus). The heuristic that you mentioned is to optimize the number of nodes as they need to be defined beforehand, but once a node is started, it's up to H2O to optimize its behaviour according to the number of threads/cpus available, not the other way round.
Co-authored-by: Tomáš Frýda <tomas.fryda@h2o.ai>
Adds some clarification and benchmark results to the parallel processing vignette.
Once this is merged (and pkdown finishes), I'll never the version number to the dev value and commit to main (to get the pkgdown updated on the dev site).