Add benchmarks to vignette by topepo · Pull Request #41 · tidymodels/agua

topepo · 2022-10-19T13:42:29Z

Adds some clarification and benchmark results to the parallel processing vignette.

Once this is merged (and pkdown finishes), I'll never the version number to the dev value and commit to main (to get the pkgdown updated on the dev site).

qiushiyan · 2022-10-23T02:34:53Z

Looks good to me.

tomasfryda · 2022-10-25T16:28:04Z

+    H2O cluster allowed cores:  20 
+```
+
+so it is unclear why the processing was relatively slow. The usage graph: 


It's definitely not what I'd expect from H2O.

Possible reason could be the usage of hyper-threading. I'm not sure if all but at least some Intel processors have one floating point unit per core and with HT you can have 2 parallel threads on one core which means that in floating point intensive workload the threads will have to wait for each other.

Another possible reason related to HT is security fixes - couple years ago there were several security issues related to HT (Meltdown, Spectre,...) and one of the mitigation techniques was to disable HT altogether so I assume it might take some performance hit on some workloads.

For both cases it could help to use nthreads=10.

Another reason could be cache utilization - the more threads, the more cache invalidations => more time spent on waiting on memory.

It can also be related to how we split the data, if the dataset is small, it could very well make the training slower with higher parallelism (more time spent on communication and synchronization). Trying with bigger data could make the more parallel version perform better in this case.

Anyway, now I'm curious about it so I'll try to run the benchmark with a profiler. If I find some reason I will mention it here as well.

I did not find the definition of sim_classification function but if it takes as its first argument number of rows to generate, I would say the reason for slower parallel run is really just small data (10k rows).

With 10-fold cv it will use 9k rows to train the model and if we have 20 threads we will process 450 rows per thread so the time spent on communication/synchronization might be significant when compared to the computation time and cause this behavior.

@ledell do we have any recommendation about how many threads should we use based on the dataset size? I know we have some heuristic for GLM (nodes = rows*columns^2/(nthreads*1e8)) but I don't know if we have something like that in general.

My general recommendation would be to use the H2O parallelism (using nthreads and/or h2o cluster) for bigger data (but I don't know where the threshold is).

@tomasfryda Recommendation is to use default nthreads except if strong reason not to do so (e.g. running on laptop and don't want h2o to use all my cpus). The heuristic that you mentioned is to optimize the number of nodes as they need to be defined beforehand, but once a node is started, it's up to H2O to optimize its behaviour according to the number of threads/cpus available, not the other way round.

Co-authored-by: Tomáš Frýda <tomas.fryda@h2o.ai>

topepo added 2 commits October 19, 2022 09:39

update with benchmarking results

b66f5a9

rollback version to get current pkgodwn site to show the vignette

0fddfc1

topepo requested review from ledell and qiushiyan October 19, 2022 13:42

tomasfryda reviewed Oct 25, 2022

View reviewed changes

qiushiyan approved these changes Nov 2, 2022

View reviewed changes

Update vignettes/parallel.Rmd

5459bf8

Co-authored-by: Tomáš Frýda <tomas.fryda@h2o.ai>

qiushiyan mentioned this pull request Sep 21, 2024

Error in h2o.getConnection(): No active connection to an H2O cluster. #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks to vignette#41

Add benchmarks to vignette#41
topepo wants to merge 3 commits into
mainfrom
benchmarks

topepo commented Oct 19, 2022

Uh oh!

qiushiyan commented Oct 23, 2022

Uh oh!

tomasfryda Oct 25, 2022

Uh oh!

tomasfryda Oct 26, 2022

Uh oh!

sebhrusen Nov 2, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

topepo commented Oct 19, 2022

Uh oh!

qiushiyan commented Oct 23, 2022

Uh oh!

tomasfryda Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

tomasfryda Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

sebhrusen Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants