Add test for shuffling on different string dtypes#495
Add test for shuffling on different string dtypes#495ian-r-rose wants to merge 3 commits intomainfrom
Conversation
|
@jrbourbeau I thought I could help here a bit, I suspect we should see all green now, between the skipif < 2022.10.1 and merging main. I did not check the rest of the test though, like technical things and design. |
|
I'm curious about the motivation for this test. Seems like it makes more sense to specify a string dtype for benchmarking and monitor behavior there. Thoughts? |
I'm not sure I follow what do you mean by specifying a string dtype for benchmarking. Do you mean in the h2o benchmarks? If that's the case, then I think you'd like to avoid the s3 reading to isolate the study only to the type. Which is what this test does. |
This adds a new test measuring the performance of shuffling based on the different options for string dtypes:
"object","string[python]", and"string[pyarrow]". In my initial testing, the pyarrow string dtype was significantly slower (!), though I haven't had the chance to chase down exactly what is going on there (possibly time spent converting dtypes, possibly performance issues with hashing or serialization, possibly something else entirely). Something to fix, I suppose!