Conversation
| let mut cur_idx = 0u32; | ||
|
|
||
| // 2) Write null indices by inverting | ||
| let inv_buf = !bitmap.inner(); |
There was a problem hiding this comment.
I think this is where most of the saving comes in. Recreating the bitmap with inverse is another pass over it, so we have to read it twice.
There was a problem hiding this comment.
Interesting, I think this was changed in the other direction in 625e6ee#diff-2c3ee46148ccba8c1092294caab66ba73416314595ce2eef9b2bdd9cd898b27d
I wonder what changed now that makes the benchmarks go the other way.
32667d6 to
8657ae6
Compare
|
run benchmark sort_kernel |
|
🤖 Hi @Dandandan, thanks for the request (#9407 (comment)).
Please choose one or more of these with You can also set environment variables on subsequent lines: Unsupported benchmarks: sort_kernels. |
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
N/A
Rationale for this change
There are a few small wins we can get adjusting how sort works by avoiding some copies in memory & eliminates some unsafe code around the partition scan. I tried a few attempts at bringing down the numbers of the partition scan even more by chunking the bitmaps, but this "simpler" implementation looks like it compiles/branches pretty well
I also optimised the
sort_native_typeto avoid some double copies and bounds checks, which gets us a nice +17% speedup in some of the primitive benchmarks including null values (i.e,sort i32 nulls 2^12).Admittedly after changing this code it doesn't even look like DataFusion uses that path... it only uses
lexsort_to_indices.One other extension I thought of was for smaller arrays, whether it makes sense to delegate to the
sort_native_typeand then slice it, rather than do the indices rigmaroleWhat changes are included in this PR?
Adjustments to two inner functions:
sort_native_typepartition_validity_scanAre these changes tested?
Yes via:
& benched with
mainvia:Are there any user-facing changes?
Nope
Benchmark Results against
mainHere's a comparison against
mainon my PC.It would be interesting to run this via the standard benchmark and see if we have the same improvements there: