Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391
Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391lyang24 wants to merge 1 commit intoapache:mainfrom
Conversation
0a86a90 to
c4d8432
Compare
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
arrow-select/src/take.rs
Outdated
| // This avoids reallocations during the copy phase. We only read the u128 | ||
| // view descriptors here, not the actual string data. | ||
| let mut total_bytes: usize = 0; | ||
| for (i, idx) in indices.values().iter().enumerate() { |
There was a problem hiding this comment.
I think it makes sense to add some special cases, e.g. :
- no input buffers (i.e. source was all inlined)
- no output buffer (target is all inlined)
- a single output buffer (offset always zero)
There was a problem hiding this comment.
good catch added special cases especially all inlined case could be a fast path
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
c4d8432 to
21830ad
Compare
…ginal buffers alive when selectivity is high Signed-off-by: lyang24 <lanqingy93@gmail.com>
21830ad to
b8be571
Compare
Which issue does this PR close?
On clickbench query profiles Noticed the buffers allocated on producer thread, dropped on consumer thread. Mimalloc's deferred-free mechanism caused up to 9.26% CPU in _mi_heap_delayed_free_partial.
Rationale for this change
The previous implementation treated take as a purely generic gather, leaving buffer compaction to the downstream coalescing step. While correct, this meant that in sparse cases we would:
This PR optimizes the sparse case by recognizing that compaction is effectively inevitable downstream. When take selects a small fraction of rows, we fuse gather and compaction into a single pass. This:
The dense case remains unchanged, so the existing fast path is preserved. The heuristic is intentionally conservative and only triggers when sparsity suggests compaction will be beneficial.
What changes are included in this PR?
The optimization fuses the take and compaction into a single pass via take_byte_view_compact. It kicks in automatically when the take is sparse (indices.len() < array.len() / 2):
Are these changes tested?
arrow select test passed
Are there any user-facing changes?
no