Conversation
The `half::f16` type no longer implements `rand::distributions::Standard`, so benchmarks using `rng.random::<f16>()` fail to compile. Use `f16::from_f32(rng.random::<f32>())` instead, and route Float16 array generation through `create_f16_array` which already handles this. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `Layout` field in MutableBuffer with separate `capacity` and `align` fields, keeping the struct at 32 bytes (same as original) while enabling O(1) capacity checks in `reserve()` without going through `Layout::size()`. Layout is reconstructed on cold paths only (alloc/dealloc/realloc). BufferBuilder changes: - Cache `buffer.capacity()` in a local before the extend loop - Replace `std::mem::size_of::<T>()` with a `const` local in hot paths - Add `#[inline(always)]` to `BufferBuilder::advance` and `BufferBuilder::append` Kernel changes (sort/concat/interleave): - Replace `Vec<T>` + `Buffer::from(vec)` with direct `MutableBuffer`/`BufferBuilder` usage to avoid the memcpy on Vec→Buffer conversion Benchmark results vs clean baseline (--quick): - sort with indices: 9-14% faster - sort primitive run: 24% faster - interleave dict: 8-13% faster - take stringview: up to 28% faster - buffer create (mutable): 9% faster - buffer create (from_slice): 9% faster Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@Dandandan @rluvaton any interest in reviewing this perf improvement? |
|
run benchmarks |
|
🤖 |
|
YES, will review now |
arrow-buffer/src/buffer/mutable.rs
Outdated
| /// Only used in cold paths (alloc/dealloc/realloc). | ||
| #[inline] | ||
| fn layout(&self) -> Layout { | ||
| debug_assert!(self.align.is_power_of_two()); |
There was a problem hiding this comment.
if this is in cold path, should we use assert instead of debug_assert?
There was a problem hiding this comment.
See comment above -- I don't understand the rationale for inlining the Layout
| pub fn append(&mut self, v: T) { | ||
| self.reserve(1); | ||
| self.buffer.push(v); | ||
| self.buffer.reserve(std::mem::size_of::<T>()); |
There was a problem hiding this comment.
Can we revert this? as the reserve already does that and it should be inlined
| self.buffer.reserve(std::mem::size_of::<T>()); | |
| self.reserve(1); |
There was a problem hiding this comment.
I still think this is going to be faster since we don't need to multiply it by n
arrow-buffer/src/builder/mod.rs
Outdated
| pub fn append_n(&mut self, n: usize, v: T) { | ||
| self.reserve(n); | ||
| self.extend(std::iter::repeat_n(v, n)) | ||
| self.buffer.reserve(n * std::mem::size_of::<T>()); |
There was a problem hiding this comment.
can we revert this as this is the same as self.reserve(n)?
| self.buffer.reserve(n * std::mem::size_of::<T>()); | |
| self.reserve(n); |
|
🤖: Benchmark completed Details
|
arrow-buffer/src/builder/mod.rs
Outdated
| let before = self.buffer.len(); | ||
| self.buffer.extend(iter); | ||
| let added_bytes = self.buffer.len() - before; | ||
| debug_assert_eq!(added_bytes % std::mem::size_of::<T>(), 0); | ||
| self.len += added_bytes / std::mem::size_of::<T>(); |
There was a problem hiding this comment.
Note: before if the iterator panicked in the 3rd item or something it would not leave the builder in inconsistent state, now it will as the length does not match the buffer anymore
arrow-ord/src/sort.rs
Outdated
| let mut mutable_buffer = | ||
| MutableBuffer::from_len_zeroed(primitive_values.len() * std::mem::size_of::<T::Native>()); |
There was a problem hiding this comment.
Note: this assume that T::default_value() is 0 which is the case, but I verified by reading the code that it is just a placeholder so it is ok 👍🏻
|
Thank you for the PR, it looks like it contain maybe optimization that I'm not sure whether all of them are actually improving or not, it would be better if you could create a pr per optimization and we can run the benchmarks for each one to verify that it does in fact improve perf |
|
run benchmark sort concat take interleave |
|
🤖 Hi @rluvaton, thanks for the request (#9393 (comment)).
Please choose one or more of these with |
|
run benchmark builder cast_kernels comparison_kernels concatenate_kernel filter_kernels interleave_kernels sort_kernel take_kernels zip_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
I agree with @rluvaton 's comments
Your PR comments say this:
There are a number of places within the code that go from MutableBuffer to Vec and then to Buffer. This causes extra allocations and is ripe for a performance refactor.
But I can't figure out where this extra allocation is (see comments inline). I am probably missing something
arrow-buffer/src/buffer/mutable.rs
Outdated
| // invariant: len <= capacity | ||
| len: usize, | ||
| layout: Layout, | ||
| capacity: usize, |
There was a problem hiding this comment.
What is the rationale for inlining Layout here?
It seems like Layout https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html has the same representation and has accessors to get capacity and alignment 🤔 as usize
https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.size
https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.align
If we reverted this change to Layout I think the diff would be easier to understand
arrow-buffer/src/buffer/mutable.rs
Outdated
| /// Only used in cold paths (alloc/dealloc/realloc). | ||
| #[inline] | ||
| fn layout(&self) -> Layout { | ||
| debug_assert!(self.align.is_power_of_two()); |
There was a problem hiding this comment.
See comment above -- I don't understand the rationale for inlining the Layout
arrow-select/src/interleave.rs
Outdated
| let values: ScalarBuffer<T::Native> = indices | ||
| .iter() | ||
| .map(|(a, b)| interleaved.arrays[*a].value(*b)) | ||
| .collect::<Vec<_>>(); | ||
| .collect(); | ||
|
|
||
| let array = PrimitiveArray::<T>::try_new(values.into(), interleaved.nulls)?; | ||
| let array = PrimitiveArray::<T>::try_new(values, interleaved.nulls)?; |
There was a problem hiding this comment.
I don't understand why this would help
The old code did ScalarBuffer::from
arrow-rs/arrow-buffer/src/buffer/scalar.rs
Lines 205 to 212 in f9dd799
Which calls Buffer::from_vec
Which makes a MutableBuffer by taking over the Vec allocation
arrow-rs/arrow-buffer/src/buffer/mutable.rs
Lines 813 to 830 in 08e34ba
Which then makes a buffer
So TLDR is I don't understand why this saves a memory copy
Or is the theory that all the shenanigans to make Vec -> MutableBuffer -> Bytes/Buffer can be reduced?
There was a problem hiding this comment.
Basically there are a bunch of places in the code I think that make ScalarBuffer from Vecs and assume that is fast. If we can make it faster, maybe we can optimize that path (in one place) rather than having to change all call sites to use collect ScalarBuffer directly 🤔
|
Based on the benchmark runner - it seems not yet showing consistent improvements. |
|
Let me address the comments today. There is one other reason I would like to move away from |
|
OK I've slimmed this down to just the builder changes and left the changes around adjusting the kernels to a later PR. I think at some point if we are doing memory accounting we need to be able to ensure we don't lose the memory pool provenance which was the intent behind some of the changes, but I can raise that in a separate PR. I also think that the allocation strategies for |
Which issue does this PR close?
None at the moment
Rationale for this change
Buffer Builder had a few bounds checks when adjusting the inner MutableBuffer.
What changes are included in this PR?
Adjusts the buffer builder struct to be a little more slimmed down, deferring to the
MutableBufferfor a lot of things.Are these changes tested?
cargo test -p arrow-buffer-- 244 tests passedcargo test -p arrow-select -p arrow-ord -p arrow-array-- all tests passedcargo +nightly miri test -p arrow-buffer -- buffer::mutable-- all tests passedcargo +nightly miri test -p arrow-buffer -- builder-- all tests passedAre there any user-facing changes?
Nope