Implement min, max, sum for run-end-encoded arrays.#9409
Implement min, max, sum for run-end-encoded arrays.#9409brunal wants to merge 1 commit intoapache:mainfrom
Conversation
Efficient implementations: * min & max work directly on the values child array. * sum folds over run lengths & values, without decompressing the array. In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial: * We get the physical start & end indices in O(log(#runs)), but those are incorrect for empty arrays. * Slicing can happen in the middle of a run. For sum, we need to track the logical start & end and reduce the run length accordingly. Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.
|
Note that in a future MR, I'm likely to move most of |
| // We can directly perform min/max on the values child array, as any | ||
| // run must have non-zero length. We just need take care of the logical offset & len. |
There was a problem hiding this comment.
We can use RunArray::values_slice here:
arrow-rs/arrow-array/src/array/run_array.rs
Lines 139 to 150 in 70089ac
- Introduced by fix:[9018]Fixed RunArray slice offsets #9036
| // We will fail here if the values child array is not the PrimitiveArray<T>. That | ||
| // is okay as: | ||
| // * BooleanArray & StringArray are not primitive, but their Item is not | ||
| // ArrowNumericType, so they are not in scope for this function. | ||
| // * Having the values child array be either dict-encoded, or run-end-encoded does not | ||
| // make sense. Nor does using a custom array type. | ||
| // Note however that the Apache specification does not forbid using an exotic type as | ||
| // the values child array. | ||
| // The type parameter `A` is a TypedRunArray<'_, RunEndIndexType, ValuesArrayType>. | ||
| // Once specialization gets stabilized, this implementation can be changed to | ||
| // directly pick up `ValuesArrayType`. |
There was a problem hiding this comment.
While this is quite a comprehensive explanation, I feel we can simply leave it as "we expect child array to be primitive" 🤔
- It doesn't feel necessary to include boolean/string in the explanation given we're only in the context of numeric types
- It's not really necessary to try explain away custom array implementations as those would be extremely rare
| pub(super) fn sum_wrapping<I: RunEndIndexType, V: ArrowNumericType>( | ||
| array: &dyn Array, | ||
| ) -> Option<V::Native> { | ||
| if array.null_count() == array.len() { |
There was a problem hiding this comment.
Run arrays don't have a null buffer so this check is essentially a noop each time
| array: &'a dyn Array, | ||
| ) -> Option<TypedRunArray<'a, I, PrimitiveArray<V>>> { | ||
| let array = array.as_run_opt::<I>()?; | ||
| // This fails if the values child array is not the PrimitiveArray<T>. That is okay as: |
There was a problem hiding this comment.
I feel sufficient justification is just "we support only runarrays wrapping primitive types"; no need to go into this much detail (especially as we're only dealing with numeric types so mentioning boolean/string for example seems unnecessary)
| let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> { | ||
| Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len)))) | ||
| }) | ||
| // Safety: error type is Infallible. | ||
| .unwrap(); |
There was a problem hiding this comment.
| let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> { | |
| Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len)))) | |
| }) | |
| // Safety: error type is Infallible. | |
| .unwrap(); | |
| let Ok(sum) = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> { | |
| Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len)))) | |
| }); |
If using Infallible then we can destructure like so without unwrap
|
|
||
| let logical_start = run_ends.offset(); | ||
| let logical_end = run_ends.offset() + run_ends.len(); | ||
|
|
There was a problem hiding this comment.
I believe we can use RunArray::values_slice
arrow-rs/arrow-array/src/array/run_array.rs
Lines 139 to 150 in d8946ca
And RunEndBuffer::sliced_values (accessed via RunArray::run_ends)
arrow-rs/arrow-buffer/src/buffer/run.rs
Lines 192 to 215 in d8946ca
|
I'm not handling null values properly when computing sums. Back to draft. |
Efficient implementations:
In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial:
Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.
This feature is tracked in #3520.