Skip to content

Implement min, max, sum for run-end-encoded arrays.#9409

Draft
brunal wants to merge 1 commit intoapache:mainfrom
brunal:ree-agg
Draft

Implement min, max, sum for run-end-encoded arrays.#9409
brunal wants to merge 1 commit intoapache:mainfrom
brunal:ree-agg

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented Feb 13, 2026

Efficient implementations:

  • min & max work directly on the values child array.
  • sum folds over run lengths & values, without decompressing the array.

In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial:

  • We get the physical start & end indices in O(log(#runs)), but those are incorrect for empty arrays.
  • Slicing can happen in the middle of a run. For sum, we need to track the logical start & end and reduce the run length accordingly.

Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.

This feature is tracked in #3520.

Efficient implementations:
* min & max work directly on the values child array.
* sum folds over run lengths & values, without decompressing the array.

In particular, those implementations takes care of the logical offset & len of the run-end-encoded
arrays. This is non-trivial:
* We get the physical start & end indices in O(log(#runs)), but those
  are incorrect for empty arrays.
* Slicing can happen in the middle of a run. For sum, we need to track
  the logical start & end and reduce the run length accordingly.

Finally, one caveat: the aggregation functions only work when the child
values array is a primitive array. That's fine ~always, but some client
might store the values in an unexpected type. They'll either get None or
an Error, depending on the aggregation function used.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 13, 2026
@brunal brunal marked this pull request as ready for review February 13, 2026 12:51
@brunal
Copy link
Contributor Author

brunal commented Feb 13, 2026

Note that in a future MR, I'm likely to move most of ree::fold() into run_array.rs, providing an iterator over (run_idx_start, run_idx_end, value), and use that in cmp.rs.

Comment on lines +778 to +779
// We can directly perform min/max on the values child array, as any
// run must have non-zero length. We just need take care of the logical offset & len.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use RunArray::values_slice here:

/// Similar to [`values`] but accounts for logical slicing, returning only the values
/// that are part of the logical slice of this array.
///
/// [`values`]: Self::values
pub fn values_slice(&self) -> ArrayRef {
if self.is_empty() {
return self.values.slice(0, 0);
}
let start = self.get_start_physical_index();
let end = self.get_end_physical_index();
self.values.slice(start, end - start + 1)
}

Comment on lines +800 to +810
// We will fail here if the values child array is not the PrimitiveArray<T>. That
// is okay as:
// * BooleanArray & StringArray are not primitive, but their Item is not
// ArrowNumericType, so they are not in scope for this function.
// * Having the values child array be either dict-encoded, or run-end-encoded does not
// make sense. Nor does using a custom array type.
// Note however that the Apache specification does not forbid using an exotic type as
// the values child array.
// The type parameter `A` is a TypedRunArray<'_, RunEndIndexType, ValuesArrayType>.
// Once specialization gets stabilized, this implementation can be changed to
// directly pick up `ValuesArrayType`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is quite a comprehensive explanation, I feel we can simply leave it as "we expect child array to be primitive" 🤔

  • It doesn't feel necessary to include boolean/string in the explanation given we're only in the context of numeric types
  • It's not really necessary to try explain away custom array implementations as those would be extremely rare

pub(super) fn sum_wrapping<I: RunEndIndexType, V: ArrowNumericType>(
array: &dyn Array,
) -> Option<V::Native> {
if array.null_count() == array.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run arrays don't have a null buffer so this check is essentially a noop each time

array: &'a dyn Array,
) -> Option<TypedRunArray<'a, I, PrimitiveArray<V>>> {
let array = array.as_run_opt::<I>()?;
// This fails if the values child array is not the PrimitiveArray<T>. That is okay as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel sufficient justification is just "we support only runarrays wrapping primitive types"; no need to go into this much detail (especially as we're only dealing with numeric types so mentioning boolean/string for example seems unnecessary)

Comment on lines +660 to +664
let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {
Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))
})
// Safety: error type is Infallible.
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {
Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))
})
// Safety: error type is Infallible.
.unwrap();
let Ok(sum) = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {
Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))
});

If using Infallible then we can destructure like so without unwrap


let logical_start = run_ends.offset();
let logical_end = run_ends.offset() + run_ends.len();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can use RunArray::values_slice

/// Similar to [`values`] but accounts for logical slicing, returning only the values
/// that are part of the logical slice of this array.
///
/// [`values`]: Self::values
pub fn values_slice(&self) -> ArrayRef {
if self.is_empty() {
return self.values.slice(0, 0);
}
let start = self.get_start_physical_index();
let end = self.get_end_physical_index();
self.values.slice(start, end - start + 1)
}

And RunEndBuffer::sliced_values (accessed via RunArray::run_ends)

/// Returns an iterator yielding run ends adjusted for the logical slice.
///
/// Each yielded value is subtracted by the [`logical_offset`] and capped
/// at the [`logical_length`].
///
/// [`logical_offset`]: Self::offset
/// [`logical_length`]: Self::len
pub fn sliced_values(&self) -> impl Iterator<Item = E> + '_ {
let offset = self.logical_offset;
let len = self.logical_length;
// Doing this roundabout way since the iterator type we return must be
// the same (i.e. cannot use std::iter::empty())
let physical_slice = if self.is_empty() {
&self.run_ends[0..0]
} else {
let start = self.get_start_physical_index();
let end = self.get_end_physical_index();
&self.run_ends[start..=end]
};
physical_slice.iter().map(move |&val| {
let val = val.as_usize().saturating_sub(offset).min(len);
E::from_usize(val).unwrap()
})
}

@brunal
Copy link
Contributor Author

brunal commented Feb 13, 2026

I'm not handling null values properly when computing sums. Back to draft.

@brunal brunal marked this pull request as draft February 13, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants