Implement min, max, sum for run-end-encoded arrays. by brunal · Pull Request #9409 · apache/arrow-rs

brunal · 2026-02-13T12:30:58Z

Efficient implementations:

min & max work directly on the values child array.
sum folds over run lengths & values, without decompressing the array.

In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial:

We get the physical start & end indices in O(log(#runs)), but those are incorrect for empty arrays.
Slicing can happen in the middle of a run. For sum, we need to track the logical start & end and reduce the run length accordingly.

Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.

This feature is tracked in #3520.

Efficient implementations: * min & max work directly on the values child array. * sum folds over run lengths & values, without decompressing the array. In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial: * We get the physical start & end indices in O(log(#runs)), but those are incorrect for empty arrays. * Slicing can happen in the middle of a run. For sum, we need to track the logical start & end and reduce the run length accordingly. Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.

brunal · 2026-02-13T13:57:44Z

Note that in a future MR, I'm likely to move most of ree::fold() into run_array.rs, providing an iterator over (run_idx_start, run_idx_end, value), and use that in cmp.rs.

Jefffrey · 2026-02-13T14:36:13Z

arrow-arith/src/aggregate.rs

+            // We can directly perform min/max on the values child array, as any
+            // run must have non-zero length. We just need take care of the logical offset & len.


We can use RunArray::values_slice here:

arrow-rs/arrow-array/src/array/run_array.rs

Lines 139 to 150 in 70089ac

/// Similar to [`values`] but accounts for logical slicing, returning only the values

/// that are part of the logical slice of this array.

///

/// [`values`]: Self::values

pub fn values_slice(&self) -> ArrayRef {

if self.is_empty() {

return self.values.slice(0, 0);

}

let start = self.get_start_physical_index();

let end = self.get_end_physical_index();

self.values.slice(start, end - start + 1)

}

Introduced by fix:[9018]Fixed RunArray slice offsets #9036

Jefffrey · 2026-02-13T14:44:19Z

arrow-arith/src/aggregate.rs

+            // We will fail here if the values child array is not the PrimitiveArray<T>. That
+            // is okay as:
+            // * BooleanArray & StringArray are not primitive, but their Item is not
+            //   ArrowNumericType, so they are not in scope for this function.
+            // * Having the values child array be either dict-encoded, or run-end-encoded does not
+            //   make sense. Nor does using a custom array type.
+            // Note however that the Apache specification does not forbid using an exotic type as
+            // the values child array.
+            // The type parameter `A` is a TypedRunArray<'_, RunEndIndexType, ValuesArrayType>.
+            // Once specialization gets stabilized, this implementation can be changed to
+            // directly pick up `ValuesArrayType`.


While this is quite a comprehensive explanation, I feel we can simply leave it as "we expect child array to be primitive" 🤔

It doesn't feel necessary to include boolean/string in the explanation given we're only in the context of numeric types

It's not really necessary to try explain away custom array implementations as those would be extremely rare

Jefffrey · 2026-02-13T14:47:42Z

arrow-arith/src/aggregate.rs

+    pub(super) fn sum_wrapping<I: RunEndIndexType, V: ArrowNumericType>(
+        array: &dyn Array,
+    ) -> Option<V::Native> {
+        if array.null_count() == array.len() {


Run arrays don't have a null buffer so this check is essentially a noop each time

Jefffrey · 2026-02-13T14:48:55Z

arrow-arith/src/aggregate.rs

+        array: &'a dyn Array,
+    ) -> Option<TypedRunArray<'a, I, PrimitiveArray<V>>> {
+        let array = array.as_run_opt::<I>()?;
+        // This fails if the values child array is not the PrimitiveArray<T>. That is okay as:


I feel sufficient justification is just "we support only runarrays wrapping primitive types"; no need to go into this much detail (especially as we're only dealing with numeric types so mentioning boolean/string for example seems unnecessary)

Jefffrey · 2026-02-13T14:51:12Z

arrow-arith/src/aggregate.rs

+        let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {
+            Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))
+        })
+        // Safety: error type is Infallible.
+        .unwrap();


Suggested change

let sum = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {

Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))

})

// Safety: error type is Infallible.

.unwrap();

let Ok(sum) = fold(ree, |acc, val, len| -> Result<V::Native, Infallible> {

Ok(acc.add_wrapping(val.mul_wrapping(V::Native::usize_as(len))))

});

If using Infallible then we can destructure like so without unwrap

Jefffrey · 2026-02-13T14:55:38Z

arrow-arith/src/aggregate.rs

+
+        let logical_start = run_ends.offset();
+        let logical_end = run_ends.offset() + run_ends.len();
+


I believe we can use RunArray::values_slice

arrow-rs/arrow-array/src/array/run_array.rs

Lines 139 to 150 in d8946ca

/// Similar to [`values`] but accounts for logical slicing, returning only the values

/// that are part of the logical slice of this array.

///

/// [`values`]: Self::values

pub fn values_slice(&self) -> ArrayRef {

if self.is_empty() {

return self.values.slice(0, 0);

}

let start = self.get_start_physical_index();

let end = self.get_end_physical_index();

self.values.slice(start, end - start + 1)

}

And RunEndBuffer::sliced_values (accessed via RunArray::run_ends)

arrow-rs/arrow-buffer/src/buffer/run.rs

Lines 192 to 215 in d8946ca

/// Returns an iterator yielding run ends adjusted for the logical slice.

///

/// Each yielded value is subtracted by the [`logical_offset`] and capped

/// at the [`logical_length`].

///

/// [`logical_offset`]: Self::offset

/// [`logical_length`]: Self::len

pub fn sliced_values(&self) -> impl Iterator<Item = E> + '_ {

let offset = self.logical_offset;

let len = self.logical_length;

// Doing this roundabout way since the iterator type we return must be

// the same (i.e. cannot use std::iter::empty())

let physical_slice = if self.is_empty() {

&self.run_ends[0..0]

} else {

let start = self.get_start_physical_index();

let end = self.get_end_physical_index();

&self.run_ends[start..=end]

};

physical_slice.iter().map(move |&val| {

let val = val.as_usize().saturating_sub(offset).min(len);

E::from_usize(val).unwrap()

})

}

brunal · 2026-02-13T14:57:02Z

I'm not handling null values properly when computing sums. Back to draft.

github-actions bot added the arrow Changes to the arrow crate label Feb 13, 2026

brunal marked this pull request as ready for review February 13, 2026 12:51

brunal mentioned this pull request Feb 13, 2026

Implements Sum,sum_checked,min,max,is Distict,inverse for REE. #7933

Open

Jefffrey reviewed Feb 13, 2026

View reviewed changes

brunal marked this pull request as draft February 13, 2026 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement min, max, sum for run-end-encoded arrays.#9409

Implement min, max, sum for run-end-encoded arrays.#9409
brunal wants to merge 1 commit intoapache:mainfrom
brunal:ree-agg

brunal commented Feb 13, 2026

Uh oh!

brunal commented Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Uh oh!

brunal commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// We can directly perform min/max on the values child array, as any
		// run must have non-zero length. We just need take care of the logical offset & len.

	/// Similar to [`values`] but accounts for logical slicing, returning only the values
	/// that are part of the logical slice of this array.
	///
	/// [`values`]: Self::values
	pub fn values_slice(&self) -> ArrayRef {
	if self.is_empty() {
	return self.values.slice(0, 0);
	}
	let start = self.get_start_physical_index();
	let end = self.get_end_physical_index();
	self.values.slice(start, end - start + 1)
	}


		let logical_start = run_ends.offset();
		let logical_end = run_ends.offset() + run_ends.len();

	/// Returns an iterator yielding run ends adjusted for the logical slice.
	///
	/// Each yielded value is subtracted by the [`logical_offset`] and capped
	/// at the [`logical_length`].
	///
	/// [`logical_offset`]: Self::offset
	/// [`logical_length`]: Self::len
	pub fn sliced_values(&self) -> impl Iterator<Item = E> + '_ {
	let offset = self.logical_offset;
	let len = self.logical_length;
	// Doing this roundabout way since the iterator type we return must be
	// the same (i.e. cannot use std::iter::empty())
	let physical_slice = if self.is_empty() {
	&self.run_ends[0..0]
	} else {
	let start = self.get_start_physical_index();
	let end = self.get_end_physical_index();
	&self.run_ends[start..=end]
	};
	physical_slice.iter().map(move \|&val\| {
	let val = val.as_usize().saturating_sub(offset).min(len);
	E::from_usize(val).unwrap()
	})
	}

Conversation

brunal commented Feb 13, 2026

Uh oh!

brunal commented Feb 13, 2026

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

brunal commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants