Skip to content

Implements Sum,sum_checked,min,max,is Distict,inverse for REE. #7933

Open
rich-t-kid-datadog wants to merge 3 commits intoapache:mainfrom
rich-t-kid-datadog:baah/AggregationSupport
Open

Implements Sum,sum_checked,min,max,is Distict,inverse for REE. #7933
rich-t-kid-datadog wants to merge 3 commits intoapache:mainfrom
rich-t-kid-datadog:baah/AggregationSupport

Conversation

@rich-t-kid-datadog
Copy link
Contributor

@rich-t-kid-datadog rich-t-kid-datadog commented Jul 15, 2025

Which issue does this PR close?

This PR works towards closing the larger REE Epic

Rationale for this change

Add operations onto the REE datatype such as

  • Sum
  • Sum_checked
  • IS DISTINCT FROM
  • Max/Min
  • Distinct (Arrow already handles this apparently)

What changes are included in this PR?

Allows for REE columns to be used for the previously mentioned functions correctly and efficently.

Are these changes tested?

Yes, comprehensive tests have been added in arrow-ord/src/cmp.rs and arrow-arith/src/aggregate.rs:

  • Basic functionality tests: Same values, different values, mixed scenarios
  • Edge case tests: Empty arrays, single runs, all nulls, mixed nulls and values
  • Data type tests: Float64 and Timestamp types
  • Both operations: distinct and not_distinct operations
  • The tests verify that REE distinct operations work correctly without array expansion and handle null values properly.

Are there any user-facing changes?

Performance improvement: REE distinct operations are now much faster for datasets with repeated values
No API changes: Existing distinct() and not_distinct() functions work the same way but are now more efficient for REE arrays
No breaking changes: All existing functionality is preserved

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

…ncludes helper functions for expanding REE into logical represention
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 15, 2025
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the baah/AggregationSupport branch from 4ded72e to 05fb3c0 Compare July 17, 2025 18:47
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the baah/AggregationSupport branch from 05fb3c0 to 72bd81a Compare July 17, 2025 18:52
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the baah/AggregationSupport branch from 5688ad3 to 9d00687 Compare July 24, 2025 15:01
@rich-t-kid-datadog rich-t-kid-datadog changed the title [Draft] implements Sum,sum_checked,min,max,is Distict,inverse for REE. Implements Sum,sum_checked,min,max,is Distict,inverse for REE. Jul 24, 2025
@rich-t-kid-datadog rich-t-kid-datadog marked this pull request as ready for review July 25, 2025 13:54
@Jefffrey
Copy link
Contributor

Jefffrey commented Feb 5, 2026

Just want to checkup on this PR, if it's still being worked on? Looking to see if we can get REE support pushed along in arrow-rs

@brunal
Copy link
Contributor

brunal commented Feb 8, 2026

According to #3520 (comment), the work was done while @rich-t-kid-datadog was an intern, and he's not available anymore. I can pick it up -- which should mostly be resolving merge conflicts.

@Jefffrey
Copy link
Contributor

Jefffrey commented Feb 9, 2026

That sounds good, I'm willing to help review this to get it along

@brunal
Copy link
Contributor

brunal commented Feb 10, 2026

This isn't ready as-is:

  • aggregate works, but is inefficient for min-max: it goes through all values (cost O(n*log(n)) for n values), when it could just look at the run-ends child array (cost O(# runs) < O(n))
  • cmp only implements distinct, and is inefficient as well (binary searches for logical->physical index)

I'll send a first PR just for aggregate, and later a second PR for cmp.

@brunal
Copy link
Contributor

brunal commented Feb 13, 2026

Aggregate work done in #9409. In particular, it takes care of sliced Ree arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants