[FEA] first and last as hash based aggregates

**Is your feature request related to a problem? Please describe.**
In a recent customer query I found that it was rather slow because they were doing a first aggregation along with a number of other small aggregations.  First is translated into an NTH_ELEMENT aggregation, either with or without null handling.  But NTH_ELEMENT is implemented as a sort based aggregation, not a hash based one.  This slows down the entire query. If I replace `first` with `max` the entire query gets to be 18% faster.

I know in the general case NTH_ELEMENT cannot be a hash based aggregation, but for common SQL operations like `first` and `last` we should be able to make it a hash aggregate.

**Describe the solution you'd like**
I personally would love for some magic to happen behind the scenes with NTH_ELEMENT where it can become a hash aggregate if `n` is `0` or `-1`.  But I would be happy to have some new FIRST and LAST aggregations instead, if that is simpler.

The algorithm I have been thinking about for first, is to do a min aggregation on a counting sequence starting at 0. Then we use the result of that as a gather map to pull in the first value from the original input column.  If we don't want to include nulls, then instead of using the counting iterator directly we also pull in the null mask from the original input column, and we replace nulls in the gather map with -1 before doing the gather.

For last we would switch it over to doing max aggregation instead of a min.

**Describe alternatives you've considered**
We could writ this ourselves, but the current Spark aggregation code does not make it simple to get access to the original input column after doing the aggregation. I can change that, but I didn't want to do that until I head from CUDF about how hard this might be.  Also the CUDF version is going to be more efficient, because I might not have to materialize as much data depending on how much can use thrust iterators to manipulate the original input data.

**Additional context**
Add any other context, code examples, or references to existing implementations about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] first and last as hash based aggregates #11141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] first and last as hash based aggregates #11141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions