Skip to content

[FEA] first and last as hash based aggregates #11141

@revans2

Description

@revans2

Is your feature request related to a problem? Please describe.
In a recent customer query I found that it was rather slow because they were doing a first aggregation along with a number of other small aggregations. First is translated into an NTH_ELEMENT aggregation, either with or without null handling. But NTH_ELEMENT is implemented as a sort based aggregation, not a hash based one. This slows down the entire query. If I replace first with max the entire query gets to be 18% faster.

I know in the general case NTH_ELEMENT cannot be a hash based aggregation, but for common SQL operations like first and last we should be able to make it a hash aggregate.

Describe the solution you'd like
I personally would love for some magic to happen behind the scenes with NTH_ELEMENT where it can become a hash aggregate if n is 0 or -1. But I would be happy to have some new FIRST and LAST aggregations instead, if that is simpler.

The algorithm I have been thinking about for first, is to do a min aggregation on a counting sequence starting at 0. Then we use the result of that as a gather map to pull in the first value from the original input column. If we don't want to include nulls, then instead of using the counting iterator directly we also pull in the null mask from the original input column, and we replace nulls in the gather map with -1 before doing the gather.

For last we would switch it over to doing max aggregation instead of a min.

Describe alternatives you've considered
We could writ this ourselves, but the current Spark aggregation code does not make it simple to get access to the original input column after doing the aggregation. I can change that, but I didn't want to do that until I head from CUDF about how hard this might be. Also the CUDF version is going to be more efficient, because I might not have to materialize as much data depending on how much can use thrust iterators to manipulate the original input data.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPerformancePerformance related issueSparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions