[RFC][WIP] Add backend exec protocol by rjzamora · Pull Request #711 · dask/dask-expr

rjzamora · 2024-01-11T19:11:17Z

Note that this is not a high-priority, but I explored the idea a few months ago and wanted to share the branch in case others had interest.

The idea here is that the expression system used in dask-expr also makes it pretty easy to directly execute the query using the backend library (rather than constructing a task graph and scheduling the tasks). I'd expect this to be useful for both debugging and for smallish-data applications. For the latter case, this effectively allows the user to apply query optimization to a "serial" pandas or cudf query.

This is only a half-baked idea for now. Only a small subset of expression types are supported (e.g. FromPandas, Blockwise, SortValues, SetIndex, Merge, GroupbyAggregation).

mrocklin · 2024-01-11T20:05:42Z

Two thoughts:

Probably the per-class methods should only have to do their own operation, and not ask their children to exec themselves. There should probably be some other thing that's walking the tree, similar to how we handle simplify. It's important, I think, to make the per-class stuff as simple as possible, and take as much burden as we can onto a centralized traversal system.
I'm curious about the protocol name. Something like __exec__ could make sense. I could also imagine being library specific, like __exec_pandas__. I don't have confidence here one way or the other though.

mrocklin · 2024-01-11T20:06:48Z

Also, I agree that this isn't high priority. I'd be fine personally if people ignored this until after the dask migration is done (which is high priority I think).

rjzamora · 2024-01-11T20:15:13Z

Probably the per-class methods should only have to do their own operation, and not ask their children to exec themselves.

Yeah, I agree that something else should probably walk the tree to keep the protocol code simple. The current implementation is just a very simple way to demonstrate the concept.

I'm curious about the protocol name

I'm personally open to anything. I ended up using "exec" instead of "apply", since apply already has a meaning in pandas. My only hesitation from using "pandas" in the name is that it would be nice to use the same language for array expressions in the future.

I'd be fine personally if people ignored this until after the dask migration is done

Yup - Dask-expr + cudf is pretty much completely broken at the moment, so this proposal is pretty low on my list as well. Just want to make sure the "exec" idea is visible, and that there is a space to discuss it.

mrocklin · 2024-01-12T14:07:30Z

Also cc'ing @TomNicholas and @tomwhite who have expressed interest in using dask-expr for things other than Dask. This PR isn't mature, but it's a good example of feasibility.

basic exec protocol

1f03a07

Merge remote-tracking branch 'upstream/main' into backend-exec-2

6e81364

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][WIP] Add backend exec protocol#711

[RFC][WIP] Add backend exec protocol#711
rjzamora wants to merge 2 commits intodask:mainfrom
rjzamora:backend-exec-2

rjzamora commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 11, 2024

Uh oh!

rjzamora commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjzamora commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 11, 2024

Uh oh!

rjzamora commented Jan 11, 2024

Uh oh!

mrocklin commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants