[DNM/RFC] Track dependents without weakrefs#831
Conversation
dask_expr/_core.py
Outdated
| def __hash__(self): | ||
| raise TypeError("Don't!") |
There was a problem hiding this comment.
"Funny" story: The above implementation is maintaining two dicts. One that maps names and dependencies and one that just maps names to the actual objects. Initially I just tried to use the Expr objects themselves in the mappings using a mapping from Expr -> set[Expr}. That triggered funny recursion errors and it took me a while to understand that...
Whenever there is a hash collision, the implementation of set falls back to use __eq__ to compare the new object with the one that is stored under a given hash, i.e. object.__eq__(old, new). However, this doesn't evaluate to a bool but defines another expression... 💥
There was a problem hiding this comment.
Yeah, definite __eq__ is pretty hazardous in Python. It would be reasonable, I think, to drop Expr.__eq__ and use explicit Eq(a, b) calls instead, and then rely on Frame.__eq__ for user comfort.
There was a problem hiding this comment.
Or rather, use whatever class is defined in collections.py to hold the __eq__ method
|
I have to take a closer look, but I like the solution. When I tried the same I only stored the dependents on the expression which caused me trouble while recreating the mapping. Simply building the mapping continuously and storing it on the expression should circumvent the issues I ran into. Thanks for taking a stab at this |
…krefs # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py # dask_expr/io/_delayed.py
| def test_set_index_triggers_calc_when_accessing_divisions(pdf, df): | ||
| divisions_lru.data = OrderedDict() | ||
| query = df.set_index("x") | ||
| query = df.fillna(random.randint(1, 100)).set_index("x") |
There was a problem hiding this comment.
This one is funny, the previous expression was cached, so we were never calculating divisions with this PR. Using a random number here avoid this
|
Fix for failing tests is here: #880 Will merge into this branch |
@phofl this is an implementation that would track the "expression graph" on the instances themselves without needing weakref semantics.
In a nutshell, the expressions all track a dictionary that maps the names of the expresions to the names of their dependencies. When one needs the dependents, this can be reverted.
I haven't tested performance on this thing yet. The instance creation is a little more expensive now but I would be surprised if this would be a significant overhead. The dependents graph inversion should be the same cost as the weakref implementation we have right now.
Tests are currently failing because the DelayedExpr thingy is a little broken but the rest passes. I'll look at performance next and will see if this can be easily combined with #798 This Draft PR is just an FYI