Skip to content

Pattern for windowed queries? #35

@svilupp

Description

@svilupp

First of all, thank you again for creating these packages!

I have a use case where I need to process a large number of tables in a windowed query (think tables saved by the hour and I need two-hour averages).

I discussed with @jpsamaroo in JCon that it would be best to process it by simply operating on the chunks underpinning the DTable, but I've been struggling to produce a working example (see MWE below).

I get an error about indexing: "ERROR: ArgumentError: invalid index: EagerThunk (running) of type Dagger.EagerThunk"
Or an error about Chunk: "ERROR: type Chunk has no field a"

I probably wrongly assumed that wrapping it in Dagger.@Spawn will delay execution until it's ready to be run (ie, that inner function will never see the "Chunk")

MWE:

using Dagger, DTables, DataFrames
# starting from a DF to make things simle
x = DataFrame(; a=1:10_000, b=1:10_000)
# create 10 partitions
d = DTable(x, 1000)

results = []
for i in 1:(length(d.chunks)-1)
    tbl = Dagger.@spawn d.chunks[i]
    tbl_next = Dagger.@spawn d.chunks[i+1]
    # running on current+next partition and calculating average of the column :a
    out = Dagger.@spawn sum(tbl.a .+ tbl_next.a) / 2
    # in an ideal world, the return is an object not scalar (eg, DataFrame), hence the generic `push!`
    Dagger.@spawn push!(results, out)
end

# note: this is imperfect as it doesn't process 10th partition, but that's okay. I wanted to keep the example as simple as possible

Thank you!

EDIT:
The bonus question would be how to run it out-of-core as presented by Julian? (with deserialize/serialize arguments)
I couldn't find that API on the main branch / in the docs.
I suspect it might not be released yet, right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions