Skip to content

Not sure how best to handle missing #22

@ssfrr

Description

@ssfrr

I'm not actually sure where the right place on the stack is to fix this, because it seems to cut across several layers.

Here's an example - Say I want to get the mean of each group, ignoring missings. Notice that for the foo=2 group, both elements are missing.

using SplitApplyCombine

table = [(foo=1, bar=rand()),
         (foo=2, bar=missing),
         (foo=3, bar=rand()),
         (foo=1, bar=missing),
         (foo=2, bar=missing),
         (foo=3, bar=rand())]
map(mean  skipmissing, group(r->r.foo, r->r.bar, table))

This throws the error MethodError: no method matching zero(::Type{Any})

This is the result of a cascade of things, most of which seem pretty reasonable in isolation, which is why it's not clear (to me anyways) what the right fix is

  1. mean doesn't know how to handle an empty array Any[]. I don't think there's anything more reasonable for mean to do here
  2. table doesn't have usful type information (see type promotion of missing inside tuples JuliaLang/julia#31077)
  3. group seems to set the type of the dictionary elements based on the eltype of table.

I'm not sure if there's a good resolution to this. Even if group built up the groups iteratively rather than pre-allocating, for a group with only missings it would end up with an Array{Missing}, which still doesn't help mean figure out what a reasonable answer is.

My current workaround is to re-inject the type information, but it took some digging to figure out what the actual problem was, and is not pretty:

map(c->mean(Vector{Float64}(collect(skipmissing(c)))), group(r->r.foo, r->r.bar, table))

Another workaround is setting the type of table explicitly:

table = NamedTuple{(:foo, :bar), Tuple{Int64, Union{Missing,Float64}}}[
    (foo=1, bar=rand()),
    ...

But that gets pretty verbose.

Any thoughts as the the best way to handle this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions