diff --git a/NEWS.md b/NEWS.md index 79b3c8e1..0874263e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,8 +3,29 @@ ClimaAnalysis.jl Release Notes main ------- +## Split-apply-combine + +This release features an initial implemention of the split-combine-apply +pattern. For example, to do a time average, you can now do + +```julia +import ClimaAnalysis +import Statistics: mean + +reduced_var = + var |> + ClimaAnalysis.GroupAll("time") |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine +``` + +The resulting `OutputVar` has a time dimension containing the first time value +of `var`. For more information, see the section "Split-Apply-Combine" in the +documentation. + ## Bug fixes +- Enforce uniqueness when splitting dates by seasons across time. - Exclude NaNStatistics v0.6.57 in compat, because of correctness issues and possibility of a segfault. See this [issue](https://github.com/brenhinkeller/NaNStatistics.jl/issues/66) in diff --git a/docs/make.jl b/docs/make.jl index ea4283a3..ba54ad3d 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -35,6 +35,7 @@ makedocs(; "RMSEVariables" => "rmse_var.md", "Visualizing RMSEVariables" => "visualize_rmse_var.md", "FlatVar" => "flat.md", + "Split-Apply-Combine" => "split_apply_combine.md", "APIs" => "api.md", "How do I?" => "howdoi.md", "Developer Documentation" => "developer.md", diff --git a/docs/src/api.md b/docs/src/api.md index 6d9feb3b..16d86d6b 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -138,6 +138,18 @@ ClimaAnalysis.get_index(var, dim_name, val, ::MatchValue) ClimaAnalysis.get_index(var, dim_name, val, ::Index) ``` +## Split apply combine + +```@docs +ClimaAnalysis.AbstractSplitOperation +ClimaAnalysis.AbstractApplyOperation +ClimaAnalysis.SplitApplyVar +ClimaAnalysis.GroupAll +ClimaAnalysis.SplitSeason +ClimaAnalysis.Reduce +ClimaAnalysis.combine +``` + ## FlatVar !!! note "Accessor functions for versions of ClimaAnalysis after v0.5.18" diff --git a/docs/src/split_apply_combine.md b/docs/src/split_apply_combine.md new file mode 100644 index 00000000..e63544a9 --- /dev/null +++ b/docs/src/split_apply_combine.md @@ -0,0 +1,103 @@ +# Split Apply Combine + +!!! note "Experimental" + This is an initial implementation of the split-apply-combine pattern that + supports a limited set of features. As a result, the public API will not be + covered under semantic versioning and is subjected to changes, but we will + try not to change it. + + If you need a feature that is not yet implemented, please let us know! + +The paper ["The Split-Apply-Combine Strategy for Data Analysis"](https://www.jstatsoft.org/article/view/v040i01), +written by Hadley Wickham, illustrates that most data processing follows the +pattern of: +- splitting data into groups, +- applying a reduction, transformation, or filter on each group, +- combining the results together via concatenation. + +ClimaAnalysis currently implements splitting operations via a single group via +[`GroupAll`](@ref) or seasonal splits via [`SplitSeason`](@ref) and reductions +via [`Reduce`](@ref). + +## Tutorial + +The split-apply-combine pattern is only defined for `OutputVar`s. Assuming you +already have `var`, a `OutputVar`, here's a complete example of computing a time +average in a functional style approach. + +```@setup split-apply-combine +import Dates +import ClimaAnalysis +import ClimaAnalysis.Template: + TemplateVar, add_dim, add_attribs, initialize + +ref_date = Dates.DateTime(2010) +time = + ClimaAnalysis.Utils.date_to_time.( + Ref(ref_date), + [ + Dates.DateTime(2010, 1), + Dates.DateTime(2010, 2), + Dates.DateTime(2010, 3), + Dates.DateTime(2010, 4), + ], + ) +lat = [-90.0, 0.0, 90.0] +lon = [-180.0, -120.0, -60.0, 0.0, 60.0, 120.0, 180.0] +var = + TemplateVar() |> + add_dim("time", time, units = "s") |> + add_dim("lat", lat, units = "degrees") |> + add_dim("lon", lon, units = "degrees") |> + add_attribs(start_date = ref_date, short_name = "lwu") |> + initialize +``` + +```@example split-apply-combine +var +``` + +```@example split-apply-combine +import ClimaAnalysis +import Statistics: mean + +time_averaged_var = + var |> + ClimaAnalysis.GroupAll("time") |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine +``` + +!!! note "What is the difference between this example and `average_time`?" + There are no modifications to the attributes in this example and + [`average_time`](@ref) does modifies the attributes to reflect that the + operation happens. Also, `average_time` squeezes the time dimension and + remove it while the time dimension is kept when using the + split-apply-combine pattern. + +First, a [`AbstractSplitOperation`](@ref) is called on a `OutputVar` which +produces a [`SplitApplyVar`](@ref). In this example, var is piped to +`GroupAll("time")`. The result is a `SplitApplyVar` which represents a lazy +evaluation of the split-apply-combine pattern on a `OutputVar`. + +Second, a [`AbstractApplyOperation`](@ref) is called on the `SplitApplyVar` +which record the apply operation. In this example, we compute a time average +over every group which there is only one. For [`Reduce`](@ref), the reduction +function passed must accept an `Array` and a `dims` keyword argument, and must +return an array with the same number of dimensions where the size along `dims` +is 1 (i.e., the dimension is kept but collapsed to a single element). Functions +such as `Statistics.mean`, `sum`, `minimum`, and `maximum` satisfy this +requirement. + +Third, [`combine`](@ref) is called on the `SplitApplyVar` which starts the +split, apply, and combine operations to produce the resulting `OutputVar`. The +final result is a time-averaged `OutputVar` that still have a time dimension. +The single value of the time dimension is the first date of the time dimension. +The split dimension is still present in the returned `OutputVar`, but its size +equals the number of groups. For [`GroupAll`](@ref), this is always 1. The +coordinate value for each group is taken from the first element of that group. + +```@repl split-apply-combine +ClimaAnalysis.dates(var) +ClimaAnalysis.dates(time_averaged_var) +``` diff --git a/src/Utils.jl b/src/Utils.jl index cc06cb70..cb42fbc1 100644 --- a/src/Utils.jl +++ b/src/Utils.jl @@ -369,6 +369,8 @@ function split_by_season_across_time(dates::AbstractArray{<:Dates.DateTime}) # Dates are not necessarily sorted dates = sort(dates) + allunique(dates) || error("Dates are not unique") + # Empty case isempty(dates) && return Vector{Vector{eltype(dates)}}[] diff --git a/src/Var.jl b/src/Var.jl index 6861f5dd..e19b6b78 100644 --- a/src/Var.jl +++ b/src/Var.jl @@ -2949,5 +2949,6 @@ include("outvar_operators.jl") include("outvar_dimensions.jl") include("outvar_selectors.jl") include("masks.jl") +include("split_apply_combine.jl") end diff --git a/src/split_apply_combine.jl b/src/split_apply_combine.jl new file mode 100644 index 00000000..393072c8 --- /dev/null +++ b/src/split_apply_combine.jl @@ -0,0 +1,255 @@ +export AbstractSplitOperation, + AbstractApplyOperation, + SplitApplyVar, + GroupAll, + SplitSeason, + Reduce, + combine + +""" + abstract type AbstractSplitOperation + +Supertype for operations that partition an `OutputVar` into groups in the +split-apply-combine pattern. +""" +abstract type AbstractSplitOperation end + +""" + abstract type AbstractApplyOperation + +Supertype for operations applied to each group in a split-apply-combine pattern. +""" +abstract type AbstractApplyOperation end + +""" + GroupAll <: AbstractSplitOperation + +Specify the dimension over which to apply an operation in a split-apply-combine pattern. + +`GroupAll` treats the entire dimension as a single group (i.e. no splitting occurs). It is +used in combination with a [`Reduce`](@ref) (or other apply operation) and [`combine`](@ref) +to transform an `OutputVar`. +""" +struct GroupAll <: AbstractSplitOperation + "Dimension to treat as a single group" + dim_name::String +end + +""" + SplitSeason <: AbstractSplitOperation end + +Split the dates of the `OutputVar` into seasons, in chronological order, as part of the +split-apply-combine pattern. + +If a season is empty, it will not be present as a group. The dates do not necessarily need +to be in sorted order. + +# Example + +This is a singleton type and can be constructed as follows. + +```julia +ClimaAnalysis.SplitSeason() +``` +""" +struct SplitSeason <: AbstractSplitOperation end + +""" + Reduce{F} <: AbstractApplyOperation + +Specify a reduction function to apply to each group in a split-apply-combine pattern. + +The reduction `f` must accept an `Array` and a `dims` keyword argument, and return an array +of the same number of dimensions where the size along `dims` is 1 (i.e., `f` does not drop +the reduced dimension). + +It is used in combination with an [`GroupAll`](@ref) and [`combine`](@ref) to transform an +`OutputVar`. + +# Example + +You can pass the reduction function when constructing a `Reduce`. + +```julia +ClimaAnalysis.Reduce(sum) +``` +""" +struct Reduce{F} <: AbstractApplyOperation + "The reduction function applied to each group along the split dimension" + reduction::F +end + +""" + SplitApplyVar + +Represents lazy split and apply operations before combining to produce the resulting +`OutputVar`. +""" +struct SplitApplyVar{ + OUTPUTVAR <: OutputVar, + SPLIT <: AbstractSplitOperation, + APPLY <: Union{AbstractApplyOperation, Nothing}, +} + "The `OutputVar` to split, apply, and combine." + var::OUTPUTVAR + + "A split operation that partitions the dimension into groups of indices." + split_op::SPLIT + + "An apply operation to apply to each group, or `nothing` if not yet set." + apply_op::APPLY +end + +""" + (split_op::AbstractSplitOperation)(var::OutputVar) + +Lazily applies `split_op` to `var` as part of the split-apply-combine pattern. + +The result should be passed to an [`AbstractApplyOperation`](@ref) and then to +[`combine`](@ref) to produce the final `OutputVar`. +""" +function (split_op::AbstractSplitOperation)(var::OutputVar) + return SplitApplyVar(var, split_op, nothing) +end + +""" + (split_op::AbstractSplitOperation)(::SplitApplyVar) + +This throws a helpful error message that you cannot apply another split +operation. +""" +function (split_op::AbstractSplitOperation)(::SplitApplyVar) + error("A split operation is already set") +end + +""" + (apply_op::AbstractApplyOperation)(split_apply_var::SplitApplyVar) + +Lazily applies `apply_op` to `split_apply_var` as part of the split-apply-combine pattern. + +The result should be passed to [`combine`](@ref) to produce the final `OutputVar`. + +You cannot applies another `AbstractApplyOperation` to a `split_apply_var` that already have +an `AbstractApplyOperation` applied on it. +""" +function (apply_op::AbstractApplyOperation)(split_apply_var::SplitApplyVar) + curr_apply_op = split_apply_var.apply_op + isnothing(curr_apply_op) || error("An apply operation is already set") + (; var, split_op) = split_apply_var + return SplitApplyVar(var, split_op, apply_op) +end + +""" + combine(split_apply_var::SplitApplyVar) + +Materialize the `OutputVar` with the split and apply operations. + +This splits the `OutputVar` into groups via an `AbstractSplitOperation`, applies +an `AbstractApplyOperation` to each group, and combine the results together. The +split dimension is kept. The coordinate values of the split dimension in the +returned `OutputVar` are the first coordinate value from each group. +""" +function combine(split_apply_var::SplitApplyVar) + (; var, split_op, apply_op) = split_apply_var + isnothing(apply_op) && error("No apply operation is defined") + return _combine(var, split_op, apply_op) +end + +""" + _combine( + var::OutputVar, + split_op::AbstractSplitOperation, + apply_op::AbstractApplyOperation, + ) + +Create groups using `split_op`, perform an operation on each group using `apply_op`, and +combine the results together. +""" +function _combine( + var::OutputVar, + split_op::AbstractSplitOperation, + apply_op::AbstractApplyOperation, +) + dim_idx, groups = _create_groups(var, split_op) + return _apply_and_combine(var, apply_op, dim_idx, groups) +end + +""" + _create_groups(var::OutputVar, group_all::GroupAll) + +Return the index of the grouped dimension and a single-element vector consisting of a vector +of all indices of that dimension. +""" +function _create_groups(var::OutputVar, group_all::GroupAll) + (; dim_name) = group_all + dim_name_in_var = find_corresponding_dim_name_in_var(dim_name, var) + dim_idx = var.dim2index[dim_name_in_var] + return dim_idx, [eachindex(var.dims[dim_name_in_var])] +end + +""" + _create_groups(var::OutputVar, ::SplitSeason) + +Return the index of the time dimension and a vector of index vectors, one per non-empty +season. +""" +function _create_groups(var::OutputVar, ::SplitSeason) + _check_time_dim(var) + + # Get dimension index + dim_name_in_var = find_corresponding_dim_name_in_var("time", var) + dim_idx = var.dim2index[dim_name_in_var] + + # Compute groups + var_dates = dates(var) + date_groups = ClimaAnalysis.split_by_season_across_time(var_dates) + dim_indices_groups = + [indexin(group, var_dates) for group in date_groups if !isempty(group)] + return dim_idx, dim_indices_groups +end + +""" + _apply_and_combine( + var::OutputVar, + apply_op::Reduce, + dim_idx, + dim_indices_groups, + ) + +Apply `apply_op` over the `dim_idx`th dimension for each group specified by +`dim_indices_groups` over the data of `var` and combine the results together. +""" +function _apply_and_combine( + var::OutputVar, + apply_op::Reduce, + dim_idx, + dim_indices_groups, +) + data_groups = map(dim_indices_groups) do dim_indices + index_tuple = ntuple( + idx -> idx == dim_idx ? dim_indices : Colon(), + ndims(var.data), + ) + view(var.data, index_tuple...) + end + + reduction = apply_op.reduction + reduced = map(g -> reduction(g; dims = dim_idx), data_groups) + ret_data = cat(reduced..., dims = dim_idx) + + # Note: For Reduce, the size of the resulting dimension is the same as the + # number of groups + # We will always get the first element of the dimension + ret_dim_indices = (first(dim_indices) for dim_indices in dim_indices_groups) + + # New dimension to return + dim_name_in_var = var.index2dim[dim_idx] + dim = [var.dims[dim_name_in_var][i] for i in ret_dim_indices] + + # Make new dimensions for OutputVar + ret_dims = deepcopy(var.dims) + ret_dims[dim_name_in_var] = dim + + # Do not update the attributes or dimension attributes + return ClimaAnalysis.remake(var, dims = ret_dims, data = ret_data) +end diff --git a/test/runtests.jl b/test/runtests.jl index 7060e1e0..9110317f 100644 --- a/test/runtests.jl +++ b/test/runtests.jl @@ -16,6 +16,7 @@ using Test @safetestset "OutputVar" begin @time include("test_Var.jl") end @safetestset "Selectors" begin @time include("test_outvar_selectors.jl") end @safetestset "FlatVar" begin @time include("test_flat.jl") end +@safetestset "SplitApplyCombine" begin @time include("test_split_apply_combine.jl") end @safetestset "MakieExt" begin @time include("test_MakieExt.jl") end @safetestset "GeoMakieExt" begin @time include("test_GeoMakieExt.jl") end #! format: on diff --git a/test/test_split_apply_combine.jl b/test/test_split_apply_combine.jl new file mode 100644 index 00000000..77b1dec3 --- /dev/null +++ b/test/test_split_apply_combine.jl @@ -0,0 +1,190 @@ +using Test +import ClimaAnalysis + +import Dates +import Statistics: mean, std + +import ClimaAnalysis.Template: + TemplateVar, + make_template_var, + add_attribs, + add_dim, + add_time_dim, + add_lon_dim, + add_lat_dim, + add_data, + ones_data, + zeros_data, + one_to_n_data, + initialize + +@testset "GroupAll and Reduce" begin + time = [0.0, 1.0, 2.0, 3.0] + lat = [-90.0, 0.0, 90.0] + lon = [-180.0, -120.0, -60.0, 0.0, 60.0, 120.0, 180.0] + var = + TemplateVar() |> + add_dim("time", time, units = "s") |> + add_dim("lat", lat, units = "degrees") |> + add_dim("lon", lon, units = "degrees") |> + add_attribs(start_date = Dates.DateTime(2010), short_name = "lwu") |> + initialize + + dimnames = ClimaAnalysis.dim_names(var) + reductions = [sum, prod, minimum, maximum, mean, std] + for reduction in reductions + for (i, dimname) in enumerate(dimnames) + reduced_var = + var |> + ClimaAnalysis.GroupAll(dimname) |> + ClimaAnalysis.Reduce(reduction) |> + ClimaAnalysis.combine + @test reduced_var.data == reduction(var.data, dims = i) + @test length(reduced_var.dims[dimname]) == 1 + @test reduced_var.dims[dimname][1] == first(var.dims[dimname]) + @test reduced_var.attributes == var.attributes + @test reduced_var.dim_attributes == var.dim_attributes + end + end + + @test_throws ErrorException var |> + ClimaAnalysis.GroupAll("cool_dim") |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine +end + +@testset "SplitBySeason and Reduce" begin + time = [0.0, 1.0, 2.0, 3.0] + lat = [-90.0, 0.0, 90.0] + lon = [-180.0, -120.0, -60.0, 0.0, 60.0, 120.0, 180.0] + single_season_var = + TemplateVar() |> + add_dim("time", time, units = "s") |> + add_dim("lat", lat, units = "degrees") |> + add_dim("lon", lon, units = "degrees") |> + add_attribs(start_date = Dates.DateTime(2010)) |> + initialize + + # OutputVar with a single season + reduced_var = + single_season_var |> + ClimaAnalysis.SplitSeason() |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine + @test reduced_var.data == mean(single_season_var.data, dims = 1) + + # OutputVar with multiple seasons + # DJF - 12, 1, 2 + # MAM - 3, 4, 5 + # JJA - 6, 7, 8 + # SON - 9, 10, 11 + ref_date = Dates.DateTime(2010) + time = + ClimaAnalysis.Utils.date_to_time.( + Ref(ref_date), + [ + Dates.DateTime(2010, 1), + Dates.DateTime(2010, 2), + Dates.DateTime(2010, 4), + Dates.DateTime(2010, 6), + Dates.DateTime(2010, 7), + Dates.DateTime(2010, 8), + Dates.DateTime(2011, 10), + ], + ) + lat = [-90.0, 0.0, 90.0] + lon = [-180.0, -120.0, -60.0, 0.0, 60.0, 120.0, 180.0] + multiple_season_var = + TemplateVar() |> + add_dim("lat", lat, units = "degrees") |> + add_dim("time", time, units = "s") |> + add_dim("lon", lon, units = "degrees") |> + add_attribs(start_date = ref_date) |> + initialize + + reduced_var = + multiple_season_var |> + ClimaAnalysis.SplitSeason() |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine + @test reduced_var.data == cat( + mean(multiple_season_var.data[:, 1:2, :], dims = 2), # DJF + mean(multiple_season_var.data[:, 3:3, :], dims = 2), # JJA + mean(multiple_season_var.data[:, 4:6, :], dims = 2), # MAM + mean(multiple_season_var.data[:, 7:7, :], dims = 2), # SON + dims = 2, + ) + @test length(reduced_var.dims["time"]) == 4 + @test reduced_var.dims["time"] == + multiple_season_var.dims["time"][[1, 3, 4, 7]] + @test reduced_var.attributes == multiple_season_var.attributes + @test reduced_var.dim_attributes == multiple_season_var.dim_attributes + + # OutputVar with dates out of order + diff_order_var = ClimaAnalysis.select( + multiple_season_var, + by = ClimaAnalysis.Index(), + time = [3, 4, 1, 6, 2, 5, 7], + ) + reduced_var2 = + diff_order_var |> + ClimaAnalysis.SplitSeason() |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine + @test reduced_var.attributes == reduced_var2.attributes + @test reduced_var.dims == reduced_var2.dims + @test reduced_var.dim_attributes == reduced_var2.dim_attributes + @test reduced_var.data == reduced_var2.data + + # OutputVar with repeated dates + repeated_dates_var = + TemplateVar() |> + add_dim("time", [0.0, 0.0, 1.0], units = "s") |> + add_dim("lat", lat, units = "degrees") |> + add_dim("lon", lon, units = "degrees") |> + add_attribs(start_date = Dates.DateTime(2010)) |> + initialize + + @test_throws ErrorException repeated_dates_var |> + ClimaAnalysis.SplitSeason() |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine + + # Time dimension does not exist + no_time_var = + TemplateVar() |> + add_dim("lat", lat, units = "degrees") |> + add_dim("lon", lon, units = "degrees") |> + initialize + + @test_throws ErrorException no_time_var |> + ClimaAnalysis.SplitSeason() |> + ClimaAnalysis.Reduce(mean) |> + ClimaAnalysis.combine +end + +@testset "Invalid split and apply operations chaining" begin + time = [0.0, 1.0] + var = + TemplateVar() |> + add_dim("time", time, units = "s") |> + add_attribs(start_date = Dates.DateTime(2010), short_name = "lwu") |> + initialize + + @test_throws r"A split operation is already set" var |> + ClimaAnalysis.GroupAll( + "time", + ) |> + ClimaAnalysis.GroupAll( + "time", + ) + + @test_throws r"An apply operation is already set" var |> + ClimaAnalysis.GroupAll( + "time", + ) |> + ClimaAnalysis.Reduce( + mean, + ) |> + ClimaAnalysis.Reduce(mean) +end