diff --git a/.Rbuildignore b/.Rbuildignore index 3f7bb60..5654ffd 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -2,3 +2,5 @@ ^\.Rproj\.user$ ^LICENSE\.md$ ^\.github$ +^\.positai$ +^\.claude$ diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 0f2fe08..b026f8a 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -1,10 +1,10 @@ # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help + on: push: branches: [main, master] pull_request: - branches: [main, master] name: R-CMD-check @@ -22,9 +22,9 @@ jobs: config: - {os: macos-latest, r: 'release'} - {os: windows-latest, r: 'release'} - - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} - - {os: ubuntu-latest, r: 'release'} - - {os: ubuntu-latest, r: 'oldrel-1'} + - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} + - {os: ubuntu-latest, r: 'release'} + - {os: ubuntu-latest, r: 'oldrel-1'} env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} diff --git a/.gitignore b/.gitignore index 72e042c..9a5266a 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ .DS_Store .quarto inst/doc +.positai diff --git a/DESCRIPTION b/DESCRIPTION index 317cb16..dca1441 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,45 +1,48 @@ Package: stgam Title: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models -Version: 0.0.1.3 +Version: 1.2.1 Authors@R: c( person("Lex", "Comber", email = "a.comber@leeds.ac.uk", role = c("aut", "cre")), person("Paul", "Harris", email = "paul.harris@rothamsted.ac.uk", role = c("ctb")), + person("Gonzalo", "Irisarri", email = "jirisarr@uwyo.edu", role = c("ctb")), person("Chris", "Brunsdon", email = "christopher.brunsdon@mu.ie", role = c("ctb")) ) Author: Lex Comber [aut, cre], Paul Harris [ctb], + Gonzalo Irisarri [ctb], Chris Brunsdon [ctb] Maintainer: Lex Comber -Description: A framework for specifying spatially, temporally and spatially-and-temporally varying coefficient models using Generalized Additive Models with Gaussian Process smooths. The smooths are parameterised with location and / or time attributes. Importantly the framework supports the investigation of the presence and nature of any space-time dependencies in the data, allows the user to evaluate different model forms (specifications) and to pick the most probable model or to combine multiple varying coefficient models using Bayesian Model Averaging. For more details see: Brunsdon et al (2023) , Comber et al (2023) and Comber et al (2024) . +Description: A framework for undertaking space and time varying coefficient models (varying parameter models) using a Generalized Additive Model (GAM) with smooths approach. The framework suggests the need to investigate for the presence and nature of any space-time dependencies in the data. It proposes a workflow that creates and refines an initial space-time GAM and includes tools to create and evaluate multiple model forms. The workflow sequence is to: i) Prepare the data by lengthening it to have a single location and time variables for each observation. ii) Create all possible space and/or time models in which each predictor is specified in different ways in smooths. iii) Evaluate each model via their AIC value and pick the best one. iv) Create the final model. v) Calculate the varying coefficient estimates to quantify how the relationships between the target and predictor variables vary over space, time or space-time. vi) Create maps, time series plots etc. The number of knots used in each smooth can be specified directly or iteratively increased. This is illustrated with a climate point dataset of the dry rain forest in South America. This builds on work in Comber et al (2024) and Comber et al (2004) . License: MIT + file LICENSE Encoding: UTF-8 Roxygen: list(markdown = TRUE) -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.2 Suggests: cols4all, - knitr, - purrr, + knitr, + ggplot2, + cowplot, rmarkdown, - sf, testthat (>= 3.0.0), - tidyr + tidyr, + lubridate, + sf, + gratia, + kableExtra Config/testthat/edition: 3 URL: https://github.com/lexcomber/stgam BugReports: https://github.com/lexcomber/stgam/issues Depends: - R (>= 2.10), + R (>= 4.1.0), + mgcv (>= 1.9-1), + glue LazyData: true Imports: - cowplot, + foreach, doParallel, + parallel, dplyr, - foreach, - ggplot2, - glue, - grDevices, magrittr, - metR, - mgcv (>= 1.9-1), - parallel, - tidyselect + purrr, + stringr VignetteBuilder: knitr diff --git a/NAMESPACE b/NAMESPACE index 80e1880..715726e 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,45 +1,38 @@ # Generated by roxygen2: do not edit by hand export(calculate_vcs) -export(do_bma) +export(effect_size) export(evaluate_models) -export(gam_model_probs) -export(plot_1d_smooth) -export(plot_2d_smooth) -importFrom(cowplot,plot_grid) +export(gam_model_rank) importFrom(doParallel,registerDoParallel) importFrom(dplyr,across) importFrom(dplyr,arrange) importFrom(dplyr,mutate) importFrom(dplyr,relocate) importFrom(dplyr,rename) -importFrom(dplyr,select) importFrom(dplyr,slice_head) importFrom(dplyr,tibble) importFrom(foreach,"%dopar%") importFrom(foreach,foreach) -importFrom(ggplot2,aes) -importFrom(ggplot2,coord_sf) -importFrom(ggplot2,geom_contour) -importFrom(ggplot2,geom_contour_filled) -importFrom(ggplot2,geom_line) -importFrom(ggplot2,geom_ribbon) -importFrom(ggplot2,geom_sf) -importFrom(ggplot2,ggplot) -importFrom(ggplot2,theme_bw) -importFrom(ggplot2,xlab) -importFrom(ggplot2,ylab) -importFrom(ggplot2,ylim) importFrom(glue,glue) -importFrom(grDevices,dev.off) -importFrom(grDevices,pdf) importFrom(magrittr,"%>%") +importFrom(mgcv,bam) importFrom(mgcv,gam) -importFrom(parallel,detectCores) +importFrom(mgcv,k.check) +importFrom(mgcv,predict.gam) +importFrom(mgcv,s) +importFrom(mgcv,te) importFrom(parallel,makeCluster) importFrom(parallel,stopCluster) -importFrom(stats,BIC) +importFrom(purrr,map2_chr) importFrom(stats,as.formula) +importFrom(stats,family) importFrom(stats,formula) importFrom(stats,predict) -importFrom(tidyselect,all_of) +importFrom(stats,reformulate) +importFrom(stats,sd) +importFrom(stringr,str_detect) +importFrom(stringr,str_replace) +importFrom(stringr,str_split) +importFrom(utils,head) +importFrom(utils,installed.packages) diff --git a/NEWS.md b/NEWS.md index e820610..a0bd91a 100644 --- a/NEWS.md +++ b/NEWS.md @@ -20,3 +20,37 @@ * expanded the output of `do_bma` to include averaged $\hat{y}$ and working residuals * returns weighted vary coefficient estimates appended to input data + +# stgam 1.0.0 + +* space-time GAMs are reformatted to include Tensor Product smooths for combined space-time +* modelling averaging is removed + +# stgam 1.0.1 + +* London borough data (`lb`) corrected +* typos in vignette corrected + +# stgam 1.0.2 + +* fix to vignette plot + +# stgam 1.1.0 + +* updates to functions and vignette to use `te()` tensor product smooths for space-time smooths, replacing `t2()` +* inclusion of t-values in coefficient estimates + +# stgam 1.2.0 + +* updates to main function (`evaluate_models()`) for user specification of `k` or to increase `k` automatically +* new function for quantifying the effect size of each model term (`effect_size()`) +* updates to `gam_model_rank` to evaluate models by AIC and to report `k` for each smooth +* new vignettes +* new case study dataset + +# stgam 1.2.1 + +* update to main function (`evaluate_models()`) with fast GAM option using `mgcv::bam` +* update to `gam_model_rank()` function to include `k` in the results + + diff --git a/R/.Rapp.history b/R/.Rapp.history new file mode 100644 index 0000000..e69de29 diff --git a/R/calculate_vc.R b/R/calculate_vc.R deleted file mode 100644 index 9f9e5b9..0000000 --- a/R/calculate_vc.R +++ /dev/null @@ -1,51 +0,0 @@ -#' Extracts varying coefficient estimates (for SVC, TVC and STVC models). -#' -#' @param input_data the data used to create the GAM model in `data.frame`, `tibble` or `sf` format -#' @param model a GAM model with smooths created using the `mgcv` package -#' @param terms a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in `data` ) -#' -#' @return A `data.frame` of the input data and the coefficient and standard error estimates for each covariate. -#' @importFrom magrittr %>% -#' @importFrom dplyr mutate -#' @importFrom stats predict -#' -#' @examples -#' library(dplyr) -#' library(mgcv) -#' # SVC -#' data(productivity) -#' input_data = productivity |> dplyr::filter(year == "1970") |> mutate(Intercept = 1) -#' gam.svc.mod = gam(privC ~ 0 + Intercept + -#' s(X, Y, bs = 'gp', by = Intercept) + -#' unemp + s(X, Y, bs = "gp", by = unemp) + -#' pubC + s(X, Y, bs = "gp", by = pubC), -#' data = input_data) -#' terms = c("Intercept", "unemp", "pubC") -#' svcs = calculate_vcs(input_data, gam.svc.mod, terms) -#' head(svcs) -#' @export -calculate_vcs = function(input_data, model, terms) { - . = NULL - n_t = length(terms) - input_data_copy = input_data - output_data = input_data - for (i in 1:n_t) { - zeros = rep(0, n_t) - zeros[i] = 1 - terms_df = data.frame(matrix(rep(zeros, nrow(input_data)), ncol = n_t, byrow = T)) - names(terms_df) = terms - input_data_copy[, terms] = terms_df - se.j = predict(model, se = TRUE, newdata = input_data_copy)$se.fit - b.j=predict(model,newdata=input_data_copy) - expr1 = paste0("b_", terms[i], "= b.j") - expr2 = paste0("se_",terms[i], "= se.j") - output_data = output_data %>% - mutate(within(., !!parse(text = expr1))) %>% - mutate(within(., !!parse(text = expr2))) - } - output_data$yhat = predict(model, newdata = input_data) - output_data -} - - - diff --git a/R/calculate_vcs.R b/R/calculate_vcs.R new file mode 100644 index 0000000..bcf97d6 --- /dev/null +++ b/R/calculate_vcs.R @@ -0,0 +1,91 @@ +#' Extracts varying coefficient estimates (for SVC, TVC and STVC models). +#' +#' @param input_data the data used to create the GAM model in `data.frame`, `tibble` or `sf` format. This can be the original data used to create the model or another surface with location and time attributes. +#' @param mgcv_model a GAM model with smooths created using the `mgcv` package +#' @param terms a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in the `input_data` used to construct the model). +#' +#' @return A `data.frame` of the input data, the coefficient estimates, the standard errors and the t-values estimates for each covariate. It can be used to generate coefficient estimates for specific time slices and over gridded surfaces as described in the package vignette. +#' @importFrom dplyr mutate +#' @importFrom stats predict +#' +#' @examples +#' require(dplyr) +#' require(doParallel) +#' # define input data +#' data("chaco") +#' input_data <- +#' chaco |> +#' # create Intercept as an addressable term +#' mutate(Intercept = 1) +#' # create a model for example as result of running `evaluate_models` +#' gam.m = gam(ndvi ~ 0 + s(X, Y, by = Intercept) + +#' s(X, Y, by = tmax) + s(X, Y, by = pr), data = input_data) +#' # calculate the Varying Coefficients +#' terms = c("Intercept", "tmax", "pr") +#' vcs = calculate_vcs(input_data, gam.m, terms) +#' vcs |> select(ndvi, X, Y, starts_with(c("b_", "se_", "t_")), yhat) +#' +#' @export +calculate_vcs <- function(input_data, mgcv_model, terms = NULL) { + # --- Input validation --- + if (!inherits(mgcv_model, "gam")) { + stop("Error: 'mgcv_model' must be a GAM object from mgcv::gam().") + } + if (!is.data.frame(input_data)) { + stop("Error: 'input_data' must be a data.frame.") + } + if (is.null(terms)) { + # Default to all parametric terms + terms <- attr(mgcv_model$terms, "term.labels") + } + if (!all(terms %in% names(input_data))) { + missing_terms <- setdiff(terms, names(input_data)) + stop(paste("Error: The following terms are missing from input_data:", + paste(missing_terms, collapse = ", "))) + } + if (is.null(terms)) { + terms <- names(input_data) + } + n_t <- length(terms) + + # preallocate output list + output_data <- input_data + + # create a template for modifying predictor columns + input_data_copy <- input_data + n_rows <- nrow(input_data) + + # build all term-modified datasets at once + term_mats <- diag(n_t) + colnames(term_mats) <- terms + + # loop over terms efficiently + for (i in seq_len(n_t)) { + # assign 0/1 indicators for this term + for (t in seq_along(terms)) { + input_data_copy[[terms[t]]] <- term_mats[i, t] + } + + # single predict() call returning both fit and SE + pred <- predict(mgcv_model, newdata = input_data_copy, se.fit = TRUE) + b.j <- pred$fit + se.j <- pred$se.fit + t.j <- b.j / se.j + + # append results + term_i <- terms[i] + output_data[[paste0("b_", term_i)]] <- b.j + output_data[[paste0("se_", term_i)]] <- se.j + output_data[[paste0("t_", term_i)]] <- t.j + } + + # optional predicted response (only if all terms present) + if (all(terms %in% names(input_data))) { + output_data$yhat <- predict(mgcv_model, newdata = input_data) + } + + return(output_data) +} + + + diff --git a/R/data.R b/R/data.R index e9c9ea6..5d80226 100644 --- a/R/data.R +++ b/R/data.R @@ -1,43 +1,25 @@ -#' US States Economic Productivity Data (1970-1985) +#' Chaco dry rainforest data (2012-2022) #' -#' A dataset of annual economic productivity data for the 48 contiguous US states (with Washington DC merged into Maryland), from 1970 to 1985 (17 years) in long format. The data productivity data table was extracted from the `plm` package. +#' A point dataset of NDVI and climate data. The data are sample of 2000 observations of Normalised Difference Vegetation Index (NDVI) (2012-2022) of the Chaco dry rainforest in South America with some climate data. These are found via Google Earth Engine (Gorelick et al., 2017). The NDVI data is sourced from the PKU GIMMS NDVI v1.2 dataset, which provides NDVI observations at 1/12° spatial resolution at bi-monthly intervals from 1982 to 2022 (Li et al., 2023). The climate data was derived from the TerraClimate dataset (IDAHO_EPSCOR/TERRACLIMATE). Maximum temperature (`tmax`) and Precipitation (`pr`) were selected and means calculated for each monthly image across all pixels. #' -#' @format A tibble with 816 rows and 14 columns. +#' @format A `sf` POINT dataset with 2000 observations and 12 fields. #' \describe{ -#' \item{state}{The name of the state} -#' \item{GEOID}{The state code} -#' \item{region}{The region} -#' \item{pubC}{Public capital which is composed of highways and streets (hwy) water and sewer facilities (water) and other public buildings and structures (util)} -#' \item{hwy}{Highway and streets assets} -#' \item{water}{Water utility assets} -#' \item{util}{Other public buildings and structures} -#' \item{privC}{Private captial stock} -#' \item{gsp}{Gross state product} -#' \item{emp}{Labour input measured by the employment in non-agricultural payrolls} -#' \item{unemp}{State unemployment rate capture elements of the business cycle} -#' \item{X}{Easting in metres from USA Contiguous Equidistant Conic projection (ESRI:102005)} -#' \item{Y}{Northing in metres from USA Contiguous Equidistant Conic projection (ESRI:102005)} +#' \item{id}{An observation identifier} +#' \item{ndvi}{Normalised Difference Vegetation Index (NDVI)} +#' \item{tmax}{Maximum temperature (°C)} +#' \item{pr}{Preciptation} +#' \item{month}{A continous integer variable from 1 to 120} +#' \item{year}{The year of observation} +#' \item{lon}{Longitude in degrees (WGS84)} +#' \item{lat}{Latitude in degrees (WGS84)} +#' \item{X}{Easting in metres from the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} +#' \item{Y}{Northing in metres from the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} +#' \item{geometry}{The spatial geometry of the observation in the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} #' } -#' @source Croissant, Yves, Giovanni Millo, and Kevin Tappe. 2022. Plm: Linear Models for Panel Data +#' @source Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/J.RSE.2017.06.031 +#' @source Li, M., Cao, S., Zhu, Z., Wang, Z., Myneni, R. B., & Piao, S. (2023). Spatiotemporally consistent global dataset of the GIMMS Normalized Difference Vegetation Index (PKU GIMMS NDVI) from 1982 to 2022. Earth System Science Data, 15(9), 4181–4203. https://doi.org/10.5194/ESSD-15-4181-2023 #' #' @examples -#' data(productivity) -"productivity" - -#' US States boundaries -#' -#' A dataset of of the boundaries of 48 contiguous US states (with Washington DC merged into Maryland), extracted from the `spData` package. -#' -#' @format A `sf` polygon dataset with 48 rows and 6 fields. -#' \describe{ -#' \item{GEOID}{The state code} -#' \item{NAME}{The name of the state} -#' \item{REGION}{The region} -#' \item{total_pop_10}{Population in 2010} -#' \item{total_pop_15}{Population in 2015} -#' } -#' @source Bivand, Roger, Jakub Nowosad, and Robin Lovelace. 2019. spData: Datasets for Spatial Analysis. R package -#' -#' @examples -#' data(us_data) -"us_data" +#' library(sf) +#' data("chaco") +"chaco" diff --git a/R/do_bma.R b/R/do_bma.R deleted file mode 100644 index 7347c5a..0000000 --- a/R/do_bma.R +++ /dev/null @@ -1,117 +0,0 @@ -#' Undertake undertake coefficient averaging using Bayesian Model Avergaing (BMA), weighting different models by their probabilities -#' -#' @param model_table a table of competing models generated by the `gam_model_probs()` function -#' @param terms a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in `data` ) -#' @param thresh a probability threshold value above which to combine competing models -#' @param relative logical to indicate whether the probabilities in `data` are relative (`Pr(M)`) or absolute (`Pr(M|D)`) -#' @param input_data the input data with a named Intercept term, in `data.frame`, `tibble` or `sf` format -#' -#' @return A `tibble` of the input data plus the probability weighted averaged coefficient estimates from multiple models, all starting with `b_` -#' @importFrom mgcv gam -#' @importFrom stats as.formula -#' @importFrom dplyr select -#' @importFrom tidyselect all_of -#' -#' @examples -#' library(cols4all) -#' library(dplyr) -#' library(sf) -#' library(glue) -#' library(purrr) -#' library(mgcv) -#' library(sf) -#' library(ggplot2) -#' # data -#' data(productivity) -#' input_data = productivity |> filter(year == "1970") |> mutate(Intercept = 1) -#' # create and evaluate multiple models -#' svc_res_gam = evaluate_models(input_data, STVC = FALSE) -#' # determine their probabilities -#' mod_comp_svc <- gam_model_probs(svc_res_gam) -#' # combine the model coefficients -#' svc_bma <- do_bma(model_table = mod_comp_svc, -#' terms = c("Intercept", "unemp", "pubC"), -#' thresh = 0.1, -#' relative = FALSE, -#' input_data = input_data) -#' head(svc_bma) -#' # join back to spatial layer -#' data(us_data) -#' svc_bma_sf <- -#' us_data |> -#' select(GEOID) |> -#' left_join(svc_bma) -#' # and map -#' tit =expression(paste(""*beta[`Public Capital`]*" ")) -#' ggplot(data = svc_bma_sf, aes(fill=b_pubC)) + -#' geom_sf() + -#' scale_fill_continuous_c4a_div(palette="brewer.yl_or_rd",name=tit) + -#' coord_sf() + -#' theme_linedraw() -#' @export -do_bma = function(model_table, - terms, - thresh = 0.1, - relative = FALSE, - input_data){ - # 6.1 re-create the top models and extract coefficients - extract_vcs = function(formulae, terms, input_data) { - out_list = list() - for(i in 1:length(formulae)){ - # formula and model - f = as.formula(formulae[i]) - mod.i = gam(f, data = input_data) - # VC predictions - betas.s = calculate_vcs(input_data, mod.i, terms) - out_list[[i]] <- betas.s - } - out_list - } - # 6.2 weight the coefficients - weighted_vcs <- function(coef_list, probs, terms){ - coef_out = NULL - for(i in paste0("b_", terms)) { - coef_tab <- NULL - # extract the coefficients - for (j in 1:length(probs)) { - coef_tab = cbind(coef_tab, - coef_list[[j]] |> select(all_of(i)) |> unlist()) - } - # weighted average - BMA - coef.i = coef_tab %*% probs - # add to the out table - coef_out = cbind(coef_out, coef.i |> unlist()) - } - colnames(coef_out) <- terms - rownames(coef_out) <- 1:nrow(coef_out) - coef_out - } - # extract the probabilities - if(!relative) { - n = sum(model_table$`Pr(M|D)` > thresh) - probs = model_table$`Pr(M|D)`[1:n] - } else { - n = sum(model_table$`Pr(M)` > thresh) + 1 - probs = as.numeric(model_table$`Pr(M)`[2:n]) - # add the first one back in - probs = c(1-probs[1], probs) - # re-scale - probs = probs / sum(probs) - } - # get the competing model VCs - f = model_table$f[1:n] - svc_list = extract_vcs(formulae = f, terms, input_data) - # combine - svc <- weighted_vcs(svc_list, probs, terms) - # link back to input data with yhat and working residuals - colnames(svc) = paste0("b_", colnames(svc)) - x = (input_data[, terms[-1]]) - y = (svc[,-1]) - yhat = (as.vector(rowSums(x*y) + svc[,1])) - f = unlist(f[1]) - target = all.vars(as.formula(f))[1] - resids = as.vector(unlist(input_data[, target])) - yhat - svc <-input_data |> cbind(svc) - svc <- svc |> tibble() |> mutate(yhat = yhat, residuals = resids) - return(svc) -} diff --git a/R/effect_size.R b/R/effect_size.R new file mode 100644 index 0000000..463ba09 --- /dev/null +++ b/R/effect_size.R @@ -0,0 +1,37 @@ +#' Quantifies the relative effect sizes of each component of `mgcv` GAM +#' +#' @param mgcv_model a GAM model created by the `mgcv` package +#' @param digits the number of significant figures of uysed to report effect size +#' +#' @returns a matrix of the model terms, the size of the effect (range) ad standard deviation (sd) +#' @importFrom mgcv predict.gam +#' +#' @examples +#' require(dplyr) +#' require(stringr) +#' require(purrr) +#' require(doParallel) +#' +#' # define input data +#' data("chaco") +#' m <- gam( +#' ndvi ~ +#' te(X,Y, by = tmax) + +#' s(X,Y, by = pr), +#' data = chaco, +#' method = "REML", +#' family = gaussian() +#' ) +#' # examine the effect size +#' effect_size(m, 3) +#' @export +effect_size <- function(mgcv_model, digits = 3) { + if (!inherits(mgcv_model, "gam")) { + stop("Error: 'mgcv_model' must be a GAM object from mgcv::gam().") + } + # extract the terms + f <- predict(mgcv_model, type = "terms") + # generate numeric summaries + out <- apply(f, 2, function(v) c(sd = sd(v), range = diff(range(v)))) |> round(digits) + return(out) +} diff --git a/R/evaluate_models.R b/R/evaluate_models.R index 0a0f88c..a65d6cb 100644 --- a/R/evaluate_models.R +++ b/R/evaluate_models.R @@ -1,158 +1,458 @@ -#' Creates and evaluates multiple varying coefficient GAM GP smooth models (SVC or STVC) +#' Evaluates multiple models with each predictor variable specified in different ways in order to determining model form #' -#' @param input_data a `data.frame`, `tibble` `sf` containing the target variables, covariates and coordinate variables -#' @param target_var the name of the target variable in `data` -#' @param covariates the name of the covariates (predictor variables) in `data` -#' @param coords_x the name of the X, Easting or Longitude variable in `data` -#' @param coords_y the name of the Y, Northing or Latitude variable in `data` -#' @param STVC a logical operator to indicate whether the models Space-Time (`TRUE`) or just Space (`FALSE`) -#' @param time_var the name of the time variable if undertaking STVC model evaluations -#' @param ncores the number of cores to use in parallelised approaches (default is 2 to overcome CRAN package checks). This can be determined for your computer by running `parallel::detectCores()-1`. Parallel approaches are only undertaken if the number of models to evaluate is greater than 30. +#' @param input_data he data to be used used to create the GAM model in (`data.frame` or `tibble` format), containing an Intercept column to allow it be treated as an addressable term in the model. +#' @param target_var the name of the target variable. +#' @param vars a vector of the predictor variable names (without the Intercept). +#' @param model_family the model family, defaults to Guassian +#' @param bam a logical parameter indicating whether to use the `bam` function to construct GAMs. The default is `FALSE` and the `gam` function. The `mgcv::bam` should be used for large datasets or large numbers of predictor variables. +#' @param coords_x the name of the X, Easting or Longitude variable in `input_data`. +#' @param coords_y the name of the Y, Northing or Latitude variable in `input_data`. +#' @param VC_type the type of varying coefficient model: options are "TVC" for temporally varying, "SVC" for spatially varying and "STVC" for space-time. +#' @param time_var the name of the time variable if undertaking STVC model evaluations. +#' @param k_set a logical value for user defined `k` values. The default is `FALSE`. Cannot be used with `k_increase`. +#' @param spatial_k the value of `k` for spatial smooths if `k_set` is `TRUE`. +#' @param temporal_k the value of `k` for temporal smooths if `k_set` is `TRUE`. +#' @param k_increase a logical value of whether to check and increase the number of knots in each smooth. The default is `FALSE`. Cannot be used with `k_set`. +#' @param k2edf_ratio a threshold of the ratio of the number of knots, `k`, in each smooth to its Effective Degrees of Freedom. If any smooth has a *knots-to-EDF* ratio less than this value then the knots are iteratively increased by the `k_multiplier` value until the threshold check is passed, the number knots passes the maximum degrees of freedom, or the number of iterations, `max_iter` is reached. Cannot be used with `k_set`. +#' @param k_multiplier a multiplier by which the knots are increased on each iteration. The default is 2. +#' @param max_iter the maximum number of iterations that `k` is increased. +#' @param ncores the number of cores to use in parallelised approaches (default is 2 to overcome CRAN package checks). This can be determined for your computer by running parallel::detectCores()-1. Parallel approaches are only undertaken if the number of models to evaluate is greater than 30. #' -#' @return A data table in `data.frame` format of all possible model combinations with each covariate specified in all possible ways, with the BIC of the model and the model formula. +#' @returns a `data.frame` with indices for each predictor variable, the knots specified in each smooth (`ks`), a AIC score (`aic`) for each model and the associated formula (`f`). The output should be passed to the `gam_model_rank` function. #' @importFrom glue glue -#' @importFrom stats formula -#' @importFrom stats BIC +#' @importFrom dplyr mutate +#' @importFrom mgcv gam +#' @importFrom mgcv bam +#' @importFrom mgcv te +#' @importFrom mgcv s +#' @importFrom mgcv k.check #' @importFrom parallel makeCluster -#' @importFrom parallel detectCores #' @importFrom doParallel registerDoParallel -#' @importFrom foreach %dopar% #' @importFrom foreach foreach +#' @importFrom foreach "%dopar%" #' @importFrom parallel stopCluster +#' @importFrom stats formula +#' @importFrom stats as.formula +#' @importFrom stats family +#' @importFrom utils installed.packages +#' @importFrom stats sd +#' @importFrom stats reformulate #' #' @examples -#' library(dplyr) -#' library(glue) -#' library(purrr) -#' library(doParallel) -#' library(mgcv) -#' data("productivity") -#' input_data = productivity |> filter(year == "1970") -#' svc_res_gam = -#' evaluate_models(input_data = input_data, -#' target_var = "privC", -#' covariates = c("unemp", "pubC"), -#' coords_x = "X", -#' coords_y = "Y", -#' STVC = FALSE) -#' head(svc_res_gam) +#' \dontrun{ +#' require(dplyr) +#' require(doParallel) +#' require(sf) +#' +#' # define input data +#' data("chaco") +#' input_data <- +#' chaco |> +#' # create Intercept as an addressable term +#' mutate(Intercept = 1) |> +#' # remove the geometry +#' st_drop_geometry() +#' +#' # Evaluate different model forms +#' # example 1 with 6 models and no `k` adjustment +#' svc_mods <- +#' evaluate_models( +#' input_data = input_data, +#' target_var = "ndvi", +#' model_family = "gaussian()", +#' vars = c("tmax"), +#' coords_x = "X", +#' coords_y = "Y", +#' VC_type = "SVC" +#' ) +#' # have a look! +#' svc_mods +#' +#' # example 2 with 6 models and `k` adjustment +#' svc_k2_mods <- +#' evaluate_models( +#' input_data = input_data, +#' target_var = "ndvi", +#' vars = c("tmax"), +#' model_family = "gaussian()", +#' coords_x = "X", +#' coords_y = "Y", +#' VC_type = "SVC", +#' k_increase = TRUE, +#' k2edf_ratio = 1.5, +#' k_multiplier = 2, +#' max_iter = 10 +#' ) +#' # have a look! +#' svc_k2_mods +#' +#' # example 3 with 6 models and `k` set by user +#' svc_k3_mods <- +#' evaluate_models( +#' input_data = input_data, +#' model_family = "gaussian()", +#' target_var = "ndvi", +#' vars = c("tmax"), +#' coords_x = "X", +#' coords_y = "Y", +#' VC_type = "SVC", +#' time_var = NULL, +#' k_set = TRUE, +#' spatial_k = 20, +#' ) +#' # have a look! +#' svc_k3_mods +#' +#' # example comparing `gam` (slower) and `bam` (faster) +#' t1 <- Sys.time() +#' svc_GAM_mods <- +#' evaluate_models( +#' input_data = input_data, +#' target_var = "ndvi", +#' vars = c("tmax", "pr"), +#' model_family = "gaussian()", +#' bam = FALSE, +#' coords_x = "X", +#' coords_y = "Y", +#' time_var = "month", +#' VC_type = "STVC", +#' ncores = detectCores()-1 +#' ) +#' t_gam <- Sys.time() - t1 +#' +#' t1 <- Sys.time() +#' svc_BAM_mods <- +#' evaluate_models( +#' input_data = input_data, +#' target_var = "ndvi", +#' vars = c("tmax", "pr"), +#' model_family = "gaussian()", +#' bam = TRUE, +#' coords_x = "X", +#' coords_y = "Y", +#' time_var = "month", +#' VC_type = "STVC", +#' ncores = detectCores()-1 +#' ) +#' t_bam <- Sys.time() - t1 +#' +#' t_gam - t_bam +#' +#' } +#' #' @export -evaluate_models = function(input_data, - target_var = "privC", - covariates = c("unemp", "pubC"), - coords_x = "X", - coords_y = "Y", - STVC = FALSE, - time_var = NULL, - ncores = 2) { - # Helper functions - # 1 intercept formula - get_form_intercept = function(index, bs = "gp") { - c("", - glue("+s({coords_x},{coords_y},bs='{bs}',by=Intercept)"), - glue("+s({time_var},bs='{bs}',by=Intercept)"), - glue("+s({coords_x},{coords_y},bs='{bs}',by= Intercept) + s({time_var},bs='{bs}',by=Intercept)"), - glue("+s({coords_x},{coords_y},{time_var},bs='{bs}',by=Intercept)"))[index] - } - # get_form_intercept(3) - - # 2 covariate formula - get_form_covariate = function(varname, index, bs = "gp") { - c("", - glue("+ {varname}"), - glue("+s({coords_x},{coords_y},bs='{bs}',by={varname})"), - glue("+s({time_var},bs='{bs}',by={varname})"), - glue("+s({coords_x},{coords_y},bs='{bs}',by={varname}) + s({time_var},bs='{bs}',by={varname})"), - glue("+s({coords_x},{coords_y},{time_var},bs='{bs}',by={varname})"))[index] - } - # get_form_covariate("unemp", 6) - - # 3 make combinations grid for SVC and STVC - make_svc_index_grid = function(covariates) { - expression_x = "expand.grid(Intercept = 1:2 " - for (i in covariates) { - expression_x = - paste0(expression_x, ",", i, "=1:3") + +evaluate_models <- function(input_data, + target_var, + vars, + model_family = "gaussian()", + bam = FALSE, + coords_x = "X", + coords_y = "Y", + VC_type = "SVC", + time_var = NULL, + k_set = FALSE, + spatial_k = 50, + temporal_k = 10, + k_increase = FALSE, + k2edf_ratio = 1.5, + k_multiplier = 2, + max_iter = 10, + ncores = 2) +{ + ## --------------------------------------------------------- + ## 1. BASIC INPUT CHECKS + ## --------------------------------------------------------- + + if (!is.data.frame(input_data)) { + stop("input_data must be a data.frame or tibble") + } + + if(k_set & k_increase){ + stop("You cannot set both k_set and k_increase to be TRUE") + } + + if( (VC_type == "TVC" | VC_type == "STVC") & is.null(time_var)){ + stop("You need to speciy a time_var for SVC or STVC") + } + + required_packages <- c("mgcv", "glue", "dplyr") + missing_pkgs <- required_packages[!required_packages %in% installed.packages()[,"Package"]] + if (length(missing_pkgs) > 0) { + stop("Missing required packages: ", paste(missing_pkgs, collapse = ", ")) + } + + if (!target_var %in% names(input_data)) { + stop("target_var is not a column in input_data.") + } + if (!is.numeric(input_data[[target_var]])) { + stop("target_var must be numeric (GAM requires numeric response).") + } + + # Check covariates + missing_vars <- vars[!vars %in% names(input_data)] + if (length(missing_vars) > 0) { + stop("These vars are missing from input_data: ", paste(missing_vars, collapse = ", ")) + } + non_numeric_vars <- vars[!sapply(input_data[vars], is.numeric)] + if (length(non_numeric_vars) > 0) { + stop("These vars must be numeric: ", paste(non_numeric_vars, collapse = ", ")) + } + + # # Coordinates + # if (!coords_x %in% names(input_data) || !coords_y %in% names(input_data)) { + # stop("coords_x and coords_y must be columns in input_data.") + # } + + # VC_type + if (!VC_type %in% c("SVC", "TVC", "STVC")) { + stop("VC_type must be one of: 'SVC', 'TVC', 'STVC'.") + } + + # time_var requirement + if (VC_type %in% c("TVC", "STVC")) { + if (is.null(time_var)) + stop("time_var must be provided for TVC or STVC models.") + if (!time_var %in% names(input_data)) + stop("time_var is not a column in input_data.") + } + + # Number of cores + if (ncores < 1) + stop("ncores must be >= 1.") + + ## --------------------------------------------------------- + ## 2. INTERNAL FUNCTIONS to increase k + ## --------------------------------------------------------- + + ## 2.1 Increase k without regex + increase_k_from_kcheck <- function(m, k2edf_ratio = 1.5, k_multiplier = 1.5) { + # get smooths and ratios + kc <- mgcv::k.check(m) + ratio <- kc[, "k'"] / kc[, "edf"] + smooths_to_fix <- rownames(kc)[ratio < k2edf_ratio] + sm <- m$smooth + # set up loop through smooths + new_terms <- vector("list", length(sm)) + for (i in seq_along(sm)) { + s <- sm[[i]] + # determine current k + if (!is.null(s$margin)) { + k_old <- vapply(s$margin, function(x) x$bs.dim, numeric(1)) + } else { + k_old <- s$bs.dim + } + # get new k for qualifying smooths + if (s$label %in% smooths_to_fix) { + k_new <- ceiling(k_old * k_multiplier) + } else { + k_new <- k_old + } + terms <- s$term + by_var <- s$by + # build smooth call + if (!is.null(s$margin)) { + # tensor product smooth + call <- call( + "te", + as.name(terms[1]), + as.name(terms[2]), + as.name(terms[3]), + d = c(2, 1), + # bs = s$bs, + k = k_new, + by = as.name(by_var) + ) + } else { + # standard smooth + call <- call( + "s", + as.name(terms[1]), + if (length(terms) > 1) as.name(terms[2]), + k = k_new, + by = as.name(by_var) + ) + } + new_terms[[i]] <- call } - expression_x = paste0(expression_x,")") - eval(parse(text = expression_x)) - } - # make_svc_index_grid(letters[1:3]) - make_stvc_index_grid = function(covariates) { - expression_x = "expand.grid(Intercept = 1:5 " - for (i in covariates) { - expression_x = - paste0(expression_x, ",", i, "=1:6") + # extract parametric part + tt <- terms(m) + param_terms <- attr(tt, "term.labels") + param_terms <- param_terms[!grepl("^s\\(|^te\\(", param_terms)] + param_terms <- param_terms[!(param_terms %in% c(coords_x, coords_y, time_var))] + # param_terms = c("Intercept", vars) + # rebuild formula + rhs <- c(param_terms, sapply(new_terms, deparse)) + # final tidy to get rid of any NULLs + rhs <- gsub("NULL,", "", rhs) + new_formula <- reformulate(rhs, response = as.character(formula(m)[[2]])) + new_formula + } + + ## 2.2 Iteratively increase k + iterate_increase_k <- function(m, k2edf_ratio, k_multiplier, max_iter, model_family) { + old_m <- m + iteration <- 1 + repeat { + # 1. Run k.check() + tab <- k.check(m) + r <- tab[, 1] / tab[, 2] + k_increase <- which(r < k2edf_ratio) + # 2. Stop if nothing to change + if (length(k_increase) == 0) { + break + } + # 3. Safety stop + if (iteration >= max_iter) { + break + } + # 4. Build updated formula + new_formula_str <- increase_k_from_kcheck(m, k2edf_ratio = k2edf_ratio, k_multiplier = k_multiplier) + new_formula <- as.formula(new_formula_str) + # 5. Refit model + old_m <- m + if(!bam) m <- try(gam(new_formula, data = m$model, method = m$method, family = model_family), silent = TRUE) + if(bam) m <- try(bam(new_formula, data = m$model, method = m$method, family = model_family, discrete = TRUE), + silent = TRUE) + # check for running out of degrees of freedom + if (inherits(m, "try-error")) { + return (old_m) + } + # cat(iteration, "\t") } - expression_x = paste0(expression_x,")") - eval(parse(text = expression_x)) - } - # make_stvc_index_grid(letters[1:3]) - - # 4 create the formula - get_formula = function(indices){ - form.i = glue("{target_var}~Intercept-1") - form.i = paste0(form.i, - get_form_intercept(indices[1]) - ) - for (j in 1:length(covariates)){ - varname = covariates[j] - form.i = paste0(form.i, - get_form_covariate(varname, indices[j+1]) - ) + return(m) + } + ## --------------------------------------------------------- + ## 3. INTERNAL FUNCTIONS to construct models + ## --------------------------------------------------------- + + get_form_intercept <- function(index, k_set = FALSE) { + if(!k_set) { + c("", + glue("+s({coords_x},{coords_y},by=Intercept)"), + glue("+s({time_var},by=Intercept)"), + glue("+s({coords_x},{coords_y},by=Intercept) + s({time_var},by=Intercept)"), + glue("+te({coords_x},{coords_y},{time_var},d=c(2,1),bs=c('tp','cr'),by=Intercept)") + )[index] + } else { + c("", + glue("+s({coords_x},{coords_y},k={spatial_k},by=Intercept)"), + glue("+s({time_var},k={temporal_k},by=Intercept)"), + glue("+s({coords_x},{coords_y},k={spatial_k},by=Intercept) + s({time_var},k={temporal_k},by=Intercept)"), + glue("+te({coords_x},{coords_y},{time_var},d=c(2,1),bs=c('tp','cr'),k=c({spatial_k},{temporal_k}),by=Intercept)") + )[index] } - return(formula(form.i)) - } - # svc_grid = make_svc_index_grid(covariates) - # indices = unlist(svc_grid[11,]) - # get_formula(indices) - - # 5 create and evaluate a GAM - evaluate_gam = function(i, terms_grid, input_data, ...){ - # make the formula - indices = unlist(terms_grid[i,]) - f <- get_formula(indices) - # do the GAM - input_data <- input_data |> mutate(Intercept = 1) - m = gam(f, data=input_data) - bic = BIC(m) - # create the indices and formula for output - index = data.frame(terms_grid[i,]) - f = paste0(target_var, ' ~ ', as.character(f)[3] ) - return(data.frame(index, bic, f)) - } - if(!STVC) { - terms_grid = make_svc_index_grid(covariates) - } else { - terms_grid = make_stvc_index_grid(covariates) + + } + + get_form_covariate <- function(varname, index, k_set = FALSE) { + if(!k_set) { + c("", + glue("+{varname}"), + glue("+{varname}+s({coords_x},{coords_y},by={varname})"), + glue("+{varname}+s({time_var},by={varname})"), + glue("+{varname}+s({coords_x},{coords_y},by={varname}) + s({time_var},by={varname})"), + glue("+{varname}+te({coords_x},{coords_y},{time_var},d=c(2,1),bs=c('tp','cr'),by={varname})") + )[index] + } else { + c("", + glue("+{varname}"), + glue("+{varname}+s({coords_x},{coords_y},k={spatial_k},by={varname})"), + glue("+{varname}+s({time_var},k={temporal_k},by={varname})"), + glue("+{varname}+s({coords_x},{coords_y},k={spatial_k},by={varname}) + s({time_var},k={temporal_k},by={varname})"), + glue("+{varname}+te({coords_x},{coords_y},{time_var},d=c(2,1),bs=c('tp','cr'),k=c({spatial_k},{temporal_k}),by={varname})") + )[index] + } + } + + make_tvc_index_grid <- function(vars) { + expression_x <- "expand.grid(Intercept = c(1, 3)" + for (i in vars) expression_x <- paste0(expression_x, ",", i, " = c(1,2,4)") + eval(parse(text = paste0(expression_x, ")"))) + } + + make_svc_index_grid <- function(vars) { + expression_x <- "expand.grid(Intercept = 1:2" + for (i in vars) expression_x <- paste0(expression_x, ",", i, " = 1:3") + eval(parse(text = paste0(expression_x, ")"))) + } + + make_stvc_index_grid <- function(vars) { + expression_x <- "expand.grid(Intercept = 1:5" + for (i in vars) expression_x <- paste0(expression_x, ",", i, " = 1:6") + eval(parse(text = paste0(expression_x, ")"))) } + + get_formula <- function(indices, k_set) { + form.i <- glue("{target_var} ~ 0 + Intercept") + form.i <- paste0(form.i, get_form_intercept(indices[1], k_set)) + for (j in seq_along(vars)) { + varname <- vars[j] + form.i <- paste0(form.i, get_form_covariate(varname, indices[j + 1], k_set)) + } + formula(form.i) + } + + evaluate_gam <- function(i, terms_grid, input_data, k_set, k_increase, ...) { + indices <- unlist(terms_grid[i, ]) + f <- get_formula(indices, k_set) + input_data <- dplyr::mutate(input_data, Intercept = 1) + # m <- mgcv::gam(f, data = input_data, method = "REML", family = model_family) + if(bam & any(indices > 1)) { + m <- mgcv::bam(f, data = input_data, method = "fREML", family = model_family, discrete = TRUE) + } else { + m <- mgcv::gam(f, data = input_data, method = "REML", family = model_family) + } + if(k_increase) { + m <- iterate_increase_k(m, k2edf_ratio = k2edf_ratio, + k_multiplier = k_multiplier, + max_iter = max_iter, + model_family = model_family) + } + ks <- k.check(m)[,1] |> as.vector() + ks <- gsub(", ", ", ", toString(ks)) + aic <- m$aic + index <- data.frame(terms_grid[i, ]) + f <- paste0(target_var, " ~ ", as.character(m$formula)[3]) + data.frame(index, ks, aic, f) + } + + ## --------------------------------------------------------- + ## 4. CONSTRUCT TERM GRID + ## --------------------------------------------------------- + terms_grid <- switch( + VC_type, + "SVC" = make_svc_index_grid(vars), + "TVC" = make_tvc_index_grid(vars), + "STVC" = make_stvc_index_grid(vars) + ) + + ## --------------------------------------------------------- + ## 5. RUN GAM MODELS (SEQUENTIAL OR PARALLEL) + ## --------------------------------------------------------- if (nrow(terms_grid) < 30) { vc_res_gam <- NULL - for(i in 1:nrow(terms_grid)) { - res.i = evaluate_gam(i, terms_grid, input_data) - vc_res_gam = rbind(vc_res_gam, res.i) + for (i in 1:nrow(terms_grid)) { + vc_res_gam <- rbind(vc_res_gam, evaluate_gam(i, terms_grid, input_data, k_set, k_increase)) } } else { - # see https://stackoverflow.com/questions/50571325/r-cran-check-fail-when-using-parallel-functions - #chk <- Sys.getenv("_R_CHECK_LIMIT_CORES_", "") - #if (nzchar(chk) && chk == "TRUE") { - # use 2 cores in CRAN/Travis/AppVeyor - # cl <- makeCluster(2L) - #} else { - # # use all cores in devtools::test() - # cl <- makeCluster(detectCores()-1) - #} - cl = makeCluster(ncores) - registerDoParallel(cl) - vc_res_gam <- - foreach(i = 1:nrow(terms_grid), - .combine = 'rbind', - .packages = c("glue", "mgcv", "purrr", "dplyr")) %dopar% { - evaluate_gam(i, terms_grid, input_data, - target_var, covariates, coords_x, coords_y, time_var) - } - stopCluster(cl) - } - vc_res_gam + cl <- parallel::makeCluster(ncores) + doParallel::registerDoParallel(cl) + + vc_res_gam <- foreach::foreach( + i = 1:nrow(terms_grid), + .combine = "rbind", + .packages = c("glue", "mgcv", "purrr", "dplyr") + ) %dopar% { + evaluate_gam(i, terms_grid, input_data, k_set, k_increase, target_var, + vars, coords_x, coords_y, time_var, spatial_k, temporal_k) + } + + parallel::stopCluster(cl) + } + + return(vc_res_gam) } + + diff --git a/R/gam_model_probs.R b/R/gam_model_probs.R deleted file mode 100644 index 277420d..0000000 --- a/R/gam_model_probs.R +++ /dev/null @@ -1,68 +0,0 @@ -#' Calculates the `model probabilities of the different GAM models generated by `evaluate_models` -#' -#' @param res_tab a table generated by `evaluate_models` -#' @param n the number of models to retain and generate probabilities for -#' -#' @return A ranked data table in `tibble` format of the top `n` models, their form, BIC and model or (`Pr(M|D)`) or relative (`Pr(M)`) probability value. Model probability indicates the probability of the each model being the correct model and the relative probabilities provide a measure of the doubt about the differences in model specification, when compared to the best or highest ranked model. The relative probabilities are needed when large BIC values generate near zero probability values. -#' -#' @importFrom dplyr rename -#' @importFrom dplyr arrange -#' @importFrom dplyr tibble -#' @importFrom dplyr relocate -#' @importFrom dplyr slice_head -#' @importFrom dplyr across -#' -#' @examples -#' library(dplyr) -#' library(purrr) -#' library(glue) -#' library(mgcv) -#' data(productivity) -#' input_data = productivity |> filter(year == "1970") -#' svc_res_gam = evaluate_models(input_data, STVC = FALSE) -#' mod_comp_svc <- gam_model_probs(svc_res_gam, n = 10) -#' # print out the terms and probabilities -#' mod_comp_svc|> select(-f) -#' @export -gam_model_probs <- function(res_tab, n = 10) { - bic = NULL - Prob = NULL - Rank = NULL - `Pr(M|D)` = NULL - nm <- names(res_tab) - len = length(nm) - mod_comp <- - tibble(res_tab) |> - rename(BIC = bic) |> - arrange(BIC) |> - # calculate absolute model probability - mutate(Prob= exp(-BIC/2)) |> - mutate(Prob = round(Prob/sum(Prob), 3)) |> - rename(`Pr(M|D)`=Prob) - # transpose the indices to to model terms - # rank and return the top n results - int_terms <- \(x) c("Fixed","s_S", "s_T", "s_T + S_S", "s_ST")[x] - var_terms <- \(x) c("---", "Fixed","s_S", "s_T", "s_T + s_S", "s_ST")[x] - out_tab <- - mod_comp |> - slice_head(n = n) |> - mutate(across(nm[2]:nm[len-2],var_terms)) |> - mutate(across(nm[1]:nm[1],int_terms)) |> - mutate(Rank = 1:n()) |> - relocate(Rank) - # if too many models Pr(M|D) is too small - # then calculate relative probabilities - # ie relative to the top ranked model - if (is.na(sum(out_tab$`Pr(M|D)`))) { - out_tab <- out_tab |> select(-`Pr(M|D)`) - prob_vec= NULL - for(i in 2:n) { - p1 = exp(-(out_tab$BIC[i]-out_tab$BIC[1])/2) - p1 = p1/(1+p1) - prob_vec = c(prob_vec, p1) - } - out_tab$`Pr(M)` = c("--", paste0(format(round(prob_vec, digits=3), nsmall = 1))) - } - out_tab -} - diff --git a/R/gam_model_rank.R b/R/gam_model_rank.R new file mode 100644 index 0000000..f312d97 --- /dev/null +++ b/R/gam_model_rank.R @@ -0,0 +1,122 @@ +#' Ranks models by AIC, giving the model form for each predictor variable. +#' +#' @param res_tab a `data.frame` returned from the `evaluate_models()` function. +#' @param n the number of ranked models to return. +#' @param show_k whether to include the smooth knots in the output (default is `FALSE`) +#' +#' @returns a `tibble` of the 'n' best models, ranked by AIC, with the form of each predictor variable where '---' indicates the absence of a predictor, 'Fixed' that a parametric form was specified, 's(S)' a spatial smooth, 's(T)' a temporal smooth and 'te(ST)' a combined space-time smooth. Model AIC is reported as are the knots in each smooth (`ks`) and the formula of each model (`f`). +#' @importFrom dplyr relocate +#' @importFrom dplyr mutate +#' @importFrom dplyr rename +#' @importFrom dplyr arrange +#' @importFrom dplyr slice_head +#' @importFrom dplyr across +#' @importFrom dplyr tibble +#' @importFrom stringr str_split +#' @importFrom stringr str_replace +#' @importFrom stringr str_detect +#' @importFrom purrr map2_chr +#' @importFrom magrittr %>% +#' @importFrom utils head +#' @importFrom mgcv k.check +#' @importFrom stats as.formula +#' +#' @examples +#' require(dplyr) +#' require(stringr) +#' require(purrr) +#' require(doParallel) +#' require(sf) +#' +#' # define input data +#' data("chaco") +#' input_data <- +#' chaco |> +#' # create Intercept as an addressable term +#' mutate(Intercept = 1) |> +#' # remove the geometry +#' st_drop_geometry() +#' +#' # evaluate different model forms +#' # example 1 with 6 models and no `k` adjustment +#' svc_mods <- +#' evaluate_models( +#' input_data = input_data, +#' target_var = "ndvi", +#' vars = c("tmax"), +#' coords_x = "X", +#' coords_y = "Y", +#' VC_type = "SVC" +#' ) +#' # rank the models +#' gam_model_rank(svc_mods, show_k = TRUE) +#' @export +gam_model_rank <- function(res_tab, n = 10, show_k = FALSE) { + ks_col <- which(names(res_tab) == "ks") + smooth_cols <- 1:(ks_col-1) + nm <- names(res_tab) + len <- length(nm) + Rank <- NULL + AIC <- NULL + aic <- NULL + res_tab <- + res_tab |> + rename(AIC = aic) |> + arrange(AIC) + + int_terms <- function(x) c("Fixed", "s(S)", "s(T)", "s(S) + s(T)", "te(ST)")[x] + var_terms <- function(x) c("---", "Fixed", "s(S)", "s(T)", "s(S) + s(T)", "te(ST)")[x] + + out_tab <- + res_tab |> + slice_head(n=n) |> + #slice(30:40) |> + mutate(across(nm[2]:nm[len - 3], var_terms)) |> + mutate(across(nm[1]:nm[1], int_terms)) + + # helper: attach a vector of ks to a vector of smooths + attach_k <- function(smooths, ks) { + map2_chr(smooths, ks, \(sm, k) { + str_replace(sm, "\\)$", paste0(", k=", k, ")")) + }) + } + # helper: split smooths inside a cell + split_smooths <- function(x) { + x %>% str_split("\\s*\\+\\s*") %>% unlist() + } + # helper: detect if a cell contains one or more smooth expressions + contains_smooth <- function(x) { + str_detect(x, "\\b(s|te|ti)\\s*\\(") + } + + if (show_k) { + for(i in 1:nrow(out_tab)) { + row.i <- out_tab[i,] + # create vector of k's allowing for non-smooth elements + smooth_lists <- lapply(row.i[smooth_cols], split_smooths) |> unlist() + smooth_counts <- sapply(smooth_lists, function(x) contains_smooth(x) + 0 |> unlist()) + ks_s <- str_split(row.i[ks_col], ",\\s*")[[1]] %>% as.numeric() + ks_vec <- rep("", length(smooth_counts)) + ks_vec[which(smooth_counts > 0)] <- ks_s + + # how many smooths in each selected column? + smooth_lists <- lapply(row.i[smooth_cols], split_smooths) + smooth_counts <- lengths(smooth_lists) + # cumulative positions to slice ks_vec + k_starts <- cumsum(c(1, head(smooth_counts, -1))) + k_ends <- cumsum(smooth_counts) + # assign new smooths back into their columns + for (j in seq_along(smooth_cols)) { + col_index <- smooth_cols[j] + cell_value = smooth_lists[[j]] + ks_slice <- ks_vec[k_starts[j]:k_ends[j]] + new_smooths <- attach_k(smooth_lists[[j]], ks_slice) + new_smooths <- paste(new_smooths, collapse = " + ") + new_smooths <- gsub(" ", "", new_smooths) + row.i[,j] <- new_smooths + } + out_tab[i,] <- row.i + } + } + return(out_tab |> mutate(Rank = 1:n()) |> relocate(Rank)) +} diff --git a/R/plot_1d_smooth.R b/R/plot_1d_smooth.R deleted file mode 100644 index 989cef7..0000000 --- a/R/plot_1d_smooth.R +++ /dev/null @@ -1,77 +0,0 @@ -#' Plots a 1-Dimensional GAM smooth -#' -#' @param mod a GAM model with smooths created using the `mgcv` package -#' @param ncol the number of columns in the compound plot -#' @param nrow the number of rows in the compound plot -#' @param fills the fill colours (single or vector) -#' -#' @return A compound plot of the GAM 1-dimensional smooths (rendered using `cowplot::plot_grid()`). -#' @importFrom grDevices pdf -#' @importFrom grDevices dev.off -#' @importFrom ggplot2 ggplot -#' @importFrom ggplot2 aes -#' @importFrom ggplot2 geom_ribbon -#' @importFrom ggplot2 geom_line -#' @importFrom ggplot2 theme_bw -#' @importFrom ggplot2 xlab -#' @importFrom ggplot2 ylab -#' @importFrom ggplot2 ylim -#' @importFrom cowplot plot_grid -#' -#' @examples -#' library(mgcv) -#' library(ggplot2) -#' library(dplyr) -#' library(cowplot) -#' # 1. from the `mgcv` `gam` function help -#' set.seed(2) ## simulate some data... -#' dat <- gamSim(1,n=400,dist="normal",scale=2) -#' b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat) -#' plot_1d_smooth(b, ncol = 2, fills = c("lightblue", "lightblue3")) -#' dev.off() -#' # 2. using a TVC -#' data(productivity) -#' data = productivity |> mutate(Intercept = 1) -#' gam.tvc.mod = gam(privC ~ 0 + Intercept + -#' s(year, bs = 'gp', by = Intercept) + -#' unemp + s(year, bs = "gp", by = unemp) + -#' pubC + s(year, bs = "gp", by = pubC), -#' data = data) -#' plot_1d_smooth(gam.tvc.mod, fills = "lightblue") -#' @export -plot_1d_smooth = function(mod, ncol = NULL, nrow = NULL, fills = "lightblue") { - pdf(file = NULL) # dummy PDF - smooths <- plot(mod, page = 1) # call the plot - dev.off() # close the dummy plot - # create the plots, rescaling to have same y-axis - # objects for outputs and axis scaling - scale = NULL - for (i in 1:length(smooths)){ - y = smooths[[i]]$fit - u = y + smooths[[i]]$se - l = y - smooths[[i]]$se - scale <- c(scale, c(min(u, l), max(u,l))) - } - fills = rep(fills, length(smooths)) - plot.list <- NULL - for (i in 1:length(smooths)){ - x = smooths[[i]]$x - y = smooths[[i]]$fit - u = y + smooths[[i]]$se - l = y - smooths[[i]]$se - scale <- c(scale, c(min(u, l), max(u,l))) - df = data.frame(x, y, u, l) - plot.list[[i]] <- - ggplot(df, aes(x, y, ymin = l, ymax = u)) + - geom_ribbon(fill = fills[i]) + - geom_line() + - theme_bw() + - xlab(smooths[[i]]$xlab) + - ylab(smooths[[i]]$ylab) + - ylim(min(scale), max(scale)) - } - plot_grid_args <- c(plot.list[1:length(smooths)], - list(ncol = ncol), - list(nrow = nrow)) - do.call(plot_grid, plot_grid_args) -} diff --git a/R/plot_2d_smooth.R b/R/plot_2d_smooth.R deleted file mode 100644 index 2cafca8..0000000 --- a/R/plot_2d_smooth.R +++ /dev/null @@ -1,95 +0,0 @@ -#' Plots a 2-Dimensional GAM smooth -#' -#' @param mod a GAM model with smooths created using the `mgcv` package -#' @param filled `logical` value to indicate whether a filled plot should be created (`TRUE`) or not (`FALSE`) -#' @param outline the name of an `sf` object to be plotted (NULL is the default) -#' @param ncol the number of columns for the compound plot -#' @param nrow the number of rows for the compound plot -#' -#' @return A compound plot of the 2-dimensional smooths (rendered using `cowplot::plot_grid`). -#' @importFrom grDevices pdf -#' @importFrom grDevices dev.off -#' @importFrom ggplot2 geom_contour -#' @importFrom ggplot2 coord_sf -#' @importFrom ggplot2 geom_sf -#' @importFrom ggplot2 geom_contour_filled -#' -#' @examples -#' library(mgcv) -#' library(ggplot2) -#' library(dplyr) -#' library(metR) -#' library(cowplot) -#' set.seed(2) ## simulate some data... -#' dat <- gamSim(1,n=400,dist="normal",scale=2) -#' # use x1 and x2 as the coordinates -#' b <- gam(y~s(x0, x1, bs = 'gp', by = x2),data=dat) -#' plot_2d_smooth(b, filled = TRUE) -#' @export -plot_2d_smooth = function(mod, filled = FALSE, outline = NULL, ncol = NULL, nrow = NULL) { - Var1 = NULL - Var2 = NULL - pdf(file = NULL) # dummy PDF - smooths <- plot(mod, page = 1) # call the plot - dev.off() # close the dummy plot - # create the plots, rescaling to have same y-axis - # objects for outputs and axis scaling - for (i in 1:length(smooths)){ - y = smooths[[i]]$fit - u = y + smooths[[i]]$se - l = y - smooths[[i]]$se - scale <- c(scale, c(min(u, l), max(u,l))) - } - plot.list <- NULL - for (i in 1:length(smooths)){ - x = smooths[[i]]$x - y = smooths[[i]]$y - fit = smooths[[i]]$fit - u = fit + smooths[[i]]$se - l = fit - smooths[[i]]$se - m <- expand.grid( x , y) - m <- cbind(m, fit, u, l) - if(!filled) { - if(is.null(outline)) { - plot.list[[i]] <- - ggplot(data = m, aes(x = Var1, y = Var2, z = fit)) + - geom_contour(col = "black") + - metR::geom_text_contour(aes(z = fit)) + - geom_contour(aes(z = u), lty = 2, col = "red") + - geom_contour(aes(z = l), lty = 3, col = "darkgreen") + - coord_sf() + - theme_bw() + xlab(smooths[[i]]$xlab) + ylab(smooths[[i]]$ylab) - } else { - plot.list[[i]] <- - ggplot(data = m) + - geom_contour(aes(x = Var1, y = Var2, z = fit), col = "black") + - metR::geom_text_contour(aes(x = Var1, y = Var2, z = fit)) + - geom_contour(aes(x = Var1, y = Var2, z = u), lty = 2, col = "red") + - geom_contour(aes(x = Var1, y = Var2, z = l), lty = 3, col = "darkgreen") + - coord_sf() + - theme_bw() + xlab("") + ylab("") + - geom_sf(data = outline, fill = NA) - } - } - if(filled) { - if(is.null(outline)) { - plot.list[[i]] <- - ggplot(data = m, aes(x = Var1, y = Var2, z = fit)) + - geom_contour_filled(na.rm = T, bins = 5) + - coord_sf() + - theme_bw() + xlab(smooths[[i]]$xlab) + ylab(smooths[[i]]$ylab) - } else { - plot.list[[i]] <- - ggplot() + - geom_contour_filled(data = m, aes(x = Var1, y = Var2, z = fit), bins = 5, na.rm = T) + - coord_sf() + - theme_bw() + xlab("") + ylab("") + - geom_sf(data = outline, fill = NA) - } - } - } - plot_grid_args <- c(plot.list[1:length(smooths)], - list(ncol = ncol), - list(nrow = nrow)) - do.call(plot_grid, plot_grid_args) -} diff --git a/R/stgam-package.R b/R/stgam-package.R index a65cf64..7400243 100644 --- a/R/stgam-package.R +++ b/R/stgam-package.R @@ -4,3 +4,4 @@ ## usethis namespace: start ## usethis namespace: end NULL + diff --git a/README.md b/README.md index f15259f..28c6e51 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,20 @@ -# stgam + +# `stgam`: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models (GAMs) [![R-CMD-check](https://github.com/lexcomber/stgam/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/lexcomber/stgam/actions/workflows/R-CMD-check.yaml) -The goal of the `stgam` package is to provide a framework for capturing process spatial and or / or temporal heterogeneity, using a varying coefficient modelling approach based on GAMs with Gaussian Process (GP) smooths. It constructs a series of models and uses probability to determine the best model or best set of competing models. Where there in no clear 'winner', competing and highly probabale models can be combined using Bayesian Model Averaging. +This package provides a framework for specifying space-time varying coefficient models using Generalized Additive Models (GAMs) with smooths. It builds on GAM functionality from the `mgcv` package. The smooths are parameterised with location, time and predictor variables. The framework suggests the need to investigate for the presence and nature of any space-time dependencies in the data. It proposes a workflow that creates and refines an initial space-time GAM and includes tools to create and evaluate multiple model forms. The workflow sequence is to: i) Prepare the data (`data.frame`, `tibble` or `sf` object) by lengthening it to have a single location and time variables for each observation. ii) Create all possible space and/or time models in which each predictor is specified in different ways in smooths. iii) Evaluate each model via their AIC value and pick the best one. iv) Create the final model. v) Calculate the varying coefficient estimates to quantify how the relationships between the target and predictor variables vary over space, time or space-time. vi) Create maps, time series plots etc. The number of knots used in each smooth can be specified directly or iteratively increased. This is illustrated with a point dataset of NDVI and climate data. The data are sample of 2000 observations of Normalised Difference Vegetation Index (NDVI) (2012-2022) of the Chaco dry rainforest in South America with some climate data. This builds on work in Comber et al (2024) [https://doi.org/10.1080/13658816.2023.2270285] and Comber et al (2024) [https://doi.org/10.3390/ijgi13120459]. ## Installation -You can install the development version of `stgam` : - +You can install the CRAN version of stgam : +``` r +install.packages("stgam") +``` +Or the development version: ``` r # just the package remotes::install_github("lexcomber/stgam") @@ -20,35 +24,63 @@ remotes::install_github("lexcomber/stgam", build_vignettes = TRUE, force = T) ## Example -This code below loads the package and the package Imports (these have not been set as dependencies). It then undertakes and evaluates a series of spatially varying coefficient models using GAMs with GP smooths: +This code below loads the package and undertakes the proposed workflow for a spatially varying coefficient model using GAMs with spatial smooths. Note there are in `evaluate_modesl()` options for increasing the smooth bases (`k`). These are not done here. -```{r eval = F} +``` r +# a spatially varying coefficient model example library(stgam) -library(cols4all) # for nice shading in graphs and maps -library(cowplot) # for managing plots -library(dplyr) # for data manipulation -library(ggplot2) # for plotting and mapping -library(glue) # for model construction -library(mgcv) # for GAMs -library(sf) # for spatial data -library(doParallel) # for parallelising operations -library(purrr) # for model construction -library(tidyr) # for model construction - -data("productivity") -data = productivity |> filter(year == "1970") -# create multiple models with different forms -svc_gam = - evaluate_models(data = data,target_var = "privC", - covariates = c("unemp", "pubC"), - coords_x = "X", - coords_y = "Y", - STVC = FALSE) -# examine -head(svc_gam) -# calculate the probabilities for each model -mod_comp_svc <- gam_model_probs(svc_gam, n = 10) +library(dplyr) +library(ggplot2) +library(cols4all) +library(sf) + +# define input data +data(chaco) +input_data <- + chaco |> + # create Intercept as an addressable term + mutate(Intercept = 1) + +# evaluate different model forms +# here not seeking to improve or set `k` +svc_mods <- + evaluate_models( + input_data = input_data |> st_drop_geometry(), + target_var = "ndvi", + vars = c("tmax", "pr"), + coords_x = "X", + coords_y = "Y", + VC_type = "SVC", + time_var = NULL, + k_increase = FALSE, + k_set = FALSE, + ncores = 10 + ) +# rank the models and smooth `k` +mod_comp <- gam_model_rank(svc_mods) # have a look -mod_comp_svc|> select(-f) -``` +mod_comp |> select(-f) + +# select best model +f = as.formula(mod_comp$f[1]) +# put into a `mgcv` GAM model +gam.m = gam(f, data = input_data) + +# evalate the effect sizes of the model terms +effect_size(gam.m) +# calculate the Varying Coefficients +terms = c("Intercept", "tmax", "pr") +vcs = calculate_vcs(input_data |> st_drop_geometry(), gam.m, terms) +vcs |> select(ndvi, date, X, Y, starts_with(c("b_", "se_")), yhat) + +# map them +tit <-expression(paste(""*beta[`tmax`]*"")) +ggplot() + + geom_point(data = vcs, aes(x = lon, y = lat, col = b_tmax)) + + scale_colour_continuous_c4a_seq("brewer.yl_or_rd", name = tit) + + theme_bw() + + coord_sf() + + xlab("") + ylab("") + +``` diff --git a/data/chaco.rda b/data/chaco.rda new file mode 100644 index 0000000..0184e7c Binary files /dev/null and b/data/chaco.rda differ diff --git a/data/productivity.rda b/data/productivity.rda deleted file mode 100644 index aba9475..0000000 Binary files a/data/productivity.rda and /dev/null differ diff --git a/data/us_data.rda b/data/us_data.rda deleted file mode 100644 index 84a24ca..0000000 Binary files a/data/us_data.rda and /dev/null differ diff --git a/man/calculate_vcs.Rd b/man/calculate_vcs.Rd index c361320..4796fff 100644 --- a/man/calculate_vcs.Rd +++ b/man/calculate_vcs.Rd @@ -1,36 +1,39 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculate_vc.R +% Please edit documentation in R/calculate_vcs.R \name{calculate_vcs} \alias{calculate_vcs} \title{Extracts varying coefficient estimates (for SVC, TVC and STVC models).} \usage{ -calculate_vcs(input_data, model, terms) +calculate_vcs(input_data, mgcv_model, terms = NULL) } \arguments{ -\item{input_data}{the data used to create the GAM model in \code{data.frame}, \code{tibble} or \code{sf} format} +\item{input_data}{the data used to create the GAM model in \code{data.frame}, \code{tibble} or \code{sf} format. This can be the original data used to create the model or another surface with location and time attributes.} -\item{model}{a GAM model with smooths created using the \code{mgcv} package} +\item{mgcv_model}{a GAM model with smooths created using the \code{mgcv} package} -\item{terms}{a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in \code{data} )} +\item{terms}{a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in the \code{input_data} used to construct the model).} } \value{ -A \code{data.frame} of the input data and the coefficient and standard error estimates for each covariate. +A \code{data.frame} of the input data, the coefficient estimates, the standard errors and the t-values estimates for each covariate. It can be used to generate coefficient estimates for specific time slices and over gridded surfaces as described in the package vignette. } \description{ Extracts varying coefficient estimates (for SVC, TVC and STVC models). } \examples{ -library(dplyr) -library(mgcv) -# SVC -data(productivity) -input_data = productivity |> dplyr::filter(year == "1970") |> mutate(Intercept = 1) -gam.svc.mod = gam(privC ~ 0 + Intercept + - s(X, Y, bs = 'gp', by = Intercept) + - unemp + s(X, Y, bs = "gp", by = unemp) + - pubC + s(X, Y, bs = "gp", by = pubC), - data = input_data) -terms = c("Intercept", "unemp", "pubC") -svcs = calculate_vcs(input_data, gam.svc.mod, terms) -head(svcs) +require(dplyr) +require(doParallel) +# define input data +data("chaco") +input_data <- + chaco |> + # create Intercept as an addressable term + mutate(Intercept = 1) +# create a model for example as result of running `evaluate_models` +gam.m = gam(ndvi ~ 0 + s(X, Y, by = Intercept) + + s(X, Y, by = tmax) + s(X, Y, by = pr), data = input_data) +# calculate the Varying Coefficients +terms = c("Intercept", "tmax", "pr") +vcs = calculate_vcs(input_data, gam.m, terms) +vcs |> select(ndvi, X, Y, starts_with(c("b_", "se_", "t_")), yhat) + } diff --git a/man/chaco.Rd b/man/chaco.Rd new file mode 100644 index 0000000..b672248 --- /dev/null +++ b/man/chaco.Rd @@ -0,0 +1,38 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/data.R +\docType{data} +\name{chaco} +\alias{chaco} +\title{Chaco dry rainforest data (2012-2022)} +\format{ +A \code{sf} POINT dataset with 2000 observations and 12 fields. +\describe{ +\item{id}{An observation identifier} +\item{ndvi}{Normalised Difference Vegetation Index (NDVI)} +\item{tmax}{Maximum temperature (°C)} +\item{pr}{Preciptation} +\item{month}{A continous integer variable from 1 to 120} +\item{year}{The year of observation} +\item{lon}{Longitude in degrees (WGS84)} +\item{lat}{Latitude in degrees (WGS84)} +\item{X}{Easting in metres from the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} +\item{Y}{Northing in metres from the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} +\item{geometry}{The spatial geometry of the observation in the SIRGAS 2000 / Brazil Mercator projection (EPSG:5641)} +} +} +\source{ +Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/J.RSE.2017.06.031 + +Li, M., Cao, S., Zhu, Z., Wang, Z., Myneni, R. B., & Piao, S. (2023). Spatiotemporally consistent global dataset of the GIMMS Normalized Difference Vegetation Index (PKU GIMMS NDVI) from 1982 to 2022. Earth System Science Data, 15(9), 4181–4203. https://doi.org/10.5194/ESSD-15-4181-2023 +} +\usage{ +chaco +} +\description{ +A point dataset of NDVI and climate data. The data are sample of 2000 observations of Normalised Difference Vegetation Index (NDVI) (2012-2022) of the Chaco dry rainforest in South America with some climate data. These are found via Google Earth Engine (Gorelick et al., 2017). The NDVI data is sourced from the PKU GIMMS NDVI v1.2 dataset, which provides NDVI observations at 1/12° spatial resolution at bi-monthly intervals from 1982 to 2022 (Li et al., 2023). The climate data was derived from the TerraClimate dataset (IDAHO_EPSCOR/TERRACLIMATE). Maximum temperature (\code{tmax}) and Precipitation (\code{pr}) were selected and means calculated for each monthly image across all pixels. +} +\examples{ +library(sf) +data("chaco") +} +\keyword{datasets} diff --git a/man/do_bma.Rd b/man/do_bma.Rd deleted file mode 100644 index f2500e5..0000000 --- a/man/do_bma.Rd +++ /dev/null @@ -1,62 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/do_bma.R -\name{do_bma} -\alias{do_bma} -\title{Undertake undertake coefficient averaging using Bayesian Model Avergaing (BMA), weighting different models by their probabilities} -\usage{ -do_bma(model_table, terms, thresh = 0.1, relative = FALSE, input_data) -} -\arguments{ -\item{model_table}{a table of competing models generated by the \code{gam_model_probs()} function} - -\item{terms}{a vector of names starting with "Intercept" plus the names of the covariates used in the GAM model (these are the names of the variables in \code{data} )} - -\item{thresh}{a probability threshold value above which to combine competing models} - -\item{relative}{logical to indicate whether the probabilities in \code{data} are relative (\code{Pr(M)}) or absolute (\code{Pr(M|D)})} - -\item{input_data}{the input data with a named Intercept term, in \code{data.frame}, \code{tibble} or \code{sf} format} -} -\value{ -A \code{tibble} of the input data plus the probability weighted averaged coefficient estimates from multiple models, all starting with \verb{b_} -} -\description{ -Undertake undertake coefficient averaging using Bayesian Model Avergaing (BMA), weighting different models by their probabilities -} -\examples{ -library(cols4all) -library(dplyr) -library(sf) -library(glue) -library(purrr) -library(mgcv) -library(sf) -library(ggplot2) -# data -data(productivity) -input_data = productivity |> filter(year == "1970") |> mutate(Intercept = 1) -# create and evaluate multiple models -svc_res_gam = evaluate_models(input_data, STVC = FALSE) -# determine their probabilities -mod_comp_svc <- gam_model_probs(svc_res_gam) -# combine the model coefficients -svc_bma <- do_bma(model_table = mod_comp_svc, - terms = c("Intercept", "unemp", "pubC"), - thresh = 0.1, - relative = FALSE, - input_data = input_data) -head(svc_bma) -# join back to spatial layer -data(us_data) -svc_bma_sf <- - us_data |> - select(GEOID) |> - left_join(svc_bma) -# and map -tit =expression(paste(""*beta[`Public Capital`]*" ")) -ggplot(data = svc_bma_sf, aes(fill=b_pubC)) + - geom_sf() + - scale_fill_continuous_c4a_div(palette="brewer.yl_or_rd",name=tit) + - coord_sf() + - theme_linedraw() -} diff --git a/man/effect_size.Rd b/man/effect_size.Rd new file mode 100644 index 0000000..1098c93 --- /dev/null +++ b/man/effect_size.Rd @@ -0,0 +1,38 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/effect_size.R +\name{effect_size} +\alias{effect_size} +\title{Quantifies the relative effect sizes of each component of \code{mgcv} GAM} +\usage{ +effect_size(mgcv_model, digits = 3) +} +\arguments{ +\item{mgcv_model}{a GAM model created by the \code{mgcv} package} + +\item{digits}{the number of significant figures of uysed to report effect size} +} +\value{ +a matrix of the model terms, the size of the effect (range) ad standard deviation (sd) +} +\description{ +Quantifies the relative effect sizes of each component of \code{mgcv} GAM +} +\examples{ +require(dplyr) +require(stringr) +require(purrr) +require(doParallel) + +# define input data +data("chaco") +m <- gam( +ndvi ~ + te(X,Y, by = tmax) + + s(X,Y, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +# examine the effect size +effect_size(m, 3) +} diff --git a/man/evaluate_models.Rd b/man/evaluate_models.Rd index cbba6ab..ccb35a8 100644 --- a/man/evaluate_models.Rd +++ b/man/evaluate_models.Rd @@ -2,56 +2,169 @@ % Please edit documentation in R/evaluate_models.R \name{evaluate_models} \alias{evaluate_models} -\title{Creates and evaluates multiple varying coefficient GAM GP smooth models (SVC or STVC)} +\title{Evaluates multiple models with each predictor variable specified in different ways in order to determining model form} \usage{ evaluate_models( input_data, - target_var = "privC", - covariates = c("unemp", "pubC"), + target_var, + vars, + model_family = "gaussian()", + bam = FALSE, coords_x = "X", coords_y = "Y", - STVC = FALSE, + VC_type = "SVC", time_var = NULL, + k_set = FALSE, + spatial_k = 50, + temporal_k = 10, + k_increase = FALSE, + k2edf_ratio = 1.5, + k_multiplier = 2, + max_iter = 10, ncores = 2 ) } \arguments{ -\item{input_data}{a \code{data.frame}, \code{tibble} \code{sf} containing the target variables, covariates and coordinate variables} +\item{input_data}{he data to be used used to create the GAM model in (\code{data.frame} or \code{tibble} format), containing an Intercept column to allow it be treated as an addressable term in the model.} -\item{target_var}{the name of the target variable in \code{data}} +\item{target_var}{the name of the target variable.} -\item{covariates}{the name of the covariates (predictor variables) in \code{data}} +\item{vars}{a vector of the predictor variable names (without the Intercept).} -\item{coords_x}{the name of the X, Easting or Longitude variable in \code{data}} +\item{model_family}{the model family, defaults to Guassian} -\item{coords_y}{the name of the Y, Northing or Latitude variable in \code{data}} +\item{bam}{a logical parameter indicating whether to use the \code{bam} function to construct GAMs. The default is \code{FALSE} and the \code{gam} function. The \code{mgcv::bam} should be used for large datasets or large numbers of predictor variables.} -\item{STVC}{a logical operator to indicate whether the models Space-Time (\code{TRUE}) or just Space (\code{FALSE})} +\item{coords_x}{the name of the X, Easting or Longitude variable in \code{input_data}.} -\item{time_var}{the name of the time variable if undertaking STVC model evaluations} +\item{coords_y}{the name of the Y, Northing or Latitude variable in \code{input_data}.} -\item{ncores}{the number of cores to use in parallelised approaches (default is 2 to overcome CRAN package checks). This can be determined for your computer by running \code{parallel::detectCores()-1}. Parallel approaches are only undertaken if the number of models to evaluate is greater than 30.} +\item{VC_type}{the type of varying coefficient model: options are "TVC" for temporally varying, "SVC" for spatially varying and "STVC" for space-time.} + +\item{time_var}{the name of the time variable if undertaking STVC model evaluations.} + +\item{k_set}{a logical value for user defined \code{k} values. The default is \code{FALSE}. Cannot be used with \code{k_increase}.} + +\item{spatial_k}{the value of \code{k} for spatial smooths if \code{k_set} is \code{TRUE}.} + +\item{temporal_k}{the value of \code{k} for temporal smooths if \code{k_set} is \code{TRUE}.} + +\item{k_increase}{a logical value of whether to check and increase the number of knots in each smooth. The default is \code{FALSE}. Cannot be used with \code{k_set}.} + +\item{k2edf_ratio}{a threshold of the ratio of the number of knots, \code{k}, in each smooth to its Effective Degrees of Freedom. If any smooth has a \emph{knots-to-EDF} ratio less than this value then the knots are iteratively increased by the \code{k_multiplier} value until the threshold check is passed, the number knots passes the maximum degrees of freedom, or the number of iterations, \code{max_iter} is reached. Cannot be used with \code{k_set}.} + +\item{k_multiplier}{a multiplier by which the knots are increased on each iteration. The default is 2.} + +\item{max_iter}{the maximum number of iterations that \code{k} is increased.} + +\item{ncores}{the number of cores to use in parallelised approaches (default is 2 to overcome CRAN package checks). This can be determined for your computer by running parallel::detectCores()-1. Parallel approaches are only undertaken if the number of models to evaluate is greater than 30.} } \value{ -A data table in \code{data.frame} format of all possible model combinations with each covariate specified in all possible ways, with the BIC of the model and the model formula. +a \code{data.frame} with indices for each predictor variable, the knots specified in each smooth (\code{ks}), a AIC score (\code{aic}) for each model and the associated formula (\code{f}). The output should be passed to the \code{gam_model_rank} function. } \description{ -Creates and evaluates multiple varying coefficient GAM GP smooth models (SVC or STVC) +Evaluates multiple models with each predictor variable specified in different ways in order to determining model form } \examples{ -library(dplyr) -library(glue) -library(purrr) -library(doParallel) -library(mgcv) -data("productivity") -input_data = productivity |> filter(year == "1970") -svc_res_gam = - evaluate_models(input_data = input_data, - target_var = "privC", - covariates = c("unemp", "pubC"), - coords_x = "X", - coords_y = "Y", - STVC = FALSE) -head(svc_res_gam) +\dontrun{ +require(dplyr) +require(doParallel) +require(sf) + +# define input data +data("chaco") +input_data <- + chaco |> + # create Intercept as an addressable term + mutate(Intercept = 1) |> + # remove the geometry + st_drop_geometry() + +# Evaluate different model forms +# example 1 with 6 models and no `k` adjustment +svc_mods <- + evaluate_models( + input_data = input_data, + target_var = "ndvi", + model_family = "gaussian()", + vars = c("tmax"), + coords_x = "X", + coords_y = "Y", + VC_type = "SVC" + ) +# have a look! +svc_mods + +# example 2 with 6 models and `k` adjustment +svc_k2_mods <- + evaluate_models( + input_data = input_data, + target_var = "ndvi", + vars = c("tmax"), + model_family = "gaussian()", + coords_x = "X", + coords_y = "Y", + VC_type = "SVC", + k_increase = TRUE, + k2edf_ratio = 1.5, + k_multiplier = 2, + max_iter = 10 + ) +# have a look! +svc_k2_mods + +# example 3 with 6 models and `k` set by user +svc_k3_mods <- + evaluate_models( + input_data = input_data, + model_family = "gaussian()", + target_var = "ndvi", + vars = c("tmax"), + coords_x = "X", + coords_y = "Y", + VC_type = "SVC", + time_var = NULL, + k_set = TRUE, + spatial_k = 20, + ) +# have a look! +svc_k3_mods + +# example comparing `gam` (slower) and `bam` (faster) +t1 <- Sys.time() +svc_GAM_mods <- + evaluate_models( + input_data = input_data, + target_var = "ndvi", + vars = c("tmax", "pr"), + model_family = "gaussian()", + bam = FALSE, + coords_x = "X", + coords_y = "Y", + time_var = "month", + VC_type = "STVC", + ncores = detectCores()-1 + ) +t_gam <- Sys.time() - t1 + +t1 <- Sys.time() +svc_BAM_mods <- + evaluate_models( + input_data = input_data, + target_var = "ndvi", + vars = c("tmax", "pr"), + model_family = "gaussian()", + bam = TRUE, + coords_x = "X", + coords_y = "Y", + time_var = "month", + VC_type = "STVC", + ncores = detectCores()-1 + ) +t_bam <- Sys.time() - t1 + +t_gam - t_bam + +} + } diff --git a/man/gam_model_probs.Rd b/man/gam_model_probs.Rd deleted file mode 100644 index 51e137d..0000000 --- a/man/gam_model_probs.Rd +++ /dev/null @@ -1,31 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/gam_model_probs.R -\name{gam_model_probs} -\alias{gam_model_probs} -\title{Calculates the \verb{model probabilities of the different GAM models generated by }evaluate_models`} -\usage{ -gam_model_probs(res_tab, n = 10) -} -\arguments{ -\item{res_tab}{a table generated by \code{evaluate_models}} - -\item{n}{the number of models to retain and generate probabilities for} -} -\value{ -A ranked data table in \code{tibble} format of the top \code{n} models, their form, BIC and model or (\code{Pr(M|D)}) or relative (\code{Pr(M)}) probability value. Model probability indicates the probability of the each model being the correct model and the relative probabilities provide a measure of the doubt about the differences in model specification, when compared to the best or highest ranked model. The relative probabilities are needed when large BIC values generate near zero probability values. -} -\description{ -Calculates the \verb{model probabilities of the different GAM models generated by }evaluate_models` -} -\examples{ -library(dplyr) -library(purrr) -library(glue) -library(mgcv) -data(productivity) -input_data = productivity |> filter(year == "1970") -svc_res_gam = evaluate_models(input_data, STVC = FALSE) -mod_comp_svc <- gam_model_probs(svc_res_gam, n = 10) -# print out the terms and probabilities -mod_comp_svc|> select(-f) -} diff --git a/man/gam_model_rank.Rd b/man/gam_model_rank.Rd new file mode 100644 index 0000000..edbc6e3 --- /dev/null +++ b/man/gam_model_rank.Rd @@ -0,0 +1,51 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/gam_model_rank.R +\name{gam_model_rank} +\alias{gam_model_rank} +\title{Ranks models by AIC, giving the model form for each predictor variable.} +\usage{ +gam_model_rank(res_tab, n = 10, show_k = FALSE) +} +\arguments{ +\item{res_tab}{a \code{data.frame} returned from the \code{evaluate_models()} function.} + +\item{n}{the number of ranked models to return.} + +\item{show_k}{whether to include the smooth knots in the output (default is \code{FALSE})} +} +\value{ +a \code{tibble} of the 'n' best models, ranked by AIC, with the form of each predictor variable where '---' indicates the absence of a predictor, 'Fixed' that a parametric form was specified, 's(S)' a spatial smooth, 's(T)' a temporal smooth and 'te(ST)' a combined space-time smooth. Model AIC is reported as are the knots in each smooth (\code{ks}) and the formula of each model (\code{f}). +} +\description{ +Ranks models by AIC, giving the model form for each predictor variable. +} +\examples{ +require(dplyr) +require(stringr) +require(purrr) +require(doParallel) +require(sf) + +# define input data +data("chaco") +input_data <- + chaco |> + # create Intercept as an addressable term + mutate(Intercept = 1) |> + # remove the geometry + st_drop_geometry() + +# evaluate different model forms +# example 1 with 6 models and no `k` adjustment +svc_mods <- + evaluate_models( + input_data = input_data, + target_var = "ndvi", + vars = c("tmax"), + coords_x = "X", + coords_y = "Y", + VC_type = "SVC" + ) +# rank the models +gam_model_rank(svc_mods, show_k = TRUE) +} diff --git a/man/plot_1d_smooth.Rd b/man/plot_1d_smooth.Rd deleted file mode 100644 index 13b10fa..0000000 --- a/man/plot_1d_smooth.Rd +++ /dev/null @@ -1,44 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot_1d_smooth.R -\name{plot_1d_smooth} -\alias{plot_1d_smooth} -\title{Plots a 1-Dimensional GAM smooth} -\usage{ -plot_1d_smooth(mod, ncol = NULL, nrow = NULL, fills = "lightblue") -} -\arguments{ -\item{mod}{a GAM model with smooths created using the \code{mgcv} package} - -\item{ncol}{the number of columns in the compound plot} - -\item{nrow}{the number of rows in the compound plot} - -\item{fills}{the fill colours (single or vector)} -} -\value{ -A compound plot of the GAM 1-dimensional smooths (rendered using \code{cowplot::plot_grid()}). -} -\description{ -Plots a 1-Dimensional GAM smooth -} -\examples{ -library(mgcv) -library(ggplot2) -library(dplyr) -library(cowplot) -# 1. from the `mgcv` `gam` function help -set.seed(2) ## simulate some data... -dat <- gamSim(1,n=400,dist="normal",scale=2) -b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat) -plot_1d_smooth(b, ncol = 2, fills = c("lightblue", "lightblue3")) -dev.off() -# 2. using a TVC -data(productivity) -data = productivity |> mutate(Intercept = 1) -gam.tvc.mod = gam(privC ~ 0 + Intercept + - s(year, bs = 'gp', by = Intercept) + - unemp + s(year, bs = "gp", by = unemp) + - pubC + s(year, bs = "gp", by = pubC), - data = data) -plot_1d_smooth(gam.tvc.mod, fills = "lightblue") -} diff --git a/man/plot_2d_smooth.Rd b/man/plot_2d_smooth.Rd deleted file mode 100644 index 0ef0a94..0000000 --- a/man/plot_2d_smooth.Rd +++ /dev/null @@ -1,37 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot_2d_smooth.R -\name{plot_2d_smooth} -\alias{plot_2d_smooth} -\title{Plots a 2-Dimensional GAM smooth} -\usage{ -plot_2d_smooth(mod, filled = FALSE, outline = NULL, ncol = NULL, nrow = NULL) -} -\arguments{ -\item{mod}{a GAM model with smooths created using the \code{mgcv} package} - -\item{filled}{\code{logical} value to indicate whether a filled plot should be created (\code{TRUE}) or not (\code{FALSE})} - -\item{outline}{the name of an \code{sf} object to be plotted (NULL is the default)} - -\item{ncol}{the number of columns for the compound plot} - -\item{nrow}{the number of rows for the compound plot} -} -\value{ -A compound plot of the 2-dimensional smooths (rendered using \code{cowplot::plot_grid}). -} -\description{ -Plots a 2-Dimensional GAM smooth -} -\examples{ -library(mgcv) -library(ggplot2) -library(dplyr) -library(metR) -library(cowplot) -set.seed(2) ## simulate some data... -dat <- gamSim(1,n=400,dist="normal",scale=2) -# use x1 and x2 as the coordinates -b <- gam(y~s(x0, x1, bs = 'gp', by = x2),data=dat) -plot_2d_smooth(b, filled = TRUE) -} diff --git a/man/productivity.Rd b/man/productivity.Rd deleted file mode 100644 index bb2b5c3..0000000 --- a/man/productivity.Rd +++ /dev/null @@ -1,37 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/data.R -\docType{data} -\name{productivity} -\alias{productivity} -\title{US States Economic Productivity Data (1970-1985)} -\format{ -A tibble with 816 rows and 14 columns. -\describe{ -\item{state}{The name of the state} -\item{GEOID}{The state code} -\item{region}{The region} -\item{pubC}{Public capital which is composed of highways and streets (hwy) water and sewer facilities (water) and other public buildings and structures (util)} -\item{hwy}{Highway and streets assets} -\item{water}{Water utility assets} -\item{util}{Other public buildings and structures} -\item{privC}{Private captial stock} -\item{gsp}{Gross state product} -\item{emp}{Labour input measured by the employment in non-agricultural payrolls} -\item{unemp}{State unemployment rate capture elements of the business cycle} -\item{X}{Easting in metres from USA Contiguous Equidistant Conic projection (ESRI:102005)} -\item{Y}{Northing in metres from USA Contiguous Equidistant Conic projection (ESRI:102005)} -} -} -\source{ -Croissant, Yves, Giovanni Millo, and Kevin Tappe. 2022. Plm: Linear Models for Panel Data -} -\usage{ -productivity -} -\description{ -A dataset of annual economic productivity data for the 48 contiguous US states (with Washington DC merged into Maryland), from 1970 to 1985 (17 years) in long format. The data productivity data table was extracted from the \code{plm} package. -} -\examples{ -data(productivity) -} -\keyword{datasets} diff --git a/man/stgam-package.Rd b/man/stgam-package.Rd index e316e43..b63c5f4 100644 --- a/man/stgam-package.Rd +++ b/man/stgam-package.Rd @@ -6,7 +6,7 @@ \alias{stgam-package} \title{stgam: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models} \description{ -A framework for specifying spatially, temporally and spatially-and-temporally varying coefficient models using Generalized Additive Models with Gaussian Process smooths. The smooths are parameterised with location and / or time attributes. Importantly the framework supports the investigation of the presence and nature of any space-time dependencies in the data, allows the user to evaluate different model forms (specifications) and to pick the most probable model or to combine multiple varying coefficient models using Bayesian Model Averaging. For more details see: Brunsdon et al (2023) \doi{10.4230/LIPIcs.GIScience.2023.17}, Comber et al (2023) \doi{10.4230/LIPIcs.GIScience.2023.22} and Comber et al (2024) \doi{10.1080/13658816.2023.2270285}. +A framework for undertaking space and time varying coefficient models (varying parameter models) using a Generalized Additive Model (GAM) with smooths approach. The framework suggests the need to investigate for the presence and nature of any space-time dependencies in the data. It proposes a workflow that creates and refines an initial space-time GAM and includes tools to create and evaluate multiple model forms. The workflow sequence is to: i) Prepare the data by lengthening it to have a single location and time variables for each observation. ii) Create all possible space and/or time models in which each predictor is specified in different ways in smooths. iii) Evaluate each model via their AIC value and pick the best one. iv) Create the final model. v) Calculate the varying coefficient estimates to quantify how the relationships between the target and predictor variables vary over space, time or space-time. vi) Create maps, time series plots etc. The number of knots used in each smooth can be specified directly or iteratively increased. This is illustrated with a climate point dataset of the dry rain forest in South America. This builds on work in Comber et al (2024) \doi{10.1080/13658816.2023.2270285} and Comber et al (2004) \doi{10.3390/ijgi13120459}. } \seealso{ Useful links: @@ -22,6 +22,7 @@ Useful links: Other contributors: \itemize{ \item Paul Harris \email{paul.harris@rothamsted.ac.uk} [contributor] + \item Gonzalo Irisarri \email{jirisarr@uwyo.edu} [contributor] \item Chris Brunsdon \email{christopher.brunsdon@mu.ie} [contributor] } diff --git a/man/us_data.Rd b/man/us_data.Rd deleted file mode 100644 index 78c9def..0000000 --- a/man/us_data.Rd +++ /dev/null @@ -1,29 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/data.R -\docType{data} -\name{us_data} -\alias{us_data} -\title{US States boundaries} -\format{ -A \code{sf} polygon dataset with 48 rows and 6 fields. -\describe{ -\item{GEOID}{The state code} -\item{NAME}{The name of the state} -\item{REGION}{The region} -\item{total_pop_10}{Population in 2010} -\item{total_pop_15}{Population in 2015} -} -} -\source{ -Bivand, Roger, Jakub Nowosad, and Robin Lovelace. 2019. spData: Datasets for Spatial Analysis. R package -} -\usage{ -us_data -} -\description{ -A dataset of of the boundaries of 48 contiguous US states (with Washington DC merged into Maryland), extracted from the \code{spData} package. -} -\examples{ -data(us_data) -} -\keyword{datasets} diff --git a/stgam.Rproj b/stgam.Rproj index 766b3b2..7215c21 100644 --- a/stgam.Rproj +++ b/stgam.Rproj @@ -1,4 +1,5 @@ Version: 1.0 +ProjectId: f30b71e5-e877-4da4-a6c0-da9eeaae3c82 RestoreWorkspace: No SaveWorkspace: No diff --git a/vignettes/.gitignore b/vignettes/.gitignore index 773e352..fec9a60 100644 --- a/vignettes/.gitignore +++ b/vignettes/.gitignore @@ -1,3 +1,3 @@ *.html *.R -*.Rmd + diff --git a/vignettes/appraise_plot_1.png b/vignettes/appraise_plot_1.png new file mode 100644 index 0000000..5c58938 Binary files /dev/null and b/vignettes/appraise_plot_1.png differ diff --git a/vignettes/appraise_plot_2.png b/vignettes/appraise_plot_2.png new file mode 100644 index 0000000..1923b01 Binary files /dev/null and b/vignettes/appraise_plot_2.png differ diff --git a/vignettes/appraise_plot_3.png b/vignettes/appraise_plot_3.png new file mode 100644 index 0000000..c5a9029 Binary files /dev/null and b/vignettes/appraise_plot_3.png differ diff --git a/vignettes/appraise_plot_4.png b/vignettes/appraise_plot_4.png new file mode 100644 index 0000000..d0e33ff Binary files /dev/null and b/vignettes/appraise_plot_4.png differ diff --git a/vignettes/model_summaries.RData b/vignettes/model_summaries.RData new file mode 100644 index 0000000..fb72233 Binary files /dev/null and b/vignettes/model_summaries.RData differ diff --git a/vignettes/space-time-gam-intro.Rmd b/vignettes/space-time-gam-intro.Rmd deleted file mode 100644 index 91b2795..0000000 --- a/vignettes/space-time-gam-intro.Rmd +++ /dev/null @@ -1,337 +0,0 @@ ---- -title: "Introduction to space-time GAMS with `stgam`" -author: "Lex Comber, Paul Harris and Chrs Brunsdon" -date: "June 2024" -output: rmarkdown::html_vignette -bibliography: vignette.bib - -vignette: > - %\VignetteIndexEntry{Introduction to space-time GAMS with `stgam`} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>", - options(width = 90) -) -``` - -## Overview - -The aim of this vignette is provide an introduction to the `stgam` package and demonstrates how to construct a spatially vary coefficient (SVC) model and spatial and temporally varying (STVC) model using the `stgam` package. Essentially `stgam` provides a wrapper for constructing GAMs with Gaussian Process (GP) smooths or splines, that are parameterised with location for SVCs and with location and time for STVCs. The key ideas underpinning the development of SVCs with GAMs in the `stgam` package and this vignette are: - -1. Standard linear regression assumes that predictor-to-response relationships to be the same throughout the study area. -2. This is often not the case when when location is considered, for example of outliers. -3. Many geographic processes have a Gaussian form when they are examined over 2-dimensional space, as they essentially exhibit distance decay. -3. The GAMs can include smooths or splines of different forms. One such form is a Gaussian Process (GP) spline. -4. GP splines can be specified to model non-linearity (wiggliness) over geographic space if location is included with the covariate. -5. A GAM with GP splines parameterised by location - a Geographic GP GAM or GGP-GAM - defines a spatially varying coefficient (SVC) model. -6. This can be extended to time and space-time. - -The approach is presented in outline below, but detail of the SVCs, TVCs and STVCs constructed using GAMs with GP smooths, and the evolution of their application from spatial models to space-time coefficient modelling can be found in @comber2024multiscale and @comber2024gtgpjgis. In this, the concept of a Gaussian Process [@Wood2020] is important in the context of regression modelling. It provides a *data model* of the likelihood that a given data set is generated given a statistical model involving some unknown parameters and in regression modelling, the unknown parameters are the regression parameters. These are described formally in @comber2024multiscale and @comber2024gtgpjgis. - -One final comment is that in this vignette you will all of the varying coefficient models you create assume that spatial, temporal or spatio-temporal dependencies are present in the data. This may not be the case and these assumptions are examined in detail in the second vignette in the `stgam` package. - -In this vignette you will: - -- Create and interpret a simple SVC using GAMs with GP smooths. -- Create and interpret a simple TVC GAM. -- Extend this to a GP GAM that includes both space and time in smooths to define a STVC model. -- Be encourages to reflect on how space and time interact and the assumptions embedded in the model specification. -Be warned! the Vignette only uses one function from the `stgam` package! - -## Data and Packagees - -You should install the `stgam` package either from CRAN or from GitHub: - -```{r eval = F} -install.packages("stgam", dependencies = TRUE) -remotes::install_github("lexcomber/stgam") -``` - -And then make sure the required packages and data are loaded: - -```{r, warning=F, message=F} -# load the packages -library(stgam) -library(cols4all) # for nice shading in graphs and maps -library(cowplot) # for managing plots -library(dplyr) # for data manipulation -library(ggplot2) # for plotting and mapping -library(glue) # for model construction -library(mgcv) # for GAMs -library(sf) # for spatial data -library(doParallel) # for parallelising operations -library(purrr) # for model construction -library(tidyr) # for model construction -# load the data -data(productivity) -data(us_data) -``` - -The `productivity` data is annual economic productivity data for the 48 contiguous US states (with Washington DC merged into Maryland), for years 1970 to 1985 (17 years). This was extracted from the `plm` package [@croissant2022plm]. The `us_data` is a spatial dataset of the US states in a USA Contiguous Equidistant Conic projection (ESRI:102005) from the `spData` package [@bivand2019spdata]. The `productivity` data includes locational information of the state geometric centroid in the same projection. The code below maps the `X` and `Y` locations in `productivity` along with the US state areas. - -```{r locationplot, message = F, warning=F, fig.height = 4, fig.width = 7, fig.cap = "The US States and the geoemtric centroids used as locations."} -ggplot() + geom_sf(data = us_data, fill = NA) + - geom_point(data = productivity |> filter(year == "1970"), aes(x = X, y = Y)) + - theme_bw() + ylab("") + xlab("") -``` - -The data attributes can be examined and a spatial model of constructed using `gam` function from the `mgcv` package. The varying coefficient models created below all focus on Private capital stock (`privC`) as the target variable, with Unemployment (`unemp`) and Public capital (`pubC`) covariates, where the coefficient functions are assumed to be realisation of a Gaussian Process (GP) introduced above. -```{r} -head(productivity) -``` - -## A simple SVC - -A spatially varying coefficient model will be created using the `productivity` dataset. - -The code below defines an intercept column (`Intercept`) in the data. This to allow the intercept to be treated as an addressable term in the model. It also defines parametric and non-parametric forms for the intercept and each covariate, so that they can can take a global form (i.e. as in a standard OLS regression) and a spatially varying form in the GP smooth. - -```{r} -# define intercept term -productivity <- productivity |> mutate(Intercept = 1) -# create the SVC -svc.gam = gam(privC ~ 0 + - Intercept + s(X, Y, bs = 'gp', by = Intercept) + - unemp + s(X, Y, bs = "gp", by = unemp) + - pubC + s(X, Y, bs = "gp", by = pubC), - data = productivity |> filter(year == "1970")) -``` -Notice the `0 +` in the model. This indicates that the intercept coefficient is not included implicitly and it is included explicitly as `Intercept`. Also notice the different form of the splines from those specified in Part 1. Here, for each covariate, a GP smooth is specified for `X` and `Y` (the coordinates in geographic space) and the covariate is included via the `by` parameter. This is to explore the interaction of the covariate with the target variable over the space defined by `X` and `Y` locations, allowing spatially varying coefficients to be generated. The model has 4 key terms specified in the same way by ` + (X, Y, b s= 'gp', by = )`: - -- the `` is the fixed parametric term for the covariate -- `s(...)` defines the smooth -- `bs = 'gp'` states that this is a GP smooth -- `by = ` suggests the GP should be multiplied by variable - -The model output can be assessed: the `k'` and `edf` parameters are quite close, but the p-values are high and and the `k-index` is greater than 1 so this looks OK. The diagnostic plots are again generated by the `gam.check` function as below: - -```{r ch2gamcheck, fig.height = 7, fig.width = 7, fig.cap = "The GAM GP SVC diagnostic plots."} -# check -gam.check(svc.gam) -# model summary -summary(svc.gam) -``` - -Here it can be seen that: - -1. The model is well tuned: all of the all effective degrees of freedom (`edf` are well below `k` in the `gam.check()` printed output (the `k-index<1` issue is not important because of this). -1. All of the the fixed parametric terms are significant. -1. Of the smooth terms, only `pccap` is locally significant and spatially varying. - -The spatially varying coefficient estimates can be extracted using `predict`. To do this a dummy data set is created with the `pubC` term set to 1, and the intersect and `unemp` terms set to zero. The result is that the predicted values for the coefficient estimate are just a function of $\beta_2$, the `pubC` coefficient estimate at each location. - -```{r} -get_b2<- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 0, pubC = 1) -res <- productivity |> filter(year == "1970") |> - mutate(b2 = predict(svc.gam, newdata = get_b2)) -``` - -The resulting `data.frame` called `res` has a new variable called `b2` which is the spatially varying coefficient estimate for `pubC`. For comparison, we can generate the spatially varying coefficient estimate for the intercept ($\beta_0$) and `unemp` $\beta_1$ (which were not found to be significant locally) in the same way by setting the other terms in the model to zero: - -```{r} -get_b0 <- productivity |> filter(year == "1970") |> mutate(Intercept = 1, unemp = 0, pubC = 0) -res <- res |> mutate(b0 = predict(svc.gam, newdata = get_b0)) -get_b1 <- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 1, pubC = 0) -res <- res |> mutate(b1 = predict(svc.gam, newdata = get_b1)) -``` - -So `res` has the records for the year 1970 and three new columns for `b0`, `b1` and `b2` The distribution of the spatially varying coefficient estimates can be examined: - -```{r eval = T} -res |> select(b0, b1, b2) |> - apply(2, summary) |> round(2) -``` - -The `stgam` package has a function to extract the varying coefficients, `calculate_vcs`. This takes three arguments, the GAM varying coefficient model, the model terms, and the data used to create the model: -```{r} -terms = c("Intercept", "unemp", "pubC") -res <- calculate_vcs(model = svc.gam, - terms = terms, - input_data = productivity |> filter(year == "1970")) -summary(res[, paste0("b_",terms)]) -``` - -Standard `dplyr` and `ggplot` approaches can be used to join and map the coefficient estimates, formally $\beta_0$, $\beta_1$ and $\beta_2$) as in the figure below. Notice the North-South trend for the Intercept and the East-West and trend for Unemployment - both insignificant predictors of `privC` - and a much stronger specific spatial pattern between the target variable and Public capital (`pubC`), with particularity high coefficient estimates in the south. - -```{r ch2svccoefs, fig.height = 7, fig.width = 7, fig.cap = "The spatially varying coefficient (SVC) estimates."} -# join the data -map_results <- - us_data |> left_join(res |> select(GEOID, b_Intercept, b_unemp, b_pubC), by = "GEOID") -# plot the insignificant coefficient estimates -tit =expression(paste(""*beta[0]*"")) -p1 <- - ggplot(data = map_results, aes(fill=b_Intercept)) + - geom_sf() + - scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) + - coord_sf() + - ggtitle("Intercept: not significant") -tit =expression(paste(""*beta[1]*"")) -p2 <- - ggplot(data = map_results, aes(fill=b_unemp)) + - geom_sf() + - scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) + - coord_sf() + - ggtitle("Unemployment: not significant") -# plot the significant pubC coefficient estimates -tit =expression(paste(""*beta[2]*" ")) -p3 <- - ggplot(data = map_results, aes(fill=b_pubC)) + - geom_sf() + - scale_fill_continuous_c4a_div(palette="brewer.prgn",name=tit) + - coord_sf() + - ggtitle("Public captial: significant") -plot_grid(p1, p2, p3, ncol = 1) -``` - -## A simple TVC - -The `productivity` data was filtered for 1970 the the SVC above in which the `X-Y` location of the 48 states was used to parametrise the Gaussian Process smooths. The same structure can be used to create a temporally vary coefficient model (TVC), with smooths specified to include the `year` parameter, but this time not restricting the analysis data to records from a single year: -```{r} -# create the TVC -tvc.gam = gam(privC ~ 0 + - Intercept + s(year, bs = 'gp', by = Intercept) + - unemp + s(year, bs = "gp", by = unemp) + - pubC + s(year, bs = "gp", by = pubC), - data = productivity) -``` - -The model can be inspected in the same way to examine the the `k'` and `edf` parameters using the `gam.check` function and again there are no concerns: - -```{r eval = F} -gam.check(tvc.gam) -summary(tvc.gam) -``` -The model summary indicates that all of the parametric and temporally vary coefficients are significant at the 95% level except the parametric one for Unemployment, but there are some interesting patterns in the residuals, as reflected in the diagnostics plots above and the model fit ($R^2$) value. - -The temporally varying coefficients can be extracted in the same way as the DVC approach, explored using the `predict` function and setting each of the covariates to 1 and the others to zero in turn: - -```{r} -terms = c("Intercept", "unemp", "pubC") -res <- calculate_vcs(model = tvc.gam, - terms = terms, - input_data = productivity) -summary(res[, paste0("b_",terms)]) -``` -The variation in coefficient estimates can be inspected over time, remembering that each State has the same coefficient estimate value for each year, so just one state is selected in the code below (you could chose another and the result would be the same). Notice the linear declines due to the linearity of the time covariate in the figure below. - -```{r ch2tvccoefs, eval = T, echo = T, fig.height = 3, fig.width = 7, message=F, warning=F, fig.cap = "Trends in the temporally varying coefficient estimates."} -res |> - filter(state == "ARIZONA") |> - select(year, b_Intercept, b_unemp, b_pubC) |> - pivot_longer(-year) |> - ggplot(aes(x = year, y = value)) + - geom_point() + geom_line() + - facet_wrap(~name, scale = "free") + - theme_bw() + xlab("Year") + ylab("") -``` - -## Spatially and Temporally Varying Coefficient (STVC) models - -We can combine space and time in GAM GP splines. But how? We could use separate smooths for location and for time, or a single, 3D smooth parameterised with both location and time. There are assumptions associated with each of these. The code below specifies the interaction of the covariates within a single space-time GP smooth. You will notice this takes a few seconds longer to run, and choice of how to specify the smooths is explored in the second vignette in the `stgam` package. - -```{r} -stvc.gam = gam(privC ~ 0 + - Intercept + s(X, Y, year, bs = 'gp', by = Intercept) + - unemp + s(X, Y, year, bs = "gp", by = unemp) + - pubC + s(X, Y, year, bs = "gp", by = pubC), - data = productivity) -``` - -The model can be inspected in the usual way using the `gam.check` function (again, no concerns) and the `summary` function with $k$ and $edf$ despite the concerning `k-index` and `p-value` values. In this case all of the parametric and smooth coefficient estimates are significant at the 95% level, and the model fit ($R^2$) has again increased over the SVC model. - -```{r eval = F} -gam.check(stvc.gam) -summary(stvc.gam) -``` - -The coefficients spatial and temporally vary coefficients can be extracted in the same way as before and the variation in coefficient estimates from the STVC-GAM model summarised: -```{r} -terms = c("Intercept", "unemp", "pubC") -res <- calculate_vcs(model = stvc.gam, - terms = terms, - input_data = productivity) -summary(res[, paste0("b_",terms)]) -``` - -This indicates that a positive relationship between Private capital with with Public capital and mixed positive and negative one with Unemployment and the Intercept. - -It is instructive to unpick some of the model coefficients in more detail and the code below summarises variations over time through the median values of each coefficient estimate: - -```{r eval = T} -res |> - select(year, b_Intercept, b_unemp, b_pubC) |> - group_by(year) |> - summarise(med_b0 = median(b_Intercept), - med_b1 = median(b_unemp), - med_b2 = median(b_pubC)) -``` -It is evident that of the 2 covariates and the intercept used to model Private capital, only Public capital (`b2`) varies (increases) over time. This increase is shown visually below . - -```{r ch2stvccoefsbox, echo = T, fig.height = 4, fig.width = 7, fig.cap = "The temporal variation of the Public capital coefficient estimates over 17 years.", fig.pos = 'h'} -# inputs to plot -res |> select(starts_with("b"), year) |> - mutate(year = "All Years") -> tmp -cols = c(c4a("tableau.red_gold", n = 17, reverse = T), "grey") -tit =expression(paste(""*beta[`Private Capital`]*"")) -# plot -res |> select(starts_with("b"), year) |> - rbind(tmp) |> - mutate(year = factor(year)) |> - ggplot(aes(y = year, x = b_pubC, fill = year)) + - geom_boxplot(outlier.alpha = 0.1) + - scale_fill_manual(values=cols, guide = "none") + - theme_bw() + xlab(tit) + ylab("Time") -``` - -The spatial pattern of this temporal trend can also be explored as below. This shows that the increasing intensity of the effect of Public capital on Private capital does not vary spatially: the increase in effect is spatially even, with high values in the south. - -```{r, ch2stvccoefsmap, message = F, warning = F, fig.height = 4, fig.width = 7, fig.cap = "The spatial variation of the Unemployment coefficient estimates over time."} -tit =expression(paste(""*beta[`Public Capital`]*"")) -# join the data -map_results <- - us_data |> left_join(res |> select(GEOID, year, b_Intercept, b_unemp, b_pubC), by = "GEOID") -# create the plot -map_results |> - ggplot() + geom_sf(aes(fill = b_pubC), col = NA) + - scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) + - facet_wrap(~year) + - theme_bw() + xlab("") + ylab("") + - theme( - strip.background =element_rect(fill="white"), - strip.text = element_text(size = 8, margin = margin()), - legend.position = "inside", legend.position.inside = c(0.7, 0.1), - legend.direction = "horizontal", - legend.key.width = unit(1, "cm"), - axis.title.x=element_blank(), - axis.text.x=element_blank(), - axis.ticks.x=element_blank(), - axis.title.y=element_blank(), - axis.text.y=element_blank(), - axis.ticks.y=element_blank()) -``` - -## Summary - -The rationale for using GAMs with GP splines for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows: - -- GAMs with splines or smooths capture non-linear relationships between the response variable and covariates. -- splines generate a varying coefficient model when they are parameterised with more than one variable. -- this is readily extending to the temporal and / or spatial dimensions to generate SVCs, TVCs and STVCs. -- different splines are available, but GP splines reflect Tobler's First Law of Geography (spatial autocorrelation, process spatial heterogeneity, etc). -- this can be extended to the temporal case on the assumption of temporal decay (similarity decreases over time). -- GAMs are robust, have a rich theoretical background and been subject to much development. - -Initial research has demonstrated the formulation and application of a GAM with GP splines calibrated via observation location as a multiscale SVC model: the Geographical Gaussian Process GAM (GGP-GAM) [@comber2024multiscale]. The GGP-GAM was compared with the most popular SVC model, Geographically Weighted Regression (GWR) [@brunsdon1996geographically] and shown to out-perform Multiscale GWR. - -However, when handling space *and* time, simply plugging all the space time data into specific GAM configuration is to make potentially unreasonable assumptions about how space and time interact in spatially and temporally varying coefficient (STVC) models. To address this workshop has suggested that the full set of models is investigated to identify the best model or the best set of models. Where there is a clear winner, this can be applied. Where these is not, as in the example used then the model coefficients can be combined using Bayesian Model Averaging (in the second vignette). - - -## References - - - diff --git a/vignettes/space-time-gam-model-probs-BMA.Rmd b/vignettes/space-time-gam-model-probs-BMA.Rmd deleted file mode 100644 index 50d55f7..0000000 --- a/vignettes/space-time-gam-model-probs-BMA.Rmd +++ /dev/null @@ -1,380 +0,0 @@ ---- -title: "Determining Space-Time model form and Bayesian Model Avergaing (BMA) with `stgam`" -author: "Lex Comber, Paul Harris and Chrs Brunsdon" -date: "June 2024" -output: rmarkdown::html_vignette -bibliography: vignette.bib -vignette: > - %\VignetteIndexEntry{Determining Space-Time model form and Bayesian Model Avergaing (BMA) with `stgam`} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>", - options(width = 90), - options(mc.cores=2) -) -``` - -## Overview - -The introductory vignette (*'Introduction to space-time GAMS with `stgam`'*) demonstrated how to construct varying coefficient models using GAMs with Gaussian Process smooths (splines). The SVC, TVC and STVC models all had a similar form, with each covariate specified as a fixed parametric term and in space, time or space-time GP smooth. By way of example the SVC model is repeated below, the the TVC and STVC had a specified the covariates and GPs in a similar way. - -```{r eval = F} -stvc.gam = gam(privC ~ 0 + - Intercept + s(X, Y, year, bs = 'gp', by = Intercept) + - unemp + s(X, Y, year, bs = "gp", by = unemp) + - pubC + s(X, Y, year, bs = "gp", by = pubC), - data = productivity) -``` - -The `stvc.gam` model above is specified in a way that assumes the presence of some spatio-temporal dependencies (or interactions) between the target and the predictor variables. However this assumption may be incorrect and the model may be incorrectly specified. - -This vignette - -- Reflects on the assumptions associated with different model forms or specifications. -- Constructs different varying coefficient models and evaluates them using BIC. -- Constructs probabilities to determine the best model or best set of competing models. -- Undertakes Bayesian Model Averaging to combine the highly probable models. - -The code below loads the packages and data: -```{r, warning=F, message=F} -# load the packages -library(stgam) -library(cols4all) # for nice shading in graphs and maps -library(cowplot) # for managing plots -library(dplyr) # for data manipulation -library(ggplot2) # for plotting and mapping -library(glue) # for model construction -library(mgcv) # for GAMs -library(sf) # for spatial data -library(doParallel) # for parallelising operations -library(purrr) # for model construction -library(tidyr) # for model construction -# load the data -data(productivity) -data(us_data) -``` - -## Considering model form - -How should you specify your varying coefficient model? - -The STCV model specified above included space and time in a single smooth (spline) for each the covariates. This is to assume that that spatial and temporal processes *do* interact and that the *temporal trends* in predictor-target variable relationships will *vary with location*. But is this assumption correct? If you have a specific a hypothesis that you are testing or working under a particular theory related to how space and time interact in the process you are examining, then you can simply specify how the covariates interact with target variable. However, more commonly we are seeking inference (understanding) about *how* processes interact in space and time. - -In this vignette, these interactions are explored using a data driven (rather than theoretical) approach. A series of different models are created, each with different assumptions, and they are evaluated to determine which one of them is, or which set of them are, the most probable. - -Multiple models can be created to explore all possible combinations of interaction and then to select the best model, through a probability based measure like BIC (see @comber2024gtgpjgis for a full treatment of this). There are a total 6 possible way that each covariate could be specified in the model: - -i. It is omitted. -ii. It is included as a parametric response with no spline. -iii. It is included in a spline with location. -iv. It is included a spline with time. -v. It is included in a single spline with location and time. -vi. It is included in 2 separate splines with location and time. - -The intercept can be treated similarly, but without it being absent (i.e. 5 options). - -To investigate STVC model form, a series of models can be specified, with each combination of the 6 permutations for each covariate, plus 5 states for the intercept. The best model(s) can be determined by quantifying the likelihood (probability) of each of model being the correct model. This can be approximated using the Bayesian Information Criterion (BIC) [@schwarz1978estimating] as described in full in @comber2024gtgpjgis, and if the probabilities for multiple models are high, then the models can be combined using a Bayesian Model Averaging (BMA) approach. BMA is described in the context of spatial modelling in @fragoso2018bayesian and summarised in @brunsdon2023gisci, but in brief, if a number of competing models exist with at least one quantity of interest that all have in common, and the likelihoods of each of them being the correct model is known (e.g. from BIC ), then a posterior distribution of the quantity of interest can be obtained by averaging them using the likelihoods as weights. In this way it allows competing models, treating space and time in different ways, to be combined. - -## Creating multiple models: a walk-through - -To create STVC multiple models, the code below first defines a grid of numbers for each covariate form. This is passed to a function to create the formula specifying each model, with different terms and smooths, which in turn is passed to the `gam` function. - -```{r eval = T} -# define intercept term -productivity <- productivity |> mutate(Intercept = 1) -# define grid of combinations (nrow = 180) -terms_gr = expand.grid(intcp = 1:5, unemp = 1:6, pubC = 1:6) -# examine a random slice -terms_gr |> slice_sample(n = 6) -``` - -A function is defined to create the equations: here this is bespoke to the covariate names and number in the `productvity` data: -```{r} -# define a function to make the equations -makeform_prod <- function(intcp, unemp, pubC, bs='gp') { - #coords <- c("X,Y", "X2,Y2")[coords] - intx <- c("", - glue("+s(year,bs='{bs}',by=Intercept)"), - glue("+s(X,Y,bs='{bs}',by=Intercept)"), - glue("+s(X,Y,bs='{bs}',by=Intercept) + s(year,bs='{bs}',by=Intercept)"), - glue("+s(X,Y,year,bs='{bs}',by=Intercept)"))[intcp] - unempx <- c("", - "+ unemp", - glue("+s(year,bs='{bs}',by=unemp)"), - glue("+s(X,Y,bs='{bs}',by=unemp)"), - glue("+s(X,Y,bs='{bs}',by=unemp) + s(year,bs='{bs}',by=unemp)"), - glue("+s(X,Y,year,bs='{bs}',by=unemp)"))[unemp] - pubCx <- c("", - "+ pubC", - glue("+s(year,bs='{bs}',by=pubC)"), - glue("+s(X,Y,bs='{bs}',by=pubC)"), - glue("+s(X,Y,bs='{bs}',by=pubC) + s(year,bs='{bs}',by=pubC)"), - glue("+s(X,Y,year,bs='{bs}',by=pubC)"))[pubC] - return(formula(glue("privC~Intercept-1{unempx}{intx}{pubCx}"))) -} -``` -To see how this works, the code below passes some numbers to it. -```{r} -makeform_prod(intcp = 5, unemp = 2, pubC = 4, bs='gp') -``` -Next a function to undertake the analysis and record the BIC, return the indices and the formula is defined. Not that this has the `terms_gr` object defined above embedded in it, taking just an index of the grid row number as input: -```{r} -do_gam = function(i){ - f <- makeform_prod(intcp = terms_gr$intcp[i], - unemp = terms_gr$unemp[i], - pubC = terms_gr$pubC[i], - bs='gp') - m = gam(f,data=productivity) - bic = BIC(m) - index = data.frame(intcp = terms_gr$intcp[i], - unemp = terms_gr$unemp[i], - pubC = terms_gr$pubC[i]) - f = paste0('privC~', as.character(f)[3] ) - return(data.frame(index, bic, f)) - #return(bic) -} -``` -This can be tested: -```{r eval = F} -terms_gr[100,] -do_gam(100) -``` - -Finally, this can be put in a loop to evaluate all of the potential space-time models: -```{r dogam, eval = T} -t1 = Sys.time() -res_gam <- NULL -for(i in 1:nrow(terms_gr)) { - res.i = do_gam(i) - res_gam = rbind(res_gam, res.i) -} -Sys.time() - t1 -``` -For more complex problems, you could parallelise the loop if you have a large multivariate analyses. The output is the same as the `for` loop above, it is just created more quickly. -```{r dogam2, eval = F, warning=F, message=F} -# set up the parallelisation -library(doParallel) -cl <- makeCluster(detectCores()-1) -registerDoParallel(cl) -# do the parallel loop -t1 = Sys.time() -res_gam <- - foreach(i = 1:nrow(terms_gr), - .combine = 'rbind', - .packages = c("glue", "mgcv", "purrr")) %dopar% { - do_gam(i) - } -Sys.time() - t1 -# release the cores -stopCluster(cl) -# have a look -head(res_gam) -``` - -## Evaluating multiple models: a walk-through - -Having generated STVC multiple models and recorded the BIC values for them, it is possible to generate probabilities for each model and evaluate them. The logics and supporting equations for this evaluations of each mode are detailed in @comber2024gtgpjgis. The results need to be sorted, the best 10 models identified, their structures extracted and then their relative probabilities calculated: -```{r} -# sort the results -mod_comp <- tibble( - res_gam) |> - rename(BIC = bic) |> - arrange(BIC) -# transpose the indices to to model terms -# rank and return the top 10 results -int_terms <- \(x) c("Fixed","s_T", "s_S", "s_T + S_S", "s_ST")[x] -var_terms <- \(x) c("---", "Fixed","s_T", "s_S", "s_T + s_S", "s_ST")[x] -mod_comp_tab <- - mod_comp |> - slice_head(n = 10) |> - mutate(across(unemp:pubC,var_terms)) |> - mutate(intcp = int_terms(intcp)) |> - rename(`Intercept` = intcp, - `Unemployment.` = unemp, - `Public Captial` = pubC) |> - mutate(Rank = 1:n()) |> - relocate(Rank) |> - select(-f) -# determine the relative probabilities -# ie relative to the top ranked model -p1_vec = NULL -for(i in 2:10) { - p1 = exp(-(mod_comp_tab$BIC[i]-mod_comp_tab$BIC[1])/2) - p1 = p1/(1+p1) - p1_vec = c(p1_vec, p1) -} -mod_comp_tab$`Pr(M)` = c("--", paste0(format(round(p1_vec*100, digits=1), nsmall = 1), "%")) -``` -The results can be examined: -```{r eval = T} -mod_comp_tab -``` -The results are sorted and suggest that nine of the models are highly probable, each with probabilities of better than the best ranked model of >10%. These are candidates for Bayesian Model Averaging. The are some commonalities in the specification of the 9 models: - -- It does not matter whether the Intercept varies over space or time or together either interdependently or simultaneously; -- Unemployment can be absent, fixed or in a temporal smooth. -- Public Capital interacts simultaneously over space and time. - - -## Creating and Evaluating multiple models using `stgam` functions - -The process above had a number of stages: -- a grid was defined with indices for each variable defining how it is specified in each model -- this was used to create formula specifying each variable in different ways -- each model was constructed and the BIC calculated using a `for` loop or a parallelised approach -- the probability for each model was determined and the top 10 models returned - -The `stgam` package has generic function that wrap these operations and the code below applies them to the STVC problem. The `evaluate_models` function creates and evaluates the different models (it may take a minute or so to run): - -```{r evalmods} -stvc_res_gam = evaluate_models(input_data = productivity, - target_var = "privC", - covariates = c("unemp", "pubC"), - coords_x = "X", - coords_y = "Y", - STVC = TRUE, - time_var = "year") -``` - -The results can be compared with the `res_gam` object created earlier: -```{r eval = F} -head(res_gam) -head(stvc_res_gam) -``` - -The model probabilities can be extracted using the `gam_model_probs` function, here again suggesting that nine space time models are equally as probable: -```{r} -stvc_mods = gam_model_probs(stvc_res_gam, n = 10) -stvc_mods -``` - -Of particular interest are the forms of the covariates in in each high ranking model: - -- The Intercept is variously in a spatial, temporal, spatial plus temporal or space-time smooth. -- Unemployment is either absent, fixed or in a temporal smooth. -- Public capital (`pubC`) is always specified with a space-time smooth. - -We should expect to see these trends when the models are averaged. Note, also the function has here calculated the *relative probabilities* for the STVC models because the individual BIC values resulted in probabilities that were too close to zero to be machine encodable. - -The functions can be applied to a SVC problem: -```{r evalmods2} -svc_res_gam = evaluate_models(input_data = productivity |> filter(year == "1970"), - target_var = "privC", - covariates = c("unemp", "pubC"), - coords_x = "X", - coords_y = "Y", - STVC = FALSE, - time_var = NULL) -# head(svc_res_gam) -svc_mods = gam_model_probs(svc_res_gam, n = 10) -svc_mods -``` -For the SVC models, the results indicate that 5 models are highly likely (greater than 10% probability) to be the best model, all specifying a Public capital (`pubC`) smooth that varies locally over space, with Unemployment (`unemp`) either removed or globally constant (fixed), and the Intercept either fixed or varying with location. Interesting the SVC model in which all three covariates are specified with a spatial GP smooth (the SVC created in the first vignette) is the 6th ranked model and with a probability of less than 1/1000 of being the correct model. - -If only one model was highly probable, then this could be specified. The code below does this for the top SVC model: -```{r eval = F} -productivity <- productivity |> mutate(Intercept = 1) -f = as.formula(svc_mods$f[1]) -svc.gam = gam(f, data = productivity |> filter(year == "1970")) -summary(svc.gam) -``` - -## Combining multiple models using Bayesian Model Averaging - -For both the SVC and STVC cases, a number of models were highly probable. It is possible to combine these models using the probabilities as weights combine (average) the coefficient estimates, under a Bayesian Model Averaging approach (see @comber2024gtgpjgis for details). The code below applies the `do_bma` function to the summary tables generated above. The code can be used to construct the BMA coefficients from absolute or relative probabilities. - -```{r dobma} -# SVC with absolute probabilities -svc_bma <- do_bma(model_table = svc_mods, - terms = c("Intercept", "unemp", "pubC"), - thresh = 0.1, - relative = FALSE, - input_data = productivity |> filter(year == "1970")) -# STVC with relative probabilities -stvc_bma <- do_bma(model_table = stvc_mods, - terms = c("Intercept", "unemp", "pubC"), - thresh = 0.1, - relative = TRUE, - input_data = productivity) -``` - -The results can be joined back to the spatial layer, in this case `us_data` to be mapped: -```{r bmamap1, fig.height = 4, fig.width = 7, fig.cap = "The spatial variation of the Public captial covriate generated using a Bayesian Model Avaergaing approach."} -# join -svc_bma_sf <- - us_data |> - select(GEOID) |> - left_join(svc_bma) -# map -tit =expression(paste(""*beta[`Public Capital`]*" ")) -ggplot(data = svc_bma_sf, aes(fill=b_pubC)) + - geom_sf() + - scale_fill_continuous_c4a_div(palette="brewer.blues",name=tit) + - coord_sf() + - theme_void() -``` - -The variations in the averaged BMA STVC coefficient estimates can be mapped in a similar way by linking to the `us_data` spatial layer, and here we see the nature of the temporal variation in the Unemployment and spatio-temporal variation in Public capital covariates as expected. - - -```{r bmamap2, message = F, warning = F, fig.height = 8, fig.width = 7, fig.cap = "The spatial variation of coefficient estimatess for BMA Unemployment and Public capital covriates over time, generated Bayesian Model Avaergaing approach."} -# link the data -stvc_bma_sf <- - us_data |> - select(GEOID) |> - left_join(stvc_bma) -# create the plots -tit =expression(paste(""*beta[`Unemployment`]*"")) -p1 = stvc_bma_sf |> - ggplot() + geom_sf(aes(fill = b_unemp), col = NA) + - scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) + - facet_wrap(~year) + - theme_bw() + xlab("") + ylab("") + - theme( - strip.background = element_rect(fill="white", colour = "white"), - strip.text = element_text(size = 8, margin = margin(b=4)), - legend.position = c(.7, .1), - legend.direction = "horizontal", - legend.key.width = unit(1.15, "cm"), - axis.title.x=element_blank(), - axis.text.x=element_blank(), - axis.ticks.x=element_blank(), - axis.title.y=element_blank(), - axis.text.y=element_blank(), - axis.ticks.y=element_blank()) -p2 = stvc_bma_sf |> - ggplot() + geom_sf(aes(fill = b_pubC), col = NA) + - scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) + - facet_wrap(~year) + - theme_bw() + xlab("") + ylab("") + - theme( - strip.background = element_rect(fill="white", colour = "white"), - strip.text = element_text(size = 8, margin = margin(b=4)), - legend.position = c(.7, .1), - legend.direction = "horizontal", - legend.key.width = unit(1.15, "cm"), - axis.title.x=element_blank(), - axis.text.x=element_blank(), - axis.ticks.x=element_blank(), - axis.title.y=element_blank(), - axis.text.y=element_blank(), - axis.ticks.y=element_blank()) -plot_grid(p1, p2, nrow = 2) -``` - -## Summary - -The key and substantive methodological point in this vignette is the need to consider the nature of the spatial and temporal interactions (dependencies) between the target and predictor variables. - -Model form (and thus the nature of the space-time process) should not be assumed and hence the provision of functions to create , evaluate and rank multiple models in the `stgam` package. Most approaches to SVC and STVC modelling implicitly assume specific spatial and space-time dependencies. - -The approach presented din this vignettes represents a fundamental difference in philosophy to (spatial and) space-time modelling. The method is essentially about model comparison to *test* for space time dependency. - - - -# References{-} diff --git a/vignettes/stgam_intro_chaco.Rmd b/vignettes/stgam_intro_chaco.Rmd new file mode 100644 index 0000000..00b7f0a --- /dev/null +++ b/vignettes/stgam_intro_chaco.Rmd @@ -0,0 +1,395 @@ +--- +title: "A Geographer's introduction to GAMs" +author: "Lex Comber, Paul Harris, Gonzalo Irisarri and Chris Brunsdon" +date: "January 2026" +output: rmarkdown::html_vignette +bibliography: vignette.bib +header-includes: + - \usepackage{caption} + +vignette: > + %\VignetteIndexEntry{A Geographer's introduction to GAMs} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} + +--- + +\captionsetup{width=5in} + +```{r, include = FALSE} +library(knitr) +# library(kableExtra) +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + options(width = 120) +) +``` + +```{r eval = F, echo = F} +# see Chaco_Data_stgam.R +# usethis::use_data(chaco) +# usethis::use_data(chaco, overwrite = TRUE) +``` + +## Overview + +The `stgam` package [@stgam] provides a wrapper around `mgcv` functionality [@wood2025package] to support space-time analysis with Generalized Additive Models (GAMs), Although GAMs, as with any regression, are useful predictors of a target variable, the focus here is understanding relationships in the data and associated inference (i.e., process understanding between the target and a set of predictor variables). This is a common objective of geographical analysis, which is often concerned with how processes vary spatially, temporally and spatio-temporally. + +This vignette illustrates an analysis workflow using GAM smooths that quantifies and models variations over space and time. It highlights the importance of prior investigations of spatial and / or temporal variability before constructing space-time GAMs. + +This vignette acts as precursor, to a complementary vignette, where spatial, temporal and spatio-temporal varying coefficient GAMs are investigated. The GAMs specified here, only use surrogates for spatial or temporal effects, through location coordinates or date stamps, respectively. GAMs by design support varying coefficients but for this vignette, this variation is essentially investigated in 'attribute-space' only. + +You should load the `stgam` package and the `chaco` data. The data are sample of 2000 observations of Normalised Difference Vegetation Index (NDVI) of the Chaco dry rainforest in South America with some climate data. The NDVI and climate data are found via Google Earth Engine [@gorelick2017google]. The NDVI data is sourced from the PKU GIMMS NDVI v1.2 dataset, which provides NDVI observations at 1/12° spatial resolution at bi-monthly intervals from 1982 to 2022 [@li2023spatiotemporally]. For the climate data, Maximum temperature (`tmax`) in degrees C and Precipitation (`pr`) in millimetres were selected from the TerraClimate dataset (IDAHO_EPSCOR/TERRACLIMATE) and their mean value was calculated for each monthly image across all pixels. + + +```{r loadpackages, warning = F, message = F, results=F} +library(cols4all) +library(dplyr) +library(ggplot2) +library(tidyr) +library(sf) +library(cowplot) +# load the package and data +library(stgam) +data("chaco") +``` + +You should examine the help for the dataset and especially the variables that `chaco` contains: +```{r eval = F, warning = F, message = F} +help(chaco) +``` + + +## 1. Data considerations + +The first stage is simply to examine the data, particularly how variables many vary over space and time. The code below examines the data before undertaking some initial investigations. + +```{r eval = T} +# examine what is loaded +chaco +``` + +The analyses will model NDVI (the `ndvi` variable). GAMs will be constructed that regress this target variable against available predictor variables including location and time. The print out of the first 10 records above shows that the dataset contains a time variable (`date`) in date format as well as location in metres (`X` and `Y` from the OSGB projection). It often more useful to represent time as continuous scalar values and to have location coordinates in kilometres rather than metres. The data also contains a variable called `month` to represent time (here months since the earliest date). The code below rescales the locational data, but the original coordinates are retained for mapping in `geometry` attribute of `chaco`: +```{r} +chaco <- + chaco |> + # scale location and retain original coordinates + mutate(Xo = X, Yo = Y) |> + mutate(X = X/1000, Y = Y/1000) +``` +Again this can be examined: +```{r eval = F} +chaco +``` + +The target variable can be mapped as in the code snippet below. The map blow shows the spatial distribution of `ndvi`. + +```{r dataplot, fig.height = 4, fig.width = 7, fig.cap = "Spatial variation of the `ndvi` variable ."} +# map the data layers +chaco |> + ggplot() + geom_sf(aes(col = ndvi)) + + scale_color_viridis_c(option = "magma") + + theme_bw() +xlab("") + ylab("") +``` + +Distributions and correlations for the continuous variables can be examined using boxplots / histograms and a correlation matrix, respectively. It is important to establish that any GAM with the proposed target, regressed against a set of predictor variables may yield something interesting and sensible, and just as importantly, to identify any variables that might be a problematic when trying to fit and interpret the resultant model. + +For example, evidence of a highly skewed variable or one that has outliers. This is the case for the proposed target variable (`ndvi`) which is strongly positively skewed with potential outliers. + +The code below generates boxplots and then density histograms. + +```{r databox, fig.height = 2.5, fig.width = 7, fig.cap = "Boxplots of the continous variables in the data."} +# boxplots +chaco |> + st_drop_geometry() |> + select(id, ndvi, tmax, pr) |> + pivot_longer(-id) |> + ggplot(aes(x = value), fil) + + geom_boxplot(fill="dodgerblue") + + facet_wrap(~name, scales = "free") + + theme_bw() + + theme(axis.title.y=element_blank(), + axis.text.y=element_blank(), + axis.ticks.y=element_blank()) +``` + +```{r datahist, fig.height = 2.5, fig.width = 7, fig.cap = "Histograms of the continuous variables in the data."} +# histograms +chaco |> + st_drop_geometry() |> + select(id, ndvi, tmax, pr) |> + pivot_longer(-id) |> + ggplot(aes(x = value), fil) + + geom_histogram(aes(y=after_stat(density)),bins = 30, + fill="tomato", col="white") + + geom_density(alpha=.5, fill="#FF6666") + + facet_wrap(~name, scales = "free") + + theme_bw() +``` + +Examining correlations, similarly checks for issues that may be a problem later on in the analysis workflow. Here the aim is to establish that there are reasonable correlations with the target variable and at the same time, little collinearity amongst the predictor variables. Note this may change when considering change in correlations over space, time or space-time (i.e., they are only examined globally, for now). See accompanying vignette. + +```{r correls} +# correlations +chaco |> + st_drop_geometry() |> + select(ndvi, tmax, pr) |> + cor() |> round(3) +``` + +Finally, thinking about a GAM-based space-time analysis of *your* dataset, as a general heuristic, any dataset for use in the space-time analysis should ideally have a minimum of about 100 locations and a minimum of about 50 time intervals. You should consider this when you are assembling your data. For space-time datasets with smaller dimensions, alternative regression methods suit, say a spatial panel regression when the time component is short [@harris2017introducing]. + +## 2. Detecting variability + +Investigations in the previous section were all concerned with establishing the viability of undertaking the proposed analysis, using standard exploratory data analyses to examine distributions and correlations. In this section, these are extended to investigate data variability over time, over space and in space-time. This is done by constructing a series of GAMs, of different forms and with different settings. Tools for constructing GAMs using the `mgcv` package [@wood2025package] are introduced but with particular focus on using GAM smooths or splines to explore variations. + +### 2.1 Initial investigations with OLS and dummy variables + +As an initial baseline model, the code below creates a standard OLS (ordinary least squares) regression for `ndvi` with two predictors, using the `lm` function. The model fit is weak, but the model summary indicates that the all of the predictor variables (`tmax` and `pr`) are significant predictors of the target variable, `ndvi`. An analysis of variance (ANOVA), using the `anova` function, shows how much variance in `ndvi` is explained by each predictor and whether that contribution is statistically significant (via associated p-values, `Pr(>F)`). Here high values in the `F value` test statistic provides evidence that a given predictor is useful.In summary, there is evidence the two predictors (`tmax` and `pr`) are statistically significant predictors. + + +```{r ols} +# an OLS model +m_ols <- lm(ndvi ~ tmax + pr, data = chaco) +summary(m_ols) +anova(m_ols) +``` + +As as of yet, space, time and space-time influences on house price variation have not been investigated. GAMs with smooths offer a route to investigate these. + + +### 2.2 Detecting variability using GAM smooths + +GAMs with splines (or smooths - the terms are used interchangeably) can be used to explore and quantify spatial and temporal variations. However before considering these aspects, it is important to outline what smooths do. + +Smooths help to fit a (smooth) curve through a cloud of messy data points. They generate flexible fits (curves) able to adapt to the shape of the data. Because of this, they are able to describe the relationships between variables better than a straight line. For example, consider the scatterplot of some simulated data below: it would be difficult to fit a straight line through it. + +```{r simplot, fig.height = 4, fig.width = 7, fig.cap = "Simulated `x` and `y` data."} +# create simulated data +set.seed(12) +x <- runif(500) +mu <- sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2)) +y <- rnorm(500, mu, .3) +# plot x and y +ggplot() + + geom_point(aes(x,y)) + + theme_bw() +``` + +The code below uses a GAM smooth to determine the relationship between `x` and `y` without having to pre-specify any particular form (e.g. quadratic, exponential, etc). The wiggliness of the fit is determined automatically by the `s` function from the `mgcv` package. This is done in different ways depending on the type of smooth that is specified (thin plate regression smooths is the default in `s`). Effectively what the smooth does is to split the data into chunks and fit piecewise linear relationships, rather than fitting a single straight line as in a standard linear regression. In doing so, GAMs implicitly allow for non-linear relationships. + +```{r simgamplot, fig.height = 4, fig.width = 7, fig.cap = "Simulated `x` and `y` data with GAM a Spline fitted."} +# a GAM illustration with a spline using s +gam_s_example <- gam(y~s(x)) +# extract the smooth fit +y.s <- gam_s_example$fitted.values +# plot +ggplot() + + geom_point(aes(x,y), col = "grey") + + geom_line(aes(x, y = y.s), lwd = 1) + + theme_bw() +``` + +Thus, this diversion into smooths and splines illustrates how they model different relationships (slopes) with `y` at different locations within the variable feature space (the `x` axis in the above example). In this way, spline smooth curves can be used to capture non-linear relationships in attribute-space (here the relationship of `x` with `y`). + +Importantly, in the context of space-time varying coefficient modelling with `stgam` (see the accompanying vignette in this package), the smooth can be used to model how relationships between `x` and `y` varies with respect to time or location, if the smooth is also parameterised with those. That is moving from attribute-space to temporal- or geographical-space, respectively. + +### 2.3 Temporal variability + +It would be useful to examine the how the target variable, `ndvi`, varies over time. One way of doing this is to extend the use GAM smooths illustrated above. The code snippet below models the variation in `ndvi` but now using time (the `month` variable) as a predictor. + +```{r gam.1} +# the first GAM +gam.1 <- gam(ndvi~s(month), data = chaco) +summary(gam.1) +``` +The smooth can be investigated by plotting `ndvi` with time. Note the use of the actual data variable `date` rather than `month` in the code below to have a friendly x-axis in the plot, and the use of the `predict` function to extract the standard errors on the fitted curve (i.e., the predictions of NDVI): + +```{r smoothplot1, fig.height = 4, fig.width = 7, fig.cap = "A plot of the temporal smooth."} +# create a data frame with x, predicted y, standard error +x <- chaco$date +y <- gam.1$fitted.values +se <- predict(gam.1, se = TRUE, chaco)$se.fit +u <- y+se +l <- y-se +df <- data.frame(x, y, u, l) +# plot! +ggplot(df, aes(x, y, ymin = l, ymax = u)) + + geom_ribbon(fill = "lightblue") + + geom_line() + + theme_bw() + + xlab("Date") + ylab("NDVI") +``` + +This exercise models variation in the relationship between `ndvi` (NDVI) and `month` (time). The code above extracts the predicted values ($\hat{y}$) of `ndvi` from the model and plots them against time, but using the actual date. The plot shows how the relationship of the target variable varies with time. This provides confirmation of the potential suitability of a temporal analysis of `ndvi`. + +### 2.4 Spatial variability + +The spatially varying trends can be examined in similar way, again using a standard `s` spline - but now with location variables (i.e., the coordinates) to predict house price. + +```{r smoothplot2, fig.height = 5, fig.width = 7, fig.cap = "A `mgcv` map of the spatial smooth."} +# the second GAM +gam.2 <- gam(ndvi~s(X,Y), data = chaco) +summary(gam.2) +plot(gam.2, asp = 1) +``` + +Notice that the model fit ($R^2$) has **not** increased substantially from the previous model. This suggests some important spatial effects and again the smooth can be plotted, but this time rather than a line, it is a 2-Dimensional surface. However, it would be useful to visualise this as a surface rather than just over the `chaco` point locations. The code below uses the `predict` function used to model the relationship over a hexagonal grid (a surface) covering the extent of study area, that contains `X` and `Y` attributes of the location of the hexagonal cell centroid. + +```{r create_lgrid} +# create a grid object from the Chaco data +chaco_grid <- + st_make_grid(chaco, square=FALSE,n=20) |> + st_sf() +# rename the geometry, sort row names +st_geometry(chaco_grid) = "geometry" +rownames(chaco_grid) <- 1:nrow(chaco_grid) +# create and add coordinates X and Y / 1000 +coords <- chaco_grid |> st_centroid() |> st_coordinates() +chaco_grid <- chaco_grid |> bind_cols(coords/1000) +``` + +You may wish to inspect this object: +```{r eval = F} +chaco_grid +plot(st_geometry(chaco_grid)) +``` + +Before continuing with the mapping procedure: +```{r smoothplot2a, fig.height = 5, fig.width = 7, fig.cap = "A map of the smoothed (predicted) response variable over hexagon grid."} +# predict over the grid +yhat <- predict(gam.2, newdata = chaco_grid) +chaco_grid |> + mutate(yhat = yhat) |> + # and plot + ggplot() + + geom_sf(aes(fill = yhat), col = NA) + + # adjust default shading + scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "NDVI") + + # apply and modify plot theme + theme_bw() + + theme(legend.position = "bottom", + legend.key.width = unit(1, "cm")) +``` + +These are clear spatial trends (variations over space) of the NDVI target variable. The map indicates high values in the centre of the study area, as in the initial map of the data earlier, and again provides confirmatory evidence that the some spatial modelling of `ndvi` is appropriate. + +### 2.5 Space-time variability I + +The smooth approach can be extended to model the target variable in space-time by including the location and time variables in the smooth. However, a slightly different set of considerations are required because of mixing space and time in the `mgcv` smooths. Consider the code snippet below for the space-time smooths it specifies a smooth using the `te` function rather than `s` and contains other parameters in the `d` and `bs` arguments. The reasons for this are described below. + +```{r gam.3} +# the third GAM +gam.3 <- gam(ndvi~te(X,Y,month, d = c(2,1), bs=c('tp','cr')), data = chaco) +summary(gam.3) +``` + +It is important to consider the difference between `s(X,Y,month)` (which would be a continuation of the format above but has not been run) and `te(X,Y,month,d = c(2,1),bs=c('tp','cr'))` before examining the resultant GAM (`gam.3`). The spline functions `s()` and `te()` can model smooth interactions between multiple variables, but they handle dimension scaling, marginal smoothness, and basis construction in different ways. + +- `s()` is an *Isotropic smooth*. It is used when all variables are on the same scale and have similar smoothness behaviours (like `X` and `Y`). By default it uses a thin plate regression spline (`bs = "tp"`) and a single basis to model the attribute space. The formulation `s(X,Y,month)` would therefore use a single basis to model a smooth in 3D space, under the assumption that all the variables (`X`, `Y` and `month`) contribute in the same way and at the same scale. This is fine if the variables are numeric, continuous, and in comparable units such as percentages. But not in space-time. + +- `te()` is a *Tensor Product (TP) smooth*. These are used when variables are on different scales or represent different kinds of effects (like `X`, `Y` and `month`). It builds a smooth using separate basis functions for each variable and then combines them. The formulation `te(X,Y,month,d = c(2,1),bs=c('tp','cr'))` includes the `d` parameter that specifies the dimensions for each basis. Here a 2D basis is specified for `X` and `Y` (location) and a 1D basis for `month` (time). It also includes a basis parameter `bs` which specifies a Thin Plate Regression Spline (`tp`) for location and a Cubic Regression Spline (`cr`) for time. Thin plate splines are invariant to coordinate rotation and are ideal for spatial data (regardless of whether the coordinates are in meters, kilometres, or degrees) and Cubic regression splines are more efficient especially in 1D. The tensor product smooth is constructed by constructing (marginal) smooths for each and taking the tensor product of those to form the joint smooth. It is useful when the variables to be included in the smooth have different scales or units (e.g., `X` and `Y` are coordinates, `month` is time) and different degrees of smoothness are expected along each variable. + +Thus, in this case a TP smooth is specified because space and time need to be combined. The code below examines the results and uses the `mgcv` plot function to show the spatial distributions over 9 discretised time chunks (~15 months). + +```{r eval = F} +plot(gam.3, asp = 1) +``` + +However, it is also possible to slice and dice the time variable in other ways. The code snippet below illustrates how this can be done for monthly periods using the `calculate_vcs` function in the `stgam` package. Note that if predictor variables are included in the model then the function coefficient estimates, standard errors and t-values for each of terms specified, as shown in the accompanying vignette. + +However, in this case we are just interested in the predicted value of the target variable, `yhat`. The code below uses the grid object created above (`chaco_grid`) as a spatial framework to hold the prediction of the `ndvi` variable over these discrete time periods. +```{r smoothplot3, fig.height = 6, fig.width = 7, fig.cap = "A plot of a spatial smooth over 7 approximately annual time periods."} +# create time slices +years <- sort(unique(chaco$year)) +# calculate over the grid for each time slice +res_out <- matrix(nrow = nrow(chaco_grid), ncol = 0) +for (i in 1:length(years)){ + # convert years to months - here getting month 6 + month.val <- ((years[i]-min(years)) * 12) + 6 + res.i <- calculate_vcs(input_data = chaco_grid |> mutate(month = month.val), + mgcv_model = gam.3, + terms = NULL) + res_out <- cbind(res_out, res.i$yhat) +} +# name with years and join to the grid +colnames(res_out) <- paste0("Y_", years) +chaco_grid |> + cbind(res_out) |> + # select the variables and pivot longer + select(-X, -Y) |> + pivot_longer(-geometry) |> + # make the new object a factor (to enforce plotting order) + mutate(name = factor(name, levels = paste0("Y_", years))) |> + # and plot + ggplot() + + geom_sf(aes(fill = value), col = NA) + + # adjust default shading + scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "Predicted\nNDVI") + + # facet + facet_wrap(~name, ncol = 4) + + # apply and modify plot theme + theme_bw() + + theme( + legend.position = "inside", + legend.position.inside = c(0.9, 0.17), + strip.background = element_rect(fill="white", colour = "white"), + strip.text = element_text(size = 8, margin = margin(b=4)), + axis.title=element_blank(), + axis.text=element_blank(), + axis.ticks=element_blank()) +``` + +This time series of maps show that without any predictor variables (aside from location & time), the trends in `NDVI` varies spatially and changes in intensity over time (the increase and decrease in values over space), with changes in intensity but limited variation in spatial pattern over time. + + +### 2.6 Space-time variability II + +The formula used to construct the `gam.3` model in the previous subsection constructed a smooth that implicitly combined space and time, under an assumption that the variations in space and time of `ndvi` interact with each other. However this may not be the case. More generally, spatial and temporal dependencies might not be expected to interact in the data for a relatively small region, whose observations might be expected to be subject to the same socio-economic or environmental pressures. In this case, any space and time effects might be expected to be *independent* of each other. Whereas in a larger, for example national study, any space and time effects might be expected to *interact* more strongly. + +It is possible to use a different construction involving separate smooths for space and time, with the `s()` smooth, and to avoid the assumption of the interaction of space and time effects: + +```{r gam.4} +# the fourth GAM +gam.4 <- gam(ndvi~s(X,Y) + s(month), data = chaco) +summary(gam.4) +``` + +Again there is evidence of spatial and temporal trends in the data and these can be visualised using the `plot.gam` function in `mgcv` (called with just `plot`). +```{r smoothplot4, fig.height = 4, fig.width = 7, fig.cap = "A `mgcv` plot of the spatial and temporal smooths."} +plot(gam.4, page = 1) +``` + +Note that, the space-time interactions could be generated in the same way as for the `gam.3` model above, using the location (`X` and `Y`) values in `chaco_grid` and the time slices in `years`. + +### 2.7 Including other predictor variables in space-time GAMs + +Up until now, only the target (response) variable, `ndvi` has been examined for variations in time and space, using GAM smooths specified in different ways. This was to explore space-time variability of this variable, but also to introduce GAM smooths. The code below includes the `tmax` predictor variable in a space-time GAM as a fixed parametric term (i.e., in a similar way as in the OLS model created above). + +```{r gam.5} +# the fifth GAM +gam.5 <- gam(ndvi~s(X,Y) + s(month) + tmax, data = chaco) +summary(gam.5) +``` + +The model summary indicates that `tmax` is a significant predictor of `ndvi` and improves the model fit. However, this raises a key question of how should predictor variables be included in the model: in (fixed) parametric form or in a (varying) smooth form? This is addressed in the accompanying vignette. + +As before the smooths can be plotted: +```{r eval = F} +plot(gam.5, page = 1) +``` + +### 2.8 Summary + +This vignette has introduced GAMs with smooths (splines) as a way of investigating any space-time effects present in the data. The mechanics of smooths was described and illustrated, and a brief introduction to different smooth forms was provided. Variations in the target variable (`ndvi`) in time, space and space-time were explored through elementary model fitting and plotting the smooth graphical trends. + +In this case, effects in space and time are present in the `ndvi` target variable and it looks like there is a universal spatial trend at all times, and universal temporal trend everywhere. The final space-time GAM included a predictor variable (`tmax`) which added some explanatory power but these were not examined with respect to varying in space or in time or in space-time (i.e. through the use of smooths). This is done in the accompanying vignette. + +Such investigations are an important initial step. They provide evidence of space-time variations in the target variable - i.e. whether the effects in space and time are present in the data - and can help determine which predictor variables may be of further interest. They guide subsequent analysis and avoid making assumptions about the presence of space-time interactions, for example by plugging every predictor variable into a space-time smooth of some kind. + + +## References + + + diff --git a/vignettes/stgam_vc_models_chaco.Rmd b/vignettes/stgam_vc_models_chaco.Rmd new file mode 100644 index 0000000..67f9ba9 --- /dev/null +++ b/vignettes/stgam_vc_models_chaco.Rmd @@ -0,0 +1,957 @@ +--- +title: "A Geographer's introduction to space-time regression with GAMs using `stgam`" +author: "Lex Comber, Paul Harris, Gonzalo Irisarri and Chris Brunsdon" +date: "January 2026" +output: rmarkdown::html_vignette +bibliography: vignette.bib +header-includes: + - \usepackage{caption} + +vignette: > + %\VignetteIndexEntry{A Geographer's introduction to space-time regression with GAMs using `stgam`} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} + +--- + +\captionsetup{width=5in} + +```{r, include = FALSE} +library(knitr) +library(kableExtra) +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.height = 6, fig.width = 7, + options(width = 120) +) +cache.val = TRUE +``` + +## Overview + +The previous vignette ("A geographer's introduction to GAMs") described some data considerations for space-time analysis and the importance of undertaking some explorations of the space-time variability in the data. It illustrated how this could be done using GAM smooths applied to the target variable through elementary model fitting and visualisation to guide subsequent analyses. + +These are extended in this vignette by including predictor variables, constructing an initial predictive space-time GAM, undertaking investigations and model refinement, and examining and visualising the space-time varying coefficient estimates. The last section undertakes automatic model selection. + +## 1. Data and variables + +The code below loads some packages and data, and then manipulates the data to create some variables as in the previous vignette: + +```{r loadpackages, warning = F, message = F, results=F} +library(cols4all) +library(dplyr) +library(ggplot2) +library(gratia) +library(sf) +library(tidyr) +library(mgcv) +# load the package and data +library(stgam) +data("chaco") +chaco <- + chaco |> + # scale location and retain original coordinates + mutate(Xo = X, Yo = Y) |> + mutate(X = X/1000, Y = Y/1000) +``` + +The `chaco` data is a spatial object in `sf` format. It contains a 2 potential predictor variables that could be used to model `ndvi`: `tmax` (Maximum temperature) and `pr` (Precipitation), as well as location (`X` and `Y`) from the rescaled Easting and Northing coordinates, and time (`months`). It is fully described in the Introductory vignette. ("A Geographer's introduction to GAMs"). The data can be examined: + +```{r} +head(chaco) +``` + +## 2. Space-time GAM contruction and considertations + +```{r echo = F, eval = T} +load("model_summaries.RData") +``` + +### 2.1 Overview + +This section works through an example of constructing a space-time GAM to illustrate a workflow that allows for spatio-temporally varying terms, using the Chaco dataset. It fits an initial complex model and then, through investigation of the results, refines the how the model is specified. In so doing a number of important considerations are illustrated relating the bases that are specified, their dimensions, the GAM fitting method, the model family and the number of knots used in the smooths. + +In the GAMs created in this vignette and this package, the Intercept is treated as an explicitly addressable term. + +By way of example, consider the model `m` created below: +```{r eval = T} +m <- gam(ndvi~s(X,Y) + s(month) + pr, data = chaco) +summary(m) +``` + +This includes the terms `s(X,Y) + s(month)`. These model the spatially varying and temporally varying Intercept plus error. Note that if this was a spatial model (without time), then just `s(X,Y)` would be included for the Intercept and similarly just `s(month)` for a purely temporal model. + +The model can be reformulated as follows with an addressable Intercept term: +```{r new} +m <- gam(ndvi ~ + 0 + Intercept + s(X,Y,by=Intercept) + s(month, by=Intercept) + + pr, + data = chaco |> mutate(Intercept = 1)) +summary(m) +``` + +The model summary now includes `Intercept` as a parametric term (rather than `(Intercept)`) with `s(X,Y):Intercept` and `s(month):Intercept` as smooth terms, rather than `s(X,Y) `and `s(month)`, but the values in both model summaries are all the same. + +The rationale for treating the Intercept as addressable term is because, in varying coefficient modelling, specifying a varying intercept over space and or time detects situations when the response variable (the $y$-value) may show notably low or high values in a space-time locality that were not accounted for by variation in the explanatory values [@harris2019simulation]. This could be caused, for example, if variables not recorded in the study were causing otherwise unexplained space-time patterns in the $y$ variable. As such, a varying intercept commonly tends to capture autocorrelated errors [@harris2019simulation] and background variation that has the potential to affect other varying coefficient estimates. It also simplifies the interpretation of other varying coefficient estimates, particularly if the predictor variables are not mean-centred. + +The models in this vignette include the Intercept as an addressable term and the code below creates this variable in the input data: +```{r data_prep} +chaco <- + chaco |> + mutate(Intercept = 1) +``` + +### 2.2 Space-time GAMs + +An initial space-time GAM is fitted using the code below. Notice that for each predictor variable and the explicitly specified intercept, the model specifies Tensor Product (TP) smooths to capture the variations in NDVI (`ndvi`) in both space and time simultaneously. These are tensor product smooths, specified using the `te()` function, and are described below. Notice also that global *parametric* parameters are also specified for `tmax`, `pr` and `Intercept`. This is to provide a multiplicative constant as well as a function around the coefficient term, and potentially an offset term. +```{r m1, eval = F, echo = T} +# initial model +m1 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept,bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + te(X,Y,month, by = pr, bs = c("tp", "cr"), d = c(2,1)), + data = chaco, + method = "REML", + family = gaussian() +) +``` + +Each of the space-time smooths is specified with Tensor Product (TP) smooth (`te()`) to model interactions between space and time margins. These construct smooth surfaces over space and time without random effects [@wood2025package]. They enable variables that are measured over different scales (i.e. that are anisotropic), such as location coordinates and time, to be combined. The basis dimensions of each component are specified through the `d` parameter, here specifying a 2-dimensional smooth basis for location (`X` and `Y`) and a 1-dimensional one for time (`month`). These are combined via the TPs. Thin plate regression splines (`tp`) were specified to model location and low-rank cubic regression splines (`cr`) were specified for time. Thin plate regression splines are ideal for 2-D smooths because they are isotropic (they are *rotation invariant* with have no directional bias) and they perform well even if the observations are irregularly located over space. Cubic regression splines are efficient in 1-D and known to be good for seasonal or long-term effects. The low-rank basis avoids overfitting and improves computational efficiency. +Finally, the model is specified with a Gaussian family and the model is fitted using a REML (Relaxed Maximum Likelihood) method. Model estimation in GAMs can be conducted in two ways: (a) predictive (i.e., out-of-sample minimization) via Generalized Cross-Validation (GCV) and (b) Bayesian (i.e. attach priors on basis coefficients) via some maximum likelihood approach. These tend to provide more stable estimates of the smoothing parameters (i.e., reduce over-fitting) and tend to provide more robust estimates of the variance. GCV is more computationally efficient but can over-fit, especially when errors are correlated. Note that GCV is set as the default in `mgcv`, but REML based approaches are preferred because they provide less biased estimates of variance and smoothing parameters, and are more robust during the optimization process. + +The model can be examined using the `appraise` function in the `gratia` package [@simpson2024gratia; @simpson2025package] which generates diagnostic plots of the model residuals (the difference between the observed value of $y$ (in this case `ndvi`) and the predicted value of $y$, also referred to as $\hat{y}$ : + +```{r eval = F, echo = T} +appraise(m1, method = "simulate") +``` + +```{r appm1, echo = F, eval = T, warning = F, message = F, out.width = "100%", fig.cap = "Diagnostic plots of `m1`."} +library(cowplot) +img1 <- ggdraw() + draw_image("appraise_plot_1.png") +img2 <- ggdraw() + draw_image("appraise_plot_2.png") +img3 <- ggdraw() + draw_image("appraise_plot_3.png") +img4 <- ggdraw() + draw_image("appraise_plot_4.png") +plot_grid(img1, img2, img3, img4, ncol = 2) +``` + +The diagnostic plots show that the top and bottom and of the QQ plot are close to the theoretical fit line and that the modelled quantiles are close to the theoretical one. This indicates that there are a low number of residuals - observed quantiles that are not on the theoretical fit line. If there were a high number - i.e. the QQ plot had an S-shape - then this would suggest residual heterogeneity, potentially due to un-modelled interactions or due to an inappropriate assumption of a Gaussian model. In such cases other response should be investigated for example through histograms and specified accordingly via the `family` parameter that is passed to the GAM - see below. + +### 2.2 Space-time GAM Refinements I: Model family + +The assumptions behind a Gaussian model are that the target variable is normally distributed, meaning that it is continuous, has constant variance and is unbounded (i.e. it can theoretically take any value from $-\propto$ to $+\propto$). + +If the target variable has a non-normal distribution (a long tail, a skewed distribution, etc) then a different model family should be specified. For example, a target variable such as house price which has values that cannot be negative and may have some very high values, would suggest specifying a a Gamma family with a log link. This is used when the response is continuous, strictly positive, right-skewed (i.e., with a long tail to the right) and when the variance increases with the mean (i.e. is heteroscedastic). What this effectively does to provide a log-scale regression for positive, skewed data. + +The point is that there are a number options for specifying the model `family` in a `mgcv` GAM, depending on the distribution of the response variable, as summarised in the table below. Further information can also be found in the help for the `family` function (enter `?family`). + +```{r echo = F, eval = T} +tab <- data.frame( + Family = c("Gaussian", "Gamma", "Inverse Gaussian", "Poisson", "Binomial"), + Response = c("continuous, symmetric", + "continuous, positive, skewed", + "continuous, positive, heavy-tailed", + "count data", + "0–1 or proportion data"), + Link = c("identity", "log", "log", "log", "logit"), + Example = c("residuals roughly normal", + "costs, durations, prices", + "reaction times, dispersions", + "integer counts", + "success/failure outcomes") +) + +kable(tab, + booktabs = T, row.names = F, linesep = "", + caption = paste0("\\label{tab:tab1}Common families and their properties.")) |> + kable_styling(latex_options = "hold_position") +``` + +This vignette and the current version of the `stgam` package considers only a Gaussian, normally distributed response. Future versions will address other responses. + +### 2.3 Space-time GAM Refinement II: Effect size + +Having determined that correct model family has been specified for the response variable, the model summary can be examined: + +```{r eval = F, echo = T} +summary(m1) +``` + +```{r echo = F, eval = T} +cat(sum_m1, sep = "\n") +``` + +This indicates that `pr` specified in a space-time smooth is not significant but that all of the other model components are. + +To unpick this further and to provide a further basis for model evaluation, it is also instructive to examine the size of the effect of each model component. The code below generates numeric summaries of the effects of each smooth by extracting the effect of each term from the model using the `predict` function. The typical size of each smooth's contribution to predictions are summarised from their standard deviations (`sd`) and the total amplitude of the smooth effects (`range`), where higher values indicate a larger effect on the target variable. This is undertaken below and the code snippets have been wrapped into the `effect_size` function in `stgam`. + +```{r eval = F, echo = T} +# extract the terms +f1 <- predict(m1, type = "terms") +# generate numeric summaries +apply(f1, 2, function(v) c(sd = sd(v), range = diff(range(v)))) |> round(3) +``` + +```{r eval = T, echo = F} +effect_m1 +``` + +This shows the size of the effect of each model component. Considering the ranges, we can see the zero effect of the parametric `Intercept` and `pr` term and the low partial effect of the TP smooth for the `pr` term, suggesting a lack of interaction over space-time for this this predictor variable. A possible refinement in this case could be to remove the fixed `Intercept` and `pr` terms but we would like to retain them as constants and offsets. The range value of the combined space-time smooth for `pr` is relatively low (`te(X,Y,month):pr`) and it may be useful to examine `pr` over space and time separately. In the code below the individual space and time smooths are specified as thin plate regression smooths, in `s()` function in the `mgcv` package. + +```{r m2, eval = F, echo = T} +# second model with separate space-time effects for pr +m2 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept, bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + s(X,Y, by = pr) + s(month, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +``` + +Then effect size can be examined using the `effect_size` function: +```{r eval = F, echo = T} +effect_size(m2) +``` + +```{r eval = T, echo = F} +effect_m2 +``` + +Now the size of the effect of `pr` is split and is larger over both space (`tmax s(X,Y):pr`) and time (`s(month):pr`), compared to the `m1` GAM model. This demonstrate how evaluating and comparing models with different space-time smooths can help to determine whether to model them interactively or not, and what the changes in the magnitude of the modelled response surface might be. + +### 2.4 Space-time GAM Refinement III: `k` the number of knots + +In the `mgcv` GAM implementation the number of knots in each smooth are automatically determined. These control how 'wiggly' (non-linear) the the smooth can be. The number of knots in each smooth can be examined using the `gam.check` function. This provides some diagnostic information summarising the model fitting procedures can also be generated and also prints out some diagnostic plots (although not as informative as the ones created by the `appraise` function in the `gratia` R package). + +```{r eval = F, echo = T} +gam.check(m2) +``` + +This checks whether the number of knots used in the smooths are too low. This is indicated when the effective degrees of freedom (EDF in the `edf` column) are close to the number of knots (`k`) and some of the `k-index` value are somewhat less than 1. In this case, the number of knots used in the temporal smooth for `pr` may be too low. + +The number of knots in a used in a smooth are automatically determined but can be specified. In `mgcv` GAM smooths, `k` controls the basis dimension of the smooth term — i.e. the maximum possible complexity of the smooth, not the actual complexity. The effective degrees of freedom (`edf`) after fitting should be much smaller. If `k` is too small, the model fails to capture the true pattern of the relationships in the data (i.e. it fails to capture non-linearities) and can result in biased fits and residual structure. There are some approximate guides for specifying `k` that are summarised in the table below. + +```{r eval = T, echo = F} +tab <- data.frame( + `Smooth type` = c( + "1D temporal smooth (e.g., days, years)", + "2D spatial smooth (e.g., te(X, Y))", + "3D space–time smooth (e.g., te(X, Y, time))", + "Factor-by smooths (e.g., by = region)" + ), + `Heuristic` = c( + "`k` ≈ min(n/10, 20–40)", + "`k` ≈ 50–200 total basis functions", + "`k` = product of marginal bases, e.g. `k = c(50, 20)`", + "`k` ≈ same as main smooth per level" + ), + Notes = c( + "Typically start around 10 basis functions; if long time series (> 1000 points), may go to 50–100", + "Typically start around 50 basis functions; choose higher if data cover large/complex region", + "mgcv handles tensor product decomposition; but do not let any single dimension be too small", + "You can use fewer if levels are many and data per level small" + ), + check.names = FALSE +) + +kable(tab, + booktabs = T, row.names = F, linesep = "", + caption = paste0("\\label{tab:tab2}Heuristics for choosing `k` in common smooth types.")) |> + kable_styling(latex_options = "hold_position") + +``` + +As a general rule of thumb, the `k` value should be doubled if the check reveals it to be too low. This is done for the temporal smooth specified for `pr` in the code below to create `m3` and note that as `k` increases so does the time to fit the model: + +```{r m3, eval = F, echo = T} +# third model with increased k for the pr temporal smooth +m3 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept, bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + s(X,Y, by = pr) + s(month, k=20, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +``` +The `gam.check` function calls the `k.check` function, provides some interpretation notes and diagnostic plots. However, the `k.check` function can be called directly: +```{r eval = F, echo = T} +# check the model smooths +k.check(m3) +``` + +```{r eval = T, echo = F} +k_check_m3 +``` + +This shows some improvement, with the EDF now increased but still close to `k` for the smooth and the `k-index` still much lower than 1. The model below tries to improve the fit by increasing `k` once more: +```{r m4, eval = F, echo = T} +# fourth model with increased k for the pr temporal smooth +m4 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept, bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + s(X,Y, by = pr) + s(month, k=40, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +# check the model smooths +k.check(m4) +``` + +```{r eval = T, echo = F} +k_check_m4 +``` + +Again some improvement is shown but this can be nudged further: +```{r m5, eval = F, echo = T} +# fifth model with increased k for the pr temporal smooth +m5 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept, bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + s(X,Y, by = pr) + s(month, k=80, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +# check the model smooths +k.check(m5) +``` + +```{r eval = T, echo = F} +k_check_m5 +``` + +Some further improvement is evident but this process of increasing `k` should continue. However, the number of knots cannot be increased infinitely due the data observations, the number of terms in the model and how they combine to determine the effective degrees of freedom (`edf`). The code below increases the number of knots by 40 as in this case there are only 120 unique values for months. + +```{r m6, eval = F, echo = T} +# sixth model with increased k for the pr temporal smooth +m6 <- gam( + ndvi ~ 0 + + Intercept + te(X,Y,month, by=Intercept, bs = c("tp", "cr"), d = c(2,1)) + + tmax + te(X,Y,month, by = tmax, bs = c("tp", "cr"), d = c(2,1)) + + pr + s(X,Y, by = pr) + s(month, k=120, by = pr), + data = chaco, + method = "REML", + family = gaussian() +) +# check the model smooths +k.check(m6) +``` + +```{r eval = T, echo = F} +k_check_m6 +``` + +In this case the check shows that there is sufficiently a large difference between `k` (120) and `edf` (~80) and the `k-index` is closer to 1 (0.84). The final model can be examined the usual way: +```{r eval = F} +appraise(m6, method = "simulate") +``` + +The effects of the terms can be examined as before: +```{r eval = F, echo = T} +effect_size(m6) +``` + +```{r eval = T, echo = F} +effect_m6 +``` + +The refined model above, `m6`, shows improvement, with the effect of `pr` over time relatively high (`s(month):pr`). Its effect over over space (`s(X,Y):pr`) is relatively low and potentially the spatial smooth for `pr` could be dropped. Some confirmation for this decision is provided when the module summary is examined: + +```{r echo = T, eval = F} +summary(m6) +``` + +```{r eval = T, echo = F} +cat(sum_m6, sep = "\n") +``` + + +However in this case, the `m6` model is retained as the final model as the effect sizes are relatively strong for both components. It specifies a space-time TP smooth for the Intercept and for `tmax`, and separate spatial and temporal smooth for `pr`. All of the parametric terms are retained: + +The size of the effects can be examined and compared with the original model `m1`: +```{r eval = F, echo = T} +# original model +effect_size(m1) +``` +```{r eval = T, echo = F} +effect_m1 +``` +```{r eval = F, echo = T} +# tuned model +effect_size(m6) +``` +```{r eval = T, echo = F} +effect_m6 +``` + + +The ranges of the final model (`m6`) still indicate the strong space-time interaction of NDVI with the `tmax` in the TP smooth (`te(X,Y,month):tmax`) and the interactions of `pr` in separate space and time smooths (`s(X,Y):pr` and `s(month):pr)`). + +The AICs of the 2 models can be compared and the final model shows a dramatic improvement (i.e. reduction in value) from the original: +```{r eval = F, echo = T} +# AIC comparison +m1$aic; m6$aic +``` + +```{r eval = T, echo = F} +aic_m1; aic_m6 +``` + + + +### 2.5. Extracting, summarising and plotting the coefficients + +The spatially and temporally varying coefficients estimates generated by the model can be extracted in a number of different ways using the `calulate_vcs` function in the `stgam` package. There are basically 2 options for generating the varying coefficient estimates: i) over the original data points or ii) over spatial surfaces for specific time slices. + +For the first option, the original data, the GAM and a vector of the model terms (i.e. predictor variables names) are simply passed to the `calculate_vcs` function. The second option requires slices of the space-time data to be passed to the `calculate_vcs` function and a bit more work to process the results. In both cases the results can be summarised and mapped. + +Considering first the original observations: + +```{r vcs, eval = F, echo = T} +vcs <- calculate_vcs(input_data = chaco, + mgcv_model = m6, + terms = c("Intercept", "tmax", "pr")) +``` + +A summary over space and time of the coefficient estimates shows that the Intercept is large and varying and that `tmax` and `pr` also have varying relationships with the target variable over space and time, with negative and positive values in some places and / or at some times. Their values reflect the association of degrees Centigrade and millimetres of rainfall. Considering the mean values, each additional 1 degree of `tmax` is associated with a 0.0222 decrease in NDVI (biomass), and each additional millimetre of rainfall is associated with a 0.0013 increase in NDVI. + +```{r, warning=F, message=FALSE} +vcs |> + # drop the geometry if the object is spatial + st_drop_geometry() |> + select(starts_with("b_")) |> + apply(2, summary) |> + round(4) +``` + +These can be explicitly examined over time and notice the increasingly positive relationship with NDVI over time with the Intercept and the increasingly negative relationship of Maximum temperature. There is much more variation in the temporal trend of the relationship between NDVI and Precipitation. + +```{r, final_time, fig.height = 3, fig.width = 7, fig.cap = "The variations over time of the final model coefficient estimates."} +vcs |> + st_drop_geometry() |> + select(date, starts_with("b_")) |> + rename(`Intercept` = b_Intercept, + `Max Temperature` = b_tmax, + `Precipitation` = b_pr) |> + pivot_longer(-date) |> + group_by(date, name) |> + summarise( + lower = quantile(value, 0.25), + median = median(value), + upper = quantile(value, 0.75) + ) |> + ggplot(aes(x = date, y = median)) + + geom_point(col = "blue", alpha = 0.2) + + geom_smooth() + + facet_wrap(~name, scale = "free_y") + + theme_bw() + xlab("") + ylab("") + + theme(strip.background = element_rect(fill="white")) + +``` + +Standard `ggplot` approaches can be used to map the coefficient estimates as in the figure below, which summarises the coefficient estimates for `pr` (`b_pr`) by year. The code uses the `year` variable to facets the maps of the spatially varying coefficients. Notice the increasingly negative relationship of Maximum temperature with NDVI over time, with a distinct South West (low) to North East (high) gradient. + +```{r svccoefs, fig.height = 6, fig.width = 7, fig.cap = "The varying `tmax` (Maximum temperature) coefficient estimates over space and time."} +# title +tit <-expression(paste(""*beta[`tmax`]*"")) +# plot +ggplot() + + geom_sf(data = vcs, aes(col = b_tmax)) + + scale_colour_continuous_c4a_div(palette="brewer.rd_yl_bu",name = tit) + + facet_wrap(~year) + + theme_bw() + + theme( + legend.position = "inside", + legend.position.inside = c(0.9, 0.17), + strip.background = element_rect(fill="white", colour = "white"), + strip.text = element_text(size = 8, margin = margin(b=4)), + axis.title=element_blank(), + axis.text=element_blank(), + axis.ticks=element_blank()) +``` + +A second approach is to estimate the varying coefficients over a grid. The code below creates the `chaco_grid` spatial object, in the same way as was done in the introductory vignette, but this time also adding dummy variables: + +```{r create_lgrid, warning=FALSE, message=FALSE} +# create a grid object from the Chaco data +chaco_grid <- + st_make_grid(chaco, square=FALSE,n=20) |> + st_sf() +# rename the geometry, sort row names +st_geometry(chaco_grid) = "geometry" +rownames(chaco_grid) <- 1:nrow(chaco_grid) +# create and add coordinates X and Y / 1000 +coords <- chaco_grid |> st_centroid() |> st_coordinates() +chaco_grid <- chaco_grid |> bind_cols(coords/1000) +# add dummy variables to the grid +chaco_grid <- + chaco_grid |> + mutate(Intercept = NA_real_, + tmax = NA_real_, + pr = NA_real_) +``` + +You may wish to inspect this object and what it contains : +```{r eval = F} +chaco_grid +plot(st_geometry(chaco_grid)) +``` + +In order to generate annual gridded estimates of the coefficients over grid space, the model passed to the `calculate_vcs` function requires the space and time inputs that were used in the model: `X`, `Y` and `month`. The `chaco_grid` object has locational attributes but the temporal attributes need to be created. The `for` loop below does this, extracts the coefficient estimates for the specified month - here the sixth month in each year - and then combines them. + +```{r svccoefs2, eval = F, echo = T} +# create time slices +years <- sort(unique(vcs$year)) +# calculate over the grid for each time slice +res_out <- matrix(nrow = nrow(chaco_grid), ncol = 0) +for (i in 1:length(years)){ + # convert years to months - here getting month 6 + month.val <- ((years[i]-min(years)) * 12) + 6 + res.i <- calculate_vcs(input_data = chaco_grid |> mutate(month = month.val), + mgcv_model = m6, + terms = c("Intercept", "tmax", "pr")) + # select all the coefficient estimates + res.i <- + res.i |> + st_drop_geometry() |> + select(starts_with("b_"), + starts_with("se_"), + starts_with("t_")) + # rename coefficients estimates + names(res.i) <- paste0(names(res.i), "_", years[i]) + # bind to the result + res_out <- cbind(res_out, res.i) + cat(years[i], "\t") +} +``` + +The result can be joined to the grid and mapped. The code below does this for `tmax` (Maximum temperature) as this has space-time TP smooth. The maps describe the variations over time and space of the relationship between the `tmax` predictor variable and `ndvi`. The maps indicate the increasingly negative relationship over time and the South West to North East gradient as before. + +```{r svccoefs3, cache = cache.val, fig.height = 6, fig.width = 7, fig.cap = "The varying `tmax` (Maximum temperature) coefficient estimates over space and time and the grid surface."} +# join to the grid +chaco_grid |> + cbind(res_out) |> + # select the variables and pivot longer + select(starts_with("b_tmax")) |> + # rename and select with transmute + transmute(`2012` = b_tmax_2012, `2013` = b_tmax_2013, + `2014` = b_tmax_2014, `2015` = b_tmax_2015, + `2016` = b_tmax_2016, `2017` = b_tmax_2017, + `2018` = b_tmax_2018, `2019` = b_tmax_2019, + `2020` = b_tmax_2020, `2021` = b_tmax_2021, + `2022` = b_tmax_2022 + ) |> + pivot_longer(-geometry) |> + # make the new time object a factor (to enforce plotting order) + mutate(name = factor(name, levels = years)) |> + # 4. and plot + ggplot() + + geom_sf(aes(fill = value), col = NA) + + # adjust default shading + scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) + + # facet + facet_wrap(~name, ncol = 4) + + # apply and modify plot theme + theme_bw() + + theme( + legend.position = "inside", + legend.position.inside = c(0.9, 0.17), + strip.background = element_rect(fill="white", colour = "white"), + strip.text = element_text(size = 8, margin = margin(b=4)), + axis.title=element_blank(), + axis.text=element_blank(), + axis.ticks=element_blank()) +``` + + +```{r echo = F, eval = F} +# save summaries for CRAN vignette +p_m1 <- appraise(m1, method = "simulate") + +sum_m1 <- summary(m1) +effect_m1 <- apply(f1, 2, function(v) c(sd = sd(v), range = diff(range(v)))) |> round(3) + +effect_m2 <- effect_size(m2) + +# check the model smooths +k_check_m3 <- k.check(m3) + +k_check_m4 <- k.check(m4) + +k_check_m5 <- k.check(m5) +k_check_m6 <- k.check(m6) + +aic_m1 <- m1$aic +aic_m2 <- m2$aic +aic_m3 <- m3$aic +aic_m4 <- m4$aic +aic_m5 <- m5$aic +aic_m6 <- m6$aic + +setwd("~/Library/CloudStorage/Dropbox/Lex_GGP_GAM/stgam/vignettes") +save(p_m1, sum_m1, effect_m1, effect_m2, k_check_m3, + k_check_m4, k_check_m5, k_check_m6, sum_m6, + aic_m1, aic_m2, aic_m3, aic_m4, aic_m5, aic_m6, + vcs, years, res_out, + file = "model_summaries.RData") +``` + + +### 2.6 Summary + +The above subsections describes a sequence of model construction, investigation and refinement, which, after specifying an initial Gaussian model that incorporated space-time smooths for each predictor variable, required a number of GAM-related considerations to be expanded upon. + +- The initial model used Tensor Product smooths, under an assumption of space-time interactions, rather than separate space and time effects. These required the basis dimensions of each component to be specified (through the `d` parameter), here specifying a 2-dimensional basis for location and a 1-dimensional one for time). +- Additionally thin plate regression smooths (`tp`) were used to model location because they are isotropic with no directional bias) and can handle irregularly located observations. Cubic regression smooths (`cr`) were used to model time effects because there known to be good at handling seasonal or long-term effects. +- The model was fitted using REML as this tends to provide more stable estimates of the smoothing parameters and robust estimates of the variance. However, model diagnostic investigations suggested residual heterogeneity, potentially due to the incorrect assumption of a Gaussian response. +- A sequence of refinements were then undertaken and a number of considerations to be discussed: different model families and when they should be used, quantifying the magnitude of the effect size of the model terms, and checking and adjusting the knots used in model fitting procedure in order to determine the final model form. + +In this case, the final model specified was Gaussian with a combined space-time TP smooth for `tmax` and a spatial smooth for `pr` with the number of knots set to 120. + +So although statistical practice tends towards fitting the more complex model, penalizing the estimated effects and then evaluating the magnitude of those estimated effects, as was done here, it also suggests that there may be some benefit in undertaking some kind of automatic model selection (see Section 3). + +## 3. Working with `stgam`: model selection + +The models constructed in Section 2 started with an initial model that was gradually refined through a series of investigations. Each predictor variable and the Intercept was modelled a parametric term with combined space-time smooths in the initial `m1` model. After investigation and refinement the `tmax` predictor variable was included in parametric form and within a space-time TP smooth and `pr` in a temporal smooth with the number of knots specified ($k = 120$). + +This poses the question of which model form to use? How can the model best be specified? A key focus of the `stgam` package is to answer this question by creating and evaluating multiple models. + +### 3.1 Using AIC to evaluate models + +One way to evaluate models is to compare them using some metric. Commonly AIC [@akaike1998information] is used to compare models. It is an estimator of the leave-one-out cross validated prediction error and provides a parsimonious measure of model fit that balances model complexity and prediction accuracy. Using AIC to evaluate different models therefore identifies the model with the lowest prediction error. + +The code below extracts the AIC from each of the models specified in the last section and prints them out in order. Here it can be seen that the `m6` model has the lowest prediction error (lowest AIC), with `pr` specified in separate space and time smooths. + +```{r echo = T, eval = F} +data.frame(Model = paste0("m", c(1:6)), + AIC = c(m1$aic, m2$aic, m3$aic, m4$aic, m5$aic, m6$aic)) |> + # rank the models + arrange(AIC) +``` +```{r eval = T, echo = F} +data.frame(Model = paste0("m", c(1:6)), + AIC = c(aic_m1, aic_m2, aic_m3, aic_m4, aic_m5, aic_m6)) |> + # rank the models + arrange(AIC) +``` + +### 3.2 Model selection: determining GAM form + +Given the the model investigation and refinement in the previous section and the model evaluations by AIC above, one option to determining model form is to create a series of models and identify the one with the best fit through AIC, given the caveats described above. + +For a space-time GAM each predictor variable could be specified in one of 6 ways: + +i. It is omitted. +ii. It is included as a parametric response with no smooth. +iii. It is included in parametric form and in a spatial smooth with location. +iv. It is included in parametric form and in a temporal smooth with time. +v. It is included in parametric form and in a single space-time smooth. +vi. It is included in parametric form and in 2 separate space and time smooths. + +The intercept can be treated similarly, but without it being absent (i.e. 5 options). Obviously there are 3 options for each predictor variable in a spatial model (i., ii. and iii.) and in a temporal model (i., ii. and iv.), with each having 2 options for the intercept. Thus for purely spatial or temporal regressions, with $k$ predictor variables, there are $2 \times 3^k$ potential models to evaluate. For a space-time there are $5 \times 6^k$ potential models. + +Recall that the `chaco` dataset contains two numeric variables that could be of use in explaining the space-time variations in the `ndvi` target variable, well as location (`X` and `Y`) and time (`month`) + +```{r} +chaco |> st_drop_geometry() |> head() +``` + +Thus with $k = 2$ variables could are 180 potential space-time models to evaluate and 18 potential models in a spatial or temporal case study. The example below uses these 2 predictor variables evaluate 180 potential space-time models. This is done via the `evaluate_models()` function in the `stgam` package. This constructs and compares the full set of potential models, each specifying each predictor variable and the intercept in a different way. The separate space and time smooths are specified as thin plate regression smooths using the `s()` function in `mgcv` and the combined space-time smooths are specified as Tensor Product (TP) smooths (`te()`) with a 2D thin plate smooth for location and 1D cubic regression smooth for time as in Section 2.2. + +The function also includes two optional logical parameters for adjusting the number of knots specified in each smooth. The first is `k_set`. This allows the user to specify `k` in the smooths via the `spatial_k` and `temporal_k` parameters. These applied to the thin plate spline spatial and temporal smooths individually, and combined in the TP smooths. The second is `k_increase` This checks on the number of knots specified in each smooth generated by the `gam` function relative to the effective degrees of freedom, as in the outputs of the `k.check` function above. If the `increase_k` parameter is set to `TRUE`, then a threshold of the ratio of the number of knots in each smooth to its EDF can be specified via the `k2edf_ratio` parameter and if any smooth has a *knots-to-EDF* ratio less than this value (i.e. the number of knots is close to the EDF), then the knots are iteratively increased until the threshold check is passed, the number knots passes the maximum degrees of freedom (as in the worked example in Section 2.4 above), or the number of iterations (`max_iter`) is reached. The the default is that they are doubled on each iteration, but this too can be changed via the `k_multiplier` parameter. If the knots do not need to be increased, the ones generated in the fitting of the GAM by the `mgcv` package are retained. The example below applies the second of these, `k_increase`. + +Each model is evaluated via its AIC value. AIC [@akaike1998information] provides an estimate of the leave-one-out cross-validated prediction error and the model with lowest AIC can be selected. + +The code below specifies a spatially and temporally varying coefficient model (STVC) using the `tmax` and `pr` variables determine a space-time regression mode. Note that the function specifies `ncores` to be set to 2 by default. This is to pass CRAN diagnostic checks for package submission - you may want to specify more using `detectCores()-1` from the `parallel` package. Note also that the input data needs to be a `data.frame` or `tibble` object and it needs to have a defined Intercept term. This can be created in the following way `input_data |> mutate(Intercept = 1)`, as was done at the end of Section 1 for `chaco`. + +```{r eval_stvc, eval = F, cache = cache.val, warning=F, message=F} +library(doParallel) +# ncores <- detectCores()-1 +t1 <- Sys.time() +stvc_mods <- evaluate_models( + input_data = chaco |> st_drop_geometry(), + target_var = "ndvi", + family = "gaussian()", + vars = c("tmax", "pr"), + coords_x = "X", + coords_y = "Y", + VC_type = "STVC", + time_var = "month", + k_set = FALSE, + spatial_k = NULL, + temporal_k = NULL, + k_increase = TRUE, + k2edf_ratio = 1.5, + k_multiplier = 2, + max_iter = 10, + ncores = 15 +) +Sys.time() - t1 # about 29 minutes on M4 Mac with 64GB of RAM +``` + +```{r echo = F, eval = T} +# precomputed to get through CRAN checks +# save(stvc_mods, file = "stvc_mods.RData") +load("stvc_mods.RData") +``` + +The best $n$ models can be extracted (i.e. those with the lowest AIC score) using the `gam_model_rank()` function. The result of doing this is shown below. Interestingly few of the smooths in the top 10 models contain combined space-time TP smooths (`te(ST)`), indicating a lack of interaction between spatial and temporal effects. The separate spatial (`s(S)`) and temporal (`s(T)`) smooths for the Intercept have had $k$ increased as have the spatial smooths for for `tmax` and `pr`. Also note that the top 3 models variously specify similar spatial smooths or spatial plus temporal smooths for each of the predictor variables and the similarity of the AIC values. Overall this suggests that while there are spatial and temporal dependencies with the target variable, these are not interacting over space-time. + +```{r} +mod_comp <- gam_model_rank(stvc_mods, n = 10) +# have a look +mod_comp |> select(-c(f, ks)) +``` + + +### 3.3 The 'best' model + +The highest ranked model can be specified for use in analysis. This includes the Intercept in separate space-time smooths and spatial smooths for `tmax` and `pr`. The formula can be extracted and inspected: + +```{r} +f <- as.formula(mod_comp$f[1]) +f +``` + +The formula can be used to specify a `mgcv` GAM using a REML approach. The resulting model is checked for over-fitting using the `k.check` function in the`mgcv` package. Here the `k-index` values are near to 1, the `k'` and , except for the temporal smooth for the Intercept the `edf` the `edf` values are all much less than the `k'` values, so this model is reasonably well tuned. + +```{r final_mod, cache = cache.val} +# specify the model +gam.m <- gam(f, data = chaco, method = "REML") +# check k +k.check(gam.m) +``` + +A summary of the model can be examined and this shows that of the parametric terms only the `Intercept` is significant as all fo the spatial and temporal smooths. + +```{r} +summary(gam.m) +``` + + + +### 3.4 Plot and map the results + +The varying coefficient estimates can be extracted, examined, plotted and mapped as before using the code below. There are no temporal smooths specified in the model for the for the predictor variables `tmax` and `pr` and perhaps unsurprisingly the general trends of coefficient estimates and the relationships with NDVI are over time that they describe, are relatively consistent. + +```{r final_time2, eval = T, echo = T, message = F, warning=F, fig.height = 3, fig.width = 7} +# extract the VCs +vcs <- calculate_vcs(input_data = chaco, + mgcv_model = gam.m, + terms = c("Intercept", "tmax", "pr")) +# examine +vcs |> + # drop the geometry if the object is spatial + st_drop_geometry() |> + select(starts_with("b_")) |> + apply(2, summary) |> + round(5) +``` + +```{r plot_final_time2, eval = T, echo = T, message = F, warning=F, fig.height = 3, fig.width = 7, fig.cap = "The variations over time of the model coefficient estimates."} + +# plot over time +vcs |> + st_drop_geometry() |> + select(date, starts_with("b_")) |> + rename(`Intercept` = b_Intercept, + `Max Temperature` = b_tmax, + `Precipitation` = b_pr) |> + pivot_longer(-date) |> + group_by(date, name) |> + summarise( + lower = quantile(value, 0.25), + median = median(value), + upper = quantile(value, 0.75) + ) |> + ggplot(aes(x = date, y = median)) + + geom_point(col = "blue", alpha = 0.2) + + geom_smooth() + + facet_wrap(~name, scale = "free_y") + + theme_bw() + xlab("") + ylab("") + + theme(strip.background = element_rect(fill="white")) + +``` +Next they can be mapped over space and time using the original observation locations and the `year` of observation. Notice that the gradient is now South to North, with little variation over time: + +```{r svccoefs4, fig.height = 6, fig.width = 7, fig.cap = "The varying `tmax` (Maximum temperature) coefficient estimates over space and time."} +# title +tit <-expression(paste(""*beta[`tmax`]*"")) +# plot +ggplot() + + geom_sf(data = vcs, aes(col = b_tmax)) + + scale_colour_continuous_c4a_div(palette="brewer.rd_yl_bu",name = tit) + + facet_wrap(~year) + + theme_bw() + + theme( + legend.position = "inside", + legend.position.inside = c(0.9, 0.17), + strip.background = element_rect(fill="white", colour = "white"), + strip.text = element_text(size = 8, margin = margin(b=4)), + axis.title=element_blank(), + axis.text=element_blank(), + axis.ticks=element_blank()) +``` + +This lack of temporal variation is confirmed if the coefficient estimates are mapped annually over the grid: + +```{r svccoefs5, eval = T, echo = T, cache = cache.val, fig.cap = "The varying `tmax` (Maximum temperature) coefficient estimates over space and time and the grid surface."} +# create time slices +years <- sort(unique(vcs$year)) +# calculate over the grid for each time slice +res_out <- matrix(nrow = nrow(chaco_grid), ncol = 0) +for (i in 1:length(years)){ + # convert years to months - here getting month 6 + month.val <- ((years[i]-min(years)) * 12) + 6 + res.i <- calculate_vcs(input_data = chaco_grid |> mutate(month = month.val), + mgcv_model = gam.m, + terms = c("Intercept", "tmax", "pr")) + # select all the coefficient estimates + res.i <- + res.i |> + st_drop_geometry() |> + select(starts_with("b_"), + starts_with("se_"), + starts_with("t_")) + # rename coefficients estimates + names(res.i) <- paste0(names(res.i), "_", years[i]) + # bind to the result + res_out <- cbind(res_out, res.i) + # cat(years[i], "\t") +} +# join to the grid +chaco_grid |> + cbind(res_out) |> + # select the variables and pivot longer + select(starts_with("b_tmax")) |> + # rename and select with transmute + transmute(`2012` = b_tmax_2012, `2013` = b_tmax_2013, + `2014` = b_tmax_2014, `2015` = b_tmax_2015, + `2016` = b_tmax_2016, `2017` = b_tmax_2017, + `2018` = b_tmax_2018, `2019` = b_tmax_2019, + `2020` = b_tmax_2020, `2021` = b_tmax_2021, + `2022` = b_tmax_2022 + ) |> + pivot_longer(-geometry) |> + # make the new time object a factor (to enforce plotting order) + mutate(name = factor(name, levels = years)) |> + # 4. and plot + ggplot() + + geom_sf(aes(fill = value), col = NA) + + # adjust default shading + scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) + + # facet + facet_wrap(~name, ncol = 4) + + # apply and modify plot theme + theme_bw() + + theme( + legend.position = "inside", + legend.position.inside = c(0.9, 0.17), + strip.background = element_rect(fill="white", colour = "white"), + strip.text = element_text(size = 8, margin = margin(b=4)), + axis.title=element_blank(), + axis.text=element_blank(), + axis.ticks=element_blank()) +``` + + +### 3.5 Summary + +This section has illustrated the the use of functions in the the `stgam` package. It suggests the following workflow: + +1. Prepare the data, for example by lengthening the `data.frame`, `tibble` or `sf` object containing the data, to have observations with single location and / or time values (in the above examples these were `X`,`Y`and `month`), and an Intercept as an addressable term. +2. Evaluate all possible models. For spatial or temporal problems each predictor is specified in 3 ways, for space-time problems each predictor is specified in 6 ways. +3. Rank the models by their AIC score, identify any consistent model specification trends in the top ranked models and select the best model with the lowest AIC score (i.e. the model with the lowest prediction error). +4. Extract the formula and create the final model. +5. Calculate the varying coefficient estimates: to quantify how the relationships between the target and predictor variables vary over space, time or space-time. +6. Create maps, time series plots etc. + +This workflow evaluates and ranks multiple models using the model AIC value. This was done algorithmically using the `evaluate_models()` function. However, this was not undertaken in isolation. Rather it built on the investigations in Section 3 to determine whether space and time effects were present in the data. This set of model investigations were undertaken to both develop and confirm knowledge of processes related to NDVI in the Chaco dry rainforest. That is, the analysis was both considered and contextual. For example, in this section it was determined that a spatially varying intercept term was appropriate but with a different dataset others forms may be suggested. Similarly a larger model space could be examined with more variables. The `stgam` package allows these choices to be validated through an automated approach, providing an exploration of the full set of potential choices. + +### 3.6 Useful resources + +We think these talks by Noam Ross provide really useful tips and pointers for working with GAMs and `mgcv`: + +- https://www.youtube.com/watch?v=q4_t8jXcQgc +- https://www.youtube.com/watch?v=sgw4cu8hrZM + + +### 3.7 Final words + +The rationale for using GAMs with TP smooths for spatially varying coefficient or temporally varying coefficient models is as follows: + +- GAMs with smooths (splines) capture non-linear relationships between the response variable and covariates. +- Smooths generate a varying coefficient model when they are parameterised with more than one variable. +- This is readily extending to the temporal and / or spatial dimensions to generate temporally, spatially and space-time varying coefficient models. +- Different smooths are available, but Tensor Products smooths can be used combine space and time because they undertake anisotropic smoothing across space and time separately, that have different measurement scales. +- GAMs are robust, have a rich theoretical background and been subject to much development. + +The workflow suggested in this vignette and in the `stgam` package is to determine the most appropriate model form (both steps evaluated by minimising AIC. The 'best' model can then be extracted an examined for the nature of the space-time interactions it suggests, before generating the varying coefficient estimates and mapping or plotting the results. + +Future developments will seek to examine the scales of spatial dependencies in the data sing kriging based approaches and will extend the `stgam` package to work with large data (i.e. to draw from the functionality of the `bam` function in `mgcv`). + + +```{r echo = F, eval = F} +# save summaries for CRAN vignette +# 1. save PNGs to local directory +library(patchwork) +p_m1 <- appraise(m1, method = "simulate") +plots <- patchwork:::get_patches(p_m1)$plots +for (i in seq_along(plots)) { + ggsave( + filename = paste0("appraise_plot_", i, ".png"), + plot = plots[[i]], + width = 6, + height = 4 + ) +} +# 2. other stuff +# sum_m1 <- summary(m1) +sum_m1 <- capture.output(summary(m1)) +# cat(sum_m1, sep = "\n") + +effect_m1 <- apply(f1, 2, function(v) c(sd = sd(v), range = diff(range(v)))) |> round(3) + +effect_m2 <- effect_size(m2) + +# check the model smooths +k_check_m3 <- k.check(m3) + +k_check_m4 <- k.check(m4) + +k_check_m5 <- k.check(m5) + +k_check_m6 <- k.check(m6) +effect_m6 <- effect_size(m6) + +# sum_m6 <- summary(m6) +sum_m6 <- capture.output(summary(m6)) +# cat(sum_m6, sep = "\n") + +aic_m1 <- m1$aic +aic_m2 <- m2$aic +aic_m3 <- m3$aic +aic_m4 <- m4$aic +aic_m5 <- m5$aic +aic_m6 <- m6$aic + +setwd("~/Library/CloudStorage/Dropbox/Lex_GGP_GAM/stgam/vignettes") +save(sum_m1, effect_m1, effect_m2, k_check_m3, + k_check_m4, k_check_m5, k_check_m6, effect_m6, sum_m6, + aic_m1, aic_m2, aic_m3, aic_m4, aic_m5, aic_m6, + vcs, years, res_out, + file = "model_summaries.RData", + compress = "xz") +``` + +## References + diff --git a/vignettes/stvc_mods.RData b/vignettes/stvc_mods.RData new file mode 100644 index 0000000..3ef9f86 Binary files /dev/null and b/vignettes/stvc_mods.RData differ diff --git a/vignettes/vignette.bib b/vignettes/vignette.bib index 52f1a52..aa29641 100644 --- a/vignettes/vignette.bib +++ b/vignettes/vignette.bib @@ -1,13 +1,181 @@ %% This BibTeX bibliography file was created using BibDesk. %% https://bibdesk.sourceforge.io/ -%% Created for Alexis Comber at 2024-05-13 17:04:09 +0100 +%% Created for Alexis Comber at 2026-01-23 09:35:30 +0000 %% Saved with string encoding Unicode (UTF-8) +@article{li2023spatiotemporally, + author = {Li, Muyi and Cao, Sen and Zhu, Zaichun and Wang, Zhe and Myneni, Ranga B and Piao, Shilong}, + date-added = {2026-01-23 09:35:25 +0000}, + date-modified = {2026-01-23 09:35:25 +0000}, + journal = {Earth System Science Data}, + number = {9}, + pages = {4181--4203}, + publisher = {Copernicus Publications G{\"o}ttingen, Germany}, + title = {Spatiotemporally consistent global dataset of the GIMMS Normalized Difference Vegetation Index (PKU GIMMS NDVI) from 1982 to 2022}, + volume = {15}, + year = {2023}} + +@article{gorelick2017google, + author = {Gorelick, Noel and Hancher, Matt and Dixon, Mike and Ilyushchenko, Simon and Thau, David and Moore, Rebecca}, + date-added = {2026-01-23 09:34:30 +0000}, + date-modified = {2026-01-23 09:34:30 +0000}, + journal = {Remote sensing of Environment}, + pages = {18--27}, + publisher = {Elsevier}, + title = {Google Earth Engine: Planetary-scale geospatial analysis for everyone}, + volume = {202}, + year = {2017}} + +@article{harris2017introducing, + author = {Harris, Paul and Brunsdon, Chris and Lu, Binbin and Nakaya, Tomoki and Charlton, Martin}, + date-added = {2026-01-22 14:23:00 +0000}, + date-modified = {2026-01-22 14:23:00 +0000}, + journal = {Spatial Statistics}, + pages = {241--261}, + publisher = {Elsevier}, + title = {Introducing bootstrap methods to investigate coefficient non-stationarity in spatial regression models}, + volume = {21}, + year = {2017}} + +@article{harris2019simulation, + author = {Harris, Paul}, + date-added = {2026-01-21 15:59:00 +0000}, + date-modified = {2026-01-21 15:59:00 +0000}, + journal = {Geographical Analysis}, + number = {2}, + pages = {151--181}, + publisher = {Wiley Online Library}, + title = {A simulation study on specifying a regression model for spatial data: Choosing between autocorrelation and heterogeneity effects}, + volume = {51}, + year = {2019}} + +@webpage{simpson2025package, + author = {Simpson, Gavin}, + date-added = {2026-01-21 11:33:00 +0000}, + date-modified = {2026-01-21 11:33:52 +0000}, + rating = {5}, + title = {gratia: Graceful 'ggplot'-Based Graphics and Other Functions for GAMs Fitted Using 'mgcv'}, + url = {https://CRAN.R-project.org/package=gratia}, + year = {2025}, + bdsk-url-1 = {https://CRAN.R-project.org/package=mgcv}} + +@article{simpson2024gratia, + author = {Simpson, Gavin L}, + date-added = {2026-01-21 11:29:38 +0000}, + date-modified = {2026-01-21 11:29:38 +0000}, + journal = {arXiv preprint arXiv:2406.19082}, + title = {gratia: An R package for exploring generalized additive models}, + year = {2024}} + +@webpage{bin2024, + author = {Bin, Chi and Dennett, Adam and Ol{\'e}ron-Evans, Thomas and Morphet, Robin}, + date-added = {2025-05-08 11:20:01 +0100}, + date-modified = {2025-05-08 11:20:01 +0100}, + title = {House Price per Square Metre in England and Wales, 1995-2024}, + url = {https://reshare.ukdataservice.ac.uk/855033/}, + year = {2024}, + bdsk-url-1 = {https://cran.rediris.es/web/packages/mcmc/vignettes/demo.pdf}} + +@article{marra2011practical, + author = {Marra, Giampiero and Wood, Simon N}, + date-added = {2025-04-16 16:58:50 +0100}, + date-modified = {2025-04-16 16:58:50 +0100}, + journal = {Computational Statistics \& Data Analysis}, + number = {7}, + pages = {2372--2387}, + publisher = {Elsevier}, + title = {Practical variable selection for generalized additive models}, + volume = {55}, + year = {2011}} + +@article{hurvich1989regression, + author = {Hurvich, Clifford M and Tsai, Chih-Ling}, + date-added = {2025-04-16 16:51:46 +0100}, + date-modified = {2025-04-16 16:51:46 +0100}, + journal = {Biometrika}, + number = {2}, + pages = {297--307}, + publisher = {Oxford University Press}, + title = {Regression and time series model selection in small samples}, + volume = {76}, + year = {1989}} + +@incollection{akaike1998information, + author = {Akaike, Hirotogu}, + booktitle = {Selected papers of hirotugu akaike}, + date-added = {2025-04-16 16:49:02 +0100}, + date-modified = {2025-04-16 16:49:02 +0100}, + pages = {199--213}, + publisher = {Springer}, + title = {Information theory and an extension of the maximum likelihood principle}, + year = {1998}} + +@article{fotheringham2017multiscale, + author = {Fotheringham, A Stewart and Yang, Wenbai and Kang, Wei}, + date-added = {2025-04-03 09:32:20 +0100}, + date-modified = {2025-04-03 09:32:20 +0100}, + journal = {Annals of the American Association of Geographers}, + number = {6}, + pages = {1247--1265}, + publisher = {Taylor \& Francis}, + title = {Multiscale geographically weighted regression (MGWR)}, + volume = {107}, + year = {2017}} + +@phdthesis{yang2014extension, + author = {Yang, Wenbai}, + date-added = {2025-04-03 09:32:04 +0100}, + date-modified = {2025-04-03 09:32:04 +0100}, + school = {University of St Andrews}, + title = {An extension of geographically weighted regression with flexible bandwidths}, + year = {2014}} + +@inproceedings{comber2023multiscale, + author = {Comber, Alexis and Harris, Paul and Brunsdon, Chris}, + booktitle = {12th International Conference on Geographic Information Science (GIScience 2023)}, + date-added = {2025-04-02 12:07:58 +0100}, + date-modified = {2025-04-02 12:07:58 +0100}, + organization = {Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}, + title = {Multiscale Spatially and Temporally Varying Coefficient Modelling Using a Geographic and Temporal Gaussian Process GAM (GTGP-GAM)(Short Paper)}, + year = {2023}} + +@manual{mgcv, + author = {Simon Wood}, + date-added = {2025-04-02 12:06:52 +0100}, + date-modified = {2025-04-02 12:06:52 +0100}, + note = {R package version 1.9-1}, + title = {{mgcv}: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation}, + url = {https://CRAN.R-project.org/package=mgcv}, + year = {2021}, + bdsk-url-1 = {https://CRAN.R-project.org/package=crosstalk}} + +@manual{stgam, + author = {Lex Comber and Paul Harris and Chris Brunsdon}, + date-added = {2025-04-02 12:06:43 +0100}, + date-modified = {2025-04-16 16:26:16 +0100}, + note = {R package version 0.0.1.0}, + title = {{stgam}: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models}, + url = {https://CRAN.R-project.org/package=stgam}, + year = {2024}, + bdsk-url-1 = {https://CRAN.R-project.org/package=crosstalk}} + +@article{comber2024encapsulating, + author = {Comber, Alexis and Harris, Paul and Murakami, Daisuke and Nakaya, Tomoki and Tsutsumida, Narumasa and Yoshida, Takahiro and Brunsdon, Chris}, + date-added = {2025-01-03 13:19:42 +0000}, + date-modified = {2025-01-03 13:19:42 +0000}, + journal = {ISPRS International Journal of Geo-Information}, + number = {12}, + pages = {459}, + publisher = {MDPI}, + title = {Encapsulating Spatially Varying Relationships with a Generalized Additive Model}, + volume = {13}, + year = {2024}} + @article{schwarz1978estimating, author = {Schwarz, Gideon}, date-added = {2024-05-13 17:04:06 +0100}, @@ -126,16 +294,14 @@ @book{wood2006generalized title = {Generalized additive models: an introduction with R}, year = {2006}} -@article{wood2015package, +@webpage{wood2025package, author = {Wood, Simon}, date-added = {2024-05-10 13:59:33 +0100}, - date-modified = {2024-05-10 13:59:33 +0100}, - journal = {R package version}, - number = {29}, - pages = {729}, - title = {Package `mgcv'}, - volume = {1}, - year = {2015}} + date-modified = {2026-01-21 11:32:56 +0000}, + rating = {5}, + title = {mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation}, + url = {https://CRAN.R-project.org/package=mgcv}, + year = {2025}} @book{wood2017generalized, author = {Wood, Simon N}, @@ -145,12 +311,12 @@ @book{wood2017generalized title = {Generalized additive models: an introduction with R}, year = {2017}} -@article{comber2024gtgpjgis, +@article{comber2025st, author = {Comber, Alexis and Harris, Paul and Tsutsumida, Naru and Gray, Jennie and Brunsdon, Chris}, date-added = {2024-04-29 08:05:53 +0100}, - date-modified = {2024-04-29 08:05:53 +0100}, - journal = {International Journal of Geographical Information Science}, - title = {Where, When and How? Specifying spatially and temporally varying coefficient models using Generalized Additive Models with Gaussian Process splines}, + date-modified = {2025-04-02 12:31:56 +0100}, + journal = {TBC}, + title = {Where, When and How? Specifying spatially and temporally varying coefficient models using Generalized Additive Models}, year = {submitted}} @article{comber2024multiscale,