Skip to content

spepperkumc/Spatial-Temporal-Modeling-INLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

================================================================================ Spatial and Temporal Modeling of Breast Cancer Mortality in Kansas An R-INLA Approach -- Repository Documentation

OVERVIEW

This repository contains the R scripts used for the manuscript "Spatial and Temporal Modeling of Breast Cancer Mortality in Kansas: An R-INLA Approach." The analysis covers breast cancer mortality across all 105 Kansas counties for the years 2018-2021 using Bayesian spatial-temporal models implemented via R-INLA.

Two analytical workflows are included:

  • County-level analysis: descriptive statistics, candidate model comparison, final model fitting, and choropleth mapping of relative risk and hotspots.

  • Cluster-level analysis: SKATER-based spatial clustering of counties into 6 contiguous regions, followed by cluster-level spatial-temporal modeling.

================================================================================ BEFORE YOU BEGIN

This repository contains R code only. No data is included. You must assemble the dataset and obtain the shapefile before running any script. All steps are described below.

================================================================================ SOFTWARE REQUIREMENTS

R VERSION

R version 4.2 or later is recommended.

PACKAGE INSTALLATION

Most packages are available from CRAN. Install them with:

install.packages(c( "readr", "readxl", "dplyr", "tidyr", "tidyverse", "sf", "spdep", "SpatialEpi", "ggplot2", "viridis", "epiDisplay", "ggpubr", "ggregplot", "skater", "spatialreg", "sp" ))

INLA is NOT on CRAN and must be installed separately:

install.packages("INLA", repos = c(INLA = "https://inla.r-inla-download.org/R/stable"), dep = TRUE)

Verify the installation with:

library(INLA) inla.version()

================================================================================ DATA PREPARATION

OVERVIEW

All scripts read a single file named KSBC_Data_All.csv located in the same directory as the scripts. Each row represents one county-year observation. The dataset covers all 105 Kansas counties across 2018-2021 (420 rows total).

REQUIRED COLUMN STRUCTURE

The dataset must contain exactly these 11 columns in this order:

Column Type Description


County Character Full county name (e.g., "Allen County") Year Integer Study year: 2018, 2019, 2020, or 2021 Percent_Female Numeric % of county population that is female (0-100) Obese Numeric Adult obesity rate (%) Diabetes Numeric Adult diabetes prevalence (%) Smoke Numeric Adult current smoking rate (%) Binge Numeric Binge drinking rate (%) PCP Integer Number of primary care physicians in the county CARR Numeric Community Assets and Relative Rurality index (0-1) Population Integer Total county population Cases Integer Breast cancer mortality deaths (count)

! IMPORTANT: Column order matters. Some scripts reference columns by position (e.g., d <- data[,c(1,2,11)] extracts County, Year, and Cases). Ensure your column order exactly matches the table above.

! NOTE ON CARR: Download the raw 0-1 scale values. The scripts multiply CARR by 100 internally to convert to a percentage scale matching other covariates.

! NOTE ON THE FIRST COLUMN: The scripts remove the first column with KSBC_Data[,-c(1)]. If you export from Excel or R, a row-number index will be added as column 1 automatically -- this is expected and will be stripped. If you save without an index, add a dummy first column or remove that line.

PUBLIC DATA SOURCES

  1. Kansas Information for Communities (KIC) Source for: Cases (breast cancer mortality counts) URL: https://kic.kdhe.ks.gov/death_new.php#top

    • Navigate to: Deaths > Cancer > Breast Cancer
    • Download county-level counts for each year 2018-2021
    • Note: KIC suppresses counts below 5 for privacy. Record suppressed cells and handle accordingly before modeling.
  2. Kansas Health Matters Source for: Percent_Female, Obese, Diabetes, Smoke, Binge, PCP, Population URL: https://www.kansashealthmatters.org/

    • Navigate to the county health data section
    • Download each indicator by county for years 2018-2021
    • Ensure data is available for all 105 counties before proceeding
  3. Community Assets and Relative Rurality (CARR) Index Source for: CARR URL: https://ruralityindex.com/

    • Download county-level CARR values on the raw 0-1 scale

ASSEMBLING THE DATASET

  1. Create one row per county per year (105 counties x 4 years = 420 rows).
  2. Sort alphabetically by County, then by Year within each county.
  3. Verify that county names exactly match the gnis_name field in the shapefile. The merge uses county names as the key -- any mismatch will silently drop rows.
  4. Save as KSBC_Data_All.csv in the same directory as your R scripts.

================================================================================ SHAPEFILE SETUP

All scripts read a county boundary shapefile from a subfolder named Shape/ located in the same directory as the scripts:

KS <- st_read(dsn = "Shape/GU_CountyOrEquivalent.shp", layer = "GU_CountyOrEquivalent")

Keep all shapefile component files together in the Shape/ folder:

GU_CountyOrEquivalent.shp -- geometry GU_CountyOrEquivalent.dbf -- attribute table GU_CountyOrEquivalent.shx -- index GU_CountyOrEquivalent.prj -- projection definition

RECOMMENDED SOURCE

The filename GU_CountyOrEquivalent.shp corresponds to the USGS National Boundary Dataset (Governmental Unit Boundaries): URL: https://www.usgs.gov/national-boundary-dataset

An acceptable alternative is the U.S. Census Bureau TIGER/Line shapefile: URL: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html If using TIGER files, rename to GU_CountyOrEquivalent.shp or update the dsn and layer arguments in each script.

FILTERING TO KANSAS

The downloaded shapefile contains all U.S. counties. Filter to Kansas (FIPS state code 20) before placing in the Shape/ folder, or add a filter step after reading:

KS <- KS[KS$state_fipsc == "20", ]

! NOTE: The scripts rename gnis_name to County and stco_fipsc to FIPS. Verify these column names exist in your shapefile before running.

================================================================================ DIRECTORY STRUCTURE

Before running any script, your working directory should look like this:

project/ KSBC_Data_All.csv Shape/ GU_CountyOrEquivalent.shp GU_CountyOrEquivalent.dbf GU_CountyOrEquivalent.shx GU_CountyOrEquivalent.prj SP_Descriptive_Statistics.R SP_Temporal_Maps_-Poisson.R SP_Temporal_Maps-NB.R SP_Temporal_Maps-ZIP0.R SP_Temporal_Maps-ZIP1.R SP_Temporal_Maps-ZIPNB0.R SP_Temporal_Maps-ZIPNB1.R SP_Temporal_Maps-GP.R SP_Temporal_Maps-Final_Model.R SP_Temporal_Maps-Cluster-Poisson.R SP_Temporal_Maps-Cluster-NB.R SP_Temporal_Maps-Cluster-ZIP0.R SP_Temporal_Maps-Cluster-ZIP1.R SP_Temporal_Maps-Cluster-ZINB0.R SP_Temporal_Maps-Cluster-ZINB1.R SP_Temporal_Maps-Cluster-GPoisson.R SP_Temporal_Maps-Cluster-_Final_Model.R

! Set your R working directory before running: setwd("path/to/project")

================================================================================ RUNNING ORDER

A. COUNTY-LEVEL ANALYSIS

Each script is self-contained -- it reads the shapefile and KSBC_Data_All.csv at the start. Run in this order:

STEP 1 -- Descriptive Statistics Script: SP_Descriptive_Statistics.R Generates summary statistics (mean, SD, median, IQR, min, max) grouped by year and by county. Produces bar charts of case distribution per year and boxplots of all covariates. Exports descriptive.csv.

STEP 2 -- Model Comparison Run each script below to fit candidate spatial-temporal count models under four spatial structures (BYM, BYM2, RW1, RW2). Compare DIC, WAIC, and marginal log-likelihood to select the best fitting model.

Script Model Family


SP_Temporal_Maps_-Poisson.R Poisson SP_Temporal_Maps-NB.R Negative Binomial SP_Temporal_Maps-ZIP0.R Zero-Inflated Poisson (ZIP0) SP_Temporal_Maps-ZIP1.R Zero-Inflated Poisson (ZIP1) SP_Temporal_Maps-ZIPNB0.R Zero-Inflated Negative Binomial (ZINB0) SP_Temporal_Maps-ZIPNB1.R Zero-Inflated Negative Binomial (ZINB1) SP_Temporal_Maps-_GP.R Generalized Poisson

About ZIP0 vs ZIP1 (and ZINB0 vs ZINB1): Zero-inflated models add a second process that generates "extra" zeros beyond what a standard count model would predict -- useful when many counties have zero deaths in a given year. The 0/1 suffix refers to how the zero-inflation probability is estimated: - ZIP0 / ZINB0: The zero-inflation probability is a single fixed constant, the same for all observations. Simpler, fewer parameters. - ZIP1 / ZINB1: The zero-inflation probability varies by observation, modeled through the linear predictor. More flexible -- allows the probability of an extra zero to vary spatially and temporally.

STEP 3 -- Final Model Script: SP_Temporal_Maps_-_Final_Model.R Fits the selected model (Poisson BYM2 with IID temporal interaction). Produces relative risk maps (RR, lower bound LL, upper bound UL) faceted by year, and hotspot maps showing P(RR > 1.5) per year. Exports: - high_RR.csv (counties where credible interval lower bound > 1.0) - low_RR.csv (counties where credible interval upper bound < 1.0)

B. CLUSTER-LEVEL ANALYSIS

The cluster scripts group counties into 6 contiguous spatial clusters using the SKATER algorithm, then fit the same candidate model families at the cluster level. Run in the same order as the county scripts.

What SKATER clustering does: SKATER partitions counties into spatially contiguous groups that are also internally similar on health characteristics. It uses 2018 values of Obese, Diabetes, Percent_Female, Smoke, Binge, PCP, and CARR. The algorithm: 1. Builds a neighbor list from shared county boundaries 2. Computes attribute-based dissimilarity costs between neighbors 3. Constructs a minimum spanning tree 4. Removes 5 edges to create 6 clusters, constrained so each cluster has at least 21 total cases in 2018 The 6 clusters are labeled: Northwest, Central, Southwest, Metro, Southeast, and Northeast.

Script Model Family


SP_Temporal_Maps_-Cluster-Poisson.R Poisson (also generates elbow plot) SP_Temporal_Maps-Cluster-NB.R Negative Binomial SP_Temporal_Maps-Cluster-ZIP0.R Zero-Inflated Poisson (ZIP0) SP_Temporal_Maps-Cluster-ZIP1.R Zero-Inflated Poisson (ZIP1) SP_Temporal_Maps-Cluster-ZINB0.R Zero-Inflated Negative Binomial (ZINB0) SP_Temporal_Maps-Cluster-ZINB1.R Zero-Inflated Negative Binomial (ZINB1) SP_Temporal_Maps-Cluster-_GPoisson.R Generalized Poisson

Cluster Final Model: Script: SP_Temporal_Maps_-Cluster-_Final_Model.R Fits the selected cluster-level model (Poisson BYM2). Cluster covariates are aggregated from county data: average Obese, Diabetes, Percent_Female, Smoke, Binge, CARR; total PCP; average Population. CARR is multiplied by 100 to match the scale of other percentage covariates. Exports: - cluster_highRR.csv - cluster_lowRR.csv

================================================================================ OUTPUTS

File / Output Produced By Description


descriptive.csv SP_Descriptive_Statistics.R County-averaged stats map.adj County model scripts County adjacency graph clustmap.adj Cluster model scripts Cluster adjacency graph cluster_case_count.csv Cluster Poisson / Final Cases per cluster/year high_RR.csv Final Model (county) High RR counties low_RR.csv Final Model (county) Low RR counties cluster_highRR.csv Final Model (cluster) High RR clusters cluster_lowRR.csv Final Model (cluster) Low RR clusters RR / LL / UL maps Final model scripts Choropleth maps by year Hotspot maps Final model scripts P(RR > 1.5) maps

================================================================================ TROUBLESHOOTING

Problem Likely Cause Solution


County names don't merge Name mismatch Compare unique(KSBC_Data$County) and unique(KS$County) INLA not found Wrong installation Use INLA-specific repo (see above) Shapefile not found Wrong path Confirm Shape/ folder exists in working directory with all 4 files Wrong column positions CSV column order differs Check names(KSBC_Data) and reorder Cluster case count low crit threshold not met Review cluster_case_count.csv and adjust ncuts or crit if needed CRS warning or error Missing or incomplete Each script contains the line: shapefile download spTransform(map, CRS("+proj=longlat +datum=WGS84 +no_defs")) This re-projects the Kansas county map into standard GPS coordinates. It fails if GU_CountyOrEquivalent.prj is missing from the Shape/ folder. Fix: re-download the full shapefile and confirm all 4 files are present: .shp, .dbf, .shx, and .prj.

================================================================================

About

Spatial-Temporal-Modeling-INLA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages