epiverse-trace
diff --git a/‎clean-data.md‎
Lines changed: 101 additions & 23 deletions b/‎clean-data.md‎
Lines changed: 101 additions & 23 deletions
diff --git a/‎fig/mpox_data_cleaning_pipeline-rendered-unnamed-chunk-2-1.png‎
-9.84 KB b/‎fig/mpox_data_cleaning_pipeline-rendered-unnamed-chunk-2-1.png‎
-9.84 KB
diff --git a/‎fig/mpox_data_cleaning_pipeline-rendered-unnamed-chunk-4-1.png‎
-9.92 KB b/‎fig/mpox_data_cleaning_pipeline-rendered-unnamed-chunk-4-1.png‎
-9.92 KB
diff --git a/‎md5sum.txt‎
Lines changed: 20 additions & 20 deletions b/‎md5sum.txt‎
Lines changed: 20 additions & 20 deletions
diff --git a/‎read-cases.md‎
Lines changed: 18 additions & 10 deletions b/‎read-cases.md‎
Lines changed: 18 additions & 10 deletions
@@ -11,7 +11,7 @@ exercises: 10
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
+- Explain how to clean, curate, and standardize case data using `{cleanepi}` package.
 - Perform essential data-cleaning operations on a real case dataset.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,24 +21,49 @@ exercises: 10
 In this episode, we will use a simulated Ebola dataset that can be:
 
 - Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv)
-- Save it in the `data/` folder. Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
+- Save it in the `data/` folder. 
+- Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
 
 :::::::::::::::::::::
 
 ## Introduction
-In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
+In the process of analyzing outbreak data, as in other disciplines of data science, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
  This episode focuses on cleaning epidemics and outbreaks data using the 
- [cleanepi](https://epiverse-trace.github.io/cleanepi/) package,
+ [`{cleanepi}`](https://epiverse-trace.github.io/cleanepi/) package.
    For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
 
-Let's start by loading the package `{rio}` to read data and the package `{cleanepi}` 
-to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from 
-the package `{dplyr}`, so let's also call to the tidyverse package:
+### Set Up
+
+In addition to the `{cleanepi}` package, we will use the following R packages in
+this data cleaning workflow:
+
+* `{here}` for easy file referencing,
+* `{rio}` to import the data into R,
+* `{dplyr}` to perform some data processing operations,
+* `{magrittr}` to use its **pipe operator (`%>%`)**.
+
+We encourage users with recent versions of R (version > 4.4.1) to use the base R
+pipe operator (`|>`) instead of `%>%`.
+
+We also encourage using the `{pak}` package when installing R packages as shown
+below. You can refer to the [{pak} reference document](https://pak.r-lib.org/reference/features.html) for more details about
+the advantages of using this.
 
 
 ``` r
+# Check if a package is already installed and install it if not
+
+# nolint start
+if (!require("pak")) install.packages("pak")
+if (!require("here")) pak::pak("here")
+if (!require("rio")) pak::pak("rio")
+if (!require("dplyr")) pak::pak("dplyr")
+if (!require("magrittr")) pak::pak("magrittr")
+if (!require("cleanepi")) pak::pak("cleanepi")
+# nolint end
+
 # Load packages
-library(tidyverse) # for {dplyr} functions and the pipe %>%
+library(dplyr) # for {dplyr} functions and the pipe %>%
 library(rio) # for importing data
 library(here) # for easy file referencing
 library(cleanepi)
@@ -48,7 +73,7 @@ library(cleanepi)
 
 ### The double-colon (`::`) operator
 
-The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
+The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
 advantages including the followings:
 
 * Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
@@ -141,17 +166,17 @@ package simplifies this process with the `scan_data()` function. Let's take a lo
 
 
 ``` r
-cleanepi::scan_data(raw_ebola_data)
+cleanepi::scan_data(raw_ebola_data, format = "percentage")
 ```
 
 ``` output
-  Field_names missing numeric   date character logical
-1         age  0.0690  0.8925 0.0000    0.1075       0
-2      gender  0.1874  0.0560 0.0000    0.9440       0
-3      status  0.0565  0.0000 0.0000    1.0000       0
-4  date onset  0.0001  0.0000 0.9159    0.0841       0
-5 date sample  0.0001  0.0000 1.0000    0.0000       0
-6      region  0.0000  0.0000 0.0000    1.0000       0
+  Field_names  missing  numeric     date character logical
+1         age  6.9047% 89.2475%       0%  10.7525%      0%
+2      gender 18.7416%  5.6035%       0%  94.3965%      0%
+3      status  5.6549%       0%       0%      100%      0%
+4  date onset  0.0067%       0% 91.5945%   8.4055%      0%
+5 date sample  0.0133%       0%     100%        0%      0%
+6      region       0%       0%       0%      100%      0%
 ```
 
 
@@ -249,6 +274,20 @@ cleanepi::print_report(data = sim_ebola_data, what = "removed_duplicates")
 In the following data frame:
 
 
+``` output
+── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ forcats   1.0.1     ✔ readr     2.1.6
+✔ ggplot2   4.0.2     ✔ stringr   1.6.0
+✔ lubridate 1.9.5     ✔ tibble    3.3.1
+✔ purrr     1.2.1     ✔ tidyr     1.3.2
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ tidyr::extract()   masks magrittr::extract()
+✖ dplyr::filter()    masks stats::filter()
+✖ dplyr::lag()       masks stats::lag()
+✖ purrr::set_names() masks magrittr::set_names()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+```
+
 ``` output
 # A tibble: 6 × 5
    col1  col2 col3  col4  col5  
@@ -598,17 +637,55 @@ You can have more details in the section about "Dictionary-based data substituti
 
 ### Calculating time span between different date events
 
-In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
+In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). A common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
+
+The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute **reporting delay** between the date of symptom onset (`date_onset`) and date of case confirmation (`date_sample`)
+
+
+``` r
+sim_ebola_data <- cleanepi::timespan(
+  data = sim_ebola_data,
+  target_column = "date_onset",
+  end_date = "date_sample",
+  span_unit = "days",
+  span_column_name = "reporting_delay")
 
-The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the 
-time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
+sim_ebola_data %>%
+  dplyr::select(case_id, date_sample, reporting_delay)
+```
+
+``` output
+# A tibble: 15,000 × 3
+   case_id date_sample reporting_delay
+   <chr>   <date>                <dbl>
+ 1 14905   2015-04-06               22
+ 2 13043   2014-01-03              114
+ 3 14364   2015-03-03              387
+ 4 14675   2014-12-31               73
+ 5 12648   2016-10-10              855
+ 6 14274   2016-01-23              293
+ 7 14132   2015-10-05               NA
+ 8 14715   2016-04-24               NA
+ 9 13435   2014-09-20               73
+10 14816   2015-02-06             -143
+# ℹ 14,990 more rows
+```
+
+After executing the function `cleanepi::timespan()`, two new columns named `reporting_delay` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::: challenge
+
+
+1- Calculate the time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
 
+:::::::::::::::::::::::::: solution
 
 ``` r
 sim_ebola_data <- cleanepi::timespan(
   data = sim_ebola_data,
   target_column = "date_sample",
-  end_date = as.Date("2025-01-03"),
+  end_date = lubridate::ymd("2025-01-03"),
   span_unit = "years",
   span_column_name = "years_since_collection",
   span_remainder_unit = "months"
@@ -634,10 +711,11 @@ sim_ebola_data %>%
 10 14816   2015-02-06                       9               10
 # ℹ 14,990 more rows
 ```
+:::::::::::::::::::::::::: 
 
-After executing the function `cleanepi::timespan()`, two new columns named `years_since_collection` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
+::::::::::::::::::::::::::::::::::::::::::::::: 
 
-::::::::::::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::::::::::::: discussion
 
 Age data is useful in many downstream analysis. You can categorize it to generate stratified estimates.
 
 
@@ -1,21 +1,21 @@
 "file" "checksum" "built" "date"
-"CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2026-02-22"
-"LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2026-02-22"
-"config.yaml" "0f7deb99a9178d8470bd18343974bd37" "site/built/config.yaml" "2026-02-22"
-"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2026-02-22"
-"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2026-02-22"
-"pull_request_template.md" "6ae1abf4b06b0425eebca43b6db281ae" "site/built/pull_request_template.md" "2026-02-22"
-"renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-22"
-"episodes/read-cases.Rmd" "7333844e42fcf127fbdf0b63785a744c" "site/built/read-cases.md" "2026-02-22"
-"episodes/clean-data.Rmd" "5d7d9f89d85d4cf8bcfb9b0d78b182fd" "site/built/clean-data.md" "2026-02-22"
-"episodes/validate.Rmd" "2d01a5ad5453b11791fbc5f33b97dca3" "site/built/validate.md" "2026-02-22"
-"episodes/describe-cases.Rmd" "bde8c21decb3810f1351894b0997c5fe" "site/built/describe-cases.md" "2026-02-22"
-"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2026-02-22"
-"learners/epikinetics-descriptive.md" "50400941620956b3366fd99d51ed465b" "site/built/epikinetics-descriptive.md" "2026-02-22"
-"learners/epikinetics-statistics.md" "26868caf5a6b4a948ecb8f95c40694ab" "site/built/epikinetics-statistics.md" "2026-02-22"
-"learners/mpox_data_cleaning_pipeline.Rmd" "10ea9694190cfd20fb286cdc71b0c707" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-22"
-"learners/mpox_data_cleaning_pipeline.md" "5ca83e5cc976420771e2b4ee0f713cb1" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-22"
-"learners/reference.md" "f081702f1c557d1ce455b7d38306737b" "site/built/reference.md" "2026-02-22"
-"learners/setup.md" "65457a0cdd39d17bd46c96e356c74261" "site/built/setup.md" "2026-02-22"
-"profiles/learner-profiles.md" "31b503c4b5bd1f0960ada730eca4a25e" "site/built/learner-profiles.md" "2026-02-22"
-"renv/profiles/lesson-requirements/renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-22"
+"CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2026-02-23"
+"LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2026-02-23"
+"config.yaml" "0f7deb99a9178d8470bd18343974bd37" "site/built/config.yaml" "2026-02-23"
+"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2026-02-23"
+"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2026-02-23"
+"pull_request_template.md" "6ae1abf4b06b0425eebca43b6db281ae" "site/built/pull_request_template.md" "2026-02-23"
+"renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-23"
+"episodes/read-cases.Rmd" "7333844e42fcf127fbdf0b63785a744c" "site/built/read-cases.md" "2026-02-23"
+"episodes/clean-data.Rmd" "d5eeba1bf54db33f25c0d26cf7ad22f0" "site/built/clean-data.md" "2026-02-23"
+"episodes/validate.Rmd" "2d01a5ad5453b11791fbc5f33b97dca3" "site/built/validate.md" "2026-02-23"
+"episodes/describe-cases.Rmd" "bde8c21decb3810f1351894b0997c5fe" "site/built/describe-cases.md" "2026-02-23"
+"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2026-02-23"
+"learners/epikinetics-descriptive.md" "50400941620956b3366fd99d51ed465b" "site/built/epikinetics-descriptive.md" "2026-02-23"
+"learners/epikinetics-statistics.md" "26868caf5a6b4a948ecb8f95c40694ab" "site/built/epikinetics-statistics.md" "2026-02-23"
+"learners/mpox_data_cleaning_pipeline.Rmd" "10ea9694190cfd20fb286cdc71b0c707" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-23"
+"learners/mpox_data_cleaning_pipeline.md" "5ca83e5cc976420771e2b4ee0f713cb1" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-23"
+"learners/reference.md" "f081702f1c557d1ce455b7d38306737b" "site/built/reference.md" "2026-02-23"
+"learners/setup.md" "65457a0cdd39d17bd46c96e356c74261" "site/built/setup.md" "2026-02-23"
+"profiles/learner-profiles.md" "31b503c4b5bd1f0960ada730eca4a25e" "site/built/learner-profiles.md" "2026-02-23"
+"renv/profiles/lesson-requirements/renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-23"
@@ -607,9 +607,9 @@ tibble::as_tibble(demo_programs)
    displayName                                         id          type     
    <chr>                                               <chr>       <chr>    
  1 Antenatal care visit                                lxAQ7Zs9VYR aggregate
- 2 Child Programme                                     IpHINAT79UW tracker  
- 3 Contraceptives Voucher Program                      kla3mAPgvCH aggregate
- 4 Enterprise / Agribusiness Tracker                   jGYiwuDB5dc tracker  
+ 2 Cause of death (registration)                       ogrOUKoSaWA tracker  
+ 3 Child Programme                                     IpHINAT79UW tracker  
+ 4 Contraceptives Voucher Program                      kla3mAPgvCH aggregate
  5 Information Campaign                                q04UBOqq3rp aggregate
  6 Inpatient morbidity and mortality                   eBAyeGv0exc aggregate
  7 Malaria case diagnosis, treatment and investigation qDkgAbB5Jlk tracker  
@@ -631,13 +631,21 @@ tibble::as_tibble(demo_units)
 ```
 
 ``` output
-# A tibble: 1 × 12
-  `NIGERIA _name` `NIGERIA _id` FCT_name FCT_id      AMAC_name AMAC_id    
-  <chr>           <chr>         <chr>    <chr>       <chr>     <chr>      
-1 Sierra Leone    ImspTQPwCqd   NIGERIA  dwAsnpBCyPx FCT       EQyxhtW3zXI
-# ℹ 6 more variables: `FMC GARKI_name` <chr>, `FMC GARKI_id` <chr>,
-#   `Kulu Ferha_name` <chr>, `Kulu Ferha_id` <chr>, `Level 6_name` <chr>,
-#   `Level 6_id` <chr>
+# A tibble: 1,166 × 8
+   National_name National_id District_name District_id Chiefdom_name Chiefdom_id
+   <chr>         <chr>       <chr>         <chr>       <chr>         <chr>      
+ 1 Sierra Leone  ImspTQPwCqd Western Area  at6UHUQatSo Rural Wester… qtr8GGlm4gg
+ 2 Sierra Leone  ImspTQPwCqd Western Area  at6UHUQatSo Rural Wester… qtr8GGlm4gg
+ 3 Sierra Leone  ImspTQPwCqd Bo            O6uvpzGd5pu Kakua         U6Kr7Gtpidn
+ 4 Sierra Leone  ImspTQPwCqd Kambia        PMa2VCrupOd Magbema       QywkxFudXrC
+ 5 Sierra Leone  ImspTQPwCqd Tonkolili     eIQbndfxQMb Yoni          NNE0YMCDZkO
+ 6 Sierra Leone  ImspTQPwCqd Port Loko     TEQlaapDQoK Kaffu Bullom  vn9KJsLyP5f
+ 7 Sierra Leone  ImspTQPwCqd Koinadugu     qhqAxPSTUXp Nieni         J4GiUImJZoE
+ 8 Sierra Leone  ImspTQPwCqd Western Area  at6UHUQatSo Freetown      C9uduqDZr9d
+ 9 Sierra Leone  ImspTQPwCqd Western Area  at6UHUQatSo Freetown      C9uduqDZr9d
+10 Sierra Leone  ImspTQPwCqd Kono          Vth0fbpFcsO Gbense        TQkG0sX9nca
+# ℹ 1,156 more rows
+# ℹ 2 more variables: Facility_name <chr>, Facility_id <chr>
 ```
 
 :::::::::::::::