Skip to content

Commit 0cee06a

Browse files
committed
differences for PR #226
1 parent 17b04a7 commit 0cee06a

5 files changed

Lines changed: 139 additions & 53 deletions

clean-data.md

Lines changed: 101 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ exercises: 10
1111

1212
::::::::::::::::::::::::::::::::::::: objectives
1313

14-
- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
14+
- Explain how to clean, curate, and standardize case data using `{cleanepi}` package.
1515
- Perform essential data-cleaning operations on a real case dataset.
1616

1717
::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,24 +21,49 @@ exercises: 10
2121
In this episode, we will use a simulated Ebola dataset that can be:
2222

2323
- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv)
24-
- Save it in the `data/` folder. Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
24+
- Save it in the `data/` folder.
25+
- Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
2526

2627
:::::::::::::::::::::
2728

2829
## Introduction
29-
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
30+
In the process of analyzing outbreak data, as in other disciplines of data science, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
3031
This episode focuses on cleaning epidemics and outbreaks data using the
31-
[cleanepi](https://epiverse-trace.github.io/cleanepi/) package,
32+
[`{cleanepi}`](https://epiverse-trace.github.io/cleanepi/) package.
3233
For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
3334

34-
Let's start by loading the package `{rio}` to read data and the package `{cleanepi}`
35-
to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from
36-
the package `{dplyr}`, so let's also call to the tidyverse package:
35+
### Set Up
36+
37+
In addition to the `{cleanepi}` package, we will use the following R packages in
38+
this data cleaning workflow:
39+
40+
* `{here}` for easy file referencing,
41+
* `{rio}` to import the data into R,
42+
* `{dplyr}` to perform some data processing operations,
43+
* `{magrittr}` to use its **pipe operator (`%>%`)**.
44+
45+
We encourage users with recent versions of R (version > 4.4.1) to use the base R
46+
pipe operator (`|>`) instead of `%>%`.
47+
48+
We also encourage using the `{pak}` package when installing R packages as shown
49+
below. You can refer to the [{pak} reference document](https://pak.r-lib.org/reference/features.html) for more details about
50+
the advantages of using this.
3751

3852

3953
``` r
54+
# Check if a package is already installed and install it if not
55+
56+
# nolint start
57+
if (!require("pak")) install.packages("pak")
58+
if (!require("here")) pak::pak("here")
59+
if (!require("rio")) pak::pak("rio")
60+
if (!require("dplyr")) pak::pak("dplyr")
61+
if (!require("magrittr")) pak::pak("magrittr")
62+
if (!require("cleanepi")) pak::pak("cleanepi")
63+
# nolint end
64+
4065
# Load packages
41-
library(tidyverse) # for {dplyr} functions and the pipe %>%
66+
library(dplyr) # for {dplyr} functions and the pipe %>%
4267
library(rio) # for importing data
4368
library(here) # for easy file referencing
4469
library(cleanepi)
@@ -48,7 +73,7 @@ library(cleanepi)
4873

4974
### The double-colon (`::`) operator
5075

51-
The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
76+
The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
5277
advantages including the followings:
5378

5479
* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
@@ -141,17 +166,17 @@ package simplifies this process with the `scan_data()` function. Let's take a lo
141166

142167

143168
``` r
144-
cleanepi::scan_data(raw_ebola_data)
169+
cleanepi::scan_data(raw_ebola_data, format = "percentage")
145170
```
146171

147172
``` output
148-
Field_names missing numeric date character logical
149-
1 age 0.0690 0.8925 0.0000 0.1075 0
150-
2 gender 0.1874 0.0560 0.0000 0.9440 0
151-
3 status 0.0565 0.0000 0.0000 1.0000 0
152-
4 date onset 0.0001 0.0000 0.9159 0.0841 0
153-
5 date sample 0.0001 0.0000 1.0000 0.0000 0
154-
6 region 0.0000 0.0000 0.0000 1.0000 0
173+
Field_names missing numeric date character logical
174+
1 age 6.9047% 89.2475% 0% 10.7525% 0%
175+
2 gender 18.7416% 5.6035% 0% 94.3965% 0%
176+
3 status 5.6549% 0% 0% 100% 0%
177+
4 date onset 0.0067% 0% 91.5945% 8.4055% 0%
178+
5 date sample 0.0133% 0% 100% 0% 0%
179+
6 region 0% 0% 0% 100% 0%
155180
```
156181

157182

@@ -249,6 +274,20 @@ cleanepi::print_report(data = sim_ebola_data, what = "removed_duplicates")
249274
In the following data frame:
250275

251276

277+
``` output
278+
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
279+
✔ forcats 1.0.1 ✔ readr 2.1.6
280+
✔ ggplot2 4.0.2 ✔ stringr 1.6.0
281+
✔ lubridate 1.9.5 ✔ tibble 3.3.1
282+
✔ purrr 1.2.1 ✔ tidyr 1.3.2
283+
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
284+
✖ tidyr::extract() masks magrittr::extract()
285+
✖ dplyr::filter() masks stats::filter()
286+
✖ dplyr::lag() masks stats::lag()
287+
✖ purrr::set_names() masks magrittr::set_names()
288+
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
289+
```
290+
252291
``` output
253292
# A tibble: 6 × 5
254293
col1 col2 col3 col4 col5
@@ -598,17 +637,55 @@ You can have more details in the section about "Dictionary-based data substituti
598637

599638
### Calculating time span between different date events
600639

601-
In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
640+
In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). A common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
641+
642+
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute **reporting delay** between the date of symptom onset (`date_onset`) and date of case confirmation (`date_sample`)
643+
644+
645+
``` r
646+
sim_ebola_data <- cleanepi::timespan(
647+
data = sim_ebola_data,
648+
target_column = "date_onset",
649+
end_date = "date_sample",
650+
span_unit = "days",
651+
span_column_name = "reporting_delay")
602652

603-
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the
604-
time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
653+
sim_ebola_data %>%
654+
dplyr::select(case_id, date_sample, reporting_delay)
655+
```
656+
657+
``` output
658+
# A tibble: 15,000 × 3
659+
case_id date_sample reporting_delay
660+
<chr> <date> <dbl>
661+
1 14905 2015-04-06 22
662+
2 13043 2014-01-03 114
663+
3 14364 2015-03-03 387
664+
4 14675 2014-12-31 73
665+
5 12648 2016-10-10 855
666+
6 14274 2016-01-23 293
667+
7 14132 2015-10-05 NA
668+
8 14715 2016-04-24 NA
669+
9 13435 2014-09-20 73
670+
10 14816 2015-02-06 -143
671+
# ℹ 14,990 more rows
672+
```
673+
674+
After executing the function `cleanepi::timespan()`, two new columns named `reporting_delay` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
675+
676+
677+
::::::::::::::::::::::::::::::::::::::::::::::: challenge
678+
679+
680+
1- Calculate the time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
605681

682+
:::::::::::::::::::::::::: solution
606683

607684
``` r
608685
sim_ebola_data <- cleanepi::timespan(
609686
data = sim_ebola_data,
610687
target_column = "date_sample",
611-
end_date = as.Date("2025-01-03"),
688+
end_date = lubridate::ymd("2025-01-03"),
612689
span_unit = "years",
613690
span_column_name = "years_since_collection",
614691
span_remainder_unit = "months"
@@ -634,10 +711,11 @@ sim_ebola_data %>%
634711
10 14816 2015-02-06 9 10
635712
# ℹ 14,990 more rows
636713
```
714+
::::::::::::::::::::::::::
637715

638-
After executing the function `cleanepi::timespan()`, two new columns named `years_since_collection` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
716+
:::::::::::::::::::::::::::::::::::::::::::::::
639717

640-
::::::::::::::::::::::::::::::::::::::::::::::: challenge
718+
::::::::::::::::::::::::::::::::::::::::::::::: discussion
641719

642720
Age data is useful in many downstream analysis. You can categorize it to generate stratified estimates.
643721

-9.84 KB
Binary file not shown.
-9.92 KB
Binary file not shown.

md5sum.txt

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
"file" "checksum" "built" "date"
2-
"CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2026-02-22"
3-
"LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2026-02-22"
4-
"config.yaml" "0f7deb99a9178d8470bd18343974bd37" "site/built/config.yaml" "2026-02-22"
5-
"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2026-02-22"
6-
"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2026-02-22"
7-
"pull_request_template.md" "6ae1abf4b06b0425eebca43b6db281ae" "site/built/pull_request_template.md" "2026-02-22"
8-
"renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-22"
9-
"episodes/read-cases.Rmd" "7333844e42fcf127fbdf0b63785a744c" "site/built/read-cases.md" "2026-02-22"
10-
"episodes/clean-data.Rmd" "5d7d9f89d85d4cf8bcfb9b0d78b182fd" "site/built/clean-data.md" "2026-02-22"
11-
"episodes/validate.Rmd" "2d01a5ad5453b11791fbc5f33b97dca3" "site/built/validate.md" "2026-02-22"
12-
"episodes/describe-cases.Rmd" "bde8c21decb3810f1351894b0997c5fe" "site/built/describe-cases.md" "2026-02-22"
13-
"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2026-02-22"
14-
"learners/epikinetics-descriptive.md" "50400941620956b3366fd99d51ed465b" "site/built/epikinetics-descriptive.md" "2026-02-22"
15-
"learners/epikinetics-statistics.md" "26868caf5a6b4a948ecb8f95c40694ab" "site/built/epikinetics-statistics.md" "2026-02-22"
16-
"learners/mpox_data_cleaning_pipeline.Rmd" "10ea9694190cfd20fb286cdc71b0c707" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-22"
17-
"learners/mpox_data_cleaning_pipeline.md" "5ca83e5cc976420771e2b4ee0f713cb1" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-22"
18-
"learners/reference.md" "f081702f1c557d1ce455b7d38306737b" "site/built/reference.md" "2026-02-22"
19-
"learners/setup.md" "65457a0cdd39d17bd46c96e356c74261" "site/built/setup.md" "2026-02-22"
20-
"profiles/learner-profiles.md" "31b503c4b5bd1f0960ada730eca4a25e" "site/built/learner-profiles.md" "2026-02-22"
21-
"renv/profiles/lesson-requirements/renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-22"
2+
"CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2026-02-23"
3+
"LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2026-02-23"
4+
"config.yaml" "0f7deb99a9178d8470bd18343974bd37" "site/built/config.yaml" "2026-02-23"
5+
"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2026-02-23"
6+
"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2026-02-23"
7+
"pull_request_template.md" "6ae1abf4b06b0425eebca43b6db281ae" "site/built/pull_request_template.md" "2026-02-23"
8+
"renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-23"
9+
"episodes/read-cases.Rmd" "7333844e42fcf127fbdf0b63785a744c" "site/built/read-cases.md" "2026-02-23"
10+
"episodes/clean-data.Rmd" "d5eeba1bf54db33f25c0d26cf7ad22f0" "site/built/clean-data.md" "2026-02-23"
11+
"episodes/validate.Rmd" "2d01a5ad5453b11791fbc5f33b97dca3" "site/built/validate.md" "2026-02-23"
12+
"episodes/describe-cases.Rmd" "bde8c21decb3810f1351894b0997c5fe" "site/built/describe-cases.md" "2026-02-23"
13+
"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2026-02-23"
14+
"learners/epikinetics-descriptive.md" "50400941620956b3366fd99d51ed465b" "site/built/epikinetics-descriptive.md" "2026-02-23"
15+
"learners/epikinetics-statistics.md" "26868caf5a6b4a948ecb8f95c40694ab" "site/built/epikinetics-statistics.md" "2026-02-23"
16+
"learners/mpox_data_cleaning_pipeline.Rmd" "10ea9694190cfd20fb286cdc71b0c707" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-23"
17+
"learners/mpox_data_cleaning_pipeline.md" "5ca83e5cc976420771e2b4ee0f713cb1" "site/built/mpox_data_cleaning_pipeline.md" "2026-02-23"
18+
"learners/reference.md" "f081702f1c557d1ce455b7d38306737b" "site/built/reference.md" "2026-02-23"
19+
"learners/setup.md" "65457a0cdd39d17bd46c96e356c74261" "site/built/setup.md" "2026-02-23"
20+
"profiles/learner-profiles.md" "31b503c4b5bd1f0960ada730eca4a25e" "site/built/learner-profiles.md" "2026-02-23"
21+
"renv/profiles/lesson-requirements/renv.lock" "bc6ed43f0e1fb18e514be0dea3c60e38" "site/built/renv.lock" "2026-02-23"

read-cases.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -607,9 +607,9 @@ tibble::as_tibble(demo_programs)
607607
displayName id type
608608
<chr> <chr> <chr>
609609
1 Antenatal care visit lxAQ7Zs9VYR aggregate
610-
2 Child Programme IpHINAT79UW tracker
611-
3 Contraceptives Voucher Program kla3mAPgvCH aggregate
612-
4 Enterprise / Agribusiness Tracker jGYiwuDB5dc tracker
610+
2 Cause of death (registration) ogrOUKoSaWA tracker
611+
3 Child Programme IpHINAT79UW tracker
612+
4 Contraceptives Voucher Program kla3mAPgvCH aggregate
613613
5 Information Campaign q04UBOqq3rp aggregate
614614
6 Inpatient morbidity and mortality eBAyeGv0exc aggregate
615615
7 Malaria case diagnosis, treatment and investigation qDkgAbB5Jlk tracker
@@ -631,13 +631,21 @@ tibble::as_tibble(demo_units)
631631
```
632632

633633
``` output
634-
# A tibble: 1 × 12
635-
`NIGERIA _name` `NIGERIA _id` FCT_name FCT_id AMAC_name AMAC_id
636-
<chr> <chr> <chr> <chr> <chr> <chr>
637-
1 Sierra Leone ImspTQPwCqd NIGERIA dwAsnpBCyPx FCT EQyxhtW3zXI
638-
# ℹ 6 more variables: `FMC GARKI_name` <chr>, `FMC GARKI_id` <chr>,
639-
# `Kulu Ferha_name` <chr>, `Kulu Ferha_id` <chr>, `Level 6_name` <chr>,
640-
# `Level 6_id` <chr>
634+
# A tibble: 1,166 × 8
635+
National_name National_id District_name District_id Chiefdom_name Chiefdom_id
636+
<chr> <chr> <chr> <chr> <chr> <chr>
637+
1 Sierra Leone ImspTQPwCqd Western Area at6UHUQatSo Rural Wester… qtr8GGlm4gg
638+
2 Sierra Leone ImspTQPwCqd Western Area at6UHUQatSo Rural Wester… qtr8GGlm4gg
639+
3 Sierra Leone ImspTQPwCqd Bo O6uvpzGd5pu Kakua U6Kr7Gtpidn
640+
4 Sierra Leone ImspTQPwCqd Kambia PMa2VCrupOd Magbema QywkxFudXrC
641+
5 Sierra Leone ImspTQPwCqd Tonkolili eIQbndfxQMb Yoni NNE0YMCDZkO
642+
6 Sierra Leone ImspTQPwCqd Port Loko TEQlaapDQoK Kaffu Bullom vn9KJsLyP5f
643+
7 Sierra Leone ImspTQPwCqd Koinadugu qhqAxPSTUXp Nieni J4GiUImJZoE
644+
8 Sierra Leone ImspTQPwCqd Western Area at6UHUQatSo Freetown C9uduqDZr9d
645+
9 Sierra Leone ImspTQPwCqd Western Area at6UHUQatSo Freetown C9uduqDZr9d
646+
10 Sierra Leone ImspTQPwCqd Kono Vth0fbpFcsO Gbense TQkG0sX9nca
647+
# ℹ 1,156 more rows
648+
# ℹ 2 more variables: Facility_name <chr>, Facility_id <chr>
641649
```
642650

643651
:::::::::::::::

0 commit comments

Comments
 (0)