You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
14
+
- Explain how to clean, curate, and standardize case data using `{cleanepi}` package.
15
15
- Perform essential data-cleaning operations on a real case dataset.
16
16
17
17
::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,24 +21,49 @@ exercises: 10
21
21
In this episode, we will use a simulated Ebola dataset that can be:
22
22
23
23
- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv)
24
-
- Save it in the `data/` folder. Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
24
+
- Save it in the `data/` folder.
25
+
- Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder)
25
26
26
27
:::::::::::::::::::::
27
28
28
29
## Introduction
29
-
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
30
+
In the process of analyzing outbreak data, as in other disciplines of data science, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results).
30
31
This episode focuses on cleaning epidemics and outbreaks data using the
For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
33
34
34
-
Let's start by loading the package `{rio}` to read data and the package `{cleanepi}`
35
-
to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from
36
-
the package `{dplyr}`, so let's also call to the tidyverse package:
35
+
### Set Up
36
+
37
+
In addition to the `{cleanepi}` package, we will use the following R packages in
38
+
this data cleaning workflow:
39
+
40
+
*`{here}` for easy file referencing,
41
+
*`{rio}` to import the data into R,
42
+
*`{dplyr}` to perform some data processing operations,
43
+
*`{magrittr}` to use its **pipe operator (`%>%`)**.
44
+
45
+
We encourage users with recent versions of R (version > 4.4.1) to use the base R
46
+
pipe operator (`|>`) instead of `%>%`.
47
+
48
+
We also encourage using the `{pak}` package when installing R packages as shown
49
+
below. You can refer to the [{pak} reference document](https://pak.r-lib.org/reference/features.html) for more details about
50
+
the advantages of using this.
37
51
38
52
39
53
```r
54
+
# Check if a package is already installed and install it if not
55
+
56
+
# nolint start
57
+
if (!require("pak")) install.packages("pak")
58
+
if (!require("here")) pak::pak("here")
59
+
if (!require("rio")) pak::pak("rio")
60
+
if (!require("dplyr")) pak::pak("dplyr")
61
+
if (!require("magrittr")) pak::pak("magrittr")
62
+
if (!require("cleanepi")) pak::pak("cleanepi")
63
+
# nolint end
64
+
40
65
# Load packages
41
-
library(tidyverse) # for {dplyr} functions and the pipe %>%
66
+
library(dplyr) # for {dplyr} functions and the pipe %>%
42
67
library(rio) # for importing data
43
68
library(here) # for easy file referencing
44
69
library(cleanepi)
@@ -48,7 +73,7 @@ library(cleanepi)
48
73
49
74
### The double-colon (`::`) operator
50
75
51
-
The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
76
+
The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
52
77
advantages including the followings:
53
78
54
79
* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
@@ -141,17 +166,17 @@ package simplifies this process with the `scan_data()` function. Let's take a lo
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
289
+
```
290
+
252
291
```output
253
292
# A tibble: 6 × 5
254
293
col1 col2 col3 col4 col5
@@ -598,17 +637,55 @@ You can have more details in the section about "Dictionary-based data substituti
598
637
599
638
### Calculating time span between different date events
600
639
601
-
In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
640
+
In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). A common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth).
641
+
642
+
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute **reporting delay** between the date of symptom onset (`date_onset`) and date of case confirmation (`date_sample`)
643
+
644
+
645
+
```r
646
+
sim_ebola_data<-cleanepi::timespan(
647
+
data=sim_ebola_data,
648
+
target_column="date_onset",
649
+
end_date="date_sample",
650
+
span_unit="days",
651
+
span_column_name="reporting_delay")
602
652
603
-
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the
604
-
time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
After executing the function `cleanepi::timespan()`, two new columns named `reporting_delay` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
1- Calculate the time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`).
605
681
682
+
:::::::::::::::::::::::::: solution
606
683
607
684
```r
608
685
sim_ebola_data<-cleanepi::timespan(
609
686
data=sim_ebola_data,
610
687
target_column="date_sample",
611
-
end_date=as.Date("2025-01-03"),
688
+
end_date=lubridate::ymd("2025-01-03"),
612
689
span_unit="years",
613
690
span_column_name="years_since_collection",
614
691
span_remainder_unit="months"
@@ -634,10 +711,11 @@ sim_ebola_data %>%
634
711
10 14816 2015-02-06 9 10
635
712
# ℹ 14,990 more rows
636
713
```
714
+
::::::::::::::::::::::::::
637
715
638
-
After executing the function `cleanepi::timespan()`, two new columns named `years_since_collection` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months.
0 commit comments