diff --git a/config.yaml b/config.yaml index e6f3ddb2..dd0ef117 100644 --- a/config.yaml +++ b/config.yaml @@ -59,10 +59,10 @@ contact: 'andree.valle-campos@lshtm.ac.uk' # Order of episodes in your lesson episodes: -- read-cases.Rmd +- read-case-data.Rmd - clean-data.Rmd -- validate.Rmd -- describe-cases.Rmd +- tag-validate.Rmd +- aggreagate-visualize.Rmd # Information for Learners learners: diff --git a/episodes/describe-cases.Rmd b/episodes/aggreagate-visualize.Rmd similarity index 92% rename from episodes/describe-cases.Rmd rename to episodes/aggreagate-visualize.Rmd index 44a7d299..a0fc1758 100644 --- a/episodes/describe-cases.Rmd +++ b/episodes/aggreagate-visualize.Rmd @@ -24,7 +24,7 @@ exercises: 10 In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. This episode focuses on EDA of outbreak data using R packages. -A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more. +A key aspects of EDA in epidemic analysis are **person, place and time**. It is useful to identify how observed events--such as confirmed cases, hospitalizations, deaths, and recoveries--change over time, and how these vary across different locations and demographic factors, including gender, age, and more. Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time). We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package. @@ -66,9 +66,9 @@ You can also find data sets from past real outbreaks within the [`{outbreaks}`]( -## Aggregating the data +## Aggregating linelist -Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `` class object from the simulated Ebola `linelist` data based on the date of onset. +Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires converting the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for aggregating case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `` class object from the simulated Ebola `linelist` data based on the date of onset. ```{r} # Create an incidence object by aggregating case data based on the date of onset @@ -82,7 +82,7 @@ daily_incidence <- incidence2::incidence( daily_incidence ``` -With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. +With the `{incidence2}` package, you can specify the desired interval (e.g., day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. ```{r} # Group incidence data by week, accounting for sex and case type @@ -150,7 +150,7 @@ base::plot(daily_incidence) + x = "Time (in days)", # x-axis label y = "Dialy cases" # y-axis label ) + - theme_bw() + tracetheme::theme_trace() ``` @@ -161,7 +161,7 @@ base::plot(weekly_incidence) + x = "Time (in weeks)", # x-axis label y = "weekly cases" # y-axis label ) + - theme_bw() + tracetheme::theme_trace() ``` :::::::::::::::::::::::: callout @@ -200,7 +200,7 @@ base::plot(cum_df) + x = "Time (in days)", # x-axis label y = "weekly cases" # y-axis label ) + - theme_bw() + tracetheme::theme_trace() ``` Note that this function preserves grouping, i.e., if the `incidence2` object contains groups, it will accumulate the cases accordingly. diff --git a/episodes/read-cases.Rmd b/episodes/read-case-data.Rmd similarity index 100% rename from episodes/read-cases.Rmd rename to episodes/read-case-data.Rmd diff --git a/episodes/validate.Rmd b/episodes/tag-validate.Rmd similarity index 79% rename from episodes/validate.Rmd rename to episodes/tag-validate.Rmd index 3aaa2a2b..85ae2652 100644 --- a/episodes/validate.Rmd +++ b/episodes/tag-validate.Rmd @@ -1,20 +1,20 @@ --- title: 'Validate case data' -teaching: 10 -exercises: 2 +teaching: 20 +exercises: 10 --- :::::::::::::::::::::::::::::::::::::: questions -- How to convert a raw dataset into a `linelist` object? +- How can a raw case data be converted into a `linelist` object? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Demonstrate how to covert case data into `linelist` data -- Demonstrate how to tag and validate data to make analysis more reliable +- Demonstrate how to covert case data into `linelist` object +- Demonstrate how to tag and validate data to improve the reliability of downstream analysis :::::::::::::::::::::::::::::::::::::::::::::::: @@ -30,14 +30,13 @@ This episode requires you to: ## Introduction -In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `` or ``), etc. Specifically, this additional step involves: +In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Without this step, you may encounter issues later, for example, variables may be be unintentionally modified or removed, or their data types (e.g., ``, ``), may change during processing. This additional layer typically involves two key steps: -1. Verifying the presence and correct data type of certain columns within -your dataset, a process commonly referred to as **tagging**; -2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**. +1. **tagging**: Verifying that required columns are present in the dataset and confirming that they have the correct data types. +2. **validation**: Implementing safeguards to ensure that tagged columns are not accidentally deleted or altered during subsequent data manipulation steps. -This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package +This episode focuses on creating linelist object using the [linelist](https://epiverse-trace.github.io/linelist/) package, which natively supports tagging and validating outbreak data o ensure data integrity throughout the analysis workflow. Let's start by loading the package `{rio}` to read data and the `{linelist}` package to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package. @@ -54,7 +53,7 @@ library(linelist) # for tagging and validating ### The double-colon (`::`) operator -The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important +The`::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important advantages including the followings: * Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. @@ -110,7 +109,7 @@ A scenario like this usually happens when the institution doing the analysis is ## Creating a linelist and tagging columns -Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the below code chunk. +Once the data is loaded and cleaned, it can be converted into a `linelist` object using `{linelist}` package, as illustrated in the code chunk below. ```{r} # Create a linelist object from cleaned data @@ -125,17 +124,15 @@ linelist_data <- linelist::make_linelist( linelist_data ``` -The `{linelist}` package supplies tags for common epidemiological variables -and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function. +The `{linelist}` package provides predefined tags for common epidemiological variables, along with the appropriate data types for each. You can view all available tags and their corresponding acceptable data types using the `linelist::tags_types()` function. ::::::::::::::::::::::::::::::::::::: challenge -Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection. +Let's now **tag** additional variables. In some datasets, variable names may not exactly match the predefined tag names. In these cases, you can map them based on how the variables were defined during data collection. You need to: -Now: --**Explore** the available tag names in `{linelist}`. --**Find** what other variables in the input dataset can be associated with any of these available tags. --**Tag** those variables as shown above using the `linelist::make_linelist()` +- **Explore** the available tag names in `{linelist}`. +- **Find** what other variables in the input dataset can be associated with any of these available tags. +- **Tag** those variables as shown above using the `linelist::make_linelist()` function. :::::::::::::::::::: hint @@ -165,9 +162,9 @@ linelist::make_linelist( ``` -Are these additional tags visible in the output? +Are the additional tags visible in the output? -< !--Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html).- -> +Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html). ::::::::::::::::::::: @@ -176,7 +173,7 @@ Are these additional tags visible in the output? ## Validation -To ensure that all tagged variables are standardized and have the correct data +To validate that all tagged variables are standardized and have the correct data types, use the `linelist::validate_linelist()` function, as shown in the example below: ```{r} @@ -190,6 +187,7 @@ corresponding datatype using the `linelist::make_linelist()` function. ::::::::::::::::::::::::: challenge +## Changes in Variable Types During Linelist Validation Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed. Let's consider the example where the type `age` variable has changed from a double (``) to character (``). @@ -310,18 +308,20 @@ cleaned_data %>% ## Safeguarding -Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below. +Safeguarding is implicitly built into the linelist objects. If you try to delete or modify any of the tagged columns, you will receive an error or warning message, as shown in the example below. ```{r, warning=TRUE} new_df <- linelist_data %>% dplyr::select(case_id, gender) ``` -This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. +This `Warning` is the default option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. ::::::::::::::::::::::::::::::::::::: challenge +## Exploring Safeguarding Behavior for Lost Tags + Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message. - First, run this code to count the frequency of each category within a categorical variable: @@ -388,6 +388,8 @@ Data analysis during an outbreak response or mass - gathering surveillance deman - Use the `{linelist}` package to tag, validate, and prepare case data for downstream analysis. +- Explore and map dataset variables to predefined tags for standardization. +- Understand how warnings vs. errors affect the data processing workflow. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/learners/reference.md b/learners/reference.md index 14b5a951..701c5476 100644 --- a/learners/reference.md +++ b/learners/reference.md @@ -2,59 +2,11 @@ title: 'Glossary of Terms: Epiverse-TRACE' --- -## A - -[Airborne transmission]{#airborne} -: Individuals become infected via contact with infectious particles in the air. Examples include influenza and COVID-19. Atler et al. (2023) discuss about [factors and management procedures](https://www.ncbi.nlm.nih.gov/books/NBK531468/) of airborne transmission. - -## B -[Basic reproduction number]{#basic} -: A measure of the transmissibility of a disease. Defined as the average number of secondary cases arising from an initial infected case in an entirely susceptible population. [More information on the basic reproduction number](https://en.wikipedia.org/wiki/Basic_reproduction_number). - -[Bayesian inference]{#bayesian} -: A type of statistical inference where prior beliefs are updated using observed data. -[More information on Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference). - - -## C - -[Contact matrix]{#contact} -: The contact matrix is a square matrix consisting of rows/columns equal to the number age groups. Each element represents the frequency of contacts between age groups. If we believe that transmission of an infection is driven by contact, and that contact rates are very different for different age groups, then specifying a contact matrix allows us to account for age specific rates of transmission. - -[C++]{#cplusplus} -: C++ is a high-level programming language that can be used within R to speed up sections of code. To learn more about C++ check out these [tutorials](https://cplusplus.com/doc/tutorial/) and learn more about the integration of C++ and R in the [Rcpp documentation](https://www.rcpp.org/). -[Censoring]{#censoring} -: -Means that we know an event happened, but we do not know exactly when it happened. Most epidemiological data are “doubly censored” because there is uncertainty surrounding both primary and secondary event times. Not accounting for censoring can lead to biased estimates of the delay’s standard deviation ([Park et al., in progress](https://github.com/parksw3/epidist-paper)). -Different sampling approaches can generate biases given left and right censoring in the estimation of the serial interval that can propagate bias to the estimation of the [incubation period](#incubation) and generation time ([Chen et al., 2022](https://www.nature.com/articles/s41467-022-35496-8/figures/2)) - -## D - -[Deterministic model]{#deterministic} -: Models that will always have the same trajectory for given initial conditions and parameter values. Examples include ordinary differential equations and difference equations. - -[Direct transmission]{#direct} -: Individuals become infected via direct contact with other infected humans. Airborne transmitted infections are often modelled as directly transmitted infections as they require close contact with infected individuals for successful transmission. - -## E - -[Effective reproduction number]{#effectiverepro} -: The time-varying or effective reproduction number ($Rt$) is similar to the [Basic reproductive number](#basic) ($R0$), but $Rt$ measures the number of persons infected by infectious person when some portion of the population has already been infected. Read more about the [etymology of Reproduction number by Sharma et al, 2023](https://wwwnc.cdc.gov/eid/article/29/8/22-1445_article). - - - ## G [Generation time]{#generationtime} : Time between the onset of infectiousness of an index case and its secondary case. This always needs to be positive. The generation time distribution is commonly estimated from data on the [serial interval](#serialinterval) distribution of an infection ([Cori et al. 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)). - - -[Growth rate]{#growth} -: The exponential growth rate tells us how much cases are increasing or decreasing at the start of an epidemic. It gives us a measure of speed of transmission, see [Dushoff & Park, 2021](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2020.1556). - - - ## I [Incubation period]{#incubation} @@ -63,68 +15,26 @@ The generation time distribution is commonly estimated from data on the [serial This can be different to the [latent period](#latent) as shown in Figure 4 from ([Xiang et al. (2021)](https://www.sciencedirect.com/science/article/pii/S2468042721000038#fig4)). The relationship between the incubation period and the [serial interval](#serialinterval) helps to define the type of infection transmission (symptomatic or pre-symptomatic) ([Nishiura et al. (2020)](https://www.ijidonline.com/article/S1201-9712(20)30119-3/fulltext#gr2)). -[Indirect transmission]{#indirect} -: Indirectly transmitted infections are passed on to humans via contact with vectors, animals or contaminated environment. Vector-borne infections, zoonoses and water-borne infections are modelled as indirectly transmitted. - -[Initial conditions]{#initial} -: In [ODEs](#ordinary), the initial conditions are the values of the state variables at the start of the model simulation (at time 0). For example, if there is one infectious individual in a population of 1000 in an Susceptible-Infectious-Recovered model, the initial conditions would be $S(0) = 999$, $I(0) = 1$, $R(0) = 0$. - -[Infectious period]{#infectiousness} -: Also known as Duration of infectiousness. Time period between the onset and end of infectious [viral shedding](#viralshedding). -Viral load and detection of infectious virus are the two key parameters for estimating infectiousness ([Puhach et al., 2022](https://www.nature.com/articles/s41579-022-00822-w) and [Hakki et al, 2022](https://www.thelancet.com/journals/lanres/article/PIIS2213-2600(22)00226-0/fulltext)](fig/infectiousness-covid19.jpg)). - - - - - ## L [Latent period]{#latent} : The time between becoming infected and the onset of infectiousness. This can be different to the [incubation period](#incubation) as shown in Figure 4 from ([Xiang et al, 2021](https://www.sciencedirect.com/science/article/pii/S2468042721000038#fig4)) -## M -[Model parameters (ODEs)]{#parsode} -: The model parameters are used in [ordinary differential equation](#ordinary) models to describe the flow between disease states. For example, a transmission rate $\beta$ is a model parameter that can be used to describe the flow between susceptible and infectious states. - - -## N - -[Natural history of disease]{#naturalhistory} -: Refers to the development of disease from beginning to end without any treatment or intervention. In fact, given the harmfulness of an epidemic, treatment or intervention measures are inevitable. Therefore, it is difficult for the natural history of a disease to be unaffected by the various coupling factors. ([Xiang et al, 2021](https://www.sciencedirect.com/science/article/pii/S2468042721000038)) - -[Non-pharmaceutical interventions]{#NPIs} -: Non-pharmaceutical interventions (NPIs) are measures put in place to reduce transmission that do not include the administration of drugs or vaccinations. [More information on NPIs](https://www.gov.uk/government/publications/technical-report-on-the-covid-19-pandemic-in-the-uk/chapter-8-non-pharmaceutical-interventions). - -## O -[Ordinary differential equations]{#ordinary} -: Ordinary differential equations (ODEs) can be used to represent the rate of change of one variable (e.g. number of infected individuals) with respect to another (e.g. time). Check out this introduction to [ODEs](https://mathinsight.org/ordinary_differential_equation_introduction). ODEs are widely used in infectious disease modelling to model the flow of individuals between different disease states. - -[Offspring distribution]{#offspringdist} -: Distribution of the number of secondary cases caused by a particular infected individual. ([Lloyd-Smith et al., 2005](https://www.nature.com/articles/nature04153), [Endo et al., 2020](https://wellcomeopenresearch.org/articles/5-67/v3)) - -[Outbreak analytics]{#outbreakanalytics} -: A specialized field within data science that focuses on the technological and methodological aspects of the outbreak data pipeline. This includes the systematic collection, analysis, modeling, and reporting of data to inform outbreak response ([Polonsky et al., 2019](https://royalsocietypublishing.org/doi/full/10.1098/rstb.2018.0276)). - -## P - -[(Dynamical or Epidemic) Phase bias]{#phasebias} -: Accounts for population susceptibility at the times transmission pairs are observed. -It is a type of sampling bias. It affects backward-looking data and is related to the phase of the epidemic: during the exponential growth phase, cases that developed symptoms recently are over-represented in the observed data, while during the declining phase, these cases are underrepresented, leading to the estimation of shorter and longer delay intervals, respectively. ([Park et al., in progress](https://github.com/parksw3/epidist-paper)) - - +[Linelist]{#linelist} +: A linelist is a structured dataset in which each row represents an individual case or observation +and each column represents a specific variable describing that case, such as demographic information, +dates of symptom onset, exposure, or outcomes. [Linelists are a fundamental data format in epidemiology](https://outbreaktools.ca/background/line-lists/?utm_source=chatgpt.com) ## R [Reporting delay]{#reportingdelay} -: Delay or lag between the time an event occurs (e.g. symptom onset) and the time it is reported ([Lawless, 1994](https://www.jstor.org/stable/3315820)). We can quantify it by comparing the linelist with successive versions of it or up-to-date reported aggregated case counts ([Cori et al. 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)). +: Delay or lag between the time an event occurs (e.g. symptom onset) and the time it is reported ([Lawless, 1994](https://www.jstor.org/stable/3315820)). We can quantify it by comparing the [linelist](#linelist) with successive versions of it or up-to-date reported aggregated case counts ([Cori et al. 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)). [RDBMS]{#RDBMS} -: Relational DataBase Management System. -## S +: Relational Databases Management System. -[State variables]{#state} -: The state variables in a model represented by [ordinary differential equations](#ordinary) are the disease states that individuals can be in e.g. if individuals can be susceptible, infectious or recovered the state variables are $S$, $I$ and $R$. There is an ordinary differential equation for each state variable. +## S [Serial interval]{#serialinterval} : The time delay between the onset of symptoms between a primary case and a secondary case. @@ -132,29 +42,3 @@ This can be negative when pre-symptomatic infection occurs. Most commonly, the serial interval distribution of an infection is used to estimate the [generation time](#generationtime) distribution ([(Cori et al., 2017)](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)). The relationship between the serial interval and the [incubation period](#incubation) helps to define the type of infection transmission (symptomatic or pre-symptomatic) ([Nishiura et al. (2020)](https://www.ijidonline.com/article/S1201-9712(20)30119-3/fulltext#gr2)). -[Stochastic model]{#stochastic} -: A model that includes some stochastic process resulting in variation in model simulations for the same initial conditions and parameter values. Examples include stochastic differential equations and branching process models. For more detail see [Allen (2017)](https://doi.org/10.1016/j.idm.2017.03.001). - - -## T - -[(Right) Truncation]{#truncation} -: Type of sampling bias related to the data collection process. It arises because only cases that have been reported can be observed. Not accounting for right truncation during the growth phase of an epidemic can lead to underestimation of the mean delay ([Park et al., in progress](https://github.com/parksw3/epidist-paper)). - - - -## V - -[Vector-borne transmission]{#vectorborne} -: Vector-borne transmission means an infection can be passed from a vector (e.g. mosquitoes) to humans. Examples of vector-borne diseases include malaria and dengue. The World Health Organization have a [Fact sheet about Vector-borne diseases](https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases) with key information and a list of them according to their vector. - -[Viral shedding]{#viralshedding} -: The process of releasing a virus from a cell or body into the environment where it can infect other people. ([Cambridge Dictionary, 2023](https://dictionary.cambridge.org/us/dictionary/english/shedding)) - - - - - - - -