This code script was written and submitted as Missing data assignment in Data science for Epidemiology course. The following tasks were completed in the assignment:
The can_path_student_dataset was used in this assignment to explore the dataset to identify the extent, patterns, and potential reasons for missing data. To reduce computational demands running the entire dataset, 15 columns with percentage of missing data less than 20%. Methods such as little_mcar and regression model(using "NA") was performed on the datasets to understand the nature of missingness. Even though this methods were suggested, it was not used as means of proving type of missingness in the data. Results were interpreted as an assumption. Findings were summarize using tables, charts, and heatmaps to visualize missingness. The project used the naniar and visdat special packages for missing data task in R
Mean imputation, Multiple Imputation by Chained Equations (MICE) and K-Nearest Neighbors (KNN) imputation were used to fill the missing data in the can_path_student_dataset. The imputation process for each method/imputation approach has been documented in the attached files.
The performance of each method was analyzed by comparing changes in key summary statistics and visualizing comparisons of distributions before and after imputation using boxplot and density charts. Regression model was also developed to see how the various imputation methods affects the relationship between quantity of fruits consumed and physical activity.
A table was presented at the end of the rmarkdown file showing how the various methods affects the sample mean and regression model.
The Missing_Data_Imputation.Rmd file contains the script for the assignment.
The Missing_Data_Imputation.html file is the knitted .rmd file which can be downloaded and opened in a web browser