Adolescent Family Experiences Predict Young Adult Educational Attainment: A Data-Based Cross-Study Synthesis With Machine Learning
Documentation of programming scripts for the cross-study synthesis on adolescents' family experiences as predictors of young adult educational attainment with machine learning, based on Add Health data.
Paper in press at Journal of Child and Family Studies.
Authors: Xiaoran Sun, Nilam Ram, and Susan M. McHale, at The Pennsylvania State University
The purpose of this GitHub repository is:
- To share the codes for data pre-processing, machine learning models training and testing, and visualization for the paper, thereby other researchers can replicate this study.
- To provide pedagogical explanations, step by step, for how we conducted the analyses for this paper. Researchers should be able to use our codes to conduct machine learning research with their own substantive research interests and research questions.
- To facilitate post-publication communication. If you have any questions regarding our paper, analysis or codes, feel free to post in the 'Issues' section of this repository and we will try our best to answer your questions.
This paper uses public dataset (Wave I and Wave IV) of National Longitudinal Study of Adolescent Health (Add Health). To conduct analyses with the data, please first download the data following their instructions at: https://www.cpc.unc.edu/projects/addhealth/documentation/publicdata
We followed the instructions to download data from ICPSR.
The particular datasets used in this study are listed below. You can find the 'Download' option under the 'Data & Documentation' tab at the ICPSR data repository.
- DS1 Wave I: In-Home Questionnaire, Public Use Sample
- We used both the R and SPSS versions of the data because the SPSS version has more information regarding the missing data pattern.
- DS22 Wave IV: In-Home Questionnaire, Public Use Sample
- We used the R version for this data.
- DS31 Wave IV: Public Use Weights
- We used the R version for this data.
For a detailed description of this sample, please refer to the Participants section in the manuscript.
This step includes:
- Selecting and creating the 55 identified family variables from the raw data
- Missing data imputation, with
- mean imputation for the legitimate skips
- multiple imputation for the remaining missingness, with the R
micepackage.
- Selecting and creating the college enrollment and college completion variables.
- Preliminary analysis for descriptives (weighted).
See 01_DataPreProcessing.rmd for detailed instructions, codes and annotations.
Descriptions of this procedure can also be found in the Measures and Data preparation sections in the manuscript.
This step is to answer the question, "Do family experiences in adolescence predict college enrollment and graduation in young adulthood?", which includes:
- Tuning, training, and testing regularized logistic regression and random forests models in predicting college enrollment and college graduation with the 53 family experience variables, using stratified nested cross-validation (5-fold; see visualization below)
- Calculating AUC and Accuray for each algorithm predicting each outcome
Codes for this step include (one Jupyter markdown for each outcome variable and each algorithm):
02-1_collen_LR.ipynbcollege enrollment, regularized logistic regression02-2_collen_RF.ipynbcollege enrollment, random forests02-3_collcom_LR.ipynbcollege completion (i.e., graduation), regularized logistic regression02-4_collcom_RF.ipynbcollege completion (i.e., graduation), random forests
This step corresponds to the Method--Data Analysis--Question 1 and Results--Do Family Experiences in Adolescence Predict Educational Attainment in Young Adulthood? sections in the manuscript.
This step is to answer the question, "Which family experience factors are key predictors of young adult educational attainment?", which includes:
- Compute feature importance of all the 53 predictors in the trained logistic regression and random forests mdoels predicting college enrollment and graduation, respectively.
- Conduct recursive feature elimination (RFE) to identify the set of predictors that remained in models with prediction accuracy equivalent to the original model that included all predictors.
Codes for this step include:
03-1_FeatureImportance_collen.ipynbfeature importance and RFE for predicting college enrollment03-2_FeatureImportance_collcom.ipynbfeature importance and RFE for predicting college graduation03-3_FeatureImportancePlotting.rmdPlot feature importance using ggplot2 in R (Figures S1 & S2 in the manuscript supplemental material)
This step corresponds to the Method--Data Analysis--Question 2 and Results--Which Family Experiences Are Key Predictors of Educational Attainment? sections in the manuscript.
This step is to answer the question, "What complex patterns, including nonlinearities and interactions involving this set of family factors, merit further examination?"
Here, following the random forests training-testing, 2D and 3D partial dependence plots (PDPs) are made to visualize nonlinear and/or interactive effects between features/predictors.
Codes for this step include:
04-1_PDP_collen.ipynbPDPs for random forests model predicting college enrollment04-2_PDP_collcom.ipynbPDPs for random forests model predicting college graduation
In particular, within this interactive notebook, you can plug in the selected features and obtain the PDP to your interest!
This step corresponds to the Method--Data Analysis--Question 3 and Results--What Complex Patterns of Predictors Merit Further Examination? sections in the manuscript.
