Skip to content

robertthiesmeier/cross_site_imputation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

71 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Cross-site imputation to recover covariates without in federated analyses without sharing individual-level pooled data

This repository offers a tutorial to implement cross-site imputation for systematically missing variables in distributed data networks. The orginal publication is available here πŸ“„. All datasets are simulated, and you can try out the code and implementation of the software package yourself. πŸ•΅οΈβ€β™€οΈ This site walks you through the different approaches presented in the paper.

What you can find on this page πŸ“–

  1. Introduction to the problem
  2. Cross-site imputation
  3. Step-by-step tutorial
  4. Empirical heterogeneity between study sites
  5. Further reading

Introduction to the problem

Missing data is a common challenge across scientific disciplines. Most of the currently implemnted imputation methods require the availability of individual data to impute missing values. Often, however, missingness requires using external data for the imputation. Therefore, propose a new imputation approach - cross-site multiple imputation - designed to impute missing values using linear predictors and their related covariance matrix from imputation models estimated in one or multiple external studies. This allows for the imputation of any missing values without sharing individual data between studies. The idea was previously discussed here. In this short tutorial on cross-site imputation, we will work with the newly developed Stata code mi impute from that facilitates the imputation of missing values.

Download mi impute from πŸ’»

To impute missing data across study sites, we use the Stata package mi impute from. The command can be downloaded from the SSC Archive in Stata:

ssc install mi_impute_from

Note

In this preprint, we describe the underlying method and present the syntax of mi impute from alongside practical examples of missing data in collaborative research projects.

Cross-site imputation

How does it work? ✏️

For simplicity, we consider three binary variables: an outcome, Y, and exposure, X, and a confounder C. The relations between the three variables is as follows:

DAG_conf

Tip

To generate the data according to the outlined mechanism above and NOT use any real data, apply the DGM program below. This tutorial guides you through the code. For smooth implementation, consider running the attached file in this repository.

Details **Data Generating Mechanism (DGM)**

To run the code with the simulated data in this example, please check the DGM.

cap program drop singlestudy
program define singlestudy, rclass

  syntax /// 
  	[, nobs(real 1000) ///
  	study(real 1) ///
  	filename(string)]
  
  drop _all 
  local nobs = runiformint(500,10000)
  qui set obs `nobs'
  
  local pc =  runiform(0.1, 0.4)
  gen c = rbinomial(1, `pc')
  
  	
  local a0 = runiform(0.05, 0.1)
  local a1 = rnormal(ln(4), 0.1)	
  gen x = rbinomial(1, invlogit(`a0'+`a1'*c))

  local b0 = runiform(logit(0.05), logit(0.3))
  local b1 = rnormal(ln(1.5), 0.1)
  local b2 = rnormal(ln(4), 0.1)
  gen y = rbinomial(1, invlogit(`b0'+`b1'*x+`b2'*c))
  
  gen study = `study'
  if "`filename'" != "" qui save "`filename'", replace 
end

We generate five studies according to the specified mechanism above:

forv i = 1/5 {
	singlestudy, study(`i') filename(study_`i')
}

Step-by-step tutorial

Full data analysis

❗ The control approach πŸ›‚ Federated analysis without missing data

This analysis serves as a control approach to evaluate the performance of the imputation approach. Can we successfully recover (i.e., impute) missing covariates in a single example? In the following examples, we work with frames in Stata. Even though we generate all data ourselves, we cannot pool the individual data to make the example as realistic as possible. We can initiate the dataframe for our analysis in Stata as follows:

capture frame drop metadata
frame create metadata 
qui frame metadata:{ 
	set obs 5 // number of studies
	gen effect = .
	gen se = .
	gen study = .
	gen size = .
}  

Now, for each study in our data network, we perform the same analysis (federated analysis) and save the estimates for each study in the frame we initiated in the previous step.

forv i = 1/5{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

Finally, the estimates can be meta-analysed with a fixed or random meta-analytical model.

frame metadata: list 
frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize,  eform fixed

Generate missing data

❗ After having conducted the control analysis, we can generate systematically missing data on the confounder C in Study 4 and 5.

forv i = 4/5{
	qui use study_`i', clear 
	qui count 
	di in red "Study `i': N = " r(N)
	replace c = . 
	save study_`i', replace
}

Unadjusted pooled estimate

Approach 1️⃣ Federated analysis using 5 studies without adjustment for C

In this analysis, we consider all studies, however, none of them includes the adjustment for the confounder C.

forv i = 1/5{
	use study_`i', replace
	logit y x
	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

frame metadata: list 
frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize,  eform fixed

Adjusted complete case analysis

Approach 2️⃣ Federated analysis using only studies with complete information on all variables

This approach takes into consideration 3/5 studies and disregards study 4 and 5 as a result of systematically missing data on C at these sites.

capture frame drop metadata
frame create metadata 
qui frame metadata:{ 
	set obs 3
	gen effect = .
	gen se = .
	gen study = .
	gen size = .
}  

forv i = 1/3{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}

}

frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize,  eform fixed

Adjusted for available data estimate

Approach 3️⃣ Federated analysis using all studies with complete and incomplete information on all variables

In this approach we aim to include all studies in the analysis, regardless of whether or not we are able to adjust for C in some of the studies with missing information. First, we fit the fully adjusted outcome model in study 1 to 3:

capture frame drop metadata
frame create metadata 
qui frame metadata:{ 
	set obs 5
	gen effect = .
	gen se = .
	gen study = .
	gen size = .
}  

forv i = 1/3{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata: replace effect = _b[x] if _n == `i'
	frame metadata: replace se = _se[x] if _n == `i'
	frame metadata: replace study = `i' if _n == `i'
	frame metadata: replace size = `obs'  if _n == `i'
}

Second, we fit the model not adjusting for the confounding variable C in study 4 and 5.

forv i = 4/5{
	use study_`i', replace
	logit y x
	local obs = _N
		frame metadata:{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs' if _n == `i'
	}
}

Finally, we can again analyse all studies with a meta-analytical model.

frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize, eform fixed

Single study imputation adjusted pooled estimate

Approach 4️⃣ Federated analysis using all studies and recovering the missing variable C

The next two approaches consider all studies and applying cross-site imputation to recover the variables with missing information at sites where values are missing. In this approach, we consider a randomly selected study that has complete information on C (one out of the three studies) and fir the imputation model in that study. The imputation regression coefficients are then exported to text files that can be easily shared with other studies.

local pick = runiformint(1,3) 
use study_`pick', clear
	logit c y x
 
	// export matrices 
	mat ib = e(b)
	mat iV = e(V)
	svmat ib
	qui export delimited ib* using b_study`pick'.txt in 1 , replace 
	svmat iV 
	qui export delimited iV* using v_study`pick'.txt if iV1 != . , replace 

At the study sites with missing data (Study 4 and Study 5), we import the regression coefficients and impute the missing values on C 10-times. A detailed explanation on this step can be found here. Finally, we fit the outcome model to each imputed dataset and combine the estimates with Rubin's rules.

forv i = 4/5 {

	use study_`i', clear 
	mi set wide
	mi register imputed c
	
	mi_impute_from_get, b(b_study`pick') v(v_study`pick') colnames(y x _cons) imodel(logit) 
	mat ib = r(get_ib)
	mat iV = r(get_iV)
	
	mi impute from c , add(10) b(ib) v(iV) imodel(logit)

	mi estimate, post noi: logit y x c

	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

We have now estimated the C-adjusted effect of the exposure X on the outcome Y after imputing C in Study 4 and 5. Next, we can derive the estimates for Study 1 to 3 as we have done the in the previous steps and apply a meta-analysis to derive a single pooled estimate.

forv i = 1/3{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize, eform fixed

Multiple study imputation adjusted pooled estimate

Approach 5️⃣ Federated analysis using all studies and recovering the missing variable C

In this approach, we also consider all five studies using cross-site imputation to recover missing values of C in Study 4 and 5. However, here we consider Study 1 to 3 to fit an imputaton model as opposed to only considering a single study as the basis for imputation. To do so, we first fit the imputation model in the studies with available data on the confounder C. After the first step (fitting the prediction model in the studies with available data), we save all files and transport them to the sites with missing data. Here, we proceed as outlined in Approach 4. The command mi_impute_from_get recognises multiple input files and takes a weighted average.

forv i = 1/3 {
	qui use study_`i', replace
	qui logit c y x
	mat ib = e(b)
	mat iV = e(V)
	svmat ib
	qui export delimited ib* using b_study`i'.txt in 1 , replace 
	svmat iV 
	qui export delimited iV* using v_study`i'.txt if iV1 != . , replace 
}

local b_file "b_study1 b_study2 b_study3"
local v_file "v_study1 v_study2 v_study3"

// use the list of files in mi_impute_from_get

forv k = 4/5{
	
	// impute in study 4 & 5 
	use study_`k', clear 
		
	mi set wide
	mi register imputed c
		
	mat drop _all
	mi_impute_from_get, b(`b_file') v(`v_file') colnames(y x _cons) imodel(logit) // weighted average is automatically taken
	mat ib = r(get_ib)
	mat iV = r(get_iV)
	
	mi impute from c , b(ib) v(iV) add(10) imodel(logit)
	mi estimate, post noi: logit y x c

	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `k'
		replace se = _se[x] if _n == `k'
		replace study = `k' if _n == `k'
		replace size = `obs' if _n == `k'
	}
}

We add the estimates for the studies with complete data and apply a meta-analysis in the end.

forv i = 1/3{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize,  eform fixed

Empirical heterogeneity between study sites

πŸ”† Extension to Approach 5️⃣ (using multiple prediction models to impute the missing confounder) πŸ”†

When multiple studies are used to fit the prediction model, it is often desirable to account for the statistical (i.e., empirical) heterogeneity between the study sites. To account for these differences in the final imputation model, we can fit a meta-regression model with random effects combining the regression coefficients from multiple studies. The example below illustrates this approach.

Tip

In the following section, we walk you through the code. For a smooth implemntation, refer to the do.file and run everything in one go to avoid any error messages.

First, as shown above, the imputation must be fit at all sites with complete data on the systematically missing confounder.

*** fit imputation model in all studies with available data ***
forv i = 1/3 {
	qui use study_`i', replace
	qui logit c y x
	mat ib = e(b)
	mat iV = e(V)
	svmat ib
	qui export delimited ib* using b_study`i'.txt in 1 , replace 
	svmat iV 
	qui export delimited iV* using v_study`i'.txt if iV1 != . , replace 
}


local b_file "b_study1 b_study2 b_study3"
local v_file "v_study1 v_study2 v_study3"

πŸ“§ We can send the .txt files to the sites with missing data. At the receiving site: Transform the files into matrices.

** import 3 files with beta-coefficients from prediction model

forv i = 1/3 {
	tempname get_b_`i'
	qui import delimited using "b_study`i'.txt", clear 
	qui mkmat * , matrix(`get_b_`i'')	
}

** import the 3 files with the variance/covariance matrix from the prediction model

forv i = 1/3 {
	tempname get_v_`i'
	qui import delimited using "v_study`i'.txt", clear 
	qui mkmat *, matrix(`get_v_`i'')
}

▢️ We now imported all regression coefficients and their variance/covariances from the three prediction models. Next, we would like to take a random meta regression model of the three set of coefficients to derive our final imputation regression coefficients. In the previous examples, when multiple files were input, mi_impute_from_get facilitated a weighted average using the inverse variance method. Now, we illustrate how to fit a random meta regression model to respect the statistical heterogeneity between sites. (πŸ•΅οΈ Expand to see code shown in details)

Details
capture frame drop random_impmodel  
frame create random_impmodel study y1 y2 y3 v11 v12 v13 v22 v23 v33

forv t = 1/3 {
 
	local post_est_b_`t' ""
	local post_est_V_`t' ""
	
	forv i = 1/3 {
		local post_est_b_`t' "`post_est_b_`t'' (`=`get_b_`t''[1,`i']')"
	}
	
	forv i = 1/3 {
		forv j = `i'/3 {   // triangular and diagonal elements of the matrix for the V matrix
            		local post_est_V_`t' "`post_est_V_`t'' (`=`get_v_`t''[`i',`j']')"
        	}
    	}
	
	qui frame post random_impmodel (`t') `post_est_b_`t'' `post_est_V_`t''
}

Once we created a dataframe and organised the beta-coefficients and their variance/covariances, we can fit a meta regression model with random effects.

frame random_impmodel: meta mvregress y* , wcovvariables(v*) random(reml, covariance(identity))

The output has to be cleaned up and organised a little bit to be used for mi impute from. (πŸ•΅οΈ Expand to see the code shown in details)

Details
	
** clean up the results to be used with mi impute from
mat ib = e(b)[1, 1..3]
mat colnames ib = y x _cons

mat iV = e(V)[1..3, 1..3]
mat colnames iV = y x _cons
mat rownames iV = y x _cons

mat li ib 
mat li iV

We can now proceed as shown above. We use the final set of imputation coefficients in the mi impute from package.

** use the matrices in the studies with incomplete data

forv k = 4/5{
	
	// impute in study 4 & 5 
	use study_`k', clear 
		
	mi set wide
	mi register imputed c
		
	mi impute from c , b(ib) v(iV) add(10) imodel(logit)
	mi estimate, post noi: logit y x c

	local obs = _N
	frame metadata:{
		replace effect = _b[x] if _n == `k'
		replace se = _se[x] if _n == `k'
		replace study = `k' if _n == `k'
		replace size = `obs' if _n == `k'
	}
}

** do the analysis for all studies with complete data and fit a final random-effects meta-analysis

forv i = 1/3{
	use study_`i', replace
	logit y x c
	local obs = _N
	frame metadata{
		replace effect = _b[x] if _n == `i'
		replace se = _se[x] if _n == `i'
		replace study = `i' if _n == `i'
		replace size = `obs'  if _n == `i'
	}
}

frame metadata: meta set effect se , studysize(size)
frame metadata: meta summarize,  eform random(reml)

β˜‘οΈ We now used the cross-site imputation approach allowing for heterogeneity in the imputation model.

Wrap-up βœ…

All five approaches can be implemented without the need for any real data and you can test the package mi impute from. In addition, we showed how to incorporate empirical heterogenity between the sites into the final imputation model by fitting a meta-regression model with random effects on the regression coefficients coming from multiple sites. This approach is explained in more detail in Resche-Rigon et al. (2018).

Important

Again, please refer to this preprint for a more detailed description of the steps and assumptions made that are pivotal to understand the concept of cross-site imputation.

Further reading

πŸ“š References πŸ“š

πŸ”– The idea of cross-site imputation, an applied example, and a discussion around the assumptions of the method are presented in: Thiesmeier, R. Madley-Dowd, P. Orsini, N., Ahlqvist, V. (2024). Cross-site imputation for recovering variables without individual pooled data.

πŸ”– The underlying imputation method and a simulation study are described in: Thiesmeier, R., Bottai, M., & Orsini, N. (2024). Systematically missing data in distributed data networks: multiple imputation when data cannot be pooled. Journal of Computational Statistics and Simulation, 94(17), 3807–3825

πŸ”– Multiple imputation by chained equations for systematically and sporadically missing multilevel data by Resche-Rigon M. and White I. illustarting the theoretical foundations of a two-stage imputation process for multilevel data.

πŸ”– A first introduction and illustration of MI algorithms in distributed health data networks by Chang et al. (2020).

πŸ–₯️ Software πŸ–₯️

🏷️ The documentation of mi impute from is available as a preprint: Thiesmeier, R., Bottai, M., & Orsini, N. (2024). Imputing Missing Values with External Data.

🏷️ The software package mi impute from in Stata is stored at: Thiesmeier R, Bottai M, Orsini R. (2024). MI_IMPUTE_FROM: Stata module to impute using an external imputation model. Statistical Software Components S459378, Boston College Department of Economics

πŸ“’ Presentations πŸ“’

🏷️ Theoretical underpinnings of cross-site imputation were presneted at the:

  1. Royal Statistical Society International Conference in Harrogate, UK, 2023
  2. Royal Statistical Society International Conference in Brighton, UK, 2024
  3. International Biometric Society Conference in Atlanta, GA, USA, 2024

The first version of mi impute from was presented at the 2024 UK Stata Conference in London and at the 2025 Stata Biostatistics and Epidemiology Virtual Symposium.

⚠️ If you find any errors, please notfiy us: robert.thiesmeier@ki.se ⚠️

About

Computer code and working example on how to recover covariates without the need to share individual data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages