-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.json
More file actions
1689 lines (1689 loc) · 296 KB
/
search.json
File metadata and controls
1689 lines (1689 loc) · 296 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[
{
"objectID": "index.html",
"href": "index.html",
"title": "hdPS and its machine learning extensions in residual confounding control",
"section": "",
"text": "Background\nThe use of retrospective health care claims datasets is frequently criticized for lacking complete information on potential confounders. Ultimately, the treatment effects estimated utilizing such data sources may be subject to residual confounding. Digital electronic administrative records routinely collect a large volume of health-related information; and many of whom are usually not considered in conventional pharmacoepidemiological studies.",
"crumbs": [
"Background"
]
},
{
"objectID": "index.html#proposal-to-reduce-residual-confounding-bias",
"href": "index.html#proposal-to-reduce-residual-confounding-bias",
"title": "hdPS and its machine learning extensions in residual confounding control",
"section": "Proposal to reduce residual confounding bias",
"text": "Proposal to reduce residual confounding bias\nIn 2009, a high-dimensional propensity score (hdPS) algorithm was proposed that utilizes such information as surrogates or proxies for mismeasured and unobserved confounders in an effort to reduce residual confounding bias. Since then, many machine learning and semi-parametric extensions of this algorithm have been proposed to exploit the wealth of high-dimensional proxy information properly.\n\n\nSchneeweiss et al. (2009)",
"crumbs": [
"Background"
]
},
{
"objectID": "index.html#purpose-of-the-workshop",
"href": "index.html#purpose-of-the-workshop",
"title": "hdPS and its machine learning extensions in residual confounding control",
"section": "Purpose of the workshop",
"text": "Purpose of the workshop\nThis workshop will\n\ndemonstrate logic, steps and implementation guidelines of hdPS utilizing an open data source as an example (using reproducible R codes),\nfamiliarize participants with the difference between propensity score vs. hdPS,\nexplain the rationale for using the machine learning extensions of hdPS, and their statistical properties, and\ndiscuss advantages, controversies, and hdPS reporting guidelines while writing a manuscript.",
"crumbs": [
"Background"
]
},
{
"objectID": "index.html#workshop-prerequisite",
"href": "index.html#workshop-prerequisite",
"title": "hdPS and its machine learning extensions in residual confounding control",
"section": "Workshop prerequisite",
"text": "Workshop prerequisite\nAttendees should have prerequisite knowledge of multiple regression analysis and working knowledge in R (e.g., basic data manipulation and regression fitting).\n\nR Codes\nR Codes for data creation and hdPS analysis can be found on the GitHub repo (codes directory).\n\n\nVersion history\nDifferent versions and updates of the materials were presented in the following sessions\n\nCanadian Society for Epidemiology and Biostatistics, Montreal, Quebec, August 11, 2025 (scheduled)\n2025 Society of Epidemiologic Research Workshops, July 11, 2025 (scheduled)\n2025 Statistical Society of Canada, Biostatistics Workshop, May 25, 2025 (together with Md Belal Hossain)\n2024 Society of Epidemiologic Research Workshops, May 10th, 2024\nR/Medicine Conference 2023, Virtual, June 5, 2023\n2023 Society of Epidemiologic Research Workshops, Virtual, May 4, 2023\n\nAdditional relevant talks (selected):\n\nStatistical issues in administrative data, Banff International Research Station, Banff, Feb 2019.\nStatistics Conference in Genomics, Pharmaceutical Science, and Health Data Science, August 15-17, 2022 University of Victoria, Victoria, BC\nWork in Progress Seminar, CHEOS, St. Paul’s Hospital (Hurlburt Auditorium), Dec 14th, 2022.\nStatistics and Biostatistics seminar series, at the Department of Statistics and Actuarial Science, University of Waterloo, April 26, 2023.\nConference on Statistics and Data Science with Applications in Biology, Genetics, Public Health, and Finance, Thompson Rivers University, Kamloops, August 21-24, 2023.\n\n\n\nCitation\n\n\n\n\n\n\nHow to cite\n\n\n\nKarim, M. E. (2025). High-dimensional propensity score and its machine learning extensions in residual confounding control. The American Statistician, 79(1), 72-90. DOI: 10.1080/00031305.2024.2368794.\n\n\n\n\nComments\nFor any comments regarding this document, reach out to me.\n\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.",
"crumbs": [
"Background"
]
},
{
"objectID": "motivating.html",
"href": "motivating.html",
"title": "Motivating example",
"section": "",
"text": "Literature\nType 2 diabetes is a metabolic disorder that is characterized by high blood sugar levels and insulin resistance. There is a growing body of evidence that, for type 2 diabetes, obesity is a well-established risk factor. Possible mechanism includes excess body fat leading to insulin resistance, while impairing the body’s ability to regulate blood sugar levels.",
"crumbs": [
"Motivating example"
]
},
{
"objectID": "motivating.html#literature",
"href": "motivating.html#literature",
"title": "Motivating example",
"section": "",
"text": "(Klein et al. 2022)",
"crumbs": [
"Motivating example"
]
},
{
"objectID": "motivating.html#research-question",
"href": "motivating.html#research-question",
"title": "Motivating example",
"section": "Research question",
"text": "Research question\n“Does obesity increase the risk of developing diabetes?”\n\n\nObesity is often considered a challenging exposure variable to define precisely in research studies (Hernán and Taubman 2008). In this case, we are using it as an illustrative example to explain the methods and not attempting to make any clinical statements about this topic.\n\n\n\n\n\n\nflowchart LR\n A[Obesity] --> Y(Diabetes)\n\n\n\n\n\n\n\n\n\nExposure: Being obese\n\nOutcome: Developing diabetes\n\n\n\n\n\n\n\nTip\n\n\n\nThe primary goal of the research is not to answer a clinical question or to draw conclusions about the relationship between obesity and diabetes in the general population, but rather to use the relationship as a motivating example for conducting simulations that compares different statistical methods.\n\n\n\n\n\n\nHernán, Miguel A, and Sarah L Taubman. 2008. “Does Obesity Shorten Life? The Importance of Well-Defined Interventions to Answer Causal Questions.” International Journal of Obesity 32 (3): S8–14.\n\n\nKlein, Samuel, Amalia Gastaldelli, Hannele Yki-Järvinen, and Philipp E Scherer. 2022. “Why Does Obesity Cause Diabetes?” Cell Metabolism 34 (1): 11–20.",
"crumbs": [
"Motivating example"
]
},
{
"objectID": "data.html",
"href": "data.html",
"title": "1 Data to Analyze",
"section": "",
"text": "1.1 Choose a U.S. data source\nTo answer the research question “Does obesity increase the risk of developing diabetes?” in the U.S. context, we do the following:",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#choose-a-u.s.-data-source",
"href": "data.html#choose-a-u.s.-data-source",
"title": "1 Data to Analyze",
"section": "",
"text": "Data source: National Health and Nutrition Examination Survey (NHANES) (Disease Control and Prevention 2021)\n\n2013-2014,\n2015-2016,\n2017-2018\n\nAvailability: NHANES is a publicly available dataset that can be downloaded for free from the CDC website.\nDesign: Observational cross-sectional data. Hence, inferring causality is not a possibility or our objective here.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#confounder-identification",
"href": "data.html#confounder-identification",
"title": "1 Data to Analyze",
"section": "1.2 Confounder identification",
"text": "1.2 Confounder identification\nDirected acyclic graph (DAG)\n\n\n(Greenland, Pearl, and Robins 1999)\n\n\n\n\n\n\nflowchart TB\n A[Obesity A] --> Y(Diabetes Y)\n L[Confounders C] --> Y\n L --> A\n\n\n\n\n\n\n\n\n\n\n\n\nHypothesized Directed acyclic graph drawn based on analyst’s best understanding of the literature\n\n\n\n\n\n\nExposure: Being obese\n\nOutcome: Developing diabetes\n\nConfounders: Demographic and lab variables",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#structure-of-the-data",
"href": "data.html#structure-of-the-data",
"title": "1 Data to Analyze",
"section": "1.3 Structure of the data",
"text": "1.3 Structure of the data\n\n\n\n\n\n\nflowchart LR\n D[NHANES 2013-14] --> demo[Demographic \\nVariables \\nand \\nSample \\nWeights]\n demo --> Age\n demo --> Sex\n demo --> Education\n demo --> r[Race or \\nethnicity]\n demo --> m[Marital \\nstatus]\n demo --> Income\n demo --> b[Birth place]\n demo --> sf[Survey \\nfeatures: \\nsampling \\nweights, \\nstrata, \\ncluster]\n D --> bmi[Body \\nMeasures]\n bmi --> Obesity\n D --> diq[Diabetes]\n diq --> Diabetes\n diq --> f[Family \\nhistory of \\ndiabetes]\n D --> smq[Smoking - \\nCigarette Use]\n smq --> Smoking\n D --> dbq[Diet \\nBehavior & \\nNutrition]\n dbq --> Diet\n D --> paq[Physical \\nActivity]\n paq --> p[Physical \\nactivities]\n D --> huq[Hospital \\nUtilization & \\nAccess \\nto Care]\n huq --> mm[Medical \\naccess]\n D --> bpx[Blood \\nPressure]\n bpx --> sbp[Systolic \\nBlood \\nPressure]\n bpx --> dbp[Diastolic \\nBlood \\nPressure]\n D --> bpq[Blood \\nPressure & \\nCholesterol]\n bpq --> hc[High \\ncholesterol]\n D --> slq[Sleep \\nDisorders]\n slq --> Sleep\n D --> biopro[Standard\\n Biochemistry \\nProfile]\n biopro --> u[Uric \\nacid]\n biopro --> Protein\n biopro --> Bilirubin\n biopro --> Phosphorus\n biopro --> Sodium\n biopro --> Potassium\n biopro --> Globulin\n biopro --> Calcium\n D --> rxq[Prescription\\n Medications - \\nICD-10-CM \\ncodes]\n style D fill:#FFA500;\n style rxq fill:#00FF00;\n style biopro fill:#00FF00;\n style slq fill:#00FF00;\n style bpq fill:#00FF00;\n style bpx fill:#00FF00;\n style huq fill:#00FF00;\n style paq fill:#00FF00;\n style dbq fill:#00FF00;\n style smq fill:#00FF00;\n style diq fill:#00FF00;\n style bmi fill:#00FF00;\n style demo fill:#00FF00;\n\n\n\n\n\n\n\n\n\nWe do the same for the following cycles:\n\nNHANES 2015-16\nNHANES 2017-18",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#identify-measured-and-unmeasured-variables-in-the-data",
"href": "data.html#identify-measured-and-unmeasured-variables-in-the-data",
"title": "1 Data to Analyze",
"section": "1.4 Identify measured and unmeasured variables in the data",
"text": "1.4 Identify measured and unmeasured variables in the data\nFind variables capturing the following concepts in the data based on a hypothesized DAG.\n\n\n\n\nRole\nData Component\nVariables considered based on DAG\n\n\n\n\nOutcome\nDIQ\nHave diabetes1\n\n\nExposure\nBMX\nObese; BMI >= 30\n\n\nConfounder\n(demographic) DEMO\nAge, Sex, Education, Race/ethnicity, Marital status, Annual household income, County of birth, Survey cycle year\n\n\n\n(behaviour) SMQ, PAQ, SLQ, DBQ\nSmoking2, Vigorous work activity, Sleep3, Diet4\n\n\n\n(health history / access) DIQ, HUQ\nDiabetes family history, Access to care5\n\n\n\n(lab) BPX, BPQ, BIOPRO\nBlood pressure (systolic, diastolic6), Cholesterol, Uric acid, Total Protein, Total Bilirubin, Phosphorus, Sodium, Potassium, Globulin, Total Calcium\n\n\n\n\n\n\n14 demographic, behavioral, health history related variables\n\nMostly categorical\n\n11 lab variables\n\nMostly continuous",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#fitting-crude-model-to-obtain-or",
"href": "data.html#fitting-crude-model-to-obtain-or",
"title": "1 Data to Analyze",
"section": "1.5 Fitting crude model to obtain OR",
"text": "1.5 Fitting crude model to obtain OR\n\n\n\n\n\n\nCrude association\n\n\n\nHere we estimate the crude association between the exposure and the outcome.\n\n\n\nout.formula <- as.formula(\"outcome ~ exposure\")\nfit <- glm(out.formula,\n data = hdps.data,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.ci <- confint(fit, level = 0.95)[\"exposure\", ]\nfit.summary_with_ci.crude <- c(fit.summary, fit.ci)\nknitr::kable(t(round(fit.summary_with_ci.crude, 2)))\n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.73\n0.05\n0\n0.63\n0.84\n\n\n\n\n\n\n\n\n\nDisease Control, Centers for, and Prevention. 2021. “National Health and Nutrition Examination Survey (NHANES).” National Center for Health Statistics.\n\n\nGreenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology, 37–48.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "data.html#footnotes",
"href": "data.html#footnotes",
"title": "1 Data to Analyze",
"section": "",
"text": "combination of (a) Doctor told you have diabetes, (b) Taking insulin now, (c) Take diabetic pills to lower blood sugar.↩︎\ncigarette use (at least 100 cigarettes in life)↩︎\nSleep hours/workdays↩︎\nHow healthy is the diet↩︎\nRoutine place to go for healthcare↩︎\naverage of 4 measurements↩︎",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>1</span> <span class='chapter-title'>Data to Analyze</span>"
]
},
{
"objectID": "psipw.html",
"href": "psipw.html",
"title": "2 Propensity score",
"section": "",
"text": "2.1 Propensity Score Analysis\nThere are four approaches to propensity score (PS) analysis:",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Propensity score</span>"
]
},
{
"objectID": "psipw.html#propensity-score-analysis-1",
"href": "psipw.html#propensity-score-analysis-1",
"title": "2 Propensity score",
"section": "",
"text": "Weighting: Assign weights to individuals based on their propensity scores to create a pseudo-population where treatment groups are balanced.\nMatching: Match individuals in the treatment group with individuals in the control group based on their propensity scores.\nStratification: Divide the sample into strata based on the propensity score and compare outcomes within each stratum.\nCovariate Adjustment: Include the propensity score as a covariate in a outcome model to adjust for confounding.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Propensity score</span>"
]
},
{
"objectID": "psipw.html#propensity-score-weighting",
"href": "psipw.html#propensity-score-weighting",
"title": "2 Propensity score",
"section": "2.2 Propensity Score Weighting",
"text": "2.2 Propensity Score Weighting\nFor this demonstration, we will focus on the Weighting approach. The other approaches are not covered in this demonstration, but they can be implemented using similar steps as shown below.\nThere are four steps in propensity score weighting:\n\nData preparation: Prepare the data by creating the treatment/exposure, outcome, and covariates.\nSpecifying PS & fit model: Specify the propensity score model with investigator-specified measured covariates and fit the model\nWeighting: Convert PS to inverse probability weights (IPW).\nCovariate balance: Check the balance of covariates between treatment groups after weighting.\nEstimating treatment effect: Fit the outcome model on the pseudo population.\n\n\n2.2.1 Step 0: Data preparation\n\n2.2.1.1 Creating Analytic data\n3 cycles of NHANES datasets were - downloaded from the US CDC website - recoded for consistency, and - merged together to make an analytic data.\nDetails of data download process, and recoding and merging are discussed in Appendix.\n\n\n\n\n\nflowchart LR\n A[NHANES] --> C1(2013-2014 cycle) --> ss1(10,175 \\nparticipants)\n A --> C2(2015-2016 cycle) --> ss2(9,971 \\nparticipants)\n A --> C3(2017-2018 cycle) --> ss3(9,254 \\nparticipants)\n ss1 --> ss(7,585 \\nafter \\nimposing \\neligibility \\ncriteria)\n ss2 --> ss\n ss3 --> ss\n style A fill:#FFA500;\n style C1 fill:#FFA500;\n style C2 fill:#FFA500;\n style C3 fill:#FFA500;\n style ss1 fill:#FFA500;\n style ss2 fill:#FFA500;\n style ss3 fill:#FFA500;\n style ss fill:#FFA500;\n\n\n\n\n\n\n\n\nOur study population was restricted to the U.S. population who were\n\n20 years or older and\nnot pregnant at the time of survey data collection, and\nwho had available International Classification of Diseases (ICD) codes to ensure we can extract sufficient proxy information for the analysis (discussed in step 1).\n\nTo simplify the analysis, we only considered complete case data.\n\n# Table 1\nlibrary(tableone)\ntab1 <- CreateTableOne(vars = investigator.specified.covariates, \n strata = \"exposure\",\n data = hdps.data, \n test = FALSE)\nprint(tab1, showAllLevels = TRUE, noSpaces = TRUE, quote = FALSE, smd = TRUE)\n#> Stratified by exposure\n#> level 0 \n#> n 4186 \n#> age.cat (%) 20-49 1246 (29.8) \n#> 50-64 1274 (30.4) \n#> 65+ 1666 (39.8) \n#> sex (%) Male 1958 (46.8) \n#> Female 2228 (53.2) \n#> education (%) Less than high school 819 (19.6) \n#> High school 2147 (51.3) \n#> College graduate or above 1220 (29.1) \n#> race (%) White 1946 (46.5) \n#> Black 699 (16.7) \n#> Hispanic 791 (18.9) \n#> Others 750 (17.9) \n#> marital (%) Never married 555 (13.3) \n#> Married/with partner 2525 (60.3) \n#> Other 1106 (26.4) \n#> income (%) less than $20,000 846 (20.2) \n#> $20,000 to $74,999 2019 (48.2) \n#> $75,000 and Over 1321 (31.6) \n#> born (%) Born in US 2989 (71.4) \n#> Other place 1197 (28.6) \n#> year (mean (SD)) 8.95 (0.82) \n#> diabetes.family.history (%) No 3515 (84.0) \n#> Yes 671 (16.0) \n#> medical.access (%) No 299 (7.1) \n#> Yes 3887 (92.9) \n#> smoking (%) Never smoker 2249 (53.7) \n#> Previous smoker 1143 (27.3) \n#> Current smoker 794 (19.0) \n#> diet.healthy (%) Poor or fair 970 (23.2) \n#> Good 1723 (41.2) \n#> Very good or excellent 1493 (35.7) \n#> physical.activity (%) No 3463 (82.7) \n#> Yes 723 (17.3) \n#> sleep (mean (SD)) 7.50 (1.57) \n#> uric.acid (mean (SD)) 5.26 (1.40) \n#> protein.total (mean (SD)) 7.08 (0.47) \n#> bilirubin.total (mean (SD)) 0.56 (0.28) \n#> phosphorus (mean (SD)) 3.72 (0.56) \n#> sodium (mean (SD)) 139.62 (2.68) \n#> potassium (mean (SD)) 4.03 (0.39) \n#> globulin (mean (SD)) 2.86 (0.47) \n#> calcium.total (mean (SD)) 9.41 (0.38) \n#> systolicBP (mean (SD)) 127.28 (20.07)\n#> diastolicBP (mean (SD)) 69.82 (11.69) \n#> high.cholesterol (%) No 2176 (52.0) \n#> Yes 2010 (48.0) \n#> Stratified by exposure\n#> 1 SMD \n#> n 3399 \n#> age.cat (%) 1126 (33.1) 0.157\n#> 1176 (34.6) \n#> 1097 (32.3) \n#> sex (%) 1373 (40.4) 0.129\n#> 2026 (59.6) \n#> education (%) 686 (20.2) 0.200\n#> 2009 (59.1) \n#> 704 (20.7) \n#> race (%) 1496 (44.0) 0.368\n#> 846 (24.9) \n#> 804 (23.7) \n#> 253 (7.4) \n#> marital (%) 433 (12.7) 0.041\n#> 2006 (59.0) \n#> 960 (28.2) \n#> income (%) 742 (21.8) 0.130\n#> 1783 (52.5) \n#> 874 (25.7) \n#> born (%) 2776 (81.7) 0.244\n#> 623 (18.3) \n#> year (mean (SD)) 8.99 (0.81) 0.059\n#> diabetes.family.history (%) 2626 (77.3) 0.170\n#> 773 (22.7) \n#> medical.access (%) 157 (4.6) 0.107\n#> 3242 (95.4) \n#> smoking (%) 1805 (53.1) 0.129\n#> 1085 (31.9) \n#> 509 (15.0) \n#> diet.healthy (%) 1331 (39.2) 0.420\n#> 1384 (40.7) \n#> 684 (20.1) \n#> physical.activity (%) 2747 (80.8) 0.049\n#> 652 (19.2) \n#> sleep (mean (SD)) 7.37 (1.68) 0.084\n#> uric.acid (mean (SD)) 5.83 (1.54) 0.387\n#> protein.total (mean (SD)) 7.08 (0.45) 0.019\n#> bilirubin.total (mean (SD)) 0.51 (0.30) 0.163\n#> phosphorus (mean (SD)) 3.66 (0.57) 0.107\n#> sodium (mean (SD)) 139.49 (2.66) 0.048\n#> potassium (mean (SD)) 4.04 (0.39) 0.009\n#> globulin (mean (SD)) 2.99 (0.46) 0.287\n#> calcium.total (mean (SD)) 9.34 (0.38) 0.168\n#> systolicBP (mean (SD)) 129.28 (17.69) 0.106\n#> diastolicBP (mean (SD)) 71.68 (11.98) 0.157\n#> high.cholesterol (%) 1624 (47.8) 0.084\n#> 1775 (52.2)\n\n\n\n\n2.2.2 Step 1: Specifying PS & fit model\nWe build the propensity score model in this data using the investigator-specified covariates.\n\n\n\n\n\n\n\n\n\n\n\nC = investigator-specified covariates.\n\nIf you are somewhat unfamiliar with propensity score paradigm, look at tutorials dedicated towards that topic. There are additional tutorials also talking about propensity score weighting.\n\n\n2.2.2.1 PS model specification\nNow let us create the propensity score formula with the investigator-specified covariates:\n\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", covform))\nps.formula\n#> exposure ~ age.cat + sex + education + race + marital + income + \n#> born + year + diabetes.family.history + medical.access + \n#> smoking + diet.healthy + physical.activity + sleep + uric.acid + \n#> protein.total + bilirubin.total + phosphorus + sodium + potassium + \n#> globulin + calcium.total + systolicBP + diastolicBP + high.cholesterol\n\n\n\n\nOnly use investigator specified covariates to build the formula.\nDuring the construction of the propensity score model, researchers should consider incorporating additional model specifications, such as interactions and polynomials, if they are deemed necessary.\n\n\n\n2.2.2.2 Fit the PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n data = hdps.data, \n estimand = \"ATE\",\n method = \"ps\")\n\n\n\n\nUse that formula to estimate propensity scores.\nIn this demonstration, we did not use stabilize = TRUE. However, stabilized propensity score weights often reduce the variance of treatment effect estimates.\n\n\n\n2.2.2.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\nggplot(hdps.data, aes(x = ps, fill = factor(exposure))) +\n geom_density(alpha = 0.5) +\n scale_fill_manual(values = c(\"darkblue\", \"darkred\")) +\n theme_classic()\n\n\n\n\n\n\n\n\n\n\nCheck propensity score overlap in both exposure groups.\n\n\n\n2.2.3 Step 2: Weighting\nAs mentioned, we only talk about inverse probability weighting in our current context.\n\nhdps.data$w <- W.out$weights\nsummary(hdps.data$w)\n#> Min. 1st Qu. Median Mean 3rd Qu. Max. \n#> 1.006 1.325 1.617 2.006 2.187 31.825\n\n\nggplot(hdps.data, aes(x = \"\", y = w)) +\n geom_boxplot(fill = \"lightblue\", \n color = \"blue\", \n size = 1) +\n geom_text(aes(x = 1, y = max(w), \n label = paste0(\"Max = \", round(max(w), 2))), \n vjust = 1.5, \n hjust = -0.3, \n size = 4, \n color = \"red\") +\n geom_text(aes(x = 1, y = min(w), \n label = paste0(\"Min = \", round(min(w), 2))), \n vjust = -2.5, \n hjust = -0.3, \n size = 4, \n color = \"red\") +\n ggtitle(\"Boxplot of Inverse Probability Weights\") +\n xlab(\"\") +\n ylab(\"Weights\") +\n theme_classic()\n\n\n\n\n\n\n\n\n\n\n\nCheck the summary statistics of the weights to assess whether there are extreme weights. Less extreme weights now?\n\n\n\n2.2.4 Step 3: Covariate balance\n\nrequire(cobalt)\nlove.plot(x = W.out,\n thresholds = c(m = .1), \n var.order = \"unadjusted\",\n stars = \"raw\")\n#> Warning: No shared levels found between `names(values)` of the manual scale and the\n#> data's fill values.\n\n\n\n\n\n\n\n\n\n\n\nAssess balance against SMD 0.1. Still balanced?\nPredictive measures such as c-statistics are not helpful in this context (Westreich et al. 2011): “use of the c-statistic as a guide in constructing propensity scores may result in less overlap in propensity scores between treated and untreated subjects”!\n\n\n\n2.2.5 Step 4: Estimating treatment effect\n\n2.2.5.1 Set outcome formula\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nout.formula\n#> outcome ~ exposure\n\n\n\nWe are again using a crude weighted outcome model here.\n\n\n2.2.5.2 Obtain OR\n\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\n\nfit.summary_with_ci.ps <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.ps,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.68\n0.07\n0\n0.61\n0.76\n\n\n\n\n\n\n\n2.2.5.3 Obtain RD\n\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= gaussian(link = \"identity\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|t|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.ps.rd <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.ps.rd,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|t|)\n2.5 %\n97.5 %\n\n\n\n\n0.13\n0.01\n0\n0.11\n0.14\n\n\n\n\n\n\n\n\n\nWestreich, Daniel, Stephen R Cole, Michele Jonsson Funk, M Alan Brookhart, and Til Stürmer. 2011. “The Role of the c-Statistic in Variable Selection for Propensity Score Models.” Pharmacoepidemiology and Drug Safety 20 (3): 317–20.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Propensity score</span>"
]
},
{
"objectID": "proxy.html",
"href": "proxy.html",
"title": "3 Reducing residual confounding",
"section": "",
"text": "3.1 Measuring comorbidity burden\nIn health research, the overall health status/ Disease burden could be a potential confounding factor. In the original DAG, we had comorbidity as a known confounder.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Reducing residual confounding</span>"
]
},
{
"objectID": "proxy.html#measuring-comorbidity-burden",
"href": "proxy.html#measuring-comorbidity-burden",
"title": "3 Reducing residual confounding",
"section": "",
"text": "flowchart TB\n A[Obesity] --> Y(Diabete)\n L[Comorbidity measure unobserved] --> Y\n L --> A\n style A fill:#90EE90;\n style Y fill:#ADD8E6;\n style L fill:#FF0000;\n\n\n\n\n\n\n\n\nCharlson Comorbidity Index (CCI) is a measure that quantifies the burden of comorbidities or pre-existing medical conditions in patients (takes into account 17 comorbidities), which can impact their health outcomes and overall survival.\nElixhauser Comorbidity Index (ECI) is a measure of the burden of comorbidities, based on 30 different comorbid conditions.\nChronic Disease Score (CDS) is a weighted score of the number and severity of chronic diseases, calculated using self-reported data on diagnosed conditions (considers the presence of 21 chronic conditions).\n\n\n\n\n(Charlson et al. 1987; Elixhauser et al. 1998; Von Korff, Wagner, and Saunders 1992)\nNHANES does not include information on all of the comorbidities included in theses scores / indices.\n\n\n\n\n\n\n\nResidual confounding\n\n\n\nComorbidity scores are widely used as a measure of comorbidity burden, and their calculation often relies on data that may not be available in certain contexts, such as in NHANES or Canadian health administrative databases. In such cases, when comorbidity burden is a known confounder, researchers may use proxy information to approximate and mimic the information. Not being able to adjust for such variable can introduce bias and residual confounding in the treatment effect estimation.\n\n\n\n\n\n(Schneeweiss and Maclure 2000; L. Lix et al. 2011; L. M. Lix et al. 2013)",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Reducing residual confounding</span>"
]
},
{
"objectID": "proxy.html#proxy-adjustment-empirical-criterion",
"href": "proxy.html#proxy-adjustment-empirical-criterion",
"title": "3 Reducing residual confounding",
"section": "3.2 Proxy Adjustment Empirical criterion",
"text": "3.2 Proxy Adjustment Empirical criterion\nEmpirical criterion: Modified disjunctive cause criterion\nVanderWeele et al. 2019 European Journal of Epidemiology: CC BY license\n\n\n\n\n\nHypothesized Directed acyclic graph with comorbidity measure being unmeasured, and approximated by the simple count measures based on the ICD codes\n\n\n\n\n\n\nAdjust for variables that are (a) causes of exposure or outcome or both, (b) discard: known instrument, (c) including good proxies for unmeasured common causes (VanderWeele 2019)",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Reducing residual confounding</span>"
]
},
{
"objectID": "proxy.html#additional-information-icd-10-cm",
"href": "proxy.html#additional-information-icd-10-cm",
"title": "3 Reducing residual confounding",
"section": "3.3 Additional information: ICD-10-CM",
"text": "3.3 Additional information: ICD-10-CM\n\n\nThe International Classification of Diseases 10th Revision (ICD-10) is a standardized system of codes for the classification of diseases, disorders, and injuries.\n\n\n\nRole\nData Source\nVariables considered\n\n\n\n\nRole unclear as they may not directly relate to the research question\nRXQ_RX\nPrescription medication ICD-10-CM code\n\n\n\n\n\nRXQ_RX questionnaire (a) collects information on prescription medications taken in the past 30 days, (b) conducted by trained interviewers, and (c) with some quality control efforts.\n\n\n\n\n\nExamples of ICD-10-CM codes (3-7 characters, 1st character being alpha, 2-end are numberic, often with a dot) assigned to reasons for using medication (see Appendix in NHANES RXQ_RX component)\n\n\n\n\n\n\nWe have a lot of information through these ICD-10-CM codes, but for most of these information, it is unclear what role they play within the context of our research questions.\nCount of prescriptions is often used to measure comorbidity burden. This is not a perfect measure. But could serve as a proxy for our purpose.\n\n\n\n\nPrescription medication (ICD-10-CM codes from all 3 cycles) data was liked with the initial data.\n\n\nCharlson, Mary E, Peter Pompei, Kathy L Ales, and C Ronald MacKenzie. 1987. “A New Method of Classifying Prognostic Comorbidity in Longitudinal Studies: Development and Validation.” Journal of Chronic Diseases 40 (5): 373–83.\n\n\nElixhauser, Anne, Claudia Steiner, D Robert Harris, and Rosanna M Coffey. 1998. “Comorbidity Measures for Use with Administrative Data.” Medical Care, 8–27.\n\n\nLix, Lisa M, Jacqueline Quail, Opeyemi Fadahunsi, and Gary F Teare. 2013. “Predictive Performance of Comorbidity Measures in Administrative Databases for Diabetes Cohorts.” BMC Health Services Research 13: 1–12.\n\n\nLix, LM, J Quail, G Teare, and B Acan. 2011. “Performance of Comorbidity Measures for Predicting Outcomes in Population-Based Osteoporosis Cohorts.” Osteoporosis International 22: 2633–43.\n\n\nSchneeweiss, Sebastian, and Malcolm Maclure. 2000. “Use of Comorbidity Scores for Control of Confounding in Studies Using Administrative Databases.” International Journal of Epidemiology 29 (5): 891–98.\n\n\nVanderWeele, Tyler J. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34: 211–19.\n\n\nVon Korff, Michael, Edward H Wagner, and Kathleen Saunders. 1992. “A Chronic Disease Score from Automated Pharmacy Data.” Journal of Clinical Epidemiology 45 (2): 197–203.",
"crumbs": [
"Motivating example",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Reducing residual confounding</span>"
]
},
{
"objectID": "hdps.html",
"href": "hdps.html",
"title": "High-dimensional Propensity score",
"section": "",
"text": "Origin",
"crumbs": [
"High-dimensional Propensity score"
]
},
{
"objectID": "hdps.html#origin",
"href": "hdps.html#origin",
"title": "High-dimensional Propensity score",
"section": "",
"text": "(Schneeweiss et al. 2009)",
"crumbs": [
"High-dimensional Propensity score"
]
},
{
"objectID": "hdps.html#key-idea",
"href": "hdps.html#key-idea",
"title": "High-dimensional Propensity score",
"section": "Key idea",
"text": "Key idea\nSchneeweiss et al. 2009 extended to a variety of classifications to code diagnoses (ICD), procedure (CPT), medications (eg, NDC, AHFS, ATCC), or others (PCP, LOINC).\n\n\n\n\n\n\n\n\n\n\n\nCPT-4 (Current Procedural Terminology, 4th edition), ICD-9 (International Classification of Diseases, 9th edition), PCP visits (Primary Care Physician visits), NDC (National Drug Code), and ATC (Anatomical Therapeutic Chemical classification) are all codes or measures commonly used in healthcare and medical research.\nSchneeweiss et al. 2018 Clinical Epidemiology: CC BY license\n\n\n(Schneeweiss 2018)\n\n\n\n\n\n\nAdjust useful proxies\n\n\n\nIn administrative data sources, the main idea of hdPS (high-dimensional propensity score) is to adjust for proxies that are empirically associated with the outcome of interest, which may not be directly measured in the data.\n\n\n\n\nWith hdPS, users do not need to know which unmeasured confounders are being adjusted for by proxy information.\n\nAdjusting for something that may not be interpretable directly with the context of the research question.\nLogic: measures from same subject should be correlated = has relevant proxy information\n\n\n\n\n\nSchneeweiss, Sebastian. 2018. “Automated Data-Adaptive Analytics for Electronic Healthcare Data to Study Causal Treatment Effects.” Clinical Epidemiology, 771–88.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.",
"crumbs": [
"High-dimensional Propensity score"
]
},
{
"objectID": "step1.html",
"href": "step1.html",
"title": "4 Step 1: Proxy sources",
"section": "",
"text": "4.1 Data with investigator-specified variables\nanalytic <- data.complete\nidx <- analytic$id\noutcome <- as.numeric(analytic$diabetes == \"Yes\") \nexposure <- as.numeric(analytic$obese == \"Yes\")\ndomain <- \"dx\"\nanalytic.dfx <- as.data.frame(cbind(idx, exposure, outcome, domain))",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Step 1: Proxy sources</span>"
]
},
{
"objectID": "step1.html#data-with-investigator-specified-variables",
"href": "step1.html#data-with-investigator-specified-variables",
"title": "4 Step 1: Proxy sources",
"section": "",
"text": "Data: part 1\n\n\n\nWe will work with the data.complete data for the investigator-specified information.\n\n\n\n\n\nWe prepare the minimal analytic data only with the following 4 information:\n\nidentifying information (idx)\nexposure (obese)\noutcome (diabetes)\ndomain of the codes (dx). In this example we only have prescription domain (1 domain dx)",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Step 1: Proxy sources</span>"
]
},
{
"objectID": "step1.html#proxy-data",
"href": "step1.html#proxy-data",
"title": "4 Step 1: Proxy sources",
"section": "4.2 Proxy data",
"text": "4.2 Proxy data\n\n4.2.1 Identify the data dimensions (proxy sources)\nIn this example we only have prescription domain (1 domain dx of ICD-10-CM code). Hence \\(p = 1\\) in this exercise.\n\n\nNHANES Questionnaire collects information on: (a) dietary supplements, (b) nonprescription antacids, (c) prescription medications, and (d) preventive aspirin use.\n\n\n4.2.2 Define a covariate assessment period (CAP)\n\n\n\n\n\n\n\n\n\n\n\n(Connolly et al. 2019; Schneeweiss et al. 2009)\nWe only collect proxy information from a well-defined CAP. In our case, it was \\(30\\) days.\n\n\nNHANES asked “In the past 30 days, have you used or taken medication for which a prescription is needed? Do not include prescription vitamins or minerals you may have already told me about.”\n\n\n\n\n\n\nData: part 2\n\n\n\nWe will work with the merge proxy data (ICD-10 codes) from 3 cycles: dat.proxy.long.\n\n\n\n\n4.2.3 Omit duplicated information\n\n\nWe need to delete codes that could be close proxies of exposure and/or outcome, or other investigator specified covariates we have already selected in step0.\n\n\n\n\n\n\n\n\n\n\ndat.proxy.long <- subset(dat.proxy.long, \n icd10 != \"E66\") # Overweight and obesity\ndat.proxy.long <- subset(dat.proxy.long, \n icd10 != \"O24\") # Gestational diabetes mellitus\ndat.proxy.long <- subset(dat.proxy.long, \n icd10 != \"E10\") # Type 1 diabetes mellitus\ndat.proxy.long <- subset(dat.proxy.long, \n icd10 != \"E11\") # Type 2 diabetes mellitus\n\n\n\n\nWe delete codes associated with exposure and outcome.\nSame should be done for any other proxies that may have duplicating information compared to the investigator-specified covariates.\n\n\n\n4.2.4 Long format proxy data\n\n\nHere is an example of 3 digit codes for 1 patient with subject ID “100001”. We create the same for all patients.\n\n\n\n\n\nID\nICD 10 codes (3 digit)\nDescription\n\n\n\n\n100001\nF33\nMajor depressive disorder, recurrent\n\n\n100001\nI10\nHypertension\n\n\n100001\nM62\nMuscle spasm\n\n\n100001\nF32\nMajor depressive disorder, single episode\n\n\n100001\nM25\nJoint disorder/pain\n\n\n100001\nK21\nGastro-esophageal reflux disease\n\n\n100001\nM79\nmusculoskeletal pain conditions\n\n\n100001\nR12\nHeartburn",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Step 1: Proxy sources</span>"
]
},
{
"objectID": "step1.html#merge-proxy-data-with-analytic-data",
"href": "step1.html#merge-proxy-data-with-analytic-data",
"title": "4 Step 1: Proxy sources",
"section": "4.3 Merge Proxy data with Analytic data",
"text": "4.3 Merge Proxy data with Analytic data\n\n\n\n\n\n\nMerged Data: parts 1 and 2\n\n\n\n\nWe will work with the merge proxy data with analytic data.\nThat will provide us with the IDs (idx) of the subject that have proxy (ICD-10) information associated with them.\n\n\n\n\nrequire(dplyr) \ndfx <- merge(analytic.dfx, proxy.var.long, by = \"idx\")\nhead(dfx)\n\n\n \n\n\nbasetable <- dfx %>% select(idx, exposure, outcome) %>% distinct()\npatientIds <- basetable$idx\nlength(patientIds)\n#> [1] 3839\n\n\n\n\n\nConnolly, John G, Sebastian Schneeweiss, Robert J Glynn, and Joshua J Gagne. 2019. “Quantifying Bias Reduction with Fixed-Duration Versus All-Available Covariate Assessment Periods.” Pharmacoepidemiology and Drug Safety 28 (5): 665–70.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Step 1: Proxy sources</span>"
]
},
{
"objectID": "step2.html",
"href": "step2.html",
"title": "5 Step 2: Empirical",
"section": "",
"text": "5.1 Sort by prevalence\nBased on the merged dataset, we identify which patients were linked in both databases. Using those IDs, we want to sort the list of candidate empirical covariates.\nCheck out the frequency of each codes:\nlibrary(dplyr)\ndf <- data.frame(\n icd10 = names(sort(table(dfx$icd10), decreasing = TRUE)),\n count = sort(table(dfx$icd10), decreasing = TRUE)\n)\nICD10 Code Frequencies\n\n\nICD10 Code\nCount\n\n\n\n\nI10\n2775\n\n\nE78\n1517\n\n\nF32\n536\n\n\nF41\n524\n\n\nK21\n441\n\n\nM79\n401\n\n\nE03\n397\n\n\nM54\n314\n\n\nG47\n307\n\n\nJ45\n301\nHowever, some may be associated with lower counts (e.g., less than 20).\nIf there were more dimensions, separate list of candidate empirical covariates would be identified.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Step 2: Empirical</span>"
]
},
{
"objectID": "step2.html#sort-by-prevalence",
"href": "step2.html#sort-by-prevalence",
"title": "5 Step 2: Empirical",
"section": "",
"text": "Only top 10 prevalent codes are shown.\n\n\n\n\n\n\n\nRestrictions\n\n\n\nCandidate empirical covariates list is constrained by\n\ntheir prevalence of codes. Only top n covariates with highest prevalence would be chosen.\nanalysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).\ncodes associated with very low prevalence are also numerically problematic for further analyses.\n\n\n\n\n\nWe choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Step 2: Empirical</span>"
]
},
{
"objectID": "step2.html#choose-granularity",
"href": "step2.html#choose-granularity",
"title": "5 Step 2: Empirical",
"section": "5.2 Choose Granularity",
"text": "5.2 Choose Granularity\nOne important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.\n\n\nWe have already truncated the codes at 3 digit level while preparing the data.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Step 2: Empirical</span>"
]
},
{
"objectID": "step2.html#retain-top-n-empirical-covariates",
"href": "step2.html#retain-top-n-empirical-covariates",
"title": "5 Step 2: Empirical",
"section": "5.3 Retain top n empirical covariates",
"text": "5.3 Retain top n empirical covariates\n\nrequire(autoCovariateSelection)\nstep1 <- get_candidate_covariates(df = dfx, \n domainVarname = \"domain\",\n eventCodeVarname = \"icd10\", \n patientIdVarname = \"idx\",\n patientIdVector = patientIds,\n n = 200, \n min_num_patients = 20)\n\n\n\nYou can use autoCovariateSelection package to implement these restrictions (Robert 2020).\n\n5.3.1 Long format data\n\nout1 <- step1$covars_data\nhead(out1)\n\n\n \n\n\n\n\n\n5.3.2 Updated frequency data\n\ndf2 <- data.frame(\n icd10 = names(table(out1$icd10)),\n count = as.numeric(table(out1$icd10))\n)\n\n\n\n\n\n\nICD10 Code\nCount\n\n\n\n\ndx_A49\n28\n\n\ndx_B00\n20\n\n\ndx_B35\n22\n\n\ndx_C50\n31\n\n\ndx_D75\n136\n\n\ndx_E03\n397\n\n\n\n\n\n\n\nOnly first few code frequencies are shown (alphabetic order), that were selected based on the restrictions n = 200 and min_num_patients = 20.\n\n\n\n\n\n\nICD10 Code\nCount\n\n\n\n\n77\ndx_R52\n40\n\n\n78\ndx_R60\n187\n\n\n79\ndx_R73\n202\n\n\n80\ndx_T14\n82\n\n\n81\ndx_T78\n96\n\n\n82\ndx_Z79\n277\n\n\n\n\n\n\n\nOnly last few code frequencies are shown (alphabetic order).\n\n\n5.3.3 Total number of codes retained\n\nnrow(df2)\n#> [1] 82\n\n\n\n\n\nRobert, Dennis. 2020. autoCovariateSelection: Automatic Covariate Selection. https://CRAN.R-project.org/package=autoCovariateSelection.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nSchuster, Tibor, Menglan Pang, and Robert W Platt. 2015. “On the Role of Marginal Confounder Prevalence–Implications for the High-Dimensional Propensity Score Algorithm.” Pharmacoepidemiology and Drug Safety 24 (9): 1004–7.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Step 2: Empirical</span>"
]
},
{
"objectID": "step3.html",
"href": "step3.html",
"title": "6 Step 3: Recurrence",
"section": "",
"text": "6.1 Genrate recurrence covariates\nIn this step, we generate 3 binary recurrence covariates for each of the candidate empirical covariates identified in the previous step:\nstep2 <- get_recurrence_covariates(df = out1, \n patientIdVarname = \"idx\",\n eventCodeVarname = \"icd10\", \n patientIdVector = patientIds)",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Step 3: Recurrence</span>"
]
},
{
"objectID": "step3.html#genrate-recurrence-covariates",
"href": "step3.html#genrate-recurrence-covariates",
"title": "6 Step 3: Recurrence",
"section": "",
"text": "(Schneeweiss et al. 2009)\n\n\noccurred at least once\noccurred sporadically (at least more than the median)\noccurred frequently (at least more than the 75th percentile)",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Step 3: Recurrence</span>"
]
},
{
"objectID": "step3.html#example-of-recurrence-covariates",
"href": "step3.html#example-of-recurrence-covariates",
"title": "6 Step 3: Recurrence",
"section": "6.2 Example of recurrence covariates",
"text": "6.2 Example of recurrence covariates\n\n\n\n\n\n\n\n\n\n\nICD-10-CM code (dimension 1)\ncode appeared at least once\ncode appeared at least more than the median\ncode appeared at least more than the 75th percentile\n\n\n\n\nD64.9 Anemia\nrec_dx_D64_once\nrec_dx_D64_sporadic\nrec_dx_D64_frequent\n\n\nD75.9P Blood clots\nrec_dx_D75_once\nrec_dx_D75_sporadic\nrec_dx_D75_frequent\n\n\nD89.9 Immune disorder\nrec_dx_D89_once\nrec_dx_D89_sporadic\nrec_dx_D89_frequent\n\n\n\\(\\ldots\\)\n\\(\\ldots\\)\n\\(\\ldots\\)\n\\(\\ldots\\)\n\n\nE07.9 Disorder of thyroid\nrec_dx_E07_once\nrec_dx_E07_sporadic\nrec_dx_E07_frequent\n\n\n\n\n\nExample of 3 binary covariates (hypothetical) created based on the candidate empirical covariates.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Step 3: Recurrence</span>"
]
},
{
"objectID": "step3.html#recurrence-covariates-in-the-data",
"href": "step3.html#recurrence-covariates-in-the-data",
"title": "6 Step 3: Recurrence",
"section": "6.3 Recurrence covariates in the data",
"text": "6.3 Recurrence covariates in the data\n\nout2 <- step2$recurrence_data\nncol(out2)\n#> [1] 92\n\n\n\n\n \n\n\n\n\n\nHere we show binary recurrence covariates for only 2 columns",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Step 3: Recurrence</span>"
]
},
{
"objectID": "step3.html#refined-recurrence-covariates",
"href": "step3.html#refined-recurrence-covariates",
"title": "6 Step 3: Recurrence",
"section": "6.4 Refined recurrence covariates",
"text": "6.4 Refined recurrence covariates\nBelow you can click to see a list of all recurrence covariates obtained in our data.\n\nShow/Hide Table\n\n\n\nICD-10 Recurrence Data\n\n\n1\nrec_dx_A49_once\nrec_dx_B00_once\nrec_dx_B35_once\n\n\n2\nrec_dx_C50_once\nrec_dx_D75_once\nrec_dx_E03_once\n\n\n3\nrec_dx_E04_once\nrec_dx_E07_once\nrec_dx_E78_once\n\n\n4\nrec_dx_E87_once\nrec_dx_F31_once\nrec_dx_F31_frequent\n\n\n5\nrec_dx_F32_once\nrec_dx_F39_once\nrec_dx_F41_once\n\n\n6\nrec_dx_F43_once\nrec_dx_F90_once\nrec_dx_G25_once\n\n\n7\nrec_dx_G40_once\nrec_dx_G40_frequent\nrec_dx_G43_once\n\n\n8\nrec_dx_G47_once\nrec_dx_H04_once\nrec_dx_H40_once\n\n\n9\nrec_dx_H40_frequent\nrec_dx_I10_once\nrec_dx_I10_frequent\n\n\n10\nrec_dx_I20_once\nrec_dx_I21_once\nrec_dx_I48_once\n\n\n11\nrec_dx_I48_frequent\nrec_dx_I49_once\nrec_dx_I50_once\n\n\n12\nrec_dx_I50_frequent\nrec_dx_I51_once\nrec_dx_I63_once\n\n\n13\nrec_dx_J30_once\nrec_dx_J42_once\nrec_dx_J44_once\n\n\n14\nrec_dx_J44_frequent\nrec_dx_J45_once\nrec_dx_J45_frequent\n\n\n15\nrec_dx_K04_once\nrec_dx_K08_once\nrec_dx_K21_once\n\n\n16\nrec_dx_K25_once\nrec_dx_K27_once\nrec_dx_K30_once\n\n\n17\nrec_dx_K59_once\nrec_dx_K92_once\nrec_dx_L40_once\n\n\n18\nrec_dx_L70_once\nrec_dx_M06_once\nrec_dx_M06_frequent\n\n\n19\nrec_dx_M10_once\nrec_dx_M13_once\nrec_dx_M19_once\n\n\n20\nrec_dx_M1A_once\nrec_dx_M25_once\nrec_dx_M54_once\n\n\n21\nrec_dx_M62_once\nrec_dx_M79_once\nrec_dx_M81_once\n\n\n22\nrec_dx_N28_once\nrec_dx_N32_once\nrec_dx_N39_once\n\n\n23\nrec_dx_N40_once\nrec_dx_N92_once\nrec_dx_N94_once\n\n\n24\nrec_dx_N95_once\nrec_dx_R00_once\nrec_dx_R05_once\n\n\n25\nrec_dx_R06_once\nrec_dx_R07_once\nrec_dx_R09_once\n\n\n26\nrec_dx_R10_once\nrec_dx_R11_once\nrec_dx_R12_once\n\n\n27\nrec_dx_R25_once\nrec_dx_R32_once\nrec_dx_R35_once\n\n\n28\nrec_dx_R39_once\nrec_dx_R41_once\nrec_dx_R42_once\n\n\n29\nrec_dx_R51_once\nrec_dx_R52_once\nrec_dx_R60_once\n\n\n30\nrec_dx_R73_once\nrec_dx_T14_once\nrec_dx_T78_once\n\n\n31\nrec_dx_Z79_once\n\n\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\n\n\nGiven that we had one dimension of proxy data, \\(p=1\\), at most \\(n=200\\) most prevalent codes (with the restriction that minimum number of patients in each code = 20), and \\(3\\) intensity, we could theoretically have at most \\(p \\times n \\times 3 = 1 \\times 200 \\times \\ 3 = 600\\) recurrence covariates.\n\n\n\n\nBased on all of the restrictions, we created 143 distinct recurrence covariates.\nThe merged data (analytic and proxies) size is now 7,585.\n\n\n\n\n\n\nIf 2 or all 3 recurrence covariates are identical, only one distinct recurrence covariate is returned. This is why you do not see any sporadic recurrence covariate here.\nRecurrence covariate creation is for each patient, and it is possible to have same code occur multiple time because we are working with a 3 digit granularity (possible to have medications from other codes within same ICD-10 3 digit granularity).\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Step 3: Recurrence</span>"
]
},
{
"objectID": "step4.html",
"href": "step4.html",
"title": "7 Step 4: Prioritize",
"section": "",
"text": "7.1 Bross formula\nWe need to make an educated guess about 3 components (i.e., make an assumption), that are used in the calculation of bias contributed by not adjusting for a covariate based on Bross (1966) formula:\nThe above components can help us calculate \\(bias\\) amount (known as ‘Bias Multiplier’) using the Bross formula when we omit adjusting for \\(U\\):\n\\[\\text{Bias}_U = \\frac{P_{UA_1} (RR_{UY} - 1) + 1}{P_{UA_0} (RR_{UY} - 1) + 1}\\]",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step4.html#bross-formula",
"href": "step4.html#bross-formula",
"title": "7 Step 4: Prioritize",
"section": "",
"text": "Bross formula (Bross 1966; Schneeweiss 2006) for the Bias Multiplier considers both the imbalance in the prevalence of the unmeasured confounder between the exposure groups and the association between the confounder and the outcome to assess the potential bias.\n\nprevalence of a binary unmeasured confounder (\\(U\\)) among exposed (\\(P_{UA_1}\\))\nprevalence of that binary unmeasured confounder among unexposed (\\(P_{UA_0}\\))\nassociation between that binary unmeasured confounder and the outcome (\\(RR_{UY} = \\frac{P_{UY_1}}{P_{UY_1}}\\))\n\n\n\n\n\nThese are the ingredients of the Bross formula. This formula is helpful for understanding the impact of unmeasured confounding of a binary variable. We have to put assumed prevalence and risk ratio associated with an unmeasured confounder.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step4.html#calculating-bias-from-a-recurrence-covariate",
"href": "step4.html#calculating-bias-from-a-recurrence-covariate",
"title": "7 Step 4: Prioritize",
"section": "7.2 Calculating bias from a recurrence covariate",
"text": "7.2 Calculating bias from a recurrence covariate\nFor recurrence covariates (\\(R\\)), we do not need to assume, we just plug-in \\(R\\) instead of \\(U\\) in the following calculations:\n\nprevalence of a binary recurrence variable among exposed (\\(P_{RA_1}\\))\nprevalence of that binary recurrence variable among unexposed (\\(P_{RA_0}\\))\nassociation between that binary recurrence variable and the outcome (\\(RR_{RY} = \\frac{P_{RY_1}}{P_{RY_1}}\\))\n\nThese components can help us empirically calculate \\(bias\\) amount:\n\\[\\text{Bias}_R = \\frac{P_{RA_1} (RR_{RY} - 1) + 1}{P_{RA_0} (RR_{RY} - 1) + 1}\\]\nHere, \\(RR_{RY}\\) is the crude risk ratio between the recurrence covariate and the outcome, \\(Y\\) is the outcome, \\(A\\) is the exposure, and \\(R\\) is a recurrence covariate.\n\n\nFor recurrence covariates, we do not need to assume, we can basically calculate these numbers (\\(log-absolute-bias\\)) for all of the recurrence covariates (Schneeweiss et al. 2009). For each data dimension, we can rank each of the recurrence covariates based on the amount of bias (confounding or imbalance) it could likely adjust.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step4.html#calculating-bias-from-all-recurrence-covariates",
"href": "step4.html#calculating-bias-from-all-recurrence-covariates",
"title": "7 Step 4: Prioritize",
"section": "7.3 Calculating bias from all recurrence covariates",
"text": "7.3 Calculating bias from all recurrence covariates\nIn our example, we simply plug-in each recurrence covariates one-by-one to calculate \\(log-absolute-bias\\):\n\n\n\n\n\nR=rec_dx_D64_once\n\n\nR=rec_dx_D75_sporadic\n\n\n…\n\n\nR=rec_dx_E07_frequent",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step4.html#obtain-log-of-absolute-bias",
"href": "step4.html#obtain-log-of-absolute-bias",
"title": "7 Step 4: Prioritize",
"section": "7.4 Obtain log of absolute-bias",
"text": "7.4 Obtain log of absolute-bias\nWe calculate \\(log-absolute-bias\\) for all recurrence covariates.\n\n\nAbsolute log of the Bias Multiplier, \\(log-absolute-bias\\), is a symmetric measure of the potential bias introduced by the recurrence covariate, making it easier to compare and rank recurrence covariates.\n\nout3 <- get_prioritised_covariates(df = out2,\n patientIdVarname = \"idx\", \n exposureVector = basetable$exposure,\n outcomeVector = basetable$outcome,\n patientIdVector = patientIds, \n k = 100)\nsorted_values <- sort(out3$multiplicative_bias, \n decreasing = TRUE)\n\nThis would return absolute log of the multiplicative bias for each recurrence covariate (by univariate Bross formula). We can use this information to prioritize recurrence covariates in the next step.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step4.html#convert-to-absolute-log-of-multiplicative-bias",
"href": "step4.html#convert-to-absolute-log-of-multiplicative-bias",
"title": "7 Step 4: Prioritize",
"section": "7.5 Convert to Absolute log of multiplicative bias",
"text": "7.5 Convert to Absolute log of multiplicative bias\nHere are the few covariates and associated Absolute log of the multiplicative bias:\n\n\n\n\n\n\n\n\n\n\nrec_dx_I10_once : 0.124\n\n\nrec_dx_R73_once : 0.078\n\n\nrec_dx_I10_frequent : 0.065\n\n\nrec_dx_R60_once : 0.038\n\n\nrec_dx_E78_once : 0.036\n\n\nrec_dx_M79_once : 0.033\n\n\nrec_dx_I51_once : 0.019\n\n\nrec_dx_M10_once : 0.017\n\n\nrec_dx_I50_once : 0.016\n\n\n\n\n\nAnd here are translated table with description:\n\n\n\n\n\n\n\n\n\n\nHypertension : 0.115\n\n\nElevated blood glucose level : 0.088\n\n\nHypertension : 0.068\n\n\nEdema : 0.054\n\n\nPure hypercholesterolemia : 0.038\n\n\nmusculoskeletal pain : 0.017\n\n\nHypokalemia : 0.015\n\n\nHeart disease : 0.013\n\n\nHeart failure : 0.011\n\n\n\n\n\n\n\nSome of the empirical covariates with top Absolute log of the multiplicative bias are actually relevant to the outcome (diabetes): Hypertension, Elevated blood glucose level , etc. (Choi and Shi 2001)\n\n\n\n\n\n\nSMD vs Bias multiplier\n\n\n\nStandardized mean difference (SMD) is useful for assessing the balance in the propensity score literature. However, Bross formula incorporates outcome information. In the investigation of empirical covariates or recurrence covariates where interpretations of these covariates are unknown, it may seem more safe to use the multiplicative bias term from the Bross formula to identify proxy covariates that are helpful in predicting the outcome.\n\n\n\n\n\n\n(Stuart, Lee, and Leacy 2013)\n\n\nBross, Irwin DJ. 1966. “Spurious Effects from an Extraneous Variable.” Journal of Chronic Diseases 19 (6): 637–47.\n\n\nChoi, BCK, and F Shi. 2001. “Risk Factors for Diabetes Mellitus by Age and Sex: Results of the National Population Health Survey.” Diabetologia 44: 1221–31.\n\n\nSchneeweiss, Sebastian. 2006. “Sensitivity Analysis and External Adjustment for Unmeasured Confounders in Epidemiologic Database Studies of Therapeutics.” Pharmacoepidemiology and Drug Safety 15 (5): 291–303.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nStuart, Elizabeth A, Brian K Lee, and Finbarr P Leacy. 2013. “Prognostic Score–Based Balance Measures Can Be a Useful Diagnostic for Propensity Score Methods in Comparative Effectiveness Research.” Journal of Clinical Epidemiology 66 (8): S84–90.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Step 4: Prioritize</span>"
]
},
{
"objectID": "step5.html",
"href": "step5.html",
"title": "8 Step 5: Covariates",
"section": "",
"text": "8.1 Ideal number of prioritised covariates\nWe select 2 types of covariates for the next step (to analyze using propensity score or other alternative approaches):\nBased on calculated \\(log-absolute-bias\\), we select top k recurrence covariates to be used in the hdPS analyses later. Below is a plot of all of the absolute log of the Bias Multiplier:\nWe used \\(k = 100\\) covariates selected by the hdPS algorithm (we call them ‘hdPS covariates’). What should be the cutpoint?",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Step 5: Covariates</span>"
]
},
{
"objectID": "step5.html#ideal-number-of-prioritised-covariates",
"href": "step5.html#ideal-number-of-prioritised-covariates",
"title": "8 Step 5: Covariates",
"section": "",
"text": "Absolute log of the Bias Multiplier has a null value of 0. Anything above 0 is an indication of confounding bias adjusted by the adjustment of the associated recurrent covariate.\nFor large proxy data sources, \\(k = 500\\) is suggested (Schneeweiss et al. 2009).\nSee Sensitivity Analysis section for an understanding of how to choose a value based on an ad-hoc process.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Step 5: Covariates</span>"
]
},
{
"objectID": "step5.html#selected-hdps-variables-proxies",
"href": "step5.html#selected-hdps-variables-proxies",
"title": "8 Step 5: Covariates",
"section": "8.2 Selected hdPS variables (proxies)",
"text": "8.2 Selected hdPS variables (proxies)\n\nhdps.dim <- out3$autoselected_covariate_df\ndim(hdps.dim) # id + k\n#> [1] 3839 92\nhead(hdps.dim)[,1:3]\n\n\n \n\n\nhdps.dim$id <- hdps.dim$idx\nhdps.dim$idx <- NULL",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Step 5: Covariates</span>"
]
},
{
"objectID": "step5.html#investigator-specified-covariates",
"href": "step5.html#investigator-specified-covariates",
"title": "8 Step 5: Covariates",
"section": "8.3 Investigator-specified covariates",
"text": "8.3 Investigator-specified covariates\n\\(25\\) investigator-specified covariates are selected based on variables in the DAG that are available in the data set.\n\n\nWe should also add necessary interactions of these investigator-specified covariates, or add other useful model-specifications (e.g., polynomials).\n\n\n\n\n\nHypothesized Directed acyclic graph drawn based on analyst’s best understanding of the literature\n\n\n\n\n\n\n\n14 demographic, behavioral, health history related variables/access\n\nMostly categorical\n\n11 lab variables\n\nMostly continuous\n\n\n\nexposure <- \"obese\"\noutcome <- \"diabetes\" \ninvestigator.specified.covariates <- \n c(# Demographic\n \"age.cat\", \"sex\", \"education\", \"race\", \n \"marital\", \"income\", \"born\", \"year\",\n \n # health history related variables/access\n \"diabetes.family.history\", \"medical.access\",\n \n # behavioral\n \"smoking\", \"diet.healthy\", \"physical.activity\", \"sleep\",\n \n # Laboratory \n \"uric.acid\", \"protein.total\", \"bilirubin.total\", \"phosphorus\",\n \"sodium\", \"potassium\", \"globulin\", \"calcium.total\", \n \"systolicBP\", \"diastolicBP\", \"high.cholesterol\"\n)\nlength(investigator.specified.covariates)\n#> [1] 25",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Step 5: Covariates</span>"
]
},
{
"objectID": "step5.html#merged-data",
"href": "step5.html#merged-data",
"title": "8 Step 5: Covariates",
"section": "8.4 Merged data",
"text": "8.4 Merged data\n\nload(\"data/analytic3cycles.RData\")\nhdps.data <- merge(data.complete[,c(\"id\",\n outcome, \n exposure, \n investigator.specified.covariates)], \n hdps.dim, by = \"id\")\ndim(hdps.data)\n#> [1] 3839 119\n\n\n\n\n\nVariable count (128)\n\n1 ID variable\n1 exposure\n1 outcome\n25 investigator-specified covariates\n100 hdPS variables\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Step 5: Covariates</span>"
]
},
{
"objectID": "step6.html",
"href": "step6.html",
"title": "9 Step 6: Propensity",
"section": "",
"text": "9.1 hdPS model\nThen the hdPS can be used as matching, weighting, stratifying variables, or as covariates (usually in deciles) in outcome model.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Step 6: Propensity</span>"
]
},
{
"objectID": "step6.html#hdps-model",
"href": "step6.html#hdps-model",
"title": "9 Step 6: Propensity",
"section": "",
"text": "C = investigator-specified covariates and EC = hdPS covariates (Schneeweiss et al. 2009)\n\n\n\n(Wyss et al. 2022)\n\n9.1.1 Create propensity score formula\n\nhdps.data$exposure <- as.numeric(I(hdps.data$obese=='Yes'))\nhdps.data$outcome <- as.numeric(I(hdps.data$diabetes=='Yes'))\nproxy.list.sel <- names(out3$autoselected_covariate_df[,-1])\nproxyform <- paste0(proxy.list.sel, collapse = \"+\")\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\n\n\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\nps.formula\n#> exposure ~ age.cat + sex + education + race + marital + income + \n#> born + year + diabetes.family.history + medical.access + \n#> smoking + diet.healthy + physical.activity + sleep + uric.acid + \n#> protein.total + bilirubin.total + phosphorus + sodium + potassium + \n#> globulin + calcium.total + systolicBP + diastolicBP + high.cholesterol + \n#> rec_dx_I10_once + rec_dx_R73_once + rec_dx_I10_frequent + \n#> rec_dx_R60_once + rec_dx_E78_once + rec_dx_M79_once + rec_dx_I51_once + \n#> rec_dx_M10_once + rec_dx_I50_once + rec_dx_K21_once + rec_dx_D75_once + \n#> rec_dx_Z79_once + rec_dx_F41_once + rec_dx_M1A_once + rec_dx_E87_once + \n#> rec_dx_R12_once + rec_dx_R51_once + rec_dx_J45_once + rec_dx_I50_frequent + \n#> rec_dx_L70_once + rec_dx_M25_once + rec_dx_I63_once + rec_dx_R39_once + \n#> rec_dx_N28_once + rec_dx_K25_once + rec_dx_F90_once + rec_dx_B00_once + \n#> rec_dx_J42_once + rec_dx_R41_once + rec_dx_I20_once + rec_dx_M54_once + \n#> rec_dx_J44_once + rec_dx_K08_once + rec_dx_I21_once + rec_dx_F32_once + \n#> rec_dx_J30_once + rec_dx_F43_once + rec_dx_R06_once + rec_dx_I48_once + \n#> rec_dx_R32_once + rec_dx_R42_once + rec_dx_N92_once + rec_dx_N95_once + \n#> rec_dx_M19_once + rec_dx_E07_once + rec_dx_R25_once + rec_dx_G43_once + \n#> rec_dx_R52_once + rec_dx_M81_once + rec_dx_T78_once + rec_dx_G47_once + \n#> rec_dx_R11_once + rec_dx_B35_once + rec_dx_M06_once + rec_dx_E04_once + \n#> rec_dx_H40_frequent + rec_dx_T14_once + rec_dx_C50_once + \n#> rec_dx_H40_once + rec_dx_N32_once + rec_dx_A49_once + rec_dx_J45_frequent + \n#> rec_dx_N39_once + rec_dx_R09_once + rec_dx_N94_once + rec_dx_F39_once + \n#> rec_dx_E03_once + rec_dx_K59_once + rec_dx_F31_once + rec_dx_L40_once + \n#> rec_dx_M06_frequent + rec_dx_M13_once + rec_dx_I48_frequent + \n#> rec_dx_G25_once + rec_dx_F31_frequent + rec_dx_M62_once + \n#> rec_dx_K92_once + rec_dx_R10_once + rec_dx_G40_once + rec_dx_G40_frequent + \n#> rec_dx_J44_frequent + rec_dx_K04_once + rec_dx_I49_once + \n#> rec_dx_R00_once + rec_dx_R07_once + rec_dx_H04_once + rec_dx_K27_once + \n#> rec_dx_R05_once + rec_dx_K30_once + rec_dx_N40_once + rec_dx_R35_once\n\n\n\nThis is an overly simplistic scenario where we are adding only the main effects in the non-transformed form.\n\n\n9.1.2 Fit PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n data = hdps.data, \n estimand = \"ATE\",\n method = \"ps\")\n\n\n\n9.1.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\n\n\n\n\n\n\n\n\n\n\n\n\nAlways a good idea to check propensity score overlap in both exposure groups\n\n\n9.1.4 Obtain Weights\n\nhdps.data$w <- W.out$weights\n\n\n\n\n\n\n\n\n\n\n\n\nAlways a good idea to check the summary statistics of the weights to assess whether there are extreme weights\n\n\n9.1.5 Assessing balance\n\n\n\n\n\n\n\n\n\n\n\n\n\nAlways a good idea to assess balance. Here we are measuring against SMD 0.1. Use love.plot function from the cobalt package. See more descriptions of balanced diagnostics elsewhere for a propensity score context.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nWyss, Richard, Chen Yanover, Tal El-Hay, Dimitri Bennett, Robert W Platt, Andrew R Zullo, Grammati Sari, et al. 2022. “Machine Learning for Improving High-Dimensional Proxy Confounder Adjustment in Healthcare Database Studies: An Overview of the Current Literature.” Pharmacoepidemiology and Drug Safety 31 (9): 932–43.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Step 6: Propensity</span>"
]
},
{
"objectID": "step7.html",
"href": "step7.html",
"title": "10 Step 7: Association",
"section": "",
"text": "10.0.1 Set outcome formula\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nout.formula\n#> outcome ~ exposure\n\n\n\n10.0.2 Obtain OR from unadjusted model\n\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.42\n0.08\n0\n0.35\n0.49\n\n\n\n\n\n\n\n\nWe are using a crude outcome model here.\nSomewhat controversial to adjust for all (investigator-specified and all 100 proxies) covariates.\n\n\n\n10.0.3 Obtain RD from unadjusted model\n\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= gaussian(link = \"identity\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|t|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary, 2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|t|)\n2.5 %\n97.5 %\n\n\n\n\n0.08\n0.01\n0\n0.06\n0.1\n\n\n\n\n\n\n\n\n\n(Naimi and Whitcomb 2020)\n\n\nNaimi, Ashley I, and Brian W Whitcomb. 2020. “Estimating Risk Ratios and Risk Differences Using Regression.” American Journal of Epidemiology 189 (6): 508–10.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Step 7: Association</span>"
]
},
{
"objectID": "sens.html",
"href": "sens.html",
"title": "11 Sensitivity",
"section": "",
"text": "11.1 Sensitivity analysis for k",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Sensitivity</span>"
]
},
{
"objectID": "sens.html#sensitivity-analysis-for-k",
"href": "sens.html#sensitivity-analysis-for-k",
"title": "11 Sensitivity",
"section": "",
"text": "11.1.1 Create propensity score formula\n\n\n\n\n\n\n\n\n\n\n\nHence we iterate the process (change k parameter in get_prioritised_covariates function in step 4) and obtain odds ratio (exponentiation of log-OR) for each k. We varied k from 10 to 140.\n\n\n\n\n\n\nTip\n\n\n\nOR estimates stabilizes around 1.5, shows variability below k = 50 and above 110",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Sensitivity</span>"
]
},
{
"objectID": "sens.html#sensitivity-analysis-for-n",
"href": "sens.html#sensitivity-analysis-for-n",
"title": "11 Sensitivity",
"section": "11.2 Sensitivity analysis for n",
"text": "11.2 Sensitivity analysis for n\n\n\n\n\n\n\nTip\n\n\n\nWe varied n from 10 to 120, remaining everything else constant (e.g., k = 100).\n\n\n\n\n\n\n\n\n\n\n\n\n\nHence we iterate the process (change n parameter in get_candidate_covariates function step 2) and obtain odds ratio (exponentiation of log-OR) for each n. We varied n from 10 to 120.\n\n\n\n\n\n\nTip\n\n\n\nOR estimates stabilizes around 1.5 for above n = 60.\n\n\n\n\n\n\nLiterature suggested that this restriction of n can be detrimental (Schuster, Pang, and Platt 2015). Hence in the original analysis we chose n such that that is larger than available empirical covariates.\n\n\nSchuster, Tibor, Menglan Pang, and Robert W Platt. 2015. “On the Role of Marginal Confounder Prevalence–Implications for the High-Dimensional Propensity Score Algorithm.” Pharmacoepidemiology and Drug Safety 24 (9): 1004–7.",
"crumbs": [
"High-dimensional Propensity score",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Sensitivity</span>"
]
},
{
"objectID": "extension.html",
"href": "extension.html",
"title": "Challenges",
"section": "",
"text": "Issues with hdPS",
"crumbs": [
"Challenges"
]
},
{
"objectID": "extension.html#issues-with-hdps",
"href": "extension.html#issues-with-hdps",
"title": "Challenges",
"section": "",
"text": "Univariate selection of many proxies\n\n\n\n\nRecurrent covariates selected separately / univariately\ncan be correlated (coming from same patient) and cause multicollinearity\nmay inflate variance\nGeneral overfitting problem. Too many adjustment variables?\n\n\n\n\n\n(Franklin et al. 2015; Schuster, Lowe, and Platt 2016; Karim, Pang, and Platt 2018)",
"crumbs": [
"Challenges"
]
},
{
"objectID": "extension.html#potential-ways-to-improve",
"href": "extension.html#potential-ways-to-improve",
"title": "Challenges",
"section": "Potential ways to improve",
"text": "Potential ways to improve\n\nMultiple recurrent covariates could provide same information, may not be useful anymore in the presence of others. Multivariate structure could be good to consider in a single model.\nMachine learning variable selection methods could be useful to combat multicollinearity.\nSample splitting methods could be useful in combating overfitting in high dimensions.\n\n\n\nCross-validation is embedded within super (ensemble) learning.",
"crumbs": [
"Challenges"
]
},
{
"objectID": "extension.html#controversy",
"href": "extension.html#controversy",
"title": "Challenges",
"section": "Controversy",
"text": "Controversy\nResearchers argue that the PS model, which does not allow for data-driven selection of variables, is a more principled approach to adjusting for confounding in observational studies, without introducing any bias in the analysis .\nOther researchers argue that the hdPS approach can improve the precision of effect estimates by including additional variables that are empirically associated with both the exposure and the outcome, which may reduce residual confounding.\n\n\nMachine learning alternatives have the same criticism as some of them depend on association with the outcome.\n\n\n\n\n\n\nTip\n\n\n\nhdPS can only control for observed confounding, and cannot guarantee the direction or magnitude of residual confounding that may still exist. This is why sensitivity analyses and model diagnostics are important in assessing the robustness of hdPS results.\n\n\n\n\n\n\n(VanderWeele 2019)\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nSchuster, Tibor, Wilfrid Kouokam Lowe, and Robert W Platt. 2016. “Propensity Score Model Overfitting Led to Inflated Variance of Estimated Odds Ratios.” Journal of Clinical Epidemiology 80: 97–106.\n\n\nVanderWeele, Tyler J. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34: 211–19.",
"crumbs": [
"Challenges"
]
},
{
"objectID": "pubmed.html",
"href": "pubmed.html",
"title": "12 Literature",
"section": "",
"text": "12.1 PubMed\nCombination of plasmode, simulation, high-dimensional propensity provides 7 papers (searched in April 23, 2023):",
"crumbs": [
"Challenges",
"<span class='chapter-number'>12</span> <span class='chapter-title'>Literature</span>"
]
},
{
"objectID": "pubmed.html#pubmed",
"href": "pubmed.html#pubmed",
"title": "12 Literature",
"section": "",
"text": "flowchart LR\n A[PubMed] --> p4(Karim et al. 2018 \\nEpidemiology)\n p4 --> ml1\n p4 --> ml0[Hybrid]\n A[PubMed] --> p2(Tian et al. 2018 \\nInt J Epidemiol.)\n p2 --> ml1[Pure LASSO]\n A[PubMed] --> p5(Wyss et al. 2018 \\nEpidemiology)\n p5 --> sl1[vary k,\\nk=25,100:500\\nSuper \\nLearner]\n p5 --> ct1\n A[PubMed] --> p1(Benasseur et al. 2022 \\nPharmacoepidemiol Drug Saf. )\n p1 --> ml2[Low k,\\nk = 10]\n p1 --> ct1[cTMLE]\n A[PubMed] --> p7(Neugebauer et al. 2015 \\nStat Med.)\n p7 --> O2[time-varying \\ninterventions]\n A[PubMed] --> p6(Franklin et al. 2015 \\nAm J Epidemiol.)\n p6 --> ml1\n p6 --> ml0\n A[PubMed] --> p3(Schneeweiss et al. 2018 \\nClin Epidemiol.)\n p3 --> O1[Review]\n style p1 fill:#f44,stroke-width:2px,stroke:#f00,color:#fff;\n style p3 fill:#f44,stroke-width:2px,stroke:#f00,color:#fff;\n style p7 fill:#f44,stroke-width:2px,stroke:#f00,color:#fff;\n style p5 fill:#ffff00,stroke-width:2px,stroke:#ffcc00,color:#000;\n style p2 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n style p4 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n style p6 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n\n\n\n\n\n\n\n\n\n(Benasseur et al. 2022; Tian, Schuemie, and Suchard 2018; Franklin et al. 2015; Neugebauer et al. 2015; Wyss et al. 2018; Karim, Pang, and Platt 2018; Schneeweiss 2018)",
"crumbs": [
"Challenges",
"<span class='chapter-number'>12</span> <span class='chapter-title'>Literature</span>"
]
},
{
"objectID": "pubmed.html#outside-of-pubmed",
"href": "pubmed.html#outside-of-pubmed",
"title": "12 Literature",
"section": "12.2 Outside of PubMed",
"text": "12.2 Outside of PubMed\n\n\n\n\n\n\nflowchart LR\n S[Simulations] --> p0(Pang et al. 2016 \\nInt. J Biostat.)\n p0 --> t1[TMLE, \\nNo \\nsuper \\nlearner]\n D--> p00(Pang et al. 2016 \\nEpidemiology)\n p00 --> t1\n D[Data \\nanalysis] --> p1(Ju et al. 2019 \\nJ App Stat.)\n p1 --> sl1[Super \\nlearner, \\nNo TMLE, \\n bias not \\nused as a\\nperformance \\nmeasure]\n D --> p3(Schneeweiss et al. 2017 \\nEpidemiology)\n p3 --> ml1[LASSO]\n S --> p4(Weberpals et al. 2021 \\nEpidemiology)\n p4 --> ml1[LASSO]\n p4 --> ml2[Autoencoder]\n S --> p5(Ju et al. 2019 \\nStat Meth Med Res.)\n p5 --> t1\n p5 --> t2[cTMLE, \\nmore about \\ntime \\ncomplexity]\n S --> p6(Low et al. 2015 \\nJ Comp Eff Res.)\n p6 --> ml1\n \n style p1 fill:#ffff00,stroke-width:2px,stroke:#ffcc00,color:#000;\n style p4 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n style p3 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n style p6 fill:#9f9,stroke-width:2px,stroke:#090,color:#000;\n style p5 fill:#44f,stroke-width:2px,stroke:#00f,color:#fff;\n style p0 fill:#44f,stroke-width:2px,stroke:#00f,color:#fff;\n style p00 fill:#44f,stroke-width:2px,stroke:#00f,color:#fff;\n\n\n\n\n\n\n\n\n\n\n\n(Pang, Schuster, Filion, Eberg, et al. 2016; Pang, Schuster, Filion, Schnitzer, et al. 2016; Ju, Gruber, et al. 2019; Ju, Combs, et al. 2019; Schneeweiss et al. 2017; Weberpals et al. 2021; Low, Gallego, and Shah 2016)\n\n\nBenasseur, Imane, Denis Talbot, Madeleine Durand, Anne Holbrook, Alexis Matteau, Brian J Potter, Christel Renoux, Mireille E Schnitzer, Jean-Éric Tarride, and Jason R Guertin. 2022. “A Comparison of Confounder Selection and Adjustment Methods for Estimating Causal Effects Using Large Healthcare Databases.” Pharmacoepidemiology and Drug Safety 31 (4): 424–33.\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nJu, Cheng, Mary Combs, Samuel D Lendle, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van der Laan. 2019. “Propensity Score Prediction for Electronic Healthcare Databases Using Super Learner and High-Dimensional Propensity Score Methods.” Journal of Applied Statistics 46 (12): 2216–36.\n\n\nJu, Cheng, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van Der Laan. 2019. “Scalable Collaborative Targeted Learning for High-Dimensional Data.” Statistical Methods in Medical Research 28 (2): 532–54.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nLow, Yen Sia, Blanca Gallego, and Nigam Haresh Shah. 2016. “Comparing High-Dimensional Confounder Control Methods for Rapid Cohort Studies from Electronic Health Records.” Journal of Comparative Effectiveness Research 5 (2): 179–92.\n\n\nNeugebauer, Romain, Julie A Schmittdiel, Zheng Zhu, Jeremy A Rassen, John D Seeger, and Sebastian Schneeweiss. 2015. “High-Dimensional Propensity Score Algorithm in Comparative Effectiveness Research with Time-Varying Interventions.” Statistics in Medicine 34 (5): 753–81.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Maria Eberg, and Robert W Platt. 2016. “Targeted Maximum Likelihood Estimation for Pharmacoepidemiologic Research.” Epidemiology (Cambridge, Mass.) 27 (4): 570.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Mireille E Schnitzer, Maria Eberg, and Robert W Platt. 2016. “Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data–a Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting.” The International Journal of Biostatistics 12 (2).\n\n\nSchneeweiss, Sebastian. 2018. “Automated Data-Adaptive Analytics for Electronic Healthcare Data to Study Causal Treatment Effects.” Clinical Epidemiology, 771–88.\n\n\nSchneeweiss, Sebastian, Wesley Eddings, Robert J Glynn, Elisabetta Patorno, Jeremy Rassen, and Jessica M Franklin. 2017. “Variable Selection for Confounding Adjustment in High-Dimensional Covariate Spaces When Analyzing Healthcare Databases.” Epidemiology 28 (2): 237–48.\n\n\nTian, Yuxi, Martijn J Schuemie, and Marc A Suchard. 2018. “Evaluating Large-Scale Propensity Score Performance Through Real-World and Synthetic Data Experiments.” International Journal of Epidemiology 47 (6): 2005–14.\n\n\nWeberpals, Janick, Tim Becker, Jessica Davies, Fabian Schmich, Dominik Rüttinger, Fabian J Theis, and Anna Bauer-Mehren. 2021. “Deep Learning-Based Propensity Scores for Confounding Control in Comparative Effectiveness Research: A Large-Scale, Real-World Data Study.” Epidemiology 32 (3): 378–88.\n\n\nWyss, Richard, Sebastian Schneeweiss, Mark Van Der Laan, Samuel D Lendle, Cheng Ju, and Jessica M Franklin. 2018. “Using Super Learner Prediction Modeling to Improve High-Dimensional Propensity Score Estimation.” Epidemiology 29 (1): 96–106.",
"crumbs": [
"Challenges",
"<span class='chapter-number'>12</span> <span class='chapter-title'>Literature</span>"
]
},
{
"objectID": "mllogic.html",
"href": "mllogic.html",
"title": "Machine learning",
"section": "",
"text": "Understanding variable’s role",
"crumbs": [
"Machine learning"
]
},
{
"objectID": "mllogic.html#understanding-variables-role",
"href": "mllogic.html#understanding-variables-role",
"title": "Machine learning",
"section": "",
"text": "(Rubin and Thomas 1996; Rubin 1997; Brookhart et al. 2006)\n\nConfounders\n\n\n\n\n\nflowchart LR\n C --> A\n C --> Y\n A --> Y\n style A fill:#90EE90;\n style Y fill:#ADD8E6;\n style C fill:#FF0000;\n\n\n\n\n\n\n\n\nAdjusting Confounders help reduce bias\n\n\n(Near) instruments\n\n\n\n\n\nflowchart LR\n C --> A\n A --> Y\n style A fill:#90EE90;\n style Y fill:#ADD8E6;\n style C fill:#FF0000;\n\n\n\n\n\n\n\n\nAdjusting for covariates strongly associated with the exposure: Adjusting for these variables can potentially amplify bias in the treatment effect estimate and increase standard error (SE).\n\n\nPrecision variables\n\n\n\n\n\nflowchart LR\n C --> Y\n A --> Y\n style A fill:#90EE90;\n style Y fill:#ADD8E6;\n style C fill:#FF0000;\n\n\n\n\n\n\n\n\nAdjusting for covariates strongly associated with the outcome: Adjusting for these variables can lead to decrease of the SE of the treatment effect estimate.\n\n\nNoise variables\n\n\n\n\n\nflowchart LR\n C\n A --> Y\n style A fill:#90EE90;\n style Y fill:#ADD8E6;\n style C fill:#FF0000;\n\n\n\n\n\n\n\n\nAdjusting for covariates that are neither associated with the outcome or the exposure can increase the SE of the treatment effect estimate.",
"crumbs": [
"Machine learning"
]
},
{
"objectID": "mllogic.html#overall-picture",
"href": "mllogic.html#overall-picture",
"title": "Machine learning",
"section": "Overall picture",
"text": "Overall picture\n\n\n\n\n\n\n\n\n\n\n\n\n\nChoose variables associated with the outcome in general, as long as they are not mediator, collider or effect of the outcome. In hdPS, we chose proxies in the covariate assessment period (before exposure occurs), reducing the possibility of those proxies to be mediator, collider or effect of the outcome.\n\n\nBrookhart, M Alan, Sebastian Schneeweiss, Kenneth J Rothman, Robert J Glynn, Jerry Avorn, and Til Stürmer. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163 (12): 1149–56.\n\n\nRubin, Donald B. 1997. “Estimating Causal Effects from Large Data Sets Using Propensity Scores.” Annals of Internal Medicine 127 (8_Part_2): 757–63.\n\n\nRubin, Donald B, and Neal Thomas. 1996. “Matching Using Estimated Propensity Scores: Relating Theory to Practice.” Biometrics, 249–64.",
"crumbs": [
"Machine learning"
]
},
{
"objectID": "mllasso.html",
"href": "mllasso.html",
"title": "13 Pure ML",
"section": "",
"text": "14 Pure ML approach (LASSO)\nStart with all recurrence variables (EC in the following equation)\nSay, 100 proxies (associated with outcome) were selected by LASSO approach (ML-hdPS)",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>13</span> <span class='chapter-title'>Pure ML</span>"
]
},
{
"objectID": "mllasso.html#choose-variables-associated-with-outcome",
"href": "mllasso.html#choose-variables-associated-with-outcome",
"title": "13 Pure ML",
"section": "14.1 Choose variables associated with outcome",
"text": "14.1 Choose variables associated with outcome\n\nproxy.dim <- out2 # from step 3\ndim(proxy.dim) \n#> [1] 7585 143\nproxy.dim$id <- proxy.dim$idx\nproxy.dim$idx <- NULL\nfullcovproxy.data <- merge(data.complete[,c(\"id\",\n outcome, \n exposure, \n investigator.specified.covariates)], \n proxy.dim, by = \"id\")\ndim(fullcovproxy.data)\n#> [1] 3839 170\nfullcovproxy.data$outcome <- as.numeric(I(fullcovproxy.data$diabetes=='Yes'))\nfullcovproxy.data$exposure <- as.numeric(I(fullcovproxy.data$obese=='Yes'))\n\n\nproxy.list <- names(out2[-1])\n# out3$autoselected_covariate_df[,-1] for hybrid \n# out2 is from step2$recurrence_data\ncovarsTfull <- c(investigator.specified.covariates, proxy.list)\nY.form <- as.formula(paste0(c(\"outcome~ exposure\", \n covarsTfull), collapse = \"+\") )\ncovar.mat <- model.matrix(Y.form, data = fullcovproxy.data)[,-1]\nlasso.fit<-glmnet::cv.glmnet(y = fullcovproxy.data$outcome, \n x = covar.mat, \n type.measure='mse',\n family=\"binomial\",\n alpha = 1, \n nfolds = 5)\ncoef.fit<-coef(lasso.fit,s='lambda.min',exact=TRUE)\nsel.variables<-row.names(coef.fit)[which(as.numeric(coef.fit)!=0)]\nproxy.list.sel.ml <- proxy.list[proxy.list %in% sel.variables]\nlength(proxy.list.sel.ml)\n#> [1] 54\n\n\n\n\nFrom all proxies, we try to identify proxies that are empirically associated with the outcome based on a multivariate LASSO (outcome with all proxies in one model).\nNote that LASSO model is choosing variables based on association with the outcome conditional on the ’exposure`.\nVariable selection is only happening for proxy variables.\nInvestigator specified variables are not being subject to variable selection.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>13</span> <span class='chapter-title'>Pure ML</span>"
]
},
{
"objectID": "mllasso.html#build-model-formula-based-on-selected-variables",
"href": "mllasso.html#build-model-formula-based-on-selected-variables",
"title": "13 Pure ML",
"section": "14.2 Build model formula based on selected variables",
"text": "14.2 Build model formula based on selected variables\n\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nproxyform <- paste0(proxy.list.sel.ml, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\n\n\n\nBuild propensity score model based on selected variables based on LASSO.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>13</span> <span class='chapter-title'>Pure ML</span>"
]
},
{
"objectID": "mllasso.html#fit-the-ps-model",
"href": "mllasso.html#fit-the-ps-model",
"title": "13 Pure ML",
"section": "14.3 Fit the PS model",
"text": "14.3 Fit the PS model\n\nhdps.data <- fullcovproxy.data\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n data = hdps.data, \n estimand = \"ATE\",\n method = \"ps\")\n\n\n\nPropensity score model fit to be able to calculate the inverse probability weights.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>13</span> <span class='chapter-title'>Pure ML</span>"
]
},
{
"objectID": "mllasso.html#obtain-log-or-from-unadjusted-outcome-model",
"href": "mllasso.html#obtain-log-or-from-unadjusted-outcome-model",
"title": "13 Pure ML",
"section": "14.4 Obtain log-OR from unadjusted outcome model",
"text": "14.4 Obtain log-OR from unadjusted outcome model\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.29\n0.13\n0\n0.19\n0.39\n\n\n\n\n\n\n\n\n\nSummary of results (log-OR).\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>13</span> <span class='chapter-title'>Pure ML</span>"
]
},
{
"objectID": "mlhybrid.html",
"href": "mlhybrid.html",
"title": "14 Hybrid ML",
"section": "",
"text": "14.1 Build model formula based on selected variables\nInstead of all recurrence variables, you start with the hdPS variables chosen by the hdPS algorithm first.\nlength(proxy.list.sel)\n#> [1] 100\nproxy.list <- names(out3$autoselected_covariate_df[,-1]) # from step 4\ncovarsTfull <- c(investigator.specified.covariates, proxy.list)\nY.form <- as.formula(paste0(c(\"outcome~ exposure\", \n covarsTfull), collapse = \"+\") )\ncovar.mat <- model.matrix(Y.form, data = hdps.data)[,-1]\nlasso.fit<-glmnet::cv.glmnet(y = hdps.data$outcome, \n x = covar.mat, \n type.measure='mse',\n family=\"binomial\",\n alpha = 1, \n nfolds = 5)\ncoef.fit<-coef(lasso.fit,s='lambda.min',exact=TRUE)\nsel.variables<-row.names(coef.fit)[which(as.numeric(coef.fit)!=0)]\nproxy.list.sel.hybrid <- proxy.list[proxy.list %in% sel.variables]\nlength(proxy.list.sel.hybrid)\n#> [1] 48\nproxyform <- paste0(proxy.list.sel.hybrid, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>14</span> <span class='chapter-title'>Hybrid ML</span>"
]
},
{
"objectID": "mlhybrid.html#build-model-formula-based-on-selected-variables",
"href": "mlhybrid.html#build-model-formula-based-on-selected-variables",
"title": "14 Hybrid ML",
"section": "",
"text": "From hdPS variables, we try to identify proxies that are empirically associated with the outcome based on a multivariate LASSO (outcome with all proxies in one model).\n\n\n\nBuild propensity score model based on selected variables based on LASSO.\n\n14.1.1 Fit the PS model\n\nW.out <- weightit(ps.formula, \n data = hdps.data, \n estimand = \"ATE\",\n method = \"ps\")\n\n\n\nPropensity score model fit to be able to calculate the inverse probability weights.\n\n\n14.1.2 Obtain log-OR from unadjusted outcome model\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.h <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.h,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.31\n0.13\n0\n0.21\n0.42\n\n\n\n\n\n\n\nSummary of results (log-OR).\n\n\n\n\n\n\nAlternative process\n\n\n\nIt is also possible to start with ML selection, and then applying Bross’s formula on top of it (Schneeweiss et al. 2017).\n\n\n\n\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nSchneeweiss, Sebastian, Wesley Eddings, Robert J Glynn, Elisabetta Patorno, Jeremy Rassen, and Jessica M Franklin. 2017. “Variable Selection for Confounding Adjustment in High-Dimensional Covariate Spaces When Analyzing Healthcare Databases.” Epidemiology 28 (2): 237–48.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>14</span> <span class='chapter-title'>Hybrid ML</span>"
]
},
{
"objectID": "sl.html",
"href": "sl.html",
"title": "15 Ensemble",
"section": "",
"text": "15.1 Build model formula based on all variables\nproxy.list <- names(out3$autoselected_covariate_df[,-1])\nlength(proxy.list)\n#> [1] 100\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nproxyform <- paste0(proxy.list, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>15</span> <span class='chapter-title'>Ensemble</span>"
]
},
{
"objectID": "sl.html#build-model-formula-based-on-all-variables",
"href": "sl.html#build-model-formula-based-on-all-variables",
"title": "15 Ensemble",
"section": "",
"text": "We work with all proxies",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>15</span> <span class='chapter-title'>Ensemble</span>"
]
},
{
"objectID": "sl.html#fit-the-ps-model-with-super-learner",
"href": "sl.html#fit-the-ps-model-with-super-learner",
"title": "15 Ensemble",
"section": "15.2 Fit the PS model with super learner",
"text": "15.2 Fit the PS model with super learner\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n data = hdps.data, \n estimand = \"ATE\",\n method = \"super\",\n SL.library = c(\"SL.glm\", \n \"SL.glmnet\",\n \"SL.earth\"))\n#> Loading required namespace: glmnet\n#> Loading required namespace: earth\n\n\n\nPropensity score model fit based on super learning algorithm to be able to calculate the inverse probability weights.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>15</span> <span class='chapter-title'>Ensemble</span>"
]
},
{
"objectID": "sl.html#obtain-log-or-from-unadjusted-outcome-model",
"href": "sl.html#obtain-log-or-from-unadjusted-outcome-model",
"title": "15 Ensemble",
"section": "15.3 Obtain log-OR from unadjusted outcome model",
"text": "15.3 Obtain log-OR from unadjusted outcome model\n\nsummary(W.out$ps)\n#> Min. 1st Qu. Median Mean 3rd Qu. Max. \n#> 0.01826 0.22575 0.39324 0.42094 0.59037 0.98809\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n data = hdps.data,\n weights = W.out$weights,\n family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n c(\"Estimate\", \n \"Std. Error\", \n \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.sl <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.sl,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.42\n0.1\n0\n0.31\n0.53\n\n\n\n\n\n\n\n\n\nSummary of results (log-OR).",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>15</span> <span class='chapter-title'>Ensemble</span>"
]
},
{
"objectID": "tmle.html",
"href": "tmle.html",
"title": "16 TMLE",
"section": "",
"text": "16.1 Obtain OR with superlearner\nsummary(W.out$ps)\n#> Min. 1st Qu. Median Mean 3rd Qu. Max. \n#> 0.01826 0.22575 0.39324 0.42094 0.59037 0.98809\nSL.library = c(\"SL.glm\", \"SL.glmnet\",\"SL.earth\")\nproxy.list <- names(out3$autoselected_covariate_df[,-1])\nObsData.noYA <- hdps.data[,c(investigator.specified.covariates, \n proxy.list)]\ntmle.fit <- tmle::tmle(Y = hdps.data$outcome,\n A = hdps.data$exposure, \n W = ObsData.noYA, \n family = \"binomial\",\n V.Q = 3,\n V.g = 3,\n Q.SL.library = SL.library,\n g1W = W.out$ps)\nestOR.tmle <- tmle.fit$estimates$OR\nestOR.tmle\n#> $psi\n#> [1] 1.433163\n#> \n#> $log.psi\n#> [1] 0.3598838\n#> \n#> $CI\n#> [1] 1.228595 1.671792\n#> \n#> $pvalue\n#> [1] 4.65237e-06\n#> \n#> $var.log.psi\n#> [1] 0.0061747\n#> \n#> $bs.var.log.psi\n#> [1] NA\n#> \n#> $bs.CI.twosided\n#> [1] NA NA\n#> \n#> $bs.CI.onesided.lower\n#> [1] -Inf NA\n#> \n#> $bs.CI.onesided.upper\n#> [1] NA Inf",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>16</span> <span class='chapter-title'>TMLE</span>"
]
},
{
"objectID": "tmle.html#obtain-or-with-superlearner",
"href": "tmle.html#obtain-or-with-superlearner",
"title": "16 TMLE",
"section": "",
"text": "We use the same propensity score model that was fitted based on super learning algorithm.\n\n\n\nIf you want to know more about TMLE, look at other tutorials.\n\n\n\nSummary of results (OR).",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>16</span> <span class='chapter-title'>TMLE</span>"
]
},
{
"objectID": "tmle.html#obtain-or-without-superlearner",
"href": "tmle.html#obtain-or-without-superlearner",
"title": "16 TMLE",
"section": "16.2 Obtain OR without superlearner",
"text": "16.2 Obtain OR without superlearner\n\nsummary(W.out0$ps)\n#> Min. 1st Qu. Median Mean 3rd Qu. Max. \n#> 0.0000003 0.2445201 0.4308397 0.4481213 0.6321148 0.9975438\nSL.library = c(\"SL.glm\")\nproxy.list <- names(out3$autoselected_covariate_df[,-1])\nObsData.noYA <- hdps.data[,c(investigator.specified.covariates, \n proxy.list)]\n\n\n\nWe use the same propensity score model that was fitted based on hdPS variables via logistic regression (no other learners).\n\ntmle.fit0 <- tmle::tmle(Y = hdps.data$outcome,\n A = hdps.data$exposure, \n W = ObsData.noYA, \n family = \"binomial\",\n V.Q = 3,\n V.g = 3,\n Q.SL.library = SL.library,\n g1W = W.out$ps)\n\n\nestOR.tmle0 <- tmle.fit0$estimates$OR\nestOR.tmle0\n#> $psi\n#> [1] 1.459564\n#> \n#> $log.psi\n#> [1] 0.3781375\n#> \n#> $CI\n#> [1] 1.256360 1.695633\n#> \n#> $pvalue\n#> [1] 7.669587e-07\n#> \n#> $var.log.psi\n#> [1] 0.005850784\n#> \n#> $bs.var.log.psi\n#> [1] NA\n#> \n#> $bs.CI.twosided\n#> [1] NA NA\n#> \n#> $bs.CI.onesided.lower\n#> [1] -Inf NA\n#> \n#> $bs.CI.onesided.upper\n#> [1] NA Inf\n\n\n\n\n\nSummary of results (OR).",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>16</span> <span class='chapter-title'>TMLE</span>"
]
},
{
"objectID": "stat.html",
"href": "stat.html",
"title": "17 Statistical Approaches",
"section": "",
"text": "17.1 Background\nRecent work compares multiple variable selection strategies for hdPS analysis (Karim and Lei 2025). The study aims to identify methods that best balance bias, precision, and computational cost in causal inference using observational data. It is based on NHANES 2013–2018 data evaluating the association between obesity and diabetes.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "stat.html#background",
"href": "stat.html#background",
"title": "17 Statistical Approaches",
"section": "",
"text": "Tip\n\n\n\n(Karim and Lei 2025)",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "stat.html#simulation-design",
"href": "stat.html#simulation-design",
"title": "17 Statistical Approaches",
"section": "17.2 Simulation Design",
"text": "17.2 Simulation Design\n\n\n\n\n\n\n\nElement\nDetails\n\n\n\n\nData Source\nNHANES 2013–2018\n\n\nSample Size\n3,000 participants per iteration\n\n\nIterations\n500\n\n\nPrevalence Scenarios\n1. Frequent exposure & frequent outcome 2. Rare exposure & frequent outcome 3. Frequent exposure & rare outcome\n\n\nTrue Effect\nOR = 1 (null); RD = 0\n\n\nOutcome Generation\nIncluded nonlinear transforms, interactions, and a comorbidity index from 94 proxies\n\n\nNoise Variables\n48 of 142 proxy covariates used as noise",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "stat.html#methods-compared",
"href": "stat.html#methods-compared",
"title": "17 Statistical Approaches",
"section": "17.3 Methods Compared",
"text": "17.3 Methods Compared\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\nKitchen Sink\nIncludes all investigator and proxy covariates (no selection)\n\n\nBross hdPS\nSelects top 100 proxies using the Bross formula\n\n\nHybrid (Bross + LASSO)\nFirst applies Bross, then refines with LASSO\n\n\nLASSO\nPenalized regression with cross-validation\n\n\nElastic Net\nCombines LASSO and Ridge penalties to handle collinearity\n\n\nRandom Forest\nRanks variables by importance using Gini impurity\n\n\nXGBoost\nBoosted trees optimizing impurity reduction\n\n\nForward Selection\nAdds variables sequentially based on adjusted R²\n\n\nBackward Elimination\nRemoves variables iteratively based on adjusted R²\n\n\nGenetic Algorithm\nEvolves variable subsets via stochastic search",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "stat.html#simulation-results",
"href": "stat.html#simulation-results",
"title": "17 Statistical Approaches",
"section": "17.4 Simulation Results",
"text": "17.4 Simulation Results\n\n\n\n\n\nFigure 1. Bias across Methods in NHANES Plasmode Simulation\n\n\n\n\n\n\n\n\n\nFigure 2. Coverage across Methods in NHANES Plasmode Simulation\n\n\n\n\nSee interactive results: 👉 Shiny App",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "stat.html#key-takeaways",
"href": "stat.html#key-takeaways",
"title": "17 Statistical Approaches",
"section": "17.5 Key Takeaways",
"text": "17.5 Key Takeaways\n\nSimpler methods (Forward/Backward selection) offer strong coverage with efficiency.\nBross-based and Hybrid hdPS methods remain reliable and interpretable.\nMethod choice should reflect the specific inferential goal: bias reduction vs variance minimization.\n\n\n\n\n\nKarim, ME, and Y Lei. 2025. “Is There a Competitive Advantage to Using Multivariate Statistical or Machine Learning Methods over the Bross Formula in the hdPS Framework for Bias and Variance Estimation?” PLoS One 20 (5): e0324639.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>17</span> <span class='chapter-title'>Statistical Approaches</span>"
]
},
{
"objectID": "mlcompare.html",
"href": "mlcompare.html",
"title": "18 Compare results",
"section": "",
"text": "Summary of model results\n\n\n\nOR\nBeta-coef\ncoef-SE\nCI (2.5 %)\nCI (97.5 %)\np-value\n\n\n\n\nCrude (no adjustment)\n2.08\n0.73\n0.05\n0.63\n0.84\n< 2e-16\n\n\nPS (no proxies)\n1.98\n0.68\n0.07\n0.61\n0.76\n< 2e-16\n\n\nhdPS\n1.52\n0.42\n0.04\n0.35\n0.49\n< 2e-16\n\n\nPure LASSO\n1.34\n0.29\n0.05\n0.19\n0.39\n5.9e-08\n\n\nHybrid (hdPS, then LASSO)\n1.37\n0.32\n0.05\n0.21\n0.42\n3.0e-09\n\n\nSuper learner (GLM, LASSO, MARS)\n1.53\n0.42\n0.10\n0.31\n0.53\n2.6e-14\n\n\nTMLE (GLM, LASSO, MARS in SL)\n1.43\n0.36\n0.08\n0.21\n0.51\n4.7e-06\n\n\nTMLE (only GLM in SL)\n1.46\n0.38\n0.08\n0.23\n0.53\n7.7e-07\n\n\nKitchen Sink\n1.50\n0.41\n0.04\n0.32\n0.48\n< 2e-16\n\n\nRandom Forest\n1.54\n0.43\n0.04\n0.35\n0.51\n< 2e-16\n\n\nXGBoost\n1.51\n0.41\n0.04\n0.33\n0.49\n< 2e-16\n\n\nForward Selection\n1.56\n0.44\n0.04\n0.36\n0.52\n< 2e-16\n\n\nBackward Elimination\n1.53\n0.43\n0.04\n0.34\n0.50\n< 2e-16\n\n\n\n\n\n\n\n\nPS is the result from the propensity score approach that did not include any proxies.\nResults from this approach is somewhat different than other approaches.\nMore detailed results from simulations are available elsewhere (Karim 2023).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAcross all methods evaluated—including hdPS, regularized regression (LASSO, Hybrid), ensemble learners (Super Learner, TMLE), and high-dimensional variable selection strategies (e.g., Kitchen Sink, Random Forest, XGBoost)—adjusted odds ratios ranged from 1.34 to 1.56, with most clustering between 1.50 and 1.56. In contrast, unadjusted and PS-only models produced substantially higher ORs (>1.9).\n\n\nKarim, ME. 2023. “Rethinking Residual Confounding Bias Reduction: Why Vanilla hdPS Alone Is No Longer Enough.”",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>18</span> <span class='chapter-title'>Compare results</span>"
]
},
{
"objectID": "dctmle.html",
"href": "dctmle.html",
"title": "19 DC-TMLE",
"section": "",
"text": "19.1 Background\nDouble Cross-Fit TMLE (DC-TMLE) (Mondol and Karim 2024; M. Karim and Mondol 2025) is an extension of TMLE designed to improve robustness and reduce overfitting when using flexible, high-dimensional or machine learning-based models. It works by splitting the data into multiple folds, training nuisance models (e.g., the propensity score and outcome regressions) on one subset, and then evaluating the targeted update and parameter estimation on another. This sample-splitting (cross-fitting) procedure helps ensure that the estimation step is not biased by the same data used to fit the nuisance models. This process of sample-splitting and estimation is repeated, and the results are averaged to produce a final, stable estimate. DC-TMLE maintains double robustness, meaning it remains consistent if either the treatment or outcome model is correctly specified, and it provides valid statistical inference even in high-dimensional settings where traditional TMLE may be unstable.\nResidual confounding remains a persistent challenge in observational studies, particularly with high-dimensional data (M. E. Karim and Lei 2025). Recent work evaluates traditional and machine learning-based extensions of hdPS methods, including Super Learner (SL), TMLE, and Double Cross-Fit TMLE (DC-TMLE).",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "dctmle.html#background",
"href": "dctmle.html#background",
"title": "19 DC-TMLE",
"section": "",
"text": "Tip\n\n\n\n(M. E. Karim and Lei 2025)",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "dctmle.html#simulation-design",
"href": "dctmle.html#simulation-design",
"title": "19 DC-TMLE",
"section": "19.2 Simulation Design",
"text": "19.2 Simulation Design\n\n\n\n\n\n\n\nElement\nDetails\n\n\n\n\nData Source\nNHANES 2013–2018\n\n\nSample Size\n3,000 per iteration\n\n\nIterations\n500\n\n\nExposure/Outcome Prevalence\n3 scenarios: (i) Frequent-Frequent, (ii) Rare-Frequent, (iii) Frequent-Rare\n\n\nTrue Effect\nOR = 1 (null); RD = 0\n\n\nProxies\n142 medication variables; 94 outcome-associated proxies and 48 noise variables\n\n\nConfounding Simulation\nUsed proxy-derived comorbidity index and complex transformations to mimic unmeasured confounding",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "dctmle.html#methods-compared",
"href": "dctmle.html#methods-compared",
"title": "19 DC-TMLE",
"section": "19.3 Methods Compared",
"text": "19.3 Methods Compared\n\n\n\n\n\n\n\n\nMethod Group\nMethod\nDescription\n\n\n\n\nTMLE Methods with Proxies\nTMLE.ks, hdPS.TMLE, LASSO.TMLE, hdPS.LASSO.TMLE\nTMLE with various proxy selection strategies\n\n\n\nDC.TMLE\nDouble cross-fit TMLE\n\n\nSuper Learner Methods with Proxies\nhdPS.SL, LASSO.SL, hdPS.LASSO.SL, SL.ks\nSuper Learner with proxy selection options\n\n\nStandard Methods with Proxies\nPS.ks, hdPS, LASSO, hdPS.LASSO\nPropensity score and outcome models with proxy inclusion\n\n\nNo Proxy Methods\nTMLE.u, SL.u, PS.u\nOnly measured covariates, no proxies\n\n\n\nSuper Learner libraries included:\n\n1-learner: Logistic regression\n3-learners: Logistic regression, LASSO, MARS\n4-learners: Above + XGBoost (non-Donsker)",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "dctmle.html#simulation-results",
"href": "dctmle.html#simulation-results",
"title": "19 DC-TMLE",
"section": "19.4 Simulation Results",
"text": "19.4 Simulation Results\n\n\n\n\n\nFigure 2. Bias across Methods in NHANES Plasmode Simulation\n\n\n\n\n\n\n\n\n\nFigure 3. Coverage across Methods in NHANES Plasmode Simulation\n\n\n\n\nResults are fully accessible via a Shiny app:\n👉 Interactive Causal Benchmark App\nExplore bias, SEs, and coverage metrics across methods and simulation conditions.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "dctmle.html#conclusion",
"href": "dctmle.html#conclusion",
"title": "19 DC-TMLE",
"section": "19.5 Conclusion",
"text": "19.5 Conclusion\n\nSimpler models with structured proxy inclusion (hdPS, LASSO) remain competitive and stable.\nTMLE is effective for bias reduction but suffers under high-dimensional instability with complex libraries.\nSL performance is library-sensitive; 1- and 3-learner libraries performed best. Complex learners (e.g., XGBoost) should be used cautiously.\n\n\n\n\n\nKarim, ME, and MH Mondol. 2025. “Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.” Pharmaceutical Statistics.\n\n\nKarim, Mohammad Ehsanul, and Yang Lei. 2025. “How Effective Are Machine Learning and Doubly Robust Estimators in Incorporating High-Dimensional Proxies to Reduce Residual Confounding?” Pharmacoepidemiology and Drug Safety 34 (5): e70155.\n\n\nMondol, MH, and ME Karim. 2024. “Towards Robust Causal Inference in Epidemiological Research: Employing Double Cross-Fit TMLE in Right Heart Catheterization Data.” American Journal of Epidemiology, kwae447.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>19</span> <span class='chapter-title'>DC-TMLE</span>"
]
},
{
"objectID": "deep.html",
"href": "deep.html",
"title": "20 Deep Learning",
"section": "",
"text": "20.1 Plasmode Simulation\nRecent work extends traditional hdPS analyses by introducing and explaining neural representation learning methods for causal inference in observational studies. It focuses on NHANES data (2013–2018) and highlights how recent innovations in machine learning can address residual confounding and model misspecification challenges commonly encountered in high-dimensional data settings.",
"crumbs": [
"Machine learning",
"<span class='chapter-number'>20</span> <span class='chapter-title'>Deep Learning</span>"
]
},
{
"objectID": "deep.html#plasmode-simulation",
"href": "deep.html#plasmode-simulation",
"title": "20 Deep Learning",