hdPSv25/search.json at main · ehsanx/hdPSv25 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
[
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "hdPS and its machine learning extensions in residual confounding control",
    "section": "",
    "text": "The use of retrospective health care claims datasets is frequently criticized for lacking complete information on potential confounders. Ultimately, the treatment effects estimated utilizing such data sources may be subject to residual confounding. Digital electronic administrative records routinely collect a large volume of health-related information; and many of whom are usually not considered in conventional pharmacoepidemiological studies."
  },
  {
    "objectID": "index.html#proposal-to-reduce-residual-confounding-bias",
    "href": "index.html#proposal-to-reduce-residual-confounding-bias",
    "title": "hdPS and its machine learning extensions in residual confounding control",
    "section": "Proposal to reduce residual confounding bias",
    "text": "Proposal to reduce residual confounding bias\nIn 2009, a high-dimensional propensity score (hdPS) algorithm was proposed that utilizes such information as surrogates or proxies for mismeasured and unobserved confounders in an effort to reduce residual confounding bias. Since then, many machine learning and semi-parametric extensions of this algorithm have been proposed to exploit the wealth of high-dimensional proxy information properly.\n\n\nSchneeweiss et al. (2009)"
  },
  {
    "objectID": "index.html#purpose-of-the-workshop",
    "href": "index.html#purpose-of-the-workshop",
    "title": "hdPS and its machine learning extensions in residual confounding control",
    "section": "Purpose of the workshop",
    "text": "Purpose of the workshop\nThis workshop will\n\ndemonstrate logic, steps and implementation guidelines of hdPS utilizing an open data source as an example (using reproducible R codes),\nfamiliarize participants with the difference between propensity score vs. hdPS,\nexplain the rationale for using the machine learning extensions of hdPS, and their statistical properties, and\ndiscuss advantages, controversies, and hdPS reporting guidelines while writing a manuscript."
  },
  {
    "objectID": "index.html#workshop-prerequisite",
    "href": "index.html#workshop-prerequisite",
    "title": "hdPS and its machine learning extensions in residual confounding control",
    "section": "Workshop prerequisite",
    "text": "Workshop prerequisite\nAttendees should have prerequisite knowledge of multiple regression analysis and working knowledge in R (e.g., basic data manipulation and regression fitting).\n\nR Codes\nR Codes for data creation and hdPS analysis can be found on the GitHub repo (codes directory).\n\n\nVersion history\nDifferent versions and updates of the materials were presented in the following sessions\n\nCanadian Society for Epidemiology and Biostatistics, Montreal, Quebec, August 11, 2025 (scheduled)\n2025 Society of Epidemiologic Research Workshops, July 11, 2025 (scheduled)\n2025 Statistical Society of Canada, Biostatistics Workshop, May 25, 2025 (together with Md Belal Hossain)\n2024 Society of Epidemiologic Research Workshops, May 10th, 2024\nR/Medicine Conference 2023, Virtual, June 5, 2023\n2023 Society of Epidemiologic Research Workshops, Virtual, May 4, 2023\n\nAdditional relevant talks (selected):\n\nStatistical issues in administrative data, Banff International Research Station, Banff, Feb 2019.\nStatistics Conference in Genomics, Pharmaceutical Science, and Health Data Science, August 15-17, 2022 University of Victoria, Victoria, BC\nWork in Progress Seminar, CHEOS, St. Paul’s Hospital (Hurlburt Auditorium), Dec 14th, 2022.\nStatistics and Biostatistics seminar series, at the Department of Statistics and Actuarial Science, University of Waterloo, April 26, 2023.\nConference on Statistics and Data Science with Applications in Biology, Genetics, Public Health, and Finance, Thompson Rivers University, Kamloops, August 21-24, 2023.\n\n\n\nCitation\n\n\n\n\n\n\nHow to cite\n\n\n\nKarim, M. E. (2025). High-dimensional propensity score and its machine learning extensions in residual confounding control. The American Statistician, 79(1), 72-90. DOI: 10.1080/00031305.2024.2368794.\n\n\n\n\nComments\nFor any comments regarding this document, reach out to me.\n\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "motivating.html#literature",
    "href": "motivating.html#literature",
    "title": "Motivating example",
    "section": "Literature",
    "text": "Literature\nType 2 diabetes is a metabolic disorder that is characterized by high blood sugar levels and insulin resistance. There is a growing body of evidence that, for type 2 diabetes, obesity is a well-established risk factor. Possible mechanism includes excess body fat leading to insulin resistance, while impairing the body’s ability to regulate blood sugar levels.\n\n\n\n\n\n\n\n\n\n(Klein et al. 2022)"
  },
  {
    "objectID": "motivating.html#research-question",
    "href": "motivating.html#research-question",
    "title": "Motivating example",
    "section": "Research question",
    "text": "Research question\n“Does obesity increase the risk of developing diabetes?”\n\n\n\n\n\n\nTip\n\n\n\nObesity is often considered a challenging exposure variable to define precisely in research studies (Hernán and Taubman 2008). In this case, we are using it as an illustrative example to explain the methods and not attempting to make any clinical statements about this topic.\n\n\n\n\n\n\n\nflowchart LR\n  A[Obesity] --> Y(Diabetes)\n\n\n\n\n\n\n\n\n\n\n\nExposure: Being obese\n\nOutcome: Developing diabetes\n\n\n\n\n\n\n\nTip\n\n\n\nThe primary goal of the research is not to answer a clinical question or to draw conclusions about the relationship between obesity and diabetes in the general population, but rather to use the relationship as a motivating example for conducting simulations that compares different statistical methods.\n\n\n\n\n\n\nHernán, Miguel A, and Sarah L Taubman. 2008. “Does Obesity Shorten Life? The Importance of Well-Defined Interventions to Answer Causal Questions.” International Journal of Obesity 32 (3): S8–14.\n\n\nKlein, Samuel, Amalia Gastaldelli, Hannele Yki-Järvinen, and Philipp E Scherer. 2022. “Why Does Obesity Cause Diabetes?” Cell Metabolism 34 (1): 11–20."
  },
  {
    "objectID": "data.html#choose-a-u.s.-data-source",
    "href": "data.html#choose-a-u.s.-data-source",
    "title": "1  Data to Analyze",
    "section": "1.1 Choose a U.S. data source",
    "text": "1.1 Choose a U.S. data source\n\n\n\n\n\n\n\n\nData source: National Health and Nutrition Examination Survey (NHANES) (Disease Control and Prevention 2021)\n\n2013-2014,\n2015-2016,\n2017-2018\n\nAvailability: NHANES is a publicly available dataset that can be downloaded for free from the CDC website.\nDesign: Observational cross-sectional data. Hence, inferring causality is not a possibility or our objective here."
  },
  {
    "objectID": "data.html#confounder-identification",
    "href": "data.html#confounder-identification",
    "title": "1  Data to Analyze",
    "section": "1.2 Confounder identification",
    "text": "1.2 Confounder identification\nDirected acyclic graph (DAG)\n\n\n(Greenland, Pearl, and Robins 1999)\n\n\n\n\n\nflowchart TB\n  A[Obesity A] --> Y(Diabetes Y)\n  L[Confounders C] --> Y\n  L --> A\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHypothesized Directed acyclic graph drawn based on analyst’s best understanding of the literature\n\n\n\n\n\n\nExposure: Being obese\n\nOutcome: Developing diabetes\n\nConfounders: Demographic and lab variables"
  },
  {
    "objectID": "data.html#structure-of-the-data",
    "href": "data.html#structure-of-the-data",
    "title": "1  Data to Analyze",
    "section": "1.3 Structure of the data",
    "text": "1.3 Structure of the data\n\n\n\n\n\nflowchart LR\n  D[NHANES 2013-14] --> demo[Demographic Variables and Sample Weights]\n  demo --> Age\n  demo --> Sex\n  demo --> Education\n  demo --> r[Race or ethnicity]\n  demo --> m[Marital status]\n  demo --> Income\n  demo --> b[Birth place]\n  demo --> sf[Survey features: sampling weights, strata, cluster]\n  D --> bmi[Body Measures]\n  bmi --> Obesity\n  D --> diq[Diabetes]\n  diq --> Diabetes\n  diq --> f[Family history of diabetes]\n  D --> smq[Smoking - Cigarette Use]\n  smq --> Smoking\n  D --> dbq[Diet Behavior & Nutrition]\n  dbq --> Diet\n  D --> paq[Physical Activity]\n  paq --> p[Physical activities]\n  D --> huq[Hospital Utilization & Access to Care]\n  huq --> mm[Medical access]\n  D --> bpx[Blood Pressure]\n  bpx --> sbp[Systolic Blood Pressure]\n  bpx --> dbp[Diastolic Blood Pressure]\n  D --> bpq[Blood Pressure & Cholesterol]\n  bpq --> hc[High cholesterol]\n  D --> slq[Sleep Disorders]\n  slq --> Sleep\n  D --> biopro[Standard Biochemistry Profile]\n  biopro --> u[Uric acid]\n  biopro --> Protein\n  biopro --> Bilirubin\n  biopro --> Phosphorus\n  biopro --> Sodium\n  biopro --> Potassium\n  biopro --> Globulin\n  biopro --> Calcium\n  D --> rxq[Prescription Medications - ICD-10-CM codes]\n  style D fill:#FFA500;\n  style rxq fill:#00FF00;\n  style biopro fill:#00FF00;\n  style slq fill:#00FF00;\n  style bpq fill:#00FF00;\n  style bpx fill:#00FF00;\n  style huq fill:#00FF00;\n  style paq fill:#00FF00;\n  style dbq fill:#00FF00;\n  style smq fill:#00FF00;\n  style diq fill:#00FF00;\n  style bmi fill:#00FF00;\n  style demo fill:#00FF00;\n\n\n\n\n\n\n\n\n\n\n\nWe do the same for the following cycles:\n\nNHANES 2015-16\nNHANES 2017-18"
  },
  {
    "objectID": "data.html#identify-measured-and-unmeasured-variables-in-the-data",
    "href": "data.html#identify-measured-and-unmeasured-variables-in-the-data",
    "title": "1  Data to Analyze",
    "section": "1.4 Identify measured and unmeasured variables in the data",
    "text": "1.4 Identify measured and unmeasured variables in the data\nFind variables capturing the following concepts in the data based on a hypothesized DAG.\n\n\n\n\n\nRole\nData Component\nVariables considered based on DAG\n\n\n\n\nOutcome\nDIQ\nHave diabetes1\n\n\nExposure\nBMX\nObese; BMI >= 30\n\n\nConfounder\n(demographic) DEMO\nAge, Sex, Education, Race/ethnicity, Marital status, Annual household income, County of birth, Survey cycle year\n\n\n\n(behaviour) SMQ, PAQ, SLQ, DBQ\nSmoking2, Vigorous work activity, Sleep3, Diet4\n\n\n\n(health history / access) DIQ, HUQ\nDiabetes family history, Access to care5\n\n\n\n(lab) BPX, BPQ, BIOPRO\nBlood pressure (systolic, diastolic6), Cholesterol, Uric acid, Total Protein, Total Bilirubin, Phosphorus, Sodium, Potassium, Globulin, Total Calcium\n\n\n\n\n\n\n14 demographic, behavioral, health history related variables\n\nMostly categorical\n\n11 lab variables\n\nMostly continuous"
  },
  {
    "objectID": "data.html#fitting-crude-model-to-obtain-or",
    "href": "data.html#fitting-crude-model-to-obtain-or",
    "title": "1  Data to Analyze",
    "section": "1.5 Fitting crude model to obtain OR",
    "text": "1.5 Fitting crude model to obtain OR\n\n\n\n\n\n\nCrude association\n\n\n\nHere we estimate the crude association between the exposure and the outcome.\n\n\n\n\n\n\nout.formula <- as.formula(\"outcome ~ exposure\")\nfit <- glm(out.formula,\n            data = hdps.data,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.ci <- confint(fit, level = 0.95)[\"exposure\", ]\nfit.summary_with_ci.crude <- c(fit.summary, fit.ci)\nknitr::kable(t(round(fit.summary_with_ci.crude, 2)))\n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.66\n0.08\n0\n0.51\n0.81\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDisease Control, Centers for, and Prevention. 2021. “National Health and Nutrition Examination Survey (NHANES).” National Center for Health Statistics.\n\n\nGreenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology, 37–48."
  },
  {
    "objectID": "psipw.html#propensity-score-analysis-1",
    "href": "psipw.html#propensity-score-analysis-1",
    "title": "2  Propensity score",
    "section": "2.1 Propensity Score Analysis",
    "text": "2.1 Propensity Score Analysis\nThere are four approaches to propensity score (PS) analysis:\n\nWeighting: Assign weights to individuals based on their propensity scores to create a pseudo-population where treatment groups are balanced.\nMatching: Match individuals in the treatment group with individuals in the control group based on their propensity scores.\nStratification: Divide the sample into strata based on the propensity score and compare outcomes within each stratum.\nCovariate Adjustment: Include the propensity score as a covariate in a outcome model to adjust for confounding."
  },
  {
    "objectID": "psipw.html#propensity-score-weighting",
    "href": "psipw.html#propensity-score-weighting",
    "title": "2  Propensity score",
    "section": "2.2 Propensity Score Weighting",
    "text": "2.2 Propensity Score Weighting\nFor this demonstration, we will focus on the Weighting approach. The other approaches are not covered in this demonstration, but they can be implemented using similar steps as shown below.\nThere are four steps in propensity score weighting:\n\nData preparation: Prepare the data by creating the treatment/exposure, outcome, and covariates.\nSpecifying PS & fit model: Specify the propensity score model with investigator-specified measured covariates and fit the model\nWeighting: Convert PS to inverse probability weights (IPW).\nCovariate balance: Check the balance of covariates between treatment groups after weighting.\nEstimating treatment effect: Fit the outcome model on the pseudo population.\n\n\n2.2.1 Step 0: Data preparation\n\n2.2.1.1 Creating Analytic data\n3 cycles of NHANES datasets were - downloaded from the US CDC website - recoded for consistency, and - merged together to make an analytic data.\nDetails of data download process, and recoding and merging are discussed in Appendix.\n\n\n\n\nflowchart LR\n  A[NHANES] --> C1(2013-2014 cycle) --> ss1(10,175<br>participants)\n  A --> C2(2015-2016 cycle) --> ss2(9,971<br>participants)\n  A --> C3(2017-2018 cycle) --> ss3(9,254<br>participants)\n\n  ss(7,585<br>participants<br>after imposing<br>eligibility criteria)\n\n  ss1 --> ss\n  ss2 --> ss\n  ss3 --> ss\n\n  %% 1. Define a reusable style class named 'customOrange'\n  classDef customOrange fill:#FFA500,color:#333,stroke:#A66C00\n\n  %% 2. Apply the class to the desired nodes\n  class A,C1,C2,C3,ss1,ss2,ss3,ss customOrange;\n\n\n\n\n\n\n\n\n\n\nOur study population was restricted to the U.S. population who were\n\n20 years or older and\nnot pregnant at the time of survey data collection, and\nwho had available International Classification of Diseases (ICD) codes to ensure we can extract sufficient proxy information for the analysis (discussed in step 1).\n\nTo simplify the analysis, we only considered complete case data.\n\n# Table 1\nlibrary(tableone)\ntab1 <- CreateTableOne(vars = investigator.specified.covariates, \n                       strata = \"exposure\",\n                       data = hdps.data, \n                       test = FALSE)\nprint(tab1, \n      showAllLevels = TRUE, \n      noSpaces = TRUE, \n      quote = FALSE, \n      smd = TRUE, \n      printToggle = FALSE) |>\n  kbl(caption = \"Table 1: Baseline Characteristics by Exposure Group\") |>\n  kable_styling(bootstrap_options = c(\"striped\", \"hover\", \"condensed\"), \n                full_width = FALSE)\n\n\n\nTable 1: Baseline Characteristics by Exposure Group\n \n  \n      \n    level \n    0 \n    1 \n    SMD \n  \n \n\n  \n    n \n     \n    2223 \n    1616 \n     \n  \n  \n    age.cat (%) \n    20-49 \n    703 (31.6) \n    528 (32.7) \n    0.149 \n  \n  \n     \n    50-64 \n    673 (30.3) \n    579 (35.8) \n     \n  \n  \n     \n    65+ \n    847 (38.1) \n    509 (31.5) \n     \n  \n  \n    sex (%) \n    Male \n    1009 (45.4) \n    648 (40.1) \n    0.107 \n  \n  \n     \n    Female \n    1214 (54.6) \n    968 (59.9) \n     \n  \n  \n    education (%) \n    Less than high school \n    322 (14.5) \n    248 (15.3) \n    0.242 \n  \n  \n     \n    High school \n    951 (42.8) \n    860 (53.2) \n     \n  \n  \n     \n    College graduate or above \n    950 (42.7) \n    508 (31.4) \n     \n  \n  \n    race (%) \n    White \n    933 (42.0) \n    677 (41.9) \n    0.452 \n  \n  \n     \n    Black \n    302 (13.6) \n    367 (22.7) \n     \n  \n  \n     \n    Hispanic \n    453 (20.4) \n    424 (26.2) \n     \n  \n  \n     \n    Others \n    535 (24.1) \n    148 (9.2) \n     \n  \n  \n    marital (%) \n    Never married \n    274 (12.3) \n    196 (12.1) \n    0.115 \n  \n  \n     \n    Married/with partner \n    1432 (64.4) \n    964 (59.7) \n     \n  \n  \n     \n    Other \n    517 (23.3) \n    456 (28.2) \n     \n  \n  \n    income (%) \n    less than $20,000 \n    364 (16.4) \n    300 (18.6) \n    0.184 \n  \n  \n     \n    $20,000 to $74,999 \n    984 (44.3) \n    821 (50.8) \n     \n  \n  \n     \n    $75,000 and Over \n    875 (39.4) \n    495 (30.6) \n     \n  \n  \n    born (%) \n    Born in US \n    1342 (60.4) \n    1170 (72.4) \n    0.257 \n  \n  \n     \n    Other place \n    881 (39.6) \n    446 (27.6) \n     \n  \n  \n    year (%) \n    NHANES 2013-2014 public release \n    1026 (46.2) \n    703 (43.5) \n    0.090 \n  \n  \n     \n    NHANES 2015-2016 public release \n    305 (13.7) \n    195 (12.1) \n     \n  \n  \n     \n    NHANES 2017-2018 public release \n    892 (40.1) \n    718 (44.4) \n     \n  \n  \n    diabetes.family.history (%) \n    No \n    1900 (85.5) \n    1251 (77.4) \n    0.208 \n  \n  \n     \n    Yes \n    323 (14.5) \n    365 (22.6) \n     \n  \n  \n    medical.access (%) \n    No \n    150 (6.7) \n    71 (4.4) \n    0.103 \n  \n  \n     \n    Yes \n    2073 (93.3) \n    1545 (95.6) \n     \n  \n  \n    smoking (%) \n    Never smoker \n    1350 (60.7) \n    943 (58.4) \n    0.095 \n  \n  \n     \n    Previous smoker \n    576 (25.9) \n    484 (30.0) \n     \n  \n  \n     \n    Current smoker \n    297 (13.4) \n    189 (11.7) \n     \n  \n  \n    diet.healthy (%) \n    Poor or fair \n    436 (19.6) \n    615 (38.1) \n    0.487 \n  \n  \n     \n    Good \n    904 (40.7) \n    650 (40.2) \n     \n  \n  \n     \n    Very good or excellent \n    883 (39.7) \n    351 (21.7) \n     \n  \n  \n    physical.activity (%) \n    No \n    1901 (85.5) \n    1317 (81.5) \n    0.108 \n  \n  \n     \n    Yes \n    322 (14.5) \n    299 (18.5) \n     \n  \n  \n    sleep (mean (SD)) \n     \n    7.40 (1.48) \n    7.30 (1.60) \n    0.067 \n  \n  \n    uric.acid (mean (SD)) \n     \n    5.25 (1.38) \n    5.81 (1.53) \n    0.383 \n  \n  \n    protein.total (mean (SD)) \n     \n    7.08 (0.46) \n    7.06 (0.45) \n    0.049 \n  \n  \n    bilirubin.total (mean (SD)) \n     \n    0.58 (0.29) \n    0.52 (0.33) \n    0.193 \n  \n  \n    phosphorus (mean (SD)) \n     \n    3.74 (0.55) \n    3.68 (0.58) \n    0.109 \n  \n  \n    sodium (mean (SD)) \n     \n    139.83 (2.62) \n    139.74 (2.74) \n    0.031 \n  \n  \n    potassium (mean (SD)) \n     \n    4.05 (0.39) \n    4.07 (0.39) \n    0.046 \n  \n  \n    globulin (mean (SD)) \n     \n    2.87 (0.46) \n    3.00 (0.46) \n    0.289 \n  \n  \n    calcium.total (mean (SD)) \n     \n    9.41 (0.39) \n    9.34 (0.40) \n    0.166 \n  \n  \n    systolicBP (mean (SD)) \n     \n    126.07 (19.52) \n    129.23 (17.72) \n    0.169 \n  \n  \n    diastolicBP (mean (SD)) \n     \n    70.17 (11.59) \n    72.35 (11.82) \n    0.186 \n  \n  \n    high.cholesterol (%) \n    No \n    1137 (51.1) \n    756 (46.8) \n    0.087 \n  \n  \n     \n    Yes \n    1086 (48.9) \n    860 (53.2) \n     \n  \n\n\n\n\n\n\n\n\n2.2.2 Step 1: Specifying PS & fit model\nWe build the propensity score model in this data using the investigator-specified covariates.\n\n\n\n\n\n\n\n\n\n\n\nC = investigator-specified covariates.\n\nIf you are somewhat unfamiliar with propensity score paradigm, look at tutorials dedicated towards that topic. There are additional tutorials also talking about propensity score weighting.\n\n\n2.2.2.1 PS model specification\nNow let us create the propensity score formula with the investigator-specified covariates:\n\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", covform))\nps.formula\n#> exposure ~ age.cat + sex + education + race + marital + income + \n#>     born + year + diabetes.family.history + medical.access + \n#>     smoking + diet.healthy + physical.activity + sleep + uric.acid + \n#>     protein.total + bilirubin.total + phosphorus + sodium + potassium + \n#>     globulin + calcium.total + systolicBP + diastolicBP + high.cholesterol\n\n\n\n\nOnly use investigator specified covariates to build the formula.\nDuring the construction of the propensity score model, researchers should consider incorporating additional model specifications, such as interactions and polynomials, if they are deemed necessary.\n\n\n\n2.2.2.2 Fit the PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\")\n\n\n\n\nUse that formula to estimate propensity scores.\nIn this demonstration, we did not use stabilize = TRUE. However, stabilized propensity score weights often reduce the variance of treatment effect estimates.\n\n\n\n2.2.2.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\nggplot(hdps.data, aes(x = ps, fill = factor(exposure))) +\n  geom_density(alpha = 0.5) +\n  scale_fill_manual(values = c(\"darkblue\", \"darkred\")) +\n  theme_classic()\n\n\n\n\n\n\nCheck propensity score overlap in both exposure groups.\n\n\n\n2.2.3 Step 2: Weighting\nAs mentioned, we only talk about inverse probability weighting in our current context.\n\nhdps.data$w <- W.out$weights\nsummary(hdps.data$w)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>   1.016   1.277   1.559   2.008   2.144  45.795\n\n\nggplot(hdps.data, aes(x = \"\", y = w)) +\n  geom_boxplot(fill = \"lightblue\", \n               color = \"blue\", \n               size = 1) +\n  geom_text(aes(x = 1, y = max(w), \n                label = paste0(\"Max = \", round(max(w), 2))), \n            vjust = 1.5, \n            hjust = -0.3, \n            size = 4, \n            color = \"red\") +\n  geom_text(aes(x = 1, y = min(w), \n                label = paste0(\"Min = \", round(min(w), 2))), \n            vjust = -2.5, \n            hjust = -0.3, \n            size = 4, \n            color = \"red\") +\n  ggtitle(\"Boxplot of Inverse Probability Weights\") +\n  xlab(\"\") +\n  ylab(\"Weights\") +\n  theme_classic()\n\n\n\n\n\n\n\nCheck the summary statistics of the weights to assess whether there are extreme weights. Less extreme weights now?\n\n\n\n2.2.4 Step 3: Covariate balance\n\nrequire(cobalt)\nlove.plot(x = W.out,\n          thresholds = c(m = .1), \n          var.order = \"unadjusted\",\n          stars = \"raw\")\n\n\n\n\n\n\n\nAssess balance against SMD 0.1. Still balanced?\nPredictive measures such as c-statistics are not helpful in this context (Westreich et al. 2011): “use of the c-statistic as a guide in constructing propensity scores may result in less overlap in propensity scores between treated and untreated subjects”!\n\n\n\n2.2.5 Step 4: Estimating treatment effect\n\n2.2.5.1 Set outcome formula\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nout.formula\n#> outcome ~ exposure\n\n\n\nWe are again using a crude weighted outcome model here.\n\n\n2.2.5.2 Obtain OR\n\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\n\nfit.summary_with_ci.ps <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.ps,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.64\n0.1\n0\n0.53\n0.75\n\n\n\n\n\n\n\n\n\n\n\n\n2.2.5.3 Obtain RD\n\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= gaussian(link = \"identity\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|t|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.ps.rd <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.ps.rd,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|t|)\n2.5 %\n97.5 %\n\n\n\n\n0.11\n0.02\n0\n0.09\n0.14\n\n\n\n\n\n\n\n\n\n\n\n\nWestreich, Daniel, Stephen R Cole, Michele Jonsson Funk, M Alan Brookhart, and Til Stürmer. 2011. “The Role of the c-Statistic in Variable Selection for Propensity Score Models.” Pharmacoepidemiology and Drug Safety 20 (3): 317–20."
  },
  {
    "objectID": "proxy.html#measuring-comorbidity-burden",
    "href": "proxy.html#measuring-comorbidity-burden",
    "title": "3  Reducing residual confounding",
    "section": "3.1 Measuring comorbidity burden",
    "text": "3.1 Measuring comorbidity burden\nIn health research, the overall health status/ Disease burden could be a potential confounding factor. In the original DAG, we had comorbidity as a known confounder.\n\n\n\n\n\nflowchart TB\n  A[Obesity] --> Y(Diabete)\n  L[Comorbidity measure unobserved] --> Y\n  L --> A\n  style A fill:#90EE90;\n  style Y fill:#ADD8E6;\n  style L fill:#FF0000;\n\n\n\n\n\n\n\n\n\n\nCharlson Comorbidity Index (CCI) is a measure that quantifies the burden of comorbidities or pre-existing medical conditions in patients (takes into account 17 comorbidities), which can impact their health outcomes and overall survival.\nElixhauser Comorbidity Index (ECI) is a measure of the burden of comorbidities, based on 30 different comorbid conditions.\nChronic Disease Score (CDS) is a weighted score of the number and severity of chronic diseases, calculated using self-reported data on diagnosed conditions (considers the presence of 21 chronic conditions).\n\n\n\n\n(Charlson et al. 1987; Elixhauser et al. 1998; Von Korff, Wagner, and Saunders 1992)\nNHANES does not include information on all of the comorbidities included in theses scores / indices.\n\n\n\n\n\n\n\nResidual confounding\n\n\n\nComorbidity scores are widely used as a measure of comorbidity burden, and their calculation often relies on data that may not be available in certain contexts, such as in NHANES or Canadian health administrative databases. In such cases, when comorbidity burden is a known confounder, researchers may use proxy information to approximate and mimic the information. Not being able to adjust for such variable can introduce bias and residual confounding in the treatment effect estimation.\n\n\n\n\n\n(Schneeweiss and Maclure 2000; L. Lix et al. 2011; L. M. Lix et al. 2013)"
  },
  {
    "objectID": "proxy.html#proxy-adjustment-empirical-criterion",
    "href": "proxy.html#proxy-adjustment-empirical-criterion",
    "title": "3  Reducing residual confounding",
    "section": "3.2 Proxy Adjustment Empirical criterion",
    "text": "3.2 Proxy Adjustment Empirical criterion\nEmpirical criterion: Modified disjunctive cause criterion\nVanderWeele et al. 2019 European Journal of Epidemiology: CC BY license\n\n\n\n\n\nHypothesized Directed acyclic graph with comorbidity measure being unmeasured, and approximated by the simple count measures based on the ICD codes\n\n\n\n\n\n\nAdjust for variables that are (a) causes of exposure or outcome or both, (b) discard: known instrument, (c) including good proxies for unmeasured common causes (VanderWeele 2019)"
  },
  {
    "objectID": "proxy.html#additional-information-icd-10-cm",
    "href": "proxy.html#additional-information-icd-10-cm",
    "title": "3  Reducing residual confounding",
    "section": "3.3 Additional information: ICD-10-CM",
    "text": "3.3 Additional information: ICD-10-CM\n\n\nThe International Classification of Diseases 10th Revision (ICD-10) is a standardized system of codes for the classification of diseases, disorders, and injuries.\n\n\n\nRole\nData Source\nVariables considered\n\n\n\n\nRole unclear as they may not directly relate to the research question\nRXQ_RX\nPrescription medication ICD-10-CM code\n\n\n\n\n\nRXQ_RX questionnaire (a) collects information on prescription medications taken in the past 30 days, (b) conducted by trained interviewers, and (c) with some quality control efforts.\n\n\n\n\n\nExamples of ICD-10-CM codes (3-7 characters, 1st character being alpha, 2-end are numberic, often with a dot) assigned to reasons for using medication (see Appendix in NHANES RXQ_RX component)\n\n\n\n\n\n\nWe have a lot of information through these ICD-10-CM codes, but for most of these information, it is unclear what role they play within the context of our research questions.\nCount of prescriptions is often used to measure comorbidity burden. This is not a perfect measure. But could serve as a proxy for our purpose.\n\n\n\n\nPrescription medication (ICD-10-CM codes from all 3 cycles) data was liked with the initial data.\n\n\nCharlson, Mary E, Peter Pompei, Kathy L Ales, and C Ronald MacKenzie. 1987. “A New Method of Classifying Prognostic Comorbidity in Longitudinal Studies: Development and Validation.” Journal of Chronic Diseases 40 (5): 373–83.\n\n\nElixhauser, Anne, Claudia Steiner, D Robert Harris, and Rosanna M Coffey. 1998. “Comorbidity Measures for Use with Administrative Data.” Medical Care, 8–27.\n\n\nLix, Lisa M, Jacqueline Quail, Opeyemi Fadahunsi, and Gary F Teare. 2013. “Predictive Performance of Comorbidity Measures in Administrative Databases for Diabetes Cohorts.” BMC Health Services Research 13: 1–12.\n\n\nLix, LM, J Quail, G Teare, and B Acan. 2011. “Performance of Comorbidity Measures for Predicting Outcomes in Population-Based Osteoporosis Cohorts.” Osteoporosis International 22: 2633–43.\n\n\nSchneeweiss, Sebastian, and Malcolm Maclure. 2000. “Use of Comorbidity Scores for Control of Confounding in Studies Using Administrative Databases.” International Journal of Epidemiology 29 (5): 891–98.\n\n\nVanderWeele, Tyler J. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34: 211–19.\n\n\nVon Korff, Michael, Edward H Wagner, and Kathleen Saunders. 1992. “A Chronic Disease Score from Automated Pharmacy Data.” Journal of Clinical Epidemiology 45 (2): 197–203."
  },
  {
    "objectID": "hdps.html#origin",
    "href": "hdps.html#origin",
    "title": "High-dimensional Propensity score",
    "section": "Origin",
    "text": "Origin\n\n\n\n\n\n\n\n\n\n\n(Schneeweiss et al. 2009)"
  },
  {
    "objectID": "hdps.html#key-idea",
    "href": "hdps.html#key-idea",
    "title": "High-dimensional Propensity score",
    "section": "Key idea",
    "text": "Key idea\nSchneeweiss et al. 2009 extended to a variety of classifications to code diagnoses (ICD), procedure (CPT), medications (eg, NDC, AHFS, ATCC), or others (PCP, LOINC).\n\n\n\n\n\n\n\n\n\n\n\nCPT-4 (Current Procedural Terminology, 4th edition), ICD-9 (International Classification of Diseases, 9th edition), PCP visits (Primary Care Physician visits), NDC (National Drug Code), and ATC (Anatomical Therapeutic Chemical classification) are all codes or measures commonly used in healthcare and medical research.\nSchneeweiss et al. 2018 Clinical Epidemiology: CC BY license\n\n\n(Schneeweiss 2018)\n\n\n\n\n\n\nAdjust useful proxies\n\n\n\nIn administrative data sources, the main idea of hdPS (high-dimensional propensity score) is to adjust for proxies that are empirically associated with the outcome of interest, which may not be directly measured in the data.\n\n\n\n\nWith hdPS, users do not need to know which unmeasured confounders are being adjusted for by proxy information.\n\nAdjusting for something that may not be interpretable directly with the context of the research question.\nLogic: measures from same subject should be correlated = has relevant proxy information\n\n\n\n\n\nSchneeweiss, Sebastian. 2018. “Automated Data-Adaptive Analytics for Electronic Healthcare Data to Study Causal Treatment Effects.” Clinical Epidemiology, 771–88.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "step1.html#data-with-investigator-specified-variables",
    "href": "step1.html#data-with-investigator-specified-variables",
    "title": "4  Step 1: Proxy sources",
    "section": "4.1 Data with investigator-specified variables",
    "text": "4.1 Data with investigator-specified variables\n\n\n\n\n\n\nData: part 1\n\n\n\nWe will work with the data.complete data for the investigator-specified information.\n\n\n\nanalytic <- data.complete\nidx <- analytic$id\noutcome <- as.numeric(analytic$diabetes == \"Yes\") \nexposure <- as.numeric(analytic$obese == \"Yes\")\ndomain <- \"dx\"\nanalytic.dfx <- as.data.frame(cbind(idx, exposure, outcome, domain))\n\n\n\nWe prepare the minimal analytic data only with the following 4 information:\n\nidentifying information (idx)\nexposure (obese)\noutcome (diabetes)\ndomain of the codes (dx). In this example we only have prescription domain (1 domain dx)"
  },
  {
    "objectID": "step1.html#proxy-data",
    "href": "step1.html#proxy-data",
    "title": "4  Step 1: Proxy sources",
    "section": "4.2 Proxy data",
    "text": "4.2 Proxy data\n\n4.2.1 Identify the data dimensions (proxy sources)\nIn this example we only have prescription domain (1 domain dx of ICD-10-CM code). Hence \\(p = 1\\) in this exercise.\n\n\nNHANES Questionnaire collects information on: (a) dietary supplements, (b) nonprescription antacids, (c) prescription medications, and (d) preventive aspirin use.\n\n\n4.2.2 Define a covariate assessment period (CAP)\n\n\n\n\n\n\n\n\n\n\n\n(Connolly et al. 2019; Schneeweiss et al. 2009)\nWe only collect proxy information from a well-defined CAP. In our case, it was \\(30\\) days.\n\n\nNHANES asked “In the past 30 days, have you used or taken medication for which a prescription is needed? Do not include prescription vitamins or minerals you may have already told me about.”\n\n\n\n\n\n\nData: part 2\n\n\n\nWe will work with the merge proxy data (ICD-10 codes) from 3 cycles: dat.proxy.long.\n\n\n\n\n4.2.3 Omit duplicated information\n\n\nWe need to delete codes that could be close proxies of exposure and/or outcome, or other investigator specified covariates we have already selected in step0.\n\n\n\n\n\n\n\n\n\n\ndat.proxy.long <- subset(dat.proxy.long, \n                         icd10 != \"E66\") # Overweight and obesity\ndat.proxy.long <- subset(dat.proxy.long, \n                         icd10 != \"O24\") # Gestational diabetes mellitus\ndat.proxy.long <- subset(dat.proxy.long, \n                         icd10 != \"E10\") # Type 1 diabetes mellitus\ndat.proxy.long <- subset(dat.proxy.long, \n                         icd10 != \"E11\") # Type 2 diabetes mellitus\n\n\n\n\nWe delete codes associated with exposure and outcome.\nSame should be done for any other proxies that may have duplicating information compared to the investigator-specified covariates.\n\n\n\n4.2.4 Long format proxy data\n\n\n\n\n\nHere is an example of 3 digit codes for 1 patient with subject ID “100001”. We create the same for all patients.\n\n\n\n\n \n  \n    ID \n    ICD 10 codes (3 digit) \n    Description \n  \n \n\n  \n    100001 \n    F33 \n    Major depressive disorder, recurrent \n  \n  \n    100001 \n    I10 \n    Hypertension \n  \n  \n    100001 \n    M62 \n    Muscle spasm \n  \n  \n    100001 \n    F32 \n    Major depressive disorder, single episode \n  \n  \n    100001 \n    M25 \n    Joint disorder/pain \n  \n  \n    100001 \n    K21 \n    Gastro-esophageal reflux disease \n  \n  \n    100001 \n    M79 \n    musculoskeletal pain conditions \n  \n  \n    100001 \n    R12 \n    Heartburn"
  },
  {
    "objectID": "step1.html#merge-proxy-data-with-analytic-data",
    "href": "step1.html#merge-proxy-data-with-analytic-data",
    "title": "4  Step 1: Proxy sources",
    "section": "4.3 Merge Proxy data with Analytic data",
    "text": "4.3 Merge Proxy data with Analytic data\n\n\n\n\n\n\nMerged Data: parts 1 and 2\n\n\n\n\nWe will work with the merge proxy data with analytic data.\nThat will provide us with the IDs (idx) of the subject that have proxy (ICD-10) information associated with them.\n\n\n\n\nrequire(dplyr) \ndfx <- merge(analytic.dfx, proxy.var.long, by = \"idx\")\nhead(dfx)\n\n\n\n  \n\n\nbasetable <- dfx %>% select(idx, exposure, outcome) %>% distinct()\npatientIds <- basetable$idx\nlength(patientIds)\n#> [1] 3839\n\n\n\n\n\n\n\n\nConnolly, John G, Sebastian Schneeweiss, Robert J Glynn, and Joshua J Gagne. 2019. “Quantifying Bias Reduction with Fixed-Duration Versus All-Available Covariate Assessment Periods.” Pharmacoepidemiology and Drug Safety 28 (5): 665–70.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "step2.html#sort-by-prevalence",
    "href": "step2.html#sort-by-prevalence",
    "title": "5  Step 2: Empirical",
    "section": "5.1 Sort by prevalence",
    "text": "5.1 Sort by prevalence\nCheck out the frequency of each codes:\n\nlibrary(dplyr)\ndf <- data.frame(\n  icd10 = names(sort(table(dfx$icd10), decreasing = TRUE)),\n  count = sort(table(dfx$icd10), decreasing = TRUE)\n)\n\n\n\n\n\nICD10 Code Frequencies\n \n  \n    ICD10 Code \n    Count \n  \n \n\n  \n    I10 \n    2775 \n  \n  \n    E78 \n    1517 \n  \n  \n    F32 \n    536 \n  \n  \n    F41 \n    524 \n  \n  \n    K21 \n    441 \n  \n  \n    M79 \n    401 \n  \n  \n    E03 \n    397 \n  \n  \n    M54 \n    314 \n  \n  \n    G47 \n    307 \n  \n  \n    J45 \n    301 \n  \n\n\n\n\n\n\n\nOnly top 10 prevalent codes are shown.\nHowever, some may be associated with lower counts (e.g., less than 20).\n\n\n\n\n\n\nRestrictions\n\n\n\nCandidate empirical covariates list is constrained by\n\ntheir prevalence of codes. Only top n covariates with highest prevalence would be chosen.\nanalysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).\ncodes associated with very low prevalence are also numerically problematic for further analyses.\n\n\n\n\n\nWe choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.\nIf there were more dimensions, separate list of candidate empirical covariates would be identified."
  },
  {
    "objectID": "step2.html#choose-granularity",
    "href": "step2.html#choose-granularity",
    "title": "5  Step 2: Empirical",
    "section": "5.2 Choose Granularity",
    "text": "5.2 Choose Granularity\nOne important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.\n\n\nWe have already truncated the codes at 3 digit level while preparing the data."
  },
  {
    "objectID": "step2.html#retain-top-n-empirical-covariates",
    "href": "step2.html#retain-top-n-empirical-covariates",
    "title": "5  Step 2: Empirical",
    "section": "5.3 Retain top n empirical covariates",
    "text": "5.3 Retain top n empirical covariates\n\nrequire(autoCovariateSelection)\nstep1 <- get_candidate_covariates(df = dfx,  \n                                  domainVarname = \"domain\",\n                                  eventCodeVarname = \"icd10\", \n                                  patientIdVarname = \"idx\",\n                                  patientIdVector = patientIds,\n                                  n = 200, \n                                  min_num_patients = 20)\n\n\n\nYou can use autoCovariateSelection package to implement these restrictions (Robert 2020).\n\n5.3.1 Long format data\n\nout1 <- step1$covars_data\nhead(out1)\n\n\n\n  \n\n\n\n\n\n5.3.2 Updated frequency data\n\ndf2 <- data.frame(\n  icd10 = names(table(out1$icd10)),\n  count = as.numeric(table(out1$icd10))\n)\n\n\n\n\n\n \n  \n    ICD10 Code \n    Count \n  \n \n\n  \n    dx_A49 \n    28 \n  \n  \n    dx_B00 \n    20 \n  \n  \n    dx_B35 \n    22 \n  \n  \n    dx_C50 \n    31 \n  \n  \n    dx_D75 \n    136 \n  \n  \n    dx_E03 \n    397 \n  \n\n\n\n\n\n\n\nOnly first few code frequencies are shown (alphabetic order), that were selected based on the restrictions n = 200 and min_num_patients = 20.\n\n\n\n\n \n  \n     \n    ICD10 Code \n    Count \n  \n \n\n  \n    77 \n    dx_R52 \n    40 \n  \n  \n    78 \n    dx_R60 \n    187 \n  \n  \n    79 \n    dx_R73 \n    202 \n  \n  \n    80 \n    dx_T14 \n    82 \n  \n  \n    81 \n    dx_T78 \n    96 \n  \n  \n    82 \n    dx_Z79 \n    277 \n  \n\n\n\n\n\n\n\nOnly last few code frequencies are shown (alphabetic order).\n\n\n5.3.3 Total number of codes retained\n\nnrow(df2)\n#> [1] 82\n\n\n\n\n\n\n\n\nRobert, Dennis. 2020. autoCovariateSelection: Automatic Covariate Selection. https://CRAN.R-project.org/package=autoCovariateSelection.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nSchuster, Tibor, Menglan Pang, and Robert W Platt. 2015. “On the Role of Marginal Confounder Prevalence–Implications for the High-Dimensional Propensity Score Algorithm.” Pharmacoepidemiology and Drug Safety 24 (9): 1004–7."
  },
  {
    "objectID": "step3.html#genrate-recurrence-covariates",
    "href": "step3.html#genrate-recurrence-covariates",
    "title": "6  Step 3: Recurrence",
    "section": "6.1 Genrate recurrence covariates",
    "text": "6.1 Genrate recurrence covariates\n\n\n\n\n\n(Schneeweiss et al. 2009)\nIn this step, we generate 3 binary recurrence covariates for each of the candidate empirical covariates identified in the previous step:\n\noccurred at least once\noccurred sporadically (at least more than the median)\noccurred frequently (at least more than the 75th percentile)\n\n\nstep2 <- get_recurrence_covariates(df = out1, \n                                   patientIdVarname = \"idx\",\n                                   eventCodeVarname = \"icd10\", \n                                   patientIdVector = patientIds)"
  },
  {
    "objectID": "step3.html#example-of-recurrence-covariates",
    "href": "step3.html#example-of-recurrence-covariates",
    "title": "6  Step 3: Recurrence",
    "section": "6.2 Example of recurrence covariates",
    "text": "6.2 Example of recurrence covariates\n\n\n\n\n\n\n\n\n\n\n\nICD-10-CM code (dimension 1)\ncode appeared at least once\ncode appeared at least more than the median\ncode appeared at least more than the 75th percentile\n\n\n\n\nD64.9 Anemia\nrec_dx_D64_once\nrec_dx_D64_sporadic\nrec_dx_D64_frequent\n\n\nD75.9P Blood clots\nrec_dx_D75_once\nrec_dx_D75_sporadic\nrec_dx_D75_frequent\n\n\nD89.9 Immune disorder\nrec_dx_D89_once\nrec_dx_D89_sporadic\nrec_dx_D89_frequent\n\n\n\\(\\ldots\\)\n\\(\\ldots\\)\n\\(\\ldots\\)\n\\(\\ldots\\)\n\n\nE07.9 Disorder of thyroid\nrec_dx_E07_once\nrec_dx_E07_sporadic\nrec_dx_E07_frequent\n\n\n\n\n\nExample of 3 binary covariates (hypothetical) created based on the candidate empirical covariates."
  },
  {
    "objectID": "step3.html#recurrence-covariates-in-the-data",
    "href": "step3.html#recurrence-covariates-in-the-data",
    "title": "6  Step 3: Recurrence",
    "section": "6.3 Recurrence covariates in the data",
    "text": "6.3 Recurrence covariates in the data\n\nout2 <- step2$recurrence_data\nncol(out2)-1\n#> [1] 91\n\n\n\n\n\n  \n\n\n\n\n\nHere we show binary recurrence covariates for only 2 columns"
  },
  {
    "objectID": "step3.html#refined-recurrence-covariates",
    "href": "step3.html#refined-recurrence-covariates",
    "title": "6  Step 3: Recurrence",
    "section": "6.4 Refined recurrence covariates",
    "text": "6.4 Refined recurrence covariates\nBelow you can click to see a list of all recurrence covariates obtained in our data.\n\n\nShow/Hide Table\n\n\n\n\nICD-10 Recurrence Data\n\n\n1\nrec_dx_A49_once\nrec_dx_B00_once\nrec_dx_B35_once\n\n\n2\nrec_dx_C50_once\nrec_dx_D75_once\nrec_dx_E03_once\n\n\n3\nrec_dx_E04_once\nrec_dx_E07_once\nrec_dx_E78_once\n\n\n4\nrec_dx_E87_once\nrec_dx_F31_once\nrec_dx_F31_frequent\n\n\n5\nrec_dx_F32_once\nrec_dx_F39_once\nrec_dx_F41_once\n\n\n6\nrec_dx_F43_once\nrec_dx_F90_once\nrec_dx_G25_once\n\n\n7\nrec_dx_G40_once\nrec_dx_G40_frequent\nrec_dx_G43_once\n\n\n8\nrec_dx_G47_once\nrec_dx_H04_once\nrec_dx_H40_once\n\n\n9\nrec_dx_H40_frequent\nrec_dx_I10_once\nrec_dx_I10_frequent\n\n\n10\nrec_dx_I20_once\nrec_dx_I21_once\nrec_dx_I48_once\n\n\n11\nrec_dx_I48_frequent\nrec_dx_I49_once\nrec_dx_I50_once\n\n\n12\nrec_dx_I50_frequent\nrec_dx_I51_once\nrec_dx_I63_once\n\n\n13\nrec_dx_J30_once\nrec_dx_J42_once\nrec_dx_J44_once\n\n\n14\nrec_dx_J44_frequent\nrec_dx_J45_once\nrec_dx_J45_frequent\n\n\n15\nrec_dx_K04_once\nrec_dx_K08_once\nrec_dx_K21_once\n\n\n16\nrec_dx_K25_once\nrec_dx_K27_once\nrec_dx_K30_once\n\n\n17\nrec_dx_K59_once\nrec_dx_K92_once\nrec_dx_L40_once\n\n\n18\nrec_dx_L70_once\nrec_dx_M06_once\nrec_dx_M06_frequent\n\n\n19\nrec_dx_M10_once\nrec_dx_M13_once\nrec_dx_M19_once\n\n\n20\nrec_dx_M1A_once\nrec_dx_M25_once\nrec_dx_M54_once\n\n\n21\nrec_dx_M62_once\nrec_dx_M79_once\nrec_dx_M81_once\n\n\n22\nrec_dx_N28_once\nrec_dx_N32_once\nrec_dx_N39_once\n\n\n23\nrec_dx_N40_once\nrec_dx_N92_once\nrec_dx_N94_once\n\n\n24\nrec_dx_N95_once\nrec_dx_R00_once\nrec_dx_R05_once\n\n\n25\nrec_dx_R06_once\nrec_dx_R07_once\nrec_dx_R09_once\n\n\n26\nrec_dx_R10_once\nrec_dx_R11_once\nrec_dx_R12_once\n\n\n27\nrec_dx_R25_once\nrec_dx_R32_once\nrec_dx_R35_once\n\n\n28\nrec_dx_R39_once\nrec_dx_R41_once\nrec_dx_R42_once\n\n\n29\nrec_dx_R51_once\nrec_dx_R52_once\nrec_dx_R60_once\n\n\n30\nrec_dx_R73_once\nrec_dx_T14_once\nrec_dx_T78_once\n\n\n31\nrec_dx_Z79_once\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\n\n\nGiven that we had one dimension of proxy data, \\(p=1\\), at most \\(n=200\\) most prevalent codes (with the restriction that minimum number of patients in each code = 20), and \\(3\\) intensity, we could theoretically have at most \\(p \\times n \\times 3 = 1 \\times 200 \\times \\ 3 = 600\\) recurrence covariates.\n\n\n\n\nBased on all of the restrictions, we created 91 distinct recurrence covariates.\nThe merged data (analytic and proxies) size is now 7,585.\n\n\n\n\nIf 2 or all 3 recurrence covariates are identical, only one distinct recurrence covariate is returned. This is why you do not see any sporadic recurrence covariate here.\nRecurrence covariate creation is for each patient, and it is possible to have same code occur multiple time because we are working with a 3 digit granularity (possible to have medications from other codes within same ICD-10 3 digit granularity).\n\n\n\n\n\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "step4.html#bross-formula",
    "href": "step4.html#bross-formula",
    "title": "7  Step 4: Prioritize",
    "section": "7.1 Bross formula",
    "text": "7.1 Bross formula\nWe need to make an educated guess about 3 components (i.e., make an assumption), that are used in the calculation of bias contributed by not adjusting for a covariate based on Bross (1966) formula:\n\n\nBross formula (Bross 1966; Schneeweiss 2006) for the Bias Multiplier considers both the imbalance in the prevalence of the unmeasured confounder between the exposure groups and the association between the confounder and the outcome to assess the potential bias.\n\nprevalence of a binary unmeasured confounder (\\(U\\)) among exposed (\\(P_{UA_1}\\))\nprevalence of that binary unmeasured confounder among unexposed (\\(P_{UA_0}\\))\nassociation between that binary unmeasured confounder and the outcome (\\(RR_{UY} = \\frac{P_{UY_1}}{P_{UY_1}}\\))\n\nThe above components can help us calculate \\(bias\\) amount (known as ‘Bias Multiplier’) using the Bross formula when we omit adjusting for \\(U\\):\n\\[\\text{Bias}_U = \\frac{P_{UA_1} (RR_{UY} - 1) + 1}{P_{UA_0} (RR_{UY} - 1) + 1}\\]\n\n\nThese are the ingredients of the Bross formula. This formula is helpful for understanding the impact of unmeasured confounding of a binary variable. We have to put assumed prevalence and risk ratio associated with an unmeasured confounder."
  },
  {
    "objectID": "step4.html#calculating-bias-from-a-recurrence-covariate",
    "href": "step4.html#calculating-bias-from-a-recurrence-covariate",
    "title": "7  Step 4: Prioritize",
    "section": "7.2 Calculating bias from a recurrence covariate",
    "text": "7.2 Calculating bias from a recurrence covariate\nFor recurrence covariates (\\(R\\)), we do not need to assume, we just plug-in \\(R\\) instead of \\(U\\) in the following calculations:\n\nprevalence of a binary recurrence variable among exposed (\\(P_{RA_1}\\))\nprevalence of that binary recurrence variable among unexposed (\\(P_{RA_0}\\))\nassociation between that binary recurrence variable and the outcome (\\(RR_{RY} = \\frac{P_{RY_1}}{P_{RY_1}}\\))\n\nThese components can help us empirically calculate \\(bias\\) amount:\n\\[\\text{Bias}_R = \\frac{P_{RA_1} (RR_{RY} - 1) + 1}{P_{RA_0} (RR_{RY} - 1) + 1}\\]\nHere, \\(RR_{RY}\\) is the crude risk ratio between the recurrence covariate and the outcome, \\(Y\\) is the outcome, \\(A\\) is the exposure, and \\(R\\) is a recurrence covariate.\n\n\nFor recurrence covariates, we do not need to assume, we can basically calculate these numbers (\\(log-absolute-bias\\)) for all of the recurrence covariates (Schneeweiss et al. 2009). For each data dimension, we can rank each of the recurrence covariates based on the amount of bias (confounding or imbalance) it could likely adjust."
  },
  {
    "objectID": "step4.html#calculating-bias-from-all-recurrence-covariates",
    "href": "step4.html#calculating-bias-from-all-recurrence-covariates",
    "title": "7  Step 4: Prioritize",
    "section": "7.3 Calculating bias from all recurrence covariates",
    "text": "7.3 Calculating bias from all recurrence covariates\nIn our example, we simply plug-in each recurrence covariates one-by-one to calculate \\(log-absolute-bias\\):\n\n\n\n\n\nR=rec_dx_D64_once\n\n\nR=rec_dx_D75_sporadic\n\n\n…\n\n\nR=rec_dx_E07_frequent"
  },
  {
    "objectID": "step4.html#obtain-log-of-absolute-bias",
    "href": "step4.html#obtain-log-of-absolute-bias",
    "title": "7  Step 4: Prioritize",
    "section": "7.4 Obtain log of absolute-bias",
    "text": "7.4 Obtain log of absolute-bias\nWe calculate \\(log-absolute-bias\\) for all recurrence covariates.\n\n\nAbsolute log of the Bias Multiplier, \\(log-absolute-bias\\), is a symmetric measure of the potential bias introduced by the recurrence covariate, making it easier to compare and rank recurrence covariates.\n\nout3 <- get_prioritised_covariates(df = out2,\n                                   patientIdVarname = \"idx\", \n                                   exposureVector = basetable$exposure,\n                                   outcomeVector = basetable$outcome,\n                                   patientIdVector = patientIds, \n                                   k = 50)\nsorted_values <- sort(out3$multiplicative_bias, \n                      decreasing = TRUE)\n\nThis would return absolute log of the multiplicative bias for each recurrence covariate (by univariate Bross formula). We can use this information to prioritize recurrence covariates in the next step."
  },
  {
    "objectID": "step4.html#convert-to-absolute-log-of-multiplicative-bias",
    "href": "step4.html#convert-to-absolute-log-of-multiplicative-bias",
    "title": "7  Step 4: Prioritize",
    "section": "7.5 Convert to Absolute log of multiplicative bias",
    "text": "7.5 Convert to Absolute log of multiplicative bias\nHere are the few covariates and associated Absolute log of the multiplicative bias:\n\n\n\n\n \n  \n     \n  \n \n\n  \n    rec_dx_I10_once : 0.124 \n  \n  \n    rec_dx_R73_once : 0.078 \n  \n  \n    rec_dx_I10_frequent : 0.065 \n  \n  \n    rec_dx_R60_once : 0.038 \n  \n  \n    rec_dx_E78_once : 0.036 \n  \n  \n    rec_dx_M79_once : 0.033 \n  \n  \n    rec_dx_I51_once : 0.019 \n  \n  \n    rec_dx_M10_once : 0.017 \n  \n  \n    rec_dx_I50_once : 0.016 \n  \n\n\n\n\n\nAnd here are translated table with description:\n\n\n\n\n \n  \n     \n  \n \n\n  \n    Hypertension : 0.115 \n  \n  \n    Elevated blood glucose level : 0.088 \n  \n  \n    Hypertension : 0.068 \n  \n  \n    Edema : 0.054 \n  \n  \n    Pure hypercholesterolemia : 0.038 \n  \n  \n    musculoskeletal pain : 0.017 \n  \n  \n    Hypokalemia : 0.015 \n  \n  \n    Heart disease : 0.013 \n  \n  \n    Heart failure : 0.011 \n  \n\n\n\n\n\n\n\nSome of the empirical covariates with top Absolute log of the multiplicative bias are actually relevant to the outcome (diabetes): Hypertension, Elevated blood glucose level , etc. (Choi and Shi 2001)\n\n\n\n\n\n\n\n\n\nSMD vs Bias multiplier\n\n\n\nStandardized mean difference (SMD) is useful for assessing the balance in the propensity score literature. However, Bross formula incorporates outcome information. In the investigation of empirical covariates or recurrence covariates where interpretations of these covariates are unknown, it may seem more safe to use the multiplicative bias term from the Bross formula to identify proxy covariates that are helpful in predicting the outcome.\n\n\n\n\n(Stuart, Lee, and Leacy 2013)\n\n\n\n\n\n\n\nBross, Irwin DJ. 1966. “Spurious Effects from an Extraneous Variable.” Journal of Chronic Diseases 19 (6): 637–47.\n\n\nChoi, BCK, and F Shi. 2001. “Risk Factors for Diabetes Mellitus by Age and Sex: Results of the National Population Health Survey.” Diabetologia 44: 1221–31.\n\n\nSchneeweiss, Sebastian. 2006. “Sensitivity Analysis and External Adjustment for Unmeasured Confounders in Epidemiologic Database Studies of Therapeutics.” Pharmacoepidemiology and Drug Safety 15 (5): 291–303.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nStuart, Elizabeth A, Brian K Lee, and Finbarr P Leacy. 2013. “Prognostic Score–Based Balance Measures Can Be a Useful Diagnostic for Propensity Score Methods in Comparative Effectiveness Research.” Journal of Clinical Epidemiology 66 (8): S84–90."
  },
  {
    "objectID": "step5.html#ideal-number-of-prioritised-covariates",
    "href": "step5.html#ideal-number-of-prioritised-covariates",
    "title": "8  Step 5: Covariates",
    "section": "8.1 Ideal number of prioritised covariates",
    "text": "8.1 Ideal number of prioritised covariates\nBased on calculated \\(log-absolute-bias\\), we select top k recurrence covariates to be used in the hdPS analyses later. Below is a plot of all of the absolute log of the Bias Multiplier:\n\n\n\n\n\n\n\n\n\nWe used \\(k = 50\\) covariates selected by the hdPS algorithm (we call them ‘hdPS covariates’). What should be the cutpoint?\n\n\n\nAbsolute log of the Bias Multiplier has a null value of 0. Anything above 0 is an indication of confounding bias adjusted by the adjustment of the associated recurrent covariate.\nFor large proxy data sources, \\(k = 500\\) is suggested (Schneeweiss et al. 2009).\nSee Sensitivity Analysis section for an understanding of how to choose a value based on an ad-hoc process."
  },
  {
    "objectID": "step5.html#selected-hdps-variables-proxies",
    "href": "step5.html#selected-hdps-variables-proxies",
    "title": "8  Step 5: Covariates",
    "section": "8.2 Selected hdPS variables (proxies)",
    "text": "8.2 Selected hdPS variables (proxies)\n\nhdps.dim <- out3$autoselected_covariate_df\ndim(hdps.dim) # id + k\n#> [1] 3839   51\nhead(hdps.dim)[,1:3]\n\n\n\n  \n\n\nhdps.dim$id <- hdps.dim$idx\nhdps.dim$idx <- NULL"
  },
  {
    "objectID": "step5.html#investigator-specified-covariates",
    "href": "step5.html#investigator-specified-covariates",
    "title": "8  Step 5: Covariates",
    "section": "8.3 Investigator-specified covariates",
    "text": "8.3 Investigator-specified covariates\n\\(25\\) investigator-specified covariates are selected based on variables in the DAG that are available in the data set.\n\n\nWe should also add necessary interactions of these investigator-specified covariates, or add other useful model-specifications (e.g., polynomials).\n\n\n\n\n\nHypothesized Directed acyclic graph drawn based on analyst’s best understanding of the literature\n\n\n\n\n\n\n\n14 demographic, behavioral, health history related variables/access\n\nMostly categorical\n\n11 lab variables\n\nMostly continuous\n\n\n\nexposure <- \"obese\"\noutcome <- \"diabetes\" \ninvestigator.specified.covariates <- \n  c(# Demographic\n  \"age.cat\", \"sex\", \"education\", \"race\", \n  \"marital\", \"income\", \"born\", \"year\",\n  \n  # health history related variables/access\n  \"diabetes.family.history\", \"medical.access\",\n  \n  # behavioral\n  \"smoking\", \"diet.healthy\", \"physical.activity\", \"sleep\",\n  \n  # Laboratory \n  \"uric.acid\", \"protein.total\", \"bilirubin.total\", \"phosphorus\",\n  \"sodium\", \"potassium\", \"globulin\", \"calcium.total\", \n  \"systolicBP\", \"diastolicBP\", \"high.cholesterol\"\n)\nlength(investigator.specified.covariates)\n#> [1] 25"
  },
  {
    "objectID": "step5.html#merged-data",
    "href": "step5.html#merged-data",
    "title": "8  Step 5: Covariates",
    "section": "8.4 Merged data",
    "text": "8.4 Merged data\n\nload(\"data/analytic3cycles.RData\")\nhdps.data <- merge(data.complete[,c(\"id\",\n                                    outcome, \n                                    exposure, \n                                    investigator.specified.covariates)], \n                       hdps.dim, by = \"id\")\ndim(hdps.data)\n#> [1] 3839   78\n\n\n\nVariable count (78)\n\n1 ID variable\n1 exposure\n1 outcome\n25 investigator-specified covariates\n50 hdPS variables\n\n\n\n\n\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "step6.html#hdps-model",
    "href": "step6.html#hdps-model",
    "title": "9  Step 6: Propensity",
    "section": "9.1 hdPS model",
    "text": "9.1 hdPS model\n\n\n\n\n\n\n\n\n\n\n\nC = investigator-specified covariates and EC = hdPS covariates (Schneeweiss et al. 2009)\nThen the hdPS can be used as matching, weighting, stratifying variables, or as covariates (usually in deciles) in outcome model.\n\n\n(Wyss et al. 2022)\n\n9.1.1 Create propensity score formula\n\nhdps.data$exposure <- as.numeric(I(hdps.data$obese=='Yes'))\nhdps.data$outcome <- as.numeric(I(hdps.data$diabetes=='Yes'))\nproxy.list.sel <- names(out3$autoselected_covariate_df[,-1])\nproxyform <- paste0(proxy.list.sel, collapse = \"+\")\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\n\n\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\nps.formula\n#> exposure ~ age.cat + sex + education + race + marital + income + \n#>     born + year + diabetes.family.history + medical.access + \n#>     smoking + diet.healthy + physical.activity + sleep + uric.acid + \n#>     protein.total + bilirubin.total + phosphorus + sodium + potassium + \n#>     globulin + calcium.total + systolicBP + diastolicBP + high.cholesterol + \n#>     rec_dx_I10_once + rec_dx_R73_once + rec_dx_I10_frequent + \n#>     rec_dx_R60_once + rec_dx_E78_once + rec_dx_M79_once + rec_dx_I51_once + \n#>     rec_dx_M10_once + rec_dx_I50_once + rec_dx_K21_once + rec_dx_D75_once + \n#>     rec_dx_Z79_once + rec_dx_F41_once + rec_dx_M1A_once + rec_dx_E87_once + \n#>     rec_dx_R12_once + rec_dx_R51_once + rec_dx_J45_once + rec_dx_I50_frequent + \n#>     rec_dx_L70_once + rec_dx_M25_once + rec_dx_I63_once + rec_dx_R39_once + \n#>     rec_dx_N28_once + rec_dx_K25_once + rec_dx_F90_once + rec_dx_B00_once + \n#>     rec_dx_J42_once + rec_dx_R41_once + rec_dx_I20_once + rec_dx_M54_once + \n#>     rec_dx_J44_once + rec_dx_K08_once + rec_dx_I21_once + rec_dx_F32_once + \n#>     rec_dx_J30_once + rec_dx_F43_once + rec_dx_R06_once + rec_dx_I48_once + \n#>     rec_dx_R32_once + rec_dx_R42_once + rec_dx_N92_once + rec_dx_N95_once + \n#>     rec_dx_M19_once + rec_dx_E07_once + rec_dx_R25_once + rec_dx_G43_once + \n#>     rec_dx_R52_once + rec_dx_M81_once + rec_dx_T78_once\n\n\n\nThis is an overly simplistic scenario where we are adding only the main effects in the non-transformed form.\n\n\n9.1.2 Fit PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\")\n\n\n\n9.1.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\n\n\n\n\n\n\n\n\nAlways a good idea to check propensity score overlap in both exposure groups\n\n\n9.1.4 Obtain Weights\n\nhdps.data$w <- W.out$weights\n\n\n\n\n\n\n\n\nAlways a good idea to check the summary statistics of the weights to assess whether there are extreme weights\n\n\n9.1.5 Assessing balance\n\n\n\n\n\n\n\nAlways a good idea to assess balance. Here we are measuring against SMD 0.1. Use love.plot function from the cobalt package. See more descriptions of balanced diagnostics elsewhere for a propensity score context.\n\n\n\n\n\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512.\n\n\nWyss, Richard, Chen Yanover, Tal El-Hay, Dimitri Bennett, Robert W Platt, Andrew R Zullo, Grammati Sari, et al. 2022. “Machine Learning for Improving High-Dimensional Proxy Confounder Adjustment in Healthcare Database Studies: An Overview of the Current Literature.” Pharmacoepidemiology and Drug Safety 31 (9): 932–43."
  },
  {
    "objectID": "step7.html",
    "href": "step7.html",
    "title": "10  Step 7: Association",
    "section": "",
    "text": "10.0.1 Set outcome formula\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nout.formula\n#> outcome ~ exposure\n\n\n\n10.0.2 Obtain OR from unadjusted model\n\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.35\n0.12\n0\n0.25\n0.46\n\n\n\n\n\n\n\n\nWe are using a crude outcome model here.\nSomewhat controversial to adjust for all (investigator-specified and all 100 proxies) covariates.\n\n\n\n\n\n\n\n\n10.0.3 Obtain RD from unadjusted model\n\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= gaussian(link = \"identity\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|t|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary, 2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|t|)\n2.5 %\n97.5 %\n\n\n\n\n0.06\n0.02\n0\n0.04\n0.09\n\n\n\n\n\n\n\n(Naimi and Whitcomb 2020)\n\n\n\n\n\n\n\nNaimi, Ashley I, and Brian W Whitcomb. 2020. “Estimating Risk Ratios and Risk Differences Using Regression.” American Journal of Epidemiology 189 (6): 508–10."
  },
  {
    "objectID": "sens.html#sensitivity-analysis-for-k",
    "href": "sens.html#sensitivity-analysis-for-k",
    "title": "11  Sensitivity",
    "section": "11.1 Sensitivity analysis for k",
    "text": "11.1 Sensitivity analysis for k\n\n11.1.1 Create propensity score formula\n\n\n\n\n\n\n\nHence we iterate the process (change k parameter in get_prioritised_covariates function in step 4) and obtain odds ratio (exponentiation of log-OR) for each k. We varied k from 10 to 90.\n\n\n\n\n\n\nTip\n\n\n\nFind out where OR estimates stabilizes"
  },
  {
    "objectID": "sens.html#sensitivity-analysis-for-n",
    "href": "sens.html#sensitivity-analysis-for-n",
    "title": "11  Sensitivity",
    "section": "11.2 Sensitivity analysis for n",
    "text": "11.2 Sensitivity analysis for n\n\n\n\n\n\n\n\n\n\nTip\n\n\n\nWe varied n from 10 to 90, remaining everything else constant\n\n\n\n\n\n\n\n\n\nHence we iterate the process (change n parameter in get_candidate_covariates function step 2) and obtain odds ratio (exponentiation of log-OR) for each n. We varied n from 10 to 90.\n\n\n\n\n\n\nTip\n\n\n\nFind out where OR estimates stabilize\n\n\n\n\nLiterature suggested that this restriction of n can be detrimental (Schuster, Pang, and Platt 2015). Hence in the original analysis we chose n such that that is larger than available empirical covariates.\n\n\n\n\n\n\n\nSchuster, Tibor, Menglan Pang, and Robert W Platt. 2015. “On the Role of Marginal Confounder Prevalence–Implications for the High-Dimensional Propensity Score Algorithm.” Pharmacoepidemiology and Drug Safety 24 (9): 1004–7."
  },
  {
    "objectID": "extension.html#issues-with-hdps",
    "href": "extension.html#issues-with-hdps",
    "title": "Challenges",
    "section": "Issues with hdPS",
    "text": "Issues with hdPS\n\n\n\n\n\n\nUnivariate selection of many proxies\n\n\n\n\nRecurrent covariates selected separately / univariately\ncan be correlated (coming from same patient) and cause multicollinearity\nmay inflate variance\nGeneral overfitting problem. Too many adjustment variables?\n\n\n\n\n\n(Franklin et al. 2015; Schuster, Lowe, and Platt 2016; Karim, Pang, and Platt 2018)"
  },
  {
    "objectID": "extension.html#potential-ways-to-improve",
    "href": "extension.html#potential-ways-to-improve",
    "title": "Challenges",
    "section": "Potential ways to improve",
    "text": "Potential ways to improve\n\nMultiple recurrent covariates could provide same information, may not be useful anymore in the presence of others. Multivariate structure could be good to consider in a single model.\nMachine learning variable selection methods could be useful to combat multicollinearity.\nSample splitting methods could be useful in combating overfitting in high dimensions.\n\n\n\nCross-validation is embedded within super (ensemble) learning."
  },
  {
    "objectID": "extension.html#controversy",
    "href": "extension.html#controversy",
    "title": "Challenges",
    "section": "Controversy",
    "text": "Controversy\nResearchers argue that the PS model, which does not allow for data-driven selection of variables, is a more principled approach to adjusting for confounding in observational studies, without introducing any bias in the analysis .\nOther researchers argue that the hdPS approach can improve the precision of effect estimates by including additional variables that are empirically associated with both the exposure and the outcome, which may reduce residual confounding.\n\n\nMachine learning alternatives have the same criticism as some of them depend on association with the outcome.\n\n\n\n\n\n\nTip\n\n\n\nhdPS can only control for observed confounding, and cannot guarantee the direction or magnitude of residual confounding that may still exist. This is why sensitivity analyses and model diagnostics are important in assessing the robustness of hdPS results.\n\n\n\n\n\n\n(VanderWeele 2019)\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nSchuster, Tibor, Wilfrid Kouokam Lowe, and Robert W Platt. 2016. “Propensity Score Model Overfitting Led to Inflated Variance of Estimated Odds Ratios.” Journal of Clinical Epidemiology 80: 97–106.\n\n\nVanderWeele, Tyler J. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34: 211–19."
  },
  {
    "objectID": "pubmed.html#pubmed",
    "href": "pubmed.html#pubmed",
    "title": "12  Literature",
    "section": "12.1 PubMed",
    "text": "12.1 PubMed\nCombination of plasmode, simulation, high-dimensional propensity provides 7 papers (searched in April 23, 2023):\n\n\n\n\n\nflowchart LR\n  A[PubMed] --> p4(Karim et al. 2018<br>Epidemiology)\n  p4 --> ml1\n  p4 --> ml0[Hybrid]\n\n  A[PubMed] --> p2(Tian et al. 2018<br>Int J Epidemiol.)\n  p2 --> ml1[Pure LASSO]\n\n  A[PubMed] --> p5(Wyss et al. 2018<br>Epidemiology)\n  p5 --> sl1[vary k,<br>k=25,100:500<br>Super<br>Learner]\n  p5 --> ct1\n\n  A[PubMed] --> p1(Benasseur et al. 2022<br>Pharmacoepidemiol Drug Saf.)\n  p1 --> ml2[Low k,<br>k = 10]\n  p1 --> ct1[cTMLE]\n\n  A[PubMed] --> p7(Neugebauer et al. 2015<br>Stat Med.)\n  p7 --> O2[time-varying<br>interventions]\n\n  A[PubMed] --> p6(Franklin et al. 2015<br>Am J Epidemiol.)\n  p6 --> ml1\n  p6 --> ml0\n\n  A[PubMed] --> p3(Schneeweiss et al. 2018<br>Clin Epidemiol.)\n  p3 --> O1[Review]\n\n  %% Define style classes\n  classDef redNode fill:#f44,stroke-width:2px,stroke:#f00,color:#fff\n  classDef yellowNode fill:#ffff00,stroke-width:2px,stroke:#ffcc00,color:#000\n  classDef greenNode fill:#9f9,stroke-width:2px,stroke:#090,color:#000\n\n  %% Apply classes to nodes\n  class p1,p3,p7 redNode\n  class p5 yellowNode\n  class p2,p4,p6 greenNode\n\n\n\n\n\n\n\n\n\n\n\n(Benasseur et al. 2022; Tian, Schuemie, and Suchard 2018; Franklin et al. 2015; Neugebauer et al. 2015; Wyss et al. 2018; Karim, Pang, and Platt 2018; Schneeweiss 2018)"
  },
  {
    "objectID": "pubmed.html#outside-of-pubmed",
    "href": "pubmed.html#outside-of-pubmed",
    "title": "12  Literature",
    "section": "12.2 Outside of PubMed",
    "text": "12.2 Outside of PubMed\n\n\n\n\n\nflowchart LR\n  S[Simulations] --> p0(Pang et al. 2016<br>Int. J Biostat.)\n  p0 --> t1[TMLE,<br>No<br>super<br>learner]\n\n  D[Data<br>analysis] --> p00(Pang et al. 2016<br>Epidemiology)\n  p00 --> t1\n\n  D --> p1(Ju et al. 2019<br>J App Stat.)\n  p1 --> sl1[Super<br>learner,<br>No TMLE,<br>bias not<br>used as a<br>performance<br>measure]\n\n  D --> p3(Schneeweiss et al. 2017<br>Epidemiology)\n  p3 --> ml1[LASSO]\n\n  S --> p4(Weberpals et al. 2021<br>Epidemiology)\n  p4 --> ml1[LASSO]\n  p4 --> ml2[Autoencoder]\n\n  S --> p5(Ju et al. 2019<br>Stat Meth Med Res.)\n  p5 --> t1\n  p5 --> t2[cTMLE,<br>more about<br>time<br>complexity]\n\n  S --> p6(Low et al. 2015<br>J Comp Eff Res.)\n  p6 --> ml1\n\n  %% Define style classes\n  classDef yellowNode fill:#ffff00,stroke-width:2px,stroke:#ffcc00,color:#000\n  classDef greenNode fill:#9f9,stroke-width:2px,stroke:#090,color:#000\n  classDef blueNode fill:#44f,stroke-width:2px,stroke:#00f,color:#fff\n\n  %% Apply classes to nodes\n  class p1 yellowNode\n  class p3,p4,p6 greenNode\n  class p0,p00,p5 blueNode\n\n\n\n\n\n\n\n\n\n\n\n\n\n(Pang, Schuster, Filion, Eberg, et al. 2016; Pang, Schuster, Filion, Schnitzer, et al. 2016; Ju, Gruber, et al. 2019; Ju, Combs, et al. 2019; Schneeweiss et al. 2017; Weberpals et al. 2021; Low, Gallego, and Shah 2016)\n\n\nBenasseur, Imane, Denis Talbot, Madeleine Durand, Anne Holbrook, Alexis Matteau, Brian J Potter, Christel Renoux, Mireille E Schnitzer, Jean-Éric Tarride, and Jason R Guertin. 2022. “A Comparison of Confounder Selection and Adjustment Methods for Estimating Causal Effects Using Large Healthcare Databases.” Pharmacoepidemiology and Drug Safety 31 (4): 424–33.\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nJu, Cheng, Mary Combs, Samuel D Lendle, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van der Laan. 2019. “Propensity Score Prediction for Electronic Healthcare Databases Using Super Learner and High-Dimensional Propensity Score Methods.” Journal of Applied Statistics 46 (12): 2216–36.\n\n\nJu, Cheng, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van Der Laan. 2019. “Scalable Collaborative Targeted Learning for High-Dimensional Data.” Statistical Methods in Medical Research 28 (2): 532–54.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nLow, Yen Sia, Blanca Gallego, and Nigam Haresh Shah. 2016. “Comparing High-Dimensional Confounder Control Methods for Rapid Cohort Studies from Electronic Health Records.” Journal of Comparative Effectiveness Research 5 (2): 179–92.\n\n\nNeugebauer, Romain, Julie A Schmittdiel, Zheng Zhu, Jeremy A Rassen, John D Seeger, and Sebastian Schneeweiss. 2015. “High-Dimensional Propensity Score Algorithm in Comparative Effectiveness Research with Time-Varying Interventions.” Statistics in Medicine 34 (5): 753–81.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Maria Eberg, and Robert W Platt. 2016. “Targeted Maximum Likelihood Estimation for Pharmacoepidemiologic Research.” Epidemiology (Cambridge, Mass.) 27 (4): 570.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Mireille E Schnitzer, Maria Eberg, and Robert W Platt. 2016. “Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data–a Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting.” The International Journal of Biostatistics 12 (2).\n\n\nSchneeweiss, Sebastian. 2018. “Automated Data-Adaptive Analytics for Electronic Healthcare Data to Study Causal Treatment Effects.” Clinical Epidemiology, 771–88.\n\n\nSchneeweiss, Sebastian, Wesley Eddings, Robert J Glynn, Elisabetta Patorno, Jeremy Rassen, and Jessica M Franklin. 2017. “Variable Selection for Confounding Adjustment in High-Dimensional Covariate Spaces When Analyzing Healthcare Databases.” Epidemiology 28 (2): 237–48.\n\n\nTian, Yuxi, Martijn J Schuemie, and Marc A Suchard. 2018. “Evaluating Large-Scale Propensity Score Performance Through Real-World and Synthetic Data Experiments.” International Journal of Epidemiology 47 (6): 2005–14.\n\n\nWeberpals, Janick, Tim Becker, Jessica Davies, Fabian Schmich, Dominik Rüttinger, Fabian J Theis, and Anna Bauer-Mehren. 2021. “Deep Learning-Based Propensity Scores for Confounding Control in Comparative Effectiveness Research: A Large-Scale, Real-World Data Study.” Epidemiology 32 (3): 378–88.\n\n\nWyss, Richard, Sebastian Schneeweiss, Mark Van Der Laan, Samuel D Lendle, Cheng Ju, and Jessica M Franklin. 2018. “Using Super Learner Prediction Modeling to Improve High-Dimensional Propensity Score Estimation.” Epidemiology 29 (1): 96–106."
  },
  {
    "objectID": "mllogic.html#understanding-variables-role",
    "href": "mllogic.html#understanding-variables-role",
    "title": "Machine learning",
    "section": "Understanding variable’s role",
    "text": "Understanding variable’s role\n\n\n\n\n\n\n\n\n\n\n\n(Rubin and Thomas 1996; Rubin 1997; Brookhart et al. 2006)\n\nConfounders\n\n\n\n\nflowchart LR\n  C --> A\n  C --> Y\n  A --> Y\n  style A fill:#90EE90;\n  style Y fill:#ADD8E6;\n  style C fill:#FF0000;\n\n\n\n\n\n\n\n\n\n\nAdjusting Confounders help reduce bias\n\n\n(Near) instruments\n\n\n\n\nflowchart LR\n  C --> A\n  A --> Y\n  style A fill:#90EE90;\n  style Y fill:#ADD8E6;\n  style C fill:#FF0000;\n\n\n\n\n\n\n\n\n\n\nAdjusting for covariates strongly associated with the exposure: Adjusting for these variables can potentially amplify bias in the treatment effect estimate and increase standard error (SE).\n\n\nPrecision variables\n\n\n\n\nflowchart LR\n  C --> Y\n  A --> Y\n  style A fill:#90EE90;\n  style Y fill:#ADD8E6;\n  style C fill:#FF0000;\n\n\n\n\n\n\n\n\n\n\nAdjusting for covariates strongly associated with the outcome: Adjusting for these variables can lead to decrease of the SE of the treatment effect estimate.\n\n\nNoise variables\n\n\n\n\nflowchart LR\n  C\n  A --> Y\n  style A fill:#90EE90;\n  style Y fill:#ADD8E6;\n  style C fill:#FF0000;\n\n\n\n\n\n\n\n\n\n\nAdjusting for covariates that are neither associated with the outcome or the exposure can increase the SE of the treatment effect estimate."
  },
  {
    "objectID": "mllogic.html#overall-picture",
    "href": "mllogic.html#overall-picture",
    "title": "Machine learning",
    "section": "Overall picture",
    "text": "Overall picture\n\n\n\n\n\n\n\n\n\n\n\n\n\nChoose variables associated with the outcome in general, as long as they are not mediator, collider or effect of the outcome. In hdPS, we chose proxies in the covariate assessment period (before exposure occurs), reducing the possibility of those proxies to be mediator, collider or effect of the outcome.\n\n\nBrookhart, M Alan, Sebastian Schneeweiss, Kenneth J Rothman, Robert J Glynn, Jerry Avorn, and Til Stürmer. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163 (12): 1149–56.\n\n\nRubin, Donald B. 1997. “Estimating Causal Effects from Large Data Sets Using Propensity Scores.” Annals of Internal Medicine 127 (8_Part_2): 757–63.\n\n\nRubin, Donald B, and Neal Thomas. 1996. “Matching Using Estimated Propensity Scores: Relating Theory to Practice.” Biometrics, 249–64."
  },
  {
    "objectID": "mllasso.html",
    "href": "mllasso.html",
    "title": "13  Pure ML",
    "section": "",
    "text": "14 Pure ML approach (LASSO)\nStart with all recurrence variables (EC in the following equation)\nSay, 100 proxies (associated with outcome) were selected by LASSO approach (ML-hdPS)"
  },
  {
    "objectID": "mllasso.html#choose-variables-associated-with-outcome",
    "href": "mllasso.html#choose-variables-associated-with-outcome",
    "title": "13  Pure ML",
    "section": "14.1 Choose variables associated with outcome",
    "text": "14.1 Choose variables associated with outcome\n\nproxy.dim <- out2 # from step 3\ndim(proxy.dim) \n#> [1] 3839   92\nproxy.dim$id <- proxy.dim$idx\nproxy.dim$idx <- NULL\nfullcovproxy.data <- merge(data.complete[,c(\"id\",\n                                    outcome, \n                                    exposure, \n                                    investigator.specified.covariates)], \n                       proxy.dim, by = \"id\")\ndim(fullcovproxy.data)\n#> [1] 3839  119\nfullcovproxy.data$outcome <- as.numeric(I(fullcovproxy.data$diabetes=='Yes'))\nfullcovproxy.data$exposure <- as.numeric(I(fullcovproxy.data$obese=='Yes'))\n\n\nproxy.list <- names(out2[-1])\n# out3$autoselected_covariate_df[,-1] for hybrid \n# out2 is from step2$recurrence_data\ncovarsTfull <- c(investigator.specified.covariates, proxy.list)\nY.form <- as.formula(paste0(c(\"outcome~ exposure\", \n                              covarsTfull), collapse = \"+\") )\ncovar.mat <- model.matrix(Y.form, data = fullcovproxy.data)[,-1]\nlasso.fit<-glmnet::cv.glmnet(y = fullcovproxy.data$outcome, \n                             x = covar.mat, \n                             type.measure='mse',\n                             family=\"binomial\",\n                             alpha = 1, \n                             nfolds = 5)\ncoef.fit<-coef(lasso.fit,s='lambda.min',exact=TRUE)\nsel.variables<-row.names(coef.fit)[which(as.numeric(coef.fit)!=0)]\nproxy.list.sel.ml <- proxy.list[proxy.list %in% sel.variables]\nlength(proxy.list.sel.ml)\n#> [1] 35\n\n\n\n\nFrom all proxies, we try to identify proxies that are empirically associated with the outcome based on a multivariate LASSO (outcome with all proxies in one model).\nNote that LASSO model is choosing variables based on association with the outcome conditional on the ’exposure`.\nVariable selection is only happening for proxy variables.\nInvestigator specified variables are not being subject to variable selection."
  },
  {
    "objectID": "mllasso.html#build-model-formula-based-on-selected-variables",
    "href": "mllasso.html#build-model-formula-based-on-selected-variables",
    "title": "13  Pure ML",
    "section": "14.2 Build model formula based on selected variables",
    "text": "14.2 Build model formula based on selected variables\n\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nproxyform <- paste0(proxy.list.sel.ml, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\n\n\n\nBuild propensity score model based on selected variables based on LASSO."
  },
  {
    "objectID": "mllasso.html#fit-the-ps-model",
    "href": "mllasso.html#fit-the-ps-model",
    "title": "13  Pure ML",
    "section": "14.3 Fit the PS model",
    "text": "14.3 Fit the PS model\n\nhdps.data <- fullcovproxy.data\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\")\n\n\n\nPropensity score model fit to be able to calculate the inverse probability weights."
  },
  {
    "objectID": "mllasso.html#obtain-log-or-from-unadjusted-outcome-model",
    "href": "mllasso.html#obtain-log-or-from-unadjusted-outcome-model",
    "title": "13  Pure ML",
    "section": "14.4 Obtain log-OR from unadjusted outcome model",
    "text": "14.4 Obtain log-OR from unadjusted outcome model\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.37\n0.1\n0\n0.27\n0.48\n\n\n\n\n\n\n\n\n\n\n\n\nSummary of results (log-OR).\n\n\n\n\n\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98."
  },
  {
    "objectID": "mlhybrid.html#build-model-formula-based-on-selected-variables",
    "href": "mlhybrid.html#build-model-formula-based-on-selected-variables",
    "title": "14  Hybrid ML",
    "section": "14.1 Build model formula based on selected variables",
    "text": "14.1 Build model formula based on selected variables\n\n\nFrom hdPS variables, we try to identify proxies that are empirically associated with the outcome based on a multivariate LASSO (outcome with all proxies in one model).\n\nlength(proxy.list.sel)\n#> [1] 50\nproxy.list <- names(out3$autoselected_covariate_df[,-1]) # from step 4\ncovarsTfull <- c(investigator.specified.covariates, proxy.list)\nY.form <- as.formula(paste0(c(\"outcome~ exposure\", \n                              covarsTfull), collapse = \"+\") )\ncovar.mat <- model.matrix(Y.form, data = hdps.data)[,-1]\nlasso.fit<-glmnet::cv.glmnet(y = hdps.data$outcome, \n                             x = covar.mat, \n                             type.measure='mse',\n                             family=\"binomial\",\n                             alpha = 1, \n                             nfolds = 5)\ncoef.fit<-coef(lasso.fit,s='lambda.min',exact=TRUE)\nsel.variables<-row.names(coef.fit)[which(as.numeric(coef.fit)!=0)]\nproxy.list.sel.hybrid <- proxy.list[proxy.list %in% sel.variables]\nlength(proxy.list.sel.hybrid)\n#> [1] 37\nproxyform <- paste0(proxy.list.sel.hybrid, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\n\n\n\nBuild propensity score model based on selected variables based on LASSO.\n\n14.1.1 Fit the PS model\n\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\")\n\n\n\nPropensity score model fit to be able to calculate the inverse probability weights.\n\n\n14.1.2 Obtain log-OR from unadjusted outcome model\n\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.h <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.h,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.39\n0.1\n0\n0.29\n0.5\n\n\n\n\n\n\n\n\n\n\n\n\nSummary of results (log-OR).\n\n\n\n\n\n\nAlternative process\n\n\n\nIt is also possible to start with ML selection, and then applying Bross’s formula on top of it (Schneeweiss et al. 2017).\n\n\n\n\n\n\n\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.\n\n\nSchneeweiss, Sebastian, Wesley Eddings, Robert J Glynn, Elisabetta Patorno, Jeremy Rassen, and Jessica M Franklin. 2017. “Variable Selection for Confounding Adjustment in High-Dimensional Covariate Spaces When Analyzing Healthcare Databases.” Epidemiology 28 (2): 237–48."
  },
  {
    "objectID": "sl.html#build-model-formula-based-on-all-variables",
    "href": "sl.html#build-model-formula-based-on-all-variables",
    "title": "15  Ensemble",
    "section": "15.1 Build model formula based on all variables",
    "text": "15.1 Build model formula based on all variables\n\nproxy.list <- names(out2[-1])\n# out3$autoselected_covariate_df[,-1] for hybrid \n# out2 is from step2$recurrence_data\nlength(proxy.list)\n#> [1] 91\ncovform <- paste0(investigator.specified.covariates, collapse = \"+\")\nproxyform <- paste0(proxy.list, collapse = \"+\")\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\nps.formula <- as.formula(paste0(\"exposure\", \"~\", rhsformula))\n\n\n\nWe work with all proxies"
  },
  {
    "objectID": "sl.html#fit-the-ps-model-with-super-learner",
    "href": "sl.html#fit-the-ps-model-with-super-learner",
    "title": "15  Ensemble",
    "section": "15.2 Fit the PS model with super learner",
    "text": "15.2 Fit the PS model with super learner\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                  data = hdps.data, \n                  estimand = \"ATE\",\n                  method = \"super\",\n                  SL.library = c(\"SL.glm\", \n                                 \"SL.glmnet\",\n                                 \"SL.earth\"))\n#> Loading required namespace: glmnet\n#> Loading required namespace: earth\n#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred\n\n\n\nPropensity score model fit based on super learning algorithm to be able to calculate the inverse probability weights."
  },
  {
    "objectID": "sl.html#obtain-log-or-from-unadjusted-outcome-model",
    "href": "sl.html#obtain-log-or-from-unadjusted-outcome-model",
    "title": "15  Ensemble",
    "section": "15.3 Obtain log-OR from unadjusted outcome model",
    "text": "15.3 Obtain log-OR from unadjusted outcome model\n\nsummary(W.out$ps)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.01919 0.22474 0.38968 0.42094 0.59528 0.99131\nout.formula <- as.formula(paste0(\"outcome\", \"~\", \"exposure\"))\nfit <- glm(out.formula,\n            data = hdps.data,\n            weights = W.out$weights,\n            family= binomial(link = \"logit\"))\nfit.summary <- summary(fit)$coef[\"exposure\",\n                                 c(\"Estimate\", \n                                   \"Std. Error\", \n                                   \"Pr(>|z|)\")]\nfit.summary[2] <- sqrt(sandwich::sandwich(fit)[2,2])\nrequire(lmtest)\nconf.int <- confint(fit, \"exposure\", level = 0.95, method = \"hc1\")\nfit.summary_with_ci.sl <- c(fit.summary, conf.int)\nknitr::kable(t(round(fit.summary_with_ci.sl,2))) \n\n\n\n\nEstimate\nStd. Error\nPr(>|z|)\n2.5 %\n97.5 %\n\n\n\n\n0.43\n0.09\n0\n0.32\n0.54\n\n\n\n\n\n\n\n\n\n\n\n\nSummary of results (log-OR)."
  },
  {
    "objectID": "tmle.html#obtain-or-with-superlearner",
    "href": "tmle.html#obtain-or-with-superlearner",
    "title": "16  TMLE",
    "section": "16.1 Obtain OR with superlearner",
    "text": "16.1 Obtain OR with superlearner\n\nsummary(W.out$ps)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.01919 0.22474 0.38968 0.42094 0.59528 0.99131\nSL.library = c(\"SL.glm\", \"SL.glmnet\",\"SL.earth\")\nproxy.list <- names(out2[-1])\n# out3$autoselected_covariate_df[,-1] for hybrid \n# out2 is from step2$recurrence_data\nObsData.noYA <- hdps.data[,c(investigator.specified.covariates, \n                             proxy.list)]\n\n\n\nWe use the same propensity score model that was fitted based on super learning algorithm.\n\ntmle.fit <- tmle::tmle(Y = hdps.data$outcome,\n                       A = hdps.data$exposure, \n                       W = ObsData.noYA, \n                       family = \"binomial\",\n                       V.Q = 3,\n                       V.g = 3,\n                       Q.SL.library = SL.library,\n                       g1W = W.out$ps)\n\n\n\nIf you want to know more about TMLE, look at other tutorials.\n\nestOR.tmle <- tmle.fit$estimates$OR\nestOR.tmle\n#> $psi\n#> [1] 1.465619\n#> \n#> $log.psi\n#> [1] 0.382278\n#> \n#> $CI\n#> [1] 1.268728 1.693067\n#> \n#> $pvalue\n#> [1] 2.062344e-07\n#> \n#> $var.log.psi\n#> [1] 0.005417723\n#> \n#> $bs.var.log.psi\n#> [1] NA\n#> \n#> $bs.CI.twosided\n#> [1] NA NA\n#> \n#> $bs.CI.onesided.lower\n#> [1] -Inf   NA\n#> \n#> $bs.CI.onesided.upper\n#> [1]  NA Inf\n\n\n\nSummary of results (OR)."
  },
  {
    "objectID": "tmle.html#obtain-or-without-superlearner",
    "href": "tmle.html#obtain-or-without-superlearner",
    "title": "16  TMLE",
    "section": "16.2 Obtain OR without superlearner",
    "text": "16.2 Obtain OR without superlearner\n\nsummary(W.out0$ps)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.01333 0.21054 0.38992 0.42094 0.61259 0.99414\nSL.library = c(\"SL.glm\")\nproxy.list <- names(out2[-1])\n# out3$autoselected_covariate_df[,-1] for hybrid \n# out2 is from step2$recurrence_data\nObsData.noYA <- hdps.data[,c(investigator.specified.covariates, \n                             proxy.list)]\n\n\n\nWe use the same propensity score model that was fitted based on hdPS variables via logistic regression (no other learners).\n\ntmle.fit0 <- tmle::tmle(Y = hdps.data$outcome,\n                       A = hdps.data$exposure, \n                       W = ObsData.noYA, \n                       family = \"binomial\",\n                       V.Q = 3,\n                       V.g = 3,\n                       Q.SL.library = SL.library,\n                       g1W = W.out$ps)\n\n\nestOR.tmle0 <- tmle.fit0$estimates$OR\nestOR.tmle0\n#> $psi\n#> [1] 1.459729\n#> \n#> $log.psi\n#> [1] 0.3782507\n#> \n#> $CI\n#> [1] 1.261396 1.689247\n#> \n#> $pvalue\n#> [1] 3.840893e-07\n#> \n#> $var.log.psi\n#> [1] 0.005551369\n#> \n#> $bs.var.log.psi\n#> [1] NA\n#> \n#> $bs.CI.twosided\n#> [1] NA NA\n#> \n#> $bs.CI.onesided.lower\n#> [1] -Inf   NA\n#> \n#> $bs.CI.onesided.upper\n#> [1]  NA Inf\n\n\n\nSummary of results (OR)."
  },
  {
    "objectID": "stat.html#background",
    "href": "stat.html#background",
    "title": "17  Statistical Approaches",
    "section": "17.1 Background",
    "text": "17.1 Background\nRecent work compares multiple variable selection strategies for hdPS analysis (Karim and Lei 2025). The study aims to identify methods that best balance bias, precision, and computational cost in causal inference using observational data. It is based on NHANES 2013–2018 data evaluating the association between obesity and diabetes.\n\n\n\n\n\n\nTip\n\n\n\n(Karim and Lei 2025)"
  },
  {
    "objectID": "stat.html#simulation-design",
    "href": "stat.html#simulation-design",
    "title": "17  Statistical Approaches",
    "section": "17.2 Simulation Design",
    "text": "17.2 Simulation Design\n\n\n\n\n\n\n\nElement\nDetails\n\n\n\n\nData Source\nNHANES 2013–2018\n\n\nSample Size\n3,000 participants per iteration\n\n\nIterations\n500\n\n\nPrevalence Scenarios\n1. Frequent exposure & frequent outcome  2. Rare exposure & frequent outcome  3. Frequent exposure & rare outcome\n\n\nTrue Effect\nOR = 1 (null); RD = 0\n\n\nOutcome Generation\nIncluded nonlinear transforms, interactions, and a comorbidity index from 94 proxies\n\n\nNoise Variables\n48 of 142 proxy covariates used as noise"
  },
  {
    "objectID": "stat.html#methods-compared",
    "href": "stat.html#methods-compared",
    "title": "17  Statistical Approaches",
    "section": "17.3 Methods Compared",
    "text": "17.3 Methods Compared\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\nKitchen Sink\nIncludes all investigator and proxy covariates (no selection)\n\n\nBross hdPS\nSelects top 100 proxies using the Bross formula\n\n\nHybrid (Bross + LASSO)\nFirst applies Bross, then refines with LASSO\n\n\nLASSO\nPenalized regression with cross-validation\n\n\nElastic Net\nCombines LASSO and Ridge penalties to handle collinearity\n\n\nRandom Forest\nRanks variables by importance using Gini impurity\n\n\nXGBoost\nBoosted trees optimizing impurity reduction\n\n\nForward Selection\nAdds variables sequentially based on adjusted R²\n\n\nBackward Elimination\nRemoves variables iteratively based on adjusted R²\n\n\nGenetic Algorithm\nEvolves variable subsets via stochastic search"
  },
  {
    "objectID": "stat.html#simulation-results",
    "href": "stat.html#simulation-results",
    "title": "17  Statistical Approaches",
    "section": "17.4 Simulation Results",
    "text": "17.4 Simulation Results\n\n\n\n\n\nFigure 1. Bias across Methods in NHANES Plasmode Simulation\n\n\n\n\n\n\n\n\n\nFigure 2. Coverage across Methods in NHANES Plasmode Simulation\n\n\n\n\nSee interactive results: 👉 Shiny App"
  },
  {
    "objectID": "stat.html#key-takeaways",
    "href": "stat.html#key-takeaways",
    "title": "17  Statistical Approaches",
    "section": "17.5 Key Takeaways",
    "text": "17.5 Key Takeaways\n\nSimpler methods (Forward/Backward selection) offer strong coverage with efficiency.\nBross-based and Hybrid hdPS methods remain reliable and interpretable.\nMethod choice should reflect the specific inferential goal: bias reduction vs variance minimization.\n\n\n\n\n\nKarim, ME, and Y Lei. 2025. “Is There a Competitive Advantage to Using Multivariate Statistical or Machine Learning Methods over the Bross Formula in the hdPS Framework for Bias and Variance Estimation?” PLoS One 20 (5): e0324639."
  },
  {
    "objectID": "mlcompare.html",
    "href": "mlcompare.html",
    "title": "18  Compare results",
    "section": "",
    "text": "Summary of model results\n \n  \n     \n    OR \n    Beta-coef \n    coef-SE \n    CI (2.5 %) \n    CI (97.5 %) \n    p-value \n  \n \n\n  \n    Crude (no adjustment) \n    1.94 \n    0.66 \n    0.08 \n    0.51 \n    0.81 \n    < 2e-16 \n  \n  \n    PS (no proxies) \n    1.89 \n    0.64 \n    0.10 \n    0.53 \n    0.75 \n    < 2e-16 \n  \n  \n    hdPS \n    1.42 \n    0.35 \n    0.12 \n    0.25 \n    0.46 \n    6.7e-11 \n  \n  \n    Pure LASSO \n    1.45 \n    0.37 \n    0.10 \n    0.27 \n    0.48 \n    4.7e-12 \n  \n  \n    Hybrid (hdPS, then LASSO) \n    1.48 \n    0.39 \n    0.10 \n    0.29 \n    0.50 \n    3.9e-13 \n  \n  \n    Super learner (GLM, LASSO, MARS) \n    1.54 \n    0.43 \n    0.09 \n    0.32 \n    0.54 \n    1.3e-14 \n  \n  \n    TMLE (GLM, LASSO, MARS in SL) \n    1.47 \n    0.38 \n    0.07 \n    0.24 \n    0.53 \n    2.1e-07 \n  \n  \n    TMLE (only GLM in SL) \n    1.46 \n    0.38 \n    0.07 \n    0.23 \n    0.52 \n    3.8e-07 \n  \n  \n    Kitchen Sink \n    1.50 \n    0.41 \n    0.04 \n    0.32 \n    0.48 \n    < 2e-16 \n  \n  \n    Random Forest \n    1.54 \n    0.43 \n    0.04 \n    0.35 \n    0.51 \n    < 2e-16 \n  \n  \n    XGBoost \n    1.51 \n    0.41 \n    0.04 \n    0.33 \n    0.49 \n    < 2e-16 \n  \n  \n    Forward Selection \n    1.56 \n    0.44 \n    0.04 \n    0.36 \n    0.52 \n    < 2e-16 \n  \n  \n    Backward Elimination \n    1.53 \n    0.43 \n    0.04 \n    0.34 \n    0.50 \n    < 2e-16 \n  \n\n\n\n\n\n\n\n\nPS is the result from the propensity score approach that did not include any proxies.\nResults from this approach is somewhat different than other approaches.\nMore detailed results from simulations are available elsewhere (Karim 2023).\n\n\n\n\n\n\n\n\nAcross all methods evaluated—including hdPS, regularized regression (LASSO, Hybrid), ensemble learners (Super Learner, TMLE), and high-dimensional variable selection strategies (e.g., Kitchen Sink, Random Forest, XGBoost)—adjusted odds ratios ranged from 1.34 to 1.56, with most clustering between 1.50 and 1.56. In contrast, unadjusted and PS-only models produced substantially higher ORs (>1.9).\n\n\n\n\n\n\n\nKarim, ME. 2023. “Rethinking Residual Confounding Bias Reduction: Why Vanilla hdPS Alone Is No Longer Enough.”"
  },
  {
    "objectID": "dctmle.html#background",
    "href": "dctmle.html#background",
    "title": "19  DC-TMLE",
    "section": "19.1 Background",
    "text": "19.1 Background\nResidual confounding remains a persistent challenge in observational studies, particularly with high-dimensional data (M. E. Karim and Lei 2025). Recent work evaluates traditional and machine learning-based extensions of hdPS methods, including Super Learner (SL), TMLE, and Double Cross-Fit TMLE (DC-TMLE).\n\n\n\n\n\n\nTip\n\n\n\n(M. E. Karim and Lei 2025)"
  },
  {
    "objectID": "dctmle.html#simulation-design",
    "href": "dctmle.html#simulation-design",
    "title": "19  DC-TMLE",
    "section": "19.2 Simulation Design",
    "text": "19.2 Simulation Design\n\n\n\n\n\n\n\nElement\nDetails\n\n\n\n\nData Source\nNHANES 2013–2018\n\n\nSample Size\n3,000 per iteration\n\n\nIterations\n500\n\n\nExposure/Outcome Prevalence\n3 scenarios: (i) Frequent-Frequent, (ii) Rare-Frequent, (iii) Frequent-Rare\n\n\nTrue Effect\nOR = 1 (null); RD = 0\n\n\nProxies\n142 medication variables; 94 outcome-associated proxies and 48 noise variables\n\n\nConfounding Simulation\nUsed proxy-derived comorbidity index and complex transformations to mimic unmeasured confounding"
  },
  {
    "objectID": "dctmle.html#methods-compared",
    "href": "dctmle.html#methods-compared",
    "title": "19  DC-TMLE",
    "section": "19.3 Methods Compared",
    "text": "19.3 Methods Compared\n\n\n\n\n\n\n\n\nMethod Group\nMethod\nDescription\n\n\n\n\nTMLE Methods with Proxies\nTMLE.ks, hdPS.TMLE, LASSO.TMLE, hdPS.LASSO.TMLE\nTMLE with various proxy selection strategies\n\n\n\nDC.TMLE\nDouble cross-fit TMLE\n\n\nSuper Learner Methods with Proxies\nhdPS.SL, LASSO.SL, hdPS.LASSO.SL, SL.ks\nSuper Learner with proxy selection options\n\n\nStandard Methods with Proxies\nPS.ks, hdPS, LASSO, hdPS.LASSO\nPropensity score and outcome models with proxy inclusion\n\n\nNo Proxy Methods\nTMLE.u, SL.u, PS.u\nOnly measured covariates, no proxies\n\n\n\nSuper Learner libraries included:\n\n1-learner: Logistic regression\n3-learners: Logistic regression, LASSO, MARS\n4-learners: Above + XGBoost (non-Donsker)"
  },
  {
    "objectID": "dctmle.html#simulation-results",
    "href": "dctmle.html#simulation-results",
    "title": "19  DC-TMLE",
    "section": "19.4 Simulation Results",
    "text": "19.4 Simulation Results\n\n\n\n\n\nFigure 2. Bias across Methods in NHANES Plasmode Simulation\n\n\n\n\n\n\n\n\n\nFigure 3. Coverage across Methods in NHANES Plasmode Simulation\n\n\n\n\nResults are fully accessible via a Shiny app:\n👉 Interactive Causal Benchmark App\nExplore bias, SEs, and coverage metrics across methods and simulation conditions."
  },
  {
    "objectID": "dctmle.html#conclusion",
    "href": "dctmle.html#conclusion",
    "title": "19  DC-TMLE",
    "section": "19.5 Conclusion",
    "text": "19.5 Conclusion\n\nSimpler models with structured proxy inclusion (hdPS, LASSO) remain competitive and stable.\nTMLE is effective for bias reduction but suffers under high-dimensional instability with complex libraries.\nSL performance is library-sensitive; 1- and 3-learner libraries performed best. Complex learners (e.g., XGBoost) should be used cautiously.\n\n\n\n\n\nKarim, ME, and MH Mondol. 2025. “Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.” Pharmaceutical Statistics.\n\n\nKarim, Mohammad Ehsanul, and Yang Lei. 2025. “How Effective Are Machine Learning and Doubly Robust Estimators in Incorporating High-Dimensional Proxies to Reduce Residual Confounding?” Pharmacoepidemiology and Drug Safety 34 (5): e70155.\n\n\nMondol, MH, and ME Karim. 2024. “Towards Robust Causal Inference in Epidemiological Research: Employing Double Cross-Fit TMLE in Right Heart Catheterization Data.” American Journal of Epidemiology, kwae447."
  },
  {
    "objectID": "deep.html#plasmode-simulation",
    "href": "deep.html#plasmode-simulation",
    "title": "20  Deep Learning",
    "section": "20.1 Plasmode Simulation",
    "text": "20.1 Plasmode Simulation\n\n\n\n\n\n\n\nSimulation Element\nDescription\n\n\n\n\nSource Dataset\nNHANES 2013–2018\n\n\nSimulation Framework\nPlasmode simulation preserving empirical covariate and exposure distributions\n\n\nSimulated Sample Size\n3,000 participants per iteration\n\n\nIterations\n500 replicates\n\n\nPrevalence Scenarios\n1. Frequent exposure & frequent outcome  2. Rare exposure & frequent outcome  3. Frequent exposure & rare outcome\n\n\nTrue Effect\nOR = 1 (null); RD = 0\n\n\nOutcome Generation\nLogistic regression model with:  - Nonlinear transformations (log, poly)  - Interactions  - Proxy-derived comorbidity index\n\n\nConfounding Simulation\nUnmeasured confounding mimicked using high-dimensional proxy variables"
  },
  {
    "objectID": "deep.html#estimators-compared",
    "href": "deep.html#estimators-compared",
    "title": "20  Deep Learning",
    "section": "20.2 Estimators Compared",
    "text": "20.2 Estimators Compared\n\n\n\n\n\n\n\n\n\n\nMethod\nCore Idea\nKey Features\nUse of Propensity Score\nOptimization & Regularization\n\n\n\n\nPSW (hdPS)\nBaseline method using logistic regression on investigator and proxy covariates\nHigh-dimensional covariates selected via hdPS\nExplicitly modeled via logistic regression\nNone\n\n\nTMLE (SL Smooth) (Balzer and Westling 2021)\nSemiparametric estimator using Super Learner\nCombines outcome and treatment models; uses smooth learners (logistic regression, LASSO, MARS)\nExplicitly modeled and used for targeting\nSuper Learner; Donsker-compliant learners\n\n\nTMLE (SL Unsmooth)\nMore flexible TMLE with XGBoost in Super Learner\nAllows complex nonlinearities; lower variance reliability in small samples\nExplicitly modeled and used for targeting\nSuper Learner including unsmooth learners (e.g., XGBoost)\n\n\nDCTMLE (Zivich and Breskin 2021)\nTMLE with double cross-fitting\nReduces overfitting in TMLE with flexible learners\nExplicitly modeled and used for targeting\nDouble cross-fitting for robustness\n\n\nTARNET (Shalit, Johansson, and Sontag 2017)\nNeural net with treatment-agnostic shared representation\nTwo heads for outcome under treatment/control; most precise in frequent exposure/outcome\nNot used explicitly\nTargeted regularization; Adam + SGD with early stopping\n\n\nDragonnet (Shi, Blei, and Veitch 2019)\nNeural net that jointly models outcomes and propensity score\nAdds third head for PS; enforces balance and semiparametric alignment\nModeled as an explicit third output\nTargeted regularization; multitask learning\n\n\nNEDnet (Shi, Blei, and Veitch 2019)\nSequential neural network for treatment then outcome\nStage 1: predict treatment; Stage 2: freeze representation and predict outcomes\nModeled separately in Stage 1\nTargeted regularization; two-stage optimization"
  },
  {
    "objectID": "deep.html#simulation-results",
    "href": "deep.html#simulation-results",
    "title": "20  Deep Learning",
    "section": "20.3 Simulation Results",
    "text": "20.3 Simulation Results\n\n\n\n\n\nFigure 1. Bias across Methods in NHANES Plasmode Simulation\n\n\n\n\n\n\n\n\n\nFigure 2. Relative error across Methods in NHANES Plasmode Simulation\n\n\n\n\nResults are fully accessible via a Shiny app:\n👉 Interactive Causal Benchmark App\nExplore bias, SEs, and coverage metrics across methods and simulation conditions."
  },
  {
    "objectID": "deep.html#conclusion",
    "href": "deep.html#conclusion",
    "title": "20  Deep Learning",
    "section": "20.4 Conclusion",
    "text": "20.4 Conclusion\n\nPSW remains an interpretable benchmark\nTMLE and neural methods extend this framework by improving bias-variance trade-offs and enabling better performance in complex settings\nAmong deep learning methods, Dragonnet offers the best average trade-off; NEDnet excels in coverage but is computationally heavy; TARNET offers precision\nThese methods are particularly useful when dealing with residual confounding, nonlinear effects, and proxy variable structures\n\n\n\n\n\nBalzer, Laura B, and Ted Westling. 2021. “Demystifying Statistical Inference When Using Machine Learning in Causal Research.” American Journal of Epidemiology.\n\n\nShalit, Uri, Fredrik D Johansson, and David Sontag. 2017. “Estimating Individual Treatment Effect: Generalization Bounds and Algorithms.” In International Conference on Machine Learning, 3076–85. PMLR.\n\n\nShi, Claudia, David Blei, and Victor Veitch. 2019. “Adapting Neural Networks for the Estimation of Treatment Effects.” Advances in Neural Information Processing Systems 32.\n\n\nZivich, Paul N, and Alexander Breskin. 2021. “Machine Learning for Causal Inference: On the Use of Cross-Fit Estimators.” Epidemiology (Cambridge, Mass.) 32 (3): 393."
  },
  {
    "objectID": "extension2.html#time-to-event-outcome",
    "href": "extension2.html#time-to-event-outcome",
    "title": "Extensions in Survival and Longitudinal Analyses",
    "section": "Time-to-event outcome",
    "text": "Time-to-event outcome\nThere are two components to a time-to-event (survival) outcome: (1) whether an event occurs and (2) the timing of the event.\n\nExample: time-to-CVD, where we are interested both in CVD (cardiovascular disease) status and time from cohort entry to CVD development. The Cox proportional hazards (PH) model is widely used for modeling a time-to-event outcome.\n\n\n\nAcknowledgment: Md Belal Hossain contributed to drafting this chapter; some of the ideas presented here stem from his PhD thesis and subsequent publications.\n\n\n\n\n\n\nTime-to-event outcome\n\n\n\n\nBross formula requires the exposure, outcome, and proxy covariates to be binary\nIgnoring the time aspect in a time-to-event outcome leads to a loss of information, which can significantly impact the selection of the proxy covariates and can impact the effect estimate of interest\nML survival models could be useful in prioritizing and selecting recurrence covariates\n\n\n\n\n\n(Bross 1966; Schneeweiss et al. 2009)"
  },
  {
    "objectID": "extension2.html#time-dependent-exposure",
    "href": "extension2.html#time-dependent-exposure",
    "title": "Extensions in Survival and Longitudinal Analyses",
    "section": "Time-dependent exposure",
    "text": "Time-dependent exposure\nUnlike a time-fixed exposure, whose values are known at study entry (time zero), values for a time-dependent exposure can change over the course of follow-up. Consider an example of a multiple sclerosis (MS) cohort, where we are interested in the relationship between disease-modifying drugs (DMDs) and long-term mortality. Not every patient is exposed to DMDs at their MS diagnosis. Instead, some patients are never exposed to DMDs, some may be exposed to DMDs at the time of their MS diagnosis, while others may be exposed many years later. In this case, the exposure status of patients is not fixed at time zero, but rather depends on time. Simultaneously dealing with time-dependent exposure and residual confounding bias can be challenging.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTime-dependent exposure\n\n\n\n\nImmortal time bias occurs when a period during follow-up, in which a patient cannot experience the outcome (e.g., death), is misclassified as time under observation, often leading to an overestimation of the treatment’s effectiveness\nIn the MS example, immortal time can occur if patients must survive long enough to receive a DMD, which could falsely enhance the perceived survival benefit of these drugs\nEmploying Cox proportional hazards models with time-varying exposure to DMDs can help mitigate immortal time bias. However, this time-dependent Cox regression cannot deal with residual confounding bias.\n\n\n\n\n\n(Jones and Fowler 2016; Beyersmann, Wolkewitz, and Schumacher 2008; Karim et al. 2014)"
  },
  {
    "objectID": "extension2.html#high-dimensional-disease-risk-score-hddrs",
    "href": "extension2.html#high-dimensional-disease-risk-score-hddrs",
    "title": "Extensions in Survival and Longitudinal Analyses",
    "section": "High-dimensional disease risk score (hdDRS)",
    "text": "High-dimensional disease risk score (hdDRS)\nThe hdPS technique might not reduce significant bias due to an overfitted exposure model, particularly with a rare exposure. An alternative confounding adjustment method to hdPS is hdDRS. In contrast, hdPS separates the exposure modelling from the outcome modelling, ultimately giving the end-user more flexibility in adjusting for confounding effects (e.g., via inverse-probability-weighting). On the other hand, the hdDRS achieves the balancing of the confounders by modelling the outcome.\n\n\n\n\n\n\nHigh-dimensional disease risk score\n\n\n\n\nhdDRS can be an alternative to hdPS for dealing with residual confounding bias\nhdDRS could be particularly helpful in situations where the exposure is rare or the outcome is a repeated measure\n\n\n\n\n\n\n\n(Kumamaru et al. 2016; Hossain 2025)\n\n\nBeyersmann, Jan, Martin Wolkewitz, and Martin Schumacher. 2008. “The Impact of Time-Dependent Bias in Proportional Hazards Modelling.” Statistics in Medicine 27 (30): 6439–54.\n\n\nBross, Irwin DJ. 1966. “Spurious Effects from an Extraneous Variable.” Journal of Chronic Diseases 19 (6): 637–47.\n\n\nHossain, Md Belal. 2025. “Chapter 2: High-Dimensional Disease Risk Score for Dealing with Residual Confounding in Estimating Treatment Effects with a Survival Outcome.” In. Harnessing the power of causal inference; predictive analytics for survival outcomes with health administrative data: applications to tuberculosis research.\n\n\nJones, Mark, and Robert Fowler. 2016. “Immortal Time Bias in Observational Studies of Time-to-Event Outcomes.” Journal of Critical Care 36: 195–99.\n\n\nKarim, Mohammad Ehsanul, Paul Gustafson, John Petkau, Yinshan Zhao, Afsaneh Shirani, Elaine Kingwell, Charity Evans, Mia Van Der Kop, Joel Oger, and Helen Tremlett. 2014. “Marginal Structural Cox Models for Estimating the Association Between \\(\\beta\\)-Interferon Exposure and Disease Progression in a Multiple Sclerosis Cohort.” American Journal of Epidemiology 180 (2): 160–71.\n\n\nKumamaru, Hiraku, Sebastian Schneeweiss, Robert J Glynn, Soko Setoguchi, and Joshua J Gagne. 2016. “Dimension Reduction and Shrinkage Methods for High Dimensional Disease Risk Scores in Historical Data.” Emerging Themes in Epidemiology 13: 1–10.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology (Cambridge, Mass.) 20 (4): 512."
  },
  {
    "objectID": "hddrs1.html#step-0-analytic-data",
    "href": "hddrs1.html#step-0-analytic-data",
    "title": "21  hdDRS with a binary outcome",
    "section": "21.1 Step 0: Analytic data",
    "text": "21.1 Step 0: Analytic data\nWe will use the same NHANES data for this hdDRS demonstration as we did for the hdPS with a binary exposure and a binary outcome demonstration.\n\ndim(hdps.data)\n#> [1] 3839   78"
  },
  {
    "objectID": "hddrs1.html#step-6-disease-risk-score-drs",
    "href": "hddrs1.html#step-6-disease-risk-score-drs",
    "title": "21  hdDRS with a binary outcome",
    "section": "21.2 Step 6: Disease risk score (DRS)",
    "text": "21.2 Step 6: Disease risk score (DRS)\nHansen (2008) shows that the DRS has a balancing property, called prognostic balance. Individuals sharing a similar DRS value can be regarded as having the same risk/prognosis for the outcome. In contrast, the propensity score has a covariate balance property.\n\n\n(Hansen 2008)\n\n21.2.1 Create DRS formula\nWith a binary outcome, there are two approaches to estimate the disease risk score (DRS):\n\nFitting DRS model on unexposed individuals: Covariates include the investigator-specified and empirical covariates\nFitting DRS model on the full cohort: Covariates include the exposure, investigator-specified and empirical covariates.\n\nIn this example, we will focus on fitting the DRS model on unexposed individuals:\n\nhdps.data$outcome <- as.numeric(I(hdps.data$diabetes=='Yes'))\nproxy.list.sel <- names(out3$autoselected_covariate_df[,-1])\nproxyform <- paste0(proxy.list.sel, collapse = \"+\")\ncovform <- paste0(investigator.specified.covariates, collapse = \" + \")\n\n\nrhsformula <- paste0(c(covform, proxyform), collapse = \"+\")\ndrs.formula <- as.formula(paste0(\"outcome\", \"~\", rhsformula))\ndrs.formula\n#> outcome ~ age.cat + sex + education + race + marital + income + \n#>     born + year + diabetes.family.history + medical.access + \n#>     smoking + diet.healthy + physical.activity + sleep + uric.acid + \n#>     protein.total + bilirubin.total + phosphorus + sodium + potassium + \n#>     globulin + calcium.total + systolicBP + diastolicBP + high.cholesterol + \n#>     rec_dx_I10_once + rec_dx_R73_once + rec_dx_I10_frequent + \n#>     rec_dx_R60_once + rec_dx_E78_once + rec_dx_M79_once + rec_dx_I51_once + \n#>     rec_dx_M10_once + rec_dx_I50_once + rec_dx_K21_once + rec_dx_D75_once + \n#>     rec_dx_Z79_once + rec_dx_F41_once + rec_dx_M1A_once + rec_dx_E87_once + \n#>     rec_dx_R12_once + rec_dx_R51_once + rec_dx_J45_once + rec_dx_I50_frequent + \n#>     rec_dx_L70_once + rec_dx_M25_once + rec_dx_I63_once + rec_dx_R39_once + \n#>     rec_dx_N28_once + rec_dx_K25_once + rec_dx_F90_once + rec_dx_B00_once + \n#>     rec_dx_J42_once + rec_dx_R41_once + rec_dx_I20_once + rec_dx_M54_once + \n#>     rec_dx_J44_once + rec_dx_K08_once + rec_dx_I21_once + rec_dx_F32_once + \n#>     rec_dx_J30_once + rec_dx_F43_once + rec_dx_R06_once + rec_dx_I48_once + \n#>     rec_dx_R32_once + rec_dx_R42_once + rec_dx_N92_once + rec_dx_N95_once + \n#>     rec_dx_M19_once + rec_dx_E07_once + rec_dx_R25_once + rec_dx_G43_once + \n#>     rec_dx_R52_once + rec_dx_M81_once + rec_dx_T78_once\n\n\n\n21.2.2 Unexposed cohort\n\n# Unexposed cohort\ndat.unexposed <- subset(hdps.data, obese == \"No\")\n\n\n\n21.2.3 Fit DRS model\nHere we fit a logistic regression model on the unexposed individuals. Similar to the hdPS example, we are adding only the main effects in the non-transformed form.\n\nfit.drs <- glm(drs.formula, data = dat.unexposed, family = binomial)\n\n\n\n21.2.4 Obtain DRS\nNow we can obtain the DRS for the full cohort and check the summary. Unlike in hdPS, balance checking is difficult in hdDRS analysis. There are some proposed matching and weighting techniques with hdDRS where balance checking is possible. These techniques are developed/proposed with only investigator-specified covariates. In this demonstration, we adjust our outcome model for deciles of the DRS, which is a common practice in the hdDRS literature.\n\n\n(Wyss et al. 2015; Nguyen et al. 2024)\n\nhdps.data$drs <- predict(fit.drs, type = \"response\", newdata = hdps.data)\n\n# Sumamry\nsummary(hdps.data$drs)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.00000 0.01666 0.07709 0.19312 0.26457 0.99992\n\n# Summary by exposure status\ntapply(hdps.data$drs, hdps.data$obese, summary)\n#> $No\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.00000 0.01664 0.06980 0.17814 0.23025 0.99963 \n#> \n#> $Yes\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#> 0.00000 0.01670 0.08679 0.21374 0.30472 0.99992"
  },
  {
    "objectID": "hddrs1.html#step-7-association",
    "href": "hddrs1.html#step-7-association",
    "title": "21  hdDRS with a binary outcome",
    "section": "21.3 Step 7: Association",
    "text": "21.3 Step 7: Association\n\n21.3.1 Adjtsing for deciles of DRS\n\n# Deciles of DRS\nhdps.data$drs.decile <- as.factor(dplyr::ntile(hdps.data$drs, 10))\n\n# Outcome analysis\nfit.hddrs <- glm(outcome ~ obese + drs.decile, \n                 data = hdps.data,\n                 family = binomial)\n\npublish(fit.hddrs, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n21.3.2 Adjtsing for deciles of DRS and covariates\nAnother popular approach is to adjust for deciles of the DRS as well as for the investigator-specified covariates:\n\n# Outcome analysis\nfit.hddrs1 <- glm(outcome ~ obese + drs.decile + age.cat + sex + education + \n                   race + marital + income + born + year + \n                   diabetes.family.history + medical.access + smoking + \n                   diet.healthy + physical.activity + sleep + uric.acid + \n                   protein.total + bilirubin.total + phosphorus + sodium + \n                   potassium + globulin + calcium.total + systolicBP + \n                   diastolicBP + high.cholesterol, data = hdps.data,\n                 family = binomial)\n\npublish(fit.hddrs1, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n\n\nHansen, Ben B. 2008. “The Prognostic Analogue of the Propensity Score.” Biometrika 95 (2): 481–88.\n\n\nHossain, Md Belal. 2025. “Chapter 2: High-Dimensional Disease Risk Score for Dealing with Residual Confounding in Estimating Treatment Effects with a Survival Outcome.” In. Harnessing the power of causal inference; predictive analytics for survival outcomes with health administrative data: applications to tuberculosis research.\n\n\nKumamaru, Hiraku, Joshua J Gagne, Robert J Glynn, Soko Setoguchi, and Sebastian Schneeweiss. 2016. “Comparison of High-Dimensional Confounder Summary Scores in Comparative Studies of Newly Marketed Medications.” Journal of Clinical Epidemiology 76: 200–208.\n\n\nNguyen, Tri-Long, Thomas PA Debray, Bora Youn, Gabrielle Simoneau, and Gary S Collins. 2024. “Confounder Adjustment Using the Disease Risk Score: A Proposal for Weighting Methods.” American Journal of Epidemiology 193 (2): 377–88.\n\n\nWyss, Richard, Alan R Ellis, M Alan Brookhart, Michele Jonsson Funk, Cynthia J Girman, Ross J Simpson Jr, and Til Stürmer. 2015. “Matching on the Disease Risk Score in Comparative Effectiveness Research of New Treatments.” Pharmacoepidemiology and Drug Safety 24 (9): 951–61."
  },
  {
    "objectID": "survival.html#step-0-analytic-data",
    "href": "survival.html#step-0-analytic-data",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.1 Step 0: Analytic data",
    "text": "22.1 Step 0: Analytic data\nTo demonstrate the use of the hdPS analysis with a time-to-event outcome, we will use a simulated dataset. The example is to explore the relationship between arthritis (binary exposure) and CVD (time-to-event outcome).\nThe simulated dataset contains information on 3,000 individuals with the following variables:\n\nhead(simdat)"
  },
  {
    "objectID": "survival.html#step-1-proxy-sources",
    "href": "survival.html#step-1-proxy-sources",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.2 Step 1: Proxy sources",
    "text": "22.2 Step 1: Proxy sources\n\n22.2.1 Data with investigator-specified covariates\nLet us check the summary statistics of the investigator-specified covariates, stratified by the exposure variable (arthritis).\n\n# Table 1\ntab1 <- CreateTableOne(vars = c(\"age\", \"sex\", \"comorbidity\"), \n                       strata = \"arthritis\", \n                       data = simdat, \n                       test = FALSE)\nprint(tab1, showAllLevels = TRUE, noSpaces = TRUE, quote = FALSE, smd = TRUE)\n#>                  Stratified by arthritis\n#>                   level  No           Yes           SMD  \n#>   n                      2143         857                \n#>   age (mean (SD))        48.88 (9.53) 53.65 (10.03) 0.487\n#>   sex (%)         Female 1189 (55.5)  592 (69.1)    0.283\n#>                   Male   954 (44.5)   265 (30.9)         \n#>   comorbidity (%) No     1629 (76.0)  596 (69.5)    0.146\n#>                   Yes    514 (24.0)   261 (30.5)\n\n# Bivariate table\nround(prop.table(table(arthritis = simdat$arthritis, CVD = simdat$cvd),\n                 margin = 1)*100, 2)\n#>          CVD\n#> arthritis     0     1\n#>       No  70.70 29.30\n#>       Yes 43.52 56.48\n\n\n\n22.2.2 Proxy data\nIn this example, we will use four data dimensions:\n\n3-digit diagnostic codes from hospital database (diag)\n3-digit procedure codes from hospital database (proc)\n3-digit icd codes from physician claim database (msp)\nDINPIN from drug dispensation database (din)\n\n\ntable(dat.proxy$dim)\n#> \n#>  diag   din   msp  proc \n#> 20000   269 44179   321\n\n\ndat.proxy <- dat.proxy[order(dat.proxy$studyid),]\ndat.proxy[5001:5010,]"
  },
  {
    "objectID": "survival.html#step-2-empirical-covariates",
    "href": "survival.html#step-2-empirical-covariates",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.3 Step 2: Empirical covariates",
    "text": "22.3 Step 2: Empirical covariates\nThe same as before, top 200 covariates with highest prevalence are chosen.\n\nlibrary(autoCovariateSelection)\nid <- simdat$studyid\n\nstep1 <- get_candidate_covariates(df = dat.proxy, domainVarname = \"dim\", \n                                  eventCodeVarname = \"code\", \n                                  patientIdVarname = \"studyid\", \n                                  patientIdVector = id, \n                                  n = 200, \n                                  min_num_patients = 20)\nout1 <- step1$covars_data\nhead(out1)"
  },
  {
    "objectID": "survival.html#step-3-recurrence",
    "href": "survival.html#step-3-recurrence",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.4 Step 3: Recurrence",
    "text": "22.4 Step 3: Recurrence\nIn this step, we generate a maximum of 3 binary recurrence covariates for each of the candidate proxy/code. We observed 401 recurrence covariates in this analysis.\n\n22.4.1 Assessing recurrence of codes\n\nall.equal(id, step1$patientIds)\n#> [1] TRUE\n\nstep2 <- get_recurrence_covariates(df = out1, \n                                   eventCodeVarname = \"code\", \n                                   patientIdVarname = \"studyid\",\n                                   patientIdVector = id)\nout2 <- step2$recurrence_data\ndim(out2)\n#> [1] 3000  402\n\n\n\n22.4.2 Recurrence covariates\n\nvars.empirical <- names(out2)[-1]\nhead(vars.empirical)\n#> [1] \"rec_diag_C44_once\" \"rec_diag_C81_once\" \"rec_diag_C83_once\"\n#> [4] \"rec_diag_E08_once\" \"rec_diag_E09_once\" \"rec_diag_E10_once\"\n\n\n\n22.4.3 Merging all recurrence covariates with the analytic dataset\n\nhdps.data <- merge(simdat, out2, by = \"studyid\", all.x = T)\ndim(hdps.data)\n#> [1] 3000  408"
  },
  {
    "objectID": "survival.html#step-4-prioritize",
    "href": "survival.html#step-4-prioritize",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.5 Step 4: Prioritize",
    "text": "22.5 Step 4: Prioritize\nThe Bross formula requires the exposure, outcome, and proxy covariates to be binary. With the time-to-event outcome, we will use Cox-PH with LASSO regularization to prioritize the empirical covariates. The hyperparameter (\\(\\lambda\\)) will be selected using 5-fold cross-validation.\n\n22.5.1 Hyperparameter tuning\n\n# Formula with only empirical covariates\nformula.out <- as.formula(paste(\"Surv(follow_up, cvd) ~ \", \n                                paste(vars.empirical, collapse = \" + \")))\n\n# Model matrix for fitting Cox with LASSO regularization\nX <- model.matrix(formula.out, data = hdps.data)[,-1]\nY <- as.matrix(data.frame(time = hdps.data$follow_up, status = hdps.data$cvd))\n\n# Detect the number of cores\nn_cores <- parallel::detectCores()\n\n# Create a cluster of cores\ncl <- makeCluster(n_cores - 1)\n\n# Register the cluster for parallel processing\nregisterDoParallel(cl)\n\n# Hyperparameter tuning with 5-fold cross-validation \nset.seed(123)\nfit.lasso <- cv.glmnet(x = X, y = Y, nfolds = 5, parallel = T, alpha = 1,\n                       family = \"cox\")\nstopCluster(cl)\n\nplot(fit.lasso)\n\n\n\n\n## Best lambda\nfit.lasso$lambda.min\n#> [1] 0.03093478\n\n\n\n22.5.2 Variable ranking based on Cox-LASSO\n\nempvars.lasso <- coef(fit.lasso, s = fit.lasso$lambda.min) \nempvars.lasso <- data.frame(as.matrix(empvars.lasso))\nempvars.lasso <- data.frame(vars = rownames(empvars.lasso),\n                            coef = empvars.lasso)\ncolnames(empvars.lasso) <- c(\"vars\", \"coef\")\nrownames(empvars.lasso) <- NULL\n\n# Number of non-zero coefficients\ntable(empvars.lasso$coef != 0)\n#> \n#> FALSE \n#>   401\n\nSince proxies were random and unrelated to the simulated data, LASSO produced all zero coefficients. Let choose an arbitrary value as to demonstrate the process of variable selection.\n\nempvars.lasso <- coef(fit.lasso, s = exp(-6)) \nempvars.lasso <- data.frame(as.matrix(empvars.lasso))\nempvars.lasso <- data.frame(vars = rownames(empvars.lasso), \n                            coef = empvars.lasso)\ncolnames(empvars.lasso) <- c(\"vars\", \"coef\")\nrownames(empvars.lasso) <- NULL\nhead(empvars.lasso)\n\n\n\n  \n\n\n\n# Number of non-zero coefficients\ntable(empvars.lasso$coef != 0)\n#> \n#> FALSE  TRUE \n#>    71   330\n\n\n\n22.5.3 Rank empirical covariates\nNow we will rank the empirical covariates based on absolute value of log hazard ratio.\n\nempvars.lasso$coef.abs <- abs(empvars.lasso$coef)\nempvars.lasso <- empvars.lasso[order(empvars.lasso$coef.abs, decreasing = T),]\nhead(empvars.lasso)"
  },
  {
    "objectID": "survival.html#step-5-covariates",
    "href": "survival.html#step-5-covariates",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.6 Step 5: Covariates",
    "text": "22.6 Step 5: Covariates\nWe used all investigator-specified covariates and the top 200 empirical covariates for the PS model. Again, this is a simplistic scenario where we only consider the main effects of the covariates.\n\n# Investigator-specified covariates\ninvestigator.vars <- c(\"age\", \"sex\", \"comorbidity\")\n\n# Top 200 empirical covariates section based on Cox-LASSO\nempirical.vars.lasso <- empvars.lasso$vars[1:200]\n\n# Investigator-specified and empirical covariates\nvars.hsps <- c(investigator.vars, empirical.vars.lasso)\nhead(vars.hsps)\n#> [1] \"age\"               \"sex\"               \"comorbidity\"      \n#> [4] \"rec_diag_V03_once\" \"rec_diag_S24_once\" \"rec_diag_W61_once\""
  },
  {
    "objectID": "survival.html#step-6-propensity-score",
    "href": "survival.html#step-6-propensity-score",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.7 Step 6: Propensity score",
    "text": "22.7 Step 6: Propensity score\n\n22.7.1 Create propensity score formula\n\nps.formula <- as.formula(paste0(\"I(arthritis == 'Yes') ~ \", \n                                paste(vars.hsps, collapse = \"+\")))\n\n\n\n22.7.2 Fit PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\", \n                  stabilize = T)\n\n\n\n22.7.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\n\n\n\n22.7.4 Obtain weights\n\nhdps.data$w <- W.out$weights\n\n\n\n\n\n\n\n\n22.7.5 Assessing balance"
  },
  {
    "objectID": "survival.html#step-7-association",
    "href": "survival.html#step-7-association",
    "title": "22  hdPS with a time-to-event outcome",
    "section": "22.8 Step 7: Association",
    "text": "22.8 Step 7: Association\n\n22.8.1 Obtain HR\n\nlibrary(survival)\nlibrary(Publish)\nfit.hdps <- coxph(Surv(follow_up, cvd) ~ arthritis, \n                  weights = w,\n                  data = hdps.data)\npublish(fit.hdps, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n22.8.2 Obtain HR with survey package\n\nlibrary(survey)\n# Create a design\nsvy.design <- svydesign(id = ~1, weights = ~w, data = hdps.data)\n\n# Model\nfit.hdps1 <- svycoxph(Surv(follow_up, cvd) ~ arthritis, \n                      design = svy.design)\npublish(fit.hdps1, print = F)$regressionTable[1:2,]\n#> Independent Sampling design (with replacement)\n#> svydesign(id = ~1, weights = ~w, data = hdps.data)"
  },
  {
    "objectID": "hddrs2.html#step-0-analytic-data",
    "href": "hddrs2.html#step-0-analytic-data",
    "title": "23  hdDRS with a survival outcome",
    "section": "23.1 Step 0: Analytic data",
    "text": "23.1 Step 0: Analytic data\n\ndim(hdps.data)\n#> [1] 3000  410\n\n# Investigator-specified covariates\ninvestigator.vars\n#> [1] \"age\"         \"sex\"         \"comorbidity\"\n\n# Top 200 empirical covariates section based on Cox-LASSO\nhead(empirical.vars.lasso)\n#> [1] \"rec_diag_V03_once\" \"rec_diag_S24_once\" \"rec_diag_W61_once\"\n#> [4] \"rec_diag_C81_once\" \"rec_diag_M85_once\" \"rec_diag_H18_once\"\n\n# Investigator-specified and empirical covariates\nhead(vars.hsps)\n#> [1] \"age\"               \"sex\"               \"comorbidity\"      \n#> [4] \"rec_diag_V03_once\" \"rec_diag_S24_once\" \"rec_diag_W61_once\""
  },
  {
    "objectID": "hddrs2.html#step-6-disease-risk-score-drs",
    "href": "hddrs2.html#step-6-disease-risk-score-drs",
    "title": "23  hdDRS with a survival outcome",
    "section": "23.2 Step 6: Disease risk score (DRS)",
    "text": "23.2 Step 6: Disease risk score (DRS)\nThere are at least eight approaches to estimate the disease risk score (DRS):\n\nhdDRS-Full-Logistic: On the full cohort (both exposed and unexposed), fit logistic regression without considering the follow-up time. This model included the exposure, investigator-specified measured confounders, and the recurrence covariates. The DRS is calculated as the probability of the outcome by setting everyone as unexposed.\nhdDRS-Full-Survival: On the full cohort, fit the Cox-PH model with the exposure, investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the survival probability of the outcome by setting everyone as unexposed.\nhdDRS-Full-Hazard: On the full cohort, fit the Cox-PH model with the exposure, investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the hazard of the outcome by setting everyone unexposed.\nhdDRS-Full-Rate: On the full cohort, fit the modified Poisson regression with the exposure, an offset by the natural logarithm of follow-up time, investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the rate of the outcome by setting everyone as unexposed.\nhdDRS-Unexposed-Logistic: On the cohort with only unexposed, fit the logistic regression with the investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the probability of the outcome on the full cohort.\nhdDRS-Unexposed-Survival: On the cohort with only unexposed, fit the Cox-PH model with the investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the survival probability of the outcome on the full cohort.\nhdDRS-Unexposed-Hazard: On the cohort with only unexposed, fit the Cox-PH model with the investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the hazard of the outcome on the full cohort.\nhdDRS-Unexposed-Rate: On the cohort with only unexposed, fit the modified Poisson regression with an offset by the natural logarithm of follow-up time, investigator-specified measured confounders and the recurrence covariates. The DRS is calculated as the rate of the outcome on the full cohort.\n\nIn this example, we will focus on the rate-based approach using the unexposed cohort (hdDRS-Unexposed-Rate). To demonstrate how to apply all these eight methods in a given scenario, reproducible R codes on a simulated dataset are provided in the GitHub folder.\n\n\n(Hansen 2008; Zhang and Kim 2019; Hossain 2025)\n\n23.2.1 Unexposed cohort\n\n# Unexposed cohort\ndat.unexposed <- subset(hdps.data, arthritis == \"No\")\n\n# Offset \ndat.unexposed$log.offset <- log(1)\n\n# Full cohort\ndat.full <- hdps.data\n\n# Offset  \ndat.full$log.offset <- log(1)\n\n\n\n23.2.2 Create DRS formula\n\n# Covariates\nvars.hsdrs <- c(investigator.vars, empirical.vars.lasso)\n\n# Formula\ndrs.formula <- as.formula(paste0(\"cvd ~ offset(log.offset) + \", \n                                 paste(vars.hsdrs, collapse = \"+\")))\n\n\n\n23.2.3 Fit DRS model\n\nfit.drs <- glm(drs.formula, data = dat.unexposed, family = poisson)\nfit.drs$coefficients[is.na(fit.drs$coefficients)] <- 0\n\n\n\n23.2.4 Obtain DRS\n\ndat.full$drs <- predict(fit.drs, type = \"response\", newdata = dat.full)\n\n# Sumamry\nsummary(dat.full$drs)\n#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. \n#> 0.004198 0.089762 0.232163 0.324619 0.466890 4.085130\n\n# Summary by exposure status\ntapply(dat.full$drs, dat.full$arthritis, summary)\n#> $No\n#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. \n#> 0.004198 0.081598 0.206067 0.293047 0.433756 1.776528 \n#> \n#> $Yes\n#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. \n#> 0.006824 0.129409 0.298506 0.403565 0.568061 4.085130"
  },
  {
    "objectID": "hddrs2.html#step-7-association",
    "href": "hddrs2.html#step-7-association",
    "title": "23  hdDRS with a survival outcome",
    "section": "23.3 Step 7: Association",
    "text": "23.3 Step 7: Association\n\n# Deciles of DRS\ndat.full$drs.decile <- as.factor(dplyr::ntile(dat.full$drs, 10))\n\n# Outcome analysis\nfit.hddrs <- coxph(Surv(follow_up, cvd) ~ arthritis + drs.decile + age + sex + \n                     comorbidity, data = dat.full)\n\npublish(fit.hddrs, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n\n\nHansen, Ben B. 2008. “The Prognostic Analogue of the Propensity Score.” Biometrika 95 (2): 481–88.\n\n\nHossain, Md Belal. 2025. “Chapter 2: High-Dimensional Disease Risk Score for Dealing with Residual Confounding in Estimating Treatment Effects with a Survival Outcome.” In. Harnessing the power of causal inference; predictive analytics for survival outcomes with health administrative data: applications to tuberculosis research.\n\n\nZhang, Di, and Jessica Kim. 2019. “Use of Propensity Score and Disease Risk Score for Multiple Treatments with Time-to-Event Outcome: A Simulation Study.” Journal of Biopharmaceutical Statistics 29 (6): 1103–15."
  },
  {
    "objectID": "survival2.html",
    "href": "survival2.html",
    "title": "24  Application in MS",
    "section": "",
    "text": "A recent article summarizes the estimated hazard ratios (HRs) for the association between exposure to any disease-modifying drug (DMD) and time from cohort entry (index date) to all-cause mortality among individuals with multiple sclerosis (MS) in British Columbia, Canada (1996–2017), using different methods of confounding adjustment (Karim et al. 2025).\n\nUnadjusted analysis suggested a strong protective effect (HR 0.31, 95% CI: 0.27–0.36).\nInvestigator-specified covariate adjustment attenuated this effect (aHR 0.76, 95% CI: 0.65–0.89).\nHigh-dimensional propensity score (hdPS) methods further adjusted using empirically selected covariates:\n\nhdPS-1 to hdPS-3 yielded aHRs ranging from 0.77 to 0.80.\n\nHigh-dimensional disease risk score (hdDRS) methods produced similar estimates:\n\nhdDRS-1 to hdDRS-3 resulted in aHRs between 0.79 and 0.81.\n\n\n\n\n\n\n\n\nTip\n\n\n\n(Karim et al. 2025)\n\n\n\n\n\n\n\n\n\n\n\n\nDespite adding high-dimensional proxy data, effect estimates changed only slightly from those using conventional covariate adjustment. These results suggest that residual confounding may be modest or that available proxies are insufficient to capture unmeasured confounders fully.\n\n\n\n\nKarim, Mohammad Ehsanul, Md Belal Hossain, Huah Shin Ng, Feng Zhu, Hanna A Frank, and Helen Tremlett. 2025. “Evaluating the Role of High-Dimensional Proxy Data in Confounding Adjustment in Multiple Sclerosis Research: A Case Study.” Pharmacoepidemiology and Drug Safety 34 (2): e70112."
  },
  {
    "objectID": "ncc.html#time-dependent-cox-regression",
    "href": "ncc.html#time-dependent-cox-regression",
    "title": "25  Time-dependent exposure",
    "section": "25.1 Time-dependent Cox regression",
    "text": "25.1 Time-dependent Cox regression\nLet us start with an example of exploring the relationship between disease-modifying drugs (DMDs) for multiple sclerosis and long-term mortality. The DMD exposure is a time-dependent variable, and the mortality outcome is a time-to-event outcome. Employing Cox proportional hazards models with time-varying exposure to DMDs can address immortal time bias in this example.\n\n\n\n\n\n\nTime-dependent exposure\n\n\n\n\nTime-dependent Cox regression with time-varying exposure can help mitigate immortal time bias.\nThe hdPS approach is used to deal with residual confounding with a binary ‘time-fixed’ treatment\nWith a ‘time-dependent’ exposure, implementing the hdPS in conjunction with the time-dependent Cox regression presents a methodological and practical challenge.\n\nSee the associated article for more details (Hossain et al. 2025)."
  },
  {
    "objectID": "ncc.html#nested-case-control-ncc",
    "href": "ncc.html#nested-case-control-ncc",
    "title": "25  Time-dependent exposure",
    "section": "25.2 Nested case-control (NCC)",
    "text": "25.2 Nested case-control (NCC)\nThe nested case-control (NCC) design is a well-established method for addressing immortal time bias with a time-dependent exposure. The NCC framework provides a robust alternative for addressing immortal time bias, while allowing for the integration of hdPS analysis to minimize the residual confounding bias.\n\n\n(Austin et al. 2012; Ernster 1994; Hossain et al. 2024)\n\n\n\n\n\n\nNCC\n\n\n\n\nMatch subjects who experienced the event of interest (called cases) to a subset of event-free subjects (called controls) using incidence density sampling\nSome controls could later become cases themselves and also serve as controls for other cases\nFour controls per case has been shown to provide near-optimal statistical efficiency without the need for the full cohort analysis"
  },
  {
    "objectID": "ncc.html#hdps-in-the-ncc-framework",
    "href": "ncc.html#hdps-in-the-ncc-framework",
    "title": "25  Time-dependent exposure",
    "section": "25.3 hdPS in the NCC framework",
    "text": "25.3 hdPS in the NCC framework\nThe time-dependent exposure status becomes a time-independent exposure variable in the NCC analysis. Hence, we could implement the hdPS technique in the NCC framework to deal with residual confounding bias.\n\n\n(Hossain et al. 2025)\n\n25.3.1 Step 0: Analytic data\nTo demonstrate the use of hdPS analysis with a time-dependent exposure, we will use a simulated dataset. This example explores the relationship between exposure to disease-modifying drugs (DMDs) for multiple sclerosis and all-cause mortality.\n\n25.3.1.1 Dataset with time-dependent exposure\n\nhead(simdat)\n\n\n\n  \n\n\n\n\n\n25.3.1.2 NCC with 4 control per case\nLet us use the nested case-control (NCC) design with 4 controls per case. The ccwc function from the Epi package is used to create the nested case-control dataset. The ccwc function requires the following arguments:\n\norigin: The time origin for the study\nentry: The time of entry into the study\nexit: The follow-up time\nfail: The event of interest\ncontrols: The number of controls per case\nmatch: The variables to match on\n\n\nlibrary(Epi)\nset.seed(100)\n\ndat.ncc <- ccwc(\n  origin = 0,\n  entry = 0,\n  exit = follow_up,\n  fail = mortality_outcome,\n  controls = 4, \n  match = list(ses, cci, year),\n  include = list(id, follow_up, mortality_outcome, anyDMD, yrs_anyDMD, \n                 sex, age),\n  data = simdat,\n  silent = T\n  )\n\n# Drop those experienced the event before being exposed\ndat.ncc$anyDMD[dat.ncc$yrs_anyDMD > dat.ncc$Time] <- NA\ndat.ncc <- dat.ncc[complete.cases(dat.ncc$anyDMD),]\n\ndat.ncc[1:10,]\n\n\n\n  \n\n\n\n\n# Rows\ndim(simdat)\n#> [1] 19000    10\ndim(dat.ncc)\n#> [1] 14370    14\n\n# Mortality status\ntable(simdat$mortality_outcome)\n#> \n#>     0     1 \n#> 15947  3053\ntable(dat.ncc$Fail)\n#> \n#>     0     1 \n#> 11317  3053\n\n\n\n\n25.3.2 Step 1: Proxy sources\nIn this example, we will use four data dimensions:\n\n3-digit diagnostic codes from hospital database (diag)\n3-digit procedure codes from hospital database (proc)\n3-digit icd codes from physician claim database (msp)\nDINPIN from drug dispensation database (din)\n\n\ntable(dat.proxy$dim)\n#> \n#>  diag   din   msp  proc \n#> 10000   125 22135   158\n\n\n\n25.3.3 Step 2: Empirical covariates\n\nlibrary(autoCovariateSelection)\nid <- simdat$id\n\nstep1 <- get_candidate_covariates(df = dat.proxy, domainVarname = \"dim\", \n                                  eventCodeVarname = \"code\", \n                                  patientIdVarname = \"id\", \n                                  patientIdVector = id, \n                                  n = 1000, \n                                  min_num_patients = 20)\nout1 <- step1$covars_data\nhead(out1)\n\n\n\n  \n\n\n\n\n\n25.3.4 Step 3: Recurrence\nLet us generate the binary recurrence covariates.\n\nall.equal(id, step1$patientIds)\n#> [1] TRUE\n\n# Assessing recurrence of codes\nstep2 <- get_recurrence_covariates(df = out1, \n                                   eventCodeVarname = \"code\", \n                                   patientIdVarname = \"id\",\n                                   patientIdVector = id)\nout2 <- step2$recurrence_data\ndim(out2)\n#> [1] 19000   454\n\n\n# Recurrence covariates\nvars.empirical <- names(out2)[-1]\nhead(vars.empirical)\n#> [1] \"rec_diag_H02_once\" \"rec_diag_H18_once\" \"rec_diag_H35_once\"\n#> [4] \"rec_diag_H40_once\" \"rec_diag_H44_once\" \"rec_diag_I69_once\"\n\n\n25.3.4.1 Merging all recurrence covariates with the analytic dataset\n\nhdps.data <- merge(dat.ncc, out2, by = \"id\", all.x = T)\ndim(hdps.data)\n#> [1] 14370   467\n\n\n\n\n25.3.5 Step 4: Prioritize\nWe will use Cox-PH with LASSO regularization to prioritize the empirical covariates. The hyperparameter (\\(\\lambda\\)) will be selected using 5-fold cross-validation.\n\n25.3.5.1 Hyperparameter tuning\n\n# Formula with only empirical covariates\nformula.out <- as.formula(paste(\"Surv(Time, Fail) ~ \", \n                                paste(vars.empirical, collapse = \" + \")))\n\n# Model matrix for fitting Cox with LASSO regularization\nX <- model.matrix(formula.out, data = hdps.data)[,-1]\nY <- as.matrix(data.frame(time = hdps.data$Time, status = hdps.data$Fail))\n\n# Detect the number of cores\nn_cores <- parallel::detectCores()\n\n# Create a cluster of cores\ncl <- makeCluster(n_cores - 1)\n\n# Register the cluster for parallel processing\nregisterDoParallel(cl)\n\n# Hyperparameter tuning with 5-fold cross-validation \nset.seed(123)\nfit.lasso <- cv.glmnet(x = X, y = Y, nfolds = 5, parallel = T, alpha = 1, \n                       family = \"cox\")\nstopCluster(cl)\n\nplot(fit.lasso)\n\n\n\n\n## Best lambda\nfit.lasso$lambda.min\n#> [1] 0.01042345\n\n\n\n25.3.5.2 Variable ranking based on Cox-LASSO\n\nempvars.lasso <- coef(fit.lasso, s = fit.lasso$lambda.min) \nempvars.lasso <- data.frame(as.matrix(empvars.lasso))\nempvars.lasso <- data.frame(vars = rownames(empvars.lasso), \n                            coef = empvars.lasso)\ncolnames(empvars.lasso) <- c(\"vars\", \"coef\")\nrownames(empvars.lasso) <- NULL\n\n# Number of non-zero coefficients\ntable(empvars.lasso$coef != 0)\n#> \n#> FALSE  TRUE \n#>   442    11\n\nSince proxies were random and unrelated to the simulated data, LASSO produced only 11 non-zero coefficients. Let choose an arbitrary value as to demonstrate the process of variable selection.\n\nempvars.lasso <- coef(fit.lasso, s = exp(-6)) \nempvars.lasso <- data.frame(as.matrix(empvars.lasso))\nempvars.lasso <- data.frame(vars = rownames(empvars.lasso), \n                            coef = empvars.lasso)\ncolnames(empvars.lasso) <- c(\"vars\", \"coef\")\nrownames(empvars.lasso) <- NULL\nhead(empvars.lasso)\n\n\n\n  \n\n\n\n# Number of non-zero coefficients\ntable(empvars.lasso$coef != 0)\n#> \n#> FALSE  TRUE \n#>   196   257\n\n\n\n25.3.5.3 Rank empirical covariates\nNow we will rank the empirical covariates based on absolute value of log hazard ratio.\n\nempvars.lasso$coef.abs <- abs(empvars.lasso$coef)\nempvars.lasso <- empvars.lasso[order(empvars.lasso$coef.abs, decreasing = T),]\nhead(empvars.lasso)\n\n\n\n  \n\n\n\n\n\n\n25.3.6 Step 5: Covariates\nWe used all investigator-specified covariates and the top 200 empirical covariates for the PS model. Again, this is a simplistic scenario where we only consider the main effects of the covariates.\n\n# Investigator-specified covariates\ninvestigator.vars <- c(\"sex\", \"age\")\n\n# Top 200 empirical covariates section based on Cox-LASSO\nempirical.vars.lasso <- empvars.lasso$vars[1:200]\n\n# Investigator-specified and empirical covariates\nvars.hsps <- c(investigator.vars, empirical.vars.lasso)\nhead(vars.hsps)\n#> [1] \"sex\"               \"age\"               \"rec_diag_T44_once\"\n#> [4] \"rec_diag_O41_once\" \"rec_msp_793_once\"  \"rec_msp_459_once\"\n\n\n\n25.3.7 Step 6: Propensity score\n\n25.3.7.1 Create propensity score formula\n\nps.formula <- as.formula(paste0(\"I(anyDMD == 'Yes') ~ \", \n                                paste(vars.hsps, collapse = \"+\")))\n\n\n\n25.3.7.2 Fit PS model\n\nrequire(WeightIt)\nW.out <- weightit(ps.formula, \n                    data = hdps.data, \n                    estimand = \"ATE\",\n                    method = \"ps\",\n                  stabilize = T)\n\n\n\n25.3.7.3 Obtain PS\n\nhdps.data$ps <- W.out$ps\n\n\n\n25.3.7.4 Obtain weights\n\nhdps.data$w <- W.out$weights\n\n\n\n\n\n\n\n\n25.3.7.5 Assessing balance\n\n\n\n\n\n\nlibrary(survey)\n\n# Balance checking for investigator-specified covariates\ndesign.ipw <- svydesign(ids = ~id, weights = ~w, data = hdps.data)\ntab.ipw <- svyCreateTableOne(vars = investigator.vars, \n                             strata = \"anyDMD\", \n                             data = design.ipw, \n                             test = F)\nprint(tab.ipw, smd = T) # Age and sex are balanced\n#>                  Stratified by anyDMD\n#>                   No               Yes             SMD   \n#>   n                11035.4          3324.4               \n#>   sex = Male (%)    3199.3 (29.0)   1007.3 (30.3)   0.029\n#>   age (mean (SD))    45.58 (13.75)   45.67 (13.48)  0.007\n\n\n\n\n25.3.8 Step 7: Association\n\n25.3.8.1 Obtain HR\nWe can fit the Cox-PH model, adjusting for the matched strata.\n\nlibrary(survival)\nlibrary(Publish)\nfit.hdps <- coxph(Surv(Time, Fail) ~ anyDMD + strata(Set), \n                  weights = w,\n                  data = hdps.data)\npublish(fit.hdps, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n25.3.8.2 Obtain HR with conditional logistic\nFor the NCC analysis, an alternative to the stratified Cox-PH model is to use the conditional logistic regression. The HR and SE from both models should be similar under the proportional hazards assumption.\n\n\n(Liu et al. 2010)\n\nfit.hdps1 <- clogit(Fail ~ anyDMD + strata(Set), \n                    weights = w, \n                    data = hdps.data, \n                    method = \"efron\")\npublish(fit.hdps1, pvalue.method = \"robust\", confint.method = \"robust\", \n        print = F)$regressionTable[1:2,]\n\n\n\n  \n\n\n\n\n\n\n\nAustin, Peter C, Geoffrey M Anderson, Candemir Cigsar, and Andrea Gruneir. 2012. “Comparing the Cohort Design and the Nested Case–Control Design in the Presence of Both Time-Invariant and Time-Dependent Treatment and Competing Risks: Bias and Precision.” Pharmacoepidemiology and Drug Safety 21 (7): 714–24.\n\n\nErnster, Virginia L. 1994. “Nested Case-Control Studies.” Preventive Medicine 23 (5): 587–90.\n\n\nHossain, Md Belal, Huah Shin Ng, Feng Zhu, Helen Tremlett, and Mohammad Ehsanul Karim. 2025. “Simultaneously Dealing with Immortal Time Bias and Residual Confounding: A Case Study of a High-Dimensional Propensity Score Approach with a Nested Case–Control Framework in Multiple Sclerosis Research.” Pharmacoepidemiology and Drug Safety 34 (7): e70174.\n\n\nHossain, Md Belal, Hubert Wong, Mohsen Sadatsafavi, James C Johnston, Victoria J Cook, and Mohammad Ehsanul Karim. 2024. “Benefits of Repeated Matched-Cohort and Nested Case–Control Analyses with Time-Dependent Exposure in Observational Studies.” Statistics in Biosciences, 1–29.\n\n\nLiu, Mengling, Wenbin Lu, Roy E Shore, and Anne Zeleniuch-Jacquotte. 2010. “Cox Regression Model with Time-Varying Coefficients in Nested Case–Control Studies.” Biostatistics 11 (4): 693–706."
  },
  {
    "objectID": "prediction.html",
    "href": "prediction.html",
    "title": "26  High-Dimensional Prediction Models",
    "section": "",
    "text": "Recent article proposed and evaluated high-dimensional prediction models (hdPMs) using linked health administrative data to predict long-term mortality risk (Hossain et al. 2025). Their key objective was to assess whether hdPMs could compensate for the absence of important clinical predictors (e.g., age, smoking, BMI) by using a large number of routinely collected health-care variables (e.g., ICD-9/10 codes).\n\n\n\n\n\n\nTip\n\n\n\n(Hossain et al. 2025)\n\n\n\n\n\n\n\nBased on simulations, their findings showed that Cox-LASSO hdPMs consistently outperformed conventional models in both discrimination (time-dependent c-statistic) and calibration, especially when strong clinical predictors were missing. For example, the c-statistic improved from 0.78 (conventional model) to 0.90 (LASSO-based hdPM) in simulations.\n\n\n\n\n\n\n\n\nFeature\nhdPM\nhdPS / hdDRS\n\n\n\n\nGoal\nRisk prediction (e.g., mortality stratification)\nConfounding adjustment in causal inference\n\n\nTarget\nOutcome model (e.g., Cox model for time to death)\nExposure model (PS) or outcome model (DRS)\n\n\nUse of empirical vars\nYes, extensive ICD-9/10 code-based variables\nYes, but fewer (typically 500 top-ranking)\n\n\nShrinkage used\nRegularization (LASSO) crucial for performance\nOften none; or simple score summaries\n\n\nInterpretability\nLess interpretable, not designed for clinical use\nOften more interpretable in PS/DRS context\n\n\nMain application\nStratification, risk targeting at population level\nAdjustment in comparative effectiveness studies\n\n\n\nThis study shows that hdPMs are promising tools for population-level risk prediction, especially when clinical data are sparse. While hdPS/hdDRS target confounding control in causal inference, hdPMs aim to optimize outcome prediction, even when important variables are missing. The use of LASSO-based regularization is a key differentiator that enables hdPMs to avoid overfitting in high-dimensional spaces.\n\n\n\n\nHossain, Md Belal, Mohsen Sadatsafavi, Hubert Wong, Victoria J Cook, James C Johnston, and Mohammad Ehsanul Karim. 2025. “Enhancing Risk Prediction Base on Health Administrative Data Using High-Dimensional Prediction Model.” Journal of Clinical Epidemiology, 111857."
  },
  {
    "objectID": "guideline.html#reviews-and-guidelines",
    "href": "guideline.html#reviews-and-guidelines",
    "title": "Guideline",
    "section": "Reviews and Guidelines",
    "text": "Reviews and Guidelines\n\n\n\n\n\nflowchart LR\n  r[Reviews] --> p1(Wyss et al. 2022<br>Pharmacoepidemiol Drug Saf.)\n  r --> p0(Schneeweiss et al. 2018<br>Clin Epidemiol.)\n\n  g[Guideline] --> p2(Rassen et al. 2022<br>Pharmacoepidemiol Drug Saf.)\n  g --> p3(Tazare et al. 2022<br>Pharmacoepidemiol Drug Saf.)\n\n  %% Define style classes\n  classDef redNode fill:#f44,stroke-width:2px,stroke:#f00,color:#fff\n  classDef maroonNode fill:#b03,stroke-width:2px,stroke:#600,color:#fff\n\n  %% Apply classes to nodes\n  class p0,p1 redNode\n  class p2,p3 maroonNode\n\n\n\n\n\n\n\n\n\n\n\n\n\n(Wyss et al. 2022; Rassen et al. 2023; Tazare et al. 2022; Schneeweiss 2018)\n\n\nRassen, Jeremy A, Patrick Blin, Sebastian Kloss, Romain S Neugebauer, Robert W Platt, Anton Pottegård, Sebastian Schneeweiss, and Sengwee Toh. 2023. “High-Dimensional Propensity Scores for Empirical Covariate Selection in Secondary Database Studies: Planning, Implementation, and Reporting.” Pharmacoepidemiology and Drug Safety 32 (2): 93–106.\n\n\nSchneeweiss, Sebastian. 2018. “Automated Data-Adaptive Analytics for Electronic Healthcare Data to Study Causal Treatment Effects.” Clinical Epidemiology, 771–88.\n\n\nTazare, John, Richard Wyss, Jessica M Franklin, Liam Smeeth, Stephen JW Evans, Shirley V Wang, Sebastian Schneeweiss, Ian J Douglas, Joshua J Gagne, and Elizabeth J Williamson. 2022. “Transparency of High-Dimensional Propensity Score Analyses: Guidance for Diagnostics and Reporting.” Pharmacoepidemiology and Drug Safety 31 (4): 411–23.\n\n\nWyss, Richard, Chen Yanover, Tal El-Hay, Dimitri Bennett, Robert W Platt, Andrew R Zullo, Grammati Sari, et al. 2022. “Machine Learning for Improving High-Dimensional Proxy Confounder Adjustment in Healthcare Database Studies: An Overview of the Current Literature.” Pharmacoepidemiology and Drug Safety 31 (9): 932–43."
  },
  {
    "objectID": "report.html#analysis-information",
    "href": "report.html#analysis-information",
    "title": "27  Reporting",
    "section": "27.1 Analysis Information",
    "text": "27.1 Analysis Information\nMany reporting guideline already exists about what to report in a propensity score analysis. Most of the reporting guideline should be applicable in the hdPS context as well. On top of those, we also need to consider reporting the following information about hdPS for transparency.\n\n\n(Karim et al. 2022; Simoneau et al. 2022)"
  },
  {
    "objectID": "report.html#information-about-hdps",
    "href": "report.html#information-about-hdps",
    "title": "27  Reporting",
    "section": "27.2 Information about hdPS",
    "text": "27.2 Information about hdPS\n\n\n\n\n\nInformation\nDescription\nOur example\n\n\n\n\nProxy data dimensions\nThe number of data dimensions (p) used.\np = 1 as only one proxy data dimension (dx) was available from medication usage\n\n\nWhat was done to remove proxies that are problematic\nUsually proxies of outcome, exposure as well as those identified as IV or mediator or collider are discarded.\nobesity and diabetes related codes removed\n\n\nProxy feature parameters\nThe parameters used to select proxy features, including granularity [g], prevalence filter [n], and the minimum number of patients [m]\ng = 3, n = 200, m = 20. This resulted in 126 empirical covariates.\n\n\nRecurrence parameters\nHow many recurrence variables per code [r] and the covariate assessment period [CAP]\nr = 3, CAP = 30 days. This resulted in 143 distinct recurrence covariates.\n\n\nPrioritization process\nThe process used to prioritize proxy features, such as machine learning (ML), Bross, or hybrid methods\nWe used all of these, but used Bross formula for hdPS to calculate absolute log of the multiplicative bias, and then ranked based on magnitude to select / prioritize recurrence covariates.\n\n\nSelected proxies\nThe number of proxies selected (k) for the model\nk = 100 for the hdPS\n\n\nSoftware\nThe software used to perform the analysis\nR: autoCovariateSelection package\n\n\n\n\n\n(Rassen et al. 2023; Tazare et al. 2022)"
  },
  {
    "objectID": "report.html#diagnostics",
    "href": "report.html#diagnostics",
    "title": "27  Reporting",
    "section": "27.3 Diagnostics",
    "text": "27.3 Diagnostics\n\n\n\nInformation\nDescription\nOur example\n\n\n\n\nDiagnostics used to assess the model\nStandardized mean differences (SMD)\nWithin 0.1 in hdPS analysis\n\n\n\nWeight (IPW) summary assessment\nSomewhat reasonable (maximum approximately 54) within hdPS analysis\n\n\n\nComparison of propensity score distributions between each exposure group\nOverlapping (common support) does not seem to be an issue.\n\n\n\nAssess distribution of absolute log bias\nMost bias multiplier values are close to null (0), only a few values seem to deviate from null.\n\n\n\nComparison with regular propensity score\nEstimates slightly towards null"
  },
  {
    "objectID": "report.html#sensitivity-analysis",
    "href": "report.html#sensitivity-analysis",
    "title": "27  Reporting",
    "section": "27.4 Sensitivity analysis",
    "text": "27.4 Sensitivity analysis\n\n\n\nInformation\nDescription\nOur example\n\n\n\n\nSensitivity analysis\nVarying the number of selected proxies [k].\nOR estimates stabilizes around 1.5, shows variability below k = 50 and above 110\n\n\nSensitivity analysis\nVarying the prevalence filter [n].\nOR estimates stabilizes around 1.5 for above n = 60.\n\n\n\n\n\n\n\nKarim, Mohammad Ehsanul, Fabio Pellegrini, Robert W Platt, Gabrielle Simoneau, Julie Rouette, and Carl de Moor. 2022. “The Use and Quality of Reporting of Propensity Score Methods in Multiple Sclerosis Literature: A Review.” Multiple Sclerosis Journal 28 (9): 1317–23.\n\n\nRassen, Jeremy A, Patrick Blin, Sebastian Kloss, Romain S Neugebauer, Robert W Platt, Anton Pottegård, Sebastian Schneeweiss, and Sengwee Toh. 2023. “High-Dimensional Propensity Scores for Empirical Covariate Selection in Secondary Database Studies: Planning, Implementation, and Reporting.” Pharmacoepidemiology and Drug Safety 32 (2): 93–106.\n\n\nSimoneau, Gabrielle, Fabio Pellegrini, Thomas PA Debray, Julie Rouette, Johanna Muñoz, Robert W Platt, John Petkau, et al. 2022. “Recommendations for the Use of Propensity Score Methods in Multiple Sclerosis Research.” Multiple Sclerosis Journal 28 (9): 1467–80.\n\n\nTazare, John, Richard Wyss, Jessica M Franklin, Liam Smeeth, Stephen JW Evans, Shirley V Wang, Sebastian Schneeweiss, Ian J Douglas, Joshua J Gagne, and Elizabeth J Williamson. 2022. “Transparency of High-Dimensional Propensity Score Analyses: Guidance for Diagnostics and Reporting.” Pharmacoepidemiology and Drug Safety 31 (4): 411–23."
  },
  {
    "objectID": "references.html",
    "href": "references.html",
    "title": "References",
    "section": "",
    "text": "Austin, Peter C, Geoffrey M Anderson, Candemir Cigsar, and Andrea\nGruneir. 2012. “Comparing the Cohort Design and the Nested\nCase–Control Design in the Presence of Both Time-Invariant and\nTime-Dependent Treatment and Competing Risks: Bias and\nPrecision.” Pharmacoepidemiology and Drug Safety 21 (7):\n714–24.\n\n\nBalzer, Laura B, and Ted Westling. 2021. “Demystifying Statistical\nInference When Using Machine Learning in Causal Research.”\nAmerican Journal of Epidemiology.\n\n\nBenasseur, Imane, Denis Talbot, Madeleine Durand, Anne Holbrook, Alexis\nMatteau, Brian J Potter, Christel Renoux, Mireille E Schnitzer,\nJean-Éric Tarride, and Jason R Guertin. 2022. “A Comparison of\nConfounder Selection and Adjustment Methods for Estimating Causal\nEffects Using Large Healthcare Databases.”\nPharmacoepidemiology and Drug Safety 31 (4): 424–33.\n\n\nBeyersmann, Jan, Martin Wolkewitz, and Martin Schumacher. 2008.\n“The Impact of Time-Dependent Bias in Proportional Hazards\nModelling.” Statistics in Medicine 27 (30): 6439–54.\n\n\nBrookhart, M Alan, Sebastian Schneeweiss, Kenneth J Rothman, Robert J\nGlynn, Jerry Avorn, and Til Stürmer. 2006. “Variable Selection for\nPropensity Score Models.” American Journal of\nEpidemiology 163 (12): 1149–56.\n\n\nBross, Irwin DJ. 1966. “Spurious Effects from an Extraneous\nVariable.” Journal of Chronic Diseases 19 (6): 637–47.\n\n\nCharlson, Mary E, Peter Pompei, Kathy L Ales, and C Ronald MacKenzie.\n1987. “A New Method of Classifying Prognostic Comorbidity in\nLongitudinal Studies: Development and Validation.” Journal of\nChronic Diseases 40 (5): 373–83.\n\n\nChoi, BCK, and F Shi. 2001. “Risk Factors for Diabetes Mellitus by\nAge and Sex: Results of the National Population Health Survey.”\nDiabetologia 44: 1221–31.\n\n\nConnolly, John G, Sebastian Schneeweiss, Robert J Glynn, and Joshua J\nGagne. 2019. “Quantifying Bias Reduction with Fixed-Duration\nVersus All-Available Covariate Assessment Periods.”\nPharmacoepidemiology and Drug Safety 28 (5): 665–70.\n\n\nDisease Control, Centers for, and Prevention. 2021. “National\nHealth and Nutrition Examination Survey (NHANES).” National\nCenter for Health Statistics.\n\n\nElixhauser, Anne, Claudia Steiner, D Robert Harris, and Rosanna M\nCoffey. 1998. “Comorbidity Measures for Use with Administrative\nData.” Medical Care, 8–27.\n\n\nErnster, Virginia L. 1994. “Nested Case-Control Studies.”\nPreventive Medicine 23 (5): 587–90.\n\n\nFranklin, Jessica M, Wesley Eddings, Robert J Glynn, and Sebastian\nSchneeweiss. 2015. “Regularized Regression Versus the\nHigh-Dimensional Propensity Score for Confounding Adjustment in\nSecondary Database Analyses.” American Journal of\nEpidemiology 182 (7): 651–59.\n\n\nGreenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal\nDiagrams for Epidemiologic Research.” Epidemiology,\n37–48.\n\n\nHansen, Ben B. 2008. “The Prognostic Analogue of the Propensity\nScore.” Biometrika 95 (2): 481–88.\n\n\nHernán, Miguel A, and Sarah L Taubman. 2008. “Does Obesity Shorten\nLife? The Importance of Well-Defined Interventions to Answer Causal\nQuestions.” International Journal of Obesity 32 (3):\nS8–14.\n\n\nHossain, Md Belal. 2025. “Chapter 2: High-Dimensional Disease Risk\nScore for Dealing with Residual Confounding in Estimating Treatment\nEffects with a Survival Outcome.” In. Harnessing the power of\ncausal inference; predictive analytics for survival outcomes with health\nadministrative data: applications to tuberculosis research.\n\n\nHossain, Md Belal, Huah Shin Ng, Feng Zhu, Helen Tremlett, and Mohammad\nEhsanul Karim. 2025. “Simultaneously Dealing with Immortal Time\nBias and Residual Confounding: A Case Study of a High-Dimensional\nPropensity Score Approach with a Nested Case–Control Framework in\nMultiple Sclerosis Research.” Pharmacoepidemiology and Drug\nSafety 34 (7): e70174.\n\n\nHossain, Md Belal, Mohsen Sadatsafavi, Hubert Wong, Victoria J Cook,\nJames C Johnston, and Mohammad Ehsanul Karim. 2025. “Enhancing\nRisk Prediction Base on Health Administrative Data Using\nHigh-Dimensional Prediction Model.” Journal of Clinical\nEpidemiology, 111857.\n\n\nHossain, Md Belal, Hubert Wong, Mohsen Sadatsafavi, James C Johnston,\nVictoria J Cook, and Mohammad Ehsanul Karim. 2024. “Benefits of\nRepeated Matched-Cohort and Nested Case–Control Analyses with\nTime-Dependent Exposure in Observational Studies.” Statistics\nin Biosciences, 1–29.\n\n\nJones, Mark, and Robert Fowler. 2016. “Immortal Time Bias in\nObservational Studies of Time-to-Event Outcomes.” Journal of\nCritical Care 36: 195–99.\n\n\nJu, Cheng, Mary Combs, Samuel D Lendle, Jessica M Franklin, Richard\nWyss, Sebastian Schneeweiss, and Mark J van der Laan. 2019.\n“Propensity Score Prediction for Electronic Healthcare Databases\nUsing Super Learner and High-Dimensional Propensity Score\nMethods.” Journal of Applied Statistics 46 (12):\n2216–36.\n\n\nJu, Cheng, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M\nFranklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van Der Laan.\n2019. “Scalable Collaborative Targeted Learning for\nHigh-Dimensional Data.” Statistical Methods in Medical\nResearch 28 (2): 532–54.\n\n\nKarim, ME. 2023. “Rethinking Residual Confounding Bias Reduction:\nWhy Vanilla hdPS Alone Is No Longer Enough.”\n\n\nKarim, ME, and Y Lei. 2025. “Is There a Competitive Advantage to\nUsing Multivariate Statistical or Machine Learning Methods over the\nBross Formula in the hdPS Framework for Bias and Variance\nEstimation?” PLoS One 20 (5): e0324639.\n\n\nKarim, ME, and MH Mondol. 2025. “Finding the Optimal Number of\nSplits and Repetitions in Double Cross-Fitting Targeted Maximum\nLikelihood Estimators.” Pharmaceutical Statistics.\n\n\nKarim, Mohammad Ehsanul, Paul Gustafson, John Petkau, Yinshan Zhao,\nAfsaneh Shirani, Elaine Kingwell, Charity Evans, Mia Van Der Kop, Joel\nOger, and Helen Tremlett. 2014. “Marginal Structural Cox Models\nfor Estimating the Association Between β-Interferon Exposure and Disease\nProgression in a Multiple Sclerosis Cohort.” American Journal\nof Epidemiology 180 (2): 160–71.\n\n\nKarim, Mohammad Ehsanul, Md Belal Hossain, Huah Shin Ng, Feng Zhu, Hanna\nA Frank, and Helen Tremlett. 2025. “Evaluating the Role of\nHigh-Dimensional Proxy Data in Confounding Adjustment in Multiple\nSclerosis Research: A Case Study.” Pharmacoepidemiology and\nDrug Safety 34 (2): e70112.\n\n\nKarim, Mohammad Ehsanul, and Yang Lei. 2025. “How Effective Are\nMachine Learning and Doubly Robust Estimators in Incorporating\nHigh-Dimensional Proxies to Reduce Residual Confounding?”\nPharmacoepidemiology and Drug Safety 34 (5): e70155.\n\n\nKarim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018.\n“Can We Train Machine Learning Methods to Outperform the\nHigh-Dimensional Propensity Score Algorithm?”\nEpidemiology 29 (2): 191–98.\n\n\nKarim, Mohammad Ehsanul, Fabio Pellegrini, Robert W Platt, Gabrielle\nSimoneau, Julie Rouette, and Carl de Moor. 2022. “The Use and\nQuality of Reporting of Propensity Score Methods in Multiple Sclerosis\nLiterature: A Review.” Multiple Sclerosis Journal 28\n(9): 1317–23.\n\n\nKlein, Samuel, Amalia Gastaldelli, Hannele Yki-Järvinen, and Philipp E\nScherer. 2022. “Why Does Obesity Cause Diabetes?” Cell\nMetabolism 34 (1): 11–20.\n\n\nKumamaru, Hiraku, Joshua J Gagne, Robert J Glynn, Soko Setoguchi, and\nSebastian Schneeweiss. 2016. “Comparison of High-Dimensional\nConfounder Summary Scores in Comparative Studies of Newly Marketed\nMedications.” Journal of Clinical Epidemiology 76:\n200–208.\n\n\nKumamaru, Hiraku, Sebastian Schneeweiss, Robert J Glynn, Soko Setoguchi,\nand Joshua J Gagne. 2016. “Dimension Reduction and Shrinkage\nMethods for High Dimensional Disease Risk Scores in Historical\nData.” Emerging Themes in Epidemiology 13: 1–10.\n\n\nLiu, Mengling, Wenbin Lu, Roy E Shore, and Anne Zeleniuch-Jacquotte.\n2010. “Cox Regression Model with Time-Varying Coefficients in\nNested Case–Control Studies.” Biostatistics 11 (4):\n693–706.\n\n\nLix, Lisa M, Jacqueline Quail, Opeyemi Fadahunsi, and Gary F Teare.\n2013. “Predictive Performance of Comorbidity Measures in\nAdministrative Databases for Diabetes Cohorts.” BMC Health\nServices Research 13: 1–12.\n\n\nLix, LM, J Quail, G Teare, and B Acan. 2011. “Performance of\nComorbidity Measures for Predicting Outcomes in Population-Based\nOsteoporosis Cohorts.” Osteoporosis International 22:\n2633–43.\n\n\nLow, Yen Sia, Blanca Gallego, and Nigam Haresh Shah. 2016.\n“Comparing High-Dimensional Confounder Control Methods for Rapid\nCohort Studies from Electronic Health Records.” Journal of\nComparative Effectiveness Research 5 (2): 179–92.\n\n\nMondol, MH, and ME Karim. 2024. “Towards Robust Causal Inference\nin Epidemiological Research: Employing Double Cross-Fit TMLE in Right\nHeart Catheterization Data.” American Journal of\nEpidemiology, kwae447.\n\n\nNaimi, Ashley I, and Brian W Whitcomb. 2020. “Estimating Risk\nRatios and Risk Differences Using Regression.” American\nJournal of Epidemiology 189 (6): 508–10.\n\n\nNeugebauer, Romain, Julie A Schmittdiel, Zheng Zhu, Jeremy A Rassen,\nJohn D Seeger, and Sebastian Schneeweiss. 2015. “High-Dimensional\nPropensity Score Algorithm in Comparative Effectiveness Research with\nTime-Varying Interventions.” Statistics in Medicine 34\n(5): 753–81.\n\n\nNguyen, Tri-Long, Thomas PA Debray, Bora Youn, Gabrielle Simoneau, and\nGary S Collins. 2024. “Confounder Adjustment Using the Disease\nRisk Score: A Proposal for Weighting Methods.” American\nJournal of Epidemiology 193 (2): 377–88.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Maria Eberg, and\nRobert W Platt. 2016. ��Targeted Maximum Likelihood Estimation for\nPharmacoepidemiologic Research.” Epidemiology (Cambridge,\nMass.) 27 (4): 570.\n\n\nPang, Menglan, Tibor Schuster, Kristian B Filion, Mireille E Schnitzer,\nMaria Eberg, and Robert W Platt. 2016. “Effect Estimation in\nPoint-Exposure Studies with Binary Outcomes and High-Dimensional\nCovariate Data–a Comparison of Targeted Maximum Likelihood Estimation\nand Inverse Probability of Treatment Weighting.” The\nInternational Journal of Biostatistics 12 (2).\n\n\nRassen, Jeremy A, Patrick Blin, Sebastian Kloss, Romain S Neugebauer,\nRobert W Platt, Anton Pottegård, Sebastian Schneeweiss, and Sengwee Toh.\n2023. “High-Dimensional Propensity Scores for Empirical Covariate\nSelection in Secondary Database Studies: Planning, Implementation, and\nReporting.” Pharmacoepidemiology and Drug Safety 32 (2):\n93–106.\n\n\nRobert, Dennis. 2020. autoCovariateSelection: Automatic Covariate\nSelection. https://CRAN.R-project.org/package=autoCovariateSelection.\n\n\nRubin, Donald B. 1997. “Estimating Causal Effects from Large Data\nSets Using Propensity Scores.” Annals of Internal\nMedicine 127 (8_Part_2): 757–63.\n\n\nRubin, Donald B, and Neal Thomas. 1996. “Matching Using Estimated\nPropensity Scores: Relating Theory to Practice.”\nBiometrics, 249–64.\n\n\nSchneeweiss, Sebastian. 2006. “Sensitivity Analysis and External\nAdjustment for Unmeasured Confounders in Epidemiologic Database Studies\nof Therapeutics.” Pharmacoepidemiology and Drug Safety\n15 (5): 291–303.\n\n\n———. 2018. “Automated Data-Adaptive Analytics for Electronic\nHealthcare Data to Study Causal Treatment Effects.” Clinical\nEpidemiology, 771–88.\n\n\nSchneeweiss, Sebastian, Wesley Eddings, Robert J Glynn, Elisabetta\nPatorno, Jeremy Rassen, and Jessica M Franklin. 2017. “Variable\nSelection for Confounding Adjustment in High-Dimensional Covariate\nSpaces When Analyzing Healthcare Databases.”\nEpidemiology 28 (2): 237–48.\n\n\nSchneeweiss, Sebastian, and Malcolm Maclure. 2000. “Use of\nComorbidity Scores for Control of Confounding in Studies Using\nAdministrative Databases.” International Journal of\nEpidemiology 29 (5): 891–98.\n\n\nSchneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn,\nHelen Mogun, and M Alan Brookhart. 2009. “High-Dimensional\nPropensity Score Adjustment in Studies of Treatment Effects Using Health\nCare Claims Data.” Epidemiology (Cambridge, Mass.) 20\n(4): 512.\n\n\nSchuster, Tibor, Wilfrid Kouokam Lowe, and Robert W Platt. 2016.\n“Propensity Score Model Overfitting Led to Inflated Variance of\nEstimated Odds Ratios.” Journal of Clinical Epidemiology\n80: 97–106.\n\n\nSchuster, Tibor, Menglan Pang, and Robert W Platt. 2015. “On the\nRole of Marginal Confounder Prevalence–Implications for the\nHigh-Dimensional Propensity Score Algorithm.”\nPharmacoepidemiology and Drug Safety 24 (9): 1004–7.\n\n\nShalit, Uri, Fredrik D Johansson, and David Sontag. 2017.\n“Estimating Individual Treatment Effect: Generalization Bounds and\nAlgorithms.” In International Conference on Machine\nLearning, 3076–85. PMLR.\n\n\nShi, Claudia, David Blei, and Victor Veitch. 2019. “Adapting\nNeural Networks for the Estimation of Treatment Effects.”\nAdvances in Neural Information Processing Systems 32.\n\n\nSimoneau, Gabrielle, Fabio Pellegrini, Thomas PA Debray, Julie Rouette,\nJohanna Muñoz, Robert W Platt, John Petkau, et al. 2022.\n“Recommendations for the Use of Propensity Score Methods in\nMultiple Sclerosis Research.” Multiple Sclerosis Journal\n28 (9): 1467–80.\n\n\nStuart, Elizabeth A, Brian K Lee, and Finbarr P Leacy. 2013.\n“Prognostic Score–Based Balance Measures Can Be a Useful\nDiagnostic for Propensity Score Methods in Comparative Effectiveness\nResearch.” Journal of Clinical Epidemiology 66 (8):\nS84–90.\n\n\nTazare, John, Richard Wyss, Jessica M Franklin, Liam Smeeth, Stephen JW\nEvans, Shirley V Wang, Sebastian Schneeweiss, Ian J Douglas, Joshua J\nGagne, and Elizabeth J Williamson. 2022. “Transparency of\nHigh-Dimensional Propensity Score Analyses: Guidance for Diagnostics and\nReporting.” Pharmacoepidemiology and Drug Safety 31 (4):\n411–23.\n\n\nTian, Yuxi, Martijn J Schuemie, and Marc A Suchard. 2018.\n“Evaluating Large-Scale Propensity Score Performance Through\nReal-World and Synthetic Data Experiments.” International\nJournal of Epidemiology 47 (6): 2005–14.\n\n\nVanderWeele, Tyler J. 2019. “Principles of Confounder\nSelection.” European Journal of Epidemiology 34: 211–19.\n\n\nVon Korff, Michael, Edward H Wagner, and Kathleen Saunders. 1992.\n“A Chronic Disease Score from Automated Pharmacy Data.”\nJournal of Clinical Epidemiology 45 (2): 197–203.\n\n\nWeberpals, Janick, Tim Becker, Jessica Davies, Fabian Schmich, Dominik\nRüttinger, Fabian J Theis, and Anna Bauer-Mehren. 2021. “Deep\nLearning-Based Propensity Scores for Confounding Control in Comparative\nEffectiveness Research: A Large-Scale, Real-World Data Study.”\nEpidemiology 32 (3): 378–88.\n\n\nWestreich, Daniel, Stephen R Cole, Michele Jonsson Funk, M Alan\nBrookhart, and Til Stürmer. 2011. “The Role of the c-Statistic in\nVariable Selection for Propensity Score Models.”\nPharmacoepidemiology and Drug Safety 20 (3): 317–20.\n\n\nWyss, Richard, Alan R Ellis, M Alan Brookhart, Michele Jonsson Funk,\nCynthia J Girman, Ross J Simpson Jr, and Til Stürmer. 2015.\n“Matching on the Disease Risk Score in Comparative Effectiveness\nResearch of New Treatments.” Pharmacoepidemiology and Drug\nSafety 24 (9): 951–61.\n\n\nWyss, Richard, Sebastian Schneeweiss, Mark Van Der Laan, Samuel D\nLendle, Cheng Ju, and Jessica M Franklin. 2018. “Using Super\nLearner Prediction Modeling to Improve High-Dimensional Propensity Score\nEstimation.” Epidemiology 29 (1): 96–106.\n\n\nWyss, Richard, Chen Yanover, Tal El-Hay, Dimitri Bennett, Robert W\nPlatt, Andrew R Zullo, Grammati Sari, et al. 2022. “Machine\nLearning for Improving High-Dimensional Proxy Confounder Adjustment in\nHealthcare Database Studies: An Overview of the Current\nLiterature.” Pharmacoepidemiology and Drug Safety 31\n(9): 932–43.\n\n\nZhang, Di, and Jessica Kim. 2019. “Use of Propensity Score and\nDisease Risk Score for Multiple Treatments with Time-to-Event Outcome: A\nSimulation Study.” Journal of Biopharmaceutical\nStatistics 29 (6): 1103–15.\n\n\nZivich, Paul N, and Alexander Breskin. 2021. “Machine Learning for\nCausal Inference: On the Use of Cross-Fit Estimators.”\nEpidemiology (Cambridge, Mass.) 32 (3): 393."
  },
  {
    "objectID": "NHANES.html#nhanes",
    "href": "NHANES.html#nhanes",
    "title": "Appendix: NHANES",
    "section": "NHANES",
    "text": "NHANES\nThe National Health and Nutrition Examination Survey (NHANES) is a cross-sectional survey that is designed to provide nationally representative data on the health and nutritional status of the non-institutionalized, civilian US population. This survey is conducted by the National Center for Health Statistics, that provides valuable data on a wide range of health issues, such as diabetes, obesity, as well as nutrition, physical activity, and environmental exposures."
  },
  {
    "objectID": "NHANES.html#components",
    "href": "NHANES.html#components",
    "title": "Appendix: NHANES",
    "section": "Components",
    "text": "Components\nNHANES combines interviews, physical examinations, and laboratory tests to gather comprehensive health and nutrition information about participants."
  },
  {
    "objectID": "NHANES.html#design",
    "href": "NHANES.html#design",
    "title": "Appendix: NHANES",
    "section": "Design",
    "text": "Design\nThe survey selects participants using a complex sampling design, which allows researchers to make inferences about the overall health of the US population. The survey uses a complex, multistage probability sampling design to select participants from US households."
  },
  {
    "objectID": "NHANES.html#usefulness",
    "href": "NHANES.html#usefulness",
    "title": "Appendix: NHANES",
    "section": "Usefulness",
    "text": "Usefulness\nThe collected data is used by many researchers, policymakers, and public health officials to identify emerging health issues, monitor trends in health and inform public health policies and initiatives to improve health-related outcomes in the US."
  },
  {
    "objectID": "NHANES.html#cylces-we-used",
    "href": "NHANES.html#cylces-we-used",
    "title": "Appendix: NHANES",
    "section": "Cylces we used",
    "text": "Cylces we used\nNHANES is an ongoing annual survey, continuous NHANES cycles started from 1999-2000 (cycle 1; see SDDSRVYR variable in the Demographic Variables & Sample Weights component). NHANES cycles 8, 9, and 10 (that we used in this tutorial) refer to the 8th, 9th, and 10th rounds of surveys, which took place from 2013-2014, 2015-2016, and 2017-2018, respectively. Each cycle involves a nationally representative sample of the US population."
  },
  {
    "objectID": "index13.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "href": "index13.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "title": "28  Download cycle 8",
    "section": "28.1 Download and Subsetting to retain only the useful variables",
    "text": "28.1 Download and Subsetting to retain only the useful variables\n\n28.1.1 Demographic\nDemographic Variables and Sample Weights (DEMO_H): The 2-year sample weights (WTINT2YR, WTMEC2YR) should be used. 15 masked variance strata and 30 masked primary sampling units (PSUs) are included in the demographics file. Each stratum has 2 PSUs.\n\ndemo <- nhanes('DEMO_H')    # Both males and females 0 YEARS - 150 YEARS\ndemo1 <- demo[c(\"SEQN\",     # Respondent sequence number\n                \"RIDAGEYR\", # Age in years at screening\n                \"RIAGENDR\", # gender\n                \"DMDEDUC2\", # Education level - Adults 20+\n                \"RIDRETH1\", # race/ethnicity\n                \"DMDMARTL\", # marital status    \n                \"INDHHIN2\", # Annual household income\n                \"DMDBORN4\", # where born\n                \"RIDEXPRG\", # Pregnancy status at exam (released for 20-44 yrs)\n                \"SDDSRVYR\", # survey cycle\n                \"WTINT2YR\", # Full sample 2 year weights\n                \"WTMEC2YR\", # Full sample 2 year MEC exam weight\n                \"SDMVPSU\",  # Masked variance pseudo-PSU\n                \"SDMVSTRA\")]# Masked variance pseudo-stratum\ndemo_vars <- names(demo1) \ndemo2 <- nhanesTranslate('DEMO_H', demo_vars, data = demo1)\n#> Translated columns: RIAGENDR DMDEDUC2 RIDRETH1 DMDMARTL INDHHIN2 DMDBORN4 RIDEXPRG SDDSRVYR\nsaveRDS(demo2, file = \"data/components/demo13.RData\")\n\n\n\n28.1.2 BMI\nBody Measures (BMX_H): The NHANES examination sample weights should be used to analyze the body measurement data. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nbmx <- nhanes('BMX_H')\nbmx1 <- bmx[c(\"SEQN\", # Respondent sequence number\n              \"BMXBMI\")] # Body Mass Index (kg/m**2): 2 YEARS - 150 YEARS\nbmx_vars <- names(bmx1)\nbmx2 <- nhanesTranslate('BMX_H', bmx_vars, data = bmx1)\n#> Warning in nhanesTranslate(\"BMX_H\", bmx_vars, data = bmx1): No columns were\n#> translated\nsaveRDS(bmx2, file = \"data/components/bmx13.RData\")\n\n\n\n28.1.3 Diabetes\nDiabetes (DIQ_H): diabetes questionnaire data must be conducted using the appropriate survey design variables, sample weights, and the basic demographic variables. Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndiq <- nhanes('DIQ_H')\ndiq1 <- diq[c(\"SEQN\", # Respondent sequence number\n              \"DIQ010\", # Doctor told you have diabetes\n              \"DIQ050\", # Taking insulin now\n              \"DIQ070\", # Take diabetic pills to lower blood sugar\n              \"DIQ175A\")] # Family history\ndiq_vars <- names(diq1)\ndiq2 <- nhanesTranslate('DIQ_H', diq_vars, data = diq1)\n#> Translated columns: DIQ010 DIQ050 DIQ070 DIQ175A\nsaveRDS(diq2, file = \"data/components/diq13.RData\")\n\n\n\n28.1.4 Smoking\nSmoking - Cigarette Use (SMQ_H): Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nsmq <- nhanes('SMQ_H')\nsmq1 <- smq[c(\"SEQN\", # Respondent sequence number\n              \"SMQ020\", # Smoked at least 100 cigarettes in life\n              \"SMQ040\")] # Do you now smoke cigarettes?: 18 YEARS - 150 YEARS\nsmq_vars <- names(smq1)\nsmq2 <- nhanesTranslate('SMQ_H', smq_vars, data = smq1)\n#> Translated columns: SMQ020 SMQ040\nsaveRDS(smq2, file = \"data/components/smq13.RData\")\n\n\n\n28.1.5 Diet\nDiet Behavior & Nutrition (DBQ_H): interview sample weights may be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndbq <- nhanes('DBQ_H')\ndbq1 <- dbq[c(\"SEQN\", # Respondent sequence number\n              \"DBQ700\")] # How healthy is the diet: 16 YEARS - 150 YEARS\ndbq_vars <- names(dbq1)\ndbq2 <- nhanesTranslate('DBQ_H', dbq_vars, data = dbq1)\n#> Translated columns: DBQ700\nsaveRDS(dbq2, file = \"data/components/dbq13.RData\")\n\n\n\n28.1.6 Physical activity\nPhysical Activity (PAQ_H): the interview sample weights should be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\npaq <- nhanes('PAQ_H')\npaq1 <- paq[c(\"SEQN\", # Respondent sequence number\n              \"PAQ605\")] # Vigorous work activity: 18 YEARS150 YEARS\npaq_vars <- names(paq1)\npaq2 <- nhanesTranslate('PAQ_H', paq_vars, data = paq1)\n#> Translated columns: PAQ605\nsaveRDS(paq2, file = \"data/components/paq13.RData\")\n\n\n\n28.1.7 Access to healthcare\nHospital Utilization & Access to Care (HUQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nhuq <- nhanes('HUQ_H')\nhuq1 <- huq[c(\"SEQN\", # Respondent sequence number\n              \"HUQ030\")] # Routine place to go for healthcare\nhuq_vars <- names(huq1)\nhuq2 <- nhanesTranslate('HUQ_H', huq_vars, data = huq1)\n#> Translated columns: HUQ030\nsaveRDS(huq2, file = \"data/components/huq13.RData\")\n\n\n\n28.1.8 Blood pressure\nBlood Pressure (BPX_H): Exam sample weights should be used for analyses.\n\nSystolic blood pressure and maximum inflation level cannot be greater than 300 mmHg;\nSystolic and diastolic blood pressure measurements and the maximum inflation level can be even numbers only;\nSystolic blood pressure must be greater than diastolic blood pressure;\nIf there is no systolic blood pressure, there can be no diastolic blood pressure. (There can be a systolic measurement without a diastolic measurement.); and\nDiastolic blood pressure can be zero.\n\n\nbpx <- nhanes('BPX_H')\nbpx1 <- bpx[c(\"SEQN\", # Respondent sequence number\n              \"BPXSY1\", # Systolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXSY2\", # Systolic: Blood pres (2nd rdg) mm Hg\n              \"BPXSY3\", # Systolic: Blood pres (3rd rdg) mm Hg\n              \"BPXSY4\", # Systolic: Blood pres (4th rdg) mm Hg\n              \"BPXDI1\", # Diastolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXDI2\", # Diastolic: Blood pres (2nd rdg) mm Hg\n              \"BPXDI3\", # Diastolic: Blood pres (3rd rdg) mm Hg\n              \"BPXDI4\")] # Diastolic: Blood pres (4th rdg) mm Hg\nbpx_vars <- names(bpx1)\nbpx2 <- nhanesTranslate('BPX_H', bpx_vars, data = bpx1)\n#> Warning in nhanesTranslate(\"BPX_H\", bpx_vars, data = bpx1): No columns were\n#> translated\nsaveRDS(bpx2, file = \"data/components/bpx13.RData\")\n\nBlood Pressure & Cholesterol (BPQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nbpq <- nhanes('BPQ_H')\nbpq1 <- bpq[c(\"SEQN\", # Respondent sequence number\n              \"BPQ080\")] # high cholesterol\nbpq_vars <- names(bpq1)\nbpq2 <- nhanesTranslate('BPQ_H', bpq_vars, data = bpq1)\n#> Translated columns: BPQ080\nsaveRDS(bpq2, file = \"data/components/bpq13.RData\")\n\n\n\n28.1.9 Sleep\nSleep Disorders (SLQ_H):\n\nslq <- nhanes('SLQ_H')\nslq1 <- slq[c(\"SEQN\", # Respondent sequence number\n              \"SLD010H\")] # Sleep hours \nslq_vars <- names(slq1)\nslq2 <- nhanesTranslate('SLQ_H', slq_vars, data = slq1)\n#> Warning in nhanesTranslate(\"SLQ_H\", slq_vars, data = slq1): No columns were\n#> translated\nsaveRDS(slq2, file = \"data/components/slq13.RData\")\n\n\n\n28.1.10 Laboratory data\nStandard Biochemistry Profile (BIOPRO_H): Exam sample weights should be used for analyses.\n\n# Standard Biochemistry Profile\nbiopro <- nhanes('BIOPRO_H') # 12 YEARS - 150 YEARS\nbiopro1 <- biopro[c(\"SEQN\", # Respondent sequence number\n                    #\"LBXSTR\", # Triglycerides, refrigerated (mg/dL)\n                    \"LBXSUA\", # Uric acid (mg/dL)\n                    \"LBXSTP\", # Total protein (g/dL)\n                    \"LBXSTB\", # Total bilirubin (mg/dL)\n                    \"LBXSPH\", # Phosphorus (mg/dL)\n                    \"LBXSNASI\", # Sodium (mmol/L)\n                    \"LBXSKSI\", # Potassium (mmol/L)\n                    \"LBXSGB\", # Globulin (g/dL)\n                    \"LBXSCA\")] # Total Calcium (mg/dL)\nbiopro_vars <- names(biopro1) \nbiopro2 <- nhanesTranslate('BIOPRO_H', biopro_vars, data = biopro1)\n#> Warning in nhanesTranslate(\"BIOPRO_H\", biopro_vars, data = biopro1): No columns\n#> were translated\nsaveRDS(biopro2, file = \"data/components/biopro13.RData\")\n\n\n\n28.1.11 ICD-10-CM codes\nPrescription Medications (RXQ_RX_H): The Prescription Medications subsection provides personal interview data on use of prescription medications during a one-month period prior to the participant’s interview date. During the household SP interview, survey participants are asked if they have taken medications in the past 30 days for which they needed a prescription. Those who answer “yes” are asked to show the interviewer the medication containers of all the products used.\n\nrxq <- nhanes('RXQ_RX_H')\nrxq10 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC1\")] # ICD-10-CM code 1\nrxq11 <- names(rxq10) \nrxq12 <- nhanesTranslate('RXQ_RX_H', rxq11, data = rxq10)\n#> Translated columns: RXDRSC1\n\nrxq20 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC2\")] # ICD-10-CM code 2\nrxq21 <- names(rxq20) \nrxq22 <- nhanesTranslate('RXQ_RX_H', rxq21, data = rxq20)\n#> Translated columns: RXDRSC2\n\nrxq30 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC3\")] # ICD-10-CM code 3\nrxq31 <- names(rxq30) \nrxq32 <- nhanesTranslate('RXQ_RX_H', rxq31, data = rxq30)\n#> Translated columns: RXDRSC3\n\nsaveRDS(rxq12, file = \"data/components/rxq1213.RData\")\nsaveRDS(rxq22, file = \"data/components/rxq2213.RData\")\nsaveRDS(rxq32, file = \"data/components/rxq3213.RData\")"
  },
  {
    "objectID": "index13.html#merging-all-the-datasets---except-for-icd-10-codes",
    "href": "index13.html#merging-all-the-datasets---except-for-icd-10-codes",
    "title": "28  Download cycle 8",
    "section": "28.2 Merging all the datasets - except for ICD-10 codes",
    "text": "28.2 Merging all the datasets - except for ICD-10 codes\n\ndat <- join_all(list(demo2, bmx2, diq2, smq2, dbq2, paq2, \n                     huq2, bpx2, bpq2, slq2, biopro2),\n                by = \"SEQN\", type='full')\nnhanes13 <- dat\n\n\n28.2.1 Save dataset for later use\n\ndim(nhanes13)\n#> [1] 10175    42\nsave(nhanes13, rxq12, rxq22, rxq32, file = \"data/analytic13.RData\")"
  },
  {
    "objectID": "index15.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "href": "index15.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "title": "29  Download cycle 9",
    "section": "29.1 Download and Subsetting to retain only the useful variables",
    "text": "29.1 Download and Subsetting to retain only the useful variables\n\n29.1.1 Demographic\nDemographic Variables and Sample Weights (DEMO_H): The 2-year sample weights (WTINT2YR, WTMEC2YR) should be used. 15 masked variance strata and 30 masked primary sampling units (PSUs) are included in the demographics file. Each stratum has 2 PSUs.\n\ndemo <- nhanes('DEMO_I')    # Both males and females 0 YEARS - 150 YEARS\ndemo1 <- demo[c(\"SEQN\",     # Respondent sequence number\n                \"RIDAGEYR\", # Age in years at screening\n                \"RIAGENDR\", # gender\n                \"DMDEDUC2\", # Education level - Adults 20+\n                \"RIDRETH1\", # race/ethnicity\n                \"DMDMARTL\", # marital status    \n                \"INDHHIN2\", # Annual household income\n                \"DMDBORN4\", # where born\n                \"RIDEXPRG\", # Pregnancy status at exam (released for 20-44 yrs)\n                \"SDDSRVYR\", # survey cycle\n                \"WTINT2YR\", # Full sample 2 year weights\n                \"WTMEC2YR\", # Full sample 2 year MEC exam weight\n                \"SDMVPSU\",  # Masked variance pseudo-PSU\n                \"SDMVSTRA\")]# Masked variance pseudo-stratum\ndemo_vars <- names(demo1) \ndemo2 <- nhanesTranslate('DEMO_I', demo_vars, data = demo1)\n#> Translated columns: RIAGENDR DMDEDUC2 RIDRETH1 DMDMARTL INDHHIN2 DMDBORN4 RIDEXPRG SDDSRVYR\nsaveRDS(demo2, file = \"data/components/demo15.RData\")\n\n\n\n29.1.2 BMI\nBody Measures (BMX_H): The NHANES examination sample weights should be used to analyze the body measurement data. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nbmx <- nhanes('BMX_I')\nbmx1 <- bmx[c(\"SEQN\", # Respondent sequence number\n              \"BMXBMI\")] # Body Mass Index (kg/m**2): 2 YEARS - 150 YEARS\nbmx_vars <- names(bmx1)\nbmx2 <- nhanesTranslate('BMX_I', bmx_vars, data = bmx1)\n#> Warning in nhanesTranslate(\"BMX_I\", bmx_vars, data = bmx1): No columns were\n#> translated\nsaveRDS(bmx2, file = \"data/components/bmx15.RData\")\n\n\n\n29.1.3 Diabetes\nDiabetes (DIQ_H): diabetes questionnaire data must be conducted using the appropriate survey design variables, sample weights, and the basic demographic variables. Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndiq <- nhanes('DIQ_I')\ndiq1 <- diq[c(\"SEQN\", # Respondent sequence number\n              \"DIQ010\", # Doctor told you have diabetes \n              \"DIQ050\", # Taking insulin now\n              \"DIQ070\", # Take diabetic pills to lower blood sugar\n              \"DIQ175A\")] # Family history\ndiq_vars <- names(diq1)\ndiq2 <- nhanesTranslate('DIQ_I', diq_vars, data = diq1)\n#> Translated columns: DIQ010 DIQ050 DIQ070 DIQ175A\nsaveRDS(diq2, file = \"data/components/diq15.RData\")\n\n\n\n29.1.4 Smoking\nSmoking - Cigarette Use (SMQ_H): Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nsmq <- nhanes('SMQ_I')\nsmq1 <- smq[c(\"SEQN\", # Respondent sequence number\n              \"SMQ020\", # Smoked at least 100 cigarettes in life\n              \"SMQ040\")] # Do you now smoke cigarettes?: 18 YEARS - 150 YEARS\nsmq_vars <- names(smq1)\nsmq2 <- nhanesTranslate('SMQ_I', smq_vars, data = smq1)\n#> Translated columns: SMQ020 SMQ040\nsaveRDS(smq2, file = \"data/components/smq15.RData\")\n\n\n\n29.1.5 Diet\nDiet Behavior & Nutrition (DBQ_H): interview sample weights may be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndbq <- nhanes('DBQ_I')\ndbq1 <- dbq[c(\"SEQN\", # Respondent sequence number\n              \"DBQ700\")] # How healthy is the diet: 16 YEARS - 150 YEARS\ndbq_vars <- names(dbq1)\ndbq2 <- nhanesTranslate('DBQ_I', dbq_vars, data = dbq1)\n#> Translated columns: DBQ700\nsaveRDS(dbq2, file = \"data/components/dbq15.RData\")\n\n\n\n29.1.6 Physical activity\nPhysical Activity (PAQ_H): the interview sample weights should be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\npaq <- nhanes('PAQ_I')\npaq1 <- paq[c(\"SEQN\", # Respondent sequence number\n              \"PAQ605\")] # Vigorous work activity: 18 YEARS150 YEARS\npaq_vars <- names(paq1)\npaq2 <- nhanesTranslate('PAQ_I', paq_vars, data = paq1)\n#> Translated columns: PAQ605\nsaveRDS(paq2, file = \"data/components/paq15.RData\")\n\n\n\n29.1.7 Access to healthcare\nHospital Utilization & Access to Care (HUQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nhuq <- nhanes('HUQ_I')\nhuq1 <- huq[c(\"SEQN\", # Respondent sequence number\n              \"HUQ030\")] # Routine place to go for healthcare\nhuq_vars <- names(huq1)\nhuq2 <- nhanesTranslate('HUQ_I', huq_vars, data = huq1)\n#> Translated columns: HUQ030\nsaveRDS(huq2, file = \"data/components/huq15.RData\")\n\n\n\n29.1.8 Blood pressure\nBlood Pressure (BPX_H): Exam sample weights should be used for analyses.\n\nSystolic blood pressure and maximum inflation level cannot be greater than 300 mmHg;\nSystolic and diastolic blood pressure measurements and the maximum inflation level can be even numbers only;\nSystolic blood pressure must be greater than diastolic blood pressure;\nIf there is no systolic blood pressure, there can be no diastolic blood pressure. (There can be a systolic measurement without a diastolic measurement.); and\nDiastolic blood pressure can be zero.\n\n\nbpx <- nhanes('BPX_I')\nbpx1 <- bpx[c(\"SEQN\", # Respondent sequence number\n              \"BPXSY1\", # Systolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXSY2\", # Systolic: Blood pres (2nd rdg) mm Hg\n              \"BPXSY3\", # Systolic: Blood pres (3rd rdg) mm Hg\n              \"BPXSY4\", # Systolic: Blood pres (4th rdg) mm Hg\n              \"BPXDI1\", # Diastolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXDI2\", # Diastolic: Blood pres (2nd rdg) mm Hg\n              \"BPXDI3\", # Diastolic: Blood pres (3rd rdg) mm Hg\n              \"BPXDI4\")] # Diastolic: Blood pres (4th rdg) mm Hg\nbpx_vars <- names(bpx1)\nbpx2 <- nhanesTranslate('BPX_I', bpx_vars, data = bpx1)\n#> Warning in nhanesTranslate(\"BPX_I\", bpx_vars, data = bpx1): No columns were\n#> translated\nsaveRDS(bpx2, file = \"data/components/bpx15.RData\")\n\nBlood Pressure & Cholesterol (BPQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nbpq <- nhanes('BPQ_I')\nbpq1 <- bpq[c(\"SEQN\", # Respondent sequence number\n              \"BPQ080\")] # high cholesterol\nbpq_vars <- names(bpq1)\nbpq2 <- nhanesTranslate('BPQ_I', bpq_vars, data = bpq1)\n#> Translated columns: BPQ080\nsaveRDS(bpq2, file = \"data/components/bpq15.RData\")\n\n\n\n29.1.9 Sleep\nSleep Disorders (SLQ_H):\n\nslq <- nhanes('SLQ_I')\nslq1 <- slq[c(\"SEQN\", # Respondent sequence number\n              \"SLD012\")] # Sleep hours - weekdays or workdays\nslq_vars <- names(slq1)\nslq2 <- nhanesTranslate('SLQ_I', slq_vars, data = slq1)\n#> Warning in nhanesTranslate(\"SLQ_I\", slq_vars, data = slq1): No columns were\n#> translated\nsaveRDS(slq2, file = \"data/components/slq15.RData\")\n\n\n\n29.1.10 Laboratory data\nStandard Biochemistry Profile (BIOPRO_H): Exam sample weights should be used for analyses.\n\n# Standard Biochemistry Profile\nbiopro <- nhanes('BIOPRO_I') # 12 YEARS - 150 YEARS\nbiopro1 <- biopro[c(\"SEQN\", # Respondent sequence number\n                    #\"LBXSTR\", # Triglycerides, refrigerated (mg/dL)\n                    \"LBXSUA\", # Uric acid (mg/dL)\n                    \"LBXSTP\", # Total protein (g/dL)\n                    \"LBXSTB\", # Total bilirubin (mg/dL)\n                    \"LBXSPH\", # Phosphorus (mg/dL)\n                    \"LBXSNASI\", # Sodium (mmol/L)\n                    \"LBXSKSI\", # Potassium (mmol/L)\n                    \"LBXSGB\", # Globulin (g/dL)\n                    \"LBXSCA\")] # Total Calcium (mg/dL)\nbiopro_vars <- names(biopro1) \nbiopro2 <- nhanesTranslate('BIOPRO_I', biopro_vars, data = biopro1)\n#> Warning in nhanesTranslate(\"BIOPRO_I\", biopro_vars, data = biopro1): No columns\n#> were translated\nsaveRDS(biopro2, file = \"data/components/biopro15.RData\")\n\n\n\n29.1.11 ICD-10-CM codes\nPrescription Medications (RXQ_RX_H): The Prescription Medications subsection provides personal interview data on use of prescription medications during a one-month period prior to the participant’s interview date. During the household SP interview, survey participants are asked if they have taken medications in the past 30 days for which they needed a prescription. Those who answer “yes” are asked to show the interviewer the medication containers of all the products used.\n\nrxq <- nhanes('RXQ_RX_I')\nrxq10 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC1\")] # ICD-10-CM code 1\nrxq11 <- names(rxq10) \nrxq12 <- nhanesTranslate('RXQ_RX_I', rxq11, data = rxq10)\n#> Translated columns: RXDRSC1\n\nrxq20 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC2\")] # ICD-10-CM code 2\nrxq21 <- names(rxq20) \nrxq22 <- nhanesTranslate('RXQ_RX_I', rxq21, data = rxq20)\n#> Translated columns: RXDRSC2\n\nrxq30 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC3\")] # ICD-10-CM code 3\nrxq31 <- names(rxq30) \nrxq32 <- nhanesTranslate('RXQ_RX_I', rxq31, data = rxq30)\n#> Translated columns: RXDRSC3\n\nsaveRDS(rxq12, file = \"data/components/rxq1215.RData\")\nsaveRDS(rxq22, file = \"data/components/rxq2215.RData\")\nsaveRDS(rxq32, file = \"data/components/rxq3215.RData\")"
  },
  {
    "objectID": "index15.html#merging-all-the-datasets---except-for-icd-10-codes",
    "href": "index15.html#merging-all-the-datasets---except-for-icd-10-codes",
    "title": "29  Download cycle 9",
    "section": "29.2 Merging all the datasets - except for ICD-10 codes",
    "text": "29.2 Merging all the datasets - except for ICD-10 codes\n\ndat <- join_all(list(demo2, bmx2, diq2, smq2, dbq2, paq2, \n                     huq2, bpx2, bpq2, slq2, biopro2),\n                by = \"SEQN\", type='full')\nnhanes15 <- dat \n\n\n29.2.1 Save dataset for later use\n\ndim(nhanes15)\n#> [1] 9971   42\nsave(nhanes15, rxq12, rxq22, rxq32, file = \"data/analytic15.RData\")"
  },
  {
    "objectID": "index17.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "href": "index17.html#download-and-subsetting-to-retain-only-the-useful-variables",
    "title": "30  Download cycle 10",
    "section": "30.1 Download and Subsetting to retain only the useful variables",
    "text": "30.1 Download and Subsetting to retain only the useful variables\n\n30.1.1 Demographic\nDemographic Variables and Sample Weights (DEMO_H): The 2-year sample weights (WTINT2YR, WTMEC2YR) should be used. 15 masked variance strata and 30 masked primary sampling units (PSUs) are included in the demographics file. Each stratum has 2 PSUs.\n\ndemo <- nhanes('DEMO_J')    # Both males and females 0 YEARS - 150 YEARS\ndemo1 <- demo[c(\"SEQN\",     # Respondent sequence number\n                \"RIDAGEYR\", # Age in years at screening\n                \"RIAGENDR\", # gender\n                \"DMDEDUC2\", # Education level - Adults 20+\n                \"RIDRETH1\", # race/ethnicity\n                \"DMDMARTL\", # marital status    \n                \"INDHHIN2\", # Annual household income\n                \"DMDBORN4\", # where born\n                \"RIDEXPRG\", # Pregnancy status at exam (released for 20-44 yrs)\n                \"SDDSRVYR\", # survey cycle\n                \"WTINT2YR\", # Full sample 2 year weights\n                \"WTMEC2YR\", # Full sample 2 year MEC exam weight\n                \"SDMVPSU\",  # Masked variance pseudo-PSU\n                \"SDMVSTRA\")]# Masked variance pseudo-stratum\ndemo_vars <- names(demo1) \ndemo2 <- nhanesTranslate('DEMO_J', demo_vars, data = demo1)\n#> Translated columns: RIAGENDR DMDEDUC2 RIDRETH1 DMDMARTL INDHHIN2 DMDBORN4 RIDEXPRG SDDSRVYR\nsaveRDS(demo2, file = \"data/components/demo17.RData\")\n\n\n\n30.1.2 BMI\nBody Measures (BMX_H): The NHANES examination sample weights should be used to analyze the body measurement data. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nbmx <- nhanes('BMX_J')\nbmx1 <- bmx[c(\"SEQN\", # Respondent sequence number\n              \"BMXBMI\")] # Body Mass Index (kg/m**2): 2 YEARS - 150 YEARS\nbmx_vars <- names(bmx1)\nbmx2 <- nhanesTranslate('BMX_J', bmx_vars, data = bmx1)\n#> Warning in nhanesTranslate(\"BMX_J\", bmx_vars, data = bmx1): No columns were\n#> translated\nsaveRDS(bmx2, file = \"data/components/bmx17.RData\")\n\n\n\n30.1.3 Diabetes\nDiabetes (DIQ_H): diabetes questionnaire data must be conducted using the appropriate survey design variables, sample weights, and the basic demographic variables. Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndiq <- nhanes('DIQ_J')\ndiq1 <- diq[c(\"SEQN\", # Respondent sequence number\n              \"DIQ010\", # Doctor told you have diabetes\n              \"DIQ050\", # Taking insulin now\n              \"DIQ070\", # Take diabetic pills to lower blood sugar\n              \"DIQ175A\")] # Family history\ndiq_vars <- names(diq1)\ndiq2 <- nhanesTranslate('DIQ_J', diq_vars, data = diq1)\n#> Translated columns: DIQ010 DIQ050 DIQ070 DIQ175A\nsaveRDS(diq2, file = \"data/components/diq17.RData\")\n\n\n\n30.1.4 Smoking\nSmoking - Cigarette Use (SMQ_H): Interview weights should only be used if questionnaire data are analyzed by themselves. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\nsmq <- nhanes('SMQ_J')\nsmq1 <- smq[c(\"SEQN\", # Respondent sequence number\n              \"SMQ020\", # Smoked at least 100 cigarettes in life\n              \"SMQ040\")] # Do you now smoke cigarettes?: 18 YEARS - 150 YEARS\nsmq_vars <- names(smq1)\nsmq2 <- nhanesTranslate('SMQ_J', smq_vars, data = smq1)\n#> Translated columns: SMQ020 SMQ040\nsaveRDS(smq2, file = \"data/components/smq17.RData\")\n\n\n\n30.1.5 Diet\nDiet Behavior & Nutrition (DBQ_H): interview sample weights may be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\ndbq <- nhanes('DBQ_J')\ndbq1 <- dbq[c(\"SEQN\", # Respondent sequence number\n              \"DBQ700\")] # How healthy is the diet: 16 YEARS - 150 YEARS\ndbq_vars <- names(dbq1)\ndbq2 <- nhanesTranslate('DBQ_J', dbq_vars, data = dbq1)\n#> Translated columns: DBQ700\nsaveRDS(dbq2, file = \"data/components/dbq17.RData\")\n\n\n\n30.1.6 Physical activity\nPhysical Activity (PAQ_H): the interview sample weights should be used in their analysis. However, if the data is joined with data from the MEC, the MEC sample weights should be used.\n\npaq <- nhanes('PAQ_J')\npaq1 <- paq[c(\"SEQN\", # Respondent sequence number\n              \"PAQ605\")] # Vigorous work activity: 18 YEARS150 YEARS\npaq_vars <- names(paq1)\npaq2 <- nhanesTranslate('PAQ_J', paq_vars, data = paq1)\n#> Translated columns: PAQ605\nsaveRDS(paq2, file = \"data/components/paq17.RData\")\n\n\n\n30.1.7 Access to healthcare\nHospital Utilization & Access to Care (HUQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nhuq <- nhanes('HUQ_J')\nhuq1 <- huq[c(\"SEQN\", # Respondent sequence number\n              \"HUQ030\")] # Routine place to go for healthcare\nhuq_vars <- names(huq1)\nhuq2 <- nhanesTranslate('HUQ_J', huq_vars, data = huq1)\n#> Translated columns: HUQ030\nsaveRDS(huq2, file = \"data/components/huq17.RData\")\n\n\n\n30.1.8 Blood pressure\nBlood Pressure (BPX_H): Exam sample weights should be used for analyses.\n\nSystolic blood pressure and maximum inflation level cannot be greater than 300 mmHg;\nSystolic and diastolic blood pressure measurements and the maximum inflation level can be even numbers only;\nSystolic blood pressure must be greater than diastolic blood pressure;\nIf there is no systolic blood pressure, there can be no diastolic blood pressure. (There can be a systolic measurement without a diastolic measurement.); and\nDiastolic blood pressure can be zero.\n\n\nbpx <- nhanes('BPX_J')\nbpx1 <- bpx[c(\"SEQN\", # Respondent sequence number\n              \"BPXSY1\", # Systolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXSY2\", # Systolic: Blood pres (2nd rdg) mm Hg\n              \"BPXSY3\", # Systolic: Blood pres (3rd rdg) mm Hg\n              \"BPXSY4\", # Systolic: Blood pres (4th rdg) mm Hg\n              \"BPXDI1\", # Diastolic Blood pres (1st rdg) mmHg: 8 - 150 YEARS\n              \"BPXDI2\", # Diastolic: Blood pres (2nd rdg) mm Hg\n              \"BPXDI3\", # Diastolic: Blood pres (3rd rdg) mm Hg\n              \"BPXDI4\")] # Diastolic: Blood pres (4th rdg) mm Hg\nbpx_vars <- names(bpx1)\nbpx2 <- nhanesTranslate('BPX_J', bpx_vars, data = bpx1)\n#> Warning in nhanesTranslate(\"BPX_J\", bpx_vars, data = bpx1): No columns were\n#> translated\nsaveRDS(bpx2, file = \"data/components/bpx17.RData\")\n\nBlood Pressure & Cholesterol (BPQ_H): Although these data were collected as part of the household questionnaire, if they are merged with the MEC exam data, exam sample weights should be used for the analyses.\n\nbpq <- nhanes('BPQ_J')\nbpq1 <- bpq[c(\"SEQN\", # Respondent sequence number\n              \"BPQ080\")] # high cholesterol\nbpq_vars <- names(bpq1)\nbpq2 <- nhanesTranslate('BPQ_J', bpq_vars, data = bpq1)\n#> Translated columns: BPQ080\nsaveRDS(bpq2, file = \"data/components/bpq17.RData\")\n\n\n\n30.1.9 Sleep\nSleep Disorders (SLQ_H):\n\nslq <- nhanes('SLQ_J')\nslq1 <- slq[c(\"SEQN\", # Respondent sequence number\n              \"SLD012\")] # Sleep hours - weekdays or workdays\nslq_vars <- names(slq1)\nslq2 <- nhanesTranslate('SLQ_J', slq_vars, data = slq1)\n#> Warning in nhanesTranslate(\"SLQ_J\", slq_vars, data = slq1): No columns were\n#> translated\nsaveRDS(slq2, file = \"data/components/slq17.RData\")\n\n\n\n30.1.10 Laboratory data\nStandard Biochemistry Profile (BIOPRO_H): Exam sample weights should be used for analyses.\n\n# Standard Biochemistry Profile\nbiopro <- nhanes('BIOPRO_J') # 12 YEARS - 150 YEARS\nbiopro1 <- biopro[c(\"SEQN\", # Respondent sequence number\n                    #\"LBXSTR\", # Triglycerides, refrigerated (mg/dL)\n                    \"LBXSUA\", # Uric acid (mg/dL)\n                    \"LBXSTP\", # Total protein (g/dL)\n                    \"LBXSTB\", # Total bilirubin (mg/dL)\n                    \"LBXSPH\", # Phosphorus (mg/dL)\n                    \"LBXSNASI\", # Sodium (mmol/L)\n                    \"LBXSKSI\", # Potassium (mmol/L)\n                    \"LBXSGB\", # Globulin (g/dL)\n                    \"LBXSCA\")] # Total Calcium (mg/dL)\nbiopro_vars <- names(biopro1) \nbiopro2 <- nhanesTranslate('BIOPRO_J', biopro_vars, data = biopro1)\n#> Warning in nhanesTranslate(\"BIOPRO_J\", biopro_vars, data = biopro1): No columns\n#> were translated\nsaveRDS(biopro2, file = \"data/components/biopro17.RData\")\n\n\n\n30.1.11 ICD-10-CM codes\nPrescription Medications (RXQ_RX_H): The Prescription Medications subsection provides personal interview data on use of prescription medications during a one-month period prior to the participant’s interview date. During the household SP interview, survey participants are asked if they have taken medications in the past 30 days for which they needed a prescription. Those who answer “yes” are asked to show the interviewer the medication containers of all the products used.\n\nrxq <- nhanes('RXQ_RX_J')\nrxq10 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC1\")] # ICD-10-CM code 1\nrxq11 <- names(rxq10) \nrxq12 <- nhanesTranslate('RXQ_RX_J', rxq11, data = rxq10)\n#> Translated columns: RXDRSC1\n\nrxq20 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC2\")] # ICD-10-CM code 2\nrxq21 <- names(rxq20) \nrxq22 <- nhanesTranslate('RXQ_RX_J', rxq21, data = rxq20)\n#> Translated columns: RXDRSC2\n\nrxq30 <- rxq[c(\"SEQN\", # Respondent sequence number\n               \"RXDRSC3\")] # ICD-10-CM code 3\nrxq31 <- names(rxq30) \nrxq32 <- nhanesTranslate('RXQ_RX_J', rxq31, data = rxq30)\n#> Translated columns: RXDRSC3\n\nsaveRDS(rxq12, file = \"data/components/rxq1217.RData\")\nsaveRDS(rxq22, file = \"data/components/rxq2217.RData\")\nsaveRDS(rxq32, file = \"data/components/rxq3217.RData\")"
  },
  {
    "objectID": "index17.html#merging-all-the-datasets---except-for-icd-10-codes",
    "href": "index17.html#merging-all-the-datasets---except-for-icd-10-codes",
    "title": "30  Download cycle 10",
    "section": "30.2 Merging all the datasets - except for ICD-10 codes",
    "text": "30.2 Merging all the datasets - except for ICD-10 codes\n\ndat <- join_all(list(demo2, bmx2, diq2, smq2, dbq2, paq2, \n                     huq2, bpx2, bpq2, slq2, biopro2),\n                by = \"SEQN\", type='full')\nnhanes17 <- dat\n\n\n30.2.1 Save dataset for later use\n\ndim(nhanes17)\n#> [1] 9254   42\nsave(nhanes17, rxq12, rxq22, rxq32, file = \"data/analytic17.RData\")"
  },
  {
    "objectID": "analytic13.html#load-downloaded-dataset",
    "href": "analytic13.html#load-downloaded-dataset",
    "title": "31  Recoding cycle 8",
    "section": "31.1 Load downloaded dataset",
    "text": "31.1 Load downloaded dataset\n\nload(file = \"data/analytic13.RData\")"
  },
  {
    "objectID": "analytic13.html#recoding",
    "href": "analytic13.html#recoding",
    "title": "31  Recoding cycle 8",
    "section": "31.2 Recoding",
    "text": "31.2 Recoding\n\n31.2.1 ID\n\ndat2 <- nhanes13\ndat2$id <- dat2$SEQN\n\n\n\n31.2.2 Demographic\n\n31.2.2.1 Age\n\ndat2$age <- dat2$RIDAGEYR\ndat2$age.cat <- car::recode(dat2$age, \" 0:19 = '<20'; 20:49 = '20-49'; 50:64 = '50-64'; \n                            65:80 = '65+'; else = NA \")\ndat2$age.cat <- factor(dat2$age.cat, levels = c(\"<20\", \"20-49\", \"50-64\", \"65+\"))\ntable(dat2$age.cat, useNA = \"always\")\n#> \n#>   <20 20-49 50-64   65+  <NA> \n#>  4406  2989  1474  1306     0\n\n\n\n31.2.2.2 Sex\n\ndat2$sex <- dat2$RIAGENDR\ntable(dat2$sex, useNA = \"always\")\n#> \n#>   Male Female   <NA> \n#>   5003   5172      0\n\n\n\n31.2.2.3 Education\n\ndat2$education <- dat2$DMDEDUC2\ndat2$education <- as.factor(dat2$education)\ndat2$education <- car::recode(dat2$education, recodes = \" c('College graduate or above') = \n'College graduate or above'; c('Some college or AA degree', 'High school graduate/GED or equi') = \n'High school'; c('Less than 9th grade', '9-11th grade (Includes 12th grad') = \n'Less than high school'; else = NA \")\ndat2$education <- factor(dat2$education, \n                         levels = c(\"Less than high school\", \"High school\", \n                                    \"College graduate or above\"))\ntable(dat2$education, useNA = \"always\")\n#> \n#>     Less than high school               High school College graduate or above \n#>                       455                      1770                      1443 \n#>                      <NA> \n#>                      6507\n\n\n\n31.2.2.4 Race/ethnicity\n\ndat2$race <- dat2$RIDRETH1\ndat2$race <- car::recode(dat2$race, recodes = \" 'Non-Hispanic White'='White';\n                    'Non-Hispanic Black'='Black'; c('Mexican American',\n                    'Other Hispanic')= 'Hispanic'; else='Others' \")\ndat2$race <- factor(dat2$race, levels = c(\"White\", \"Black\", \"Hispanic\", \"Others\"))\ntable(dat2$race, useNA = \"always\")\n#> \n#>    White    Black Hispanic   Others     <NA> \n#>     3674     2267     2690     1544        0\n\n\n\n31.2.2.5 Marital status\n\ndat2$marital <- dat2$DMDMARTL\ndat2$marital <- car::recode(dat2$marital, recodes = \" 'Never married'='Never married';\nc('Married', 'Living with partner') = 'Married/with partner'; \n                            c('Widowed', 'Divorced', 'Separated')='Other'; else=NA \")\ndat2$marital <- factor(dat2$marital, levels = c(\"Never married\", \"Married/with partner\",\n                                                \"Other\"))\ntable(dat2$marital, useNA = \"always\")\n#> \n#>        Never married Married/with partner                Other \n#>                 1112                 3382                 1272 \n#>                 <NA> \n#>                 4409\n\n\n\n31.2.2.6 Income\n\ndat2$income <- dat2$INDHHIN2\ndat2$income  <- car::recode(dat2$income, recodes = \" c('$ 0 to $ 4,999', '$ 5,000 to $ 9,999',\n'$10,000 to $14,999', '$15,000 to $19,999', 'Under $20,000')='less than $20,000';\n                       c('Over $20,000','$20,000 and Over', '$20,000 to $24,999', \n                       '$25,000 to $34,999', '$35,000 to $44,999', '$45,000 to $54,999', \n                       '$55,000 to $64,999', '$65,000 to $74,999')='$20,000 to $74,999';\n                       c('$75,000 to $99,999','$100,000 and Over')='$75,000 and Over'; \n                            else=NA \")\ndat2$income  <- factor(dat2$income , levels=c(\"less than $20,000\", \"$20,000 to $74,999\", \n                                              \"$75,000 and Over\"))\ntable(dat2$income, useNA = \"always\")\n#> \n#>  less than $20,000 $20,000 to $74,999   $75,000 and Over               <NA> \n#>               2110               4964               2641                460\n\n\n\n31.2.2.7 Where born / citizenship\n\ndat2$born <- dat2$DMDBORN4\ndat2$born <- car::recode(dat2$born, recodes = \" 'Others'='Other place';\n                       'Born in 50 US states or Washington, DC'= 'Born in US'; else=NA\")\ndat2$born <- factor(dat2$born, levels = c(\"Born in US\", \"Other place\"))\ntable(dat2$born, useNA = \"always\") \n#> \n#>  Born in US Other place        <NA> \n#>        8262        1908           5\n\n\n\n31.2.2.8 Pregnancy\n\ndat2$pregnancy <- dat2$RIDEXPRG\ndat2$pregnancy <- car::recode(dat2$pregnancy, \n                      recodes = \" 'Yes, positive lab pregnancy test' = 'Yes';\n                       'The participant was not pregnant' = 'No'; \n                       'Cannot ascertain if the particip' = 'inconclusive';\n                       else= 'outside of target population'  \")\ntable(dat2$pregnancy, useNA = \"always\") \n#> \n#> outside of target population                         <NA> \n#>                        10175                            0\n\n\n\n\n31.2.3 BMI\n\n31.2.3.1 BMI and Obesity\n\ndat2$bmi <- dat2$BMXBMI\nsummary(dat2$bmi)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   12.10   19.70   24.70   25.68   30.20   82.90    1120\ndat2$obese <- ifelse(dat2$BMXBMI >= 30, \"Yes\", \"No\")\ndat2$obese <- factor(dat2$obese, levels = c(\"No\", \"Yes\"))\ntable(dat2$obese, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 6708 2347 1120\n\n\n\n\n31.2.4 Diabetes\n\ndat2$diabetes <- dat2$DIQ010\ndat2$diabetes <- car::recode(dat2$diabetes, \" 'Yes'='Yes'; c('No','Borderline')='No';\n                             else=NA \")\n\n# Taking insulin now or diabetic pills to lower blood sugar - they have diabetes\ndat2$diabetes[dat2$DIQ050 == \"Yes\"] <- \"Yes\"\ndat2$diabetes[dat2$DIQ070 == \"Yes\"] <- \"Yes\"\ntable(dat2$diabetes, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 8981  782  412\n\n\n\n31.2.5 Family history of diabetes\n\ndat2$diabetes.family.history <- car::recode(dat2$DIQ175A, \" 'Family history' = 'Yes'; \n                             else = 'No' \")\ndat2$diabetes.family.history <- factor(dat2$diabetes.family.history, levels = c(\"No\", \"Yes\"))\ndat2$diabetes.family.history[dat2$DIQ175A==\"Don't know\"] <- NA\ntable(dat2$diabetes.family.history, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 8837 1337    1\n\n\n\n31.2.6 Smoking\n\ndat2$smoking <- dat2$SMQ020\ndat2$smoking <- car::recode(dat2$smoking, \" 'Yes' = 'Current smoker'; 'No' = 'Never smoker'; else=NA  \")\ndat2$smoking <- factor(dat2$smoking, levels = c(\"Never smoker\", \"Previous smoker\", \"Current smoker\"))\ndat2$smoking[dat2$SMQ040 == \"Not at all\"] <- \"Previous smoker\"\ntable(dat2$smoking, useNA = \"always\")\n#> \n#>    Never smoker Previous smoker  Current smoker            <NA> \n#>            3532            1347            1232            4064\n\n\n\n31.2.7 Diet\n\n31.2.7.1 How healthy is the diet\n\ndat2$diet.healthy <- dat2$DBQ700\ndat2$diet.healthy <- car::recode(dat2$diet.healthy, recodes = \" c('Excellent', 'Very good')=\n                    'Very good or excellent'; 'Good'='Good'; c('Fair', 'Poor')=\n                    'Poor or fair'; else = NA \")\ndat2$diet.healthy <- factor(dat2$diet.healthy, levels = c(\"Poor or fair\", \"Good\", \n                                                          \"Very good or excellent\"))\ntable(dat2$diet.healthy, useNA = \"always\")\n#> \n#>           Poor or fair                   Good Very good or excellent \n#>                   1824                   2743                   1896 \n#>                   <NA> \n#>                   3712\n\n\n\n\n31.2.8 Vigorous physical activity\n\ndat2$physical.activity <- dat2$PAQ605\ndat2$physical.activity <- car::recode(dat2$physical.activity, recodes = \" 'No' = 'No'; \n                                      'Yes' = 'Yes'; else=NA\")\ndat2$physical.activity <- factor(dat2$physical.activity, levels = c(\"No\", \"Yes\"))\ntable(dat2$physical.activity, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 5975 1172 3028\n\n\n\n31.2.9 Access to medical services\n\ndat2$medical.access <- dat2$HUQ030\ndat2$medical.access <- car::recode(dat2$medical.access, recodes = \" c('Yes',\n                              'There is more than one place')='Yes'; 'There is no place'=\n                              'No'; else=NA\")\ntable(dat2$medical.access, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 1194 8981    0\n\n\n\n31.2.10 Hypertension/high blood pressure\n\n31.2.10.1 Systolic BP\n\ndat2$systolic1 <- dat2$BPXSY1\ndat2$systolic2 <- dat2$BPXSY2\ndat2$systolic3 <- dat2$BPXSY3\ndat2$systolic4 <- dat2$BPXSY4\n\ndat2 <- dat2 %>% \n  mutate(systolicBP = rowMeans(dat2[, c(\"systolic1\", \"systolic2\", \n                                        \"systolic3\", \"systolic4\")], \n                             na.rm = TRUE))\nsummary(dat2$systolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   64.67  106.00  115.33  118.31  128.00  228.67    2644\n\n\n\n31.2.10.2 Diastolic BP\n\ndat2$diastolic1 <- dat2$BPXDI1\ndat2$diastolic2 <- dat2$BPXDI2\ndat2$diastolic3 <- dat2$BPXDI3\ndat2$diastolic4 <- dat2$BPXDI4\ndatX <- dat2[, c(\"diastolic1\", \"diastolic2\", \n                 \"diastolic3\", \"diastolic4\")]\ndatX[datX ==0] <- NA\ndat2$diastolicBP <- rowMeans(datX[, c(\"diastolic1\", \"diastolic2\", \n                                      \"diastolic3\", \"diastolic4\")], \n                             na.rm = TRUE)\nsummary(dat2$diastolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   3.333  58.000  66.667  66.329  74.667 128.000    2688\n\n\n\n\n31.2.11 Sleep (daily in hours)\n\ndat2$sleep <- dat2$SLD010H\ndat2$sleep[dat2$sleep == 99] <- NA\nsummary(dat2$sleep)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.000   6.000   7.000   6.951   8.000  12.000    3721\n\n\n\n31.2.12 Laboratory data\n\n31.2.12.1 Uric acid (mg/dL)\n\ndat2$uric.acid <- dat2$LBXSUA\nsummary(dat2$uric.acid)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    0.70    4.30    5.20    5.35    6.20   13.30    3624\n\n\n\n31.2.12.2 Total protein (g/dL)\n\ndat2$protein.total <- dat2$LBXSTP\nsummary(dat2$protein.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   4.700   6.800   7.100   7.108   7.400  10.200    3631\n\n\n\n31.2.12.3 Total bilirubin (mg/dL)\n\ndat2$bilirubin.total <- dat2$LBXSTB\nsummary(dat2$bilirubin.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   0.100   0.400   0.600   0.639   0.800   7.100    3626\n\n\n\n31.2.12.4 Phosphorus (mg/dL)\n\ndat2$phosphorus <- dat2$LBXSPH\nsummary(dat2$phosphorus)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.800   3.500   3.900   3.929   4.300  10.900    3623\n\n\n\n31.2.12.5 Sodium (mmol/L)\n\ndat2$sodium <- dat2$LBXSNASI\nsummary(dat2$sodium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   119.0   139.0   140.0   139.8   141.0   154.0    3622\n\n\n\n31.2.12.6 Potassium (mmol/L)\n\ndat2$potassium <- dat2$LBXSKSI\nsummary(dat2$potassium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.800   3.800   4.000   4.027   4.200   5.800    3623\n\n\n\n31.2.12.7 Globulin (g/dL)\n\ndat2$globulin <- dat2$LBXSGB\nsummary(dat2$globulin)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.400   2.500   2.800   2.826   3.100   6.500    3631\n\n\n\n31.2.12.8 Total calcium (mg/dL)\n\ndat2$calcium.total <- dat2$LBXSCA\nsummary(dat2$calcium.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   7.600   9.200   9.500   9.486   9.700  14.800    3664\n\n\n\n31.2.12.9 High cholesterol\n\ndat2$high.cholesterol <- dat2$BPQ080\ndat2$high.cholesterol <- car::recode(dat2$high.cholesterol, recodes = \" 'Yes'='Yes';\n                                     'No'='No'; else = NA\")\ntable(dat2$high.cholesterol, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 4391 2037 3747\n\n\n\n\n31.2.13 Survey features\n\n31.2.13.1 Weight\n\ndat2$survey.weight <- dat2$WTINT2YR\nsummary(dat2$survey.weight)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>    3698   12754   20233   30585   36280  167885\ndat2$survey.weight.mec <- dat2$WTMEC2YR\nsummary(dat2$survey.weight.mec)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>       0   12562   20175   30585   36748  171395\n\n\n\n31.2.13.2 PSU\n\ndat2$psu <- as.factor(dat2$SDMVPSU)\ntable(dat2$psu)\n#> \n#>    1    2 \n#> 5249 4926\n\n\n\n31.2.13.3 Strata\n\ndat2$strata <- as.factor(dat2$SDMVSTRA)\ntable(dat2$strata)\n#> \n#> 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 \n#> 674 646 671 732 674 752 664 663 723 665 741 681 700 500 689\n\n\n\n\n31.2.14 Survey year\n\ndat2$year <- dat2$SDDSRVYR\ntable(dat2$year, useNA = \"always\") \n#> \n#> NHANES 2013-2014 public release                            <NA> \n#>                           10175                               0\n\n\n\n31.2.15 ICD-10-CM codes\n\ncolnames(rxq12) <- c(\"id\", \"icd10\")\ncolnames(rxq22) <- c(\"id\", \"icd10\")\ncolnames(rxq32) <- c(\"id\", \"icd10\")\n\nrx2013 <- rbind(rxq12, rxq22, rxq32)\nrx2013 <- rx2013[order(rx2013$id),]\n\nrx2013$icd10[rx2013$icd10 == \"Unknown\"] <- NA\nrx2013$icd10[rx2013$icd10 == \"Refused\"] <- NA\nrx2013$icd10[rx2013$icd10 == \"Don't know\"] <- NA\nrx2013$icd10[rx2013$icd10 == \"\"] <- NA\nrx2013$icd10.new <- substr(rx2013$icd10, start = 1, stop = 3)\n\nrx2013 <- na.omit(rx2013)"
  },
  {
    "objectID": "analytic13.html#analytic-data",
    "href": "analytic13.html#analytic-data",
    "title": "31  Recoding cycle 8",
    "section": "31.3 Analytic data",
    "text": "31.3 Analytic data\n\n31.3.1 Full dataset\n\nnhanes13r <- dat2\n\n\n\n31.3.2 Analytic datset - adults 20 years of more\n\nvars <- c(\n  # ID\n  \"id\",\n  \n  # Demographic\n  \"age\", \"age.cat\", \"sex\", \"education\", \"race\", \n  \"marital\", \"income\", \"born\", \"pregnancy\",\n  \n  # obesity\n  \"obese\", \n  \n  # Diabetes\n  \"diabetes\", \"diabetes.family.history\",\n  \n  # Smoking\n  \"smoking\", \n  \n  # Diet\n  \"diet.healthy\", \n\n  # Physical activity\n  \"physical.activity\", \n  \n  # Access to routine healthcare\n  \"medical.access\",\n  \n  # Blood pressure and Hypertension\n  \"systolicBP\", \"diastolicBP\", \n  \n  # Sleep \n  \"sleep\",\n\n  # Laboratory \n  \"uric.acid\", \"protein.total\", \"bilirubin.total\", \"phosphorus\",\n  \"sodium\", \"potassium\", \"globulin\", \"calcium.total\", \n  \"high.cholesterol\",\n  \n  # Survey features\n  \"survey.weight\", \"survey.weight.mec\", \"psu\", \"strata\", \n  \n  # Survey year\n  \"year\"\n)\n\nnhanes13r.sel <- nhanes13r[, vars]\n\n\n# Adults 20 years of more and not pregnant\ndim(nhanes13r.sel)\n#> [1] 10175    34\nanalytic13 <- subset(nhanes13r.sel, age >= 20 & \n                       pregnancy != 'yes')\ndim(analytic13)\n#> [1] 5769   34\n\n\n\n31.3.3 Save dataset for later use\n\ndim(analytic13)\n#> [1] 5769   34\ndim(rx2013)\n#> [1] 14474     3\nsave(analytic13, rx2013, file = \"data/analytic13recoded.RData\")"
  },
  {
    "objectID": "analytic15.html#load-downloaded-dataset",
    "href": "analytic15.html#load-downloaded-dataset",
    "title": "32  Recoding cycle 9",
    "section": "32.1 Load downloaded dataset",
    "text": "32.1 Load downloaded dataset\n\nload(file = \"data/analytic15.RData\")"
  },
  {
    "objectID": "analytic15.html#recoding",
    "href": "analytic15.html#recoding",
    "title": "32  Recoding cycle 9",
    "section": "32.2 Recoding",
    "text": "32.2 Recoding\n\n32.2.1 ID\n\ndat2 <- nhanes15\ndat2$id <- dat2$SEQN\n\n\n\n32.2.2 Demographic\n\n32.2.2.1 Age\n\ndat2$age <- dat2$RIDAGEYR\ndat2$age.cat <- car::recode(dat2$age, \" 0:19 = '<20'; 20:49 = '20-49'; 50:64 = '50-64'; \n                            65:80 = '65+'; else = NA \")\ndat2$age.cat <- factor(dat2$age.cat, levels = c(\"<20\", \"20-49\", \"50-64\", \"65+\"))\ntable(dat2$age.cat, useNA = \"always\")\n#> \n#>   <20 20-49 50-64   65+  <NA> \n#>  4252  2894  1447  1378     0\n\n\n\n32.2.2.2 Sex\n\ndat2$sex <- dat2$RIAGENDR\ntable(dat2$sex, useNA = \"always\")\n#> \n#>   Male Female   <NA> \n#>   4892   5079      0\n\n\n\n32.2.2.3 Education\n\ndat2$education <- dat2$DMDEDUC2\ndat2$education <- as.factor(dat2$education)\ndat2$education <- car::recode(dat2$education, recodes = \" c('College graduate or above') = \n'College graduate or above'; c('Some college or AA degree', 'High school graduate/GED or equi') = \n'High school'; c('Less than 9th grade', '9-11th grade (Includes 12th grad') = \n'Less than high school'; else = NA \")\ndat2$education <- factor(dat2$education, \n                         levels = c(\"Less than high school\", \"High school\", \n                                    \"College graduate or above\"))\ntable(dat2$education, useNA = \"always\")\n#> \n#>     Less than high school               High school College graduate or above \n#>                       688                      1692                      1422 \n#>                      <NA> \n#>                      6169\n\n\n\n32.2.2.4 Race/ethnicity\n\ndat2$race <- dat2$RIDRETH1\ndat2$race <- car::recode(dat2$race, recodes = \" 'Non-Hispanic White'='White';\n                    'Non-Hispanic Black'='Black'; c('Mexican American',\n                    'Other Hispanic')= 'Hispanic'; else='Others' \")\ndat2$race <- factor(dat2$race, levels = c(\"White\", \"Black\", \"Hispanic\", \"Others\"))\ntable(dat2$race, useNA = \"always\")\n#> \n#>    White    Black Hispanic   Others     <NA> \n#>     3066     2129     3229     1547        0\n\n\n\n32.2.2.5 Marital status\n\ndat2$marital <- dat2$DMDMARTL\ndat2$marital <- car::recode(dat2$marital, recodes = \" 'Never married'='Never married';\nc('Married', 'Living with partner') = 'Married/with partner'; \n                            c('Widowed', 'Divorced', 'Separated')='Other'; else=NA \")\ndat2$marital <- factor(dat2$marital, levels = c(\"Never married\", \"Married/with partner\",\n                                                \"Other\"))\ntable(dat2$marital, useNA = \"always\")\n#> \n#>        Never married Married/with partner                Other \n#>                 1048                 3441                 1227 \n#>                 <NA> \n#>                 4255\n\n\n\n32.2.2.6 Income\n\ndat2$income <- dat2$INDHHIN2\ndat2$income  <- car::recode(dat2$income, recodes = \" c('$ 0 to $ 4,999', '$ 5,000 to $ 9,999',\n'$10,000 to $14,999', '$15,000 to $19,999', 'Under $20,000')='less than $20,000';\n                       c('Over $20,000','$20,000 and Over', '$20,000 to $24,999', \n                       '$25,000 to $34,999', '$35,000 to $44,999', '$45,000 to $54,999', \n                       '$55,000 to $64,999', '$65,000 to $74,999')='$20,000 to $74,999';\n                       c('$75,000 to $99,999','$100,000 and Over')='$75,000 and Over'; \n                            else=NA \")\ndat2$income  <- factor(dat2$income , levels=c(\"less than $20,000\", \"$20,000 to $74,999\", \n                                              \"$75,000 and Over\"))\ntable(dat2$income, useNA = \"always\")\n#> \n#>  less than $20,000 $20,000 to $74,999   $75,000 and Over               <NA> \n#>               1906               4812               2554                699\n\n\n\n32.2.2.7 Where born / citizenship\n\ndat2$born <- dat2$DMDBORN4\ndat2$born <- car::recode(dat2$born, recodes = \" 'Others'='Other place';\n                       'Born in 50 US states or Washingt'= 'Born in US'; else=NA\")\ndat2$born <- factor(dat2$born, levels = c(\"Born in US\", \"Other place\"))\ntable(dat2$born, useNA = \"always\") \n#> \n#>  Born in US Other place        <NA> \n#>           0        2236        7735\n\n\n\n32.2.2.8 Pregnancy\n\ndat2$pregnancy <- dat2$RIDEXPRG\ndat2$pregnancy <- car::recode(dat2$pregnancy, \n                      recodes = \" 'Yes, positive lab pregnancy test' = 'Yes';\n                       'The participant was not pregnant' = 'No'; \n                       'Cannot ascertain if the particip' = 'inconclusive';\n                       else= 'outside of target population'  \")\ntable(dat2$pregnancy, useNA = \"always\") \n#> \n#> outside of target population                         <NA> \n#>                         9971                            0\n\n\n\n\n32.2.3 BMI\n\n32.2.3.1 BMI and Obesity\n\ndat2$bmi <- dat2$BMXBMI\nsummary(dat2$bmi)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   11.50   19.90   25.20   26.02   30.60   67.30    1215\ndat2$obese <- ifelse(dat2$BMXBMI >= 30, \"Yes\", \"No\")\ndat2$obese <- factor(dat2$obese, levels = c(\"No\", \"Yes\"))\ntable(dat2$obese, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 6346 2410 1215\n\n\n\n\n32.2.4 Diabetes\n\ndat2$diabetes <- dat2$DIQ010\ndat2$diabetes <- car::recode(dat2$diabetes, \" 'Yes'='Yes'; c('No','Borderline')='No';\n                             else=NA \")\n\n# Taking insulin now or diabetic pills to lower blood sugar - they have diabetes\ndat2$diabetes[dat2$DIQ050 == \"Yes\"] <- \"Yes\"\ndat2$diabetes[dat2$DIQ070 == \"Yes\"] <- \"Yes\"\ntable(dat2$diabetes, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 8648  923  400\n\n\n\n32.2.5 Family history of diabetes\n\ntable(dat2$DIQ175A, useNA = \"always\")\n#> \n#> Family history           <NA> \n#>           1186           8785\ndat2$diabetes.family.history <- dat2$DIQ175A\ndat2$diabetes.family.history <- car::recode(dat2$diabetes.family.history, \" 10 = 'Yes'; \n                             else = 'No' \")\ndat2$diabetes.family.history <- factor(dat2$diabetes.family.history, levels = c(\"No\", \"Yes\"))\ndat2$diabetes.family.history[dat2$DIQ175A==\"Don't know\"] <- NA\ntable(dat2$diabetes.family.history, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 9971    0    0\n\n\n\n32.2.6 Smoking\n\ndat2$smoking <- dat2$SMQ020\ndat2$smoking <- car::recode(dat2$smoking, \" 'Yes' = 'Current smoker'; 'No' = 'Never smoker'; else=NA  \")\ndat2$smoking <- factor(dat2$smoking, levels = c(\"Never smoker\", \"Previous smoker\", \"Current smoker\"))\ndat2$smoking[dat2$SMQ040 == \"Not at all\"] <- \"Previous smoker\"\ntable(dat2$smoking, useNA = \"always\")\n#> \n#>    Never smoker Previous smoker  Current smoker            <NA> \n#>            3559            1322            1100            3990\n\n\n\n32.2.7 Diet\n\n32.2.7.1 How healthy is the diet\n\ndat2$diet.healthy <- dat2$DBQ700\ndat2$diet.healthy <- car::recode(dat2$diet.healthy, recodes = \" c('Excellent', 'Very good')=\n                    'Very good or excellent'; 'Good'='Good'; c('Fair', 'Poor')=\n                    'Poor or fair'; else = NA \")\ndat2$diet.healthy <- factor(dat2$diet.healthy, levels = c(\"Poor or fair\", \"Good\", \n                                                          \"Very good or excellent\"))\ntable(dat2$diet.healthy, useNA = \"always\")\n#> \n#>           Poor or fair                   Good Very good or excellent \n#>                   2105                   2524                   1697 \n#>                   <NA> \n#>                   3645\n\n\n\n\n32.2.8 Vigorous physical activity\n\ndat2$physical.activity <- dat2$PAQ605\ndat2$physical.activity <- car::recode(dat2$physical.activity, recodes = \" 'No' = 'No'; \n                                      'Yes' = 'Yes'; else=NA\")\ndat2$physical.activity <- factor(dat2$physical.activity, levels = c(\"No\", \"Yes\"))\ntable(dat2$physical.activity, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 5596 1366 3009\n\n\n\n32.2.9 Access to medical services\n\ndat2$medical.access <- dat2$HUQ030\ndat2$medical.access <- car::recode(dat2$medical.access, recodes = \" c('Yes',\n                              'There is more than one place')='Yes'; 'There is no place'=\n                              'No'; else=NA\")\ntable(dat2$medical.access, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 1340 8631    0\n\n\n\n32.2.10 Hypertension/high blood pressure\n\n32.2.10.1 Systolic BP\n\ndat2$systolic1 <- dat2$BPXSY1\ndat2$systolic2 <- dat2$BPXSY2\ndat2$systolic3 <- dat2$BPXSY3\ndat2$systolic4 <- dat2$BPXSY4\n\ndat2 <- dat2 %>% \n  mutate(systolicBP = rowMeans(dat2[, c(\"systolic1\", \"systolic2\", \n                                        \"systolic3\", \"systolic4\")], \n                             na.rm = TRUE))\nsummary(dat2$systolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    74.0   107.3   117.3   120.4   130.0   231.3    2608\n\n\n\n32.2.10.2 Diastolic BP\n\ndat2$diastolic1 <- dat2$BPXDI1\ndat2$diastolic2 <- dat2$BPXDI2\ndat2$diastolic3 <- dat2$BPXDI3\ndat2$diastolic4 <- dat2$BPXDI4\ndatX <- dat2[, c(\"diastolic1\", \"diastolic2\", \n                 \"diastolic3\", \"diastolic4\")]\ndatX[datX ==0] <- NA\ndat2$diastolicBP <- rowMeans(datX[, c(\"diastolic1\", \"diastolic2\", \n                                      \"diastolic3\", \"diastolic4\")], \n                             na.rm = TRUE)\nsummary(dat2$diastolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    2.00   58.00   66.67   66.64   74.67  138.67    2636\n\n\n\n\n32.2.11 Sleep (daily in hours)\n\ndat2$sleep <- dat2$SLD012\nsummary(dat2$sleep)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.000   7.000   8.000   7.753   8.500  14.500    3677\n\n\n\n32.2.12 Laboratory data\n\n32.2.12.1 Uric acid (mg/dL)\n\ndat2$uric.acid <- dat2$LBXSUA\nsummary(dat2$uric.acid)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.600   4.300   5.200   5.335   6.200  18.000    3717\n\n\n\n32.2.12.2 Total protein (g/dL)\n\ndat2$protein.total <- dat2$LBXSTP\nsummary(dat2$protein.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   5.200   6.900   7.200   7.201   7.500  10.100    3718\n\n\n\n32.2.12.3 Total bilirubin (mg/dL)\n\ndat2$bilirubin.total <- dat2$LBXSTB\nsummary(dat2$bilirubin.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   0.000   0.400   0.500   0.552   0.700   3.500    3717\n\n\n\n32.2.12.4 Phosphorus (mg/dL)\n\ndat2$phosphorus <- dat2$LBXSPH\nsummary(dat2$phosphorus)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.000   3.400   3.800   3.796   4.200   9.700    3715\n\n\n\n32.2.12.5 Sodium (mmol/L)\n\ndat2$sodium <- dat2$LBXSNASI\nsummary(dat2$sodium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   124.0   137.0   139.0   138.7   140.0   161.0    3714\n\n\n\n32.2.12.6 Potassium (mmol/L)\n\ndat2$potassium <- dat2$LBXSKSI\nsummary(dat2$potassium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.600   3.740   3.930   3.952   4.150   5.860    3714\n\n\n\n32.2.12.7 Globulin (g/dL)\n\ndat2$globulin <- dat2$LBXSGB\nsummary(dat2$globulin)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   0.600   2.600   2.800   2.857   3.100   7.000    3719\n\n\n\n32.2.12.8 Total calcium (mg/dL)\n\ndat2$calcium.total <- dat2$LBXSCA\nsummary(dat2$calcium.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   7.300   9.100   9.400   9.375   9.600  11.500    3714\n\n\n\n32.2.12.9 High cholesterol\n\ndat2$high.cholesterol <- dat2$BPQ080\ndat2$high.cholesterol <- car::recode(dat2$high.cholesterol, recodes = \" 'Yes'='Yes';\n                                     'No'='No'; else = NA\")\ntable(dat2$high.cholesterol, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 4323 1960 3688\n\n\n\n\n32.2.13 Survey features\n\n32.2.13.1 Weight\n\ndat2$survey.weight <- dat2$WTINT2YR\nsummary(dat2$survey.weight)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>    3294   12878   20160   31740   33257  233756\ndat2$survey.weight.mec <- dat2$WTMEC2YR\nsummary(dat2$survey.weight.mec)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>       0   12551   20281   31740   33708  242387\n\n\n\n32.2.13.2 PSU\n\ndat2$psu <- as.factor(dat2$SDMVPSU)\ntable(dat2$psu)\n#> \n#>    1    2 \n#> 5127 4844\n\n\n\n32.2.13.3 Strata\n\ndat2$strata <- as.factor(dat2$SDMVSTRA)\ntable(dat2$strata)\n#> \n#> 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 \n#> 462 685 694 629 612 571 673 723 665 688 681 759 805 773 551\n\n\n\n\n32.2.14 Survey year\n\ndat2$year <- dat2$SDDSRVYR\ntable(dat2$year, useNA = \"always\") \n#> \n#> NHANES 2015-2016 public release                            <NA> \n#>                            9971                               0\n\n\n\n32.2.15 ICD-10-CM codes\n\ncolnames(rxq12) <- c(\"id\", \"icd10\")\ncolnames(rxq22) <- c(\"id\", \"icd10\")\ncolnames(rxq32) <- c(\"id\", \"icd10\")\n\nrx2015 <- rbind(rxq12, rxq22, rxq32)\nrx2015 <- rx2015[order(rx2015$id),]\n\nrx2015$icd10[rx2015$icd10 == \"Unknown\"] <- NA\nrx2015$icd10[rx2015$icd10 == \"Refused\"] <- NA\nrx2015$icd10[rx2015$icd10 == \"Don't know\"] <- NA\nrx2015$icd10[rx2015$icd10 == \"\"] <- NA\nrx2015$icd10.new <- substr(rx2015$icd10, start = 1, stop = 3)\n\nrx2015 <- na.omit(rx2015)"
  },
  {
    "objectID": "analytic15.html#analytic-data",
    "href": "analytic15.html#analytic-data",
    "title": "32  Recoding cycle 9",
    "section": "32.3 Analytic data",
    "text": "32.3 Analytic data\n\n32.3.1 Full dataset\n\nnhanes15r <- dat2\n\n\n\n32.3.2 Analytic datset - adults 20 years of more\n\nvars <- c(\n  # ID\n  \"id\",\n  \n  # Demographic\n  \"age\", \"age.cat\", \"sex\", \"education\", \"race\", \n  \"marital\", \"income\", \"born\", \"pregnancy\",\n  \n  # obesity\n  \"obese\", \n  \n  # Diabetes\n  \"diabetes\", \"diabetes.family.history\",\n  \n  # Smoking\n  \"smoking\", \n  \n  # Diet\n  \"diet.healthy\", \n\n  # Physical activity\n  \"physical.activity\", \n  \n  # Access to routine healthcare\n  \"medical.access\",\n  \n  # Blood pressure and Hypertension\n  \"systolicBP\", \"diastolicBP\", \n  \n  # Sleep \n  \"sleep\",\n\n  # Laboratory \n  \"uric.acid\", \"protein.total\", \"bilirubin.total\", \"phosphorus\",\n  \"sodium\", \"potassium\", \"globulin\", \"calcium.total\", \n  \"high.cholesterol\",\n  \n  # Survey features\n  \"survey.weight\", \"survey.weight.mec\", \"psu\", \"strata\", \n  \n  # Survey year\n  \"year\"\n)\n\nnhanes15r.sel <- nhanes15r[, vars]\n\n\n# Adults 20 years of more and not pregnant\ndim(nhanes15r.sel)\n#> [1] 9971   34\nanalytic15 <- subset(nhanes15r.sel, age >= 20 & \n                       pregnancy != 'yes')\ndim(analytic15)\n#> [1] 5719   34\n\n\n\n32.3.3 Save dataset for later use\n\ndim(analytic15)\n#> [1] 5719   34\ndim(rx2015)\n#> [1] 14084     3\nsave(analytic15, rx2015, file = \"data/analytic15recoded.RData\")"
  },
  {
    "objectID": "analytic17.html#load-downloaded-dataset",
    "href": "analytic17.html#load-downloaded-dataset",
    "title": "33  Recoding cycle 10",
    "section": "33.1 Load downloaded dataset",
    "text": "33.1 Load downloaded dataset\n\nload(file = \"data/analytic17.RData\")"
  },
  {
    "objectID": "analytic17.html#recoding",
    "href": "analytic17.html#recoding",
    "title": "33  Recoding cycle 10",
    "section": "33.2 Recoding",
    "text": "33.2 Recoding\n\n33.2.1 ID\n\ndat2 <- nhanes17\ndat2$id <- dat2$SEQN\n\n\n\n33.2.2 Demographic\n\n33.2.2.1 Age\n\ndat2$age <- dat2$RIDAGEYR\ndat2$age.cat <- car::recode(dat2$age, \" 0:19 = '<20'; 20:49 = '20-49'; 50:64 = '50-64'; \n                            65:80 = '65+'; else = NA \")\ndat2$age.cat <- factor(dat2$age.cat, levels = c(\"<20\", \"20-49\", \"50-64\", \"65+\"))\ntable(dat2$age.cat, useNA = \"always\")\n#> \n#>   <20 20-49 50-64   65+  <NA> \n#>  3685  2500  1569  1500     0\n\n\n\n33.2.2.2 Sex\n\ndat2$sex <- dat2$RIAGENDR\ntable(dat2$sex, useNA = \"always\")\n#> \n#>   Male Female   <NA> \n#>   4557   4697      0\n\n\n\n33.2.2.3 Education\n\ndat2$education <- dat2$DMDEDUC2\ndat2$education <- as.factor(dat2$education)\ndat2$education <- car::recode(dat2$education, recodes = \" c('College graduate or above') = \n'College graduate or above'; c('Some college or AA degree', 'High school graduate/GED or equi') = \n'High school'; c('Less than 9th grade', '9-11th grade (Includes 12th grad') = \n'Less than high school'; else = NA \")\ndat2$education <- factor(dat2$education, \n                         levels = c(\"Less than high school\", \"High school\", \n                                    \"College graduate or above\"))\ntable(dat2$education, useNA = \"always\")\n#> \n#>     Less than high school               High school College graduate or above \n#>                       479                      1778                      1336 \n#>                      <NA> \n#>                      5661\n\n\n\n33.2.2.4 Race/ethnicity\n\ndat2$race <- dat2$RIDRETH1\ndat2$race <- car::recode(dat2$race, recodes = \" 'Non-Hispanic White'='White';\n                    'Non-Hispanic Black'='Black'; c('Mexican American',\n                    'Other Hispanic')= 'Hispanic'; else='Others' \")\ndat2$race <- factor(dat2$race, levels = c(\"White\", \"Black\", \"Hispanic\", \"Others\"))\ntable(dat2$race, useNA = \"always\")\n#> \n#>    White    Black Hispanic   Others     <NA> \n#>     3150     2115     2187     1802        0\n\n\n\n33.2.2.5 Marital status\n\ndat2$marital <- dat2$DMDMARTL\ndat2$marital <- car::recode(dat2$marital, recodes = \" 'Never married'='Never married';\nc('Married', 'Living with partner') = 'Married/with partner'; \n                            c('Widowed', 'Divorced', 'Separated')='Other'; else=NA \")\ndat2$marital <- factor(dat2$marital, levels = c(\"Never married\", \"Married/with partner\",\n                                                \"Other\"))\ntable(dat2$marital, useNA = \"always\")\n#> \n#>        Never married Married/with partner                Other \n#>                 1006                 3252                 1305 \n#>                 <NA> \n#>                 3691\n\n\n\n33.2.2.6 Income\n\ndat2$income <- dat2$INDHHIN2\ndat2$income  <- car::recode(dat2$income, recodes = \" c('$ 0 to $ 4,999', '$ 5,000 to $ 9,999',\n'$10,000 to $14,999', '$15,000 to $19,999', 'Under $20,000')='less than $20,000';\n                       c('Over $20,000','$20,000 and Over', '$20,000 to $24,999', \n                       '$25,000 to $34,999', '$35,000 to $44,999', '$45,000 to $54,999', \n                       '$55,000 to $64,999', '$65,000 to $74,999')='$20,000 to $74,999';\n                       c('$75,000 to $99,999','$100,000 and Over')='$75,000 and Over'; \n                            else=NA \")\ndat2$income  <- factor(dat2$income , levels=c(\"less than $20,000\", \"$20,000 to $74,999\", \n                                              \"$75,000 and Over\"))\ntable(dat2$income, useNA = \"always\")\n#> \n#>  less than $20,000 $20,000 to $74,999   $75,000 and Over               <NA> \n#>               1589               4331               2453                881\n\n\n\n33.2.2.7 Where born / citizenship\n\ndat2$born <- dat2$DMDBORN4\ndat2$born <- car::recode(dat2$born, recodes = \" 'Others'='Other place';\n                       'Born in 50 US states or Washington, DC'= 'Born in US'; else=NA\")\ndat2$born <- factor(dat2$born, levels = c(\"Born in US\", \"Other place\"))\ntable(dat2$born, useNA = \"always\") \n#> \n#>  Born in US Other place        <NA> \n#>        7303        1948           3\n\n\n\n33.2.2.8 Pregnancy\n\ndat2$pregnancy <- dat2$RIDEXPRG\ndat2$pregnancy <- car::recode(dat2$pregnancy, \n                      recodes = \" 'Yes, positive lab pregnancy test' = 'Yes';\n                       'The participant was not pregnant' = 'No'; \n                       'Cannot ascertain if the particip' = 'inconclusive';\n                       else= 'outside of target population'  \")\ntable(dat2$pregnancy, useNA = \"always\") \n#> \n#> outside of target population                         <NA> \n#>                         9254                            0\n\n\n\n\n33.2.3 BMI\n\n33.2.3.1 BMI and Obesity\n\ndat2$bmi <- dat2$BMXBMI\nsummary(dat2$bmi)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   12.30   20.40   25.80   26.58   31.30   86.20    1249\ndat2$obese <- ifelse(dat2$BMXBMI >= 30, \"Yes\", \"No\")\ndat2$obese <- factor(dat2$obese, levels = c(\"No\", \"Yes\"))\ntable(dat2$obese, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 5597 2408 1249\n\n\n\n\n33.2.4 Diabetes\n\ndat2$diabetes <- dat2$DIQ010\ndat2$diabetes <- car::recode(dat2$diabetes, \" 'Yes'='Yes'; c('No','Borderline')='No';\n                             else=NA \")\n\n# Taking insulin now or diabetic pills to lower blood sugar - they have diabetes\ndat2$diabetes[dat2$DIQ050 == \"Yes\"] <- \"Yes\"\ndat2$diabetes[dat2$DIQ070 == \"Yes\"] <- \"Yes\"\ntable(dat2$diabetes, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 7927  966  361\n\n\n\n33.2.5 Family history of diabetes\n\ntable(dat2$DIQ175A, useNA = \"always\")\n#> \n#> Family history     Don't know           <NA> \n#>           1143              2           8109\ndat2$diabetes.family.history <- dat2$DIQ175A\ndat2$diabetes.family.history <- car::recode(dat2$diabetes.family.history, \" 'Family history' = 'Yes'; \n                             else = 'No' \")\ndat2$diabetes.family.history <- factor(dat2$diabetes.family.history, levels = c(\"No\", \"Yes\"))\ndat2$diabetes.family.history[dat2$DIQ175A==\"Don't know\"] <- NA\ntable(dat2$diabetes.family.history, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 8109 1143    2\n\n\n\n33.2.6 Smoking\n\ndat2$smoking <- dat2$SMQ020\ndat2$smoking <- car::recode(dat2$smoking, \" 'Yes' = 'Current smoker'; 'No' = 'Never smoker'; else=NA  \")\ndat2$smoking <- factor(dat2$smoking, levels = c(\"Never smoker\", \"Previous smoker\", \"Current smoker\"))\ndat2$smoking[dat2$SMQ040 == \"Not at all\"] <- \"Previous smoker\"\ntable(dat2$smoking, useNA = \"always\")\n#> \n#>    Never smoker Previous smoker  Current smoker            <NA> \n#>            3497            1338            1021            3398\n\n\n\n33.2.7 Diet\n\n33.2.7.1 How healthy is the diet\n\ndat2$diet.healthy <- dat2$DBQ700\ndat2$diet.healthy <- car::recode(dat2$diet.healthy, recodes = \" c('Excellent', 'Very good')=\n                    'Very good or excellent'; 'Good'='Good'; c('Fair', 'Poor')=\n                    'Poor or fair'; else = NA \")\ndat2$diet.healthy <- factor(dat2$diet.healthy, levels = c(\"Poor or fair\", \"Good\", \n                                                          \"Very good or excellent\"))\ntable(dat2$diet.healthy, useNA = \"always\")\n#> \n#>           Poor or fair                   Good Very good or excellent \n#>                   2036                   2411                   1712 \n#>                   <NA> \n#>                   3095\n\n\n\n\n33.2.8 Vigorous physical activity\n\ndat2$physical.activity <- dat2$PAQ605\ndat2$physical.activity <- car::recode(dat2$physical.activity, recodes = \" 'No' = 'No'; \n                                      'Yes' = 'Yes'; else=NA\")\ndat2$physical.activity <- factor(dat2$physical.activity, levels = c(\"No\", \"Yes\"))\ntable(dat2$physical.activity, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 4461 1389 3404\n\n\n\n33.2.9 Access to medical services\n\ndat2$medical.access <- dat2$HUQ030\ndat2$medical.access <- car::recode(dat2$medical.access, recodes = \" c('Yes',\n                              'There is more than one place')='Yes'; 'There is no place'=\n                              'No'; else=NA\")\ntable(dat2$medical.access, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 1398 7854    2\n\n\n\n33.2.10 Hypertension/high blood pressure\n\n33.2.10.1 Systolic BP\n\ndat2$systolic1 <- dat2$BPXSY1\ndat2$systolic2 <- dat2$BPXSY2\ndat2$systolic3 <- dat2$BPXSY3\ndat2$systolic4 <- dat2$BPXSY4\n\ndat2 <- dat2 %>% \n  mutate(systolicBP = rowMeans(dat2[, c(\"systolic1\", \"systolic2\", \n                                        \"systolic3\", \"systolic4\")], \n                             na.rm = TRUE))\nsummary(dat2$systolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   72.67  106.67  118.00  121.68  132.67  238.00    2537\n\n\n\n33.2.10.2 Diastolic BP\n\ndat2$diastolic1 <- dat2$BPXDI1\ndat2$diastolic2 <- dat2$BPXDI2\ndat2$diastolic3 <- dat2$BPXDI3\ndat2$diastolic4 <- dat2$BPXDI4\ndatX <- dat2[, c(\"diastolic1\", \"diastolic2\", \n                 \"diastolic3\", \"diastolic4\")]\ndatX[datX ==0] <- NA\ndat2$diastolicBP <- rowMeans(datX[, c(\"diastolic1\", \"diastolic2\", \n                                      \"diastolic3\", \"diastolic4\")], \n                             na.rm = TRUE)\nsummary(dat2$diastolicBP)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    8.00   61.33   70.00   69.54   77.33  135.33    2618\n\n\n\n\n33.2.11 Sleep (daily in hours)\n\ndat2$sleep <- dat2$SLD012\nsummary(dat2$sleep)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.000   7.000   8.000   7.659   8.500  14.000    3141\n\n\n\n33.2.12 Laboratory data\n\n33.2.12.1 Uric acid (mg/dL)\n\ndat2$uric.acid <- dat2$LBXSUA\nsummary(dat2$uric.acid)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   0.800   4.300   5.300   5.402   6.300  15.100    3353\n\n\n\n33.2.12.2 Total protein (g/dL)\n\ndat2$protein.total <- dat2$LBXSTP\nsummary(dat2$protein.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   5.300   6.900   7.200   7.166   7.400  10.000    3353\n\n\n\n33.2.12.3 Total bilirubin (mg/dL)\n\ndat2$bilirubin.total <- dat2$LBXSTB\nsummary(dat2$bilirubin.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    0.10    0.30    0.40    0.46    0.60    3.70    3351\n\n\n\n33.2.12.4 Phosphorus (mg/dL)\n\ndat2$phosphorus <- dat2$LBXSPH\nsummary(dat2$phosphorus)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.900   3.300   3.600   3.665   4.000   9.600    3353\n\n\n\n33.2.12.5 Sodium (mmol/L)\n\ndat2$sodium <- dat2$LBXSNASI\nsummary(dat2$sodium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   121.0   138.0   140.0   140.3   142.0   151.0    3350\n\n\n\n33.2.12.6 Potassium (mmol/L)\n\ndat2$potassium <- dat2$LBXSKSI\nsummary(dat2$potassium)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   2.800   3.900   4.100   4.094   4.300   6.600    3355\n\n\n\n33.2.12.7 Globulin (g/dL)\n\ndat2$globulin <- dat2$LBXSGB\nsummary(dat2$globulin)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>   1.800   2.800   3.100   3.087   3.300   6.000    3353\n\n\n\n33.2.12.8 Total calcium (mg/dL)\n\ndat2$calcium.total <- dat2$LBXSCA\nsummary(dat2$calcium.total)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n#>    6.40    9.10    9.30    9.32    9.60   11.70    3353\n\n\n\n33.2.12.9 High cholesterol\n\ndat2$high.cholesterol <- dat2$BPQ080\ndat2$high.cholesterol <- car::recode(dat2$high.cholesterol, recodes = \" 'Yes'='Yes';\n                                     'No'='No'; else = NA\")\ntable(dat2$high.cholesterol, useNA = \"always\")\n#> \n#>   No  Yes <NA> \n#> 4153 1968 3133\n\n\n\n\n33.2.13 Survey features\n\n33.2.13.1 Weight\n\ndat2$survey.weight <- dat2$WTINT2YR\nsummary(dat2$survey.weight)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>    2571   13074   21099   34671   36923  433085\ndat2$survey.weight.mec <- dat2$WTMEC2YR\nsummary(dat2$survey.weight.mec)\n#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n#>       0   12347   21060   34671   37562  419763\n\n\n\n33.2.13.2 PSU\n\ndat2$psu <- as.factor(dat2$SDMVPSU)\ntable(dat2$psu)\n#> \n#>    1    2 \n#> 4464 4790\n\n\n\n33.2.13.3 Strata\n\ndat2$strata <- as.factor(dat2$SDMVSTRA)\ntable(dat2$strata)\n#> \n#> 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 \n#> 510 638 695 554 605 653 612 693 735 551 689 609 604 596 510\n\n\n\n\n33.2.14 Survey year\n\ndat2$year <- dat2$SDDSRVYR\ntable(dat2$year, useNA = \"always\") \n#> \n#> NHANES 2017-2018 public release                            <NA> \n#>                            9254                               0\n\n\n\n33.2.15 ICD-10-CM codes\n\ncolnames(rxq12) <- c(\"id\", \"icd10\")\ncolnames(rxq22) <- c(\"id\", \"icd10\")\ncolnames(rxq32) <- c(\"id\", \"icd10\")\n\nrx2017 <- rbind(rxq12, rxq22, rxq32)\nrx2017 <- rx2017[order(rx2017$id),]\n\nrx2017$icd10[rx2017$icd10 == \"Unknown\"] <- NA\nrx2017$icd10[rx2017$icd10 == \"Refused\"] <- NA\nrx2017$icd10[rx2017$icd10 == \"Don't know\"] <- NA\nrx2017$icd10[rx2017$icd10 == \"\"] <- NA\nrx2017$icd10.new <- substr(rx2017$icd10, start = 1, stop = 3)\n\nrx2017 <- na.omit(rx2017)"
  },
  {
    "objectID": "analytic17.html#analytic-data",
    "href": "analytic17.html#analytic-data",
    "title": "33  Recoding cycle 10",
    "section": "33.3 Analytic data",
    "text": "33.3 Analytic data\n\n33.3.1 Full dataset\n\nnhanes17r <- dat2\n\n\n\n33.3.2 Analytic datset - adults 20 years of more\n\nvars <- c(\n  # ID\n  \"id\",\n  \n  # Demographic\n  \"age\", \"age.cat\", \"sex\", \"education\", \"race\", \n  \"marital\", \"income\", \"born\", \"pregnancy\",\n  \n  # obesity\n  \"obese\", \n  \n  # Diabetes\n  \"diabetes\", \"diabetes.family.history\",\n  \n  # Smoking\n  \"smoking\", \n  \n  # Diet\n  \"diet.healthy\", \n\n  # Physical activity\n  \"physical.activity\", \n  \n  # Access to routine healthcare\n  \"medical.access\",\n  \n  # Blood pressure and Hypertension\n  \"systolicBP\", \"diastolicBP\", \n  \n  # Sleep \n  \"sleep\",\n\n  # Laboratory \n  \"uric.acid\", \"protein.total\", \"bilirubin.total\", \"phosphorus\",\n  \"sodium\", \"potassium\", \"globulin\", \"calcium.total\", \n  \"high.cholesterol\",\n  \n  # Survey features\n  \"survey.weight\", \"survey.weight.mec\", \"psu\", \"strata\", \n  \n  # Survey year\n  \"year\"\n)\n\nnhanes17r.sel <- nhanes17r[, vars]\n\n\n# Adults 20 years of more and not pregnant\ndim(nhanes17r.sel)\n#> [1] 9254   34\nanalytic17 <- subset(nhanes17r.sel, age >= 20 & \n                       pregnancy != 'yes')\ndim(analytic17)\n#> [1] 5569   34\n\n\n\n33.3.3 Save dataset for later use\n\ndim(analytic17)\n#> [1] 5569   34\ndim(rx2017)\n#> [1] 15025     3\nsave(analytic17, rx2017, file = \"data/analytic17recoded.RData\")"
  },
  {
    "objectID": "merge13to17.html#analytic-dataset",
    "href": "merge13to17.html#analytic-dataset",
    "title": "34  Merge three cycles",
    "section": "34.1 Analytic dataset",
    "text": "34.1 Analytic dataset\n\n34.1.1 Load 2013-18 datasets\n\nload(\"data/analytic13recoded.RData\")\nload(\"data/analytic15recoded.RData\")\nload(\"data/analytic17recoded.RData\")\n\n\n\n34.1.2 Merge 2013-18 datasets\n\n# adults aged 20 years or more\ndata.merged0 <- rbind(analytic13, analytic15, analytic17)\ndim(data.merged0)\n#> [1] 17057    34\ndata.merged <- droplevels(data.merged0)\n\n\n\n34.1.3 Check missingness\n\nplot_missing(data.merged)\n\n\n\n# profile_missing(data.merged)\ndim(data.merged)\n#> [1] 17057    34\n\n\n\nThe data contants variables with some missing information.\n\ndata.complete <- na.omit(data.merged)\ndim(data.complete)\n#> [1] 6850   34\n\n\n\n\nOnly complete cases retained, and survey features/weights were ignored for simplicity.\nIn a realistic analysis, we would consider the missingness pattern before deleting or imputing such information."
  },
  {
    "objectID": "merge13to17.html#summary-statistics",
    "href": "merge13to17.html#summary-statistics",
    "title": "34  Merge three cycles",
    "section": "34.2 Summary statistics",
    "text": "34.2 Summary statistics\n\n\n\n\n\n\n\nNo(N=4291)\nYes(N=2559)\nOverall(N=6850)\n\n\n\n\nage.cat\n\n\n\n\n\n20-49\n2208 (51.5%)\n1227 (47.9%)\n3435 (50.1%)\n\n\n50-64\n1085 (25.3%)\n767 (30.0%)\n1852 (27.0%)\n\n\n65+\n998 (23.3%)\n565 (22.1%)\n1563 (22.8%)\n\n\nsex\n\n\n\n\n\nMale\n2086 (48.6%)\n1106 (43.2%)\n3192 (46.6%)\n\n\nFemale\n2205 (51.4%)\n1453 (56.8%)\n3658 (53.4%)\n\n\neducation\n\n\n\n\n\nLess than high school\n597 (13.9%)\n419 (16.4%)\n1016 (14.8%)\n\n\nHigh school\n1809 (42.2%)\n1375 (53.7%)\n3184 (46.5%)\n\n\nCollege graduate or above\n1885 (43.9%)\n765 (29.9%)\n2650 (38.7%)\n\n\nrace\n\n\n\n\n\nWhite\n1496 (34.9%)\n932 (36.4%)\n2428 (35.4%)\n\n\nBlack\n583 (13.6%)\n581 (22.7%)\n1164 (17.0%)\n\n\nHispanic\n955 (22.3%)\n763 (29.8%)\n1718 (25.1%)\n\n\nOthers\n1257 (29.3%)\n283 (11.1%)\n1540 (22.5%)\n\n\nmarital\n\n\n\n\n\nNever married\n757 (17.6%)\n408 (15.9%)\n1165 (17.0%)\n\n\nMarried/with partner\n2756 (64.2%)\n1533 (59.9%)\n4289 (62.6%)\n\n\nOther\n778 (18.1%)\n618 (24.2%)\n1396 (20.4%)\n\n\nincome\n\n\n\n\n\nless than $20,000\n668 (15.6%)\n443 (17.3%)\n1111 (16.2%)\n\n\n$20,000 to $74,999\n1955 (45.6%)\n1353 (52.9%)\n3308 (48.3%)\n\n\n$75,000 and Over\n1668 (38.9%)\n763 (29.8%)\n2431 (35.5%)\n\n\nborn\n\n\n\n\n\nBorn in US\n2269 (52.9%)\n1745 (68.2%)\n4014 (58.6%)\n\n\nOther place\n2022 (47.1%)\n814 (31.8%)\n2836 (41.4%)\n\n\nyear\n\n\n\n\n\nNHANES 2013-2014 public release\n1976 (46.0%)\n1100 (43.0%)\n3076 (44.9%)\n\n\nNHANES 2015-2016 public release\n740 (17.2%)\n337 (13.2%)\n1077 (15.7%)\n\n\nNHANES 2017-2018 public release\n1575 (36.7%)\n1122 (43.8%)\n2697 (39.4%)\n\n\ndiabetes.family.history\n\n\n\n\n\nNo\n3656 (85.2%)\n1971 (77.0%)\n5627 (82.1%)\n\n\nYes\n635 (14.8%)\n588 (23.0%)\n1223 (17.9%)\n\n\nsmoking\n\n\n\n\n\nNever smoker\n2760 (64.3%)\n1591 (62.2%)\n4351 (63.5%)\n\n\nPrevious smoker\n917 (21.4%)\n636 (24.9%)\n1553 (22.7%)\n\n\nCurrent smoker\n614 (14.3%)\n332 (13.0%)\n946 (13.8%)\n\n\ndiet.healthy\n\n\n\n\n\nPoor or fair\n876 (20.4%)\n1006 (39.3%)\n1882 (27.5%)\n\n\nGood\n1747 (40.7%)\n1039 (40.6%)\n2786 (40.7%)\n\n\nVery good or excellent\n1668 (38.9%)\n514 (20.1%)\n2182 (31.9%)\n\n\nphysical.activity\n\n\n\n\n\nNo\n3590 (83.7%)\n2007 (78.4%)\n5597 (81.7%)\n\n\nYes\n701 (16.3%)\n552 (21.6%)\n1253 (18.3%)\n\n\nmedical.access\n\n\n\n\n\nNo\n767 (17.9%)\n319 (12.5%)\n1086 (15.9%)\n\n\nYes\n3524 (82.1%)\n2240 (87.5%)\n5764 (84.1%)\n\n\nsleep\n\n\n\n\n\nMean (SD)\n7.32 (1.42)\n7.21 (1.54)\n7.28 (1.47)\n\n\nMedian [Min, Max]\n7.00 [2.00, 14.0]\n7.00 [2.00, 14.0]\n7.00 [2.00, 14.0]\n\n\nsystolicBP\n\n\n\n\n\nMean (SD)\n122 (18.2)\n127 (17.4)\n124 (18.1)\n\n\nMedian [Min, Max]\n118 [64.7, 229]\n125 [74.0, 212]\n121 [64.7, 229]\n\n\ndiastolicBP\n\n\n\n\n\nMean (SD)\n70.2 (11.1)\n72.8 (11.5)\n71.2 (11.3)\n\n\nMedian [Min, Max]\n70.7 [12.0, 123]\n72.7 [26.0, 124]\n71.3 [12.0, 124]\n\n\nuric.acid\n\n\n\n\n\nMean (SD)\n5.19 (1.36)\n5.74 (1.48)\n5.39 (1.43)\n\n\nMedian [Min, Max]\n5.10 [1.10, 12.3]\n5.60 [2.10, 13.3]\n5.30 [1.10, 13.3]\n\n\nprotein.total\n\n\n\n\n\nMean (SD)\n7.14 (0.454)\n7.10 (0.443)\n7.12 (0.450)\n\n\nMedian [Min, Max]\n7.10 [4.70, 10.2]\n7.10 [5.40, 9.10]\n7.10 [4.70, 10.2]\n\n\nbilirubin.total\n\n\n\n\n\nMean (SD)\n0.594 (0.307)\n0.513 (0.304)\n0.564 (0.308)\n\n\nMedian [Min, Max]\n0.500 [0, 3.30]\n0.500 [0, 7.10]\n0.500 [0, 7.10]\n\n\nphosphorus\n\n\n\n\n\nMean (SD)\n3.73 (0.545)\n3.66 (0.575)\n3.70 (0.557)\n\n\nMedian [Min, Max]\n3.70 [2.00, 6.10]\n3.60 [1.80, 8.90]\n3.70 [1.80, 8.90]\n\n\nsodium\n\n\n\n\n\nMean (SD)\n140 (2.45)\n140 (2.58)\n140 (2.50)\n\n\nMedian [Min, Max]\n140 [124, 150]\n140 [121, 154]\n140 [121, 154]\n\n\npotassium\n\n\n\n\n\nMean (SD)\n4.01 (0.358)\n4.04 (0.363)\n4.02 (0.360)\n\n\nMedian [Min, Max]\n4.00 [2.80, 6.00]\n4.00 [2.80, 6.60]\n4.00 [2.80, 6.60]\n\n\nglobulin\n\n\n\n\n\nMean (SD)\n2.88 (0.438)\n3.02 (0.450)\n2.93 (0.448)\n\n\nMedian [Min, Max]\n2.80 [1.60, 6.50]\n3.00 [1.40, 5.20]\n2.90 [1.40, 6.50]\n\n\ncalcium.total\n\n\n\n\n\nMean (SD)\n9.39 (0.364)\n9.32 (0.381)\n9.36 (0.371)\n\n\nMedian [Min, Max]\n9.40 [6.40, 14.8]\n9.30 [6.60, 12.0]\n9.40 [6.40, 14.8]\n\n\nhigh.cholesterol\n\n\n\n\n\nNo\n2833 (66.0%)\n1504 (58.8%)\n4337 (63.3%)\n\n\nYes\n1458 (34.0%)\n1055 (41.2%)\n2513 (36.7%)\n\n\n\n\n\n\n\n\n\nInvestigator specified covariates stratified by the exposure (obesity)\nThis Table includes information about participants with and without ICD-10-CM proxy information. Therefore, the sample is is larger than the original analysis."
  },
  {
    "objectID": "merge13to17.html#proxy-data-from-icd10-codes",
    "href": "merge13to17.html#proxy-data-from-icd10-codes",
    "title": "34  Merge three cycles",
    "section": "34.3 Proxy data from ICD10 codes",
    "text": "34.3 Proxy data from ICD10 codes\n\ndat.proxy.long <- rbind(rx2013, rx2015, rx2017) \ndat.proxy.long$icd10 <- NULL\n# Rename 3 digits ICD-10 codes as icd10\ncolnames(dat.proxy.long)[names(dat.proxy.long)==\"icd10.new\"] <- \"icd10\"\n\n\n\nWe combine all of the ICD-10-CM information form all 3 cycles."
  },
  {
    "objectID": "merge13to17.html#save-dataset-for-later-use",
    "href": "merge13to17.html#save-dataset-for-later-use",
    "title": "34  Merge three cycles",
    "section": "34.4 Save dataset for later use",
    "text": "34.4 Save dataset for later use\n\nsave(data.merged, \n     data.complete, \n     dat.proxy.long, \n     file = \"data/analytic3cycles.RData\")"
  }
]