rscripts/simhiv.qmd at master · harrelfe/rscripts · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Ordinal Regression for Simulated HIV Data"
author:
  - name: Frank Harrell
    affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
  html:
    self-contained: true
    number-sections: true
    number-depth: 3
    anchor-sections: true
    smooth-scroll: true
    theme: journal
    toc: true
    toc-depth: 3
    toc-title: Contents
    toc-location: left
    code-link: false
    code-tools: true
    code-fold: show
    code-block-bg: "#f1f3f5"
    code-block-border-left: "#31BAE9"
    reference-location: margin
    fig-cap-location: margin

execute:
  warning: false
  message: false

major: regression modeling
minor: ordinal
---

```{r setup}
require(rms)
options(prType='html')
```

# Problem

* From[Liu, Shepherd, Li, Harrell (2017)
* Simulated data to mimic real data from an HIV study
* N observations: 1796
* Goal: Examine predictors of post-treatment CD4 count

To avoid having to make a choice of how to transform the dependent variable, we use ordinal regression for this continuous variable.  The `rms` `orm` function is made for that.  We will use the proportional odds ordinal logistic semiparametric regression model, commonly called the proportional odds model or the ordinal logistic model.  There are 447 distinct CD4 counts in the data so we will have 446 intercepts in the ordinal model.

# Data Import

```{r import, results='asis'}
getHdata(simhiv)
d <- upData(simhiv,
            rename = .q(init.class = class, init.year = year,       preARTAIDS.clinical = aids),
            labels = .q(aids = "Pre-ART AIDS") )
contents(d)
```

# Descriptive Statistics


```{r desc, results='asis'}
describe(d)
```

See how strongly predictors relate to each other.  Note that the Spearman $\rho$ rank correlations used in this call to `varclus` are transformation-invariant.

```{r}
plot(varclus(~ . - post.cd4, data=d))
```

Two of the continuous predictors have very skewed distributions.  Taking $\sqrt[3]{}$ of them made for a nicer distribution and is likely to slightly linearize the needed transformations as predictors.  One has zeros so log cannot be used.  We stratify those predictors by baseline AIDS status.

```{r}
cr <- function(x) x ^ (1/3)
with(d, histboxp(x=nadir.cd4,     group=aids))
with(d, histboxp(x=cr(nadir.cd4), group=aids))
with(d, histboxp(x=pre.rna,       group=aids))
with(d, histboxp(x=cr(pre.rna),   group=aids))
```

# Fit the PO Model

Let's use AIC to choose between 4 knots for continuous variables, and 5 knots.
```{r}
dd <- datadist(d); options(datadist='dd')
f <- orm(post.cd4 ~ site + male + rcs(age, 4) + class + route +
         rcs(cr(nadir.cd4), 4) + rcs(year, 4) + rcs(cr(pre.rna), 4) +
         aids, data=d, x=TRUE, y=TRUE)
g <- orm(post.cd4 ~ site + male + rcs(age, 5) + class + route +
         rcs(cr(nadir.cd4), 5) + rcs(year, 5) + rcs(cr(pre.rna), 5) +
         aids, data=d)
AIC(f)   # better for 4-knot model
AIC(g)
```

```{r results='asis'}
f
```

# Assess Strength of Associations

```{r results='asis'}
a <- anova(f, test='LR')
a
plot(a)
plot(a, 'proportion chisq')
# Run the bootstrap so we can get uncertainties in rexVar
b   <- bootcov(f, B=300)
rex <- rexVar(b, d)
plot(rex)
```

# Partial Effect Shapes and Strengths

```{r}
ggplot(Predict(f))
```

Let's convert the $y$-axis to median CD4.  In doing so, the adjustment variable settings matter greatly.

```{r}
quan <- Quantile(f)
med  <- function(lp) quan(0.5, lp)
ggplot(Predict(f, fun=med), ylab='Median CD4')
```

# Compare With a Model With Linear Effects

```{r}
h <- orm(post.cd4 ~ site + male + age + class + route +
         cr(nadir.cd4) + year + cr(pre.rna) +
         aids, data=d)
lrtest(f, h)
AIC(f)
AIC(h)
```

According to AIC (backed up by likelihood ratio $\chi^2$ test) the linear model (though with $\sqrt[3]{}$ pre-transformations for two variables) is more likely to predict future observations better.  That's because there is more overfitting, with the possibly added predictive value not being worth the extra d.f.

In the linear model see if there is evidence for anything other than `site` being associated with CD4 count.

```{r}
fs <- orm(post.cd4 ~ site, data=d)
lrtest(h, fs)
```

Do the corresponding chunk test with a more approximate method (Wald test):

```{r results='asis'}
anova(h, male, age, class, route, nadir.cd4, year, pre.rna, aids)
```

# Point Effect Estimates from Simpler Model

```{r}
plot(summary(h), log=TRUE)
```

# Computing Environment

```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```