modelselection_tutorial/mesquite.Rmd at master · guhjy/modelselection_tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "Collinear demo with mesquite"
output:
  html_document: default
  html_notebook: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='95%')
```

Demonstration of collinearity in multipredictor regression.
Example of predicting the yields of mesquite bushes comes from [Gelman and Hill (2007)](http://www.stat.columbia.edu/~gelman/arm/).
The outcome variable is the total weight (in grams) of photosynthetic material as derived
from actual harvesting of the bush. The predictor variables are:

- diam1: diameter of the canopy (the leafy area of the bush)
in meters, measured along the longer axis of the bush
- diam2: canopy diameter measured along the shorter axis
- canopy height: height of the canopy
- total height: total height of the bush
- density: plant unit density (# of primary stems per plant unit)
- group: group of measurements (0 for the first group, 1 for the second group)

Load libraries.
```{r}
library(rstanarm)
library(arm)
options(mc.cores = parallel::detectCores())
library(loo)
library(bayesplot)
library(ggplot2)
library(GGally)
library(projpred)
```

Mesquite example from "Regression and other stories"
```{r warning=FALSE,message=FALSE}
dat<-read.table("mesquite.dat",header=T)
summary(dat)
```

Plot data
```{r}
ggpairs(dat,diag=list(continuous="barDiag"))
```

Additional transformed variables
```{r}
dat$CanVol <- dat$Diam1 * dat$Diam2 * dat$CanHt
dat$CanAre <- dat$Diam1 * dat$Diam2
dat$CanSha <- dat$Diam1 / dat$Diam2
```

It may be reasonable to fit on the logarithmic scale, so that effects are multiplicative rather than additive (we'll return to checking this assumption in another notebook).

We first illustrate the problem with maxiumum likelihood estimate
```{r}
lm1 <- lm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + log(Dens) + Group, data = dat)
display(lm1)
```

GroupMCD seems to be only variable which has coeffiecent far away from zero. Let's try making a model with just the group variable.
```{r}
lm2 <- lm(formula = log(LeafWt) ~ Group, data = dat)
display(lm2)
```
Hmmm.... R-squared dropped a lot, so it seems that other variables are useful even if estimated effects and tehir standard errors indicate that they are not relevant. There are approach for maximum likelihood estimated models to investigate this, but we'll switch now to Bayesian inference using rstanarm.

The corresponding rstanarm model fit
```{r}
fitg <- stan_glm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + log(Dens) + Group, data = dat, refresh=0)
```

Print summary for some diagnostics.
```{r}
summary(fitg)
```

Rhats and n_effs are good, but QR transformation usually makes sampling work even better
```{r}
fitg <- stan_glm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + log(Dens) + Group, data = dat, QR=TRUE, refresh=0)
```

Print summary for some diagnostics.
```{r}
summary(fitg)
```
Use of QR decomposition improved sampling efficiency (actually we get superefficient sampling, ie better than independent sampling) and we continue with this model.

Instead of looking at the tables, it's easier to look at plots
```{r}
mcmc_areas(as.matrix(fitg),prob=.5, prob_outer = .95)
```

All 95% posterior intervals except for GroupMCD are overlapping 0 and it seems we have serious collinearity problem.

Looking at the pairwise posteriors we can see high correlations especially between log(CanVol) and log(CanAre).
```{r}
mcmc_pairs(as.matrix(fitg),pars = c("log(CanVol)","log(CanAre)","log(CanSha)","log(TotHt)","log(Dens)"))
```

If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, some pairwise joint posteriors are not overlapping 0. Let's look more carefully the joint posterior of log(CanVol) and log(CanAre).
```{r}
mcmc_scatter(as.matrix(fitg), pars = c("log(CanVol)","log(CanAre)"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)
```

From the joint posterior scatter plot, we can see that 0 is far away fron the typical set.

In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
```{r}
fitg0 <- update(fitg, formula = log(LeafWt) ~ 1, QR=FALSE)
```

We compute leave-one-out cross-validation elpd's using PSIS-LOO.
```{r}
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
```
The model with variables has one bad k value. We can fix that by computing the corresponding leave-one-out-posterior exactly.
```{r}
(loog <- loo(fitg, k_threshold=0.7))
```

And then we can compare the models.
```{r}
compare_models(loog0,loog)
```
Based on cross-validation covariates together contain significant information to improve predictions.

We might want to choose some variables 1) because we don't want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.

Tyre used AIC_c to estimate the model performance. In Bayesian setting we could use Bayesian alternatives Bayesian cross-validation or WAIC, but we don't recommend them for variable selection as discussed in http://link.springer.com/article/10.1007/s11222-016-9649-y . The reason for not using Bayesian CV or WAIC is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditonal on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.

The paper http://link.springer.com/article/10.1007/s11222-016-9649-y also shows that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more in http://link.springer.com/article/10.1007/s11222-016-9649-y

We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach http://link.springer.com/article/10.1007/s11222-016-9696-4 is used to choose the model size.
```{r, results='hide'}
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO', nloo = nrow(dat))
```

```{r}
fitg_cv$varsel$vind
```

We can now look at the estimated predictive performance of smaller models compared to the full model.
```{r}
varsel_plot(fitg_cv, stats = c('elpd', 'rmse'), deltas=T)
```

And we get a loo-cv based recommendation for the model size to choose
```{r}
(nv <- suggest_size(fitg_cv,alpha=0.1))
```
We see that 3 variables is enough to get the same as with all 7 variables.

Next we form the projected posterior for the chosen model.
```{r}
projg <- project(fitg_cv, nv = nv, ns = 4000)
round(colMeans(as.matrix(projg)),1)
round(posterior_interval(as.matrix(projg)),1)
```

```{r}
mcmc_areas(as.matrix(projg),
           pars = c('(Intercept)', names(fitg_cv$varsel$vind[1:nv])), prob_outer=0.99)
```


<br />


### Appendix: Session information

```{r}
sessionInfo()
```

<br />


### Appendix: Licenses

* Code &copy; 2018, Aki Vehtari, licensed under BSD-3.
* Text &copy; 2018, Aki Vehtari, licensed under CC-BY-NC 4.0.