-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathLab_Assumption.Rmd
More file actions
114 lines (83 loc) · 2.52 KB
/
Lab_Assumption.Rmd
File metadata and controls
114 lines (83 loc) · 2.52 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Checking Assumptions"
author: ""
date: ""
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = F)
library(tidyverse)
```
Recall the 6 assumptions, specified by ROS, for regression models
1. Validity
2. Representativeness
3. Additivity and linearity
4. Independence of errors
5. Equal variance of errors
6. Normality of errors
Similar to HW 7, we will focus on assumptions 3, 4, 5, and 6.
### Additivity and linearity
#### Data Simulation
Consider data simulated from the model:
$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon,$$
where $\epsilon \sim N(0,\sigma^2)$
```{r}
set.seed(10102022)
n <- 50
x <- runif(n, 0, 10)
sigma <- 3
beta <- c(1,1,.75)
y <- rnorm(n, beta[1] + beta[2] * x + beta[3] * x^2, sigma)
```
#### Data Visualization
If we visualize the data as a function of x, the response is clearly non-linear.
```{r}
d1 <- tibble(x = x, y = y, x_sq = x^2)
d1 %>% ggplot(aes(y=y, x=x)) +
geom_smooth(formula = 'y~x', method ='loess') +
geom_smooth(formula = 'y~x', method = 'lm', color = 'red') +
geom_point() +
theme_bw() +
ggtitle('Non-linear synthetic data') +
labs(caption = "Red is best linear fit, blue is loess curve ")
```
#### Data Modeling and Assessment
We'll start with fitting a regression model corresponding to:
$$y = \beta_0 + \beta_1 x + \epsilon,$$
where $\epsilon \sim N(0,\sigma^2)$
```{r}
lm1 <- lm(y~x, data = d1)
summary(lm1)
```
The `ggResidpanel` has nice functionality for residual checks and model assessment.
```{r}
#library(devtools)
#devtools::install_github("goodekat/ggResidpanel")
library(ggResidpanel)
resid_panel(lm1, plots = 'all', smoother = T, qqbands = T)
# resid_interact(lm1, plots = c("resid", "qq")) # note won't compile in non HTML formats
resid_xpanel(lm1)
```
#### Data Modeling and Assessment: Part 2
Now we'll update our model, to the true model, such that:
$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon,$$
where $\epsilon \sim N(0,\sigma^2)$
```{r}
lm2 <- lm(y~x + x_sq, data = d1)
summary(lm2)
```
We see that the parameter coefficients are close to our true values.
```{r}
resid_panel(lm2, plots = 'all', smoother = T, qqbands = T)
resid_xpanel(lm2)
```
Similarly, the residual plots are better behaved without a clear pattern.
```{r, message = F}
resid_compare(models = list(lm1,
lm2),
plots = c("resid"),
smoother = TRUE)
```
### Next Steps:
Work through other scenario(s) where 1 or more of the other assumptions are violated