Checking Assumptions

Recall the 6 assumptions, specified by ROS, for regression models

Validity
Representativeness
Additivity and linearity
Independence of errors
Equal variance of errors
Normality of errors

Similar to HW 7, we will focus on assumptions 3, 4, 5, and 6.

Additivity and linearity

Data Simulation

Consider data simulated from the model:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon,$$ where $\epsilon \sim N(0,\sigma^2)$

set.seed(10102022)
n <- 50
x <- runif(n, 0, 10)
sigma <- 3
beta <- c(1,1,.75)

y <- rnorm(n, beta[1] + beta[2] * x + beta[3] * x^2,  sigma)

Data Visualization

If we visualize the data as a function of x, the response is clearly non-linear.

d1 <- tibble(x = x, y = y, x_sq = x^2)
d1 %>% ggplot(aes(y=y, x=x)) + 
  geom_smooth(formula = 'y~x', method ='loess') + 
  geom_smooth(formula = 'y~x', method = 'lm', color = 'red') + 
  geom_point() +
  theme_bw() +
  ggtitle('Non-linear synthetic data') +
  labs(caption = "Red is best linear fit, blue is loess curve ")

Data Modeling and Assessment

We’ll start with fitting a regression model corresponding to:

$$y = \beta_0 + \beta_1 x + \epsilon,$$ where $\epsilon \sim N(0,\sigma^2)$

lm1 <- lm(y~x, data = d1)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.155  -5.032  -1.640   3.381  16.070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.5763     2.0333  -6.185 1.31e-07 ***
## x             8.4170     0.3561  23.637  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.568 on 48 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9192 
## F-statistic: 558.7 on 1 and 48 DF,  p-value: < 2.2e-16

The ggResidpanel has nice functionality for residual checks and model assessment.

#library(devtools)
#devtools::install_github("goodekat/ggResidpanel")
library(ggResidpanel)
resid_panel(lm1, plots = 'all', smoother = T, qqbands = T)

# resid_interact(lm1, plots = c("resid", "qq")) # note won't compile in non HTML formats
resid_xpanel(lm1)

Data Modeling and Assessment: Part 2

Now we’ll update our model, to the true model, such that:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon,$$ where $\epsilon \sim N(0,\sigma^2)$

lm2 <- lm(y~x + x_sq, data = d1)
summary(lm2)

## 
## Call:
## lm(formula = y ~ x + x_sq, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8799 -2.3058 -0.5755  2.2237  8.7177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.99667    1.57985   1.264    0.213    
## x            0.61073    0.67194   0.909    0.368    
## x_sq         0.76917    0.06384  12.048 5.63e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.283 on 47 degrees of freedom
## Multiple R-squared:  0.9806, Adjusted R-squared:  0.9798 
## F-statistic:  1191 on 2 and 47 DF,  p-value: < 2.2e-16

We see that the parameter coefficients are close to our true values.

resid_panel(lm2, plots = 'all', smoother = T, qqbands = T)

resid_xpanel(lm2)

Similarly, the residual plots are better behaved without a clear pattern.

resid_compare(models = list(lm1, 
                            lm2),
              plots = c("resid"),
              smoother = TRUE)

Next Steps:

Work through other scenario(s) where 1 or more of the other assumptions are violated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking Assumptions

Additivity and linearity

Data Simulation

Data Visualization

Data Modeling and Assessment

Data Modeling and Assessment: Part 2

Next Steps:

FilesExpand file tree

Lab_Assumption.md

Latest commit

History

Lab_Assumption.md

File metadata and controls

Checking Assumptions

Additivity and linearity

Data Simulation

Data Visualization

Data Modeling and Assessment

Data Modeling and Assessment: Part 2

Next Steps: