-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path11-modelling.Rmd
More file actions
997 lines (710 loc) · 52.6 KB
/
11-modelling.Rmd
File metadata and controls
997 lines (710 loc) · 52.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
# Statistical modelling with R {#modelling}
This chapter is an introduction to statistical modelling with R. There are a substantial number of dedicated R modelling packages available. In this chapter only the basics of statistical modelling with R are considered. However, a thorough understanding of these principles is an invaluable aid in mastering any of the available R packages on statistical modelling and related topics.
## Introduction
Some of the aims of statistical modelling are: assessing the relative importance of a set of variables in relation to another variable(s); predicting the value of a dependent variable from those of a set of predictor variables; defining a statistical relationship between the dependent and the independent variables.
Here the following generic functions are considered.
Table: (\#tab:ModelFuncs) Generic modelling functions in R.
| *<span style="color:#F7CE21">Function</span>* | *<span style="color:#F7CE21">Model</span>* | *<span style="color:#F7CE21">Package</span>* |
| ------ | --------------- | ------ |
| `lm()` | Linear models | `stats` |
| `aov()` | Analysis of variance | `stats` |
| `glm()` | Generalized linear models | `stats` |
| `gam()` | Generalized additive models| `gam`, `mgcv` |
| `loess()` | Local regression models | `stats` |
| `rpart()` | Tree-based models | `rpart` |
| `nls()` | Nonlinear models | `stats` |
## Data for statistical models
(a) Note the difference between *<span style="color:#FF9966">factors</span>* (categorical or classification variables), *<span style="color:#FF9966">regressors</span>* (continuous or *<span style="color:#FF9966">numeric</span>* variables) and frequency data.
(b) Why must the data be in the form of a dataframe when statistical modelling is carried out with R?
(c) How are the functions `factor()`, `is.factor()` and `is.ordered()` used to set or determine the class attributes of variables that must behave as factors in further analyses? Note especially the usage of argument `'ordered = '` of function `factor()`.
## Expressing a statistical model in R
(a) What role does the tilde-operator (`~`) play in statistical models in R?
(b) Study how the operators in Table \@ref(tab:ModelOperators) work. These operators have a different meaning when they appear in a statistical model.
Table: (\#tab:ModelOperators) Operators used in model formulae after the `~`.
| *<span style="color:#F7CE21">Operator</span>* | *<span style="color:#F7CE21">Algebraic meaning</span>* | *<span style="color:#F7CE21">Meaning in formula after `~`</span>* |
| ------ | --------------- | --------------- |
| `+` | Addition | Add a term |
| `-` | Subtraction | Remove or exclude a term |
| `*` | Multiplication | Main effects and interactions |
| `/` | Division | Main effect and nesting |
| `:` | Sequence | Interaction |
| `^` | Exponentiation | Limit depth of interactions |
| `%in%` | Matching | Nesting |
(c) In order to ensure that the operators in Table \@ref(tab:ModelOperators) have their usual meaning when appearing at the right-hand side of `~` in a statistical model use the `I()` function e.g. `I(a+b)`.
(d) Note how *<span style="color:#FF9966">regression terms</span>*, *<span style="color:#FF9966">main effects</span>*, *<span style="color:#FF9966">interaction effects</span>*, *<span style="color:#FF9966">higher-order interaction effect</span>*, *<span style="color:#FF9966">nested effects</span>* and *<span style="color:#FF9966">covariate terms</span>* are specified.
(e) The following summary illustrates the correspondence between the *<span style="color:#FF9966">model notation</span>* of R and the *<span style="color:#FF9966">algebraic notation</span>* of models for factors `a` and `b` with numeric (continuous) variable `x`:
| Model notation | Algebraic notation |
| --------------- | --------------- |
| `y~a+x` | $y = \mu + \alpha_a + \beta x + \epsilon$ |
| `y~a+b+a:b` | $y = \mu + \alpha_a + \beta_b + \gamma_{ab} + \epsilon$ |
| `y~a+a:x` | $y = \mu + \alpha_a + \beta_a x + \epsilon$ |
| `y~a*x` | $y = \mu + \alpha_a + \beta x + \gamma_a x + \epsilon$ |
| `y~-1+a/x` | $y = \alpha_a + \beta_a x + \epsilon$ |
(f) Every numeric variable on the right-hand side of `~` generates one coefficient to be estimated in a fitted model; each *<span style="color:#FF9966">level</span>* of a factor variable generates one coefficient to be estimated in a fitted model. What is an *<span style="color:#FF9966">estimable function</span>*?
(g) Note how transformation of variables can be achieved and how model formulae are stored as R objects of class *<span style="color:#FF9966">formula</span>*.
## Common arguments to R modelling functions
(a) The *<span style="color:#FF9966">formula</span>* argument. Study how this argument and the period operator work. If a dataframe is specified as an argument to the modelling function a period on the right-hand side of the `~` indicates that the additive effects of all the variables in the dataframe (except those used at the left-hand side of the `~`) are included in the model specification.
(b) The *<span style="color:#FF9966">data</span>* argument. Study how this argument works. Note that this argument will accept any R expression that evaluates to a dataframe. Important: if a dataframe is specified in the data argument (i) variables may be referred to by their names; (ii) the dataframe is searched first for a name before the global environment and the rest of the search path; (iii) any other existing object can also be used as a variable.
(c) The *<span style="color:#FF9966">subset</span>* argument. Note how statistical models can be fitted to subsets of data.
(d) The *<span style="color:#FF9966">weights</span>* argument. Study how this argument works especially when iterative procedures are used.
(e) The *<span style="color:#FF9966">na.action</span>* argument. This argument controls how missing values are handled.
(f) The *<span style="color:#FF9966">control</span>* argument. What is the purpose of this argument?
## Using the statistical modelling objects
*Note*: each of the modelling functions returns an object whose class attribute has been set to the name of the function e.g. the value of `lm()` is set to be of class `lm`. A set of generic functions are available to provide access to the information contained in them.
To see what is available for a particular modelling function use `methods (class = function)` e.g.
```{r, methodslm}
methods (class = lm)
```
Table: (\#tab:ModelObjectFuncs) Functions for Model Objects.
| *<span style="color:#F7CE21">Function</span>* | *<span style="color:#F7CE21">Description</span>* |
| ------ | --------------- |
| `add1()` | Add all possible single terms in a model |
| `anova()` | Return a sequential analysis of variance table or a comparison of two hierarchical models |
| `coef()` or `coefficients()` | Extract estimated coefficients |
| `deviance()` | Return the deviance |
| `df.residual()` | Return the residual degrees of freedom |
| `drop1()` | Drop all possible single terms in a model |
| `fitted()` or `fitted.values() ` | Extract fitted values |
| `formula()` | Extract the formula on which the model is based |
| `influence()` | Extract several types of influence measures |
| `plot()` | Construct relevant plots |
| `predict()` | Calculate predictions (means) optionally with standard errors |
| `print()` | Extract information about the analysis |
| `resid()` or `residuals()` | Extract residuals |
| `step()` | Stepwise selection of a model using a AIC criterion |
| `summary()` | Extract information about the analysis |
| `update()` | Modify and/or refit models |
*Note*: Do not confuse function `anova()` which operates on an `lm` object with function `aov()` for fitting an analysis of variance model.
*Recall*: Asterisked functions are non-visible. Such functions can be accessed using for example
```{r, asterisked, eval = FALSE}
getAnywhere(add1.lm)
```
(a) Note the use of `print()`, `summary()` and `plot()` to retrieve information about model objects.
(b) Study the functions that are used to retrieve information about model objects or to modify them as summed up in Table \@ref(tab:ModelObjectFuncs).
## Usage of the function `with()`
Study the help file of function `with()` and take note specifically of its usage when calling `lm()`.
## Linear regression and anova {#lmAnova}
(a) Consider the `Cars93` data set available in package `MASS`. Use the `lm()` function to perform a regression analysis of `MPG.city` as a function of a constant term, `Length`, `(Rev.per.mile)`$^{–1}$, `Weight`, `RPM` and `(Horsepower)`$^{–1}$. Note how the function `update()` works. Illustrate the use of the functions `drop1()` and `add1()`:
```{r, lmCars93, warning = FALSE, message = FALSE}
library (MASS)
lm.city <- lm (MPG.city ~ 1+ Length + I(1/Rev.per.mile)+ Weight + RPM +
I(1/Horsepower), data=Cars93)
summary (lm.city)
# update() is used here to restrict the model fitted in lm.city to cars with
# front wheel drives
lm.city.front <- update (lm.city, subset = DriveTrain=="Front")
summary (lm.city.front)
# drop1 (lm object, scope) shows the effect of dropping the Length term from
# the fitted model
drop1 (lm.city, . ~ Length)
# Shows the effect of dropping each term in turn from the fitted model
drop1 (lm.city)
lm.city.const <- lm (MPG.city ~ 1, data=Cars93)
anova (lm.city.const)
# add1 (lm object, scope) shows the effect of adding the terms in scope
# individually in turn to the model in lm object
add1 (lm.city.const, . ~ Length + I(1/Rev.per.mile) + Weight + RPM +
I(1/Horsepower))
```
(b) Is the object `a <- as.data.frame(matA)` identical with the object `a1 <- data.frame(matA)`?
::: {style="color: #80CC99;"}
(c) Once again, consider the `Cars93` data set: Consider cars manufactured by Buick, Chevrolet, Oldsmobile and Pontiac. Use the median of `Weight` to create two car weight groups.
* Construct an interaction plot to study the interaction between `Manufacturer` and `Weight` with respect to the highway fuel consumption. Interpret the graph.
* Use the function `aov()` to perform a two-way anova with highway fuel consumption as dependent variable and as independent (explanatory) variables weight-group and manufacturer.
* What do you conclude about the main effects and the interaction effects?
* Repeat the above using lm().
(d) Study the help file of the `whiteside` data set available in package `MASS`. Plot the gas consumption as a function of the temperature. Use different plotting characters and colours for the two levels of the factor variable `Insul`. Add the regression lines of gas consumption on the temperature for the two levels of `Insul` to the graph . What do you conclude? Now test for the parallelism of the two regression lines using `lm()` with terms for `Insul`, `Temp` and `Insul:Temp`. Discuss the results of the analysis.
:::
## Regression diagnostics
(a) Study the help file of `lm.influence()` for information on its return values.
(b) Standardized residuals as well as jackknife residuals where $stdres = \frac{\hat{e}_i}{s\sqrt{1-h_{ii}}}$ and $studres = \frac{y_i-\hat{y}_{(i)}}{\sqrt{var(y_i-\hat{y}_{(i)})}}$ can be obtained with the help of the following functions: `stdres()` and `studres()` available in package `MASS`. Study the help files of these functions.
(c) Study how `predict()` and `fitted()` work.
(d) Stepwise model selection: study the use of `step()`.
(e) What is returned by the function `dfbetas()`?
(f) Discuss the output of
```{r, steplmCars93}
step (lm (MPG.highway ~ Length + Width + Weight + EngineSize + RPM +
Rev.per.mile, data = Cars93))
```
(g) Calculate and interpret the `dfbetas` associated with the object `lm.city.front` created in Section \@ref(lmAnova).
(h) Continuing (f): obtain a graph with the standardized residuals on the y-axis and the fitted y-values on the x-axis and interpret the graph.
(i) Continuing (f): obtain a histogram of the studentized residuals. Interpret.
(j) Consider the object `lm.city` created in section \@ref(lmAnova). Verify the assumption of normally distributed residuals with the function `qqnorm()`. See section \@ref(qqplot).
## Non-parametric regression
(a) Inspect the help file of the `stackloss` data set.
(b) Execute the following R code:
```{r, stackdata}
stack.data <- data.frame (stackloss)
names (stack.data) <- c("AirFlow", "WaterTemp", "AcidConc", "Loss")
```
(c) Describe briefly what is understood by a Local regression model using the help file of the function `loess()`.
(d) Execute the following R code:
```{r, stackloess}
stack.loess <- loess(Loss ~ AirFlow * WaterTemp, data = stack.data)
summary(stack.loess)
names(stack.loess)
stack.loess$x
coplot (stack.loess$y ~ stack.loess$x[,"AirFlow"] | stack.loess$x[,"WaterTemp"],
panel = points)
coplot (stack.loess$y ~ stack.loess$x[,"WaterTemp"] | stack.loess$x[,"AirFlow"],
panel=points)
```
Comment on the results.
(e) Alternatively, splines can be used to fit a nonparametric regression model. The package `splines` provides for fitting B-splines with the `bs()` function and natural cubic splines with the `ns()` function.
(f) Study the help files of the `bs()` and `ns()` functions.
(g) Inspect the help file of the `ethanol` data set available in package `lattice`.
(i) Plot `NOx` versus `E`.
(ii) Use `lm()` and fit a straight line to the data. Add this line to the plot made above. Why is it necessary to order the observations?
```{r, ethanol, eval = FALSE}
y <- ethanol$NOx [order (ethanol$E)]
x <- ethanol$E [order (ethanol$E)]
yhat <- fitted (lm (y ~ x))
lines (x, yhat, col = "blue")
```
<div style="margin-left: 25px; margin-right: 20px;">
(iii) Use `lm()` and fit a third degree polynomial to the data. Also add this line to the plot in (i).
</div>
```{r, ethanolPolynomial, eval = FALSE}
yhat <- fitted (lm (y ~ x + I(x^2) + I(x^3)))
lines (x, yhat, col = "magenta")
```
<div style="margin-left: 25px; margin-right: 20px;">
(iv) Use `loess()` with a default `span = 0.75` to fit a local regression function to the data and add this line to the plot.
</div>
```{r, ethanol_loess, eval = FALSE}
yhat <- fitted (loess (y ~ x))
lines (x, yhat, col = "green")
```
<div style="margin-left: 25px; margin-right: 20px;">
(v) Use `bs()` to fit a B-spline with default `degree = 3` and `knots = NULL`. Notice that a third degree polynomial is fitted, similar to (iii).
</div>
```{r, ethanol_bs, eval = FALSE}
yhat <- fitted (lm (y ~ bs(x, degree=3)))
lines (x, yhat, col = "gold")
```
<div style="margin-left: 25px; margin-right: 20px;">
(vi) Use `bs()` to fit a cubic B-spline with 7 degrees of freedom.
</div>
```{r, ethanol_bs_df, eval = FALSE}
yhat <- fitted (lm (y ~ bs(x, df=7)))
lines (x, yhat, col="red")
```
<div style="margin-left: 25px; margin-right: 20px;">
(vii) Compare the number of knots for the two fits in (v) and (vi).
</div>
```{r, ethanol_knots, eval = FALSE}
attr (bs(x), "knots")
attr(bs(x), "Boundary.knots")
attr(bs(x, df=7), "knots")
attr(bs(x, df=7), "Boundary.knots")
```
<div style="margin-left: 25px; margin-right: 20px;">
(viii) Experiment with different values of the span for the `loess()` function and the `df` argument of the `bs()` function.
</div>
(h) When producing a coplot (refer to section \@ref(coplot)) it is helpful to draw lines or curves in each of the dependency panels, to make the relationship easier to appreciate. Use the panel argument of `coplot()` to draw in each panel
* a straight line representing the linear regression between the variables;
* a curve representing the polynomial regression;
* a spline using `smooth.spline()`.
<div style="margin-left: 25px; margin-right: 20px;">
*Hint*: Graphical arguments are passed to `coplot()` by writing a suitable panel function as ‘value’ for argument panel of `coplot()`. Study the examples of panel functions below:
</div>
```{r, panelFunctions, eval = FALSE}
panel.straight.line <- function(x, y, col = par("col"), bg = NA,
pch = par("pch"), cex = 1,
col.smooth = "red")
{ datmat <- cbind (x, y)
datmat <- datmat [order (datmat[, 1]), ]
points (datmat[, 1], datmat[, 2], pch = pch, col = col, bg = bg,
cex = cex)
lines (datmat[, 1], fitted (lm (datmat[, 2] ~ datmat[, 1])),
col = col.smooth)
}
panel.poly <- function(x, y, col = par("col"), bg = NA,
pch = par("pch"), cex = 1,
col.smooth = "red")
{ datmat <- cbind(x, y)
datmat <- datmat [order (datmat[, 1]), ]
points (datmat[, 1], datmat[, 2], pch = pch, col = col,
bg = bg, cex = cex)
lines (datmat[, 1], fitted (lm (datmat[, 2] ~
poly (datmat[, 1], 2))),
col = col.smooth)
}
panel.smooth <- function (x, y, col = par("col"), bg = NA,
pch = par("pch"), cex = 1,
col.smooth = "red", span = 2/3,
iter = 3, ...)
{ points(x, y, pch = pch, col = col, bg = bg, cex = cex)
ok <- is.finite (x) & is.finite (y)
if (any (ok))
lines (stats::lowess(x[ok], y[ok], f = span, iter = iter),
col = col.smooth, ...)
}
```
<div style="margin-left: 25px; margin-right: 20px;">
An example
</div>
```{r, statex77_coplot, eval = FALSE}
coplot (Illiteracy ~ Life.Exp | Income * Area, data = state,
given = list (co.intervals (state$Income, 4),
co.intervals (state$Area, 4)),
panel = panel.straight.line)
coplot (Illiteracy ~ Life.Exp | Income * Area, data = state,
given = list (co.intervals (state$Income, 4),
co.intervals (state$Area, 4)),
panel = panel.poly)
coplot (Illiteracy ~ Life.Exp | Income * Area, data = state,
given = list (co.intervals (state$Income, 4),
co.intervals (state$Area, 4)),
panel = panel.smooth)
```
## The function glm()
Consider the following dataframe `treatments`.
```{r, TreatmentData, echo = FALSE, warning = FALSE, message = FALSE}
library (flextable)
set_flextable_defaults(fonts_ignore = TRUE)
data <- data.frame (TreatA = rep (1:3, each = 3),
TreatB = rep (1:3, 3),
Counts = c(18, 17, 15, 20, 10, 20, 25, 13, 12))
flextable (data) |>
theme_alafoli() |>
hline_bottom(border = officer::fp_border(color = "#333333", width = 0.75))
```
Why is a two-way anova not appropriate here?
GLM extends the linear model in two ways: (a) the expected value is replaced by a *<span style="color:#FF9966">function</span>* of the expected value, called the link function and (b) the assumption of observations from a Gaussian distribution is extended to include other distributions from the exponential family of probability distributions.
Use `glm()` with `family = poisson()` to fit the following generalized linear models to
the data: `Counts ~ TreatA + TreatB` and `Counts ~ TreatA * TreatB`.
What are your conclusions?
## The function gam()
(a) Describe briefly what is understood by a *<span style="color:#FF9966">generalized additive model</span>*.
```{r, gam, warning = FALSE, message = FALSE}
library (gam)
stack.gam <- gam::gam (Loss ~ s(AirFlow) + s(WaterTemp) + s(AcidConc),
control = gam.control (bf.maxit = 50), data = stack.data)
summary(stack.gam)
par(mfrow=c(3,1))
plot(stack.gam)
```
<div style="margin-left: 25px; margin-right: 20px;">
Comment on the results.
</div>
(b) Return to the `ethanol` data set available in package `lattice`.
(i) Use `gam()` in package `mgcv` and fit a B-spline to the data. The following code can be used for that:
```{r, mgcv_gam, warning = FALSE, message = FALSE}
detach ('package:gam')
library (mgcv)
library (lattice)
library (splines)
out <- mgcv::gam (NOx ~ bs(E, df = 7), data = ethanol)
# str(out) # inspect output components
# data must be sorted according to E values
data.temp <- data.frame (ethanol [,c(1,3)], fitted = fitted(out))
data.ordered <- with (data.temp, data.temp [order(E),])
with (data.ordered, plot (x = E, y = NOx, pch = 15, col = 'blue'))
with (data.ordered, lines (x = E, y = fitted, col = 'red', lwd = 2))
```
<div style="margin-left: 45px; margin-right: 20px;">
The domain of the predictor is divided by knots and a B-spline generates a set of piecewise polynomials restricted to be continuous at the knots. Execute the following command:
</div>
```{r, knots, fig.cap="Cubic B-spline basis functions with 4 interior knots.", out.width="100%"}
plot (range (ethanol$E), c(0,1), type = "n")
apply (bs (ethanol$E, df = 7), 2, function(y, x)
lines (x[order(x)], y[order(x)]), x = ethanol$E)
knot.vec <- attr (bs (ethanol$E [order (ethanol$E)], df = 7), "knots")
for (i in knot.vec) lines (rep(i,2), c(0,1), lty = 2)
```
<div style="margin-left: 45px; margin-right: 20px;">
A set of $7$ basis functions were generated, each a degree $3$ polynomial in an interval with
$7 - 3 = 4$ interior knots. Experiment with specifying either a different degree for the polynomials, or different numbers of knots. The specific knots can also be specified, e.g. `bs (ethanol$E, knots = c (0.7, 0.9, 1.1))`.
</div>
<div style="margin-left: 25px; margin-right: 20px;">
(ii) It has been shown that B-splines can exhibit erratic behavior beyond the boundary knots. Repeat the fits and plots above, replacing the B-splines with natural cubic splines with the function `ns()`. Natural cubic splines constrain the basis functions to be linear beyond the boundary knots. As a result, a natural cubic spline with $K$ knots is represented by $K$ basis functions.
(iii) Selecting the position and number of knots can be subjective. Smoothing splines solves the knot selection problem by making each observation a knot and controlling the overfitting with a single penalty parameter. Repeat the fits and plots in (i), replacing the B-splines with smoothing splines with the function `s()`, similar to the fit in (a).
</div>
## The function rpart()
(a) Describe briefly what is understood by *<span style="color:#FF9966">Regression and Classification Trees</span>* using the help file of the function `rpart()` in package `rpart`.
(b) Study the help file of the `kyphosis` data set available in package `MASS`.
(c) Execute the following R code:
```{r, rpart, warning = FALSE, message = FALSE}
library (rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
plot(fit)
text(fit, use.n = TRUE, xpd = NA)
```
Comment on the results.
(d) The package `partykit()` provides a toolkit for recursive partitioning. A more informative tree diagram compared to (c) can be obtained with the code:
```{r, partykit, warning = FALSE, message = FALSE}
library (partykit)
fit2 <- as.party(fit)
plot(fit2)
```
## Nonlinear regression and the function nls()
The asymptotic regression model has the form:
$$
y = \alpha + \beta e^{\gamma x}.
$$
When $\alpha > 0$, $\beta < 0$ and $\gamma < 0$ the above model becomes Mistcherlich’s model of the law of diminishing returns. According to the law of diminishing returns the yield, $y$, initially increases quickly with increasing values of $x$, but then the yield slows down and finally levels off just below the value of $\alpha$. The parameters $\alpha > 0$, $\beta < 0$ and $\gamma < 0$ have the following meaning:
<div style="margin-left: 25px; margin-right: 20px;">
* $\alpha$ represents the upper asymptote for yield
* $\beta$ is the difference between the value of $y$ when $x = 0$ and the upper asymptote
</div>
The asymptotic regression model may be reparameterized such that $\beta$ is replaced by $\delta - \alpha$ where $\delta$ represents the yield when $x = 0$. A special case of the asymptotic regression model occurs when $\delta = 0$:
$$
y = \alpha - \alpha e^{\gamma x} = \alpha (1 - e^{\gamma x}).
$$
The asymptotic regression model represents a nonlinear regression model and in general no closed form solution exists for estimating its parameters. The R function `nls()` can be used to estimate the parameters of the model. When function `nls()` is employed starting values for the iterative estimating procedure are necessary.
### Choosing starting values
(1) A starting value for \alpha can be found by looking at a scatterplot of $y$ as a function of $x$ and then chooses as starting value just larger than the largest value of $y$.
(2) \beta is the difference between the value of $y$ when $x = 0$ and the upper asymptote. A reasonable starting value is the minimum value of $y - \alpha$.
(3) \gamma can be roughly initially estimated by the negative of the slope between two "well separated" points on the plot.
(a) Study the help file of `nls()`. Notice carefully how `nls()` is called.
::: {style="color: #80CC99;"}
(b) The data below is the `Yield` of a product after a chemical treatment of different time duration. Three replicates were made at each treatment time.
```{r, ChemData, echo = FALSE, warning = FALSE, message = FALSE}
library (flextable)
set_flextable_defaults(fonts_ignore = TRUE)
data <- data.frame (Time = c(0, 3, 6, 12, 18, 24, 48, 72, 96, 168, 336, 504, 672, 840, 1008),
Rep1 = c(0, 0.0846, 0.1072, 0.1255, 0.1671, 0.1988, 0.2927, 0.3713,
0.4773, 0.6158, 0.7297, 0.8083, 0.7019, 0.6038, 0.7386),
Rep2 = c(0, 0.0556, 0.0604, 0.0590, 0.0799, 0.0958, 0.1739, 0.1910,
0.2669, 0.3584, 0.4339, 0.4816, 0.4497, 0.4851, 0.5332),
Rep3 = c(0, 0.0501, 0.0545, 0.0705, 0.0687, 0.0655, 0.1075, 0.1418,
0.1725, 0.3224, 0.3322, 0.3309, 0.3798, 0.3757, 0.3798))
flextable (data) |>
delete_part (part = "header") |>
add_header (Time = "Time (Hrs)", Rep1 = "Rep1", Rep2 = "Rep2", Rep3 = "Rep3") |>
theme_alafoli() |>
hline_bottom(border = officer::fp_border(color = "#333333", width = 0.75))
```
<div style="margin-left: 25px; margin-right: 20px;">
Plot the data.
</div>
(c) The above plot suggests a relationship of the form:
$$
y = \alpha - \alpha e^{\gamma x} = \alpha (1 - e^{\gamma x}).
$$
<div style="margin-left: 25px; margin-right: 20px;">
Explain the term: *<span style="color:#FF9966">nonlinear regression</span>*. Use `nls()` to estimate the parameters $\alpha$ and $\gamma$ in the above model. What effect does the changing of the start values have on the result of `nls()`? Experiment with different start values for the parameters. Add the estimated line to the plot in (b) and interpret the graph.
</div>
:::
## Detailed example: Analysis of Variance and Covariance
### One-way ANOVA models {#AnovaOneWay}
About 90 percent of the ozone in the atmosphere is contained in the stratosphere. Although only a small proportion of atmospheric ozone is found in the troposphere this ozone plays an important role in various climatic phenomena. Research has shown the highest amounts of ozone over the Arctic are found during the spring months of March and April, while the Antarctic has its lowest amounts of ozone during its autumn months of September and October, Atmospheric carbon dioxide not only plays an important role in global warming but its concentration is related to height above sea level and ocean temperature. In order to study relationships between ozone levels, CO_2 concentration, height above sea level, ocean and seasonality a small-scale simulation study has been undertaken. A very simple model was constructed for simulating the data called `climate.sim`. This simple simulation procedure was performed as described below. Here we show how the `climate.sim` data were subjected to various, ANOVA, regression and ANCOVA procedures.
```{r, climateData, warning = FALSE, message = FALSE}
library (tidyverse)
set.seed (7343)
ocean <- factor (c (rep ("N-Atlantic", 20), rep ("S-Atlantic", 20),
rep ("Indian", 20)))
season <- factor (rep (rep (c ("summer", "autumn", "winter", "spring"),
each = 5), 3))
height <- runif (60, 100, 750)
climate.sim <- tibble (ocean, season, height)
CO2 <- 350 + (as.numeric (ocean) - 2) + (as.numeric (season) - 2.5) * 1.8 +
(as.numeric (ocean) - 2) * (as.numeric (season) - 2.5) * 2 -
0.012 * (height - 390) + rnorm (60, 0, 2)
ozone <- 15 + (as.numeric (ocean) - 2) + 0.02 * (height - 390) +
rnorm (60, 0, 0.4)
climate.sim <- mutate (climate.sim, CO2, ozone)
```
The researcher would like to know whether the CO2 levels are the same in the three oceanic regions. Analysis of variance decomposes the total variance in the dependent variable into a between-groups variance and a within-groups variance (or error variance). The underlying model can be written as
$$
Y_{ij} = \mu + \alpha_i + \epsilon_{ij}
$$
where $Y_{ij}$ represents the value of the numeric dependent variable for observation $j$ coming from population $i$; $\mu$ is an effect common to all observations; $\alpha_i$ represents an effect found only in population $i$ and $\epsilon_{ij}$ is an effect (e.g. measurement error) that is specific to only observation ($ij$). It is assumed that the $\epsilon_{ij}$ is normally distributed with a mean of zero and a constant variance $\sigma^2$ (i.e. the variance is the same in each of the populations). The statistical models to be tested is
$$
H_0: \alpha_1 = \alpha_2 = \dots = \alpha_k
$$
where $k$ denotes the number of different populations, versus
$$
H_A: \text{At least one of the } \alpha \text{s differs from the rest}.
$$
In general $\mu, \alpha_1, \dots, \alpha_k$ cannot all be estimated uniquely. In order for a statistical hypothesis to be *<span style="color:#FF9966">testable</span>* it must be expressed in terms of *<span style="color:#FF9966">estimable functions</span>* of the parameters. A statistical hypothesis can often be expressed in different equivalent forms. The null hypothesis above, for example, is equivalent to the following hypotheses:
$$
H_0': \mu+\alpha_1 = \mu+\alpha_2 = \dots = \mu+\alpha_k
$$
$$
H_0'': \alpha_1 - \alpha_k = \alpha_2 - \alpha_k = \dots = \alpha_{k-1} - \alpha_k = 0
$$
The latter two hypotheses are both in terms of $k - 1$ linearly independent estimable functions and hence are testable. The default behaviour of R is to estimate $\mu$ (the intercept) using the data, take (estimate) one of the $\alpha$s as zero and then estimate the remainder of the $\alpha$s as deviations from the estimated intercept using the data. Notice, that in doing so the given null hypothesis above can still be tested.
(a) Use `str(climate.sim)` to inspect the structure of the R object `climate.sim`. Note that the factor season has 4 levels, with the first two levels “autumn” and “spring”. R automatically order the levels alphabetically. In this case, we want the seasons to follow in the order we observe them, especially when making plots, such as `boxplot (CO2 ~ season, data = climate.sim)`.
<div style="margin-left: 25px; margin-right: 20px;">
To change the order of the levels, the R function `relevel()` allows putting a specific level first. Sequentially calling this function, allows one to place the levels in any required order. As part of the tidyverse ecosystem, the package `forcats` is loaded and the following calls allows the user to change the order of the levels in a single call.
</div>
```{r, relevel}
climate.sim$season <- fct_relevel (climate.sim$season,
c("summer", "autumn", "winter", "spring"))
str(climate.sim)
```
(b) Use `with()` and `tapply()` to obtain the mean and the standard deviation of CO2 for the different seasons.
(c) Test if the effects of the four seasons on CO2 are the same by executing the R command `lm (CO2 ~ season, data = climate.sim)` and using the commands `summary()` and `anova()`.
(d) Use `coefficients()` and interpret the estimates referring to (b).
(e) Obtain the residuals and plot them against the fitted values. Use different colours for the different seasons. Interpret the plot.
### Two-way ANOVA models: main effects and interaction effects
Inspection of the `climate.sim` data by requesting `str(climate.sim)` reveals the presence of a second factor variable ocean.
We are interested in CO2 as a function of the two factor variables season and ocean. In addition to the *<span style="color:#FF9966">main effects</span>* of season and ocean, respectively, we also have to consider *<span style="color:#FF9966">interaction effects</span>*.
*<span style="color:#FF9966">Interaction can be defined as the result where the effect of one factor depends on the level of another factor.</span>*
In order to provide for a second factor together with interaction effects the two-way model is given by
$$
Y_{ij} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ij}
$$
In this model $\alpha$ represents the main effect of season and $\beta$ the main effect of ocean while the interaction effect is represented by $(\alpha\beta)_{ij}$. The first hypothesis to be tested is the null hypothesis that there exist no interaction effects or that the interaction effects are negligible. Symbolically this is the hypothesis
$$
H_0: (\alpha\beta)_{11} = (\alpha\beta)_{12} = \dots = (\alpha\beta)_{km}
$$
where $k$ denotes the number of levels in the first factor variable and $m$ the number of levels in the second factor variable. In order to test for interaction we must have $r > 1$ repetitions (replicates) of each $\alpha_i\beta_j$ combination. The value of $r$ in our example can be found using the following code:
```{r, replicates}
with(climate.sim, table(season, ocean))
```
From the table above it is clear that all treatment combinations occur exactly the same number of times. Had there been some missing values in the data, the correct solution to the problem is to estimate the missing values by the mean values of the available data in the affected cells. These values are then substituted in place of the missing values to form a complete data set. The ANOVA is then performed on the complete data set but the degrees of freedom for error is reduced with the number of missing values.
The presence of interaction can be investigated with an interaction plot using the following instruction
```{r, interactionSeasonOcean, fig.cap="Interaction plot of factors season and ocean associates with the dependent variable CO2", out.width="100%"}
with(climate.sim, interaction.plot (ocean, season, response = CO2))
```
What can be deduced from Figure \@ref(fig:interactionSeasonOcean)? Explain in detail.
The formal ANOVA can be performed with the instruction
```{r, two_wayANOVA}
two.way.anova <- lm(CO2 ~ season + ocean + season:ocean, data=climate.sim)
```
An analysis-of-variance table can be constructed with the instruction:
```{r, two_wayAnova_table}
anova(two.way.anova)
```
It is clear from the above table that the null hypothesis of no interaction is to be rejected. This means that it would not make sense to test the main effects of season and ocean. On the other hand, if no significant interaction would have been found then the F-values associated with the two main effects could have been investigated to conclude if none, one or both of the main effects are significant.
(a) Repeat the above analysis using in turn each of the other numeric variables as the response variable.
(b) Plot the residuals versus the fitted values and interpret the graphs. What can be concluded regarding the assumptions of the two-way ANOVA model?
#### Types of ANOVA tables
Suppose we want to fit an ANOVA model for the effect of `Origin` and `AirBags` on `Max.Price` in the `Cars93` data set in package `MASS`. Notice the difference in the two ANOVA tables below.
```{r, anovaOrder}
anova (lm (Max.Price ~ AirBags + Origin, data = Cars93))
anova (lm (Max.Price ~ Origin + AirBags, data = Cars93))
```
R computes sums of squares for each term, based on the assumption that all terms listed above that particular term in the ANOVA table are already included in the model. The effect is that the order in which terms are specified (specifically for unbalanced data) influences the results.
The R package `car` provides options to specify how sums of squares are computed:
* **Type I**: Sums of squares are computed based on the assumption that all terms specified before the particular term is included in the model.
* **Type II**: Sums of squares are computed according to the principle of marginality, testing each term after all others, except ignoring the term's higher-order interactions.
* **Type III**: Sums of squares are computed testing each term in the model after all the others are included in the model. The computation of these sums of squares violate marginality.
```{r, AnovaTypes, warning = FALSE, message = FALSE}
library (car)
anova (lm (Max.Price ~ AirBags * Origin, data = Cars93)) # type = "I"
Anova (lm (Max.Price ~ AirBags * Origin, data = Cars93), type = "II")
Anova (lm (Max.Price ~ AirBags * Origin, data = Cars93), type = "III")
```
If we focus on the type II tests, we notice that the sums of squares for interaction (432.8) corresponds to the sums of squares for the last term in the type I table. The sums of squares for `Origin` (283.5) corresponds to the sums of squares value when specifying `Max.Price ~ AirBags + Origin` and the sums of squares for `AirBags` (3585.6) corresponds to the sums of squares when specifying `Max.Price ~ Origin + AirBags`.
In the ANOVA table with type III tests, the interaction sums of squares agrees with the other two tables, but because the marginality principle is violated, the sums of squares for the main effects are different. Care should be taken when testing hypotheses with type III sums of squares.
### One-way ANCOVA models
A scatterplot of CO2 versus height suggests the possibility of a linear relationship between these two numeric variables. Check the above statement.
This means that the ANOVA model is not the correct model to use for the `climate.sim` data. Instead we should consider a one-way ANCOVA model with height as a *<span style="color:#FF9966">concomitant variable</span>*:
$$
Y_{ij} = \mu + \alpha_i + \beta_i x_{ij} + \epsilon_{ij}
$$
where $i = 1, 2, \dots, k$; $j = 1, 2, \dots, n_i$.
In this model $x_{ij}$ represents a continuous (or numeric) variable that is measured together with the dependent continuous variable $Y_{ij}$. If variable $x$ has a significant relationship with the dependent (response) variable the values of the latter should be adjusted before performing a one-way ANOVA. The researcher must first ascertain whether the regression coefficients (i.e. the slopes of the regression lines) of the different groups are the same. Therefore, <span style="color:#FF9966">Step I</span> of a covariance analysis investigates
$$
H_{01}: \beta_1 = \beta_2 = \dots = \beta_k
$$
and if this is not the case, then the analysis proceeds by (i) estimating the regression parameters $\beta_1, \beta_2, \dots, \beta_k$ for every group/treatment/level individually and (ii) by investigating the differences between the $y$-means of the $k$ groups at specific values of $x$. On the other hand, if Step I shows that it can be assumed that $\beta_1 = \beta_2 = \dots = \beta_k = (\beta)$ then the $\beta_i$ in the model is replaced by the common $\beta$ and <span style="color:#FF9966">Step II</span> is carried out. Step II then consists of determining whether there is a significant regression relationship, i.e. the null hypothesis
$$
H_{02}: \beta = 0
$$
is tested. If it can be assumed that $\beta = 0$, then an ordinary analysis of variance can be carried out. If $\beta \neq 0$, <span style="color:#FF9966">Step III</span> of ANCOVA is executed which first consists of adjusting the $y_{ij}$-observations for their regression relationship with the $\{x_{ij} \}$. An analysis of variance is then carried out on the adjusted $\{y_{ij} \}$ in order to ascertain whether the levels of the experimental factor have similar effects or not.
As an example we will again make use of the data of the `climate.sim` data set constructed above.
(a) Obtain a scatterplot of CO2 as a function of height. Use different colours for the three oceans. What can be deduced from the scatterplot?
(b) Carry out a one-way ANOVA of CO2 measured in the three different ocean locations.
(c) Perform Step I of a one-way ANCOVA using the commands:
```{r, CO2_ocean_height_I}
one.way.ancova.I <- lm (CO2 ~ ocean + height + ocean:height,
data = climate.sim)
Anova(one.way.ancova.I, type = "II")
coefficients(one.way.ancova.I)
```
<div style="margin-left: 25px; margin-right: 20px;">
*Interpretation of the ANOVA table*
First look at interaction effect in the anova table. It provides information on whether the hypothesis $\beta_1 = \beta_2 = \beta_3$ should be rejected. Since the conclusion is that all three β’s are the same (differences are negligent) then (and only then) inspect the line for the concomitant variable height to decide if the common slope $\beta = 0$. This is Step II. If it can be concluded that the common slope is zero then the ANOVA of section \@ref(AnovaOneWay) is performed.
In Step II it is concluded that the common slope is not equal to zero so we perform Step III by calling, for example,
</div>
```{r, CO2_ocean_height_III}
one.way.ancova.III <- lm (CO2 ~ ocean + height, data = climate.sim)
Anova (one.way.ancova.III, type = "II")
```
<div style="margin-left: 25px; margin-right: 20px;">
*Note that if you are not using type II sums of squares, but the default `anova()` function in the latter call the numeric concomitant variable MUST PRECEDE the categorical variable on the right-hand side of the tilde because Step III is only carried out on condition that Step II has indicated a significant relationship between the concomitant variable and the response variable.* The line corresponding to the factor variable ocean in the analysis of variance table is then inspected to conclude whether the effects associated with the respective levels of the factor variable are significantly different.
(i) In the example above we conclude that the three $\beta$’s in the ANCOVA model are the same and that the hypothesis $\beta = 0$ is rejected. In Step III we conclude that $\alpha_1 = \alpha_2 = \alpha_3$. Obtain the estimated regression equation using the following instructions:
</div>
```{r, CO2_ocean_height_final}
one.way.ancova.final <- lm (CO2 ~ height, data = climate.sim)
coefficients (one.way.ancova.final)
```
<div style="margin-left: 25px; margin-right: 20px;">
(ii) Add the regression line to the scatterplot constructed in (a). See Figure \@ref(fig:CO2oceanHeight).
</div>
```{r, CO2oceanHeight, echo = FALSE, fig.cap="Scatterplot of CO2 vs height with a single regression line.", out.width="100%"}
ggplot(data = climate.sim,
mapping = aes(x = height, y = CO2)) +
geom_point(mapping = aes(colour = ocean)) +
geom_function(fun = function(x) 352.7236 - 0.00856*x)
```
<div style="margin-left: 25px; margin-right: 20px;">
(iii) Inspection of Figure \@ref(fig:CO2oceanHeight) shows a negative relationship between CO2 and height but there is no real difference between the observations from the three different oceanic locations.
</div>
(d) Perform Step I of a one-way ANCOVA of CO2 for the different seasons with height as concomitant variable.
```{r, CO2_season_height_I}
one.way.ancova.I <- lm (CO2 ~ season + height + season:height,
data = climate.sim)
Anova(one.way.ancova.I, type = "II")
coefficients (one.way.ancova.I)
```
<div style="margin-left: 25px; margin-right: 20px;">
*Interpretation of the ANOVA table*
First look at the line for interaction. This line provides information on whether the hypothesis $\beta_1 = \beta_2 = \beta_3 = \beta_3$ should be rejected (as is the case here. Why?) Therefore we cannot proceed to Step II.
(i) In the example above we conclude that the three $\beta$’s in the ANCOVA model are not the same. Obtain the estimated regression equations for each of the seasons using the following instructions:
</div>
```{r, CO2_season_height_final}
one.way.ancova.sep.reglns <- lm (CO2 ~ -1 + season/height, data = climate.sim)
coefficients (one.way.ancova.sep.reglns)
```
<div style="margin-left: 25px; margin-right: 20px;">
(ii) Add the regression lines to the scatterplot of CO2 vs height. See Figure \@ref(fig:CO2seasonHeight).
</div>
```{r, CO2seasonHeight, echo = FALSE, message = FALSE, warning = FALSE, fig.cap="Scatterplot of CO2 vs height with a separate regression line for each of the seasons.", out.width="100%"}
ggplot(data = climate.sim,
mapping = aes(x = height, y = CO2)) +
geom_point(mapping = aes(colour = season)) +
geom_smooth(method="lm", se=F, aes(colour = season))
```
<div style="margin-left: 25px; margin-right: 20px;">
(iii) Inspection of Figure \@ref(fig:CO2seasonHeight) shows that the height has a much smaller effect in autumn and winter than in spring and summer.
It is clear from Figure \@ref(fig:CO2seasonHeight) that the outcome of a test of the hypothesis that $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4$ depends on the value of height where the test is performed. At height = 600 the test that $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4$ is performed as follows:
First notice that the Residual sum of squares in the ANOVA table above is $RSSE=427.59$ with $52$ degrees of freedom. At height = 600 execute:
</div>
```{r, CO2_season_height_600}
Anova (lm (CO2 ~ season:(I(height-600)) , data = climate.sim), type = "II")
```
<div style="margin-left: 25px; margin-right: 20px;">
Now calculate
$$
F_0 = \frac{\frac{522.39 - 427.59}{55-52}}{\frac{427.59}{52}} = 3.8429
$$
and the corresponding p-value using the instruction
</div>
```{r, F_pval}
1 - pf (3.8429, 3, 52)
```
<div style="margin-left: 25px; margin-right: 20px;">
At height = 400
</div>
```{r, CO2_season_height_400}
Anova (lm (CO2 ~ season:(I(height-400)) , data = climate.sim), type = "II")
```
<div style="margin-left: 25px; margin-right: 20px;">
Now
$$
F_0 = \frac{\frac{595.57 - 427.59}{55-52}}{\frac{427.59}{52}} = 6.80945
$$
and
</div>
```{r, F_pval2}
1 - pf (6.80945, 3, 52)
```
<div style="margin-left: 25px; margin-right: 20px;">
What conclusions can be drawn from the tests at heights of 600 and 400 respectively when using a 1% significance level? Conjecture on the conclusions following a multiple testing procedure after a significant anova result.
</div>
(e) Perform Step I of a one-way ANCOVA of ozone for the three oceanic locations with height as concomitant variable.
```{r, ozone_ocean_height_I}
one.way.ancova.I <- lm (ozone ~ ocean + height + ocean:height,
data = climate.sim)
Anova(one.way.ancova.I, type = "II")
coefficients(one.way.ancova.I)
```
<div style="margin-left: 25px; margin-right: 20px;">
*Interpretation of the ANOVA table*
First look at the line for interaction. We conclude that all three β’s are the same (differences are negligible) and then (and only then) inspect the line for continuous variable height to decide if the common slope β=0. This is Step II. Since it is concluded that the common slope is not zero we proceed to Step III.
</div>
```{r, ozone_ocean_height_III}
one.way.ancova.III <- lm (ozone ~ ocean + height, data = climate.sim)
Anova(one.way.ancova.III, type = "II")
coefficients(one.way.ancova.I)
```
<div style="margin-left: 25px; margin-right: 20px;">
The line corresponding to the factor variable ocean in the analysis of variance table is inspected to conclude if the effects associated with the respective levels of the factor variable are significantly different.
(i) In the example above we conclude that the three $\beta$’s in the ANCOVA model are the same and that the hypothesis β=0 is rejected. In Step III we conclude that the hypothesis $\alpha_1 = \alpha_2 = \alpha_3$ is also rejected. Obtain the estimated regression equations using the following instructions:
</div>
```{r, ozone_ocean_height_final}
one.way.ancova.reglns <- lm (ozone ~ ocean + height, data = climate.sim)
coefficients(one.way.ancova.reglns)
```
<div style="margin-left: 25px; margin-right: 20px;">
(ii) Obtain a scatterplot of ozone as a function of height. Use different colours for the three oceans. Add a separate regression line for each of the oceans (see Figure \@ref(fig:ozoneOceanHeight)).
(iii) Inspection of Figure \@ref(fig:ozoneOceanHeight) shows three parallel regression lines, since $\beta_1 = \beta_2 = \beta_3$, with different intercepts. The hypothesis that $\alpha_1 = \alpha_2 = \alpha_3$ is rejected irrespective of the height above sea level.
</div>
```{r, ozoneOceanHeight, echo = FALSE, fig.cap="Scatterplot of ozone vs height with a separate parallel regression lines for each of the different oceanic locations.", out.width="100%"}
ggplot(data = climate.sim,
mapping = aes(x = height, y = ozone)) +
geom_point (mapping = aes(colour = ocean)) +
geom_function (fun = function(x) 6.13127497 + 0.02021*x, colour="salmon") +
geom_function (fun = function(x) 6.13127497 + 0.99327 + 0.02021*x, colour="green") +
geom_function (fun = function(x) 6.13127497 + 2.0069 + 0.02021*x, colour="skyblue")
```
The steps of a one-way analysis of covariance is illustrated in the diagram in Figure \@ref(fig:ANCOVA).
```{r, ANCOVA, echo=FALSE, warning = FALSE, message = FALSE, fig.cap="Diagrammatic process flow for a one-way analysis of covariance.", out.width="100%"}
library(knitr)
knitr::include_graphics("pics/ANCOVA.jpg")
```
### Maize Exercise
::: {style="color: #80CC99;"}
A researcher investigates if three maize varieties give the same yield. The following data are obtained according to the requirements of a linear model:
```{r, MaizeData, echo = FALSE, warning = FALSE, message = FALSE}
library (flextable)
set_flextable_defaults(fonts_ignore = TRUE)
data <- data.frame (Yield1 = c(17.50, 45.00, 38.75, 33.75, 18.75, 21.25, NA, NA),
Rain1 = c(95.90, 116.20, 116.20, 93.10, 81.34, 115.36, NA, NA),
Yield2 = c(37.50, 42.15, 48.75, 32.50, 53.75, 38.75, 22.50, 17.50),
Rain2 = c(114.10, 119.28, 121.94, 97.02, 102.90, 91.70, 102.76, 78.54),
Yield3 = c(18.75, 13.75, 52.50, 17.50, 18.75, 52.50, 32.50, NA),
Rain3 = c(98.00, 117.60, 105.28, 94.50, 92.40, 110.74, 105.70, NA))
flextable (data) |>
delete_part (part = "header") |>
add_header (Yield1 = "Yield per hectare", Rain1 = "Rainfall",
Yield2 = "Yield per hectare", Rain2 = "Rainfall",
Yield3 = "Yield per hectare", Rain3 = "Rainfall") |>
add_header_row(values = c("KB1", "KB2", "KB3"), colwidths = c(2,2,2)) |>
add_header_row(values = "VARIETY", colwidths = 6) |>
theme_alafoli() |>
align (align = "center", part = "header") |>
hline_bottom(border = officer::fp_border(color = "#333333", width = 0.75))
```
(a) Plot the given data as a single scatterplot that distinguishes between the three varieties.
(b) What hypotheses and models are of importance when considering a covariance analysis of the given data?
(c) Perform a detailed one-way ANCOVA.
(d) Obtain estimates for the regression coefficients in the regression equations.
$$
Y_{ij} = \mu + \beta x_{ij} + \epsilon_{ij}, \hspace{1cm} i = 1, 2, \dots, k; j = 1, 2, \dots, n_i
$$
$$
Y_{ij} = \mu + \beta_i x_{ij} + \epsilon_{ij}, \hspace{1cm} i = 1, 2, \dots, k; j = 1, 2, \dots, n_i
$$
(e) Interpret the results of the analysis of covariance using the output of `lm()`.
:::
### Heart rate Exercise
::: {style="color: #80CC99;"}
@MillikenJohnson2002 discuss an investigation by a sport physiologist of the effect of three exercise programs (EX1, EX2 and EX3) on heart rate (HR) of males between 28 and 35 years of age. The experiment consists of the random allocation of a subject to one of the programs. The subject then had to follow the exercise program for 8 weeks. On finishing the program each subject had to complete a 6 minutes run and then had his heart rate (EHR) measured. As a measure of their fitness before entering the experiment each subject had also his resting heart rate (AHR) determined prior to the experiment. The following data were obtained:
```{r, HearRateData, echo = FALSE, warning = FALSE, message = FALSE}
library (flextable)
set_flextable_defaults(fonts_ignore = TRUE)
data <- data.frame (HER1 = c(118, 138, 142, 147, 160, 166, 165, 171),
AHR1 = c(56, 59, 62, 68, 71, 76, 83, 87),
HER2 = c(148, 159, 162, 157, 169, 164, 179, 177),
AHR2 = c(60, 62, 65, 66, 73, 75, 84, 88),
HER3 = c(153, 150, 158, 152, 160, 154, 155, 164),
AHR3 = c(56, 58, 61, 64, 72, 75, 82, 86))
flextable (data) |>
delete_part (part = "header") |>
add_header (HER1 = "HER", AHR1 = "AHR", HER2 = "HER", AHR2 = "AHR",
HER3 = "HER", AHR3 = "AHR") |>
add_header_row(values = c("EX1", "EX2", "EX3"), colwidths = c(2,2,2)) |>
add_header_row(values = "EXERCISE PROGRAM", colwidths = 6) |>
theme_alafoli() |>
align (align = "center", part = "header") |>
hline_bottom(border = officer::fp_border(color = "#333333", width = 0.75))
```
Analyze the data and discuss the results.
:::