Designing-Simulations-in-R/015-Case-study-ANOVA.Rmd at main · jepusto/Designing-Simulations-in-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498

# Case Study: Heteroskedastic ANOVA and Welch {#case-ANOVA}
\index{example!heteroskedastic ANOVA (Welch)}

```{r include = FALSE}
library(tidyverse)
```
In this chapter, we present another detailed example of a simulation study\index{simulation study} to demonstrate how to put the principles of tidy, modular simulation into practice.
To illustrate the process of programming a simulation, we reconstruct the simulations from @brown1974SmallSampleBehavior.
We will also use this case study as a recurring example in some of the following chapters.

@brown1974SmallSampleBehavior studied methods for null hypothesis testing in studies that measure a characteristic $X$ on samples from each of several groups. They consider a population consisting of $G$ separate groups, with population means $\mu_1,...,\mu_G$ and population variances $\sigma_1^2,...,\sigma_G^2$ for the characteristic $X$.
A researcher obtains samples of size $n_1,...,n_G$ from each of the groups and takes measurements of the characteristic for each sampled unit.
Let $x_{ig}$ denote the measurement from unit $i$ in group $g$, for $i = 1,...,n_g$ for each $j = 1,..., G$.
The researcher's goal is to use the sample data to test the hypothesis that the population means are all equal
$$
H_0: \mu_1 = \mu_2 = \cdots = \mu_G.
$$

If the population _variances_ were all equal (so $\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_G^2$), we could use a conventional one-way analysis of variance (ANOVA) to conduct this test.
However, one-way ANOVA might not work well if the variances are not equal.
The question is then: what are best practices for testing the null of equal group means, allowing for the possibility that variances could differ across groups?

```{r BF-scenarios-data, echo=FALSE}
table_1_scenarios <- tribble(
  ~Groups, ~`Sample Sizes`,           ~`Standard Deviations`,
  "4",     "4,4,4,4",                 "1,1,1,1",
  "4",     "",                        "1,2,2,3",
  "4",     "4,8,10,12",               "1,1,1,1",
  "4",     "",                        "1,2,2,3",
  "4",     "",                        "3,2,2,1",
  "4",     "11,11,11,11",             "1,1,1,1",
  "4",     "",                        "1,2,2,3",
  "4",     "11,16,16,21",             "1,1,1,1",
  "4",     "",                        "3,2,2,1",
  "4",     "",                        "1,2,2,3",
  "6",     "4,4,4,4,4,4",             "1,1,1,1,1,1",
  "6",     "",                        "1,1,2,2,3,3",
  "6",     "4,6,6,8,10,12",           "1,1,1,1,1,1",
  "6",     "",                        "1,1,2,2,3,3",
  "6",     "",                        "3,3,2,2,1,1",
  "6",     "6,6,6,6,6,6",             "1,1,2,2,3,3",
  "6",     "11,11,11,11,11,11",       "1,1,2,2,3,3",
  "6",     "16,16,16,16,16,16",       "1,1,2,2,3,3",
  "6",     "21,21,21,21,21,21",       "1,1,2,2,3,3",
  "10",    "20,20,20,20,20,20,20,20,20,20", "1,1,1.5,1.5,2,2,2.5,2.5,3,3"
) %>%
  mutate(
    Scenario = LETTERS[1:n()]
  )
```


To tackle this question, Brown and Forsythe evaluated two different hypothesis testing procedures, developed by @james1951ComparisonSeveralGroups and @welch1951ComparisonSeveralMean, both of which avoid the assumption that all groups have equal equality of variances.
Brown and Forsythe also evaluated the conventional one-way ANOVA F-test as a benchmark, even though this procedure maintains the assumption of equal variances.
They also proposed and evaluated a new procedure of their own devising.[^typical-stats-methods]
Their simulation involved comparing the performance of these different hypothesis testing procedures (the methods) under a range of conditions (different data generating processes) with different sample sizes and different degrees of heteroskedasticity.
They looked at the different scenarios shown as Table \@ref(tab:BF-Scenarios), varying number of groups, group size, and amount of variation within each group.
In all, there are a total of `r nrow(table_1_scenarios)` scenarios, covering conditions with between `r min(table_1_scenarios$Groups)` and `r max(table_1_scenarios$Groups)` groups.

[^typical-stats-methods]: This latter piece makes Brown and Forsythe's study a prototypical example of a statistical methodology paper: find some problem that current procedures do not perfectly solve, invent something to do a better job, and then do simulations and/or math to build a case that the new procedure is better.

```{r BF-Scenarios, echo=FALSE, message = FALSE}
library(kableExtra)

table_1_scenarios %>%
  relocate(Scenario) %>%
  kbl(
    caption = "Simulation scenarios explored by Brown and Forsythe (1974)"
  ) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped","compact","hover")
  )
```


When evaluating hypothesis testing procedures, there are two main performance metrics of interest: type-I error rate and power.
The type-I error rate is the rate at which a test rejects the null hypothesis when the null hypothesis is actually true.
To apply a hypothesis testing procedure, one has to specify a desired, or nominal, type-I error rate, often denoted as the $\alpha$-level.
For a specified $\alpha$, a valid or well-calibrated test should have an actual type-I error rate less than or equal to the nominal level, and ideally should be very close to nominal.
<!-- JEP: Do you have a source for the terminology of a _valid_ test? I'm kind of wary of calling it the validity of a test because validity is already such an overloaded term. In some of my work, I've written about type-I error as measuring whether the test is appropriately _calibrated_, which lines up with the idea of calibrating physical measurement instruments like scales or temperature gauges. -->
<!-- LWM: I poked around and valid seems to be common usage.  (Lot of papers talk about valid testing, for example).  Not sure we would want to cite that term here?  But i am content with calibrated only if you prefer. -->
Power is how often a test correctly rejects the null when it is indeed false.
It is a measure of how sensitive a method is to violations of the null.

Brown and Forsythe estimated error rates and power for nominal $\alpha$-levels of 1%, 5%, and 10%.
Table 1 of their paper reports the simulation results for type-I error (labeled as "size").
Our Table \@ref(tab:BF-table1) reports some of their results with respect to Type I error\index{performance measures!Type I error}.
For a well-calibrated hypothesis testing method, the reported numbers should be very close to the desired alpha levels, as listed at the top of the table.
We can compare the four tests to each other across each row, where each row is a specific scenario defined by a specific data generating process\index{data-generating process}.
Looking at ANOVA, for example, we see some scenarios with very elevated rates. For instance, in Scenario E, the ANOVA F-test has  21.9% rejection when it should only have 10%.
In contrast, the ANOVA F works fine under scenario A, which is what one would expect because all the groups have the same variance.
Brown and Forsythe's choice of scenarios here illustrates a broader design principle: to provide a full picture of the performance of a method or set of methods, it is wise to always evaluate them under conditions where we expect things to work, as well as conditions where we expect them to not work well.


```{r BF-table1, echo=FALSE}
library(tibble)

table_5_percent <- tribble(
  ~Groups, ~`F`, ~`F*`, ~`W`, ~`J`,
  "4",     4.9,     3.4,      4.5,     7.9,
  "4",     6.7,     4.1,      4.7,     8.1,
  "4",     5.1,     4.8,      5.7,     6.7,
  "4",     3.0,     5.7,      4.9,     5.6,
  "6",     14.4,    6.2,      6.5,     7.7,
  "6",     5.1,     5.7,      5.0,     5.5,
  "6",     6.3,     5.7,      5.0,     5.4,
  "6",     4.9,     5.1,      5.0,     5.3,
  "6",     10.8,    6.2,      5.5,     5.8,
  "6",     4.0,     6.5,      5.4,     5.6,
  "6",    4.9,     3.4,      6.1,     9.5
)
#nrow( table_5_percent )
table_1_percent <- tribble(
  ~Groups, ~`F`, ~`F*`, ~`W`, ~`J`,
  "4",     0.9,     0.5,      0.8,     2.4,
  "4",     1.7,     0.7,      0.8,     2.7,
  "4",     1.1,     1.0,      1.6,     2.1,
  "4",     0.6,     1.4,      0.9,     1.3,
  "6",     5.6,     1.8,      2.0,     2.9,
  "6",     1.0,     1.5,      0.9,     1.1,
  "6",     1.8,     1.5,      1.1,     1.3,
  "6",     1.1,     1.0,      1.0,     1.2,
  "6",     3.9,     1.8,      1.2,     1.3,
  "6",     1.0,     1.8,      1.1,     1.1,
  "6",    1.0,     0.4,      1.4,     3.8
)


table_10_percent <- tribble(
  ~Groups, ~`F`, ~`F*`, ~`W`, ~`J`,
  "4",     10.2,     7.8,       9.6,      13.3,
  "4",     12.0,     8.9,       10.3,     13.8,
  "4",     9.9,      9.5,       10.8,     12.1,
  "4",     5.9,      10.3,      9.8,      10.8,
  "6",     21.9,     11.0,      11.3,     12.9,
  "6",     10.1,     9.8,       10.0,     10.6,
  "6",     11.4,     10.7,      10.1,     10.6,
  "6",     10.3,     10.3,      10.2,     10.5,
  "6",     17.3,     11.1,      10.5,     10.9,
  "6",     7.3,      11.5,      10.6,     10.9,
  "10",    9.6,      7.3,       11.4,     14.7
)

table_1 <- bind_rows(
  `10%` = table_10_percent,
  `5%` = table_5_percent,
  `1%` = table_1_percent,
  .id = "level"
)

table_1 <-
  table_1 %>%
  dplyr::select( -Groups ) %>%
  mutate(
    Scenario = rep(LETTERS[1:11], 3),
    level = factor(level, levels = c("10%", "5%", "1%"))
  ) %>%
  pivot_wider(
    names_from = level,
    values_from = c(`F`, `F*`, W, J)
  ) %>%
  relocate(Scenario)

#table_1$Group = table_1_scenarios$Groups[1:nrow(table_1)]

table_1 %>%
  #relocate( Scenario, Group ) %>%
  kbl(
    caption='Portion of "Table 1" reproduced from Brown and Forsythe (1974)',
    col.names = c("Scenario", rep(c("10%","5%","1%"), times = 4))
  ) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped","compact","hover"),
  ) %>%
  add_header_above(c(" " = 1, "ANOVA F test" = 3, "B & F's F* test" = 3,"Welch's test" = 3, "James' test" = 3))
```

To replicate the Brown and Forsythe simulation, we will first write functions to generate data for a specified scenario and evaluate data of a given structure.
We will then use these functions to evaluate the hypothesis testing procedures in a specific scenario with a particular set of parameters (e.g., sample sizes, number of groups, and so forth). We will then scale up to execute the simulations for a range of scenarios that vary the parameters of the data-generating model, just as reported in Brown and Forsythe's paper.

## The data-generating model {#case-anova-DGP}
\index{data-generating process}

In the heteroskedastic one-way ANOVA simulation, there are three sets of parameter values: population means, population variances, and sample sizes.
Rather than attempting to write a general data-generating function immediately, it is often easier to write code for a specific case first and then use that code as a starting point for developing a function.
For example, say that we have four groups with means of 1, 2, 5, 6; variances of 3, 2, 5, 1; and sample sizes of 3, 6, 2, 4:

```{r}
mu <- c(1, 2, 5, 6)
sigma_sq <- c(3, 2, 5, 1)
sample_size <- c(3, 6, 2, 4)
```

Following @brown1974SmallSampleBehavior, we will assume that the measurements are normally distributed within each sub-group of the population. The following code generates a vector of group id's and a vector of simulated measurements:

```{r}
N <- sum(sample_size) # total sample size
G <- length(sample_size) # number of groups

# group id factor
group <- factor(rep(1:G, times = sample_size))

# mean for each unit of the sample
mu_long <- rep(mu, times = sample_size)

# sd for each unit of the sample
sigma_long <- rep(sqrt(sigma_sq), times = sample_size)

# See what we have?
tibble( group = group, mu = mu_long, sigma = sigma_long )
```

Now we have the pieces needed to generate a small dataset consisting of group memberships and the measured characteristic:

```{r}
# Now make our data
x <- rnorm(N, mean = mu_long, sd = sigma_long)
dat <- tibble(group = group, x = x)
dat

```

We have followed the strategy of first constructing a dataset with parameters for each observation in each group, making heavy use of base R's `rep()` function to repeat values in a list.
We then called `rnorm()` to generate `N` observations in all.
This works because `rnorm()` is _vectorized_; if you give it a vector (or vectors) of parameter values, it will generate each subsequent observation according to the next entry in the vector. As a result, the first `x` value is simulated from a normal distribution with mean `mu_long[1]` and standard deviation `sd_long[1]`, the second `x` is simulated using `mu_long[2]` and `sd_long[2]`, and so on.

As usual, there are many different and legitimate ways of doing this in R.
For instance, instead of using `rep()` to do it all at once, we could generate observations for each group separately, then stack the observations into a dataset.
Do not worry about trying to writing code the "best" way---especially when you are initially putting a simulation together.
If you can find a way to accomplish your task at all, then that's often enough (and you should feel good about it!).

### Now make a function {#Welch-DGP}

Because we will need to generate datasets over and over, we will wrap our code in a function.
The inputs to the function will be the parameters of the model that we specified at the very beginning: the set of population means `mu`, the population variances `sigma_sq`, and sample sizes `sample_size`. We make these quantities arguments of the data-generating function so that we can make datasets of different sizes and shapes:

```{r generate_data_block, file = "case_study_code/generate_ANOVA_data.R"}
```

The function is simply the code we built previously, all bundled up.
We developed the function by first writing code to make the data-generating process\index{data-generating process} to work once, the way we want, and only then turning the final code into a function for later reuse.

Once we have turned the code into a function, we can call it to get a new set of simulated data.
For example, to generate a dataset with the same parameters as before, we can do:

```{r}
sim_data <- generate_ANOVA_data(
  mu = mu,
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

str(sim_data)
```

To generate one with population means of zero in all the groups, but the same group variances and sample sizes as before, we can do:

```{r}
sim_data_null <- generate_ANOVA_data(
  mu = c( 0, 0, 0, 0 ),
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

str(sim_data)
```

Following the principles of tidy, modular simulation, we have written a function that returns a rectangular dataset for further analysis.
Also note that the dataset returned by `generate_ANOVA_data()` only includes the variables `group` and `x`, but not `mu_long` or `sd_long`.
This is by design.
Including `mu_long` or `sd_long` would amount to making the population parameters available for use in the data analysis procedures, which is not something that happens when analyzing real data.

### Cautious coding

In the above, we built some sample code, and then bundled it into a function by literally cutting and pasting the initial work we did into a function skeleton.
In the process, we shifted from having variables in our workspace with different names to using those variable names as parameters in our function call.

Developing code in this way is not without hazards.
In particular, after we have created our function, our workspace is left with a variable `mu` in it and our function also has a parameter named `mu`.
Inside the function, R will use the parameter `mu` first, but this is potentially confusing.
Another potential source of confusion are lines such as `mu = mu`, which means "set the function's parameter called `mu` to the variable called `mu`."
These are different things (with the same name).

Once  you have built a function, one way to check that it is working properly is to comment out the initial code (or delete it), clear out the workspace (or restart R), and then re-run the code that uses the function.
If things still work, then you can be somewhat confident that you have successfully bundled your code into the function.
Once you bundle your code, you can also do a search and replace to change the variable names inside your function to something more generic, to better clarify the distinction betwen object names and argument names.

## The hypothesis testing procedures {#ANOVA-hypothesis-testing-function}

Brown and Forsythe considered four different hypothesis testing procedures for heteroskedastic ANOVA, but we will focus on just two of the tests for now.
We start with the conventional one-way ANOVA that mistakenly assumes homoskedasticity.
R's `oneway.test` function will calculate this test automatically:

```{r}
sim_data <- generate_ANOVA_data(
  mu = mu,
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

anova_F <- oneway.test(x ~ group, data = sim_data, var.equal = TRUE)
anova_F
```
We can use the same function to calculate Welch's test by setting `var.equal = FALSE`:
```{r}
Welch_F <- oneway.test(x ~ group, data = sim_data, var.equal = FALSE)
Welch_F
```
The main results we need here are the $p$-values of the tests, which will let us assess Type-I error and power for a given nominal $\alpha$-level. The following function takes simulated data as input and returns as output the $p$-values from the one-way ANOVA test and Welch test:

```{r, file = "case_study_code/ANOVA_Welch_F.R"}
```
```{r}
ANOVA_Welch_F(sim_data)
```

Following our tidy, modular simulation principles, this function returns a small dataset with the p-values from both tests.
Eventually, we might want to use this function on some real data.
Our estimation function does not care if the data are simulated or not; we call the input `data` rather than `sim_data` to reflect this.

As an alternative to this function, we could instead write code to implement the ANOVA and Welch tests ourselves.
This has some potential advantages, such as avoiding any extraneous calculations that `oneway.test` does, which take time and slow down our simulation.
However, there are also drawbacks to doing so, including that writing our own code takes _our_ time and opens up the possibility of errors in our code.
For further discussion of the trade-offs, see Chapter \@ref(optimize-code), where we do implement these tests by hand and see what kind of speed-ups we can obtain.

## Running the simulation

We now have functions that implement steps 2 and 3 of the simulation.
Given some parameters, `generate_ANOVA_data` produces a simulated dataset and, given some data, `ANOVA_Welch_F`  calculates $p$-values two different ways.
We now want to know which way is better, and by how much.
To answer this question, we will need to repeat the chain of generate-and-analyze calculations a bunch of times.
To facilitate repetition, we first put the components together into a single function:
```{r}
one_run = function( mu, sigma_sq, sample_size ) {
  sim_data <- generate_ANOVA_data( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size )
  ANOVA_Welch_F(sim_data)
}

one_run( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size )
```
This function implements a single simulation trial by generating artificial data and then analyzing the data, ending with a tidy dataset that has results for the single run.

We next call `one_run()` over and over; see Appendix \@ref(repeating-oneself) for some discussion of options.
The following uses `repeat_and_stack()` from `simhelpers` to evaluate `one_run()` 4 times and then stack the results into a single dataset:\index{package!simhelpers}\index{repeat\_and\_stack()}

```{r}
library(simhelpers)

sim_data <- repeat_and_stack(4,
  one_run( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size)
)
sim_data
```

Voila! We have simulated $p$-values!

## Summarizing test performance

We now have all the pieces in place to reproduce the results from Brown and Forsythe (1974).
We first focus on calculating the actual type-I error rate of these tests---that is, the proportion of the time that they reject the null hypothesis of equal means when that null is actually true---for an $\alpha$-level of .05.
To evaluate the type-I error rate, we need to simulate data from a process where the population means are indeed all equal.
Arbitrarily, let's start with $G = 4$ groups and set all of the means equal to zero:

```{r}
mu <- rep(0, 4)
```

In the fifth row of Table 1 (Scenario E in our Table \@ref(tab:BF-Scenarios)), Brown and Forsythe examine performance for the following parameter values for sample size and population variance:

```{r}
sample_size <- c(4, 8, 10, 12)
sigma_sq <- c(3, 2, 2, 1)^2
```

With these parameter values, we can use `map_dfr` to simulate 10,000 $p$-values:

```{r welch_sim_1, cache=TRUE}
p_vals <- repeat_and_stack(10000,
  one_run(
    mu = mu,
    sigma_sq = sigma_sq,
    sample_size = sample_size
  )
)
```

We can estimate the rejection rates by summarizing across these replicated p-values.
The rule is that the null is rejected if the $p$-value is less than $\alpha$.
To get the rejection rate, we calculate the proportion of replications where the null is rejected:
```{r}
sum(p_vals$ANOVA < 0.05) / 10000
```
This is equivalent to taking the mean of the logical conditions:

```{r}
mean(p_vals$ANOVA < 0.05)
```

We get a rejection rate that is much larger than $\alpha = .05$.
We have learned that the ANOVA F-test does not adequately control Type-I error under this set of conditions.

```{r}
mean(p_vals$Welch < 0.05)
```

The Welch test does much better, although it appears to be a little bit in excess of 0.05.

Note that these two numbers are quite close (though not quite identical) to the corresponding entries in Table 1 of Brown and Forsythe (1974). The difference is due to the fact that both Table 1 and are results are actually _estimated_ rejection rates, because we have not actually simulated an infinite number of replications. The estimation error arising from using a finite number of replications is called _simulation error_ (or _Monte Carlo error_).
In Chapter \@ref(performance-measures), we will look more at how to estimate and control the Monte Carlo\index{Monte Carlo simulation} simulation error in performance measures.

So there you have it! Each part of the simulation is a distinct block of code, and together we have a modular simulation that can be easily extended to other scenarios or other tests.
The exercises at the end of this chapter ask you to extend the framework further.
In working through them, you will get to experience first-hand how the modular code that we have started to develop is easier to work with than a single, monolithic block of code.

## Exercises {#exAnovaExercises}

The following exercises involve exploring and tweaking the above simulation code we have developed to replicate the results of Brown and Forsythe (1974).

### Other $\alpha$'s {#BF-other-alphas}

Table 1 from Brown and Forsythe reported rejection rates for $\alpha = .01$ and $\alpha = .10$ in addition to $\alpha = .05$. Calculate the rejection rates of the ANOVA F and Welch tests for all three $\alpha$-levels and compare to the table.

### Compare results {#BF-compare-results}

Try simulating the Type-I error rates for the parameter values in the first two rows of Table 1 of the original paper. Use 10,000 replications. How do your results compare to the report results?

### Power {#BF-power}

In the primary paper, Table 1 is about Type I error and Table 2 is about power.  A portion of Table 2 follows:

```{r BF-power, echo = FALSE}
pow <- tribble( ~Variances, ~ Means, ~ `Brown's F`, ~ `B & F's F*`, ~`Welch's W`,
                "1,1,1,1",   "0,0,0,0", 4.9, 5.1, 5.0,
                "",          "1,0,0,0", 68.6, 67.6, 65.0,
                "3,2,2,1",   "0,0,0,0",  NA, 6.2, 5.5,
                "",         "1.3,0,0,1.3", NA, 42.4, 68.2
                )

pow %>%
  kbl(caption = 'Portion of "Table 2" reproduced from Brown and Forsythe (1974)')
```

In the table, the sizes of the four groups are 11, 16, 16, and 21, for all the scenarios.
Try simulating the **power levels** for a couple of sets of parameter values from Table \@ref(tab:BF-power).
Use 10,000 replications.
How do your results compare to the results reported in the Table?

### Wide or long? {#BF-wide-long}

Instead of making `ANOVA_Welch_F` return a single row with the columns for the $p$-values, one could instead return a dataset with one row for each test. The "long" approach is often nicer when evaluating more than two methods, or when each method returns not just a $p$-value but other quantities of interest. For our current simulation, we might also want to store the $F$ statistic, for example. The resulting dataset would then look like the following:

```{r, include=FALSE}
ANOVA_Welch_F_long <- function(data) {
  anova_F <- oneway.test(x ~ group, data = data, var.equal = TRUE)
  Welch_F <- oneway.test(x ~ group, data = data, var.equal = FALSE)

  tibble(
    method = c( "ANOVA", "Welch" ),
    Fstat = c( anova_F$statistic, Welch_F$statistic),
    pvalue = c( anova_F$p.value, Welch_F$p.value )
  )

}
```

```{r, echo=TRUE}
ANOVA_Welch_F_long(sim_data)
```

Modify `ANOVA_Welch_F()` to return output in the long format, update your simulation code, and then use `group_by()` plus `summarise()` to calculate rejection rates of both tests.
The `group_by()` method divides your data into distinct groups so you can easily conduct an operation on each.
Here is a typical way that `group_by()` might be used for calculating performance measures:

```{r, eval=FALSE}
sres <-
  res %>%
  group_by( method ) %>%
  summarise( rejection_rate = mean( pvalue < 0.05 ) )
```

### Other tests {#BF-other-tests}

The `onewaytests` package in R includes functions for calculating Brown and Forsythe's $F^*$ test and James' test for differences in population means.
Modify the data analysis function `ANOVA_Welch_F` (or, better yet, `ANOVA_Welch_F_long` from Exercise \@ref(BF-wide-long)) to also include results from these hypothesis tests.
Re-run the simulation to estimate the type-I error rate of all four tests under Scenarios A and B of Table \@ref(tab:BF-Scenarios).

### Methodological extensions

What other methodological questions might you want to investigate about the one-way ANOVA problem? List two or three questions that could be investigated with further simulations.

### Power analysis

Suppose you were conducting an experimental study involving several groups (perhaps $G = 3$ or $4$ or $5$). Because of logistical constraints, one of the groups will need to include at least 40% of the full sample, and you anticipate that this group might have more variable outcomes than the other groups. Your collaborators want to know how large a total sample they will need to recruit in order to have a high probability of detecting significant differences between groups.
How could you use the functions you've developed to answer this question? List out the steps in your approach, and make note of any further information or assumptions you will need in order to answer the question.