diff --git a/005-initial-t-test-simulation.Rmd b/005-initial-t-test-simulation.Rmd index e1c6fc8..2f737c1 100644 --- a/005-initial-t-test-simulation.Rmd +++ b/005-initial-t-test-simulation.Rmd @@ -152,7 +152,7 @@ prop.test( sum(covered), length(covered), p = 0.95) Our coverage is _too low_; the confidence interval based on the $t$-test misses the the true value more often than it should. We have learned that the $t$-test can fail when applied to non-normal (skewed) data. -## Simulating across different scenarios +## Simulating across different scenarios {#simulating-across-different-scenarios} So far, we have looked at coverage rates of the confidence interval under a single, specific scenario, with a sample size of 10, a population mean of 4, and a geometrically distributed variable. We know from statistical theory (specifically, the central limit theorem) that the confidence interval should work better if the sample size is big enough. @@ -194,7 +194,6 @@ However, this will quickly get cumbersome if we want to evaluate many different A better approach is to use a mapping function from the `purrr` package.[^apply] The `map_dbl()` function takes a list of values and calls a function for each value in the list. This accomplishes the same thing as using a `for` loop to iterate through a list of items (if you happen to be familiar with these), but is more succinct. -See Appendix chapter \@ref(repeating-oneself) for more on `map()`.[^mapping] To proceed, we first create a list of sample sizes to test out: ```{r} @@ -206,8 +205,7 @@ coverage_est <- map_dbl( ns, ttest_CI_experiment) ``` This code will run our experiment for each value in `ns`, and then return a vector of the estimated coverage rates for each of the sample sizes. -[^apply]: Alternately, readers familiar with the `*apply()` family of functions from Base R might prefer to use `lapply()` or `sapply()`, which do essentially the same thing as `purrr::map_dbl()`. -[^mapping]: You can also check out [Section 21.5 of R for Data Science (1st edition)](https://r4ds.had.co.nz/iteration.html#the-map-functions), which provides an introduction to mapping. +[^apply]: See [Section 21.5 of R for Data Science (1st edition)](https://r4ds.had.co.nz/iteration.html#the-map-functions), which provides an introduction to mapping. Alternately, readers familiar with the `*apply()` family of functions from Base R might prefer to use `lapply()` or `sapply()`, which do essentially the same thing as `purrr::map_dbl()`. We advocate for depicting simulation results graphically. To do so, we store the simulation results in a dataset and then create a line plot using a log scale for the horizontal axis: diff --git a/010-Simulation-structure.Rmd b/010-Simulation-structure.Rmd index a06999f..087d7a3 100644 --- a/010-Simulation-structure.Rmd +++ b/010-Simulation-structure.Rmd @@ -191,7 +191,7 @@ A well-written estimation method should, in principle, work not only on a simula Because of this, the inputs of the `analyze()` function should not typically include any information about the parameters of the data-generating process. To be realistic, the code for our simulated data-analysis procedure should not make use of anything that the analyst could not know when analyzing a real dataset. Thus, `analyze()` has an argument for the sample dataset but not for `model_params`. -We discuss the form and content of the data analysis function further in Chapter \@ref(data-analysis-procedures). +We discuss the form and content of the data analysis function further in Chapter \@ref(estimation-procedures). ### Repetition @@ -261,7 +261,7 @@ For example, we might want to know how close an estimator gets to the target par We might want to know if a confidence interval captures the true parameter the right proportion of the time, as in the simulation from Chapter \@ref(t-test-simulation). Performance is defined in terms of the sampling distribution of estimators or analysis results, across an infinite number of replications of the data-generating process. In practice, we use many replications of the process, but still only a finite number. Consequently, we actually _estimate_ the performance measures and need to attend to the Monte Carlo error in the estimates. -We discuss the specifics of different performance measures and assessment of Monte Carlo error in Chapter \@ref(performance-criteria). +We discuss the specifics of different performance measures and assessment of Monte Carlo error in Chapter \@ref(performance-measures). ### Multifactor simulations @@ -277,7 +277,7 @@ To implement a multifactor simulation, we will follows the same principles of mo In particular, we will take the code developed for simulating a single context and bundle it into a function that can be evaluated for any and all scenarios of interest. Simulation studies often follow a full factorial design, in which each level of a factor (something we vary, such as sample size, true treatment effect, or residual variance) is crossed with every other level. The experimental design then consists of sets of parameter values (including design parameters, such as sample sizes), and these too can be represented in an object, distinct from the other components of the simulation. -We will discuss multiple-scenario simulations in Part III (starting with Chapter \@ref(exp-design)), after we more fully develop the core concepts and techniques involved in simulating a single context. +We will discuss multiple-scenario simulations in Part III (starting with Chapter \@ref(simulating-multiple-scenarios)), after we more fully develop the core concepts and techniques involved in simulating a single context. ## Exercises diff --git a/015-Case-study-ANOVA.Rmd b/015-Case-study-ANOVA.Rmd index da82550..8be54f4 100644 --- a/015-Case-study-ANOVA.Rmd +++ b/015-Case-study-ANOVA.Rmd @@ -288,7 +288,7 @@ Once you have built a function, one way to check that it is working properly is If things still work, then you can be somewhat confident that you have successfully bundled your code into the function. Once you bundle your code, you can also do a search and replace to change the variable names inside your function to something more generic, to better clarify the distinction betwen object names and argument names. -## The hypothesis testing procedures +## The hypothesis testing procedures {#ANOVA-hypothesis-testing-function} Brown and Forsythe considered four different hypothesis testing procedures for heteroskedastic ANOVA, but we will focus on just two of the tests for now. We start with the conventional one-way ANOVA that mistakenly assumes homoskedasticity. @@ -409,7 +409,7 @@ mean(p_vals$Welch < 0.05) The Welch test does much better, although it appears to be a little bit in excess of 0.05. Note that these two numbers are quite close (though not quite identical) to the corresponding entries in Table 1 of Brown and Forsythe (1974). The difference is due to the fact that both Table 1 and are results are actually _estimated_ rejection rates, because we have not actually simulated an infinite number of replications. The estimation error arising from using a finite number of replications is called _simulation error_ (or _Monte Carlo error_). -In Chapter \@ref(performance-criteria), we will look more at how to estimate and control the Monte Carlo simulation error in performance measures. +In Chapter \@ref(performance-measures), we will look more at how to estimate and control the Monte Carlo simulation error in performance measures. So there you have it! Each part of the simulation is a distinct block of code, and together we have a modular simulation that can be easily extended to other scenarios or other tests. The exercises at the end of this chapter ask you to extend the framework further. diff --git a/030-Estimation-procedures.Rmd b/030-Estimation-procedures.Rmd index 329913c..a629904 100644 --- a/030-Estimation-procedures.Rmd +++ b/030-Estimation-procedures.Rmd @@ -3,7 +3,7 @@ output: pdf_document: default html_document: default --- -# Data analysis procedures {#data-analysis-procedures} +# Estimation procedures {#estimation-procedures} ```{r, include=FALSE} library(tidyverse) @@ -15,24 +15,38 @@ source("case_study_code/gen_cluster_RCT_rev.R") source("case_study_code/analyze_cluster_RCT.R") ``` -The overall aims of many simulation studies have to do with understand how a particular data-analysis procedure works or comparing the performance of multiple, competing procedures. -Thus, the data-analysis procedure or procedures are the central object of study. -Depending on the research question, the data-analysis procedure might be very simple---as simple as just computing a sample correlation--or it might involve a combination of several components. -For example, the procedure might entail first computing a diagnostic test for heteroskedasticity and then, depending on the outcome of the test, applying either a conventional formula or a heteroskedasticity-robust formula for standard errors. -As another example, a data-analysis procedure might involve using multiple imputation for missingness on key variables, then fitting a statistical model, and then generating predicted values based on the model. -Also depending on the research question, we might need to create _several_ functions that implement different estimation procedures to be compared. +We do simulation studies to understand how to analyze data. +Thus, the central object of study is a _data analysis procedure_, or a set of steps or calculations carried out on a dataset. +We want to know how well a procedure would work when applied in practice, and the data generating processes we described in Chapter \@ref(data-generating-processes) provides a stand-in for real data. +To revisit our consumer product testing analogy from Chapter \@ref(introduction), the data analysis procedure is the product, and the data generating process is the set of trials to which we subject the product. + +Depending on the research question, a data-analysis procedure might be very simple---as simple as just computing a sample correlation---or it might involve something more complex, such as fitting a multilevel model and generating a confidence interval for a specific coefficient. +A data analysis procedure might even involve a combination of several components. +For example, the procedure might entail first running a variable screening procedure and then fitting a random forest on the selected predictor variables. +As another example, a data-analysis procedure might involve using multiple imputation for missingness on key variables, then fitting a statistical model, and then generating predicted values based on the model. For sake of brevity, we will use the term _estimation procedure_ or just _estimator_ to encompass all of these procedures, even ones that involve multi-step or multi-component procedures. + + +In this chapter, we demonstrate how to implement an estimator in the form of an R function, which we call an _estimation function_, so that its performance can eventually be evaluated by repeatedly applying it to artificial data. +We start by describing the high-level design of an estimation function and illustrate with some simple examples in Section \@ref(estimation-functions). + +Depending on the research question, we will often be interested in comparing several competing estimators. +In this case, we will create _several_ functions that implement the estimators that we plan to compare. +The easiest approach here is to implement each estimator in turn, and then bundle the collection in a final overall function. +We describe how to do this in Section \@ref(multiple-estimation-procedures). + +In Section \@ref(validating-estimation-function), we describe strategies for validating the coded-up estimator before running a full simulation. +We offer a few strategies for validating one's code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs. + +For a full simulation, an estimator needs to be reliable, not crashing and giving answers across the wide range of datasets to which it is applied. +An estimator will need to be able to handle odd edge cases or pathological datasets that might be generated as we explore a full range of simulation scenarios. +To allow for this, Section \@ref(error-handling) demonstrates several methods for handling common computational problems, such as errors or warnings. -In this chapter, we demonstrate how to implement data-analysis procedures in the form of R functions, which we call _estimation functions_, so that their performance can eventually be evaluated by repeatedly applying them to artificial data. -We start by describing the high-level design of an estimation function, and illustrate with some simple examples. -We then discuss approaches for writing simulations that compare multiple data analysis procedures. -Next, we describe strategies for validating the coded-up estimation functions before running a full simulation. -Finally, we examine methods for handling common computational problems with estimation functions, such as handling non-convergence when using maximum likelihood estimation. ## Writing estimation functions {#estimation-functions} In the abstract, a function that implements an estimation procedure should have the following form: ```{r} -estimate <- function(data) { +estimator <- function(data) { # calculations/model-fitting/estimation procedures @@ -40,37 +54,24 @@ estimate <- function(data) { } ``` -The function takes a dataset as input, fits a model or otherwise calculates an estimate, possibly with associated standard errors and so forth, and returns these quantities as output. -The estimates could be point estimates of parameters, standard errors, confidence intervals, p-values, predictions, or other quantities. +An estimator function should take a dataset as input, fit a model or otherwise calculates an estimate, possibly with associated standard errors and so forth, and return these quantities as output. +It can return point estimates of parameters, standard errors, confidence intervals, $p$-values, predictions, or other quantities. The calculations in the body of the function should be set up to use datasets that have the same structure (i.e., same dimensions, same variable names) as the output of the corresponding function for generating simulated data. However, in principle, we should also be able to run the estimation function on real data as well. -In Chapter \@ref(case-ANOVA) we wrote a function called `ANOVA_Welch_F()` for computing $p$-values from two different procedures for testing equality of means in a heteroskedastic ANOVA: -```{r, eval = FALSE, file = "case_study_code/ANOVA_Welch_F.R"} +As a first example, suppose we want to evaluate a method for generating a confidence interval for Pearson's sample correlation coefficient when faced with non-normal data. +For bivariate normal data, statistical theory provides some very useful facts. +In particular, it tells us that applying Fisher's $z$-transformation of Pearson's correlation coefficient, which is equivalent to the hyperbolic arc-tangent function (`atanh()` in R), produces a statistic that is very close to normally distributed. +It also tells us that the empirical standard error of the $z$-transformed correlation coefficient is simply $1 / \sqrt{N - 3}$, and thus independent of the correlation parameter. +This makes $z$-transformation very useful for computing confidence intervals, which can then be back-transformed to the Pearson-$r$ scale. +However, these results are specific to bivariate normal distributions. +Would this transformation work well in the face of non-normal data, such as the bivariate Poisson distribution we coded in Chapter \@ref(data-generating-processes)? -``` -Apply this function to a simulated dataset returns two p-values, one for the usual ANOVA $F$ test (which assumes homoskedasticity) and one for Welch's heteroskedastic $F$ test: -```{r} -sim_data <- generate_ANOVA_data( - mu = c(1, 2, 5, 6), - sigma_sq = c(3, 2, 5, 1), - sample_size = c(3, 6, 2, 4) -) -ANOVA_Welch_F(sim_data) -``` -Our `ANOVA_Welch_F()` function is designed to work with the output of `generate_ANOVA_data()` in that it assumes that the grouping variable is called `group` and the outcome is called `x`. -Relying on this assumption would be a poor choice if we were designing a function as part of an R package or for general-purpose use. -However, because the primary use of the function is for simulation, it is reasonable to assume that the input data will always have appropriate variable names. - -In Chapter \@ref(data-generating-processes), we looked at a data-generating function for a bivariate Poisson distribution, an example of a non-normal bivariate distribution. -We might use such a distribution to understand the behavior of Pearson's sample correlation coefficient and its normalizing transformation, known as Fisher's $z$-transformation, which is equivalent to the hyperbolic arc-tangent function (`atanh()` in R). -When the sample measurements follow a bivariate normal distribution, Fisher's $z$-transformed correlation is very close to normally distributed and its standard error is simply $1 / \sqrt{N - 3}$, and thus independent of the correlation. -This makes $z$-transformation very useful for computing confidence intervals, which can then be back-transformed to the Pearson-$r$ scale. - -In this problem, a simple estimation function would take a dataset with two variables as input and compute the sample correlation and its $z$-transformation, compute confidence intervals for $z$, and then back-transform the confidence interval end-points. -Here is an implementation of these calculations: +For studying this question with simulation, a simple estimation function would take a dataset with two variables as input and compute the sample correlation and its $z$-transformation, compute confidence intervals for $z$, and then back-transform the confidence interval end-points. +Here is an implementation of this sequence of calculations: ```{r, file = "case_study_code/r_and_z.R"} ``` + To check that the function returns a result of the expected form, we generate a small dataset using the `r_bivariate_Poisson()` function developed in the last chapter, then apply our estimation function to the result: ```{r} Pois_dat <- r_bivariate_Poisson(40, rho = 0.5, mu1 = 4, mu2 = 4) @@ -79,7 +80,8 @@ r_and_z(Pois_dat) Although it is a little cumbersome to do so, we could also apply the estimation function to a real dataset. Here is an example, which calculates the correlation between ratings of judicial integrity and familiarity with the law from the `USJudgeRatings` dataset (which is included in base R). -For the function to work on this dataset, we first need to rename the relevant variables. +For the function to work on this dataset, we first need to rename the relevant variables because our function takes a `data.frame` with two columns named `C1` and `C2`: + ```{r} data(USJudgeRatings) @@ -87,35 +89,34 @@ USJudgeRatings %>% dplyr::select(C1 = INTG, C2 = FAMI) %>% r_and_z() ``` -The function returns a valid result---a quite strong correlation! +The function returns a valid result---a quite strong correlation in this case! -It is a good practice to test out a newly-developed estimation function on real data as a check that it is working as intended. -This type of test ensures that the estimation function is not using information outside of the dataset, such as by using known parameter values to construct an estimate. +It is good practice to test out a newly-developed estimation function on real data as a check that it is working as intended. +Testing on real data ensures that the estimation function is not using information outside of the dataset, such as by using known parameter values to construct an estimate. Applying the function to a real dataset demonstrates that the function implements a procedure that could actually be applied in real data analysis contexts. + ## Including Multiple Data Analysis Procedures {#multiple-estimation-procedures} Many simulations involve head-to-head comparisons between more than one data-analysis procedure. As a design principle, we generally recommend writing different functions for each estimation method one is planning on evaluating. Doing so makes it easier to add in additional methods as desired or to focus on just a subset of methods. -Writing separate function also leads to a code base that is flexible and useful for other purposes (such as analyzing real data). +Writing separate functions also leads to a code base that is flexible and useful for other purposes (such as analyzing real data). Finally (repeating one of our favorite mantras), separating functions makes debugging easier because it lets you focus attention on one thing at a time, without worrying about how errors in one area might propagate to others. -To see how this works in practice, we will return to the case study from Section \@ref(case-cluster), where we developed a data-generating function for simulating a cluster-randomized trial with student-level outcomes but school-level treatment assignment. -Our data-generating process allowed for varying school sizes and heterogeneous treatment effects, which might be correlated with school size. -Several different procedures might be used to estimate an overall average effect from a clustered experiment, including: +To see how this works in practice, we return to the case study from Section \@ref(case-cluster), where we developed a data-generating function for simulating a cluster-randomized trial with student-level outcomes but school-level treatment assignment. +Our data-generating process allowed for varying school sizes and heterogeneous treatment effects that are correlated with school size. +Several different procedures might be used to estimate an overall average effect from a clustered experiment. We will consider three different procedures: -* Estimating a multi-level regression model (also known as a hierarchical linear model), -* Estimating an ordinary least squares (OLS) regression model and applying cluster-robust standard errors, or -* Averaging the outcomes by school, then estimating a linear regression model on the mean outcomes. +* Fitting a multi-level regression model (also known as a hierarchical linear model), +* Fitting an ordinary least squares (OLS) regression model and applying cluster-robust standard errors, or +* Averaging the outcomes by school, then fitting a linear regression model on the mean outcomes. All three of these methods are widely used and have some theoretical guarantees supporting their use. Education researchers tend to be more comfortable using multi-level regression models, whereas economists tend to use OLS with clustered standard errors. - - + -We next develop estimation functions for each of these procedures. -We analyze as we expect would be done in practice; even though we generated data with a school size covariate, we do not include it in our estimation functions. +We next develop estimation functions for each of these procedures, focusing on a simple model that does not include any covariates besides the treatment indicator. Each function needs to produce a point estimate, standard error, and $p$-value for the average treatment effect. To have data to practice on, we generate a sample dataset using [a revised version of `gen_cluster_RCT()`](/case_study_code/gen_cluster_RCT_rev.R), which corrects the bug discussed in Exercise \@ref(cluster-RCT-checks): ```{r} @@ -136,7 +137,6 @@ analysis_OLS_code <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun, "analysis analysis_agg_code <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun, "analysis_agg <-")):(f_endings[3])] analysis_bundle_code <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun, "estimate_Tx_Fx <-")):(f_endings[4])] -lmer_composing <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun, "compose")):which(str_detect(cluster_RCT_fun, "^\\)$"))] lmer_quietly <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun, "purrr::quietly"))] @@ -145,18 +145,17 @@ analysis_MLM_contingent_code <- cluster_RCT_fun[which(str_detect(cluster_RCT_fun ``` For the multi-level modeling strategy, there are several different existing packages that we could use. -We will implement an estimator using the popular `lme4` package, along with the `lmerTest` function for computing a $p$-value for the average effect. +We will implement an estimator using the popular `lme4` package, via the `lmerTest` function which calculates a $p$-value for the average effect. Here is a basic implementation: ```{r, code = analysis_MLM_code} ``` The function fits a multi-level model with a fixed coefficient for the treatment indicator and random intercepts for each school. -To get a p-value for the treatment coefficient, we have to convert the model into an `lmerModLmerTest` object and then pass it through `summary()`. -The function outputs only the statistics in which we are interested. +It outputs only the statistics in which we are interested. -Our function makes use of the `lme4` and `lmerTest` packages. -Rather than assuming that these packages will be loaded, we call relevant functions using the package name as a prefix, as in `lme4::lmer()`. -This way, we can run the function even if we have not loaded the packages in the global environment. -This approach is also preferable to loading packages inside the function itself (e.g., with `require(lme4)`) because calling the function does not change which packages are loaded in the global environment. +Our function makes use of the `lmerTest` package. +Rather than assuming that this package will be loaded, we call the relevant function using the package name as a prefix: `lmerTest::lmer()`. +This way, we can run the function even if we have not loaded the package in the global environment. +Referencing functions with their package prefix is also preferable to loading packages inside the function itself (e.g., with `require(lmerTest)`) because calling the function then does not change which packages are loaded in the global environment. Here is a function implementing OLS regression with cluster-robust standard errors: ```{r, code = analysis_OLS_code} @@ -167,8 +166,10 @@ Adding this option would let us use the same function to compute (and compare) d We set a default option of `"CR2"`, just like the default of `lm_robust()`. Sometimes an analytic procedure involves multiple steps. -For example, aggregation estimator first involves collapsing the data to a school-level dataset, and then analyzing at the school level. -This is fine: we just wrap all the steps in a single estimation function: from the point of view of _using_ the function, it is a single call, no matter how complicated the process inside. +For example, the aggregation estimator first involves collapsing the data to a school-level dataset, and then analyzing at the school level. +This is no problem: +we just wrap all the steps in a single estimation function. +Once written, all we need to do is call the function---no matter how complicated the process inside. Here is the code for the aggregate-then-analyze approach: ```{r, code = analysis_agg_code} ``` @@ -176,8 +177,8 @@ Here is the code for the aggregate-then-analyze approach: Note the `stopifnot` command: it will throw an error if the condition is not true. This `stopifnot` ensures we do not have both treatment and control students within a single school--if we did, we would have more aggregated values than school ids due to the grouping! Putting *assert statements* in your code like this is a good way to guarantee you are not introducing weird and hard-to-track errors. -A `stopifnot` statement halts your code as soon as something goes wrong, rather than letting that initial wrongness flow on to further work, creating odd results that are hard to understand. -Here we are protecting ourselves from strange results if, for example, we messed up our DGP code to have treatment not nested within school, or we were using data that did not actually come from a cluster randomized experiment. +A `stopifnot` statement halts your code as soon as something goes wrong, rather than letting that initial error flow on to further work, potentially creating odd results that are hard to understand. +Here we are protecting ourselves from strange results if, for example, we messed up our DGP code to have treatment not nested within school, or if we were using data that did not actually come from a cluster randomized experiment. See Section \@ref(about-stopifnot) for more. All of our functions produce output in the same format: @@ -194,20 +195,28 @@ Here is a function that bundles all the estimation procedures together: ```{r} estimate_Tx_Fx(dat) ``` -This is a common coding pattern for simulations that involve multiple estimation procedures. +This is a common coding pattern for simulations that involve multiple estimators. Each procedure is expressed in its own function, then these are assembled together in a single function so that they can all easily be applied to the same dataset. Stacking the results row-wise will make it easy to compute performance measures for all methods at once. The benefit of stacking will become even more evident once we are working across multiple replications of the simulation process, as we will in Chapter \@ref(running-the-simulation-process). -## Validating an Estimation Function -Just as with data-generating functions, it is critical to verify the accuracy of an implemented estimation function. +## Validating an Estimation Function {#validating-estimation-function} + +Just as with data-generating functions, it is critical to verify that an estimation function is correctly implemented. If an estimation function involves a known procedure that has been implemented in R or one of its contributed packages, then a straightforward way to do this is to compare your implementation to another existing implementation. -For estimation functions that involve multi-step procedures or novel methods, other approaches to verification may be needed, which rely more on statistical theory. +For estimation functions that involve multi-step procedures or novel methods, other approaches to verification may be needed. + ### Checking against existing implementations -For our Welch test function, we can check the output of `ANOVA_Welch_F()` against the built-in `oneway.test` function. Let's do that with a fresh set of data: +In Chapter \@ref(case-ANOVA), we wrote a function called `ANOVA_Welch_F()` for computing $p$-values from two different procedures for testing equality of means in a heteroskedastic ANOVA: +```{r, eval = FALSE, file = "case_study_code/ANOVA_Welch_F.R"} + +``` + +We can test this function by comparing its output against the built-in `oneway.test` function, as follows: + ```{r} sim_data <- generate_ANOVA_data( mu = c(1, 2, 5, 6), @@ -215,8 +224,7 @@ sim_data <- generate_ANOVA_data( sample_size = c(3, 6, 2, 4) ) -aov_results <- oneway.test(x ~ factor(group), data = sim_data, - var.equal = FALSE) +aov_results <- oneway.test(x ~ factor(group), data = sim_data, var.equal = FALSE) aov_results Welch_results <- ANOVA_Welch_F(sim_data) @@ -225,7 +233,7 @@ all.equal(aov_results$p.value, Welch_results$Welch) We use `all.equal()` because it will check equality up to a tolerance in R, which can avoid some perplexing errors due to rounding. -For the bivariate correlation example, we can check the output of `r_and_z()` against R's built-in `cor.test()` function: +For the bivariate correlation example introduced in Section \@ref(estimation-functions), we can check the output of `r_and_z()` against R's built-in `cor.test()` function: ```{r} Pois_dat <- r_bivariate_Poisson(15, rho = 0.6, mu1 = 14, mu2 = 8) @@ -238,15 +246,21 @@ R_result <- tibble(r = R_result$estimate[["cor"]], CI_hi = R_result$conf.int[2]) R_result -all.equal(R_result, my_result) +all.equal( as.numeric(R_result), as.numeric(my_result) ) ``` -This type of test is even more useful here because `r_and_z()` uses our own implementation of the confidence interval calculations, rather than relying on R's built-in functions as we did with `ANOVA_Welch_F()`. +This type of check is all the more useful here because `r_and_z()` uses our own implementation of the confidence interval calculations, rather than relying on R's built-in functions as we did with `ANOVA_Welch_F()`. + +One might ask why you should bother implementing your own function if you already have a version implemented in R. +In some cases, it might be possible to implement a faster version of a function because you can cut corners by skipping input data verification, providing fewer estimation options, or cutting out post-estimation processing. +Any of these could help with the overall speed of your simulation. +On the other hand, gains in computational efficiency might not be worth the human time it would take to implement the calculations from scratch. +See Chapter \@ref(optimize-code) for more discussion of this trade-off. ### Checking novel procedures Simulations are usually an integral part of projects to develop novel statistical methods. -Checking estimation functions in such projects presents a challenge: if an estimation procedure truly is new, how do you check that your code is correct? +Checking estimation functions in such projects presents a challenge: if an estimator truly is new, how do you check that your code is correct? Effective methods for doing so will vary from problem to problem, but an over-arching strategy is to use theoretical results about the performance of the estimator to check that your implementation works as expected. For instance, we might work out the algebraic properties of an estimator for a special case and then check that the result of the estimation function agrees with our algebra. For some estimation problems, we might be able to identify theoretical properties of an estimator when applied to a very large sample of data and when the model is correctly specified. @@ -293,8 +307,7 @@ Here is the tweaked function: ```{r} analysis_MLM <- function( dat, all_results = FALSE) { - M1 <- lme4::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat ) - M1_test <- lmerTest::as_lmerModLmerTest(M1) + M1_test <- lmerTest::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat ) if (all_results) { return(summary(M1_test)) @@ -309,9 +322,11 @@ analysis_MLM <- function( dat, all_results = FALSE) { ) } ``` + Setting `all_results` to `TRUE` will return the entire function; keeping it at the default value of `FALSE` will return the same output as the other functions. Now let's apply the estimation function to a very large dataset, with variation in cluster sizes. We set `gamma_2 = 0` so that the estimation model is correctly specified: + ```{r} dat <- gen_cluster_RCT( J=5000, n_bar = 20, alpha = 0.9, p = 2 / 3, @@ -334,51 +349,55 @@ These limitations are typical of what can be accomplished through tests based on After all, if we had comprehensive theoretical results, we would not need to simulate anything in the first place! Nonetheless, it is good to work through such tests to the extent that relevant theory is available for the problem you are studying. + ### Checking with simulations Checking, debugging, and revising should not be limited to when you are initially developing estimation functions. It often happens that later steps in the process of conducting a simulation will reveal problems with the code for earlier steps. For instance, once you have run the data-generating and estimation steps repeatedly, calculated performance summaries, and created some graphs of the results, you might find an unusual or anomolous pattern in the performance of an estimator. This might be a legitimate result---perhaps the estimator really does behave weirdly or not work well---or it might be due to a problem in how you implemented the estimator or data-generating process. -When faced with an unusual pattern, we recommend revisiting the estimation code to double check for bugs and also thinking further about what might lead to the anomoly. +When faced with an unusual pattern, revisit the estimation code to double check for bugs and also think further about what might lead to the anomaly. Further exploration might lead you to a deeper understanding of how a method works and perhaps even an idea for how to improve the estimator or refine the data-generating process. A good illustration of this process comes from one of Luke's past research projects (see @pashley2024improving), in which he and other co-authors were working on a way to improve Instrumental Variable (IV) estimation using post-stratification. -The method they studied involved grouping units based on a covariate that predicts compliance status, then calculating estimates within each group, then summarizing the estimates across groups. +The method they studied involved grouping units based on a covariate that predicts compliance status, then calculating estimates within each group, then averaging the estimates across groups. They used simulations to see whether this method would improve the accuracy of the overall summary effect estimate. -In the first simulation, the estimates were full of NAs and odd results because the estimation function failed to account for what happens in groups of observations where the number of compliers was estimated to be zero. -After repairing that problem and re-running everything, the simulation results still indicated serious and unexpected bias, which turned out to be due to an error in how the estimation function implemented the step of summarizing estimates across groups. +In the first simulation, the estimates were full of missing values and odd results because the estimation function failed to account for what happens in groups of observations where the number of compliers was estimated to be zero. +After repairing that problem and re-running everything, the simulation results still indicated serious and unexpected bias, which turned out to be due to an error in how the estimation function implemented the step of averaging estimates across groups. After again correcting and re-running, the simulation results showed that the gains in accuracy from this new method were minimal, even when the groups were formed based on a variable that was almost perfectly predictive of compliance status. Eventually, we understood that the groups with very few compliers produced such unstable estimates that they spoiled the overall average estimate. This inspired us to revise our estimation strategy and introduce a method that dropped or down-weighted strata with few compliers, which ultimately helped us to strengthen the contribution of our work. As this experience highlights, simulations seldom follow a single, well-defined trajectory. The point of conducting simulations is to help us, as researchers, learn about estimation methods so that we can analyze real data better. -What we learn from simulation gives us a better understanding of the methods (potentially including a better understanding of theoretical results), leading to ideas about better methods to create new scenarios to explore in further simulations. +Simulations can strengthen our methodological and theoretical understanding, which can then inspire ideas for better approaches which we can then test in subsequent simulations. Of course, at some point one needs to step off this merry-go-round, write up the findings, cook dinner, and clean the bathroom. -But, just like many other research endeavors, simulations follow a highly iterative process. +But, just like many other research endeavors, simulations are often a highly iterative process. + -## Handling errors, warnings, and other hiccups +## Handling errors, warnings, and other hiccups {#error-handling} Especially when working with more advanced estimation methods, it is possible that your estimation function will fail, throw an error, or return something uninterpretable for certain input datasets. -For instance, maximum likelihood estimation often requires iterative, numerical optimization algorithms that sometimes fail to converge. +For instance, maximum likelihood estimation often involves using iterative, numerical optimization algorithms that sometimes fail to converge. This might happen rarely enough that it takes a while to even notice that it is a problem, but even quite rare things can occur when you run simulations with many thousands of repetitions. -Less dire but still annoying, your estimation function might generate warnings, which can pile up if you are running many repetitions. +Less dire but still annoying, your estimation function might occasionally generate warnings, which can pile up if you are running many repetitions. In some cases, such warnings might also signal that the estimator produced a bad result, and -it may not be clear whether we should retain this result (or include it in overall performance assessments). +it may not be clear whether we should retain this result in the estimator's overall performance assessments. After all, the function tried to warn us that something is off! Errors and warnings in estimation functions pose two problems, one purely technical and one conceptual. -On a technical level, R functions stop running if they hit errors (though not warnings), so we need ways to handle the errors in order to get our simulations up and running. -On a conceptual level, we need to decide how to use the information contained in errors and warnings, whether that be by further elaborating the estimation procedures to address different contingencies or by evaluating the performance of the estimators in a way that appropriately accounts for errors. -We consider each of the problems here, then revisit the conceptual considerations in Chapter \@ref(performance-criteria). +On a technical level, R functions stop running if they hit an error, so we need ways to handle the errors in order to get our simulations up and running. +Furthermore, warnings can clutter up the console and slow down code execution, so we may want to capture and suppress them as well. +On a conceptual level, we need to decide how to use the information contained in errors and warnings, whether that be by further elaborating the estimators to address different contingencies or by evaluating the performance of the estimators in a way that appropriately accounts for these events. +We consider both these problems here, and then revisit the conceptual considerations in Chapter \@ref(performance-measures), where we discuss assessing estimator performance. + ### Capturing errors and warnings Some estimation functions will require complicated or stochastic calculations that can sometimes produce errors. Intermittent errors can really be annoying and time-consuming if not addressed. To protect yourself, it is good practice to anticipate potential errors, preventing them from stopping code execution and allowing your simulations to keep running. -We will demonstrate some techniques for error-handling using tools from the `purrr` package. +We next demonstrate some techniques for error-handling using tools from the `purrr` package. For illustrative purposes, consider the following error-prone function that sometimes returns what we want, sometimes returns `NaN` due to taking the square root of a negative number, and sometimes crashes completely because `broken_code()` does not exist: ```{r, error=TRUE} @@ -393,42 +412,22 @@ my_complex_function = function( param ) { } ``` -Running it produces some results and an occasional warning, and some errors: +Running it a few times produces a mix of results and warnings: ```{r, error=TRUE} -set.seed(3) -my_complex_function( 1 ) -my_complex_function( 10 ) -my_complex_function( 5 ) +set.seed(156858) +my_complex_function( 7 ) +my_complex_function( 7 ) +my_complex_function( 4 ) ``` Running it many times produces warnings, then an error: ```{r, error=TRUE} -resu <- replicate(20, my_complex_function( 7 )) +set.seed( 131 ) +resu <- replicate(20, my_complex_function( 6 )) ``` +We need to "trap" these warnings and errors so they do not clutter or stop our simulation. +Let's do the errors first. -The `purrr` package includes a function called `safely` that makes it easy to trap errors. -To use it, we feed the estimation function into `safely()` to create a new version: -```{r} -my_safe_function <- safely( my_complex_function, otherwise = NA ) -my_safe_function( 7 ) -``` -The safe version of the function returns a list with two entries: the result (or NULL if there was an error), and the error message (or NULL if there was no error). -`safely()` is an example of a _functional_ (or an _abverb_), which takes a function and returns a new function that does something slightly different. -We include `otherwise = NA` so we always get a result, rather than a `NULL` when there is an error. - -We can use the safe function repeatedly and it will always return a result: -```{r} -resu <- replicate(20, my_safe_function( 7 ), simplify = FALSE) -resu <- transpose( resu ) -unlist(resu$result) -``` -The `transpose()` function takes a list of lists, and reorganizes them to give you a list of all the first elements, a list of all the second elements, etc. -This is very powerful for wrangling data, because then we can make a tibble with list columns as so: -```{r} -tb <- tibble( result = unlist( resu$result ), error = resu$error ) -head( tb, n = 4 ) -``` -The `purrr` package includes several other functionals that are useful for handling errors and warnings. -The `possibly()` wrapper will try to run a function and will return a specified value in the event of an error: +The `purrr` package includes a function called `possibly()` that makes it easy to trap errors: ```{r} my_possible_function <- possibly( my_complex_function, otherwise = NA ) my_possible_function( 7 ) @@ -436,88 +435,97 @@ my_possible_function( 7 ) rs <- replicate(20, my_possible_function( 7 )) rs ``` -It works as a simpler version of `safely()`, which does not record error messages. -The `quietly` functional leads to results that are bundled together with any console output, warnings, and messages, rather than printing anything to the console: +`possibly()` is an example of a _functional_ (or an _abverb_ or "wrapper function"), which takes a function and returns a new function that does something slightly different. +The new function has the exact same parameters as the function we put into it. +The difference is the new version of the function will, if an error happens, return the value specified by the `otherwise` argument (here, `NA`), rather than stopping execution with an error. +It does not silence the warnings, however. + +To handle warnings, the `quietly()` functional leads to results that are bundled together with any console output, warnings, and messages, rather than printing anything to the console: ```{r} my_quiet_function <- quietly( my_complex_function ) my_quiet_function( 1 ) ``` -This can be especially useful to reduce extraneous printing in a simulation, which can slow down code execution more than you might expect. +Wrapping a function with `quietly()` can be especially useful to reduce extraneous printing in a simulation, which can slow down code execution more than you might expect. However, `quietly()` does not trap errors: ```{r, error=TRUE} rs <- replicate(20, my_quiet_function( 7 )) ``` -Double-wrapping your function will handle both errors and warnings, but the structure it produces gets a bit complicated: +To handle both errors and warnings, we double-wrap the function, first with `possibly()` and then with `quietly()`: ```{r} -my_safe_quiet_function <- quietly( safely( my_complex_function, otherwise = NA ) ) +my_safe_quiet_function <- quietly( possibly( my_complex_function, otherwise = NA ) ) my_safe_quiet_function(7) ``` -Even though the result is a bit of a mess, this structure provides all the pieces that we need to do further calculations on the result (when available), along with errors, warnings, and other output. To see how this works in practice, we will adapt our `analysis_MLM()` function, which makes use of `lmer()` for fitting a multilevel model. Currently, the estimation function sometimes prints messages to the console: ```{r illustrate_warnings} -set.seed(101012) # (I picked this to show a warning.) +set.seed(101012) # (hand-picked to show a warning.) dat <- gen_cluster_RCT( J = 50, n_bar = 100, sigma2_u = 0 ) mod <- analysis_MLM(dat) ``` Wrapping `lmer()` with `quietly()` makes it possible to catch such output and store it along with other results, as in the following: ```{r} -quiet_lmer <- quietly(lme4::lmer) -qmod <- quiet_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) -qmod -``` -However, the `lmerTest` package does not like the structure of the results, and produces an error: -```{r, error = TRUE} -lmerTest::as_lmerModLmerTest(qmod$result) +quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise=NULL ) ) +M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) +M1 ``` - - - -We can side-step this by combining `as_lmerModLmerTest()` and `lmer()` into a single function. -While we are at it, we also layer on `summary()`. -To do so, we use the `compose()` functional from `purrr`, which takes a list of functions and wraps them into one: -```{r, code = lmer_composing} -``` -The resulting `lmer_with_test()` function acts as if we were calling `lmer()`, then feeding the result into `as_lmerModLmerTest()`, then feeding the result into `summary()`. -We then wrap the combination function with `safely()` and `quietly()`: -```{r, code = lmer_quietly} -``` -Now we can use our suitably quieted and safe function in a new version of the estimation function: -```{r, code = analysis_MLM_safe_code} +We then pick apart the pieces and construct a dataset of results: +```{r} +if ( is.null( M1$result ) ) { + # we had an error! + tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, + message = M1$message, + warning = M1$warning, + error = TRUE ) +} else { + sum <- summary( M1$result ) + tibble( + ATE_hat = sum$coefficients["Z","Estimate"], + SE_hat = sum$coefficients["Z","Std. Error"], + p_value = sum$coefficients["Z", "Pr(>|t|)"], + message = list( M1$message ), + warning = list( M1$warning ), + error = FALSE ) +} ``` -This quiet version runs without extraneous messages: +Now we can plug in the above code to make a nicely quieted and safe function. +This version runs without extraneous messages: ```{r} mod <- analysis_MLM_safe(dat) mod ``` Now we have the estimation results along with any diagnostic information from messages or warnings. -Storing this information will let us evaluate what proportion of the time there was a warning or message, run additional analyses on the subset of replications where there was no such warning, or even modify the estimation procedure to take the diagnostics into account. +Storing this information will let us evaluate what proportion of the time there was a warning or message, run additional analyses on the subset of replications where there was no such warning, or even modify the estimator to take the diagnostics into account. +We have solved the technical problem---our code will run and give results---but not the conceptual one: what does it mean when our estimator gives an NA or a convergence warning with a nominal answer? +How do we decide how good our estimator is when it does this? + -### Adapting estimation procedures for errors and warnings {#adapting-for-errors} +### Adapting estimators for errors and warnings {#adapting-for-errors} So far, we have seen techniques for handling technical hiccups that occur when data analysis procedures do not always produce results. But how do we account for the absence of results in a simulation? -In Chapter \@ref(performance-criteria), we will delve into the conceptual issues with summarizing the performance of methods that do not always provide an answer. -One of the best solutions to such problems still concerns the formulation of estimation functions, and so we introduce it here. -That solution is to _re-define the estimator_ to include contingencies for handling lack of results. +In Chapter \@ref(performance-measures), we will delve into the conceptual issues with summarizing the performance of methods that do not always provide an answer. +However, one of the best solutions to such problems still concerns the formulation of estimation functions, and so we introduce it here. +That solution is to _re-define the estimator_ to include contingencies for handling a lack of results. Consider a data analyst who was planning to apply a fancy statistical model to their data, but then finds that the model does not converge. What would that analyst do in practice (besides cussing and taking a snack break)? Rather than giving up entirely, they would probably think of an alternative analysis and attempt to apply it, perhaps by simplifying the model in some way. To the extent that we can anticipate such possibilities, we can build these error-contingent alternative analyses into our estimation function. +Because of this, it would be more precise to talk about an _estimation procedure_ or a _data-analysis procedure_ rather than just an _estimator_. +An analysis may be many different steps in a branching structure of analysis. -To illustrate, let's look at an error (a not-particularly-subtle one) that can crop up in the cluster-randomized trial example when clusters are very small: +To illustrate, let's look at an error (a not-particularly-subtle one) that could crop up in the cluster-randomized trial example when clusters are very small: ```{r} set.seed(65842) tiny_dat <- gen_cluster_RCT( J = 10, n_bar = 2, alpha = 0.5) analysis_MLM_safe(tiny_dat) table(tiny_dat$sid) ``` -The error occurs because all 10 simulated schools include a single student, making it impossible to estimate a random-intercepts multilevel model. +The error occurs because all 10 simulated schools happened to include a single student, making it impossible to estimate a random-intercepts multilevel model. A natural fall-back analysis here would be to estimate an ordinary least squares regression analysis. Suppose that our imaginary analyst is not especially into nuance, and so will fall back onto ordinary least squares whenever the multilevel model produces an error. @@ -539,11 +547,83 @@ Of course, studying such a model is only interesting to the extent that the deci The adaptive estimation approach does lead to more complex estimation functions, which entail implementing multiple estimation methods and a set of decision rules for applying them. Often, the set of contingencies that need to be handled will not be immediately obvious, so you may find that you need to build and refine the decision rules as you learn more about how they work. -Running an estimation procedure over multiple, simulated datasets is an excellent (if aggravating!) way to identify errors and edge cases. -We turn to procedures for doing so in the next chapter. +Running an estimator over multiple, simulated datasets is an excellent (if aggravating!) way to identify errors and edge cases. +We turn to procedures for doing so in Chapter \@ref(running-the-simulation-process). + + +### The `safely()` option + +There is one final `purrr` option, `safely()`, which traps the full error message, instead of just replacing the output with a fixed value as `possibly()` does. +`safely()` returns a list with two entries: the original result of the original function (or some fixed value if there was an error), and the error message (or NULL if there was no error). + +To use it, we feed our function into `safely()` to create a new version: +```{r} +my_safe_function <- safely( my_complex_function, otherwise = NA ) +my_safe_function( 7 ) +``` +Just as with `possibly()`, we can include `otherwise = NA` to set a return value if there is an error. + +We can use the safe function repeatedly and it will always return a result: +```{r} +resu <- replicate(20, my_safe_function( 7 ), simplify = FALSE) +``` + +We still get warnings, but any errors are trapped so the function does not crash. + +We end up with a 20-entry list, with each element consisting of a pair of the result and error message: +```{r} +length( resu ) +resu[[3]] +``` +We can massage our data to a more easy to parse format: +```{r} +resu <- transpose( resu ) +unlist(resu$result) +``` + +The `transpose()` function takes a list of lists, and reorganizes them to give you a list of all the first elements, a list of all the second elements, etc. +`transpose()` is very powerful for wrangling data, because then we can make a tibble with list columns as so: + +```{r} +tb <- tibble( result = unlist( resu$result ), + error = resu$error ) +print( tb, n = 4 ) +``` + +We now have our results all organized, and we can see which iterations produced errors. + +Unfortunately, to use `safely()` and `quietly()` together, we get a bit of a mess. +Extending our cluster RCT example, we have: + +```{r} +quiet_safe_lmer <- quietly( safely( lmerTest::lmer, otherwise=NULL ) ) +M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data = dat ) +M1 +``` +The resulting object has the estimation results buried inside it. +Even though the result is a bit of a mess, this structure provides all the pieces that we need, including any errors, warnings, and other output. + +We can then have, for example, code such as this in our estimation function: + +```{r, eval=FALSE} +if ( is.null( M1$result$result ) ) { + # we had an error! + error = M1$result$error + +} else { + # We did not. Do the same as above + error = NULL +} +# etc. +``` + +Fortunately, once we have written all this code, we can tuck it inside our estimation function, and then forget about it. +Once written, applying the function will produce a tidy table of results that we can easily analyze. + ## Exercises + ### More Heteroskedastic ANOVA {#BFFs-forever} In the classic simulation by Brown and Forsythe (1974), they not only looked at the performance of the homoskedastic ANOVA-F test and Welch's heteroskedastic-F test, they also proposed their own new hypothesis testing procedure. @@ -573,6 +653,7 @@ In the classic simulation by Brown and Forsythe (1974), they not only looked at 4. The [`onewaytests` package](https://cran.r-project.org/package=onewaytests) implements a variety of different hypothesis testing procedures for one-way ANOVA. Validate your `Welch_ANOVA_F()` function by comparing the results to the output of the relevant functions from `onewaytests`. + ### Contingent testing {#contingent-testing} In the one-way ANOVA problem, one approach that an analyst might think to take is to conduct a preliminary significance test for heterogeneity of variances (such as Levene's test or Bartlett's test), and then report the $p$-value from the homoskedastic ANOVA F test if variance heterogeneity is not detected but the $p$-value from the BFF\* test if variance heteogeneity is detected. @@ -602,6 +683,7 @@ For each of these tests, you will need to figure out the appropriate syntax by r 3. For `analysis_agg()`, check the output by aggregating the data to the school-level, fitting the linear model using `lm()`, and computing standard errors using `vcovHC()` from the `sandwich` package. + ### Extending the cluster-RCT functions {#CRT-ANCOVA-estimators} Exercise \@ref(cluster-RCT-baseline) from Chapter \@ref(data-generating-processes) asked you to extend the data-generating function for the cluster-randomized trial to include generating a student-level covariate, $X$, that is predictive of the outcome. @@ -621,6 +703,7 @@ Use your modified function to generate a dataset. ``` Hint: Check out the `reformulate()` function, which makes it easy to build formulas for different sets of predictors. + ### Contingent estimator processing {#contingent-estimator-processing} In Section \@ref(adapting-for-errors) we developed a version of `analysis_MLM()` that fell back on OLS regression in the event that `lmer()` produced any error. @@ -634,6 +717,7 @@ Handle event the contingency where all clusters are singletons by skipping the m 2. Revise `estimate_Tx_Fx()` to use your new version of `analysis_MLM_contingent()`. The revised function will sometimes return `NA` values for the `MLM` results. To implement the strategy of falling-back on OLS regression, add some code that replaces any `NA` values with corresponding results of `analysis_OLS()`. Test that the function is working as expected. + ### Estimating 3-parameter item response theory models {#IRT-3PL-estimation} Exercise \@ref(IRT-DGP-parameters) asked you to write a data-generating function for the 3-parameter IRT model described in described in Section \@ref(three-parameter-IRT). diff --git a/035-running-simulation.Rmd b/035-running-simulation.Rmd index 4660fb6..8ee5dfd 100644 --- a/035-running-simulation.Rmd +++ b/035-running-simulation.Rmd @@ -155,7 +155,7 @@ Once our simulation is complete, we save our results to a file so that we can av We now have results for each of our estimation methods applied to each of 1000 generated datasets. The next step is to evaluate how well the estimators did. For example, we will want to examine questions about bias, precision, and accuracy of the three point estimators. -In Chapter \@ref(performance-criteria), we will look systematically at ways to quantify the performance of estimation methods. +In Chapter \@ref(performance-measures), we will look systematically at ways to quantify the performance of estimation methods. ### Reparameterizing {#one-run-reparameterization} @@ -165,7 +165,7 @@ Here is a revised version of `one_run()`, which also renames some of the more ob ```{r revised_CRT, eval=FALSE} one_run <- function( n_bar = 30, J = 20, ATE = 0.3, size_coef = 0.5, - ICC = 0.4, alpha = 0.75 + ICC = 0.2, alpha = 0.75 ) { stopifnot( ICC >= 0 && ICC < 1 ) @@ -227,9 +227,10 @@ sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, id = "runID" ) ``` We can call the newly created function like so: ```{r, messages=FALSE} -sim_cluster_RCT( 2, - n_bar = 30, J = 20, gamma_1 = ATE, - sigma2_u = 0.3, sigma2_e = 0.7 ) +sim_cluster_RCT( reps = 2, + J = 20, n_bar = 30, alpha = 0.5, + gamma_1 = ATE, gamma_2 = 0.5, + sigma2_u = 0.2, sigma2_e = 0.8 ) ``` Again, `bundle_sim()` produces a function with input names that exactly match the inputs of the DGP function that we give it. It is not possible to re-parameterize or change argument names, as we did with `one_run()` in Section \@ref(one-run-reparameterization). @@ -237,9 +238,10 @@ See Exercise \@ref(reparameterization-redux) for further discussion of this limi To use the simulation function in practice, we call it by specifying the number of replications desired (which we have stored in `R`) and any relevant input parameters. ```{r, eval=FALSE} -runs <- sim_cluster_RCT( R, - n_bar = 30, J = 20, gamma_1 = ATE, - sigma2_u = 0.3, sigma2_e = 0.7 ) +runs <- sim_cluster_RCT( reps = R, + J = 20, n_bar = 30, alpha = 0.5, + gamma_1 = ATE, gamma_2 = 0.5, + sigma2_u = 0.2, sigma2_e = 0.8 ) saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) ``` The `bundle_sim()` function is just a convenient way to create a function that pieces together the steps in the simulation process, which is especially useful when the component functions include many input parameters. diff --git a/040-Performance-criteria.Rmd b/040-Performance-criteria.Rmd index 3b12004..2bc51a0 100644 --- a/040-Performance-criteria.Rmd +++ b/040-Performance-criteria.Rmd @@ -17,13 +17,13 @@ cluster_RCT_res <- ``` -# Performance metrics {#performance-criteria} +# Performance Measures {#performance-measures} Once we run a simulation, we end up with a pile of results to sort through. For example, Figure \@ref(fig:CRT-ATE-hist) depicts the distribution of average treatment effect estimates from the cluster-randomized experiment simulation, which we generated in Chapter \@ref(running-the-simulation-process). There are three different estimators, each with 1000 replications. Each histogram is an approximation of the _sampling distribution_ of the estimator, meaning its distribution across repetitions of the data-generating process. -With results such as these, the question before us is now how to evaluate how well these procedures worked. And, if we are comparing several different estimators, how do we determine which ones work better or worse than others? In this chapter, we look at a variety of __performance metrics__ that can answer these questions. +With results such as these, the question before us is now how to evaluate how well each procedure works. If we are comparing several different estimators, we also need to determine which ones work better or worse than others. In this chapter, we look at a variety of __performance measures__ that can answer these questions. ```{r CRT-ATE-hist} #| fig.width: 8 @@ -49,46 +49,46 @@ ggplot(cluster_RCT_res) + theme(legend.position = "none") ``` -Performance metrics are summaries of a sampling distribution that describe how an estimator or data analysis procedure behaves on average if we could repeat the data-generating process an infinite number of times. +Performance measures are summaries of a sampling distribution that describe how an estimator or data analysis procedure behaves on average if we could repeat the data-generating process an infinite number of times. For example, the bias of an estimator is the difference between the average value of the estimator and the corresponding target parameter. Bias measures the central tendency of the sampling distribution, capturing how far off, on average, the estimator would be from the true parameter value if we repeated the data-generating process an infinite number of times. In Figure \@ref(fig:CRT-ATE-hist), black dashed lines mark the true average treatment effect of 0.3 and the colored vertical lines with circles at the end mark the means of the estimators. The distance between the colored lines and the black dashed lines corresponds to the bias of the estimator. This distance is nearly zero for the aggregation estimator and the multilevel model estimator, but larger for the linear regression estimator. -Different types of data-analysis results produce different types of information, and so the relevant set of performance metrics depends on the type of data analysis result under evaluation. -For procedures that produce point estimates or point predictions, conventional performance metrics include bias, variance, and root mean squared error. -If the point estimates come with corresponding standard errors, then we may also want to evaluate how accurately the standard errors represent the true uncertainty of the point estimators; conventional performance metrics for capturing this include the relative bias and relative root mean squared error of the variance estimator. -For procedures that produce confidence intervals or other types of interval estimates, conventional performance metrics include the coverage rate and average interval width. -Finally, for inferential procedures that involve hypothesis tests (or more generally, classification tasks), conventional performance metrics include Type I error rates and power. -We describe each of these metrics in Sections \@ref(assessing-point-estimators) through \@ref(assessing-inferential-procedures). +Different types of data-analysis results produce different types of information, and so the relevant set of performance measures depends on the type of data analysis result we are studying. +For procedures that produce point estimates or point predictions, conventional performance measures include bias, variance, and root mean squared error. +If the point estimates come with corresponding standard errors, then we may also want to evaluate how accurately the standard errors represent the true uncertainty of the point estimators; conventional performance measures for capturing this include the relative bias and relative root mean squared error of the variance estimator. +For procedures that produce confidence intervals or other types of interval estimates, conventional performance measures include the coverage rate and average interval width. +Finally, for inferential procedures that involve hypothesis tests (or more generally, classification tasks), conventional performance measures include Type I error rates and power. +We describe each of these measures in Sections \@ref(assessing-point-estimators) through \@ref(assessing-inferential-procedures). -Performance metrics are defined with respect to sampling distributions, or the results of applying a data analysis procedure to data generated according to a particular process across an infinite number of replications. -In defining specific metrics, we will use conventional statistical notation for the means, variances, and other moments of the sampling distribution. -Specifically, we will use the expectation operator $\E()$ to denote the mean of a sampling distribution, $\M()$ to denote the median of a sampling distribution, $\Var()$ to denote the variance of a sampling distribution, and $\Prob()$ to denote probabilities of specific outcomes with respect to the sampling distribution. -We will use $\Q_p()$ to denote the $p^{th}$ quantile of a distribution, which is the value $x$ such that $\Prob(T \leq x) = p$. With this notation, the median is equivalent to $\M() = \Q_{0.5}()$. +Performance measures are defined with respect to sampling distributions, or the results of applying a data analysis procedure to data generated according to a particular process across an infinite number of replications. +In defining specific measures, we will follow statistical conventions to denote the mean, variance, and other moments of the sampling distribution. +For a random variable $T$, we will use the expectation operator $\E(T)$ to denote the mean of the sampling distribution of $T$, $\M(T)$ to denote the median of its sampling distribution, $\Var(T)$ to denote the variance of its sampling distribution, and $\Prob()$ to denote probabilities of specific outcomes with respect to its sampling distribution. +We will use $\Q_p(T)$ to denote the $p^{th}$ quantile of a distribution, which is the value $x$ such that $\Prob(T \leq x) = p$. With this notation, the median of a continuous distribution is equivalent to the 0.5 quantile: $\M(T) = \Q_{0.5}(T)$. -For some simple combinations of data-generating processes and data analysis procedures, it may be possible to derive exact mathematical formulas for calculating some performance metrics (such as exact mathematical expressions for the bias and variance of the linear regression estimator). +For some simple combinations of data-generating processes and data analysis procedures, it may be possible to derive exact mathematical formulas for calculating some performance measures (such as exact mathematical expressions for the bias and variance of the linear regression estimator). But for many problems, the math is difficult or intractable---that's why we do simulations in the first place. -Simulations do not produce the _exact_ sampling distribution or give us _exact_ values of performance metrics. -Instead, simulations yield _samples_---usually large samples---from the the sampling distribution, and we can use these to compute _estimates_ of the performance metrics of interest. -In Figure \@ref(fig:CRT-ATE-hist), we calculated the bias of each estimator by taking the mean of 1000 observations from its sampling distribution; if we were to repeat the whole set of calculations (with a different seed), then our bias results would shift slightly. +Simulations do not produce the _exact_ sampling distribution or give us _exact_ values of performance measures. +Instead, simulations generate _samples_ (usually large samples) from the the sampling distribution, and we can use these to compute _estimates_ of the performance measures of interest. +In Figure \@ref(fig:CRT-ATE-hist), we calculated the bias of each estimator by taking the mean of 1000 observations from its sampling distribution. If we were to repeat the whole set of calculations (with a different seed), then our bias results would shift slightly because they are imperfect estimates of the actual bias. -In working with simulation results, it is important to keep track of the degree of uncertainty in performance metric estimates. +In working with simulation results, it is important to keep track of the degree of uncertainty in performance measure estimates. We call such uncertainty _Monte Carlo error_ because it is the error arising from using a finite number of replications of the Monte Carlo simulation process. One way to quantify it is with the _Monte Carlo standard error (MCSE)_, or the standard error of a performance estimate based on a finite number of replications. -Just as when we analyze real data, we can apply statistical techniques to estimate the MCSE and even to generate confidence intervals for performance metrics. +Just as when we analyze real data, we can apply statistical techniques to estimate the MCSE and even to generate confidence intervals for performance measures. -The size of an MCSE is driven by how many replications we use: if we only use a few, we will have noisy estimates of performance with large MCSEs; if we use millions of replications, the MCSE will be tiny. +The magnitude of MCSE is driven by how many replications we use: if we only use a few, we will have noisy estimates of performance with large MCSEs; if we use millions of replications, the MCSE will usually be tiny. It is important to keep in mind that the MCSE is not measuring anything about how a data analysis procedure performs in general. It only describes how precisely we have approximated a performance criterion, an artifact of how we conducted the simulation. Moreover, MCSEs are under our control. Given a desired MCSE, we can determine how many replications we would need to ensure our performance estimates have the specified level of precision. Section \@ref(MCSE) provides details about how to compute MCSEs for conventional performance measures, along with some discussion of general techniques for computing MCSE for less conventional measures. -## Metrics for Point Estimators {#assessing-point-estimators} +## Measures for Point Estimators {#assessing-point-estimators} -The most common performance measures used to assess a point estimator are bias, variance, mean squared error, and root mean squared error. +The most common performance measures used to assess a point estimator are bias, variance, and root mean squared error. Bias compares the mean of the sampling distribution to the target parameter. Positive bias implies that the estimator tends to systematically over-state the quantity of interest, while negative bias implies that it systematically under-shoots the quantity of interest. If bias is zero (or nearly zero), we say that the estimator is unbiased (or approximately unbiased). @@ -100,7 +100,7 @@ In making comparisons of several different estimators, one with lower RMSE is us If two estimators have comparable RMSE, then the estimator with lower bias would usually be preferable. -To define these quantities more precisely, let's consider a generic estimator $T$ that is targeting a parameter $\theta$. +To define these quantities more precisely, let us consider a generic estimator $T$ that is targeting a parameter $\theta$. We call the target parameter the _estimand_. In most cases, in running our simulation we set the estimand $\theta$ and then generate a (typically large) series of $R$ datasets, for each of which $\theta$ is the true target parameter. We then analyze each dataset, obtaining a sample of estimates $T_1,...,T_R$. @@ -121,46 +121,43 @@ $$ $$ When conducting a simulation, we do not compute these performance measures directly but rather must estimate them using the replicates $T_1,...,T_R$ generated from the sampling distribution. -There's nothing very surprising about how we construct estimates of the performance measures. +There is nothing very surprising about how we construct estimates of the performance measures. It is just a matter of substituting sample quantities in place of the expectations and variances. Specifically, we estimate bias by taking $$ \widehat{\Bias}(T) = \bar{T} - \theta, (\#eq:bias-estimator) $$ -where $\bar{T}$ is the arithmetic mean of the replicates, -$$ -\bar{T} = \frac{1}{R}\sum_{r=1}^R T_r. -$$ +where $\bar{T}$ is the arithmetic mean of the replicates, $\bar{T} = \frac{1}{R}\sum_{r=1}^R T_r$. We estimate variance by taking the sample variance of the replicates, as $$ S_T^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(T_r - \bar{T}\right)^2. (\#eq:var-estimator) $$ -The square root of $S^2_T$, $S_T$ is an estimate of the true standard error of $T$, or the standard deviation of the estimator across an infinite set of replications of the data-generating process.[^SE-meaning] -We usually prefer to work with the true SE $S_T$ rather than the sampling variance $S_T^2$ because the former quantity has the same units as the target parameter. - -[^SE-meaning]: Generally, when people say "Standard Error" they actually mean _estimated_ Standard Error, ($\widehat{SE}$), as we would calculate in a real data analysis (where we have only a single realization of the data-generating process). It is easy to forget that this standard error is itself an estimate of a true parameter, and thus has its own uncertainty. - +$S_T$ (the square root of $S^2_T$) is an estimate of the empirical standard error of $T$, or the standard deviation of the estimator across an infinite set of replications of the data-generating process.[^SE-meaning] +We usually prefer to work with the empirical SE $S_T$ rather than the sampling variance $S_T^2$ because the former quantity has the same units as the target parameter. Finally, the RMSE estimate can be calculated as $$ \widehat{\RMSE}(T) = \sqrt{\frac{1}{R} \sum_{r = 1}^R \left( T_r - \theta\right)^2 }. (\#eq:rmse-estimator) $$ Often, people talk about the MSE (Mean Squared Error), which is just the square of RMSE. -Just like the true SE is usually easier to interpret than the sampling variance, units of RMSE are easier to interpret than the units of MSE. +Just as the true SE is usually easier to interpret than the sampling variance, units of RMSE are easier to interpret than the units of MSE. + +[^SE-meaning]: Generally, when people say "Standard Error" they actually mean _estimated_ Standard Error, ($\widehat{SE}$), as we would calculate in a real data analysis (where we have only a single realization of the data-generating process). It is easy to forget that this standard error is itself an estimate of a parameter--the true or empirical SE---and thus has its own uncertainty. + It is important to recognize that the above performance measures depend on the scale of the parameter. For example, if our estimators are measuring a treatment impact in dollars, then the bias, SE, and RMSE of the estimators are all in dollars. (The variance and MSE would be in dollars squared, which is why we take their square roots to put them back on the more intepretable scale of dollars.) -In many simulations, the scale of the outcome is an arbitrary feature of the data-generating process, making the absolute magnitude of performance metrics less meaningful. -To ease interpretation of performance metrics, it is useful to consider their magnitude relative to the baseline level of variation in the outcome. +In many simulations, the scale of the outcome is an arbitrary feature of the data-generating process, making the absolute magnitude of performance measures less meaningful. +To ease interpretation of performance measures, it is useful to consider their magnitude relative to the baseline level of variation in the outcome. One way to achieve this is to generate data so the outcome has unit variance (i.e., we generate outcomes in _standardized units_). -Doing so puts the bias, true standard error, and root mean-squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. +Doing so puts the bias, empirical standard error, and root mean squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. -In addition to understanding the scale of these performance metrics, it is also important to recognize that their magnitude depends on the scale of the parameter. -A non-linear transformation of a parameter will generally lead to changes in the magnitude of the performance metrics. +In addition to understanding the scale of these performance measures, it is also important to recognize that their magnitude depends on the metric of the parameter. +A non-linear transformation of a parameter will generally lead to changes in the magnitude of the performance measures. For instance, suppose that $\theta$ measures the proportion of time that something occurs. One natural way to transform this parameter would be to put it on the log-odds (logit) scale. However, because the log-odds transformation is non-linear, @@ -168,8 +165,8 @@ $$ \text{Bias}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{Bias}[T]\right), \qquad \text{RMSE}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{RMSE}[T]\right), $$ and so on. -This is a consequence of how the performance metrics are defined. -One might see this property as a limitation on the utility of using bias and RMSE to measure the performance of an estimator, because these metrics can be quite sensitive to the scale of the parameter. +This is a consequence of how these performance measures are defined. +One might see this property as a limitation on the utility of using bias and RMSE to measure the performance of an estimator, because these measures can be quite sensitive to the metric of the parameter. ### Comparing the Performance of the Cluster RCT Estimation Procedures {#clusterRCTperformance} @@ -177,16 +174,16 @@ One might see this property as a limitation on the utility of using bias and RMS runs <- readRDS( file = "results/cluster_RCT_simulation.rds" ) ``` -We now demonstrate the calculation of performance metrics for the point estimators of average treatment effects in the cluster-RCT example. +We now demonstrate the calculation of performance measures for the point estimators of average treatment effects in the cluster-RCT example. In Chapter \@ref(running-the-simulation-process), we generated a large set of replications of several different treatment effect estimators. Using these results, we can assess the bias, standard error, and RMSE of three different estimators of the ATE. -These performance metrics address the following questions: +These performance measures address the following questions: - Is the estimator systematically off? (bias) - Is it precise? (standard error) - Does it predict well? (RMSE) -Let us see how the three estimators compare on these metrics. +Let us see how the three estimators compare on these measures. #### Are the estimators biased? {-} @@ -194,7 +191,7 @@ Bias is defined with respect to a target estimand. Here we assess whether our estimates are systematically different from the $\gamma_1$ parameter, which we defined in standardized units by setting the standard deviation of the student-level distribution of the outcome equal to one. For these data, we generated data based on a school-level ATE parameter of 0.30 SDs. -```{r cluster_bias} +```{r cluster-bias} ATE <- 0.30 runs %>% @@ -216,11 +213,11 @@ This example illustrates how crucial it is to think carefully about the appropri #### Which method has the smallest standard error? {-} -The true standard error measures the degree of variability in a point estimator. +The empirical standard error measures the degree of variability in a point estimator. It reflects how stable our estimates are across replications of the data-generating process. We calculate the standard error by taking the standard deviation of the replications of each estimator. -For purposes of interpretation, it is useful to compare the true standard errors to the variation in a benchmark estimator. -Here, we treat the linear regression estimator as the benchmark and compute the magnitude of the true SEs of each method _relative_ to the SE of the linear regression estimator: +For purposes of interpretation, it is useful to compare the empirical standard errors to the variation in a benchmark estimator. +Here, we treat the linear regression estimator as the benchmark and compute the magnitude of the empirical SEs of each method _relative_ to the SE of the linear regression estimator: ```{r} true_SE <- @@ -233,7 +230,7 @@ true_SE ``` In a real data analysis, these standard errors are what we would be trying to approximate with a standard error estimator. -Aggregation and multi-level modeling have SEs about 8% smaller than Linear Regression. +Aggregation and multi-level modeling have SEs about 8% smaller than linear regression. For these data-generating conditions, aggregation and multi-level modeling are preferable to linear regression because they are more precise. #### Which method has the smallest Root Mean Squared Error? {-} @@ -254,21 +251,23 @@ runs %>% ) ``` -We also include SE and bias as points of reference. +We also include SE and bias for ease of reference. RMSE takes into account both bias and variance. For aggregation and multi-level modeling, the RMSE is the same as the standard error, which makes sense because these estimators are not biased. For linear regression, the combination of bias plus increased variability yields a higher RMSE, with the standard error dominating the bias term (note how RMSE and SE are more similar than RMSE and bias). -The difference between the estimators are pronounced because RMSE is the square root of the _squared_ bias and _squared_ standard errors. +The difference between the estimators are pronounced because RMSE is the square root of the _squared_ bias and _squared_ standard error. Overall, aggregation and multi-level modeling have RMSEs around 17% smaller than linear regression---a consequential difference in accuracy. -### Less Conventional Performance metrics +### Less Conventional Performance Measures {#less-conventional-measures} -Depending on the model and estimation procedures being examined, a range of different metrics might be used to assess estimator performance. -For point estimation, we have introduced bias, variance and MSE as the three core measures of performance. +Depending on the model and estimation procedures being examined, a range of different measures might be used to assess estimator performance. +For point estimation, we have introduced bias, variance and RMSE as three core measures of performance. However, all of these measures are sensitive to outliers in the sampling distribution. Consider an estimator that generally does well, except for an occasional large mistake. Because conventional measures are based on arithmetic averages, they will indicate that the estimator performs very poorly overall. -Other metrics exist, such as the median bias and the median absolute deviation of $T$, which are less sensitive to outliers in the sampling distribution compared to the conventional metrics. +Other measures such as the median bias and the median absolute deviation of $T$ are less sensitive to outliers in the sampling distribution compared to the conventional measures. +Estimating these measures will involve calculating sample quantiles of $T_1,...,T_R$, which are functions of the sample ordered from smallest to largest. +We will denote the $r^{th}$ order statistic as $T_{(r)}$ for $r = 1,...,R$. Median bias is an alternative measure of the central tendency of a sampling distribution. Positive median bias implies that more than 50% of the sampling distribution exceeds the quantity of interest, while negative median bias implies that more than 50% of the sampling distribution fall below the quantity of interest. @@ -277,90 +276,91 @@ $$ \text{Median-Bias}(T) = \M(T) - \theta (\#eq:median-bias). $$ -An estimator of median bias is computed using the sample median of $T_1,...,T_R$. +An estimator of median bias is computed using the sample median, as +$$ +\widehat{\text{Median-Bias}}(T) = M_T - \theta +(\#eq:sample-median-bias) +$$ +where $M_T = T_{((R+1)/2)}$ if $R$ is odd or $M_T = \frac{1}{2}\left(T_{(R/2)} + T_{(R/2+1)}\right)$ if $R$ is even. -Another robust measure of central tendency is based on winsorizing the sampling distribution, or truncating all errors larger than a certain maximum size. +Another robust measure of central tendency uses the $p \times 100\%$-trimmed mean, which ignores the estimates in the lowest and highest $p$-quantiles of the sampling distribution. +Formally, the trimmed-mean bias is +$$ +\text{Trimmed-Bias}(T; p) = \E\left[ T \left| \Q_{p}(T) < T < \Q_{(1 - p)}(T) \right.\right] - \theta. +(\#eq:trimmed-bias) +$$ +Median bias is thus a special case of trimmed mean bias, with $p = 0.5$. +To estimate the trimmed bias, we take the mean of the middle $1 - 2p$ fraction of the distribution +$$ +\widehat{\text{Trimmed-Bias}}(T; p) = \tilde{T}_{\{p\}} - \theta. (\#eq:sample-trimmed-bias) +$$ +where +$$ +\tilde{T}_{\{p\}} = \frac{1}{(1 - 2p)R} \sum_{r=pR + 1}^{(1-p)R} T_{(r)} +$$ +For a symmetric sampling distribution, trimmed-mean bias will be the same as the conventional (mean) bias, but its estimator $\tilde{T}_{\{p\}}$ will be less affected by outlying values (i.e., values of $T$ very far from the center of the distribution) compared to $\bar{T}$. +However, if a sampling distribution is not symmetric, trimmed-mean bias become distinct performance measures, which put less emphasis on large errors compared to the conventional bias measure. + +A further robust measure of central tendency is based on winsorizing the sampling distribution, or truncating all errors larger than a certain maximum size. Using a winsorized distribution amounts to arguing that you don't care about errors beyond a certain size, so anything beyond a certain threshold will be treated the same as if it were exactly on the threshold. The threshold for truncation is usually defined relative to the first and third quartiles of the sampling distribution, along with a given span of the inter-quartile range. The thresholds for truncation are taken as $$ -L_w = \Q_{0.25}(T) - w \times (\Q_{0.75}(T) - \Q_{0.25}(T)) \quad \text{and} \quad U_w = \Q_{0.75}(T) + w \times (\Q_{0.75}(T) - \Q_{0.25}(T)), +\begin{aligned} +L_w &= \Q_{0.25}(T) - w \times (\Q_{0.75}(T) - \Q_{0.25}(T)) \\ +U_w &= \Q_{0.75}(T) + w \times (\Q_{0.75}(T) - \Q_{0.25}(T)), +\end{aligned} $$ where $\Q_{0.25}(T)$ and $\Q_{0.75}(T)$ are the first and third quartiles of the distribution of $T$, respectively, and $w$ is the number of inter-quartile ranges below which an observation will be treated as an outlier.[^fences] -Let $T^{(w)} = \min\{\max\{T, L_w\}, U_w\}$. -The winsorized bias is then defined as -$$ -\text{Bias}(T^{(w)}) = \E\left(T^{(w)}\right) - \theta. -(\#eq:winsorized-bias) -$$ -Alternative measures of spread and overall accuracy can be defined along similar lines, using winsorized values in place of the raw values of $T$. -Specifically, +Let $X = \min\{\max\{T, L_w\}, U_w\}$. +The winsorized bias, variance, and RMSE are then defined using winsorized values in place of the raw values of $T$, as $$ \begin{aligned} -\Var\left(T^{(w)}\right) &= \E\left[\left(T^{(w)} - \E (T^{(w)})\right)^2 \right], \\ -\RMSE\left(T^{(w)}\right) &= \sqrt{\E\left[\left(T^{(w)} - \theta\right)^2 \right]}. +\text{Bias}(X) &= \E\left(X\right) - \theta \\ +\Var\left(X\right) &= \E\left[\left(X - \E (T^{(w)})\right)^2 \right], \\ +\RMSE\left(X\right) &= \sqrt{\E\left[\left(X - \theta\right)^2 \right]}. \end{aligned} (\#eq:winsorized-variance-RMSE) $$ - - -To compute estimates of the winsorized performance criteria, we substitute sample quantiles in place of $\Q_{0.25}(T)$ and $\Q_{0.25}(T)$ to get estimated thresholds, $\hat{L}_w$ and $\hat{U}_w$, find $\hat{T}_r^{(w)} = \min\{\max\{T_r, \hat{L}_w\}, \hat{U}_w\}$, and compute the sample performance metrics using Equations \@ref(eq:bias-estimator), \@ref(eq:var-estimator), and \@ref(eq:rmse-estimator), but with $\hat{T}_r^{(w)}$ in place of $T_r$. +To compute estimates of the winsorized performance criteria, we substitute sample quantiles $T_{(R/4)}$ and $T_{(3R/4)}$ in place of $\Q_{0.25}(T)$ and $\Q_{0.25}(T)$, respectively, to get estimated thresholds, $\hat{L}_w$ and $\hat{U}_w$, find $\hat{X}_r = \min\{\max\{T_r, \hat{L}_w\}, \hat{U}_w\}$, and compute the sample performance measures using Equations \@ref(eq:bias-estimator), \@ref(eq:var-estimator), and \@ref(eq:rmse-estimator), but with $\hat{X}$ in place of $T_r$. [^fences]: For a normally distributed sampling distribution, the interquartile range is `r round(diff(qnorm(c(0.25, 0.75))),2)` SD; with $w = 2$, the lower and upper thresholds would then fall at $\pm `r round(qnorm(0.75) + 2 * diff(qnorm(c(0.25, 0.75))),2)`$ SD, or the $`r round(100 * pnorm(qnorm(0.25) - 2 * diff(qnorm(c(0.25, 0.75)))),2)`^{th}$ and $`r round(100 * pnorm(qnorm(0.75) + 2 * diff(qnorm(c(0.25, 0.75)))),2)`^{th}$ percentiles. Still assuming a normal sampling distribution, taking $w = 2.5$ will mean that the thresholds fall at the $`r round(100 * pnorm(qnorm(0.25) - 2.5 * diff(qnorm(c(0.25, 0.75)))),3)`^{th}$ and $`r round(100 * pnorm(qnorm(0.75) + 2.5 * diff(qnorm(c(0.25, 0.75)))),3)`^{th}$ percentiles. -A further robust measure of central tendency used the $p \times 100\%$-trimmed mean, which ignores the estimates in the lowest and highest $p$-quantiles of the sampling distribution. -Formally, the trimmed-mean bias is -$$ -\text{Trimmed-Bias}(T; p) = \E\left[ T \left| \Q_{p}(T) < T < \Q_{(1 - p)}(T) \right.\right] - \theta. -(\#eq:trimmed-bias) -$$ -Median bias is thus a special case of trimmed mean bias, with $p = 0.5$. -To estimate the trimmed bias, we use sample quantiles $\hat{Q}_p$ and $\hat{Q}_{(1 - p)}$ and take the mean of the middle $1 - 2p$ fraction of the distribution -$$ -\widehat{\text{Trimmed-Bias}}(T; p) = \frac{1}{(1 - 2p)R} \sum_{r=1}^R T_r \times I\left(\hat{Q}_{p} < T < \hat{Q}_{(1 - p)}\right) - \theta. (\#eq:sample-trimmed-bias) -$$ -For a symmetric sampling distribution, winsorized bias and trimmed-mean bias will be the same as the conventional (mean) bias, but will be less affected by outlying values (i.e., values of $T$ very far from the center of the distribution). -However, if a sampling distribution is not symmetric, winsorized bias and trimmed-mean bias become distinct performance measures, which put less emphasis on large errors compared to the conventional bias metric. - -Alternative measures of the overall accuracy of an estimator can also be defined along similar lines. +Alternative measures of the overall accuracy of an estimator can also be defined using quantiles. For instance, an alternative to RMSE is to use the median absolute error (MAE), defined as $$ -\text{MAE} = \M\left(\left|T - \theta\right|\right), +\text{MAE} = \M\left(\left|T - \theta\right|\right). (\#eq:MAE) $$ -which can be estimated by taking the sample median $|T_1 - \theta|, |T_2 - \theta|, ..., |T_R - \theta|$. +Letting $E_r = |T_r - \theta|$, the MAE can be estimated by taking the sample median of $E_1,...,E_R$. Many other robust measures of the spread of the sampling distribution are also available, including the Rosseeuw-Croux scale estimator $Q_n$ [@Rousseeuw1993alternatives] and the biweight midvariance [@Wilcox2022introduction]. -@Maronna2006robust provide a useful introduction to these metrics and robust statistics more broadly. +@Maronna2006robust provide a useful introduction to these measures and robust statistics more broadly. The `robustbase` package [@robustbase] provides functions for calculating many of these robust statistics. -## Metrics for Standard Error Estimators +## Measures for Variance Estimators -Statistics is concerned not only with how to estimate things, but also with assessing how good an estimate is---that is, understanding the extent of uncertainty in estimates of target parameters. -These concerns apply for Monte Carlo simulation studies as well. +Statistics is concerned not only with how to estimate things, but also with understanding the extent of uncertainty in estimates of target parameters. +These concerns apply in Monte Carlo simulation studies as well. In a simulation, we can simply compute an estimator's actual properties. When we use an estimator with real data, we need to _estimate_ its associated standard error and generate confidence intervals and other assessments of uncertainty. To understand if these uncertainty assessments work in practice, we need to evaluate not only the behavior of the estimator itself, but also the behavior of these associated quantities. -In other words, we generally want to know not only whether a point estimator is doing a good job, but also whether we can obtain a good standard error for that point estimator. - -Commonly used metrics for quantifying the performance of estimated standard errors include relative bias, relative standard error, and relative root mean squared error. -These metrics are defined in relative terms (rather than absolute ones) by comparing their magnitude to the _true_ degree of uncertainty. - - - -Typically, performance metrics are computed for _variance_ estimators rather than standard error estimators. + +Commonly used measures for quantifying the performance of estimated standard errors include relative bias, relative standard error, and relative root mean squared error. +These measures are defined in relative terms (rather than absolute ones) by comparing their magnitude to the _true_ degree of uncertainty. +Typically, performance measures are computed for _variance_ estimators rather than standard error estimators. There are a few reasons for working with variance rather than standard error. -First, in practice, so-called unbiased standard errors usually are not in fact actually unbiased.[^Jokes-on-us] +First, in practice, so-called unbiased standard errors usually are not actually unbiased.[^Jokes-on-us] For linear regression, for example, the classic standard error estimator is an unbiased _variance_ estimator, but the standard error estimator is not exactly unbiased because $$ \E[ \sqrt{ V } ] \neq \sqrt{ \E[ V ] }. $$ -Variance is also the metric that gives us the bias-variance decomposition of $MSE = Variance + Bias^2$. Thus, if we are trying to determine whether MSE is due to instability or systematic bias, operating in this squared space may be preferable. +Variance is also the measure that gives us the bias-variance decomposition of Equation \@ref(eq:RMSE-decomposition). Thus, if we are trying to determine whether MSE is due to instability or systematic bias, operating in this squared space may be preferable. [^Jokes-on-us]: See the delightfully titled section 11.5, "The Joke Is on Us: The Standard Deviation Estimator is Biased after All," in @westfall2013understanding for further discussion. To make this concrete, let us consider a generic standard error estimator $\widehat{SE}$ to go along with our generic estimator $T$ of target parameter $\theta$, and let $V = \widehat{SE}^2$. -The simulation yields a large sample of standard errors, $\widehat{SE}_1,...,\widehat{SE}_R$ and variance estimators $V_r = \widehat{SE}_r^2$ for $r = 1,...,R$. +We can simulate to obtain a large sample of standard errors, $\widehat{SE}_1,...,\widehat{SE}_R$ and variance estimators $V_r = \widehat{SE}_r^2$ for $r = 1,...,R$. Formally, the relative bias, standard error, and RMSE of $V$ are defined as $$ \begin{aligned} @@ -370,7 +370,7 @@ $$ \end{aligned} (\#eq:relative-bias-SE-RMSE) $$ -In contrast to performance metrics for $T$, we define these metrics in relative terms because the raw magnitude of $V$ is not a stable or interpretable parameter. +In contrast to performance measures for $T$, we define these measures in relative terms because the raw magnitude of $V$ is not a stable or interpretable parameter. Rather, the sampling distribution of $V$ will generally depend on many of the parameters of the data-generating process, including the sample size and any other design parameters. Defining bias in relative terms makes for a more interpretable metric: a value of 1 corresponds to exact unbiasedness of the variance estimator. Relative bias measures _proportionate_ under- or over-estimation. @@ -378,7 +378,7 @@ For example, a relative bias of 1.12 would mean the standard error was, on avera We discuss relative performance measures further in Section \@ref(sec-relative-performance). To estimate these relative performance measures, we proceed by substituting sample quantities in place of the expectations and variances. -In contrast to the performance metrics for $T$, we will not generally be able to compute the true degree of uncertainty exactly. +In contrast to the performance measures for $T$, we will not generally be able to compute the true degree of uncertainty exactly. Instead, we must estimate the target quantity $\Var(T)$ using $S_T^2$, the sample variance of $T$ across replications. Denoting the arithmetic mean of the variance estimates as $$ @@ -408,10 +408,10 @@ Ideally, a variance estimator will have small relative bias, small relative stan ### Satterthwaite degrees of freedom Another more abstract measure of the stability of a variance estimator is its Satterthwaite degrees of freedom. -With some simple statistical methods such as two-sample $t$ tests, classical analysis of variance, and linear regression with homoskedastic errors, the variance estimator is computed by taking a sum of squares of normally distributed errors. -In these cases, the sampling distribution of the variance estimator is a multiple of a chi-squared distribution, with degrees of freedom corresponding to the number of independent observations used to compute the sum of squares. +For some simple statistical models such as classical analysis of variance and linear regression with homoskedastic errors, the variance estimator is computed by taking a sum of squares of normally distributed errors. +In such cases, the sampling distribution of the variance estimator is a multiple of a $\chi^2$ distribution, with degrees of freedom corresponding to the number of independent observations used to compute the sum of squares. In the context of analysis of variance problems, @Satterthwaite1946approximate described a method of approximating the variability of more complex statistics, involving linear combinations of sums of squares, by using a chi-squared distribution with a certain degrees of freedom. -When applied to an arbitrary variance estimator $V$, these degrees of freedom can be interpreted as the number of independent observations going into a sum of squares that would lead to a variance estimator that is equally precise as $V$. +When applied to an arbitrary variance estimator $V$, these degrees of freedom can be interpreted as the number of independent, normally distributed errors going into a sum of squares that would lead to a variance estimator that is equally precise as $V$. More succinctly, these degrees of freedom correspond to the amount of independent observations used to estimate $V$. Following @Satterthwaite1946approximate, we define the degrees of freedom of $V$ as @@ -429,11 +429,11 @@ Even with more complex methods, the degrees of freedom are interpretable: higher ### Assessing SEs for the Cluster RCT Simulation -Returning to the cluster RCT example, we will assess whether our estimated SEs are about right by comparing the average _estimated_ (squared) standard error versus the true sampling variance. +Returning to the cluster RCT example, we will assess whether our estimated SEs are about right by comparing the average _estimated_ (squared) standard error versus the empirical sampling variance. Our standard errors are _inflated_ if they are systematically larger than they should be, across the simulation runs. -We will also look at how stable our variance estimates are by comparing their standard deviation to the true sampling variance and by computing the Satterthwaite degrees of freedom. +We will also look at how stable our variance estimates are by comparing their standard deviation to the empirical sampling variance and by computing the Satterthwaite degrees of freedom. -```{r calcuate_cluster_RCT_performance} +```{r calculate-cluster-RCT-performance} SE_performance <- runs %>% mutate( V = SE_hat^2 ) %>% @@ -450,32 +450,32 @@ SE_performance <- SE_performance ``` -The variance estimators for Agg and MLM appear to be a bit conservative on average, with relative bias of around `r round(SE_performance$rel_bias[1], 2)`, or about `r round(100 * SE_performance$rel_bias[1]) - 100`% higher than the true sampling variance. +The variance estimators for the aggregation estimator and multilevel model estimator appear to be a bit conservative on average, with relative bias of around `r round(SE_performance$rel_bias[1], 2)`, or about `r round(100 * SE_performance$rel_bias[1]) - 100`% higher than the true sampling variance. The column labelled `rel_SE_V` reports how variable the variance estimators are relative to the true sampling variances of the estimators. The column labelled `df` reports the Satterthwaite degrees of freedom of each variance estimator. Both of these measures indicate that the linear regression variance estimator is less stable than the other methods, with around `r round(diff(SE_performance$df[2:1]))` fewer degrees of freedom. The linear regression method uses a cluster-robust variance estimator, which is known to be a bit unstable [@cameronPractitionerGuideClusterRobust2015]. Overall, it is a bad day for linear regression. -## Metrics for Confidence Intervals +## Measures for Confidence Intervals Some estimation procedures provide confidence intervals (or confidence sets) which are ranges of values, or interval estimators, that should include the true parameter value with a specified confidence level. For a 95% confidence level, the interval should include the true parameter in 95% replications of the data-generating process. However, with the exception of some simple methods and models, methods for constructing confidence intervals usually involve approximations and simplifying assumptions, so their actual coverage rate might deviate from the intended confidence level. We typically measure confidence interval performance along two dimensions: __coverage rate__ and __expected width__. -Suppose that the confidence interval is for the target parameter $\theta$ and has coverage level $\beta$ for $0 < \beta < 1$. +Suppose that the confidence interval is for the target parameter $\theta$ and has intended coverage level $\beta$ for $0 < \beta < 1$. Denote the lower and upper end-points of the $\beta$-level confidence interval as $A$ and $B$. $A$ and $B$ are random quantities---they will differ each time we compute the interval on a different replication of the data-generating process. The coverage rate of a $\beta$-level interval estimator is the probability that it covers the true parameter, formally defined as $$ -\text{Coverage}_\beta(A,B) = \Prob(A \leq \theta \leq B). +\text{Coverage}(A,B) = \Prob(A \leq \theta \leq B). (\#eq:coverage) $$ -For a well-performing interval estimator, $\text{Coverage}_\beta$ will at least $\beta$ and, ideally will not exceed $\beta$ by too much. +For a well-performing interval estimator, $\text{Coverage}$ will at least $\beta$ and, ideally will not exceed $\beta$ by too much. The expected width of a $\beta$-level interval estimator is the average difference between the upper and lower endpoints, formally defined as $$ -\text{Width}_\beta(A,B) = \E(B - A). +\text{Width}(A,B) = \E(B - A). (\#eq:expected-width) $$ Smaller expected width means that the interval tends to be narrower, on average, and thus more informative about the value of the target parameter. @@ -485,13 +485,13 @@ Let $A_r$ and $B_r$ denote the lower and upper end-points of the confidence inte The coverage rate and expected length measures can be estimated as $$ \begin{aligned} -\widehat{\text{Coverage}}_\beta(A,B) &= \frac{1}{R}\sum_{r=1}^R I(A_r \leq \theta \leq B_r) \\ -\widehat{\text{Width}}_\beta(A,B) &= \frac{1}{R} \sum_{r=1}^R W_r = \frac{1}{R} \sum_{r=1}^R \left(B_r - A_r\right). +\widehat{\text{Coverage}}(A,B) &= \frac{1}{R}\sum_{r=1}^R I(A_r \leq \theta \leq B_r) \\ +\widehat{\text{Width}}(A,B) &= \frac{1}{R} \sum_{r=1}^R W_r = \frac{1}{R} \sum_{r=1}^R \left(B_r - A_r\right). \end{aligned} (\#eq:coverage-width) $$ Following a strict statistical interpretation, a confidence interval performs acceptably if it has actual coverage rate greater than or equal to $\beta$. -If multiple tests satisfy this criterion, then the test with the lowest expected width would be preferable. Some analysts prefer to look at lower and upper coverage separately, where lower coverage is $\Prob(A \leq \theta)$ and upper coverage is $\Prob(\theta \leq B)$. +If multiple methods satisfy this criterion, then the method with the lowest expected width would be preferable. Some analysts prefer to look at lower and upper coverage separately, where lower coverage is $\Prob(A \leq \theta)$ and upper coverage is $\Prob(\theta \leq B)$. In many instances, confidence intervals are constructed using point estimators and uncertainty estimators. @@ -499,13 +499,13 @@ For example, a conventional Wald-type confidence interval is centered on a point $$ A = T - c \times \widehat{SE}, \quad B = T + c \times \widehat{SE} $$ -for some critical value $c$ (e.g.,for a normal critical value with a $\beta = 0.95$ confidence level, $c = `r round(qnorm(0.975), 2)`$). -Because of these connections, confidence interval coverage will often be closely related to the performance of the point estimator and uncertainty estimator. +for some critical value $c$ (e.g.,for a normal critical value with a $\beta = 0.95$ confidence level, $c = `r round(qnorm(0.975), 3)`$). +Because of these connections, confidence interval coverage will often be closely related to the performance of the point estimator and variance estimator. Biased point estimators will tend to have confidence intervals with coverage below the desired level because they are not centered in the right place. Likewise, variance estimators that have relative bias below 1 will tend to produce confidence intervals that are too short, leading to coverage below the desired level. Thus, confidence interval coverage captures multiple aspects of the performance of an estimation procedure. -### Confidence Intervals in the Cluster RCT Simulation +### Confidence Intervals in the Cluster RCT Simulation {#cluster-RCT-CI-coverage} Returning to the CRT simulation, we will examine the coverage and expected width of normal Wald-type confidence intervals for each of the estimators under consideration. To do this, we first have to calculate the confidence intervals because we did not do so in the estimation function. @@ -537,10 +537,10 @@ In practice, we might want to examine more carefully constructed intervals such Especially in scenarios with a small or moderate number of clusters, such methods might provide better intervals, with coverage closer to the desired confidence level. See Exercise \@ref(cluster-RCT-t-confidence-intervals). -## Metrics for Inferential Procedures (Hypothesis Tests) {#assessing-inferential-procedures} +## Measures for Inferential Procedures (Hypothesis Tests) {#assessing-inferential-procedures} -Hypothesis testing involves specifying a null hypothesis, such as that there is no difference in average outcomes between two experimental groups, collecting data, and evaluating whether the observed data is compatible with the null hypothesis. -Hypothesis testing procedures are often formulated in terms of a $p$-value, which measures how extreme or surprising a feature of the observed data (a test statistic) is relative to what one would expect if the null hypothesis is true. +Hypothesis testing entails first specifying a null hypothesis, such as that there is no difference in average outcomes between two experimental groups. One then collects data and evaluates whether the observed data is compatible with the null hypothesis. +Hypothesis test results are often describes in terms of a $p$-value, which measures how extreme or surprising a feature of the observed data (a test statistic) is relative to what one would expect if the null hypothesis is true. A small $p$-value (such as $p < .05$ or $p < .01$) indicates that the observed data would be unlikely to occur if the null is true, leading the researcher to reject the null hypothesis. Alternately, testing procedures might be formulated by comparing a test statistic to a specified critical value; a test statistic exceeding the critical value would lead the researcher to reject the null. @@ -556,11 +556,11 @@ When we evaluate a hypothesis testing procedure, we are concerned with two prima Validity pertains to whether we erroneously reject a true null more than we should. An $\alpha$-level testing procedure is valid if it has no more than an $\alpha$ chance of rejecting the null, when the null is true. -This means that if we conducted a valid testing procedure 1000 times, where the null holds for all of those 1000 replications, we should not see more than about $1000 \alpha$ rejections. If we were using the conventional $\alpha = .05$ level, then we should reject the null in only 50 of the 1000 replications. +If we were using the conventional $\alpha = .05$ level, then a valid testing procedure will reject the null in only 50 of 1000 replications of a data-generating process where the null hypothesis actually holds true. To assess validity, we will need to specify a data generating process where the null hypothesis holds (e.g., where there is no difference in average outcomes between experimental groups). We then generate a large series of data sets with a true null, conduct the testing procedure on each dataset and record the $p$-value or critical value, then score whether we reject the null hypothesis. -In practice, may be interested in evaluating a testing procedure by exploring data generation processes where the null is true but other aspects of the data (such as outliers, skewed outcome distributions, or small sample size) make estimation difficult, or where auxiliary assumptions of the testing procedure are violated. +In practice, we may be interested in evaluating a testing procedure by exploring data generation processes where the null is true but other aspects of the data (such as outliers, skewed outcome distributions, or small sample size) make estimation difficult, or where auxiliary assumptions of the testing procedure are violated. Examining such data-generating processes allows us to understand if our methods are robust to patterns that might be encountered in real data analysis. The key to evaluating the validity of a procedure is that, for whatever data-generating process we examine, the null hypothesis must be true. @@ -570,27 +570,29 @@ Power is concerned with the chance that we notice when an effect or a difference Compared to validity, power is a more nuanced concept because larger effects will clearly be easier to notice than smaller ones, and more blatant violations of a null hypothesis will be easier to identify than subtle ones. Furthermore, the rate at which we can detect violations of a null will depend on the $\alpha$ level of the testing procedure. A lower $\alpha$ level will make for a less sensitive test, requiring stronger evidence to rule out a null hypothesis. Conversely, a higher $\alpha$ level will reject more readily, leading to higher power but at a cost of increased false positives. -When assessing validity, we want the rejection rate to be at or below the specified $\alpha$ level, and when assessing power we want the rejection rate to be as high as possible. - -We find it useful to think of power as a _function_ rather than as a single quantity because its absolute magnitude will generally depend on the sample size of a dataset and the magnitude of the effect of interest. -Because of this, power evaluations will typically involve examining a sequence of data-generating scenarios with increasing sample size or increasing effect size. -Further, if our goal is to evaluate several different testing procedures, the absolute power of a procedure will be of less concern than the _relative_ performance of one procedure compared to another. - + In order to evaluate the power of a testing procedure by simulation, we will need to generate data where there is something to detect. In other words, we will need to ensure that the null hypothesis is violated (and that some specific alternative hypothesis of interest holds). The process of evaluating the power of a testing procedure is otherwise identical to that for evaluating its validity: generate many datasets, carry out the testing procedure, and track the rate at which the null hypothesis is rejected. The only difference is the _conditions_ under which the data are generated. -### The Rejection Rate +We find it useful to think of power as a _function_ rather than as a single quantity because its absolute magnitude will generally depend on the sample size of a dataset and the magnitude of the effect of interest. +Because of this, power evaluations will typically involve examining a _sequence_ of data-generating scenarios with varying sample size or varying effect size. +Further, if our goal is to evaluate several different testing procedures, the absolute power of a procedure will be of less concern than the _relative_ performance of one procedure compared to another. + + + -When evaluating both validity and power, the main performance measure is the __rejection rate__ of the hypothesis test. Letting $P$ be the p-value from a procedure for testing the null hypothesis that a parameter $\theta = 0$, generated under a data-generating process with parameter $\theta$ (which could in truth be zero or non-zero). The rejection rate is then +### Rejection Rates + +When evaluating either validity or power, the main performance measure is the __rejection rate__ of the hypothesis test. Letting $P$ be the p-value from a procedure for testing the null hypothesis that a parameter $\theta = 0$, generated under a data-generating process with parameter $\theta$ (which could in truth be zero or non-zero). The rejection rate is then $$ \rho_\alpha(\theta) = \Prob(P < \alpha) (\#eq:rejection-rate) $$ When data are simulated from a process in which the null hypothesis is true, then the rejection rate is equivalent to the Type-I error rate of the test, which should ideally be near the desired $\alpha$ level. -When the data are simulated from a process in which the null hypothesis is violated, then the rejection rate is equivalent to the __power__ of the test (for the given alternate hypothesis specified in the data-generating process). +When the data are simulated from a process in which the null hypothesis is violated, then the rejection rate is equivalent to the power of the test (for the given alternate hypothesis specified in the data-generating process). Ideally, a testing procedure should have actual Type-I error equal to the nominal level $\alpha$ (this is the definition of validity), but such exact tests are rare. To estimate the rejection rate of a test, we calculate the proportion of replications where the test rejects the null hypothesis. @@ -603,10 +605,8 @@ It may be of interest to evaluate the performance of the test at several differe For instance, @brown1974SmallSampleBehavior evaluated the Type-I error rates and power of their tests using $\alpha = .01$, $.05$, and $.10$. Simulating the $p$-value of the test makes it easy to estimate rejection rates for multiple $\alpha$ levels, since we simply need to apply Equation \@ref(eq:rejection-rate-estimate) for several values of $\alpha$. When simulating from a data-generating process where the null hypothesis holds, one can also plot the empirical cumulative distribution function of the $p$-values; for an exactly valid test, the $p$-values should follow a standard uniform distribution with a cumulative distribution falling along the $45^\circ$ line. - - -There are some different perspectives on how close the actual Type-I error rate should be in order to qualify as suitable for use in practice. Following a strict statistical definition, a hypothesis testing procedure is said to be __level-$\alpha$__ if its actual Type-I error rate is _always_ less than or equal to $\alpha$. +Methodologists hold a variety of perspectives on how close the actual Type-I error rate should be in order to qualify as suitable for use in practice. Following a strict statistical definition, a hypothesis testing procedure is said to be __level-$\alpha$__ if its actual Type-I error rate is _always_ less than or equal to $\alpha$, for any specific conditions of a data-generating process. Among a collection of level-$\alpha$ testing procedures, we would prefer the one with highest power. If looking only at null rejection rates, then the test with Type-I error closest to $\alpha$ would usually be preferred. However, some scholars prefer to use a less stringent criterion, where the Type-I error rate of a testing procedure would be considered acceptable if it is within 50\% of the desired $\alpha$ level. @@ -626,20 +626,30 @@ runs %>% summarise( power = mean( p_value <= 0.05 ) ) ``` -For this particular scenario, the power of the tests is not particularly high, and the linear regression estimator apparently has higher power than the aggregation method and the multi-level model. +For this particular scenario, none of the tests have especially high power, and the linear regression estimator apparently has higher power than the aggregation method and the multi-level model. To make sense of this power pattern, we need to also consider the validity of the testing procedures. We can do so by re-running the simulation using code we constructed in Chapter \@ref(running-the-simulation-process) using the `simhelpers` package. To evaluate the Type-I error rate of the tests, we will set the average treatment effect parameter to zero by specifying `ATE = 0`: -```{r secret_run_cluster_rct, include=FALSE} +```{r secret-run-cluster-rct, include=FALSE} +library(simhelpers) +source("case_study_code/gen_cluster_RCT.R") +source("case_study_code/analyze_cluster_RCT.R") +sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, id = "runID" ) + if ( !file.exists("results/cluster_RCT_simulation_validity.rds" )) { tictoc::tic() # Start the clock! + set.seed( 404044 ) - runs_val <- - purrr::rerun( R, one_run( 0 ) ) %>% - bind_rows( .id="runID" ) + runs_val <- sim_cluster_RCT( + reps = 1000, + J = 20, n_bar = 30, alpha = 0.75, + gamma_1 = 0, gamma_2 = 0.5, + sigma2_u = 0.2, sigma2_e = 0.8 + ) + tictoc::toc() saveRDS( runs_val, file = "results/cluster_RCT_simulation_validity.rds" ) @@ -650,90 +660,123 @@ if ( !file.exists("results/cluster_RCT_simulation_validity.rds" )) { } ``` -```{r run_cluster_rct, eval=FALSE} +```{r run-cluster-rct, eval=FALSE} set.seed( 404044 ) -runs_val <- sim_function( R, n_bar = 30, J = 20, gamma_1 = 0, gamma_2 = 0.2 ) +runs_val <- sim_cluster_RCT( + reps = 1000, + J = 20, n_bar = 30, alpha = 0.75, + gamma_1 = 0, gamma_2 = 0.5, + sigma2_u = 0.2, sigma2_e = 0.8 +) ``` Assessing validity involves repeating the exact same rejection rate calculations as we did for power: -```{r demo_calc_validity} +```{r demo-calc-validity} runs_val %>% - group_by( method ) %>% + group_by( estimator ) %>% summarise( power = mean( p_value <= 0.05 ) ) ``` -The Type-I error rates of the tests for the aggregation and multi-level modeling estimators are around 0.05, as desired. -The test for the linear regression estimator has Type-I error above the specified $\alpha$-level, due to the upward bias of the point estimator used in constructing the test. +The Type-I error rates of the tests for the aggregation and multi-level modeling approaches are around 5%, as desired. +The test for the linear regression estimator has Type-I error above the specified $\alpha$-level due to the upward bias of the point estimator used in constructing the test. The elevated rejection rate might be part of the reason that the linear regression test has higher power than the other procedures. -It is not entirely fair to compare the power of these testing procedures, because one of them has Type-I error in excess of the desired level.[^size-adjusted-power] +It is not entirely fair to compare the power of these testing procedures, because one of them has Type-I error in excess of the desired level. + -[^size-adjusted-power]: One approach for conducting a fair comparison in this situation is to compute the _size-adjusted_ power of the tests. -Size-adjusted power involves computing the rejection rate of a test using a different threshold $\alpha'$, selected so that the Type-I error rate of the test is equal to the desired $\alpha$ level. -Specifically, size adjusted power is +As discussed above, linear regression targets the person-level average treatment effect. +In the scenario we simulated for evaluating validity, the person-level average effect is not zero because we have specified a non-zero impact heterogeneity parameter ($\gamma_2=0.2$), meaning that the school-specific treatment effects vary around 0. +To see if this is why the linear regression test has an inflated Type-I error rate, we could re-run the simulation using settings where both the school-level and person-level average effects are truly zero. + +## Relative or Absolute Measures? {#sec-relative-performance} + +In considering performance measures for point estimators, we have defined the measures in terms of differences (bias, median bias) and average deviations (variance and RMSE), all of which are on the scale of the target parameter. +In contrast, for evaluating estimated standard errors we have defined measures in relative terms, calculated as _ratios_ of the target quantity rather than as differences. +In the latter case, relative measures are justified because the target quantity (the true degree of uncertainty) is always positive and is usually strongly affected by design parameters of the data-generating process. +Is it ever reasonable to use relative measures for point estimators? If so, how should we decide whether to use relative or absolute measures? + +Many published simulation studies have used relative performance measures for evaluating point estimators. For instance, studies might use relative bias or relative RMSE, defined as $$ -\rho^{adjusted}_\alpha(\theta) = \Prob(P < \rho_\alpha(0)). +\begin{aligned} +\text{Relative }\Bias(T) &= \frac{\E(T)}{\theta}, \\ +\text{Relative }\RMSE(T) &= \frac{\sqrt{\E\left[\left(T - \theta\right)^2 \right]}}{\theta}. +\end{aligned} +(\#eq:relative-bias-RMSE) $$ -To estimate size-adjusted power using simulation, we first need to estimate the Type-I error rate, $r_\alpha(0)$. We can then evaluate the rejection rate of the testing procedure under scenarios with other values of $\theta$ by computing +and estimated as $$ -r^{adjusted}_\alpha(\theta) = \frac{1}{R} \sum_{r=1}^R I(P_r < r_{\alpha}(0)). +\begin{aligned} +\widehat{\text{Relative }\Bias(T)} &= \frac{\bar{T}}{\theta}, \\ +\widehat{\text{Relative }\RMSE(T)} &= \frac{\widehat{RMSE}(T)}{\theta}. +\end{aligned} +(\#eq:relative-bias-RMSE-estimators) $$ +As justification for evaluating bias in relative terms, authors often appeal to @hoogland1998RobustnessStudiesCovariance, who suggested that relative bias of under 5% (i.e., relative bias falling between 0.95 and 1.05) could be considered acceptable for an estimation procedure. +However, @hoogland1998RobustnessStudiesCovariance were writing about a very specific context---robustness studies of structural equation modeling techniques---that have parameters of a particular form. +In our view, their proposed rule-of-thumb is often generalized far beyond the circumstances where it might be defensible, including to problems where it is clearly arbitrary and inappropriate. -As discussed above, linear regression target the person-level average treatment effect. In the scenario we simulated for evaluating validity, the person-level average effect is not zero because we have specified a non-zero impact heterogeneity parameter ($\gamma_2=0.2$), meaning that the school-specific treatment effects vary around 0. -To see if this is why the linear regression test has an inflated Type-I error rate, we could re-run the simulation using settings where both the school-level and person-level average effects are truly zero. - -## Selecting Relative vs. Absolute Metrics {#sec-relative-performance} +A more principled approach to choosing between absolute and relative measures is to consider how the magnitude of the measure changes across different values of the target parameter $\theta$. +If the estimand of interest is a location parameter, then shifting $\theta$ by 0.1 or by 10.1 would not usually lead to changes in the magnitude of bias, variance, or RMSE. +The relationship between bias and the target parameter might be similar to Scenario A in Figure \@ref(fig:absolute-relative), where bias is roughly constant across a range of different values of $\theta$. +Focusing on relative measures in this scenario would lead to a much more complicated story because different values of $\theta$ will produce drastically different values, ranging from nearly unbiased to nearly infinite bias (for $\theta$ very close to zero). -We have primarily examined performance estimators for point estimators using absolute metrics, focusing on measures like bias directly on the scale of the outcome. -In contrast, for evaluation things such as estimated standard errors, which are always positive and scale-dependent, it often makes sense to use relative metrics, i.e., metrics calculated as proportions of the target parameter ($T/\theta$) rather than as differences ($T - \theta$). -We typically apply absolute metrics to point estimators and relative metrics to standard error estimators (we are setting aside, for the moment, the relative metrics of a measure from one estimation procedure to another, as we saw earlier when we compared the SEs to a baseline SE of linear regression for the cluster randomized trial simulation. -So how do we select when to use what? +Another possibility is that shifting $\theta$ by 0.1 or 10.1 will lead to proportionate changes in the magnitude of bias, variance, or RMSE. +The relationship between bias and the target parameter might be similar to Scenario B in \@ref(fig:absolute-relative), where bias is is roughly a constant multiple of the target parameter $\theta$. +Focusing on relative measures in this scenario is useful because it leads to a simple story: relative bias is always around 1.12 across all values of $\theta$, even though the raw bias varies considerably. +We would usually expect this type of pattern to occur for scale parameters. -As a first piece of guidance, establish whether we expect the performance (e.g., bias, standard error, or RMSE) of a point estimate to depend on the magnitude of the estimand. -For example, if we are estimating some mean $\theta$, and we generate data where $\theta = 100$ vs where $\theta = 1000$ (or any other arbitrary number), we would not generally expect the value of $\theta$ to change the magnitude of bias, variance, or MSE. -On the other hand, these different $\theta$s would have a large impact on the _relative_ bias and _relative_ MSE. -(Want smaller relative bias? Just add a million to the parameter!) -For these sorts of "location parameters" we generally use absolute measures of performance. +```{r absolute-relative} +#| fig.width: 8 +#| fig.height: 3 +#| echo: false +#| message: false +#| fig.cap: "Hypothetical relationships between bias and a target parameter $\\theta$. In Scenario A, bias is unrelated to $\\theta$ and absolute bias is a more appropriate measure. In Scenario B, bias is proportional to $\\theta$ and relative bias is a more appropriate measure." -That being said, a more principled approach for determining whether to use absolute or relative performance metrics depends on assessing performance for _multiple_ values of the parameter. -In many simulation studies, replications are generated and performance metrics are calculated for several different values of a parameter, say $\theta = \theta_1,...,\theta_p$. -Let's focus on bias for now, and say that we've estimated (from a large number of replications) the bias at each parameter value. -We present two hypothetical scenarios, A and B, in the figures below. -```{r, echo = FALSE, fig.width = 5, fig.height = 2.5} -library(ggplot2) -theta <- seq(0, 5, 0.5) +theta <- seq(-0.8, 0.8, 0.2) bias1 <- rnorm(length(theta), mean = 0.06, sd = 0.004) -bias2 <- rnorm(length(theta), mean = theta * 0.12 / 5, sd = 0.004) +bias2 <- rnorm(length(theta), mean = theta * 0.12, sd = 0.004) type <- rep(c("Scenario A","Scenario B"), each = length(theta)) dat <- data.frame(type, theta, bias = c(bias1, bias2)) -dat$theta = dat$theta / 12 -dat$bias = dat$bias * 2 -ggplot(dat, aes(theta, bias)) + +ggplot(dat, aes(theta, bias, color = type)) + + geom_hline(yintercept = 0) + geom_point() + geom_line() + facet_wrap(~ type) + - theme_minimal() + theme_minimal() + + theme(legend.position = "none") + + labs( + x = expression(theta), + y = "Bias" + ) ``` - +How do we know which of these scenarios is a better match for a particular problem? +For some estimators and data-generating processes, it may be possible to analyze a problem with statistical theory and examine the how bias or variance would be expected to change as a function of $\theta$. +However, many problems are too complex to be tractable. +Another, much more feasible route is to evaluate performance for _multiple_ values of the target parameter. +As done in many simulation studies, we can simulate sampling distributions and calculate performance measures (in raw terms) for several different values of a parameter, selected so that we can distinguish between constant and multiplicative relationships. +Then, in analyzing the simulation results, we can generate graphs such as those in Figure \@ref(fig:absolute-relative) to understand how performance changes as a function of the target parameter. If the absolute bias is roughly the same for all values of $\theta$ (as in Scenario A), then it makes sense to report absolute bias as the summary performance criterion. On the other hand, if the bias grows roughly in proportion to $\theta$ (as in Scenario B), then relative bias might be a better summary criterion. +### Performance relative to a benchmark estimator -**Performance relative to a baseline estimator.** - -Another relative measure, as we saw earlier, is to calculate performance relative to some baseline. -For example, if one of the estimators is the "generic method," we could calculate ratios of the RMSE of our estimators to the baseline RMSE. -This can provide a way of standardizing across simulation scenarios where the overall scale of the RMSE changes radically. -This could be critical to, for example, examining trends across simulations that have different sample sizes, where we would expect all estimators' performance measures to improve as sample size grows. -This kind of relative standardization allows us to make statements such as "Aggregation has standard errors around 8% smaller than linear regression"--which is very interpretable, more interpretable than saying "Aggregation has standard errors around 0.01 smaller than linear regression." -In the latter case, we do not know if that is big or small. +Another way to define performance measures in relative terms to by taking the ratio of the performance measure for one estimator over the performance measure for a benchmark estimator. +We have already demonstrated this approach in calculating performance measures for the cluster RCT example (Section \@ref(clusterRCTperformance)), where we used the linear regression estimator as the benchmark against which to compare the other estimators. +This approach is natural in simulations that involve comparing the performance of multiple estimators and where one of the estimators could be considered the current standard or conventional method. -While a powerful tool, standardization is not without risks: if you scale relative to something, then higher or lower ratios can either be due to the primary method of interest (the numerator) or due to the behavior of the reference method in the denominator. -These relative ratios can end up being confusing to interpret due to this tension. +Comparing the performance of one estimator relative to another can be especially useful when examining measures whose magnitude varies drastically across design parameters. +For most statistical methods, we would usually expect precision and accuracy to improve (variance and RMSE to decrease) as sample size increases. +Comparing estimators in terms of _relative_ precision or _relative_ accuracy may make it easier to identify consistent patterns in the simulation results. +For instance, this approach might allow us to summarize findings by saying that "the aggregation estimator has standard errors that are consistently 6-10% smaller than the standard errors of the linear regression estimator." +This is much easier to interpret than saying that "aggregation has standard errors that are around 0.01 smaller than linear regression, on average." +In the latter case, it is very difficult to determine whether a difference of 0.01 is large or small, and focusing on an average difference conceals relevant variation across scenarios involving different sample sizes. -They can also break when everything is on a constrained scale, like power. +Comparing performance relative to a benchmark method can be an effective tool, but it also has potential drawbacks. +Because these relative performance measures are inherently comparative, higher or lower ratios could either be due to the behavior of the method of interest (the numerator) or due to the behavior of the benchmark method (the denominator). +Ratio comparisons are also less effective for performance measures that are on a constrained scale, such as power. If we have a power of 0.05, and we improve it to 0.10, we have doubled our power, but if it is 0.10 and we increase to 0.15, we have only increased by 50%. -Ratios when near zero can be very deceiving. +Ratios can also be very deceiving when the denominator quantity is near zero or when it can take on either negative or positive values; this can be a problem when examining bias relative to a benchmark estimator. +Because of these drawbacks, it is prudent to compute and examine performance measures in absolute terms in addition to examining relative comparisons between methods. ## Estimands Not Represented By a Parameter {#implicit-estimands} @@ -746,13 +789,13 @@ There are at least three possible ways to accomplish this. One way is to use mathematical distribution theory to compute an implied parameter. Our target parameter will be some function of the parameters and random variables in the data-generating process, and it may be possible to evaluate that function algebraically or numerically (i.e., using numerical integration functions such as `integrate()`). -Such an exercise can be very worthwhile if it provides insights into the relationship between the target parameter and the inputs of the data-generating process. +This can be a very worthwhile exercise if it provides insights into the relationship between the target parameter and the inputs of the data-generating process. However, this approach requires knowledge of distribution theory, and it can get quite complicated and technical.[^cluster-RCT-estimand] -Other approaches are often feasible and more closely aligned with our focus on Monte Carlo simulation. +Other approaches are often feasible and more closely aligned with the tools and techniques of Monte Carlo simulation. [^cluster-RCT-estimand]: In the cluster-RCT example, the distribution theory is tractable. See Exercise \@ref(cluster-RCT-SPATE) -Another alternative approach is to simply generate a massive dataset---so large that can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. +An alternative approach is to simply generate a massive dataset---so large that it can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. If the dataset is big enough, then the uncertainty in this estimate will be negligible compared to the uncertainty in our simulation. We implement this approach as follows, generating a dataset with 100,000 clusters: @@ -766,36 +809,48 @@ dat <- gen_cluster_RCT( ATE_person <- mean( dat$Yobs[dat$Z==1] ) - mean( dat$Yobs[dat$Z==0] ) ATE_person ``` -Note our estimate of the person-average effect of `r round( ATE_person, 2)` is about what we would expect given the bias we saw earlier for the linear model. +The extremely precise estimate of the person-average effect is `r round( ATE_person, 2)`, which is consistent with what we would expect given the bias we saw earlier for the linear model. -With respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift, although SE will stay the same as in our performance calculations for the school-level average effect: +If we recalculate performance measures for all of our estimators with respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift but the standard errors will stay the same as in previous performance calculations using the school-level average effect: ```{r} -runs %>% +performance_person_ATE <- + runs %>% group_by( method ) %>% summarise( - bias = mean( ATE_hat - ATE_person ), + bias = mean( ATE_hat ) - ATE_person, SE = sd( ATE_hat ), RMSE = sqrt( mean( (ATE_hat - ATE_person)^2 ) ) ) %>% mutate( per_RMSE = RMSE / RMSE[method=="LR"] ) + +performance_person_ATE ``` -For the person-weighted estimand, Agg and MLM are biased but LR is unbiased. -RMSE is now a tension between bias and reduced variance. -Overall, Agg and MLM are 4% worse than LR in terms of RMSE, because they have lower SEs but higher bias. +```{r, echo = FALSE} +RMSE_ratio <- + performance_person_ATE %>% + filter(method != "LR") %>% + summarize(per_RMSE = mean(per_RMSE)) %>% + pull(per_RMSE) +``` + +For the person-weighted estimand, the aggregation estimator and multilevel model are biased but the linear regression estimator is unbiased. +However, the aggregation estimator and multilevel model estimator still have smaller standard errors than the linear regression estimator. +RMSE now captures the trade-off between bias and reduced variance. +Overall, aggregation and multilevel modeling have RMSE that is around `r round(100 * (RMSE_ratio - 1))`% larger than linear regression. A further approach for calculating `ATE_person` would be to record the true person average effect of the dataset with each simulation iteration, and then average the sample-specific parameters at the end. -The overall average of the dataset-specific `ATE_person`s corresponds to the population person-level ATE. -This approach is effectively equivalent to generating a massive dataset---we just generate it in piece. +The overall average of the dataset-specific `ATE_person` parameters corresponds to the population person-level ATE. +This approach is equivalent to generating a single massive dataset---we just generate it piece by piece. To implement this approach, we would need to modify the data-generating function `gen_cluster_RCT()` to track the additional information. -We might have, for example +For instance, we might calculate ```{r, eval=FALSE} -tx_effect <- gamma_1 + gamma_2 * (nj-n_bar)/n_bar +tx_effect <- gamma_1 + gamma_2 * ( nj - n_bar ) / n_bar beta_0j <- gamma_0 + Zj * tx_effect + u0j ``` -and then we would return `tx_effect` as well as `Yobs` and `Z` as a column in our dataset. +and then include `tx_effect` along with `Yobs` and `Z` as a column in our dataset. This approach is quite similar to directly calculating _potential outcomes_, as discussed in Chapter \@ref(potential-outcomes). After modifying the data-generating function, we will also need to modify the analysis function(s) to record the sample-specific treatment effect parameter. @@ -814,175 +869,321 @@ analyze_data = function( dat ) { } ``` -Now when we run our simulation, we will have a column which is the true person-level average treatment effect for each dataset. +Now when we run our simulation, we will have a column corresponding to the true person-level average treatment effect for each dataset. We could then take the average of these value across replications to estimate the true person average treatment effect in the population, and then use this as the target parameter for performance calculations. -Clearly, an estimand not represented by any single input parameter is more difficult to work with, but it is not impossible. -The key is to be clear about what you are trying to estimate, since the performance of an estimator depends critically on the estimand to which the estimator is compared. +An estimand not represented by any single input parameter is more difficult to work with than one that corresponds directly to an input parameter. +Still, it is feasible to examine such estimands with a bit of forethought and careful programming. +The key is to be clear about what you are trying to estimate because the performance of an estimator depends critically on the estimand against which it is compared. + +## Uncertainty in Performance Estimates (the Monte Carlo Standard Error) {#MCSE} +The performance measures we have described are all defined with respect to the sampling distribution of an estimator, or its distribution across an infinite number of replications of the data-generating process. +Of course, simulations will only involve a finite set of replications, based on which we calculate _estimates_ of the performance measures. +These estimates involve some Monte Carlo error because they are based on a limited number of replications. +It is important to understand the extent of Monte Carlo error when interpreting simulation results, so we need methods for asssessing this source of uncertainty. +To account for Monte Carlo error, we can think of our simulation results as a sample from a population. +Each replication is an independent and identically distributed draw from the population of the sampling distribution. +Once we frame the problem in these terms, standard statistical techniques for independent and identically distributed random variables can be applied to calculate standard errors. +We call these standard errors Monte Carlo Simulation Errors, or MCSEs. +For most of the performance measures, closed-form expressions are available for calculating MCSEs. +For a few of the measures, we can apply techniques such as the jackknife to calculate reasonable approximations for MCSEs. +### Conventional measures for point estimators -## Uncertainty in Performance Estimates (the Monte Carlo Standard Error) {#MCSE} +For the measures that we have described for evaluating point estimators, Monte Carlo standard errors can be calculated using conventional formulas.[^estimated-MCSEs] +Recall that we have a point estimator $T$ of a target parameter $\theta$, and we calculate the mean of the estimator $\bar{T}$ and its sample standard deviation $S_T$ across $R$ replications of the simulation process. +In addition, we will need to calculate the standardized skewness and kurtosis of $T$ as -Our performance metrics are defined as average performance across an infinite number of trials. -Of course, in our simulations we only run a finite number of trials, and estimate the performance metrics with the sample of trials we generate. -For example, if we are assessing coverage across 100 trials, we can calculate what fraction rejected the null for that 100. -This is an _estimate_ of the true coverage rate. -Due to random chance, we might see a higher, or lower, proportion rejected than what we would see if we ran the simulation forever. +$$ +\begin{aligned} +\text{Skewness (standardized):} & &g_T &= \frac{1}{R S_T^3}\sum_{r=1}^R \left(T_r - \bar{T}\right)^3 \\ +\text{Kurtosis (standardized):} & &k_T &= \frac{1}{R S_T^4} \sum_{r=1}^R \left(T_r - \bar{T}\right)^4. +\end{aligned} +(\#eq:skewness-kurtosis) +$$ -To account for estimation uncertainty we want associated uncertainty estimates to go with our point estimates of performance. -We want to, in other words, treat our simulation results as a dataset in its own right. -(And yes, this is quite meta!) +[^estimated-MCSEs]: To be precise, the formulas that we give are _estimators_ for the Monte Carlo standard errors of the performance measure estimators. Our presentation does not emphasize this point because the performance measures will usually be estimated using a large number of replications from an independent and identically distributed process, so the distinction between empirical and estimated standard errors will not be consequential. -Once we frame the problem in these terms, it is relatively straightforward to calculate standard errors for most of the performance critera because we have an independent and identically distributed set of measurements. -We call these standard errors Monte Carlo Simulation Errors, or MCSEs. -For some of the performance metrics we have to be a bit more clever, as we will discuss below. +The bias of $T$ is estimated as $\bar{T} - \theta$, so the MCSE for bias is equal the MCSE of $\bar{T}$. It can be estimated as +$$ +MCSE\left(\widehat{\Bias}(T)\right) = \sqrt{\frac{S_T^2}{R}}. +(\#eq:MCSE-bias) +$$ +The sampling variance of $T$ is estimated as $S_T^2$, with MCSE of +$$ +MCSE\left(\widehat{\Var}(T)\right) = S_T^2 \sqrt{\frac{k_T - 1}{R}}. +(\#eq:MCSE-var) +$$ +The empirical standard error (the square root of the sampling variance) is estimated as $S_T$. Using a delta method approximation[^delta-method], the MCSE of $S_T$ is +$$ +MCSE\left(S_T\right) = \frac{S_T}{2}\sqrt{\frac{k_T - 1}{R}}. +(\#eq:MCSE-SE) +$$ -We list MCSE expressions for many of our straightforward performance measures on the following table. -In reading the table, recall that, for an estimator $T$, we have $S_T$ being the standard deviation of $T$ across our simulation runs (i.e., our estimated true Standard Error). -We also have +[^delta-method]: The delta method approximation says (with some conditions), that if we assume $X \sim N\left( \phi, \sigma_X^2 \right)$, then we can approximate the distribution of $g(X)$ for some continuous function $g(\cdot)$ as +$$ g(X) \sim N\left( g(\phi), \;\; g'(\phi)^2 \times \sigma_X^2 \right),$$ +where $g'(\phi)$ is the derivative of $g(\cdot)$ evaluated at $\phi$. +Following this approximation, +$$ SE( g(X) ) \approx \left| g'(\theta) \right| \times SE(X) .$$ +For estimation, we plug in $\hat{\theta}$ and our estimate of $SE(X)$ into the above. +To find the MCSE for $S_T$, we can apply the delta method approximation to $X = S_T^2$ with $g(x) = \sqrt(x)$ and $g'(x) =\frac{1}{2\sqrt{x}}$. - - Sample skewness (standardized): $\displaystyle{g_T = \frac{1}{R S_T^3}\sum_{r=1}^R \left(T_r - \bar{T}\right)^3}$ - - Sample kurtosis (standardized): $\displaystyle{k_T = \frac{1}{R S_T^4} \sum_{r=1}^R \left(T_r - \bar{T}\right)^4}$ +We estimate RMSE using Equation \@ref(eq:rmse-estimator), which can also be written as +$$ +\widehat{\RMSE}(T) = \sqrt{(\bar{T} - \theta)^2 + \frac{R - 1}{R} S_T^2}. +$$ +An MCSE for the estimated mean squared error (the square of RMSE) is +$$ +MCSE( \widehat{MSE} ) = \sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 g_T\left(\bar{T} - \theta\right) + 4 S_T^2 \left(\bar{T} - \theta\right)^2\right]}. +(\#eq:MCSE-MSE) +$$ +Again following a delta method approximation, a MCSE for the RMSE is +$$ +MCSE( \widehat{RMSE} ) = \frac{\sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 g_T\left(\bar{T} - \theta\right) + 4 S_T^2 \left(\bar{T} - \theta\right)^2\right]}}{2 \times \widehat{RMSE}}. +(\#eq:MCSE-RMSE) +$$ +Section \@ref(sec-relative-performance) discussed circumstances where we might prefer to calculate performance measures in relative rather than absolute terms. +For measures that are calculated by dividing a raw measure by the target parameter, the MCSE for the relative measure is simply the MCSE for the raw measure divided by the target parameter. +For instance, the MCSE of relative bias $\bar{T} / \theta$ is +$$ +MCSE\left( \frac{\bar{T}}{\theta} \right) = \frac{1}{\theta} MCSE(\bar{T}) = \frac{S_T}{\theta \sqrt{R}}. +(\#eq:MCSE-relative-bias) +$$ +MCSEs for relative variance and relative RMSE follow similarly. -| Criterion for T | MCSE | -|----------------|--------| -| Bias ($T-\theta$) | $\sqrt{S_T^2/ R}$ | -| Variance ($S_T^2$) | $\displaystyle{S_T^2 \sqrt{\frac{k_T - 1}{R}}}$ | -| MSE | see below | -| MAD | - | -| Power & Validity ($r_\alpha$) | $\sqrt{ r_\alpha \left(1 - r_\alpha\right) / R}$ | -| Coverage ($\omega_\beta$) | $\sqrt{\omega_\beta \left(1 - \omega_\beta\right) / R}$ | -| Average length ($\text{E}(W)$) | $\sqrt{S_W^2 / R}$ | +### Less conventional measures for point estimators -The MCSE for the MSE is a bit more complicated, and does not quite fit on our table: -$$ \widehat{MCSE}( \widehat{MSE} ) = \displaystyle{\sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 g_T\left(\bar{T} - \theta\right) + 4 S_T^2 \left(\bar{T} - \theta\right)^2\right]}} .$$ +In Section \@ref(less-conventional-measures) we described several alternative performance measures for evaluating point estimators, which are less commonly used but are more robust to outliers compared to measures such as bias and variance. +MCSEs for these less conventional measures can be obtained using results from the theory of robust statistics [@Hettmansperger2010robust; @Maronna2006robust]. -For relative quantities with respect to an estimand, simply divide the criterion by the target estimand. -E.g., for relative bias $T / \theta$, the standard error would be -$$ SE\left( \frac{T}{\theta} \right) = \frac{1}{\theta} SE(T) = \sqrt{\frac{S_T^2}{R\theta^2}} .$$ +@McKean1984comparison proposed a standard error estimator for the sample median from a continuous but not necessarily normal distribution, derived from a non-parametric confidence interval for the sample median. +We use their approach to compute a MCSE for $M_T$, the sample median of $T$. +Let $c = \left\lceil(R + 1) / 2 - 1.96 \times \sqrt{R/4}\right\rfloor$, where the inner expression is rounded to the nearest integer. +Then +$$ +MCSE\left(M_T\right) = \frac{T_{(R + 1 - c)} - T_{(c)}}{2 \times 1.96}. +(\#eq:MCSE-median) +$$ +A Monte Carlo standard error for the median absolute deviation can be computed following the same approach, but substituting the order statistics of $E_r = | T_r - \theta|$ in place of those for $T_r$. -For square rooted quantities, such as the SE for the true SE (square root of the Variance) or the RMSE (square root of MSE) we can use the Delta method. -The Delta method says (with some conditions), that if we assume $X \sim N( \phi, V )$, then we can approximate the distribution of $g(X)$ for some continuous function $g(\cdot)$ as -$$ g(X) \sim N\left( g(\phi), \;\; g'(\phi)^2\cdot V \right) , $$ -where $g'(\phi)$ is the derivative of $g(\cdot)$ evaluated at $\phi$. -In other words, -$$ SE( g(\hat{X}) ) \approx g'(\theta) \times SE(\hat{X}) .$$ -For estimation, we plug in $\hat{\theta}$ and our estimate of $SE(\hat{X})$ into the above. -Back to the square root, we have $g(x) = \sqrt(x)$ and $g'(x) = 1/2\sqrt(x)$. -This gives, for example, the estimated MCSE of the SE as -$$ \widehat{SE}( \widehat{SE} ) = \widehat{SE}( S^2_T ) = \frac{1}{2S^2_T} \widehat{SE}( S^2_T ) = \frac{1}{2S^2_T} S_T^2 \sqrt{\frac{k_T - 1}{R}} = \frac{1}{2} \sqrt{\frac{k_T - 1}{R}} .$$ +Trimmed mean bias with trimming proportion $p$ is calculated by taking the mean of the the middle $(1 - 2p) \times R$ observations, which we have denoted as $\tilde{T}_{\{p\}}$. A MCSE for the trimmed mean (and for the trimmed mean bias) is +$$ +MCSE\left(\tilde{T}_{\{p\}}\right) = \sqrt{\frac{U_p}{R}}, +(\#eq:MCSE-trimmed-mean) +$$ +where +$$ +U_p = \frac{1}{(1 - 2p)R}\left( pR\left(T_{(pR)} - \tilde{T}_{\{p\}}\right)^2 + pR\left(T_{((1-p)R + 1)} - \tilde{T}_{\{p\}}\right)^2 + \sum_{r=pR + 1}^{(1 - p)R} \left(T_{(r)} - \tilde{T}_{\{p\}}\right)^2 \right) +$$ +[@Maronna2006robust, Eq. 2.85]. +Performance measures based on winsorization include winsorized bias, winsorized standard error, and winsorized RMSE. +MCSEs for these measures can be computed using the same formuals as for the conventional measures of bias, empirical standard error, and RMSE, but using sample moments of $\hat{X}_r$ in place of the sample moments of $T_r$. -### MCSE for Relative Variance Estimators +### MCSE for Relative Variance Estimators {#MCSE-for-relative-variance} -Estimating the MCSE of the relative bias or relative MSE of a (squared) standard error estimator, i.e., of $E( \widehat{SE^2} - SE^2 ) / SE^2 )$ or $\widehat{MSE} / MSE$, is complicated by the appearance of an estimated quantity, $SE^2$ or $MSE$, in the denominator of the ratio. -This renders the simple division approach from above unusable, technically speaking. -The problem is we cannot use our clean expressions for MCSEs of relative performance measures since we are not taking the uncertainty of our denominator into account. +Estimating the MCSE of relative performance measures for variance estimators is complicated by the appearance of an estimated quantity in the denominator of the ratio. +For instance, the relative bias of $V$ is estimates as the ratio $\bar{V} / S_T^2$, and both the numerator and denominator are estimated quantities that will include some Monte Carlo error. +To properly account for the Monte Carlo uncertainty of the ratio, one possibility is to use formulas for the standard errors of ratio estimators. +Alternately, we can use general uncertainty approximation techniques such as the jackknife or bootstrap [@boos2015Assessing]. +The jackknife involves calculating a statistic of interest repeatedly, each time excluding one observation from the calculation. +The variance of this set of one-left-out statistics then serves as a reasonable approximation to the actual sampling variance of the statistic calculated from the full sample. -To properly assess the overall MCSE, we need to do something else. -One approach is to use the _jackknife_ technique. -Let $\bar{V}_{(j)}$ and $S_{T(j)}^2$ be the average squared standard error estimate and the true variance estimate calculated from the set of replicates __*that excludes replicate $j$*__, for $j = 1,...,R$. +To apply the jackknife to assess MCSEs of relative bias or relative RMSE of a variance estimator, we will need to compute several statistics repeatedly. +Let $\bar{V}_{(j)}$ and $S_{T(j)}^2$ be the average variance estimate and the empirical variance estimate calculated from the set of replicates __*that excludes replicate $j$*__, for $j = 1,...,R$. The relative bias estimate, excluding replicate $j$ would then be $\bar{V}_{(j)} / S_{T(j)}^2$. -Calculating all $R$ versions of this relative bias estimate and taking the variance of these $R$ versions yields the jackknife variance estimator: - +Calculating all $R$ versions of this relative bias estimate and taking the variance of these $R$ versions yields a jackknife MCSE: $$ -MCSE\left( \frac{ \widehat{SE}^2 }{SE^2} \right) = \frac{1}{R} \sum_{j=1}^R \left(\frac{\bar{V}_{(j)}}{S_{T(j)}^2} - \frac{\bar{V}}{S_T^2}\right)^2. +MCSE\left( \frac{ \bar{V}}{S_T^2} \right) = \sqrt{\frac{1}{R} \sum_{j=1}^R \left(\frac{\bar{V}_{(j)}}{S_{T(j)}^2} - \frac{\bar{V}}{S_T^2}\right)^2}. +(\#eq:MCSE-relative-bias-V) +$$ +Similarly, a MCSE for the relative standard error of $V$ is +$$ +MCSE\left( \frac{ S_V}{S_T^2} \right) = \sqrt{\frac{1}{R} \sum_{j=1}^R \left(\frac{S_{V(j)}}{S_{T(j)}^2} - \frac{S_V}{S_T^2}\right)^2}, +(\#eq:MCSE-relative-var-V) +$$ +where $S_{V(j)}$ is the sample variance of $V_1,...,V_R$, omitting replicate $j$. +To compute a MCSE for the relative RMSE of $V$, we will need to compute the performance measure after omitting each observation in turn. +Letting +$$ +RRMSE_{V} = \frac{1}{S_{T}^2}\sqrt{(\bar{V} - S_{T}^2)^2 + \frac{R - 1}{R} S_{V}^2} +$$ +and +$$ +RRMSE_{V(j)} = \frac{1}{S_{T(j)}^2}\sqrt{(\bar{V}_{(j)} - S_{T(j)}^2)^2 + \frac{R - 1}{R} S_{V(j)}^2}, +$$ +a jackknife MCSE for the estimated relative RMSE of $V$ is +$$ +MCSE\left( RRMSE_{V} \right) = \sqrt{\frac{1}{R} \sum_{j=1}^R \left(RRMSE_{V(j)} - RRMSE_{V}\right)^2}. +(\#eq:MCSE-relative-rmse-V) $$ -This would be quite time-consuming to compute if we did it by brute force. However, a few algebra tricks provide a much quicker way. The tricks come from observing that - +Jackknife calculation would be cumbersome if we did it by brute force. However, a few algebra tricks provide a much quicker way. The tricks come from observing that $$ \begin{aligned} \bar{V}_{(j)} &= \frac{1}{R - 1}\left(R \bar{V} - V_j\right) \\ +S_{V(j)}^2 &= \frac{1}{R - 2} \left[(R - 1) S_V^2 - \frac{R}{R - 1}\left(V_j - \bar{V}\right)^2\right] \\ S_{T(j)}^2 &= \frac{1}{R - 2} \left[(R - 1) S_T^2 - \frac{R}{R - 1}\left(T_j - \bar{T}\right)^2\right] \end{aligned} $$ These formulas can be used to avoid re-computing the mean and sample variance from every subsample. -Instead, you calculate the overall mean and overall variance, and then do a small adjustment with each jackknife iteration. -You can even implement this with vector processing in R! +Instead, all we need to do is calculate the overall mean and overall variance, and then do a small adjustment with each jackknife iteration. +Jackknife methods are useful for approximating MCSEs of other performance measures beyond just those for variance estimators. +For instance, the jackknife is a convenient alternative for computing the MCSE of the empirical standard error or (raw) RMSE of a point estimator, which avoids the need to compute skewness or kurtosis. +However, @boos2015Assessing notes that the jackknife does not work for performance measures involving medians, although bootstrapping remains valid. + +### MCSE for Confidence Intervals and Hypothesis Tests + +Performance measures for confidence intervals and hypothesis tests are simple compared to those we have described for point and variance estimators. +For evaluating hypothesis tests, the main measure is the rejection rate of the test, which is a proportion estimated as $r_\alpha$ (Equation \@ref(eq:rejection-rate-estimate)). +A MCSE for the estimated rejection rate is +$$ +MCSE(r_\alpha) = \sqrt{\frac{r_\alpha ( 1 - r_\alpha)}{R}}. +(\#eq:MCSE-rejection-rate) +$$ +This MCSE uses the estimated rejection rate to approximate its Monte Carlo error. +When evaluating the validity of a test, we may expect the rejection rate to be fairly close to the nominal $\alpha$ level, in which case we could compute a MCSE using $\alpha$ in place of $r_\alpha$, taking $\sqrt{\alpha(1 - \alpha) / R}$. +When evaluating power, we will not usually know the neighborhood of the rejection rate in advance of the simulation. +However, a conservative upper bound on the MCSE can be derived by observing that MCSE is maximized when $\rho_\alpha = \frac{1}{2}$, and so +$$ +MCSE(r_\alpha) \leq \sqrt{\frac{1}{4 R}}. +$$ + +When evaluating confidence interval performance, we focus on coverage rates and expected widths. +MCSEs for the estimated coverage rate work similarly to those for rejection rates. +If the coverage rate is expected to be in the neighborhood of the intended coverage level $\beta$, then we can approximate the MCSE as +$$ +MCSE(\widehat{\text{Coverage}}(A,B)) = \sqrt{\frac{\beta(1 - \beta)}{R}}. +(\#eq:MCSE-coverage) +$$ +Alternately, Equation \@ref(eq:MCSE-coverage) could be computed using the estimated coverage rate $\widehat{\text{Coverage}}(A,B)$ in place of $\beta$. + +Finally, the expected confidence interval width can be estimated as $\bar{W}$, with MCSE +$$ +MCSE(\bar{W}) = \sqrt{\frac{S_W^2}{R}}, +(\#eq:MCSE-width) +$$ +where $S_W^2$ is the sample variance of $W_1,...,W_R$, the widths of the confidence interval from each replication. ### Calculating MCSEs With the `simhelpers` Package -The `simhelper` package is designed to calculate MCSEs (and the performance metrics themselves) for you. -It is easy to use: take this set of simulation runs on the Welch dataset: +The `simhelpers` package provides several functions for calculating most of the performance measures that we have reviewed, along with MCSEs for each performance measures. +The functions are easy to use. +Consider this set of simulation runs on the Welch dataset: -```{r} +```{r simhelpers-MCSEs} library( simhelpers ) data( welch_res ) -welch <- welch_res %>% - filter( method == "t-test" ) %>% - dplyr::select( -method, -seed, -iterations ) -welch +welch <- + welch_res %>% + dplyr::select(-seed, -iterations ) %>% + mutate(method = case_match(method, "Welch t-test" ~ "Welch", .default = method)) + +head(welch) ``` -We can calculate performance metrics across all the range of scenarios. -Here is the rejection rate: +We can calculate performance measures across all the range of scenarios. +Here is the rejection rate for the traditional $t$-test based on the subset of simulation results with sample sizes of $n_1 = n_2 = 50$ and a mean difference of 0, using $\alpha$ levels of .01 and .05: ```{r} -welch_sub = filter( welch, n1 == 50, n2 == 50, mean_diff==0 ) -calc_rejection(welch_sub, p_val) +welch_sub <- filter(welch, method == "t-test", n1 == 50, n2 == 50, mean_diff == 0 ) + +calc_rejection(welch_sub, p_values = p_val, alpha = c(.01, .05)) ``` +The column labeled `K_rejection` reports the number of replications used to calculate the performance measures. -And coverage: +Here is the coverage rate calculated for the same condition: ```{r} -calc_coverage(welch_sub, lower_bound, upper_bound, mean_diff) +calc_coverage( + welch_sub, + lower_bound = lower_bound, upper_bound = upper_bound, + true_param = mean_diff +) ``` -Using `tidyverse` it is easy to process across scenarios (more on experimental design and multiple scenarios later): +The performance functions are designed to be used within a `tidyverse`-style workflow, including on grouped datasets. For instance, we can calculate rejection rates for every distinct scenario examined in the simulation: ```{r} -welch %>% group_by(n1,n2,mean_diff) %>% - summarise( calc_rejection( p_values = p_val ) ) +all_rejection_rates <- + welch %>% + group_by( n1, n2, mean_diff, method ) %>% + summarise( + calc_rejection( p_values = p_val, alpha = c(.01, .05) ) + ) +``` +The resulting summaries are reported in table \@ref(tab:Welch-rejection). + +```{r Welch-rejection, echo = FALSE, message = FALSE} +library(kableExtra) + +all_rejection_rates %>% + kbl( + digits = 3, + caption = "Rejection rates of conventional and Welch t-test for varying sample sizes and population mean differences." + ) %>% + kable_styling( + full_width = FALSE, + bootstrap_options = c("striped","compact","hover") + ) ``` + ### MCSE Calculation in our Cluster RCT Example -We can check our MCSEs for our performance measures to see if we have enough simulation trials to give us precise enough estimates to believe the differences we reported earlier. +In Section \@ref(clusterRCTperformance), we computed performance measures for three point estimators of the school-level average treatment effect in a cluster RCT. +We can carry out the same calculations using the `calc_absolute()` function from `simhelpers`, which also provides MCSEs for each measure. +Examining the MCSEs is useful to ensure that 1000 replications of the simulation is suffiicent to provide reasonably precise estimates of the performance measures. In particular, we have: -```{r cluster_MCSE_calculation} +```{r cluster-MCSE-calculation} library( simhelpers ) -runs$ATE = ATE + runs %>% - summarise( calc_absolute( estimates = ATE_hat, - true_param = ATE, - criteria = c("bias","stddev", "rmse")) ) %>% - dplyr::select( -K_absolute ) %>% - knitr::kable(digits=3) + group_by(method) %>% + summarise( + calc_absolute( + estimates = ATE_hat, true_param = ATE, + criteria = c("bias","stddev", "rmse") + ) + ) ``` -We see the MCSEs are quite small relative to the linear regression bias term and all the SEs (`stddev`) and RMSEs: we have simulated enough runs to see the gross trends identified. -We have _not_ simulated enough to for sure know if MLM and Agg are not slightly biased. Given our MCSEs, they could have true bias of around 0.01 (two MCSEs). - +We see the MCSEs are quite small relative to the linear regression bias term and all the SEs (`stddev`) and RMSEs. Results based on 1000 replications seems adequate to support our conclusions about the gross trends identified. +We have _not_ simulated enough to rule out the possibility that the aggregation estimator and multilevel modeling estimator could be slightly biased. Given our MCSEs, they could have true bias of as much as 0.01 (two MCSEs). ## Summary of Peformance Measures + We list most of the performance criteria we saw in this chapter in the table below, for reference: -| Criterion | Definition | Estimate | -|---------------|-----------------------|----------------------| -| Bias | $\text{E}(T) - \theta$ | $\bar{T} - \theta$ | -| Median bias | $\text{M}(T) - \theta$ | $\tilde{T} - \theta$ | -| Variance | $\text{E}\left[\left(T - \text{E}(T)\right)^2\right]$ | $S_T^2$ | -| MSE | $\text{E}\left[\left(T - \theta\right)^2\right]$ | $\left(\bar{T} - \theta\right)^2 + S_T^2$ | -| MAE | $\text{M}\left[\left|T - \theta\right|\right]$ | $\left[\left|T - \theta\right|\right]_{R/2}$ | -| Relative bias | $\text{E}(T) / \theta$ | $\bar{T} / \theta$ | $\sqrt{S_T^2 / \left(R\theta^2\right)}$ | -| Relative median bias | $\text{M}(T) / \theta$ | $\tilde{T} / \theta$ | -| Relative MSE | $\text{E}\left[\left(T - \theta\right)^2\right] / \theta^2$ | $\frac{\left(\bar{T} - \theta\right)^2 + S_T^2}{\theta^2}$ | +| Criterion | Definition | Estimator | Monte Carlo Standard Error | +|-------------------------|---------------------------------------------------------|----------------------|----------------------------| +| __Measures for point estimators__ ||| +| Bias | $\E(T) - \theta$ | $\bar{T} - \theta$ | \@ref(eq:MCSE-bias) | +| Median bias | $\M(T) - \theta$ | $m_T - \theta$ | \@ref(eq:MCSE-median) | +| Trimmed bias | +| Variance | $\E\left[\left(T - \text{E}(T)\right)^2\right]$ | $S_T^2$ | \@ref(eq:MCSE-var) | +| Standard error | $\sqrt{\E\left[\left(T - \text{E}(T)\right)^2\right]}$ | $S_T$ | \@ref(eq:MCSE-SE) | +| Mean squared error | $\E\left[\left(T - \theta\right)^2\right]$ | $\left(\bar{T} - \theta\right)^2 + \frac{R - 1}{R}S_T^2$ | \@ref(eq:MCSE-MSE) | +| Root mean squared error | $\sqrt{\E\left[\left(T - \theta\right)^2\right]}$ | $\sqrt{\left(\bar{T} - \theta\right)^2 + \frac{R - 1}{R} S_T^2}$ | \@ref(eq:MCSE-RMSE) | +| Median absolute error | $\M\left[\left|T - \theta\right|\right]$ | $\left[\left|T - \theta\right|\right]_{R/2}$ | \@ref(eq:MCSE-median) | +| Relative bias | $\E(T) / \theta$ | $\bar{T} / \theta$ | \@ref(eq:MCSE-relative-bias) | +| Relative median bias | $\M(T) / \theta$ | $m_T / \theta$ | \@ref(eq:MCSE-relative-bias) | +| Relative RMSE | $\sqrt{\E\left[\left(T - \theta\right)^2\right]} / \theta$ | $\frac{\sqrt{\left(\bar{T} - \theta\right)^2 + \frac{R - 1}{R}S_T^2}}{\theta}$ | \@ref(eq:MCSE-relative-bias) | * Bias and median bias are measures of whether the estimator is systematically higher or lower than the target parameter. * Variance is a measure of the __precision__ of the estimator---that is, how far it deviates _from its average_. We might look at the square root of this, to assess the precision in the units of the original measure. This is the true SE of the estimator. -* Mean-squared error is a measure of __overall accuracy__, i.e. is a measure how far we typically are from the truth. We more frequently use the root mean-squared error, or RMSE, which is just the square root of the MSE. +* Mean-squared error is a measure of __overall accuracy__, i.e. is a measure how far we typically are from the truth. We more frequently use the root mean squared error, or RMSE, which is just the square root of the MSE. * The median absolute deviation (MAD) is another measure of overall accuracy that is less sensitive to outlier estimates. The RMSE can be driven up by a single bad egg. The MAD is less sensitive to this. @@ -992,25 +1193,97 @@ In practice, many data analysis procedures produce multiple pieces of informatio For instance, a confidence interval is usually computed from a point estimate and its standard error. Consequently, the performance of that confidence interval will be strongly affected by whether the point estimator is biased and whether the standard error tends to understates or over-states the true uncertainty. Likewise, the performance of a hypothesis testing procedure will often strongly depend on the properties of the point estimator and standard error used to compute the test. -Thus, most simulations will involve evaluating a data analysis procedure on several metrics to arrive at a holistic understanding of its performance. +Thus, most simulations will involve evaluating a data analysis procedure on several measures to arrive at a holistic understanding of its performance. Moreover, the main aim of many simulations is to compare the performance of several different estimators or to determine which of several data analysis procedures is preferable. -For such aims, we will need to use the performance metrics to understand whether a set of procedures work differently, when and how one is superior to the other, and what factors influence differences in performance. -To fully understand the advantages and trade-offs among a set of estimators, we will generally need to compare them using several performance metrics. +For such aims, we will need to use the performance measures to understand whether a set of procedures work differently, when and how one is superior to the other, and what factors influence differences in performance. +To fully understand the advantages and trade-offs among a set of estimators, we will generally need to compare them using several performance measures. ## Exercises -### Brown and Forsythe (1974) {#Brown-Forsythe-performance} +### Brown and Forsythe (1974) results {#Brown-Forsythe-performance} -Continuing the exercises from the prior chapters, estimate rejection rates of the BFF\* test for the parameter values in the fifth line of Table 1 of Brown and Forsythe (1974). +1. Use the `generate_ANOVA_data` data-generating function for one-way heteroskedastic ANOVA (Section \@ref(case-anova-DGP)) and the data-analysis function you wrote for Exercise \@ref(BFFs-forever) to create a simulation driver function for the Brown and Forsythe simulations. -### Better confidence intervals {#cluster-RCT-t-confidence-intervals} +2. Use your simulation driver to evaluate the Type-I error rate of the ANOVA $F$-test, Welch's test, and the BFF* test for a scenario with four groups, sample sizes $n_1 = 11$, $n_2 = 16$, $n_3 = 16$, $n_4 = 21$, equal group means $\mu_1 = \mu_2 = \mu_3 = \mu_4 = 0$, and group standard deviations $\sigma_1 = 3$, $\sigma_2 = 2$, $\sigma_3 = 2$, $\sigma_4 = 1$. Are all of the tests level-$\alpha$? + +3. Use your simulation driver to evaluate the power of each test for a scenario with group means of $\mu_1 = 0$, $\mu_2 = 0.2$, $\mu_3 = 0.4$, $\mu_4 = 0.6$, with sample sizes and group standard deviations as listed above. Which test has the highest power? + +### Size-adjusted power {#size-adjusted-power} + +When different hypothesis testing procedures have different rejection rates under the null hypothesis, it becomes difficult to interpret differences in the non-null power of the tests. +One approach for conducting a fair comparison of testing procedures in this situation is to compute the _size-adjusted_ power of the tests. +Size-adjusted power involves computing the rejection rate of a test using a different threshold $\alpha'$, selected so that the Type-I error rate of the test is equal to the desired $\alpha$ level. +Specifically, size adjusted power is +$$ +\rho^{adjusted}_\alpha(\theta) = \Prob(P < \rho_\alpha(0)). +$$ +To estimate size-adjusted power using simulation, we first need to estimate the Type-I error rate, $r_\alpha(0)$. We can then evaluate the rejection rate of the testing procedure under scenarios with other values of $\theta$ by computing +$$ +r^{adjusted}_\alpha(\theta) = \frac{1}{R} \sum_{r=1}^R I(P_r < r_{\alpha}(0)). +$$ + +Compute the size-adjusted power of the ANOVA $F$-test, Welch's test, and the BFF* test for the scenario in part (3) of Exercise \@ref(Brown-Forsythe-performance). + +### Three correlation estimators {#three-correlation-estimators} + +Consider the bivariate negative binomial distribution model described in Exercise \@ref(BVNB2). +Suppose that we want to estimate the correlation parameter $\rho$ of the latent bivariate normal distribution. +Without studying the statistical theory for this problem, we might think to use simulation to evaluate whether any common correlation measures work well for estimating this parameter. +Potential candidate estimators include the usual sample Pearson's correlation, Spearman's rank correlation, or Kendall's $\tau$ coefficient. +The latter two estimators might seem promising because they are based on the ranked data, so could be more appropriate that Pearson's correlation for frequency count variates. + +The following estimation function computes all three correlations, along with corresponding $p$-values for the null hypothesis of no association and confidence intervals computed using Fisher's $z$ transformation. The confidence interval calculations are developed for Pearson's correlation under bivariate normality, so they might not be appropriate for this data-generating process or for Spearman's or Kendall's correlations. +Still, we can use simulation to see how well or how poorly they perform. + +```{r} +three_corrs <- function( + data, + method = c("pearson","kendall","spearman"), + level = 0.95 +) { + + r_est <- lapply(method, \(m) cor.test(data$C1, data$C2, method = m, exact = FALSE)) + est <- sapply(r_est, \(x) as.numeric(x$estimate)) + pval <- sapply(r_est, \(x) x$p.value) + z_est <- atanh(est) + se_z <- 1 / sqrt(nrow(data) - 3) + crit <- qnorm(1 - (1 - level) / 2) + ci_lo <- tanh(z_est - crit * se_z) + ci_hi <- tanh(z_est + crit * se_z) + + data.frame( + stat = method, + r = est, + z = z_est, + se_z = se_z, + pval = pval, + ci_lo = ci_lo, + ci_hi = ci_hi + ) + +} + +``` + +1. Combine your data-generating function and `three_corrs()` into a simulation driver. + +2. Use your simulation driver to generate 500 replications of the simulation for a scenario with $N = 20$, $\mu_1 = \mu_2 = 5$, $p_1 = p_2 = 0.5$, and $\rho = 0.7$. + +3. Compute the bias, empirical standard error, and RMSE of each correlation estimator, along with corresponding MCSEs. Which correlation estimator is most accurate? + +4. Compute the coverage rate and expected width of the confidence intervals based on each correlation estimator. Do any of the estimators have reasonable coverage? If so, which has the best expected width? + +5. Use your simulation driver to estimate the Type-I error rate of the hypothesis tests for each correlation coefficient for a scenario with $N = 20$, $\mu_1 = \mu_2 = 5$, $p_1 = p_2 = 0.5$. Are any of the tests level-$\alpha$? + +### Confidence interval comparison {#cluster-RCT-t-confidence-intervals} Consider the estimation functions for the cluster RCT example, as given in Section \@ref(multiple-estimation-procedures). -Modify the functions to return __both__ normal Wald-type 95% confidence intervals and robust confidence intervals based on $t$ distributions with Satterthwaite degrees of freedom. +Modify the functions to return __both__ normal Wald-type 95% confidence intervals (as computed in Section \@ref(cluster-RCT-CI-coverage)) and cluster-robust confidence intervals based on $t$ distributions with Satterthwaite degrees of freedom. For the latter, use `conf_int()` from the `clubSandwich` package, as in the following example code: ```{r, eval = FALSE} library(clubSandwich) + M1 <- lme4::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat @@ -1031,43 +1304,44 @@ conf_int(M3, cluster = dat$sid, vcov = "CR2") ``` Pick some simulation parameters and estimate the coverage and interval width of both types of confidence intervals. +How do the normal Wald-type intervals compare to the cluster-robust intervals? -### Cluster RCT simulation under a strong null hypothesis +### Jackknife calculation of MCSEs for RMSE {#jackknife-MCSE} -### Jackknife calculation of MCSEs {#jackknife-MCSE} +The following code generates 100 replications of a simulation of three average treatment effect estimators in a cluster RCT, using a simulation driver function we developed in Section \@ref(bundle-sim-demo) using components described in Sections \@ref(case-cluster) and \@ref(multiple-estimation-procedures). -Implement the jackknife as described above in code. Check your answers against the `simhelpers` package for the built-in `t_res` dataset: -```{r} -library( simhelpers ) -calc_relative(data = t_res, estimates = est, true_param = true_param) +```{r, eval = FALSE} +set.seed( 20251029 ) +runs_val <- sim_cluster_RCT( + reps = 100, + J = 16, n_bar = 20, alpha = 0.5, + gamma_1 = 0.3, gamma_2 = 0.8, + sigma2_u = 0.25, sigma2_e = 0.75 +) ``` -### Distribution theory for person-level average treatment effects {#cluster-RCT-SPATE} +Compute the RMSE of each estimator, and use the jackknife technique described in Section \@ref(MCSE-for-relative-variance) to compute a MCSE for the RMSE. +Check your results against the results from `calc_absolute()` in the `simhelpers` package. -### Multiple scenarios {#multiple-scenario-performance} +### Jackknife calculation of MCSEs for RMSE ratios {#jackknife-MCSE-ratio} -As foreground to the following chapters, can you explore multiple scenarios for the cluster RCT example to see if the trends are common? First write a function that takes a parameter, runs the entire simulation, and returns the results as a small table. You pick which parameter, e.g., average treatment effect, `alpha`, or whatever you like), that you wish to vary. Here is a skeleton for the function: +Continuing from Exercise \@ref(jackknife-MCSE), compute the ratio of the RMSE of each estimator to the RMSE of the linear regression estimator. Use the jackknife technique to compute a MCSE for these RMSE ratios. -```{r, eval=FALSE} -my_simulation <- function( my_param ) { - # call the sim_function() simulation function from the end of last - # chapter, setting the parameter you want to vary to my_param - - # Analyze the results, generating a table of performance metrics, - # e.g., bias or coverage. Make sure your analysis is a data frame, - # like we saw earlier this chapter. - - # Return results -} -``` -Then use code like the following to generate a set of results measured as a function of a varying parameter: +### Distribution theory for person-level average treatment effects {#cluster-RCT-SPATE} -```{r, eval=FALSE} -vals = seq( start, stop, length.out = 5 ) -res = map_df( vals, my_simulation ) -``` +Section \@ref(case-cluster) described a data-generating process for a cluster-randomized experiment in which the school-specific treatment effects varied according to the size of the school. +The auxiliary model for the size of school $j$ was +$$ +n_j \sim \text{Unif}\left[ (1-\alpha)\bar{n}, (1+\alpha)\bar{n} \right], +$$ +where $\bar{n}$ was the average school size and $0 \leq \alpha < 1$ determines the degree of variation in school sizes. +The data-generating process for the outcome data was +$$ +Y_{ij} = \gamma_{0} + \gamma_{1} Z_j + \gamma_{2} Z_j S_j + u_j + \epsilon_{ij}, +$$ +where $Y_{ij}$ is the outcome for student $i$ in school $j$, $Z_j$ is an indicator for whether school $j$ is assigned to treatment $(Z_j = 1)$ or control $(Z_j = 0)$, and $S_j = \frac{n_j - \bar{n}}{\bar{n}}$. +The error terms $u_j$ and $e_{ij}$ are both assumed to be normally distributed with zero means. -The above code will give you a data frame of results, one column for each performance measure. -Finally, you can use this table and plot the performance measure as a function of the varying parameter. +Under this model, the average treatment effect for school $j$ is $\tau_j = \gamma_1 + \gamma_2 S_j$. Because $\E(S_j) = 0$ by construction, the average of the school-specific treatment effects is $\gamma_1$. This is the school-level population average treatment effect estimate. But what is the student-level population average treatment effect estimate? Use the properties of the uniform distribution to find the student-level population average treatment effect $\E\left( \frac{n_j}{\bar{n}} \times \tau_j \right)$. Check your derivation by simulating a large sample of school sizes and school-specific treatment effects. diff --git a/060-multiple-scenarios.Rmd b/060-multiple-scenarios.Rmd new file mode 100644 index 0000000..f61d95e --- /dev/null +++ b/060-multiple-scenarios.Rmd @@ -0,0 +1,710 @@ +--- +output: html_document +editor_options: + chunk_output_type: console +--- + + +```{r setup-multiple-scenarios, include=FALSE} +library( tidyverse ) +library( purrr ) +library( simhelpers ) + +options(list(dplyr.summarise.inform = FALSE)) +theme_set( theme_classic() ) + +### Code for the running examples +source("case_study_code/generate_ANOVA_data.R") +source("case_study_code/ANOVA_Welch_F.R") +source( "case_study_code/r_bivariate_Poisson.R" ) +source( "case_study_code/r_and_z.R" ) +source( "case_study_code/evaluate_CIs.R" ) + +source( "case_study_code/clustered_data_simulation.R" ) + +dat <- gen_cluster_RCT( n=5, J=3, p=0.5, + gamma_0=0, gamma_1=0.2, gamma_2=0.2, + sigma2_u = 0.4, sigma2_e = 1, + alpha = 0.5 ) + +``` + +# (PART) Systematic Simulations {-} + +# Simulating across multiple scenarios {#simulating-multiple-scenarios} + +In Chapter \@ref(simulation-structure), we described the general structure of basic simulations as following four steps: generate, analyze, repeat, and summarize. +The principles of tidy simulation suggest that each of these steps should be represented by its own function or set of code. +For any particular simulation we have a data-generating function and a data-analysis function, which can be bundled together into a simulation driver that repeatedly executes the generate-and-analyze process; we also have a summarization function (or set of code) that computes performance measures across the replications of the simulation process. +In the previous section of the book, we focused on creating code that will run a simulation for a single scenario, going from a set of parameter values to a set of performance measures. + +In practice, simulation studies often involve examining a range of different values, such as multiple levels of a focal parameter value and potentially also multiple levels for auxiliary parameters, sample size, and other design parameters. +In this chapter, we demonstrate an approach for executing simulations across multiple scenarios and organizing the results for further analysis. +Our focus here is on the programming techniques and computational structure. +In the next chapter, we discuss some of the deeper theoretical challenges of designing multifactor simulations. +Then in subsequent chapters, we examine tools for analyzing and making sense of results from more complex, multifactor simulation designs. + +In Chapter \@ref(simulation-structure), we described three further steps involved in systematic simulations: _designing_ a set of scenarios to examine, _executing_ across multiple scenarios, and _synthesizing_ the performance results across scenarios. +The same principles of tidy simulation apply to these steps as well. +In this chapter, we will demonstrate how to create a dataset representing the experimental design of the simulation, how to execute a simulation driver across multiple scenarios, and how to organize results for synthesis. + +## Simulating across levels of a single factor + +Even if we are only using simulation in an ad hoc, exploratory way, we will often be interested in examining the performance of a model or estimation method in more than one scenario. +We have already seen examples of this in Chapter \@ref(t-test-simulation), where we looked at the coverage rate of a confidence interval for the mean of an exponential distribution. +In Section \@ref(simulating-across-different-scenarios), we applied a simulation driver function across a set of sample sizes ranging from 10 to 300, finding that the coverage rate improves towards the desired level as sample size increases. +Simple forms of systematic exploration such as this are useful in many situations. +For instance, when using Monte Carlo simulation for study planning, we might examine simulated power over a range of the target parameter to identify the smallest parameter for power is above a desired level. +If we are using simulation simply to study an unfamiliar model, we might vary a key parameter over a wide range to see how the performance of an estimator changes. +These forms of exploration can be understood as single-factor simulations. + +To demonstrate a single-factor simulation, we revisit the case study on heteroskedastic analysis of variance, as studied by @brown1974SmallSampleBehavior and developed in Chapter \@ref(case-ANOVA). +Suppose that we want to understand how the power of Welch's test varies as a function of the maximum distance between group means. +The data-generating function `generate_ANOVA_data()` that we developed previously was set up to take a vector of means per group, so we re-parameterize the function to define the group means based on the maximum difference (`max_diff`), under the assumption that the means are equally spaced between zero and the maximum difference. +We will also re-parameterize the function in terms of the total sample size and the fraction of observations allocated to each group. +The revised function is +```{r} +generate_ANOVA_new <- function( + G, max_diff, sigma_sq = 1, N = 20, allocation = "equal" +) { + + mu <- seq(0, max_diff, length.out = G) + if (identical(allocation, "equal")) { + allocation <- rep(1 / G, times = G) + } else { + allocation <- rep(allocation, length.out = G) + } + + N_g <- round(N * allocation) + + group <- factor(rep(1:G, times = N_g)) + mu_long <- rep(mu, times = N_g) + sigma_long <- rep(rep(sqrt(sigma_sq), length.out = G), times = N_g) + + x <- rnorm(N, mean = mu_long, sd = sigma_long) + sim_data <- tibble(group = group, x = x) + + return(sim_data) +} +``` +Now we can create a simulation driver by combining this new data-generating function with the data-analysis function we created in Section \@ref(ANOVA-hypothesis-testing-function): +```{r} +sim_ANOVA <- bundle_sim(f_generate = generate_ANOVA_new, f_analyze = ANOVA_Welch_F) +``` +To compute power, we generate a set of simulated $p$ values and then summarize the rejection rate of the Welch test at $\alpha$ levels of .01 and .05: +```{r} +sim_ANOVA(100, G = 4, max_diff = 0.5, sigma_sq = c(1, 2, 2, 3), N = 40) |> + calc_rejection(p_values = Welch, alpha = c(.01, .05)) +``` +Now we can apply this process for several different scenarios with different values of `mu1`. + +Following the principles of tidy simulation, it is useful to represent the design of a systematic simulation as a dataset with a row for each scenario to be considered. +For a single-factor simulation, the experimental design consists of a dataset with just a single variable: +```{r} +Welch_design <- tibble(max_diff = seq(0, 0.8, 0.1)) +str(Welch_design) +``` +To compute simulation results for each of these scenarios, we can use the `map()` function from `purrr()`. +This function takes a list of values as the input, then calls a function on each value. +Our `sim_ANOVA()` function has several further arguments that need to be specified. +Because these will be the same for every value of `max_diff`, we can include them as additional arguments in `map()`, and they will be used every time `sim_ANOVA()` is called. +Here is one way to code this: +```{r, eval = FALSE} +Welch_results <- + Welch_design %>% + mutate( + pvals = map(max_diff, sim_ANOVA, reps = 100, G = 4, + sigma_sq = c(1, 2, 2, 3), N = 40) + ) +``` +Another way to accomplish the same thing is to specify an anonymous function (also called a lambda) in the `map()` call. +This syntax makes it clearer that the additional arguments are getting called in every evaluation of `sim_ANOVA()`: +```{r} +Welch_results <- + Welch_design %>% + mutate( + pvals = map(max_diff, ~ sim_ANOVA(100, G = 4, max_diff = .x, + sigma_sq = c(1, 2, 2, 3), + N = 40)) + ) +``` +In the resulting dataset, the `pvals` variable is a list, with each entry consisting of a tibble of simulated p-values. +Using the `unnest()` function simplies the structure of the results, making it easier to do performance calculations: +```{r} +Welch_results_long <- Welch_results %>% unnest(pvals) +``` +The resulting dataset has `r nrow(Welch_results_long)` rows, consisting of `r nrow(Welch_results_long) / nrow(Welch_design)` replications for each of `r nrow(Welch_design)` scenarios. +To compute power levels, we use `calc_rejection()` after grouping the results by scenario: +```{r} +Welch_power <- + Welch_results_long %>% + group_by(max_diff) %>% + summarize( + calc_rejection(p_values = Welch, alpha = c(.01,.05)) + ) + +Welch_power +``` +The power levels are quite low, with the $\alpha = .05$-level tests reaching a maximum power of `r round(max(Welch_power$rej_rate_05),2)` when `max_diff` is `r round(max(Welch_power$max_diff),2)`. +The lower power levels max sense here because we are looking at a scenario with a very small sample size of just 10 observations per group. + +### A performance summary function + +These performance calculations focus only on the results for the Welch test, when we might be interested in comparing Welch's test to the conventional ANOVA $F$. +One way to carry out the performance calculations for both measures is to write a small function that encapsulates the performance calculations, then use it in place of `calc_rejection()`. +The function should take a set of simulation results as input and provide a dataset of performance measures as output. +Here is one possible implementation, which uses `map()` to apply the performance calculations to each set of simulated p-values: +```{r} +summarize_power <- function(data, alpha = c(.01,.05)) { + ANOVA <- calc_rejection(data, p_values = ANOVA, alpha = alpha, format = "long") + Welch <- calc_rejection(data, p_values = Welch, alpha = alpha, format = "long") + bind_rows( + ANOVA = ANOVA, + Welch = Welch, + .id = "test" + ) +} + +power_levels <- + Welch_results %>% + mutate( + power = map(pvals, summarize_power, alpha = c(.01, .05)) + ) %>% + dplyr::select(-pvals) %>% + unnest(power) + +power_levels +``` + +### Adding performance calculations to the simulation driver + +Now that we have a function for carrying out the performance calculations, we could consider incorporating this step into the simulation driver function. +That way, we can call the simulation driver function with a set of parameter values and it will return a table of performance summaries. +The `bundle_sim()` function from `simhelpers` will create such a function for us, by combining a performance calculation function with the data-generating and data-analysis functions: +```{r} +sim_ANOVA_full <- bundle_sim( + f_generate = generate_ANOVA_new, + f_analyze = ANOVA_Welch_F, + f_summarize = summarize_power +) + +args(sim_ANOVA_full) +``` +The resulting function includes an input argument that controls which alpha levels to use in the rejection rate calculations. +The bundled simulation driver also includes an additional option called `summarize`, which allows the user to control whether to apply the performance calculation function to the simulation output. +The default value of `TRUE` means that calling the function will compute rejection rates: +```{r} +sim_ANOVA_full( + reps = 100, G = 4, max_diff = 0.5, + sigma_sq = c(1, 2, 2, 3), N = 40, + alpha = c(.01, .05) +) +``` +Setting `summarize = FALSE` will produce a dataset with the raw simulation output, with one row per replication, ignoring the additional inputs related to the performance calculations: +```{r} +sim_ANOVA_full( + reps = 4, G = 4, max_diff = 0.5, + sigma_sq = c(1, 2, 2, 3), N = 40, + summarize = FALSE +) +``` + +This more elaborate simulation driver makes execution of the simulations a bit more streamlined. +The full set of performance summaries can now be computed by calling `map()` with the full driver: +```{r} +set.seed(20251031) + +power_levels <- + Welch_design %>% + mutate( + res = map(max_diff, sim_ANOVA_full, reps = 500, G = 4, + sigma_sq = c(1, 2, 2, 3), N = 40, + alpha = c(.01, .05)) + ) %>% + unnest(res) +``` + +The results are organized in a way that facilitates visualization of the power levels: + +```{r hetero-ANOVA-power, fig.width = 7, fig.height = 3.5, out.width = "100%"} + +ggplot(power_levels) + + aes(max_diff, rej_rate, color = test) + + geom_point() + geom_line() + + scale_y_continuous(limits = c(0, NA), expand = expansion(0,c(0,0.01))) + + facet_wrap(~ alpha, scales = "free", labeller = label_bquote(alpha == .(alpha))) + + labs(x = "Maximum mean difference", y = "Power") + + theme_minimal() + + theme(legend.position ="inside", legend.position.inside = c(0.08,0.85)) +``` + +Under the conditions examined here, both tests appear to have similar power. +At the .05 $\alpha$ level, the power of the Welch test is nearly identical to that of the ANOVA $F$ test. +At the .01 $\alpha$ level, there may be a discrepancy at when `max_diff` is 0.8, but the apparent difference might be attributable to Monte Carlo error. +Although the tests appear to work similarly here, these results are based on a very specific set of conditions, including equally sized groups and a specific configuration of within-group variances. +A natural further question is whether this pattern holds under other configurations of sample allocations, total sample size, or within-group variances. +These questions can be examined by expanding the simulation design to further scenarios. + +## Simulating across multiple factors + +Consider a simulation study examining the performance of confidence intervals for Pearson's correlation coefficient under a bivariate Poisson distribution. +We examined this data-generating model in Section \@ref(BVPois-example), implementing it in the function `r_bivariate_Poisson()`. The model has three parameters (the means of each variate, $\mu_1, \mu_2$ and the correlation $\rho$) and there is one design parameter (sample size, $N$). +Thus, we could in principle examine up to four factors. + +Using these parameters directly as factors in the simulation design will lead to considerable redundancy because of the symmetry of the model: generating data with $\mu_1 = 10$ and $\mu_2 = 5$ would lead to identical correlations as using $\mu_1 = 5$ and $\mu_2 = 10$. +It is useful to re-parameterize to reduce redundancy and simplify things. +We will therefore define the simulation conditions by always treating $\mu_1$ as the larger variate and by specifying the ratio of the smaller to the larger mean as $\lambda = \mu_2 / \mu_1$. +We might then examine the following factors: + +* the sample size, with values of $N = 10, 20$, or $30$ +* the mean of the larger variate, with values of $\mu_1 = 4, 8$, or $12$ +* the ratio of means, with values of $\lambda = 0.5$ or $1.0$. +* the true correlation, with values ranging from $\rho = 0.0$ to $0.7$ in steps of $0.1$ + +The above parameters describe a $3 \times 3 \times 2 \times 8$ factorial design, where each element is the number of levels for that factor. This is a four-factor experiment, because we have four different things we are varying. + +To implement this design in code, we first save the simulation parameters as a list with one entry per factor, where each entry consists of the levels that we would like to explore. +We will run a simulation for every possible combination of these values. +Here is code that generates all of the scenarios given the above design, storing these combinations in a data frame, `params`, that represents the full experimental design: + +```{r make_Pearson_sim_dataframe} +design_factors <- list( + N = c(10, 20, 30), + mu1 = c(4, 8, 12), + lambda = c(0.5, 1.0), + rho = seq(0.0, 0.7, 0.1) +) + +lengths(design_factors) + +params <- expand_grid( !!!design_factors ) +params +``` + +We use `expand_grid()` from the `tidyr` package to create all possible combinations of the four factors.[^expand-grid-wtf] +We have a total of $`r paste(lengths(design_factors), collapse = " \\times ")` = `r nrow(params)`$ rows, each row corresponding to a simulation scenario to explore. +With multifactor experiments, it is easy to end up running a lot of experiments! + +[^expand-grid-wtf]: `expand_grid()` is set up to take one argument per factor of the design. A clearer example of its natural syntax is: + ```{r} + params <- expand_grid( + N = c(10, 20, 30), + mu1 = c(4, 8, 12), + lambda = c(0.5, 1.0), + rho = seq(0.0, 0.7, 0.1) + ) + ``` + However, we generally find it useful to create a list of design factors before creating the full grid of parameter values, so we prefer to make `design_factors` first. To use `expand_grid()` on a list, we need to use `!!!`, the splice operator from the `rlang` package, which treats `design_factors` as a set of arguments to be passed to `expand_grid`. The syntax does look a bit wacky, but it is succinct and useful. + +## Using pmap to run multifactor simulations {#using-pmap-to-run-multifactor-simulations} + +Once we have selected factors and levels for simulation, we now need to run the simulation code across all of our factor combinations. +Conceptually, each row of our `params` dataset represents a single simulation scenario, and we want to run our simulation code for each of these scenarios. +We would thus call our simulation function, using all the values in that row as parameters to pass to the function. + +One way to call a function on each row of a dataset in this manner is by using `pmap()` from the `purrr` package. +`pmap()` marches down a set of lists, running a function on each $p$-tuple of elements, taking the $i^{th}$ element from each list for iteration $i$, and passing them as parameters to the specified function. +`pmap()` then returns the results of this sequence of function calls as a list of results.[^pmap-variants] +Because R's `data.frame` objects are also sets of lists (where each variable is a vector, which is a simple form of list), `pmap()` also works seemlessly on `data.frame` or `tibble` objects. + +[^pmap-variants]: Just like `map()` or `map2()`, `pmap()` has variants such as `_dbl` or `_dfr`. +These variants automatically stack or convert the list of things returned into a tidier collection (for `_dbl` it will convert to a vector of numbers, for `_dfr` it will stack the results to make a large tibble, assuming each returned item is a little tibble). + +Here is a small illustration of `pmap()` in action: +```{r} + +some_function <- function( a, b, theta, scale ) { + scale * (a + theta*(b-a)) +} + +args_data <- tibble( a = 1:3, b = 5:7, theta = c(0.2, 0.3, 0.7) ) +purrr::pmap( args_data, .f = some_function, scale = 10 ) +``` + +One important constraint of `pmap()` is that the variable names over which to iterate over must correspond _exactly_ to arguments of the function to be evaluated. +In the above example, `args_data` must have column names that correspond to the arguments of `some_function`. +For functions with additional arguments that are not manipulated, extra parameters can be passed after the function name (as in the `scale` argument in this example). +These will also be passed to each function call, but will be the same for all calls. + +Let's now implement this technique for our simulation of confidence intervals for Pearson's correlation coefficient. +In Section \@ref(estimation-functions), we developed a function called `r_and_z()` for computing confidence intervals for Pearson's correlation using Fisher's $z$ transformation; +then in Section \@ref(assessing-confidence-intervals), we wrote a function called `evaluate_CIs()` for evaluating confidence interval coverage and average width. +We can bundle `r_bivariate_Poisson()`, `r_and_z()`, and `evaluate_CIs()` into a simulation driver function by taking +```{r} +library(simhelpers) + +Pearson_sim <- bundle_sim( + f_generate = r_bivariate_Poisson, f_analyze = r_and_z, f_summarize = evaluate_CIs +) +args(Pearson_sim) +``` +This function will run a simulation for a given scenario: +```{r} +Pearson_sim(1000, N = 10, mu1 = 5, mu2 = 5, rho = 0.3) +``` + +In order to call `Pearson_sim()`, we will need to ensure that the columns of the `params` dataset correspond to the arguments of the function. +Because we re-parameterized the model in terms of $\lambda$, we will first need to compute the parameter value for $\mu_2$ and remove the `lambda` variable because it is not an argument of `Pearson_sim()`: +```{r} +params_mod <- + params %>% + mutate(mu2 = mu1 * lambda) %>% + dplyr::select(-lambda) +``` + +Now we can use `pmap()` to run the simulation for all `r nrow(params)` parameter settings: +```{r secret-run-Pearson-Poisson-sims, include=FALSE} +# (See below this block for book code) + +if ( !file.exists( "results/Pearson_Poisson_results.rds" ) ) { + # Secret Run code in parallel for speedup + library(future) + library(furrr) + plan(multisession) + set.seed(20250718) + sim_results <- + params_mod %>% + mutate(res = future_pmap(., .f = Pearson_sim, reps = 1000, .options = furrr_options(seed = TRUE) ) ) + + write_rds( sim_results, file = "results/Pearson_Poisson_results.rds" ) +} else { + sim_results <- read_rds("results/Pearson_Poisson_results.rds") +} + +``` + +```{r run-Pearson-sims, eval = FALSE} +sim_results <- params +sim_results$res <- pmap(params_mod, Pearson_sim, reps = 1000 ) +``` + +The above code calls our `run_alpha_sim()` method for each row in the list of scenarios we want to explore. +Conveniently, we can store the results as a new variable in the same dataset. + +```{r} +sim_results +``` + +The above code may look a bit peculiar: we are storing a set of dataframes (our result) in our original dataframe. +This is actually ok in R: our results will be in what is called a __list-column__, where each element in our list column is the little summary of our simulation results for that scenario. +For instance, if we want to examine the results from the third scenario, we can pull it out as follows: + +```{r} +sim_results$res[[3]] +``` + +List columns are neat, but hard to work with. +To turn the list-column into normal data, we can use `unnest()` to expand the `res` variable, replicating the values of the main variables once for each row in the nested dataset: + +```{r} +sim_results <- unnest(sim_results, cols = res) +sim_results +``` + +Putting all of this together into a tidy workflow leads to the following: + +```{r, eval = FALSE} +sim_results <- + params %>% + mutate( + mu2 = mu1 * lambda, + reps = 1000 + ) %>% + mutate( + res = pmap(dplyr::select(., -lambda), .f = Pearson_sim) + ) %>% + unnest(cols = res) +``` + +As an alternative approach, the `evaluate_by_row()` function from the `simhelpers` package accomplishes the same thing as the `pmap()` calculations inside the `mutate()` step of the above code. Its syntax is a bit more concise: + +```{r, eval=FALSE} +sim_results <- + params %>% + mutate( mu2 = mu1 * lambda ) %>% + evaluate_by_row( Pearson_sim, reps = 1000 ) +``` +An advantage of `evaluate_by_row()` is that the input dataset can include extra variables (such as `lambda`). +Another advantage is that it is easy to run the calculations in parallel; see Chapter \@ref(parallel-processing). + +As a final step, we save our results using tidyverse's `write_rds()` (for background on this function, see [R for Data Science, Section 7.5](https://r4ds.hadley.nz/data-import.html#sec-writing-to-a-file)). +We first ensure we have a directory by making one via `dir.create()` (see Chapter \@ref(saving-files) for more on files): + +```{r, eval=FALSE} +dir.create( "results", showWarnings = FALSE ) +write_rds( sim_results, file = "results/Pearson_Poisson_results.rds" ) +``` + +We now have a complete set of simulation results for all of the scenarios we specified. + +## When to calculate performance metrics + +For a single-scenario simulation, we repeatedly generate and analyze data, and then assess the performance across the repetitions. +When we extend this process to multifactor simulations, we have a choice: do we compute performance measures for each simulation scenario as we go (inside) or do we compute all of them after we get all of our individual results (outside)? +There are pros and cons to each approach. + +### Aggregate as you simulate (inside) + +The *inside* approach runs a stand-alone simulation for each scenario of interest. For each combination of factors, we simulate data, apply our estimators, assess performance, and return a table with summary performance measures. We can then stack these tables to get a dataset with all of the results, ready for analysis. + +This is the approach we illustrated above. It is straightforward and streamlined: we already have a method to run simulations for a single scenario, and we just repeat it across multiple scenarios and combine the outputs. +After calling `pmap()` (or `evaluate_by_row()`) and stacking the results, we end up with a dataset containing all the simulation conditions, one simulation context per row (or maybe we have sets of several rows for each simulation context, with one row for each method), with the columns consisting of the simulation factors and calculated performance measures. +This table of performance measures is exactly what we need to conduct further analysis and draw conclusions about how the estimators work. + +The primary advantages of the inside strategy are that it is easy to modularize the simulation code and it produces a compact dataset of results, minimizing the number and size of files that need to be stored. +On the con side, calculating summary performance measures inside of the simulation driver limits our ability to add new performance measures on the fly or to examine the distribution of individual estimates. +For example, say we wanted to check if the distribution of Fisher-z estimates in a particular scenario was right-skewed, perhaps because we are worried that the estimator sometimes breaks down. +We might want to make a histogram of the point estimates, or calculate the skew of the estimates as a performance measure. +Because the individual estimates are not saved, we would have no way of investigating these questions without rerunning the simulation for that condition. +In short, the inside strategy minimizes disk space but constrains our ability to explore or revise performance calculations. + +### Keep all simulation runs (outside) + +The _outside_ approach involves retaining the entire set of estimates from every replication, with each row corresponding to an estimate for a given simulated dataset. +The benefit of the outside approach is that it allows us to add or change how we calculate performance measures without re-running the entire simulation. +This is especially important if the simulation is time-intensive, such as when the estimators being evaluated are computationally expensive. +The primary disadvantage the outside approach is that it produces large amounts of data that need to be stored and further manipulated. +Thus, the outside strategy maximizes flexibility, at the cost of increased dataset size. + +In our Pearson correlation simulation, we initially followed the inside strategy. To move to the outside strategy, we can set the `summarize` argument of `Pearson_sim()` to `FALSE` so that the simulation driver returns a row for every replication: +```{r do_power_sim_full, cache=TRUE} +Pearson_sim(reps = 4, N = 15, mu1 = 5, mu2 = 5, rho = 0.5, summarize = FALSE) +``` + +We then save the entire set of estimates, rather than the performance summaries. +This result file will have $R$ times as many rows as the older file. In practice, these results can quickly get to be extremely large. +On the other hand, disk space is cheap. +Here we run the same experiment as in Section \@ref(using-pmap-to-run-multifactor-simulations), but storing the individual replications instead of just the summarized results: + +```{r secret_Pearson_full, include=FALSE} +# (See below this block for book code) + +if ( !file.exists( "results/Pearson_Poisson_results_full.rds" ) ) { + # Secret Run code in parallel for speedup + library(future) + library(furrr) + plan(multisession) + set.seed(20250718) + sim_results_full <- + params_mod %>% + mutate( + res = future_pmap( + ., .f = Pearson_sim, + reps = 1000, summarize = FALSE, + .options = furrr_options(seed = TRUE) + ) + ) %>% + unnest(res) + + write_rds( sim_results_full, file = "results/Pearson_Poisson_results_full.rds" ) +} else { + sim_results_full <- read_rds("results/Pearson_Poisson_results_full.rds") +} +``` + +```{r Pearson_all_rows, eval=FALSE} +sim_results_full <- + params %>% + mutate( mu2 = mu1 * lambda ) %>% + evaluate_by_row( Pearson_sim, reps = 1000, summarize = FALSE ) + +write_rds( sim_results_full, file = "results/Pearson_Poisson_results_full.rds" ) +``` + +We end up with many more rows and a much larger file. +One small tweak to this workflow will reduce the file size by keeping the results from each replication in a list-column rather than unnesting them. Here we set `nest_results = TRUE` in the call to `evaluate_by_row()`: + +```{r secret_Pearson_nested, include=FALSE} +# (See below this block for book code) + +if ( !file.exists( "results/Pearson_Poisson_results_nested.rds" ) ) { + # Secret Run code in parallel for speedup + library(future) + library(furrr) + plan(multisession) + set.seed(20250718) + sim_results_nested <- + params_mod %>% + mutate( + .results = future_pmap( + ., .f = Pearson_sim, + reps = 1000, summarize = FALSE, + .options = furrr_options(seed = TRUE) + ) + ) + + write_rds( sim_results_nested, file = "results/Pearson_Poisson_results_nested.rds" ) +} else { + sim_results_nested <- read_rds("results/Pearson_Poisson_results_nested.rds") +} +``` + +```{r Pearson_all_rows_nested, eval=FALSE} +sim_results_nested <- + params %>% + mutate( mu2 = mu1 * lambda ) %>% + evaluate_by_row( Pearson_sim, reps = 1000, summarize = FALSE, nest_results = TRUE) + +write_rds( sim_results_nested, file = "results/Pearson_Poisson_results_nested.rds" ) +``` + +Here is the number of rows for the outside vs inside approaches: +```{r} +c(inside = nrow( sim_results ), outside = nrow( sim_results_full ), nested = nrow( sim_results_nested )) +``` + +Here is a comparison of the file sizes on the disk: +```{r} +c( + inside = file.size("results/Pearson_Poisson_results.rds"), + outside = file.size("results/Pearson_Poisson_results_full.rds"), + nested = file.size("results/Pearson_Poisson_results_nested.rds") +) / 2^10 # Kb +``` +The first is several kilobytes, the second and third are several megabytes. If we follow the outside strategy, keeping the results nested reduces the file size by around `r round(100 * (1 - file.size("results/Pearson_Poisson_results_nested.rds") / file.size("results/Pearson_Poisson_results_full.rds")))`%. + +### Getting raw results ready for analysis + +If we generate raw results, we then need to do the performance calculations across replications within each simulation context so that we can explore the trends across simulation factors. + +One way to do this is to use `group_by()` and `summarize()` to carry out the performance calculations on the unnested simulation results: +```{r} +sim_results_full %>% + group_by( N, mu1, mu2, rho ) %>% + summarise( + calc_coverage(lower_bound = CI_lo, upper_bound = CI_hi, true_param = rho) + ) +``` + +If we want to use our full performance measure function `evaluate_CIs()` to get additional metrics such as MCSEs, we would _nest_ our data into a series of mini-datasets (one for each simulation), and then process each element. +As we saw above, nesting collapses a larger dataset into one where one of the variables consists of a list of datasets: + +```{r} +results <- + sim_results_full |> + group_by( N, mu1, mu2, rho ) %>% + nest( .key = "res" ) +results +``` + +Note how each row of our nested data has a little tibble containing the results for that context, with 1000 rows each.[^use-nested] +Once nested, we can then use `map2()` to apply a function to each element of `res`: + +```{r} +results_summary <- + results %>% + mutate( performance = map2( res, rho, evaluate_CIs ) ) %>% + dplyr::select( -res ) %>% + unnest( cols="performance" ) +results_summary +``` + +We have built our final performance table _after_ running the entire simulation, rather than running it on each simulation scenario in turn. + +[^use-nested]: Alternately, we could store the results in nested form (as in `sim_results_nested`), so that the `group_by()` and `nest()` steps are unnecessary. + +Now, if we want to add a performance metric, we can simply change `evaluate_CIs` and recalculate, without having to recompute the entire simulation. +Summarizing during the simulation vs. after, as we just did, leads to the same set of results.^[In fact, if we use the same seed, we should obtain _exactly_ the same results.] +Allowing yourself the flexibility to re-calculate performance measures can be very advantageous, and we tend to follow this outside strategy for any simulations involving more complex estimation procedures. + + +## Summary + +Multifactor simulations are simply a series of individual scenario simulations, where the set of scenarios are structured by systematically manipulating some of the parameters of the data-generating process. +The overall workflow for implementing a multifactor simulation begins with identifying which parameters and which specific values of those parameters to explore. +These parameters correspond to the factors of the simulation's design; the specific values correspond to the levels (or settings) of each factor. +Following the principles of tidy simulation, we represent these decisions as a dataset consisting of all the combinations of the factors that we wish to explore. +Think of this as a menu, or checklist, of simulation scenarios to run. +The next step in the workflow is then to walk down the list, running a simulation of each scenario in turn. + +After executing a multifactor simulation, we will have results from every simulation scenario. +These might be the raw results (estimates of quantities of interest from every individual iteration of the simulation) or summary results (performance measures calculated across iterations of the simulation for a given scenario). +In either form, the results will be connected to the parameter values (the factor levels) use to generate them. +Stacking all the results up will produce a single dataset, suitable for further analysis. + +With the workflow that we have demonstrated, it is easy to specify multifactor simulations that involve hundreds or even thousands of distinct scenarios. +The amount of data generated by such simulations can quickly grow overwhelming, and making sense of the results will require further, careful analysis. +In the next several chapters, we will examine several strategies for exploring and presenting results from multifactor simulations. + +## Exercises + +### Extending Brown and Forsythe {#extending-Brown-Forsythe-power} + +@brown1974SmallSampleBehavior evaluated the power of the ANOVA $F$ and Welch test under twenty different conditions, varying in the number of groups, sample sizes, and degree of heteroskedasticity (see Table [5.2](case-ANOVA.html#tab:BF-Scenarios) of Chapter \@ref(case-ANOVA)). + +1. Extend their work by building a multifactor simulation design to compare the power of the tests when $G = 4$, for three or more different sample sizes and for settings with either unbalanced group allocations (e.g., `allocation = c(0.1, 0.2, 0.3, 0.4)`) or equal group sizes. + +2. Execute the multifactor simulations using `pmap()` or `evaluate_by_row()`. + +3. Create a graph or graphs that depict the power levels of each test as a function of `mu_max`, sample size, and group allocations. How do the power levels of the tests compare to each other overall? + +### Comparing the trimmed mean, median and mean {#exercise:trimmed-mean} + +In this extended exercise, you will develop a multifactor simulation to compare several different estimators of a common parameter under a range of scenarios. +The specific tasks in this process illustrate how we would approach programming a methodological simulation to compare different estimation strategies, as you might see in the "simulation" section of an article in a statistics journal. +In this example, though, both the data-generating process and the estimation strategies are very simple and quick to calculate, so that it is feasible to quickly execute a multifactor simulation. +Following tidy simulation principles, the steps described below will walk you through the steps of building and testing functions for each component of the simulation, assembling them into a simulation driver, specifying a simulation design, and executing a multifactor simulation. + +The aim of this simulation is to investigate the performance of the mean, trimmed mean, and median as estimators of the center of a symmetric distribution (such that the mean and median parameters are identical). +As the data-generation function, use a scaled $t$-distribution so that the standard deviation will always be 1 but will have different fatness of tails (high chance of outliers): +```{r} +gen_scaled_t <- function( n, mu, df0 ) { + mu + rt( n, df=df0 ) / sqrt( df0 / (df0-2) ) +} +``` +The variance of a $t$ distribution is $df/(df-2)$, so when we divide our observations by the +square root of this, we standardize them so they have unit variance. +The estimand of interest here is `mu`, the center of the distribution. +The estimation methods of interest are the conventional (arithemetic) mean, a 10% trimmed mean, and the median of a sample of $n$ observations. +For performance measures, focus on bias, true standard error, and root mean squared error. + +1. Verify that `gen_scaled_t()` produces data with mean `mu` and standard deviation 1 for various `df0` values. + +2. Write a function to calculate the mean, trimmed mean, and median of a vector of data. + The trimmed mean should trim 10% of the data from each end. + The method should return a data frame with the three estimates, one row per estimator. + +3. Verify your estimation method works by analyzing a dataset generated with `gen_scaled_t()`. + For example, you can generate a dataset of size 100 with `gen_scaled_t(100, 0, 3)` and then analyze it. + +4. Use `bundle_sim()` to create a simulation function that generates data and then analyzes it. + The function should take `n` and `df0` as arguments, and return the estimates from your analysis method. + Use `id` to give each simulation run an ID. + +5. Run your simulation function for 1000 datasets of size 10, with `mu=0` and `df0=5`. + Store the results in a variable called `raw_exps`. + +6. Write a function to calculate the RMSE, bias, and standard error for your three estimators, given the results. + +7. Make a single function that takes `df0` and `n`, and runs a simulation and returns the performances of your three methods. + +8. Now make a grid of $n = 10, 50, 250, 1250$ and $df_0 = 3, 5, 15, 30$, and generate results for your multifactor simulation. + +9. Make a plot showing how SE changes as a function of sample size for each estimator. Do the three estimator seem to follow the same pattern? Or do they work differently? + +### Estimating latent correlations {#exercise:multifactor-latent-correlation} + +Exercise \@ref(BVNB2) introduced a bivariate negative binomial model and asked you to write a data-generating function that implements the model. Exercise \@ref(three-correlation-estimators) provided an estimation function (called `three_corrs`) that calculates three different types of correlation coefficients, and asked you to write a function for calculating the bias and RMSE of these measures. + +1. Combine your data-generating function, `three_corrs()`, and your performance calculation into a simulation driver. + +2. Propose a multifactor simulation design to examine the bias and RMSE of these three correlations. Write code to create a parameter grid for your proposed simulations. + +3. Execute the simulations for your proposed design. + +4. Create a graph or graphs that depict the bias and RMSE of each correlation as a function of $\rho$ and any other key parameters. + +### Meta-regression {#exercise:multifactor-meta-regression} + +Exercise \@ref(meta-regression-DGP) described the random effects meta-regression model. +List the focal, auxiliary, and structural parameters of this model, and propose a set of design factors to use in a multifactor simulation of the model. +Create a list with one entry per factor, then create a dataset with one row for each simulation context that you propose to evaluate. + +### Examine a multifactor simulation design {#exercise:examine-a-multifactor-simulation-design} + +Find a published article that reports a multifactor simulation study examining a methodological question.[^journals] +Write code to create a parameter grid for the scenarios examined in the study. +Write a few sentences explaining the overall design of the simulation study. +Summarize any justification that the authors provided for the choice of parameter values examined. + +[^journals]: Journals that regularly publish methodological simulation studies include [Psychological Methods](https://www.apa.org/pubs/journals/met), [Psychometrika](https://link-springer-com.ezproxy.library.wisc.edu/journal/11336), [Journal of Educational and Behavioral Statistics](https://journals-sagepub-com.ezproxy.library.wisc.edu/home/jeb), [Multivariate Behavioral Research](https://www-tandfonline-com.ezproxy.library.wisc.edu/journals/hmbr20), [Behavior Research Methods](https://link.springer.com/journal/13428), [Research Synthesis Methods](https://www.cambridge.org/core/journals/research-synthesis-methods), and [Statistics in Medicine](https://onlinelibrary-wiley-com.ezproxy.library.wisc.edu/journal/10970258). diff --git a/070-experimental-design.Rmd b/070-experimental-design.Rmd index 3dbe449..3320ca3 100644 --- a/070-experimental-design.Rmd +++ b/070-experimental-design.Rmd @@ -5,7 +5,7 @@ editor_options: --- -```{r setup_exp_design, include=FALSE} +```{r setup-exp-design, include=FALSE} library( tidyverse ) library( purrr ) options(list(dplyr.summarise.inform = FALSE)) @@ -25,9 +25,7 @@ dat <- gen_cluster_RCT( n=5, J=3, p=0.5, ``` -# (PART) Multifactor Simulations {-} - -# Designing and executing multifactor simulations {#exp-design} +# Designing multifactor simulations {#designing-multifactor-simulations} Thus far, we have created code that will run a simulation for a single combination of parameter values. In practice, simulation studies typically examine a range of different values, including varying the levels of the focal parameter values, auxiliary parameters, sample size, and possibly other design parameters, to explore a range of different scenarios. @@ -39,55 +37,6 @@ Let's now look at the remaining piece of the simulation puzzle: the study's expe Simulation studies often take the form of __full factorial__ designed experiments. In full factorials, each factor (a particular knob a researcher might turn to change the simulation conditions) is varied across multiple levels, and the design includes _every_ possible combination of the levels of every factor. One way to represent such a design is as a list of the factors and levels to be explored. -For example, consider a simulation study examining the performance of confidence intervals for Pearson's correlation coefficient under a bivariate Poisson distribution. -We examined this data-generating model in Section \@ref(BVPois-example), implementing it in the function `r_bivariate_Poisson()`. The model has three parameters (the means of each variate, $\mu_1, \mu_2$ and the correlation $\rho$) and there is one design parameter (sample size, $N$). -Thus, we could in principle examine up to four factors. - -Using these parameters directly as factors in the simulation design will lead to considerable redundancy because of the symmetry of the model: generating data with $\mu_1 = 10$ and $\mu_2 = 5$ would lead to identical correlations as using $\mu_1 = 5$ and $\mu_2 = 10$. -It is useful to re-parameterize to reduce redundancy and simply things. -We will therefore define the simulation conditions by always treating $\mu_1$ as the larger variate and by specifying the ratio of the smaller to the larger mean as $\lambda = \mu_2 / \mu_1$. -We might then examine the following factors: - -* the sample size, with values of $N = 10, 20$, or $30$ -* the mean of the larger variate, with values of $\mu_1 = 4, 8$, or $12$ -* the ratio of means, with values of $\lambda = 0.5$ or $1.0$. -* the true correlation, with values ranging from $\rho = 0.0$ to $0.7$ in steps of $0.1$ - -The above parameters describe a $3 \times 3 \times 2 \times 8$ factorial design, where each element is the number of levels for that factor. This is a four-factor experiment, because we have four different things we are varying. - -To implement this design in code, we first save the simulation parameters as a list with one entry per factor, where each entry consists of the levels that we would like to explore. -We will run a simulation for every possible combination of these values. -Here is code that generates all of the scenarios given the above design, storing these combinations in a data frame, `params`, that represents the full experimental design: - -```{r make_Pearson_sim_dataframe} -design_factors <- list( - N = c(10, 20, 30), - mu1 = c(4, 8, 12), - lambda = c(0.5, 1.0), - rho = seq(0.0, 0.7, 0.1) -) - -lengths(design_factors) - -params <- expand_grid( !!!design_factors ) -params -``` - -We use `expand_grid()` from the `tidyr` package to create all possible combinations of the four factors.[^expand-grid-wtf] -We have a total of $`r paste(lengths(design_factors), collapse = " \\times ")` = `r nrow(params)`$ rows, each row corresponding to a simulation scenario to explore. -With multifactor experiments, it is easy to end up running a lot of experiments! - -[^expand-grid-wtf]: `expand_grid()` is set up to take one argument per factor of the design. A clearer example of its natural syntax is: - ```{r} - params <- expand_grid( - N = c(10, 20, 30), - mu1 = c(4, 8, 12), - lambda = c(0.5, 1.0), - rho = seq(0.0, 0.7, 0.1) - ) - ``` - However, we generally find it useful to create a list of design factors before creating the full grid of parameter values, so we prefer to make `design_factors` first. To use `expand_grid()` on a list, we need to use `!!!`, the splice operator from the `rlang` package, which treats `design_factors` as a set of arguments to be passed to `expand_grid`. The syntax may look a bit wacky, but it is succinct and useful. - The multi-factor aspect of a simulation is incredible important. It can take us from an overly narrow exploration to one that has broader significance. As @little2013praise puts it: @@ -149,296 +98,6 @@ To recap, as you think about your parameter selection, always keep the following Finally, you should fully expect to add and subtract from your set of simulation factors as you get your initial simulation results. Rarely does anyone nail the choice of parameters on the first pass. - -## Using pmap to run multifactor simulations {#using-pmap-to-run-multifactor-simulations} - -Once we have selected factors and levels for simulation, we now need to run the simulation code across all of our factor combinations. -Conceptually, each row of our `params` dataset represents a single simulation scenario, and we want to run our simulation code for each of these scenarios. -We would thus call our simulation function, using all the values in that row as parameters to pass to the function. - -One way to call a function on each row of a dataset in this manner is by using `pmap()` from the `purrr` package. -`pmap()` marches down a set of lists, running a function on each $p$-tuple of elements, taking the $i^{th}$ element from each list for iteration $i$, and passing them as parameters to the specified function. -`pmap()` then returns the results of this sequence of function calls as a list of results.[^pmap-variants] -Because R's `data.frame` objects are also sets of lists (where each variable is a vector, which is a simple form of list), `pmap()` also works seemlessly on `data.frame` or `tibble` objects. - -[^pmap-variants]: Just like `map()` or `map2()`, `pmap()` has variants such as `_dbl` or `_df`. -These variants automatically stack or convert the list of things returned into a tidier collection (for `_dbl` it will convert to a vector of numbers, for `_df` it will stack the results to make a large dataframe, assuming each thing returned is a little dataframe). - -Here is a small illustration of `pmap()` in action: -```{r} - -some_function <- function( a, b, theta, scale ) { - scale * (a + theta*(b-a)) -} - -args_data <- tibble( a = 1:3, b = 5:7, theta = c(0.2, 0.3, 0.7) ) -purrr::pmap( args_data, .f = some_function, scale = 10 ) -``` - -One important constraint of `pmap()` is that the variable names over which to iterate over must correspond exactly to arguments of the function to be evaluated. -In the above example, `args_data` must have column names that correspond to the arguments of `some_function`. -For functions with additional arguments that are not manipulated, extra parameters can be passed after the function name (as in the `scale` argument in this example). -These will also be passed to each function call, but will be the same for all calls. - -Let's now implement this technique for our simulation of confidence intervals for Pearson's correlation coefficient. -In Section \@ref(estimation-functions), we developed a function called `r_and_z()` for computing confidence intervals for Pearson's correlation using Fisher's $z$ transformation; -then in Section \@ref(assessing-confidence-intervals), we wrote a function called `evaluate_CIs()` for evaluating confidence interval coverage and average width. -We can bundle `r_bivariate_Poisson()`, `r_and_z()`, and `evaluate_CIs()` into a simulation driver function by taking -```{r} -library(simhelpers) - -Pearson_sim <- bundle_sim( - f_generate = r_bivariate_Poisson, f_analyze = r_and_z, f_summarize = evaluate_CIs -) -args(Pearson_sim) -``` -This function will run a simulation for a given scenario: -```{r} -Pearson_sim(1000, N = 10, mu1 = 5, mu2 = 5, rho = 0.3) -``` - -In order to call `Pearson_sim()`, we will need to ensure that the columns of the `params` dataset correspond to the arguments of the function. -Because we re-parameterized the model in terms of $\lambda$, we will first need to compute the parameter value for $\mu_2$ and remove the `lambda` variable because it is not an argument of `Pearson_sim()`: -```{r} -params_mod <- - params %>% - mutate(mu2 = mu1 * lambda) %>% - dplyr::select(-lambda) -``` - -Now we can use `pmap()` to run the simulation for all `r nrow(params)` parameter settings: -```{r secret-run-Pearson-Poisson-sims, include=FALSE} -# (See below this block for book code) - -if ( !file.exists( "results/Pearson_Poisson_results.rds" ) ) { - # Secret Run code in parallel for speedup - library(future) - library(furrr) - plan(multisession) - set.seed(20250718) - sim_results <- - params_mod %>% - mutate(res = future_pmap(., .f = Pearson_sim, reps = 1000, .options = furrr_options(seed = TRUE) ) ) - - write_rds( sim_results, file = "results/Pearson_Poisson_results.rds" ) -} else { - sim_results <- read_rds("results/Pearson_Poisson_results.rds") -} - -``` - -```{r run-Pearson-sims, eval = FALSE} -sim_results <- params -sim_results$res <- pmap(params_mod, Pearson_sim, reps = 1000 ) -``` - -The above code calls our `run_alpha_sim()` method for each row in the list of scenarios we want to explore. -Conveniently, we can store the results __as a new variable in the same dataset__. - -```{r} -sim_results -``` - -The above code may look a bit peculiar: we are storing a set of dataframes (our result) in our original dataframe. -This is actually ok in R: our results will be in what is called a __list-column__, where each element in our list column is the little summary of our simulation results for that scenario. -Here is the third scenario, for example: - -```{r} -sim_results$res[[3]] -``` - -List columns are neat, but hard to work with. -To turn the list-column into normal data, we can use `unnest()` to expand the `res` variable, replicating the values of the main variables once for each row in the nested dataset: - -```{r} -sim_results <- unnest(sim_results, cols = res) -sim_results -``` - -Putting all of this together into a tidy workflow leads to the following: - -```{r, eval = FALSE} -sim_results <- - params %>% - mutate( - mu2 = mu1 * lambda, - reps = 1000 - ) %>% - mutate( - res = pmap(dplyr::select(., -lambda), .f = Pearson_sim) - ) %>% - unnest(cols = res) -``` - -If you like, you can simply use the `evaluate_by_row()` function from the `simhelpers` package: - -```{r, eval=FALSE} -sim_results <- - params %>% - mutate( mu2 = mu1 * lambda ) %>% - evaluate_by_row( Pearson_sim, reps = 1000 ) -``` -An advantage of `evaluate_by_row()` is that the input dataset can include extra variables (such as `lambda`). -Another advantage is that it is easy to run the calculations in parallel; see Chapter \@ref(parallel-processing). - -As a final step, we save our results using tidyverse's `write_rds()`; see [R for Data Science, Section 7.5](https://r4ds.hadley.nz/data-import.html#sec-writing-to-a-file). -We first ensure we have a directory by making one via `dir.create()` (see Chapter \@ref(saving-files) for more on files): - -```{r, eval=FALSE} -dir.create( "results", showWarnings = FALSE ) -write_rds( sim_results, file = "results/Pearson_Poisson_results.rds" ) -``` - -We now have a complete set of simulation results for all of the scenarios we specified. - -## When to calculate performance metrics - -For a single-scenario simulation, we repeatedly generate and analyze data, and then assess the performance across the repetitions. -When we extend this process to multifactor simulations, we have a choice: do we compute performance measures for each simulation scenario as we go (inside) or do we compute all of them after we get all of our individual results (outside)? -There are pros and cons to each approach. - -### Aggregate as you simulate (inside) - -The *inside* approach runs a stand-alone simulation for each scenario of interest. For each combination of factors, we simulate data, apply our estimators, assess performance, and return a table with summary performance measures. We can then stack these tables to get a dataset with all of the results, ready for analysis. - -This is the approach we illustrated above. It is straightforward and streamlined: we already have a method to run simulations for a single scenario, and we just repeat it across multiple scenarios and combine the outputs. -After calling `pmap()` (or `evaluate_by_row()`) and stacking the results, we end up with a dataset containing all the simulation conditions, one simulation context per row (or maybe we have sets of several rows for each simulation context, with one row for each method), with the columns consisting of the simulation factors and measured performance outcomes. -This table of performance is ideally all we need to conduct further analysis and write up the results. - -The primary advantages of the inside strategy are that it is easy to modularize the simulation code and it produces a compact dataset of results, minimizing the number and size of files that need to be stored. -On the con side, calculating summary performance measures inside of the simulation driver limits our ability to add new performance measures on the fly or to examine the distribution of individual estimates. -For example, say we wanted to check if the distribution of Fisher-z estimates in a particular scenario was right-skewed, perhaps because we are worried that the estimator sometimes breaks down. -We might want to make a histogram of the point estimates, or calculate the skew of the estimates as a performance measure. -Because the individual estimates are not saved, we would have no way of investigating these questions without rerunning the simulation for that condition. -In short, the inside strategy minimizes disk space but constrains our ability to explore or revise performance calculations. - -### Keep all simulation runs (outside) - -The _outside_ approach involves retaining the entire set of estimates from every replication, with each row corresponding to an estimate for a given simulated dataset. -The benefit of the outside approach is that it allows us to add or change how we calculate performance measures without re-running the entire simulation. -This is especially important if the simulation is time-intensive, such as when the estimators being evaluated are computationally expensive. -The primary disadvantage the outside approach is that it produces large amounts of data that need to be stored and further manipulated. -Thus, the outside strategy maximizes flexibility, at the cost of increased dataset size. - -In our Pearson correlation simulation, we initially followed the inside strategy. To move to the outside strategy, we can set the `summarize` argument of `Pearson_sim()` to `FALSE` so that the simulation driver returns a row for every replication: -```{r do_power_sim_full, cache=TRUE} -Pearson_sim(reps = 4, N = 15, mu1 = 5, mu2 = 5, rho = 0.5, summarize = FALSE) -``` - -We then save the entire set of estimates, rather than the performance summaries. -This result file will have $R$ times as many rows as the older file. In practice, these results can quickly get to be extremely large. -But disk space is cheap! -Here we run the same experiment as in Section \@ref(using-pmap-to-run-multifactor-simulations), but storing the individual replications instead of just the summarized results: - -```{r secret_Pearson_full, include=FALSE} -# (See below this block for book code) - -if ( !file.exists( "results/Pearson_Poisson_results_full.rds" ) ) { - # Secret Run code in parallel for speedup - library(future) - library(furrr) - plan(multisession) - set.seed(20250718) - sim_results_full <- - params_mod %>% - mutate( - res = future_pmap( - ., .f = Pearson_sim, - reps = 1000, summarize = FALSE, - .options = furrr_options(seed = TRUE) - ) - ) %>% - unnest(res) - - write_rds( sim_results_full, file = "results/Pearson_Poisson_results_full.rds" ) -} else { - sim_results_full <- read_rds("results/Pearson_Poisson_results_full.rds") -} -``` - -```{r Pearson_all_rows, eval=FALSE} -sim_results_full <- - params %>% - mutate( mu2 = mu1 * lambda ) %>% - evaluate_by_row( Pearson_sim, reps = 1000, summarize = FALSE ) - -write_rds( sim_results_full, file = "results/Pearson_Poisson_results_full.rds" ) -``` - -We end up with many more rows. -Here is the number of rows for the outside vs inside approach: -```{r} -c(inside = nrow( sim_results ), outside = nrow( sim_results_full )) -``` - -Comparing the file sizes on the disk: -```{r} -c( - inside = file.size("results/Pearson_Poisson_results.rds"), - outside = file.size("results/Pearson_Poisson_results_full.rds") -) / 2^10 # Kb -``` -The first is several kilobytes, the second is several megabytes. - -### Getting raw results ready for analysis - -If we generate raw results, we then need to do the performance calculations across replications within each simulation context so that we can explore the trends across simulation factors. - -One way to do this is to use `group_by()` and `summarize()` to carry out the performance calculations: -```{r} -sim_results_full %>% - group_by( N, mu1, mu2, rho ) %>% - summarise( - calc_coverage(lower_bound = CI_lo, upper_bound = CI_hi, true_param = rho) - ) -``` - -If we want to use our full performance measure function `evaluate_CIs()` to get additional metrics such as MCSEs, we would _nest_ our data into a series of mini-datasets (one for each simulation), and then process each element. -As we saw above, nesting collapses a larger dataset into one where one of the variables consists of a list of datasets: - -```{r} -results <- - sim_results_full |> - group_by( N, mu1, mu2, rho ) %>% - nest( .key = "res" ) -results -``` - -Note how each row of our nested data has a little tibble containing the results for that context, with 1000 rows each. -Once nested, we can then use `map2()` to apply a function to each element of `res`: - -```{r} -results_summary <- - results %>% - mutate( performance = map2( res, rho, evaluate_CIs ) ) %>% - dplyr::select( -res ) %>% - unnest( cols="performance" ) -results_summary -``` - -We have built our final performance table _after_ running the entire simulation, rather than running it on each simulation scenario in turn. - -Now, if we want to add a performance metric, we can simply change `evaluate_CIs` and recalculate, without having to recompute the entire simulation. -Summarizing during the simulation vs. after, as we just did, leads to the same set of results.^[In fact, if we use the same seed, we should obtain _exactly_ the same results.] -Allowing yourself the flexibility to re-calculate performance measures can be very advantageous, and we tend to follow this outside strategy for any simulations involving more complex estimation procedures. - - -## Summary - -Multifactor simulations are simply a systematically generated series of individual scenario simulations. -The overall workflow is to first identify the factors and levels to explore (which we store as a dataset of all the combinations of the factors desired). -Think of this as a menu, or checklist, of simulations to run. -The next step is to then to walk down the list, running each simulation in turn. - -Each individual simulation will generate its own set of results. -These can be the raw results (individual simulation iterations) or summary results (performance measures). -We stack all of these results, connecting them to the simulation factors they came from, to get a single massive dataset. -The key question is then how to explore this full set of results; without much difficulty, the amount of results generated can be kind of overwhelming! - -The next chapters dive into how to take on this final task. - - - ## Case Study: A multifactor evaluation of cluster RCT estimators To bring the multifactor simulation to life, let us return to the case study of comparing three ways to analyze a cluster randomized trial that we presented in Section \@ref(case-cluster). @@ -611,65 +270,5 @@ glimpse( sres ) -## Exercises - -### Brown and Forsythe redux - -Take another look at Table [5.1](case-ANOVA.html#tab:BF-Scenarios), which is excerpted from @brown1974SmallSampleBehavior. -Create a tibble with one row for each of the 20 scenarios that they evaluated. -Then create a function for running the full simulation process (see Exercise \@ref(Welch-simulation)). -Use `pmap()` or `evaluate_by_row()` to run simulations of all 20 scenarios and reproduce the results in Table [5.2](case-ANOVA.html#tab:BF-table1) of Chapter \@ref(case-ANOVA). - -### Meta-regression - -Exercise \@ref(meta-regression-DGP) described the random effects meta-regression model. -List the focal, auxiliary, and structural parameters of this model, and propose a set of design factors to use in a multifactor simulation of the model. -Create a list with one entry per factor, then create a dataset with one row for each simulation context that you propose to evaluate. - - -### Comparing the trimmed mean, median and mean {#exercise:trimmed-mean} - -In this exercise, you will write a simulation to compare several different -estimators of a common parameter. -In particular, you will compare the mean, trimmed mean, and median as estimators of the center of a symmetric distribution (such that the mean and median parameters are identical). -To do this, you should break building this simulation evaluation down into functions for each component of the simulation. -This will allow you to extend the same framework to more complicated simulation studies. -This extended exercise illustrates how methodologists might compare different estimation strategies, as you might see in the "simulation" section of a stats paper. - -As the data-generation function, use a scaled $t$-distribution so that the standard deviation will always be 1 but will have different fatness of tails (high chance of outliers): - -```{r} -gen_scaled_t <- function( n, mu, df0 ) { - mu + rt( n, df=df0 ) / sqrt( df0 / (df0-2) ) -} -``` - -The variance of a $t$ distribution is $df/(df-2)$, so when we divide our observations by the -square root of this, we standardize them so they have unit variance. - -1. Verify that `gen_scaled_t()` produces data with mean `mu` and standard deviation 1 for various `df0` values. - -2. Write a method to calculate the mean, trimmed mean, and median of a vector of data. - The trimmed mean should trim 10% of the data from each end. - The method should return a data frame with the three estimates, one row per estimator. - -3. Verify your estimation method works by analyzing a dataset generated with `gen_scaled_t()`. - For example, you can generate a dataset of size 100 with `gen_scaled_t(100, 0, 3)` and then analyze it. - -4. Use `bundle_sim()` to create a simulation function that generates data and then analyzes it. - The function should take `n` and `df0` as arguments, and return the estimates from your analysis method. - Use `id` to give each simulation run an ID. - -5. Run your simulation function for 1000 datasets of size 10, with `mu=0` and `df0=5`. - Store the results in a variable called `raw_exps`. - -6. Write a function to calculate the RMSE, bias, and standard error for your three estimators, given the results. - -7. Make a single function that takes `df0` and `n`, and runs a simulation and returns the performances of your three methods. - -8. Now make a grid of $n = 10, 50, 250, 1250$ and $df_0 = 3, 5, 15, 30$, and generate results for your multi-factor simulation. - -9. Make a plot showing how SE changes as a function of sample size for each estimator. Do the three estimator seem to follow the same pattern? Or do they work differently? - diff --git a/210-futher-resources.Rmd b/210-futher-resources.Rmd index 6ff12a4..10afae7 100644 --- a/210-futher-resources.Rmd +++ b/210-futher-resources.Rmd @@ -4,23 +4,18 @@ We close with a list of things of interest we have discovered while writing this text. - [Morris, White, & Crowther (2019)](https://doi.org/10.1002/sim.8086). Using simulation studies to evaluate statistical methods. - -- High-level simulation design considerations. -- Details about performance criteria calculations. -- Stata-centric. + + - High-level simulation design considerations. + - Details about performance criteria calculations. + - Stata-centric. - [SimDesign](https://github.com/philchalmers/SimDesign/wiki) R package (Chalmers, 2019) -- Tools for building generic simulation workflows. -- [Chalmers & Adkin (2019)](http://www.tqmp.org/RegularArticles/vol16-4/p248/). Writing effective and reliable Monte Carlo simulations with the SimDesign package. + - Tools for building generic simulation workflows. + - [Chalmers & Adkin (2019)](http://www.tqmp.org/RegularArticles/vol16-4/p248/). Writing effective and reliable Monte Carlo simulations with the SimDesign package. - [DeclareDesign](https://declaredesign.org/) (Blair, Cooper, Coppock, & Humphreys) -- Specialized suite of R packages for simulating research designs. -- Design philosophy is very similar to "tidy" simulation approach. - -- [SimHelpers](https://meghapsimatrix.github.io/simhelpers/index.html) R package (Joshi & Pustejovsky, 2020) - -- Helper functions for calculating performance criteria. -- Includes Monte Carlo standard errors. + - Specialized suite of R packages for simulating research designs. + - Design philosophy is very similar to "tidy" simulation approach. diff --git a/Designing-Simulations-in-R.toc b/Designing-Simulations-in-R.toc index a71dcf1..74acd82 100644 --- a/Designing-Simulations-in-R.toc +++ b/Designing-Simulations-in-R.toc @@ -13,7 +13,7 @@ \contentsline {subsection}{\numberline {1.1.6}Simulating processess}{18}{subsection.1.1.6}% \contentsline {section}{\numberline {1.2}The perils of simulation as evidence}{19}{section.1.2}% \contentsline {section}{\numberline {1.3}Simulating to learn}{21}{section.1.3}% -\contentsline {section}{\numberline {1.4}Why R?}{22}{section.1.4}% +\contentsline {section}{\numberline {1.4}Why R?}{21}{section.1.4}% \contentsline {section}{\numberline {1.5}Organization of the text}{23}{section.1.5}% \contentsline {chapter}{\numberline {2}Programming Preliminaries}{25}{chapter.2}% \contentsline {section}{\numberline {2.1}Welcome to the tidyverse}{25}{section.2.1}% @@ -31,7 +31,7 @@ \contentsline {section}{\numberline {3.2}A non-normal population distribution}{41}{section.3.2}% \contentsline {section}{\numberline {3.3}Simulating across different scenarios}{42}{section.3.3}% \contentsline {section}{\numberline {3.4}Extending the simulation design}{45}{section.3.4}% -\contentsline {section}{\numberline {3.5}Exercises}{46}{section.3.5}% +\contentsline {section}{\numberline {3.5}Exercises}{45}{section.3.5}% \contentsline {part}{II\hspace {1em}Structure and Mechanics of a Simulation Study}{49}{part.2}% \contentsline {chapter}{\numberline {4}Structure of a simulation study}{51}{chapter.4}% \contentsline {section}{\numberline {4.1}General structure of a simulation}{51}{section.4.1}% @@ -44,14 +44,14 @@ \contentsline {subsection}{\numberline {4.3.5}Multifactor simulations}{59}{subsection.4.3.5}% \contentsline {section}{\numberline {4.4}Exercises}{60}{section.4.4}% \contentsline {chapter}{\numberline {5}Case Study: Heteroskedastic ANOVA and Welch}{61}{chapter.5}% -\contentsline {section}{\numberline {5.1}The data-generating model}{64}{section.5.1}% -\contentsline {subsection}{\numberline {5.1.1}Now make a function}{66}{subsection.5.1.1}% +\contentsline {section}{\numberline {5.1}The data-generating model}{63}{section.5.1}% +\contentsline {subsection}{\numberline {5.1.1}Now make a function}{65}{subsection.5.1.1}% \contentsline {subsection}{\numberline {5.1.2}Cautious coding}{67}{subsection.5.1.2}% -\contentsline {section}{\numberline {5.2}The hypothesis testing procedures}{68}{section.5.2}% +\contentsline {section}{\numberline {5.2}The hypothesis testing procedures}{67}{section.5.2}% \contentsline {section}{\numberline {5.3}Running the simulation}{69}{section.5.3}% \contentsline {section}{\numberline {5.4}Summarizing test performance}{70}{section.5.4}% -\contentsline {section}{\numberline {5.5}Exercises}{72}{section.5.5}% -\contentsline {subsection}{\numberline {5.5.1}Other \(\alpha \)'s}{72}{subsection.5.5.1}% +\contentsline {section}{\numberline {5.5}Exercises}{71}{section.5.5}% +\contentsline {subsection}{\numberline {5.5.1}Other \(\alpha \)'s}{71}{subsection.5.5.1}% \contentsline {subsection}{\numberline {5.5.2}Compare results}{72}{subsection.5.5.2}% \contentsline {subsection}{\numberline {5.5.3}Power}{72}{subsection.5.5.3}% \contentsline {subsection}{\numberline {5.5.4}Wide or long?}{72}{subsection.5.5.4}% @@ -119,164 +119,176 @@ \contentsline {subsection}{\numberline {8.5.2}Compare sampling distributions of Pearson's correlation coefficients}{146}{subsection.8.5.2}% \contentsline {subsection}{\numberline {8.5.3}Reparameterization, redux}{147}{subsection.8.5.3}% \contentsline {subsection}{\numberline {8.5.4}Fancy clustered RCT simulations}{147}{subsection.8.5.4}% -\contentsline {chapter}{\numberline {9}Performance metrics}{149}{chapter.9}% -\contentsline {section}{\numberline {9.1}Metrics for Point Estimators}{151}{section.9.1}% +\contentsline {chapter}{\numberline {9}Performance Measures}{149}{chapter.9}% +\contentsline {section}{\numberline {9.1}Measures for Point Estimators}{151}{section.9.1}% \contentsline {subsection}{\numberline {9.1.1}Comparing the Performance of the Cluster RCT Estimation Procedures}{153}{subsection.9.1.1}% \contentsline {subsubsection}{Are the estimators biased?}{154}{section*.12}% \contentsline {subsubsection}{Which method has the smallest standard error?}{154}{section*.13}% \contentsline {subsubsection}{Which method has the smallest Root Mean Squared Error?}{155}{section*.14}% -\contentsline {subsection}{\numberline {9.1.2}Less Conventional Performance metrics}{156}{subsection.9.1.2}% -\contentsline {section}{\numberline {9.2}Metrics for Standard Error Estimators}{158}{section.9.2}% +\contentsline {subsection}{\numberline {9.1.2}Less Conventional Performance Measures}{156}{subsection.9.1.2}% +\contentsline {section}{\numberline {9.2}Measures for Variance Estimators}{158}{section.9.2}% \contentsline {subsection}{\numberline {9.2.1}Satterthwaite degrees of freedom}{160}{subsection.9.2.1}% \contentsline {subsection}{\numberline {9.2.2}Assessing SEs for the Cluster RCT Simulation}{161}{subsection.9.2.2}% -\contentsline {section}{\numberline {9.3}Metrics for Confidence Intervals}{162}{section.9.3}% +\contentsline {section}{\numberline {9.3}Measures for Confidence Intervals}{162}{section.9.3}% \contentsline {subsection}{\numberline {9.3.1}Confidence Intervals in the Cluster RCT Simulation}{163}{subsection.9.3.1}% -\contentsline {section}{\numberline {9.4}Metrics for Inferential Procedures (Hypothesis Tests)}{164}{section.9.4}% +\contentsline {section}{\numberline {9.4}Measures for Inferential Procedures (Hypothesis Tests)}{164}{section.9.4}% \contentsline {subsection}{\numberline {9.4.1}Validity}{165}{subsection.9.4.1}% \contentsline {subsection}{\numberline {9.4.2}Power}{165}{subsection.9.4.2}% -\contentsline {subsection}{\numberline {9.4.3}The Rejection Rate}{166}{subsection.9.4.3}% +\contentsline {subsection}{\numberline {9.4.3}Rejection Rates}{166}{subsection.9.4.3}% \contentsline {subsection}{\numberline {9.4.4}Inference in the Cluster RCT Simulation}{167}{subsection.9.4.4}% -\contentsline {section}{\numberline {9.5}Selecting Relative vs.\nobreakspace {}Absolute Metrics}{169}{section.9.5}% -\contentsline {section}{\numberline {9.6}Estimands Not Represented By a Parameter}{170}{section.9.6}% -\contentsline {section}{\numberline {9.7}Uncertainty in Performance Estimates (the Monte Carlo Standard Error)}{173}{section.9.7}% -\contentsline {subsection}{\numberline {9.7.1}MCSE for Relative Variance Estimators}{174}{subsection.9.7.1}% -\contentsline {subsection}{\numberline {9.7.2}Calculating MCSEs With the \texttt {simhelpers} Package}{175}{subsection.9.7.2}% -\contentsline {subsection}{\numberline {9.7.3}MCSE Calculation in our Cluster RCT Example}{176}{subsection.9.7.3}% -\contentsline {section}{\numberline {9.8}Summary of Peformance Measures}{177}{section.9.8}% -\contentsline {section}{\numberline {9.9}Concluding thoughts}{178}{section.9.9}% -\contentsline {section}{\numberline {9.10}Exercises}{178}{section.9.10}% -\contentsline {subsection}{\numberline {9.10.1}Brown and Forsythe (1974)}{178}{subsection.9.10.1}% -\contentsline {subsection}{\numberline {9.10.2}Better confidence intervals}{178}{subsection.9.10.2}% -\contentsline {subsection}{\numberline {9.10.3}Cluster RCT simulation under a strong null hypothesis}{179}{subsection.9.10.3}% -\contentsline {subsection}{\numberline {9.10.4}Jackknife calculation of MCSEs}{179}{subsection.9.10.4}% -\contentsline {subsection}{\numberline {9.10.5}Distribution theory for person-level average treatment effects}{179}{subsection.9.10.5}% -\contentsline {subsection}{\numberline {9.10.6}Multiple scenarios}{179}{subsection.9.10.6}% -\contentsline {part}{III\hspace {1em}Multifactor Simulations}{181}{part.3}% -\contentsline {chapter}{\numberline {10}Designing and executing multifactor simulations}{183}{chapter.10}% -\contentsline {section}{\numberline {10.1}Choosing parameter combinations}{185}{section.10.1}% -\contentsline {section}{\numberline {10.2}Using pmap to run multifactor simulations}{187}{section.10.2}% -\contentsline {section}{\numberline {10.3}When to calculate performance metrics}{191}{section.10.3}% -\contentsline {subsection}{\numberline {10.3.1}Aggregate as you simulate (inside)}{191}{subsection.10.3.1}% -\contentsline {subsection}{\numberline {10.3.2}Keep all simulation runs (outside)}{192}{subsection.10.3.2}% -\contentsline {subsection}{\numberline {10.3.3}Getting raw results ready for analysis}{193}{subsection.10.3.3}% -\contentsline {section}{\numberline {10.4}Summary}{195}{section.10.4}% -\contentsline {section}{\numberline {10.5}Case Study: A multifactor evaluation of cluster RCT estimators}{196}{section.10.5}% -\contentsline {subsection}{\numberline {10.5.1}Choosing parameters for the Clustered RCT}{196}{subsection.10.5.1}% -\contentsline {subsection}{\numberline {10.5.2}Redundant factor combinations}{198}{subsection.10.5.2}% -\contentsline {subsection}{\numberline {10.5.3}Running the simulations}{198}{subsection.10.5.3}% -\contentsline {subsection}{\numberline {10.5.4}Calculating performance metrics}{199}{subsection.10.5.4}% -\contentsline {section}{\numberline {10.6}Exercises}{200}{section.10.6}% -\contentsline {subsection}{\numberline {10.6.1}Brown and Forsythe redux}{200}{subsection.10.6.1}% -\contentsline {subsection}{\numberline {10.6.2}Meta-regression}{201}{subsection.10.6.2}% -\contentsline {subsection}{\numberline {10.6.3}Comparing the trimmed mean, median and mean}{201}{subsection.10.6.3}% -\contentsline {chapter}{\numberline {11}Exploring and presenting simulation results}{203}{chapter.11}% -\contentsline {section}{\numberline {11.1}Tabulation}{204}{section.11.1}% -\contentsline {subsection}{\numberline {11.1.1}Example: estimators of treatment variation}{206}{subsection.11.1.1}% -\contentsline {section}{\numberline {11.2}Visualization}{207}{section.11.2}% -\contentsline {subsection}{\numberline {11.2.1}Example 0: RMSE in Cluster RCTs}{208}{subsection.11.2.1}% -\contentsline {subsection}{\numberline {11.2.2}Example 1: Biserial correlation estimation}{209}{subsection.11.2.2}% -\contentsline {subsection}{\numberline {11.2.3}Example 2: Variance estimation and Meta-regression}{209}{subsection.11.2.3}% -\contentsline {subsection}{\numberline {11.2.4}Example 3: Heat maps of coverage}{210}{subsection.11.2.4}% -\contentsline {subsection}{\numberline {11.2.5}Example 4: Relative performance of treatment effect estimators}{211}{subsection.11.2.5}% -\contentsline {section}{\numberline {11.3}Modeling}{213}{section.11.3}% -\contentsline {subsection}{\numberline {11.3.1}Example 1: Biserial, revisited}{214}{subsection.11.3.1}% -\contentsline {subsection}{\numberline {11.3.2}Example 2: Comparing methods for cross-classified data}{215}{subsection.11.3.2}% -\contentsline {section}{\numberline {11.4}Reporting}{216}{section.11.4}% -\contentsline {chapter}{\numberline {12}Building good visualizations}{219}{chapter.12}% -\contentsline {section}{\numberline {12.1}Subsetting and Many Small Multiples}{220}{section.12.1}% -\contentsline {section}{\numberline {12.2}Bundling}{223}{section.12.2}% -\contentsline {section}{\numberline {12.3}Aggregation}{227}{section.12.3}% -\contentsline {subsubsection}{\numberline {12.3.0.1}Some notes on how to aggregate}{229}{subsubsection.12.3.0.1}% -\contentsline {section}{\numberline {12.4}Comparing true SEs with standardization}{230}{section.12.4}% -\contentsline {section}{\numberline {12.5}The Bias-SE-RMSE plot}{235}{section.12.5}% -\contentsline {section}{\numberline {12.6}Assessing the quality of the estimated SEs}{237}{section.12.6}% -\contentsline {subsection}{\numberline {12.6.1}Stability of estimated SEs}{239}{subsection.12.6.1}% -\contentsline {section}{\numberline {12.7}Assessing confidence intervals}{240}{section.12.7}% -\contentsline {section}{\numberline {12.8}Exercises}{242}{section.12.8}% -\contentsline {subsection}{\numberline {12.8.1}Assessing uncertainty}{242}{subsection.12.8.1}% -\contentsline {subsection}{\numberline {12.8.2}Assessing power}{242}{subsection.12.8.2}% -\contentsline {subsection}{\numberline {12.8.3}Going deeper with coverage}{242}{subsection.12.8.3}% -\contentsline {subsection}{\numberline {12.8.4}Pearson correlations with a bivariate Poisson distribution}{243}{subsection.12.8.4}% -\contentsline {subsection}{\numberline {12.8.5}Making another plot for assessing SEs}{243}{subsection.12.8.5}% -\contentsline {chapter}{\numberline {13}Special Topics on Reporting Simulation Results}{245}{chapter.13}% -\contentsline {section}{\numberline {13.1}Using regression to analyze simulation results}{245}{section.13.1}% -\contentsline {subsection}{\numberline {13.1.1}Example 1: Biserial, revisited}{245}{subsection.13.1.1}% -\contentsline {subsection}{\numberline {13.1.2}Example 2: Cluster RCT example, revisited}{248}{subsection.13.1.2}% -\contentsline {subsubsection}{\numberline {13.1.2.1}Using LASSO to simplify the model}{249}{subsubsection.13.1.2.1}% -\contentsline {section}{\numberline {13.2}Using regression trees to find important factors}{254}{section.13.2}% -\contentsline {section}{\numberline {13.3}Analyzing results with few iterations per scenario}{256}{section.13.3}% -\contentsline {subsection}{\numberline {13.3.1}Example: ClusterRCT with only 100 replicates per scenario}{257}{subsection.13.3.1}% -\contentsline {section}{\numberline {13.4}What to do with warnings in simulations}{263}{section.13.4}% -\contentsline {chapter}{\numberline {14}Case study: Comparing different estimators}{267}{chapter.14}% -\contentsline {section}{\numberline {14.1}Bias-variance tradeoffs}{270}{section.14.1}% -\contentsline {chapter}{\numberline {15}Simulations as evidence}{275}{chapter.15}% -\contentsline {section}{\numberline {15.1}Strategies for making relevant simulations}{276}{section.15.1}% -\contentsline {subsection}{\numberline {15.1.1}Break symmetries and regularities}{276}{subsection.15.1.1}% -\contentsline {subsection}{\numberline {15.1.2}Make your simulation general with an extensive multi-factor experiment}{277}{subsection.15.1.2}% -\contentsline {subsection}{\numberline {15.1.3}Use previously published simulations to beat them at their own game}{277}{subsection.15.1.3}% -\contentsline {subsection}{\numberline {15.1.4}Calibrate simulation factors to real data}{277}{subsection.15.1.4}% -\contentsline {subsection}{\numberline {15.1.5}Use real data to obtain directly}{277}{subsection.15.1.5}% -\contentsline {subsection}{\numberline {15.1.6}Fully calibrated simulations}{278}{subsection.15.1.6}% -\contentsline {part}{IV\hspace {1em}Computational Considerations}{281}{part.4}% -\contentsline {chapter}{\numberline {16}Organizing a simulation project}{283}{chapter.16}% -\contentsline {section}{\numberline {16.1}Well structured R scripts}{284}{section.16.1}% -\contentsline {subsection}{\numberline {16.1.1}The source command}{284}{subsection.16.1.1}% -\contentsline {subsection}{\numberline {16.1.2}Putting headers in your .R file}{285}{subsection.16.1.2}% -\contentsline {subsection}{\numberline {16.1.3}Storing testing code in your scripts}{286}{subsection.16.1.3}% -\contentsline {section}{\numberline {16.2}Principled directory structures}{286}{section.16.2}% -\contentsline {section}{\numberline {16.3}Saving simulation results}{287}{section.16.3}% -\contentsline {subsection}{\numberline {16.3.1}Saving simulations in general}{287}{subsection.16.3.1}% -\contentsline {subsection}{\numberline {16.3.2}Saving simulations as you go}{288}{subsection.16.3.2}% -\contentsline {subsection}{\numberline {16.3.3}Dynamically making directories}{291}{subsection.16.3.3}% -\contentsline {subsection}{\numberline {16.3.4}Loading and combining files of simulation results}{291}{subsection.16.3.4}% -\contentsline {chapter}{\numberline {17}Parallel Processing}{293}{chapter.17}% -\contentsline {section}{\numberline {17.1}Parallel on your computer}{294}{section.17.1}% -\contentsline {section}{\numberline {17.2}Parallel on a virtual machine}{295}{section.17.2}% -\contentsline {section}{\numberline {17.3}Parallel on a cluster}{295}{section.17.3}% -\contentsline {subsection}{\numberline {17.3.1}What is a command-line interface?}{296}{subsection.17.3.1}% -\contentsline {subsection}{\numberline {17.3.2}Running a job on a cluster}{298}{subsection.17.3.2}% -\contentsline {subsection}{\numberline {17.3.3}Checking on a job}{300}{subsection.17.3.3}% -\contentsline {subsection}{\numberline {17.3.4}Running lots of jobs on a cluster}{300}{subsection.17.3.4}% -\contentsline {subsection}{\numberline {17.3.5}Resources for Harvard's Odyssey}{303}{subsection.17.3.5}% -\contentsline {subsection}{\numberline {17.3.6}Acknowledgements}{303}{subsection.17.3.6}% -\contentsline {chapter}{\numberline {18}Debugging and Testing}{305}{chapter.18}% -\contentsline {section}{\numberline {18.1}Debugging with \texttt {print()}}{305}{section.18.1}% -\contentsline {section}{\numberline {18.2}Debugging with \texttt {browser()}}{306}{section.18.2}% -\contentsline {section}{\numberline {18.3}Debugging with \texttt {debug()}}{307}{section.18.3}% -\contentsline {section}{\numberline {18.4}Protecting functions with \texttt {stop()}}{307}{section.18.4}% -\contentsline {section}{\numberline {18.5}Testing code}{308}{section.18.5}% -\contentsline {part}{V\hspace {1em}Complex Data Structures}{313}{part.5}% -\contentsline {chapter}{\numberline {19}Using simulation as a power calculator}{315}{chapter.19}% -\contentsline {section}{\numberline {19.1}Getting design parameters from pilot data}{316}{section.19.1}% -\contentsline {section}{\numberline {19.2}The data generating process}{317}{section.19.2}% -\contentsline {section}{\numberline {19.3}Running the simulation}{321}{section.19.3}% -\contentsline {section}{\numberline {19.4}Evaluating power}{322}{section.19.4}% -\contentsline {subsection}{\numberline {19.4.1}Checking validity of our models}{322}{subsection.19.4.1}% -\contentsline {subsection}{\numberline {19.4.2}Assessing Precision (SE)}{324}{subsection.19.4.2}% -\contentsline {subsection}{\numberline {19.4.3}Assessing power}{325}{subsection.19.4.3}% -\contentsline {subsection}{\numberline {19.4.4}Assessing Minimum Detectable Effects}{326}{subsection.19.4.4}% -\contentsline {section}{\numberline {19.5}Power for Multilevel Data}{327}{section.19.5}% -\contentsline {chapter}{\numberline {20}Simulation under the Potential Outcomes Framework}{331}{chapter.20}% -\contentsline {section}{\numberline {20.1}Finite vs.\nobreakspace {}Superpopulation inference}{332}{section.20.1}% -\contentsline {section}{\numberline {20.2}Data generation processes for potential outcomes}{332}{section.20.2}% -\contentsline {section}{\numberline {20.3}Finite sample performance measures}{335}{section.20.3}% -\contentsline {section}{\numberline {20.4}Nested finite simulation procedure}{338}{section.20.4}% -\contentsline {chapter}{\numberline {21}The Parametric bootstrap}{343}{chapter.21}% -\contentsline {section}{\numberline {21.1}Air conditioners: a stolen case study}{344}{section.21.1}% -\contentsline {chapter}{\numberline {A}Coding Reference}{347}{appendix.A}% -\contentsline {section}{\numberline {A.1}How to repeat yourself}{347}{section.A.1}% -\contentsline {subsection}{\numberline {A.1.1}Using \texttt {replicate()}}{347}{subsection.A.1.1}% -\contentsline {subsection}{\numberline {A.1.2}Using \texttt {map()}}{348}{subsection.A.1.2}% -\contentsline {subsection}{\numberline {A.1.3}map with no inputs}{350}{subsection.A.1.3}% -\contentsline {subsection}{\numberline {A.1.4}Other approaches for repetition}{350}{subsection.A.1.4}% -\contentsline {section}{\numberline {A.2}Default arguments for functions}{351}{section.A.2}% -\contentsline {section}{\numberline {A.3}Profiling Code}{352}{section.A.3}% -\contentsline {subsection}{\numberline {A.3.1}Using \texttt {Sys.time()} and \texttt {system.time()}}{352}{subsection.A.3.1}% -\contentsline {subsection}{\numberline {A.3.2}The \texttt {tictoc} package}{353}{subsection.A.3.2}% -\contentsline {subsection}{\numberline {A.3.3}The \texttt {bench} package}{353}{subsection.A.3.3}% -\contentsline {subsection}{\numberline {A.3.4}Profiling with \texttt {profvis}}{356}{subsection.A.3.4}% -\contentsline {section}{\numberline {A.4}Optimizing code (and why you often shouldn't)}{356}{section.A.4}% -\contentsline {subsection}{\numberline {A.4.1}Hand-building functions}{357}{subsection.A.4.1}% -\contentsline {subsection}{\numberline {A.4.2}Computational efficiency versus simplicity}{358}{subsection.A.4.2}% -\contentsline {subsection}{\numberline {A.4.3}Reusing code to speed up computation}{360}{subsection.A.4.3}% -\contentsline {chapter}{\numberline {B}Further readings and resources}{365}{appendix.B}% +\contentsline {section}{\numberline {9.5}Relative or Absolute Measures?}{168}{section.9.5}% +\contentsline {subsection}{\numberline {9.5.1}Performance relative to a benchmark estimator}{170}{subsection.9.5.1}% +\contentsline {section}{\numberline {9.6}Estimands Not Represented By a Parameter}{171}{section.9.6}% +\contentsline {section}{\numberline {9.7}Uncertainty in Performance Estimates (the Monte Carlo Standard Error)}{174}{section.9.7}% +\contentsline {subsection}{\numberline {9.7.1}Conventional measures for point estimators}{174}{subsection.9.7.1}% +\contentsline {subsection}{\numberline {9.7.2}Less conventional measures for point estimators}{176}{subsection.9.7.2}% +\contentsline {subsection}{\numberline {9.7.3}MCSE for Relative Variance Estimators}{177}{subsection.9.7.3}% +\contentsline {subsection}{\numberline {9.7.4}MCSE for Confidence Intervals and Hypothesis Tests}{178}{subsection.9.7.4}% +\contentsline {subsection}{\numberline {9.7.5}Calculating MCSEs With the \texttt {simhelpers} Package}{179}{subsection.9.7.5}% +\contentsline {subsection}{\numberline {9.7.6}MCSE Calculation in our Cluster RCT Example}{180}{subsection.9.7.6}% +\contentsline {section}{\numberline {9.8}Summary of Peformance Measures}{182}{section.9.8}% +\contentsline {section}{\numberline {9.9}Concluding thoughts}{183}{section.9.9}% +\contentsline {section}{\numberline {9.10}Exercises}{183}{section.9.10}% +\contentsline {subsection}{\numberline {9.10.1}Brown and Forsythe (1974) results}{183}{subsection.9.10.1}% +\contentsline {subsection}{\numberline {9.10.2}Size-adjusted power}{184}{subsection.9.10.2}% +\contentsline {subsection}{\numberline {9.10.3}Three correlation estimators}{184}{subsection.9.10.3}% +\contentsline {subsection}{\numberline {9.10.4}Confidence interval comparison}{186}{subsection.9.10.4}% +\contentsline {subsection}{\numberline {9.10.5}Jackknife calculation of MCSEs for RMSE}{186}{subsection.9.10.5}% +\contentsline {subsection}{\numberline {9.10.6}Jackknife calculation of MCSEs for RMSE ratios}{187}{subsection.9.10.6}% +\contentsline {subsection}{\numberline {9.10.7}Distribution theory for person-level average treatment effects}{187}{subsection.9.10.7}% +\contentsline {part}{III\hspace {1em}Systematic Simulations}{189}{part.3}% +\contentsline {chapter}{\numberline {10}Simulating across multiple scenarios}{191}{chapter.10}% +\contentsline {section}{\numberline {10.1}Simulating across levels of a single factor}{192}{section.10.1}% +\contentsline {subsection}{\numberline {10.1.1}A performance summary function}{195}{subsection.10.1.1}% +\contentsline {subsection}{\numberline {10.1.2}Adding performance calculations to the simulation driver}{196}{subsection.10.1.2}% +\contentsline {section}{\numberline {10.2}Simulating across multiple factors}{198}{section.10.2}% +\contentsline {section}{\numberline {10.3}Using pmap to run multifactor simulations}{200}{section.10.3}% +\contentsline {section}{\numberline {10.4}When to calculate performance metrics}{204}{section.10.4}% +\contentsline {subsection}{\numberline {10.4.1}Aggregate as you simulate (inside)}{204}{subsection.10.4.1}% +\contentsline {subsection}{\numberline {10.4.2}Keep all simulation runs (outside)}{205}{subsection.10.4.2}% +\contentsline {subsection}{\numberline {10.4.3}Getting raw results ready for analysis}{206}{subsection.10.4.3}% +\contentsline {section}{\numberline {10.5}Summary}{208}{section.10.5}% +\contentsline {section}{\numberline {10.6}Exercises}{209}{section.10.6}% +\contentsline {subsection}{\numberline {10.6.1}Extending Brown and Forsythe}{209}{subsection.10.6.1}% +\contentsline {subsection}{\numberline {10.6.2}Comparing the trimmed mean, median and mean}{209}{subsection.10.6.2}% +\contentsline {subsection}{\numberline {10.6.3}Estimating latent correlations}{210}{subsection.10.6.3}% +\contentsline {subsection}{\numberline {10.6.4}Meta-regression}{211}{subsection.10.6.4}% +\contentsline {subsection}{\numberline {10.6.5}Examine a multifactor simulation design}{211}{subsection.10.6.5}% +\contentsline {chapter}{\numberline {11}Designing multifactor simulations}{213}{chapter.11}% +\contentsline {section}{\numberline {11.1}Choosing parameter combinations}{214}{section.11.1}% +\contentsline {section}{\numberline {11.2}Case Study: A multifactor evaluation of cluster RCT estimators}{216}{section.11.2}% +\contentsline {subsection}{\numberline {11.2.1}Choosing parameters for the Clustered RCT}{216}{subsection.11.2.1}% +\contentsline {subsection}{\numberline {11.2.2}Redundant factor combinations}{218}{subsection.11.2.2}% +\contentsline {subsection}{\numberline {11.2.3}Running the simulations}{218}{subsection.11.2.3}% +\contentsline {subsection}{\numberline {11.2.4}Calculating performance metrics}{219}{subsection.11.2.4}% +\contentsline {chapter}{\numberline {12}Exploring and presenting simulation results}{221}{chapter.12}% +\contentsline {section}{\numberline {12.1}Tabulation}{222}{section.12.1}% +\contentsline {subsection}{\numberline {12.1.1}Example: estimators of treatment variation}{224}{subsection.12.1.1}% +\contentsline {section}{\numberline {12.2}Visualization}{225}{section.12.2}% +\contentsline {subsection}{\numberline {12.2.1}Example 0: RMSE in Cluster RCTs}{226}{subsection.12.2.1}% +\contentsline {subsection}{\numberline {12.2.2}Example 1: Biserial correlation estimation}{227}{subsection.12.2.2}% +\contentsline {subsection}{\numberline {12.2.3}Example 2: Variance estimation and Meta-regression}{227}{subsection.12.2.3}% +\contentsline {subsection}{\numberline {12.2.4}Example 3: Heat maps of coverage}{228}{subsection.12.2.4}% +\contentsline {subsection}{\numberline {12.2.5}Example 4: Relative performance of treatment effect estimators}{229}{subsection.12.2.5}% +\contentsline {section}{\numberline {12.3}Modeling}{231}{section.12.3}% +\contentsline {subsection}{\numberline {12.3.1}Example 1: Biserial, revisited}{232}{subsection.12.3.1}% +\contentsline {subsection}{\numberline {12.3.2}Example 2: Comparing methods for cross-classified data}{233}{subsection.12.3.2}% +\contentsline {section}{\numberline {12.4}Reporting}{234}{section.12.4}% +\contentsline {chapter}{\numberline {13}Building good visualizations}{237}{chapter.13}% +\contentsline {section}{\numberline {13.1}Subsetting and Many Small Multiples}{238}{section.13.1}% +\contentsline {section}{\numberline {13.2}Bundling}{241}{section.13.2}% +\contentsline {section}{\numberline {13.3}Aggregation}{245}{section.13.3}% +\contentsline {subsubsection}{\numberline {13.3.0.1}Some notes on how to aggregate}{247}{subsubsection.13.3.0.1}% +\contentsline {section}{\numberline {13.4}Comparing true SEs with standardization}{248}{section.13.4}% +\contentsline {section}{\numberline {13.5}The Bias-SE-RMSE plot}{253}{section.13.5}% +\contentsline {section}{\numberline {13.6}Assessing the quality of the estimated SEs}{255}{section.13.6}% +\contentsline {subsection}{\numberline {13.6.1}Stability of estimated SEs}{257}{subsection.13.6.1}% +\contentsline {section}{\numberline {13.7}Assessing confidence intervals}{258}{section.13.7}% +\contentsline {section}{\numberline {13.8}Exercises}{260}{section.13.8}% +\contentsline {subsection}{\numberline {13.8.1}Assessing uncertainty}{261}{subsection.13.8.1}% +\contentsline {subsection}{\numberline {13.8.2}Assessing power}{261}{subsection.13.8.2}% +\contentsline {subsection}{\numberline {13.8.3}Going deeper with coverage}{262}{subsection.13.8.3}% +\contentsline {subsection}{\numberline {13.8.4}Pearson correlations with a bivariate Poisson distribution}{262}{subsection.13.8.4}% +\contentsline {subsection}{\numberline {13.8.5}Making another plot for assessing SEs}{262}{subsection.13.8.5}% +\contentsline {chapter}{\numberline {14}Special Topics on Reporting Simulation Results}{263}{chapter.14}% +\contentsline {section}{\numberline {14.1}Using regression to analyze simulation results}{263}{section.14.1}% +\contentsline {subsection}{\numberline {14.1.1}Example 1: Biserial, revisited}{263}{subsection.14.1.1}% +\contentsline {subsection}{\numberline {14.1.2}Example 2: Cluster RCT example, revisited}{266}{subsection.14.1.2}% +\contentsline {subsubsection}{\numberline {14.1.2.1}Using LASSO to simplify the model}{267}{subsubsection.14.1.2.1}% +\contentsline {section}{\numberline {14.2}Using regression trees to find important factors}{272}{section.14.2}% +\contentsline {section}{\numberline {14.3}Analyzing results with few iterations per scenario}{274}{section.14.3}% +\contentsline {subsection}{\numberline {14.3.1}Example: ClusterRCT with only 100 replicates per scenario}{275}{subsection.14.3.1}% +\contentsline {section}{\numberline {14.4}What to do with warnings in simulations}{281}{section.14.4}% +\contentsline {chapter}{\numberline {15}Case study: Comparing different estimators}{285}{chapter.15}% +\contentsline {section}{\numberline {15.1}Bias-variance tradeoffs}{288}{section.15.1}% +\contentsline {chapter}{\numberline {16}Simulations as evidence}{293}{chapter.16}% +\contentsline {section}{\numberline {16.1}Strategies for making relevant simulations}{294}{section.16.1}% +\contentsline {subsection}{\numberline {16.1.1}Break symmetries and regularities}{294}{subsection.16.1.1}% +\contentsline {subsection}{\numberline {16.1.2}Make your simulation general with an extensive multi-factor experiment}{295}{subsection.16.1.2}% +\contentsline {subsection}{\numberline {16.1.3}Use previously published simulations to beat them at their own game}{295}{subsection.16.1.3}% +\contentsline {subsection}{\numberline {16.1.4}Calibrate simulation factors to real data}{295}{subsection.16.1.4}% +\contentsline {subsection}{\numberline {16.1.5}Use real data to obtain directly}{295}{subsection.16.1.5}% +\contentsline {subsection}{\numberline {16.1.6}Fully calibrated simulations}{296}{subsection.16.1.6}% +\contentsline {part}{IV\hspace {1em}Computational Considerations}{299}{part.4}% +\contentsline {chapter}{\numberline {17}Organizing a simulation project}{301}{chapter.17}% +\contentsline {section}{\numberline {17.1}Well structured R scripts}{302}{section.17.1}% +\contentsline {subsection}{\numberline {17.1.1}The source command}{302}{subsection.17.1.1}% +\contentsline {subsection}{\numberline {17.1.2}Putting headers in your .R file}{303}{subsection.17.1.2}% +\contentsline {subsection}{\numberline {17.1.3}Storing testing code in your scripts}{304}{subsection.17.1.3}% +\contentsline {section}{\numberline {17.2}Principled directory structures}{304}{section.17.2}% +\contentsline {section}{\numberline {17.3}Saving simulation results}{305}{section.17.3}% +\contentsline {subsection}{\numberline {17.3.1}Saving simulations in general}{305}{subsection.17.3.1}% +\contentsline {subsection}{\numberline {17.3.2}Saving simulations as you go}{306}{subsection.17.3.2}% +\contentsline {subsection}{\numberline {17.3.3}Dynamically making directories}{309}{subsection.17.3.3}% +\contentsline {subsection}{\numberline {17.3.4}Loading and combining files of simulation results}{309}{subsection.17.3.4}% +\contentsline {chapter}{\numberline {18}Parallel Processing}{311}{chapter.18}% +\contentsline {section}{\numberline {18.1}Parallel on your computer}{312}{section.18.1}% +\contentsline {section}{\numberline {18.2}Parallel on a virtual machine}{313}{section.18.2}% +\contentsline {section}{\numberline {18.3}Parallel on a cluster}{313}{section.18.3}% +\contentsline {subsection}{\numberline {18.3.1}What is a command-line interface?}{314}{subsection.18.3.1}% +\contentsline {subsection}{\numberline {18.3.2}Running a job on a cluster}{316}{subsection.18.3.2}% +\contentsline {subsection}{\numberline {18.3.3}Checking on a job}{318}{subsection.18.3.3}% +\contentsline {subsection}{\numberline {18.3.4}Running lots of jobs on a cluster}{318}{subsection.18.3.4}% +\contentsline {subsection}{\numberline {18.3.5}Resources for Harvard's Odyssey}{321}{subsection.18.3.5}% +\contentsline {subsection}{\numberline {18.3.6}Acknowledgements}{321}{subsection.18.3.6}% +\contentsline {chapter}{\numberline {19}Debugging and Testing}{323}{chapter.19}% +\contentsline {section}{\numberline {19.1}Debugging with \texttt {print()}}{323}{section.19.1}% +\contentsline {section}{\numberline {19.2}Debugging with \texttt {browser()}}{324}{section.19.2}% +\contentsline {section}{\numberline {19.3}Debugging with \texttt {debug()}}{325}{section.19.3}% +\contentsline {section}{\numberline {19.4}Protecting functions with \texttt {stop()}}{325}{section.19.4}% +\contentsline {section}{\numberline {19.5}Testing code}{326}{section.19.5}% +\contentsline {part}{V\hspace {1em}Complex Data Structures}{331}{part.5}% +\contentsline {chapter}{\numberline {20}Using simulation as a power calculator}{333}{chapter.20}% +\contentsline {section}{\numberline {20.1}Getting design parameters from pilot data}{334}{section.20.1}% +\contentsline {section}{\numberline {20.2}The data generating process}{335}{section.20.2}% +\contentsline {section}{\numberline {20.3}Running the simulation}{339}{section.20.3}% +\contentsline {section}{\numberline {20.4}Evaluating power}{340}{section.20.4}% +\contentsline {subsection}{\numberline {20.4.1}Checking validity of our models}{340}{subsection.20.4.1}% +\contentsline {subsection}{\numberline {20.4.2}Assessing Precision (SE)}{342}{subsection.20.4.2}% +\contentsline {subsection}{\numberline {20.4.3}Assessing power}{343}{subsection.20.4.3}% +\contentsline {subsection}{\numberline {20.4.4}Assessing Minimum Detectable Effects}{344}{subsection.20.4.4}% +\contentsline {section}{\numberline {20.5}Power for Multilevel Data}{345}{section.20.5}% +\contentsline {chapter}{\numberline {21}Simulation under the Potential Outcomes Framework}{349}{chapter.21}% +\contentsline {section}{\numberline {21.1}Finite vs.~Superpopulation inference}{350}{section.21.1}% +\contentsline {section}{\numberline {21.2}Data generation processes for potential outcomes}{350}{section.21.2}% +\contentsline {section}{\numberline {21.3}Finite sample performance measures}{353}{section.21.3}% +\contentsline {section}{\numberline {21.4}Nested finite simulation procedure}{356}{section.21.4}% +\contentsline {chapter}{\numberline {22}The Parametric bootstrap}{361}{chapter.22}% +\contentsline {section}{\numberline {22.1}Air conditioners: a stolen case study}{362}{section.22.1}% +\contentsline {chapter}{\numberline {A}Coding Reference}{365}{appendix.A}% +\contentsline {section}{\numberline {A.1}How to repeat yourself}{365}{section.A.1}% +\contentsline {subsection}{\numberline {A.1.1}Using \texttt {replicate()}}{365}{subsection.A.1.1}% +\contentsline {subsection}{\numberline {A.1.2}Using \texttt {map()}}{366}{subsection.A.1.2}% +\contentsline {subsection}{\numberline {A.1.3}map with no inputs}{368}{subsection.A.1.3}% +\contentsline {subsection}{\numberline {A.1.4}Other approaches for repetition}{368}{subsection.A.1.4}% +\contentsline {section}{\numberline {A.2}Default arguments for functions}{369}{section.A.2}% +\contentsline {section}{\numberline {A.3}Profiling Code}{370}{section.A.3}% +\contentsline {subsection}{\numberline {A.3.1}Using \texttt {Sys.time()} and \texttt {system.time()}}{370}{subsection.A.3.1}% +\contentsline {subsection}{\numberline {A.3.2}The \texttt {tictoc} package}{371}{subsection.A.3.2}% +\contentsline {subsection}{\numberline {A.3.3}The \texttt {bench} package}{371}{subsection.A.3.3}% +\contentsline {subsection}{\numberline {A.3.4}Profiling with \texttt {profvis}}{374}{subsection.A.3.4}% +\contentsline {section}{\numberline {A.4}Optimizing code (and why you often shouldn't)}{374}{section.A.4}% +\contentsline {subsection}{\numberline {A.4.1}Hand-building functions}{375}{subsection.A.4.1}% +\contentsline {subsection}{\numberline {A.4.2}Computational efficiency versus simplicity}{376}{subsection.A.4.2}% +\contentsline {subsection}{\numberline {A.4.3}Reusing code to speed up computation}{377}{subsection.A.4.3}% +\contentsline {chapter}{\numberline {B}Further readings and resources}{383}{appendix.B}% diff --git a/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v1-1.pdf b/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v1-1.pdf deleted file mode 100644 index 9c5afeb..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v1-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v2-1.pdf b/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v2-1.pdf deleted file mode 100644 index 42a1248..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/clusterRCT_plot_bias_v2-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/disc_mde-1.pdf b/Designing-Simulations-in-R_files/figure-latex/disc_mde-1.pdf deleted file mode 100644 index ca2b705..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/disc_mde-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/disc_power-1.pdf b/Designing-Simulations-in-R_files/figure-latex/disc_power-1.pdf deleted file mode 100644 index 80f8671..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/disc_power-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/disc_precision-1.pdf b/Designing-Simulations-in-R_files/figure-latex/disc_precision-1.pdf deleted file mode 100644 index 7eb6b2d..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/disc_precision-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/swan_example_setup-1.pdf b/Designing-Simulations-in-R_files/figure-latex/swan_example_setup-1.pdf deleted file mode 100644 index dd3f2f9..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/swan_example_setup-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/ttest_result_figure-1.pdf b/Designing-Simulations-in-R_files/figure-latex/ttest_result_figure-1.pdf deleted file mode 100644 index 316bcaa..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/ttest_result_figure-1.pdf and /dev/null differ diff --git a/Designing-Simulations-in-R_files/figure-latex/unnamed-chunk-2-1.pdf b/Designing-Simulations-in-R_files/figure-latex/unnamed-chunk-2-1.pdf deleted file mode 100644 index 040be88..0000000 Binary files a/Designing-Simulations-in-R_files/figure-latex/unnamed-chunk-2-1.pdf and /dev/null differ diff --git a/_output.yml b/_output.yml index 7fde07e..84f0bdb 100644 --- a/_output.yml +++ b/_output.yml @@ -21,6 +21,8 @@ bookdown::gitbook: bookdown::pdf_book: includes: in_header: preamble.tex + bibliography: [book.bib, packages.bib] + biblio-style: apalike citation_package: natbib keep_tex: yes bookdown::epub_book: default diff --git a/book.bib b/book.bib index 93b66fe..18cbdd8 100644 --- a/book.bib +++ b/book.bib @@ -1,120 +1,115 @@ %% This BibTeX bibliography file was created using BibDesk. %% https://bibdesk.sourceforge.io/ -%% Created for Luke Miratrix at 2025-08-13 09:51:12 -0700 +%% Created for Luke Miratrix at 2025-12-18 10:05:30 -0500 %% Saved with string encoding Unicode (UTF-8) + + @article{Benjamin2017redefine, - title = {Redefine Statistical Significance}, - author = {Benjamin, Daniel J. and Berger, James O. and Johannesson, Magnus and Nosek, Brian A. and Wagenmakers, E.-J. and Berk, Richard and Bollen, Kenneth A. and Brembs, Björn and Brown, Lawrence and Camerer, Colin and Cesarini, David and Chambers, Christopher D. and Clyde, Merlise and Cook, Thomas D. and De Boeck, Paul and Dienes, Zoltan and Dreber, Anna and Easwaran, Kenny and Efferson, Charles and Fehr, Ernst and Fidler, Fiona and Field, Andy P. and Forster, Malcolm and George, Edward I. and Gonzalez, Richard and Goodman, Steven and Green, Edwin and Green, Donald P. and Greenwald, Anthony G. and Hadfield, Jarrod D. and Hedges, Larry V. and Held, Leonhard and Hua Ho, Teck and Hoijtink, Herbert and Hruschka, Daniel J. and Imai, Kosuke and Imbens, Guido and Ioannidis, John P. A. and Jeon, Minjeong and Jones, James Holland and Kirchler, Michael and Laibson, David and List, John and Little, Roderick and Lupia, Arthur and Machery, Edouard and Maxwell, Scott E. and McCarthy, Michael and Moore, Don A. and Morgan, Stephen L. and Munafó, Marcus and Nakagawa, Shinichi and Nyhan, Brendan and Parker, Timothy H. and Pericchi, Luis and Perugini, Marco and Rouder, Jeff and Rousseau, Judith and Savalei, Victoria and Schönbrodt, Felix D. and Sellke, Thomas and Sinclair, Betsy and Tingley, Dustin and Van Zandt, Trisha and Vazire, Simine and Watts, Duncan J. and Winship, Christopher and Wolpert, Robert L. and Xie, Yu and Young, Cristobal and Zinman, Jonathan and Johnson, Valen E.}, - journaltitle = {Nature Human Behaviour}, - shortjournal = {Nat Hum Behav}, - volume = {2}, - number = {1}, - pages = {6--10}, - issn = {2397-3374}, - doi = {10.1038/s41562-017-0189-z}, - url = {https://www.nature.com/articles/s41562-017-0189-z}, - urldate = {2025-09-19}, - langid = {english}, - year = {2017}, -} + author = {Benjamin, Daniel J. and Berger, James O. and Johannesson, Magnus and Nosek, Brian A. and Wagenmakers, E.-J. and Berk, Richard and Bollen, Kenneth A. and Brembs, Bj{\"o}rn and Brown, Lawrence and Camerer, Colin and Cesarini, David and Chambers, Christopher D. and Clyde, Merlise and Cook, Thomas D. and De Boeck, Paul and Dienes, Zoltan and Dreber, Anna and Easwaran, Kenny and Efferson, Charles and Fehr, Ernst and Fidler, Fiona and Field, Andy P. and Forster, Malcolm and George, Edward I. and Gonzalez, Richard and Goodman, Steven and Green, Edwin and Green, Donald P. and Greenwald, Anthony G. and Hadfield, Jarrod D. and Hedges, Larry V. and Held, Leonhard and Hua Ho, Teck and Hoijtink, Herbert and Hruschka, Daniel J. and Imai, Kosuke and Imbens, Guido and Ioannidis, John P. A. and Jeon, Minjeong and Jones, James Holland and Kirchler, Michael and Laibson, David and List, John and Little, Roderick and Lupia, Arthur and Machery, Edouard and Maxwell, Scott E. and McCarthy, Michael and Moore, Don A. and Morgan, Stephen L. and Munaf{\'o}, Marcus and Nakagawa, Shinichi and Nyhan, Brendan and Parker, Timothy H. and Pericchi, Luis and Perugini, Marco and Rouder, Jeff and Rousseau, Judith and Savalei, Victoria and Sch{\"o}nbrodt, Felix D. and Sellke, Thomas and Sinclair, Betsy and Tingley, Dustin and Van Zandt, Trisha and Vazire, Simine and Watts, Duncan J. and Winship, Christopher and Wolpert, Robert L. and Xie, Yu and Young, Cristobal and Zinman, Jonathan and Johnson, Valen E.}, + doi = {10.1038/s41562-017-0189-z}, + issn = {2397-3374}, + journaltitle = {Nature Human Behaviour}, + langid = {english}, + number = {1}, + pages = {6--10}, + shortjournal = {Nat Hum Behav}, + title = {Redefine Statistical Significance}, + url = {https://www.nature.com/articles/s41562-017-0189-z}, + urldate = {2025-09-19}, + volume = {2}, + year = {2017}, + bdsk-url-1 = {https://www.nature.com/articles/s41562-017-0189-z}, + bdsk-url-2 = {https://doi.org/10.1038/s41562-017-0189-z}} @article{Lakens2018justify, - title = {Justify Your Alpha}, - author = {Lakens, Daniel and Adolfi, Federico G. and Albers, Casper J. and Anvari, Farid and Apps, Matthew A. J. and Argamon, Shlomo E. and Baguley, Thom and Becker, Raymond B. and Benning, Stephen D. and Bradford, Daniel E. and Buchanan, Erin M. and Caldwell, Aaron R. and Van Calster, Ben and Carlsson, Rickard and Chen, Sau-Chin and Chung, Bryan and Colling, Lincoln J. and Collins, Gary S. and Crook, Zander and Cross, Emily S. and Daniels, Sameera and Danielsson, Henrik and DeBruine, Lisa and Dunleavy, Daniel J. and Earp, Brian D. and Feist, Michele I. and Ferrell, Jason D. and Field, James G. and Fox, Nicholas W. and Friesen, Amanda and Gomes, Caio and Gonzalez-Marquez, Monica and Grange, James A. and Grieve, Andrew P. and Guggenberger, Robert and Grist, James and Van Harmelen, Anne-Laura and Hasselman, Fred and Hochard, Kevin D. and Hoffarth, Mark R. and Holmes, Nicholas P. and Ingre, Michael and Isager, Peder M. and Isotalus, Hanna K. and Johansson, Christer and Juszczyk, Konrad and Kenny, David A. and Khalil, Ahmed A. and Konat, Barbara and Lao, Junpeng and Larsen, Erik Gahner and Lodder, Gerine M. A. and Lukavský, Jiří and Madan, Christopher R. and Manheim, David and Martin, Stephen R. and Martin, Andrea E. and Mayo, Deborah G. and McCarthy, Randy J. and McConway, Kevin and McFarland, Colin and Nio, Amanda Q. X. and Nilsonne, Gustav and De Oliveira, Cilene Lino and De Xivry, Jean-Jacques Orban and Parsons, Sam and Pfuhl, Gerit and Quinn, Kimberly A. and Sakon, John J. and Saribay, S. Adil and Schneider, Iris K. and Selvaraju, Manojkumar and Sjoerds, Zsuzsika and Smith, Samuel G. and Smits, Tim and Spies, Jeffrey R. and Sreekumar, Vishnu and Steltenpohl, Crystal N. and Stenhouse, Neil and Świątkowski, Wojciech and Vadillo, Miguel A. and Van Assen, Marcel A. L. M. and Williams, Matt N. and Williams, Samantha E. and Williams, Donald R. and Yarkoni, Tal and Ziano, Ignazio and Zwaan, Rolf A.}, - year = {2018}, - journaltitle = {Nature Human Behaviour}, - shortjournal = {Nat Hum Behav}, - volume = {2}, - number = {3}, - pages = {168--171}, - issn = {2397-3374}, - doi = {10.1038/s41562-018-0311-x}, - langid = {english} -} + author = {Lakens, Daniel and Adolfi, Federico G. and Albers, Casper J. and Anvari, Farid and Apps, Matthew A. J. and Argamon, Shlomo E. and Baguley, Thom and Becker, Raymond B. and Benning, Stephen D. and Bradford, Daniel E. and Buchanan, Erin M. and Caldwell, Aaron R. and Van Calster, Ben and Carlsson, Rickard and Chen, Sau-Chin and Chung, Bryan and Colling, Lincoln J. and Collins, Gary S. and Crook, Zander and Cross, Emily S. and Daniels, Sameera and Danielsson, Henrik and DeBruine, Lisa and Dunleavy, Daniel J. and Earp, Brian D. and Feist, Michele I. and Ferrell, Jason D. and Field, James G. and Fox, Nicholas W. and Friesen, Amanda and Gomes, Caio and Gonzalez-Marquez, Monica and Grange, James A. and Grieve, Andrew P. and Guggenberger, Robert and Grist, James and Van Harmelen, Anne-Laura and Hasselman, Fred and Hochard, Kevin D. and Hoffarth, Mark R. and Holmes, Nicholas P. and Ingre, Michael and Isager, Peder M. and Isotalus, Hanna K. and Johansson, Christer and Juszczyk, Konrad and Kenny, David A. and Khalil, Ahmed A. and Konat, Barbara and Lao, Junpeng and Larsen, Erik Gahner and Lodder, Gerine M. A. and Lukavsk{\'y}, Ji{\v r}{\'\i} and Madan, Christopher R. and Manheim, David and Martin, Stephen R. and Martin, Andrea E. and Mayo, Deborah G. and McCarthy, Randy J. and McConway, Kevin and McFarland, Colin and Nio, Amanda Q. X. and Nilsonne, Gustav and De Oliveira, Cilene Lino and De Xivry, Jean-Jacques Orban and Parsons, Sam and Pfuhl, Gerit and Quinn, Kimberly A. and Sakon, John J. and Saribay, S. Adil and Schneider, Iris K. and Selvaraju, Manojkumar and Sjoerds, Zsuzsika and Smith, Samuel G. and Smits, Tim and Spies, Jeffrey R. and Sreekumar, Vishnu and Steltenpohl, Crystal N. and Stenhouse, Neil and {\'S}wi{\k a}tkowski, Wojciech and Vadillo, Miguel A. and Van Assen, Marcel A. L. M. and Williams, Matt N. and Williams, Samantha E. and Williams, Donald R. and Yarkoni, Tal and Ziano, Ignazio and Zwaan, Rolf A.}, + doi = {10.1038/s41562-018-0311-x}, + issn = {2397-3374}, + journaltitle = {Nature Human Behaviour}, + langid = {english}, + number = {3}, + pages = {168--171}, + shortjournal = {Nat Hum Behav}, + title = {Justify Your Alpha}, + volume = {2}, + year = {2018}, + bdsk-url-1 = {https://doi.org/10.1038/s41562-018-0311-x}} @article{cameronPractitionerGuideClusterRobust2015, - title = {A Practitioner's Guide to Cluster-Robust Inference}, - author = {Cameron, A Colin and Miller, Douglas L}, - year = {2015}, - journal = {Journal of Human Resources}, - volume = {50}, - number = {2}, - pages = {317--372}, - doi = {10.3368/jhr.50.2.317} -} + author = {Cameron, A Colin and Miller, Douglas L}, + doi = {10.3368/jhr.50.2.317}, + journal = {Journal of Human Resources}, + number = {2}, + pages = {317--372}, + title = {A Practitioner's Guide to Cluster-Robust Inference}, + volume = {50}, + year = {2015}, + bdsk-url-1 = {https://doi.org/10.3368/jhr.50.2.317}} @article{Satterthwaite1946approximate, - title = {An Approximate Distribution of Estimates of Variance Components}, - author = {Satterthwaite, F. E.}, - year = {1946}, - month = dec, - journal = {Biometrics Bulletin}, - volume = {2}, - number = {6}, - eprint = {10.2307/3002019}, - eprinttype = {jstor}, - pages = {110}, - issn = {00994987}, - doi = {10.2307/3002019}, - urldate = {2025-09-19} -} - -@Manual{robustbase, - title = {robustbase: Basic Robust Statistics}, - author = {Martin Maechler and Peter Rousseeuw and Christophe Croux - and Valentin Todorov and Andreas Ruckstuhl and Matias - Salibian-Barrera and Tobias Verbeke and Manuel Koller and Eduardo - L. T. Conceicao and Maria {Anna di Palma}}, - year = {2024}, - note = {R package version 0.99-4-1}, - url = {http://robustbase.r-forge.r-project.org/}, - url = {http://robustbase.r-forge.r-project.org/}, - } - + author = {Satterthwaite, F. E.}, + doi = {10.2307/3002019}, + eprint = {10.2307/3002019}, + eprinttype = {jstor}, + issn = {00994987}, + journal = {Biometrics Bulletin}, + month = dec, + number = {6}, + pages = {110}, + title = {An Approximate Distribution of Estimates of Variance Components}, + urldate = {2025-09-19}, + volume = {2}, + year = {1946}, + bdsk-url-1 = {https://doi.org/10.2307/3002019}} + +@manual{robustbase, + author = {Martin Maechler and Peter Rousseeuw and Christophe Croux and Valentin Todorov and Andreas Ruckstuhl and Matias Salibian-Barrera and Tobias Verbeke and Manuel Koller and Eduardo L. T. Conceicao and Maria {Anna di Palma}}, + note = {R package version 0.99-4-1}, + title = {robustbase: Basic Robust Statistics}, + url = {http://robustbase.r-forge.r-project.org/}, + year = {2024}, + bdsk-url-1 = {http://robustbase.r-forge.r-project.org/}} + @book{Maronna2006robust, - title = {Robust Statistics: Theory and Methods}, - shorttitle = {Robust Statistics}, - author = {Maronna, Ricardo A. and Martin, R. Douglas and Yohai, V{\'i}ctor J.}, - year = {2006}, - series = {Wiley Series in Probability and Statistics}, - publisher = {J. Wiley}, - address = {Chichester (GB)}, - isbn = {978-0-470-01092-1}, - langid = {english}, - lccn = {519.5} -} + address = {Chichester (GB)}, + author = {Maronna, Ricardo A. and Martin, R. Douglas and Yohai, V{\'\i}ctor J.}, + isbn = {978-0-470-01092-1}, + langid = {english}, + lccn = {519.5}, + publisher = {J. Wiley}, + series = {Wiley Series in Probability and Statistics}, + shorttitle = {Robust Statistics}, + title = {Robust Statistics: Theory and Methods}, + year = {2006}} @book{Wilcox2022introduction, - title = {Introduction to Robust Estimation and Hypothesis Testing}, - author = {Wilcox, Rand R.}, - year = {2022}, - edition = {Fifth edition}, - publisher = {Academic Press, an imprint of Elsevier}, - address = {London, United Kingdom San Diego, United States Cambridge, MA Oxford, United Kingdom}, - isbn = {978-0-12-820099-5 978-0-12-820098-8}, - langid = {english} -} - + address = {London, United Kingdom San Diego, United States Cambridge, MA Oxford, United Kingdom}, + author = {Wilcox, Rand R.}, + edition = {Fifth edition}, + isbn = {978-0-12-820099-5 978-0-12-820098-8}, + langid = {english}, + publisher = {Academic Press, an imprint of Elsevier}, + title = {Introduction to Robust Estimation and Hypothesis Testing}, + year = {2022}} @article{Rousseeuw1993alternatives, - title = {Alternatives to the {{Median Absolute Deviation}}}, - author = {Rousseeuw, Peter J. and Croux, Christophe}, - year = {1993}, - month = dec, - journal = {Journal of the American Statistical Association}, - volume = {88}, - number = {424}, - pages = {1273--1283}, - issn = {0162-1459, 1537-274X}, - doi = {10.1080/01621459.1993.10476408}, - urldate = {2025-09-02}, - langid = {english} -} - + author = {Rousseeuw, Peter J. and Croux, Christophe}, + doi = {10.1080/01621459.1993.10476408}, + issn = {0162-1459, 1537-274X}, + journal = {Journal of the American Statistical Association}, + langid = {english}, + month = dec, + number = {424}, + pages = {1273--1283}, + title = {Alternatives to the {{Median Absolute Deviation}}}, + urldate = {2025-09-02}, + volume = {88}, + year = {1993}, + bdsk-url-1 = {https://doi.org/10.1080/01621459.1993.10476408}} @article{antonakis2021ignoring, author = {Antonakis, John and Bastardoz, Nicolas and R{\"o}nkk{\"o}, Mikko}, @@ -180,6 +175,33 @@ @article{lee2023comparing title = {Comparing random effects models, ordinary least squares, or fixed effects with cluster robust standard errors for cross-classified data.}, year = {2023}} +@book{Hettmansperger2010robust, + abstract = {Presenting an extensive set of tools and methods for data analysis, Robust Nonparametric Statistical Methods, Second Edition covers univariate tests and estimates with extensions to linear models, multivariate models, times series models, experimental designs, and mixed models. It follows the approach of the first edition by developing rank-based methods from the unifying theme of geometry. This edition, however, includes more models and methods and significantly extends the possible analyses based on ranks.New to the Second EditionA new section on rank procedures for nonlinear modelsA new cha}, + author = {Hettmansperger, Thomas P. and McKean, Joseph W.}, + date = {2010}, + edition = {2nd ed}, + isbn = {978-1-4398-0909-9 978-1-4398-0908-2}, + langid = {english}, + location = {Hoboken}, + pagetotal = {540}, + publisher = {{Taylor and Francis}}, + series = {Chapman \& {{Hall}} / {{CRC Monographs}} on {{Statistics}} \& {{Applied Probability}}}, + title = {Robust Nonparametric Statistical Methods}} + +@article{McKean1984comparison, + abstract = {Various methods for ``Studentizing'' the sample median are com-pared on the basis of a Monte Carlo study. Several of the methods do rather poorly while two, the bootstrap and the standardized length of a distribution free confidence interval, behave accept-ably acrors a wide range of sample sizes and several distributions of varying tail length. These two methods seem to agree closely with the distribution free confidence intervals and moreover, un-like these intervals, the methods can be extended to a method of accurate inference for λ1 regreasion.}, + author = {McKean, Joseph W. and Schrader, Ronald M.}, + doi = {10.1080/03610918408812413}, + journaltitle = {Communications in Statistics - Simulation and Computation}, + keywords = {Bootstrap,boxplot,distribution free,Monte Carlo swindle,sample median,Studentizing}, + number = {6}, + pages = {751--773}, + publisher = {Taylor \& Francis}, + title = {A Comparison of Methods for Studentizing the Sample Median}, + volume = {13}, + year = {1984}, + bdsk-url-1 = {https://doi.org/10.1080/03610918408812413}} + @article{pustejovsky2015four, author = {Pustejovsky, James E and Swan, Daniel M}, date-added = {2025-05-22 14:44:03 -0400}, @@ -245,7 +267,7 @@ @article{boos2015Assessing title = {Assessing Variability of Complex Descriptive Statistics in {{Monte Carlo}} Studies Using Resampling Methods}, volume = {83}, year = {2015}, -} + bdsk-url-1 = {https://doi.org/10.1111/insr.12087}} @article{boulesteix2020Replication, author = {Boulesteix, Anne-Laure and Hoffmann, Sabine and Charlton, Alethea and Seibold, Heidi}, @@ -256,7 +278,7 @@ @article{boulesteix2020Replication title = {A Replication Crisis in Methodological Research?}, volume = {17}, year = {2020}, -} + bdsk-url-1 = {https://doi.org/10.1111/1740-9713.01444}} @article{boulesteix2013Plea, author = {Boulesteix, Anne-Laure and Lauer, Sabine and Eugster, Manuel J. A.}, @@ -267,7 +289,7 @@ @article{boulesteix2013Plea title = {A Plea for Neutral Comparison Studies in Computational Sciences}, volume = {8}, year = {2013}, -} + bdsk-url-1 = {https://doi.org/10.1371/journal.pone.0061562}} @article{boulesteix2017evidencebased, abstract = {The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly ``evidence-based''. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of ``evidence-based'' statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies---a method of assessment of statistical methods using real-world datasets---might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.}, @@ -280,7 +302,7 @@ @article{boulesteix2017evidencebased title = {Towards Evidence-Based Computational Statistics: Lessons from Clinical Research on the Role and Design of Real-Data Benchmark Studies}, volume = {17}, year = {2017}, -} + bdsk-url-1 = {https://doi.org/10.1186/s12874-017-0417-2}} @book{borenstein2021introduction, address = {Chichester, UK}, @@ -301,7 +323,7 @@ @article{Cho2023bivariate title = {A Bivariate Zero-Inflated Negative Binomial Model and Its Applications to Biomedical Settings}, volume = {32}, year = {2023}, -} + bdsk-url-1 = {https://doi.org/10.1177/09622802231172028}} @article{gilbert2024multilevel, author = {Gilbert, Joshua and Miratrix, Luke}, @@ -433,7 +455,8 @@ @article{Bloom2016using title = {{Using Multisite Experiments to Study Cross-Site Variation in Treatment Effects: A Hybrid Approach With Fixed Intercepts and a Random Treatment Coefficient}}, volume = {10}, year = {2016}, -} + bdsk-file-1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhYYm9va21hcmtfEGguLi8uLi8uLi8uLi9Eb2N1bWVudHMvUGFwZXJzIExpYnJhcnkvQmxvb20vSm91cm5hbCBvZiBSZXNlYXJjaCBvbiBFZHVjYXRpb25hbCBFZmZlY3RpdmVuZXNzX1BvcnRlcl8xLnBkZk8RBGRib29rZAQAAAAABBAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgAwAABQAAAAEBAABVc2VycwAAAAkAAAABAQAAbG1pcmF0cml4AAAACQAAAAEBAABEb2N1bWVudHMAAAAOAAAAAQEAAFBhcGVycyBMaWJyYXJ5AAAFAAAAAQEAAEJsb29tAAAAPQAAAAEBAABKb3VybmFsIG9mIFJlc2VhcmNoIG9uIEVkdWNhdGlvbmFsIEVmZmVjdGl2ZW5lc3NfUG9ydGVyXzEucGRmAAAAGAAAAAEGAAAEAAAAFAAAACgAAAA8AAAAVAAAAGQAAAAIAAAABAMAAMs4AAAAAAAACAAAAAQDAAC97g4AAAAAAAgAAAAEAwAAMEEPAAAAAAAIAAAABAMAANRHDwAAAAAACAAAAAQDAAA5SQ8AAAAAAAgAAAAEAwAATUkPAAAAAAAYAAAAAQYAAMwAAADcAAAA7AAAAPwAAAAMAQAAHAEAAAgAAAAABAAAQb5MnI4AAAAYAAAAAQIAAAEAAAAAAAAADwAAAAAAAAAAAAAAAAAAAAgAAAAEAwAABAAAAAAAAAAEAAAAAwMAAPcBAAAIAAAAAQkAAGZpbGU6Ly8vCAAAAAEBAABMb2JhZG9yYQgAAAAEAwAAAJCClucAAAAIAAAAAAQAAEHHFbB+AAAAJAAAAAEBAAA4OTYxOUJFQy1DRENDLTQyQ0EtODAwNy0yOTc0QUE5RUI4MzQYAAAAAQIAAIEAAAABAAAA7xMAAAEAAAAAAAAAAAAAAAEAAAABAQAALwAAAAAAAAABBQAAHwEAAAECAAAwNDM1NjM1YTNiZjkxMTc2MTFlYzk1NWM1YzlmYjNhZGFmY2ZiMTY5OWNmN2QyMzA3ZjBjMmZiNjZjOGRiYzI1OzAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwMDAwMDAwMjA7Y29tLmFwcGxlLmFwcC1zYW5kYm94LnJlYWQtd3JpdGU7MDE7MDEwMDAwMTE7MDAwMDAwMDAwMDBmNDk0ZDs2NDsvdXNlcnMvbG1pcmF0cml4L2RvY3VtZW50cy9wYXBlcnMgbGlicmFyeS9ibG9vbS9qb3VybmFsIG9mIHJlc2VhcmNoIG9uIGVkdWNhdGlvbmFsIGVmZmVjdGl2ZW5lc3NfcG9ydGVyXzEucGRmAADMAAAA/v///wEAAAAAAAAAEAAAAAQQAACsAAAAAAAAAAUQAAAsAQAAAAAAABAQAABcAQAAAAAAAEAQAABMAQAAAAAAAAIgAAAkAgAAAAAAAAUgAACYAQAAAAAAABAgAACoAQAAAAAAABEgAADYAQAAAAAAABIgAAC4AQAAAAAAABMgAADIAQAAAAAAACAgAAAEAgAAAAAAADAgAAAwAgAAAAAAAAHAAAB8AQAAAAAAABHAAAAUAAAAAAAAABLAAACMAQAAAAAAAIDwAAA4AgAAAAAAAAAIAA0AGgAjAI4AAAAAAAACAQAAAAAAAAAFAAAAAAAAAAAAAAAAAAAE9g==}, + bdsk-url-1 = {https://doi.org/10.1080/19345747.2016.1264518}} @article{brown1974SmallSampleBehavior, author = {Brown, Morton B. and Forsythe, Alan B.}, @@ -446,7 +469,7 @@ @article{brown1974SmallSampleBehavior title = {The {{Small Sample Behavior}} of {{Some Statistics Which Test}} the {{Equality}} of {{Several Means}}}, volume = {16}, year = {1974}, -} + bdsk-url-1 = {https://doi.org/10.1080/00401706.1974.10489158}} @article{james1951ComparisonSeveralGroups, author = {James, G. S.}, @@ -470,7 +493,7 @@ @article{welch1951ComparisonSeveralMean title = {On the Comparison of Several Mean Values: {{An}} Alternative Approach}, volume = {38}, year = {1951}, -} + bdsk-url-1 = {https://doi.org/10.2307/2332579}} @article{mehrotra1997ImprovingBrownforsytheSolution, abstract = {Over two decades ago, Brown and Forsythe (B-F) (1974) proposed an innovative solution to the problem of comparing independent normal means under heteroscedasticity. Since then, their testing procedure has gained in popularity and authors have published various articles in which the B-F test has formed the basis of their research. The purpose of this paper is to point out, and correct, a flaw in the B-F testing procedure. Specifically, it is shown that the approximation proposed by B-F for the null distribution of their test statistic is inadequate. An improved approximation is provided and the small sample null properties of the modified B-F test are studied via simulation. The empirical findings support the theoretical result that the modified B-F test does a better job of preserving the test size compared to the original B-F test.}, @@ -485,7 +508,7 @@ @article{mehrotra1997ImprovingBrownforsytheSolution title = {Improving the Brown-Forsythe Solution to the Generalized Behrens-Fisher Problem}, volume = {26}, year = {1997}, -} + bdsk-url-1 = {https://doi.org/10.1080/03610919708813431}} @article{Kern2014calibrated, abstract = {{Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods' performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.}}, @@ -498,7 +521,7 @@ @article{Kern2014calibrated title = {{Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations}}, volume = {9}, year = {2014}, -} + bdsk-url-1 = {https://doi.org/10.1080/19345747.2015.1060282}} @article{White1980heteroskedasticity, author = {White, Halbert}, @@ -520,7 +543,7 @@ @article{dong2013PowerUpToolCalculating title = {{{{\emph{PowerUp}}}}{\emph{!}} : {{A Tool}} for {{Calculating Minimum Detectable Effect Sizes}} and {{Minimum Required Sample Sizes}} for {{Experimental}} and {{Quasi-Experimental Design Studies}}}, volume = {6}, year = {2013}, -} + bdsk-url-1 = {https://doi.org/10.1080/19345747.2012.673143}} @article{tipton2014stratified, abstract = { Background:An important question in the design of experiments is how to ensure that the findings from the experiment are generalizable to a larger population. This concern with generalizability is particularly important when treatment effects are heterogeneous and when selecting units into the experiment using random sampling is not possible---two conditions commonly met in large-scale educational experiments.Method:This article introduces a model-based balanced-sampling framework for improving generalizations, with a focus on developing methods that are robust to model misspecification. Additionally, the article provides a new method for sample selection within this framework: First units in an inference population are divided into relatively homogenous strata using cluster analysis, and then the sample is selected using distance rankings.Result:In order to demonstrate and evaluate the method, a reanalysis of a completed experiment is conducted. This example compares samples selected using the new method with the actual sample used in the experiment. Results indicate that even under high nonresponse, balance is better on most covariates and that fewer coverage errors result.Conclusion:The article concludes with a discussion of additional benefits and limitations of the method. }, @@ -532,7 +555,7 @@ @article{tipton2014stratified title = {Stratified Sampling Using Cluster Analysis: A Sample Selection Strategy for Improved Generalizations From Experiments}, volume = {37}, year = {2013}, -} + bdsk-url-1 = {https://doi.org/10.1177/0193841X13516324}} @article{faul2009StatisticalPowerAnalyses, author = {Faul, Franz and Erdfelder, Edgar and Buchner, Axel and Lang, Albert-Georg}, @@ -545,7 +568,7 @@ @article{faul2009StatisticalPowerAnalyses title = {Statistical Power Analyses Using {{G}}*{{Power}} 3.1: {{Tests}} for Correlation and Regression Analyses}, volume = {41}, year = {2009}, -} + bdsk-url-1 = {https://doi.org/10.3758/BRM.41.4.1149}} @article{longUsingHeteroscedasticityConsistent2000, abstract = {In the presence of heteroscedasticity, ordinary least squares (OLS) estimates are unbiased, but the usual tests of significance are generally inappropriate and their use can lead to incorrect inferences. Tests based on a heteroscedasticity consistent covariance matrix (HCCM), however, are consistent even in the presence of heteroscedasticity of an unknown form. Most applications that use a HCCM appear to rely on the asymptotic version known as HC0. Our Monte Carlo simulations show that HC0 often results in incorrect inferences when N {$\leq$} 250, while three relatively unknown, small sample versions of the HCCM, and especially a version known as HC3, work well even for N's as small as 25. We recommend that: (1) data analysts should correct for heteroscedasticity using a HCCM whenever there is reason to suspect heteroscedasticity; (2) the decision to use HCCM-based tests should not be determined by a screening test for heteroscedasticity; and (3) when N {$\leq$} 250, the HCCM known as HC3 should be used. Since HC3 is simple to compute, we encourage authors of statistical software to add this estimator to their programs.}, @@ -558,7 +581,7 @@ @article{longUsingHeteroscedasticityConsistent2000 title = {Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model}, volume = {54}, year = {2000}, -} + bdsk-url-1 = {https://doi.org/10.1080/00031305.2000.10474549}} @book{GerberGreen, author = {Gerber, Alan S and Green, Donald P}, @@ -617,7 +640,7 @@ @book{xie2015 title = {Dynamic Documents with {R} and knitr}, url = {http://yihui.name/knitr/}, year = {2015}, -} + bdsk-url-1 = {http://yihui.name/knitr/}} @article{alfons2010ObjectOrientedFrameworkStatistical, author = {Alfons, Andreas and Templ, Matthias and Filzmoser, Peter}, @@ -655,7 +678,7 @@ @article{blair2019DeclaringDiagnosingResearch urldate = {2024-01-01}, volume = {113}, year = {2019}, -} + bdsk-url-1 = {https://doi.org/10.1017/S0003055419000194}} @book{blair2023ResearchDesignSocial, address = {Princeton}, @@ -695,7 +718,7 @@ @article{boulesteix2020IntroductionStatisticalSimulations title = {Introduction to Statistical Simulations in Health Research}, volume = {10}, year = {2020}, -} + bdsk-url-1 = {https://doi.org/10.1136/bmjopen-2020-039921}} @misc{brown2023SimprFlexibleTidyverse, author = {Brown, Ethan}, @@ -721,7 +744,7 @@ @article{chalmers2020WritingEffectiveReliable title = {Writing {{Effective}} and {{Reliable Monte Carlo Simulations}} with the {{SimDesign Package}}}, volume = {16}, year = {2020}, -} + bdsk-url-1 = {https://doi.org/10.20982/tqmp.16.4.p248}} @book{chang2010MonteCarloSimulation, abstract = {Helping you become a creative, logical thinker and skillful "simulator," Monte Carlo Simulation for the Pharmaceutical Industry: Concepts, Algorithms, and Case Studies provides broad coverage of the entire drug development process, from drug discovery to preclinical and clinical trial aspects to commercialization. It presents the theories and metho}, @@ -788,7 +811,7 @@ @article{feiveson2002PowerSimulation title = {Power by {{Simulation}}}, volume = {2}, year = {2002}, -} + bdsk-url-1 = {https://doi.org/10.1177/1536867X0200200201}} @book{gamma1995DesignPatternsElements, abstract = {A book review of Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides is presented.}, @@ -823,7 +846,7 @@ @book{gelman2013BayesianDataAnalysis publisher = {{Chapman and Hall/CRC}}, title = {Bayesian {{Data Analysis}}}, year = {2013}, -} + bdsk-url-1 = {https://doi.org/10.1201/b16018}} @article{gelman2014StatisticalCrisisScience, abstract = {{$<$}em{$>$}Gale{$<$}/em{$>$} Academic OneFile includes The statistical crisis in science: data-dependent analy by Andrew Gelman and Eric Loken. Click to explore.}, @@ -863,7 +886,7 @@ @article{green2016SIMRPackagePower title = {{{SIMR}}: An {{R}} Package for Power Analysis of Generalized Linear Mixed Models by Simulation}, volume = {7}, year = {2016}, -} + bdsk-url-1 = {https://doi.org/10.1111/2041-210X.12504}} @article{hardwicke2023ReducingBiasIncreasing, abstract = {Flexibility in the design, analysis and interpretation of scientific studies creates a multiplicity of possible research outcomes. Scientists are granted considerable latitude to selectively use and report the hypotheses, variables and analyses that create the most positive, coherent and attractive story while suppressing those that are negative or inconvenient. This creates a risk of bias that can lead to scientists fooling themselves and fooling others. Preregistration involves declaring a research plan (for example, hypotheses, design and statistical analyses) in a public registry before the research outcomes are known. Preregistration (1) reduces the risk of bias by encouraging outcome-independent decision-making and (2) increases transparency, enabling others to assess the risk of bias and calibrate their confidence in research outcomes. In this Perspective, we briefly review the historical evolution of preregistration in medicine, psychology and other domains, clarify its pragmatic functions, discuss relevant meta-research, and provide recommendations for scientists and journal editors.}, @@ -879,7 +902,7 @@ @article{hardwicke2023ReducingBiasIncreasing title = {Reducing Bias, Increasing Transparency and Calibrating Confidence with Preregistration}, volume = {7}, year = {2023}, -} + bdsk-url-1 = {https://doi.org/10.1038/s41562-022-01497-2}} @article{harwell2018SurveyReportingPractices, abstract = {Computer simulation studies represent an important tool for investigating processes difficult or impossible to study using mathematical theory or real data. Hoaglin and Andrews recommended these studies be treated as statistical sampling experiments subject to established principles of design and data analysis, but the survey of Hauck and Anderson suggested these recommendations had, at that point in time, generally been ignored. We update the survey results of Hauck and Anderson using a sample of studies applying simulation methods in statistical research to assess the extent to which the recommendations of Hoaglin and Andrews and others for conducting simulation studies have been adopted. The important role of statistical applications of computer simulation studies in enhancing the reproducibility of scientific findings is also discussed. The results speak to the state of the art and the extent to which these studies are realizing their potential to inform statistical practice and a program of statistical research.}, @@ -894,11 +917,11 @@ @article{harwell2018SurveyReportingPractices title = {A {{Survey}} of {{Reporting Practices}} of {{Computer Simulation Studies}} in {{Statistical Research}}}, volume = {72}, year = {2018}, -} + bdsk-url-1 = {https://doi.org/10.1080/00031305.2017.1342692}} @article{hoogland1998RobustnessStudiesCovariance, abstract = {In covariance structure modeling, several estimation methods are available. The robustness of an estimator against specific violations of assumptions can be determined empirically by means of a Monte Carlo study. Many such studies in covariance structure analysis have been published, but the conclusions frequently seem to contradict each other. An overview of robustness studies in covariance structure analysis is given, and an attempt is made to generalize findings. Robustness studies are described and distinguished from each other systematically by means of certain characteristics. These characteristics serve as explanatory variables in a meta-analysis concerning the behavior of parameter estimators, standard error estimators, and goodness-of-fit statistics when the model is correctly specified.}, - author = {HOOGLAND, JEFFREY J. and BOOMSMA, {\relax ANNE}}, + author = {Hoogland, Jeffrey J. and Boomsa, Anne}, doi = {10.1177/0049124198026003003}, journal = {Sociological Methods \& Research}, langid = {english}, @@ -909,7 +932,7 @@ @article{hoogland1998RobustnessStudiesCovariance title = {Robustness {{Studies}} in {{Covariance Structure Modeling}}: {{An Overview}} and a {{Meta-Analysis}}}, volume = {26}, year = {1998}, -} + bdsk-url-1 = {https://doi.org/10.1177/0049124198026003003}} @article{huang2016GeneralizedEstimatingEquations, abstract = {Background/aims: Generalized estimating equations are a common modeling approach used in cluster randomized trials to account for within-cluster correlation. It is well known that the sandwich variance estimator is biased when the number of clusters is small ({$\leq$}40), resulting in an inflated type I error rate. Various bias correction methods have been proposed in the statistical literature, but how adequately they are utilized in current practice for cluster randomized trials is not clear. The aim of this study is to evaluate the use of generalized estimating equation bias correction methods in recently published cluster randomized trials and demonstrate the necessity of such methods when the number of clusters is small. Methods: Review of cluster randomized trials published between August 2013 and July 2014 and using generalized estimating equations for their primary analyses. Two independent reviewers collected data from each study using a standardized, pre-piloted data extraction template. A two-arm cluster randomized trial was simulated under various scenarios to show the potential effect of a small number of clusters on type I error rate when estimating the treatment effect. The nominal level was set at 0.05 for the simulation study. Results: Of the 51 included trials, 28 (54.9\%) analyzed 40 or fewer clusters with a minimum of four total clusters. Of these 28 trials, only one trial used a bias correction method for generalized estimating equations. The simulation study showed that with four clusters, the type I error rate ranged between 0.43 and 0.47. Even though type I error rate moved closer to the nominal level as the number of clusters increases, it still ranged between 0.06 and 0.07 with 40 clusters. Conclusions: Our results showed that statistical issues arising from small number of clusters in generalized estimating equations is currently inadequately handled in cluster randomized trials. Potential for type I error inflation could be very high when the sandwich estimator is used without bias correction.}, @@ -924,7 +947,7 @@ @article{huang2016GeneralizedEstimatingEquations title = {Generalized Estimating Equations in Cluster Randomized Trials with a Small Number of Clusters: {{Review}} of Practice and Simulation Study}, volume = {13}, year = {2016}, -} + bdsk-url-1 = {https://doi.org/10.1177/1740774516643498}} @article{hussey2007DesignAnalysisStepped, abstract = {Cluster randomized trials (CRT) are often used to evaluate therapies or interventions in situations where individual randomization is not possible or not desirable for logistic, financial or ethical reasons. While a significant and rapidly growing body of literature exists on CRTs utilizing a ``parallel'' design (i.e. I clusters randomized to each treatment), only a few examples of CRTs using crossover designs have been described. In this article we discuss the design and analysis of a particular type of crossover CRT -- the stepped wedge -- and provide an example of its use.}, @@ -938,7 +961,7 @@ @article{hussey2007DesignAnalysisStepped title = {Design and Analysis of Stepped Wedge Cluster Randomized Trials}, volume = {28}, year = {2007}, -} + bdsk-url-1 = {https://doi.org/10.1016/j.cct.2006.05.007}} @book{jones2012IntroductionScientificProgramming, address = {New York}, @@ -947,7 +970,7 @@ @book{jones2012IntroductionScientificProgramming publisher = {{Chapman and Hall/CRC}}, title = {Introduction to {{Scientific Programming}} and {{Simulation Using R}}}, year = {2012}, -} + bdsk-url-1 = {https://doi.org/10.1201/9781420068740}} @misc{joshi2022SimhelpersHelperFunctions, author = {Joshi, Megha and Pustejovsky, James E.}, @@ -985,7 +1008,7 @@ @article{kern2016AssessingMethodsGeneralizing title = {Assessing {{Methods}} for {{Generalizing Experimental Impact Estimates}} to {{Target Populations}}}, volume = {9}, year = {2016}, -} + bdsk-url-1 = {https://doi.org/10.1080/19345747.2015.1060282}} @article{koehler2009AssessmentMonteCarlo, abstract = {Statistical experiments, more commonly referred to as Monte Carlo or simulation studies, are used to study the behavior of statistical methods and measures under controlled situations. Whereas recent computing and methodological advances have permitted increased efficiency in the simulation process, known as variance reduction, such experiments remain limited by their finite nature and hence are subject to uncertainty; when a simulation is run more than once, different results are obtained. However, virtually no emphasis has been placed on reporting the uncertainty, referred to here as Monte Carlo error, associated with simulation results in the published literature, or on justifying the number of replications used. These deserve broader consideration. Here we present a series of simple and practical methods for estimating Monte Carlo error as well as determining the number of replications required to achieve a desired level of accuracy. The issues and methods are demonstrated with two simple examples, one evaluating operating characteristics of the maximum likelihood estimator for the parameters in logistic regression and the other in the context of using the bootstrap to obtain 95\% confidence intervals. The results suggest that in many settings, Monte Carlo error may be more substantial than traditionally thought.}, @@ -1000,7 +1023,7 @@ @article{koehler2009AssessmentMonteCarlo title = {On the {{Assessment}} of {{Monte Carlo Error}} in {{Simulation-Based Statistical Analyses}}}, volume = {63}, year = {2009}, -} + bdsk-url-1 = {https://doi.org/10.1198/tast.2009.0030}} @misc{leschinski2019MonteCarloAutomaticParallelized, author = {Leschinski, Christian Hendrik}, @@ -1021,7 +1044,7 @@ @article{leyrat2013PropensityScoresUsed title = {Propensity Scores Used for Analysis of Cluster Randomized Trials with Selection Bias: A Simulation Study}, volume = {32}, year = {2013}, -} + bdsk-url-1 = {https://doi.org/10.1002/sim.5795}} @article{lohmann2022ItTimeTen, abstract = {The quantitative analysis of research data is a core element of empirical research. The performance of statistical methods that are used for analyzing empirical data can be evaluated and compared using computer simulations. A single simulation study can influence the analyses of thousands of empirical studies to follow. With great power comes great responsibility. Here, we argue that this responsibility includes replication of simulation studies to ensure a sound foundation for data analytical decisions. Furthermore, being designed, run, and reported by humans, simulation studies face challenges similar to other experimental empirical research and hence should not be exempt from replication attempts. We highlight that the potential replicability of simulation studies is an opportunity quantitative methodology as a field should pay more attention to.}, @@ -1042,7 +1065,7 @@ @article{miratrix2021applied title = {An {{Applied Researcher}}'s {{Guide}} to {{Estimating Effects}} from {{Multisite Individually Randomized Trials}}: {{Estimands}}, {{Estimators}}, and {{Estimates}}}, volume = {14}, year = {2021}, -} + bdsk-url-1 = {https://doi.org/10.1080/19345747.2020.1831115}} @book{miratrix2023DesigningMonteCarlo, author = {Miratrix, Luke W. and Pustejovsky, Jame E.}, @@ -1064,7 +1087,7 @@ @article{moerbeek2019WhatAreStatistical urldate = {2024-01-05}, volume = {38}, year = {2019}, -} + bdsk-url-1 = {https://doi.org/10.1002/sim.8351}} @book{mooney1997MonteCarloSimulation, author = {Mooney, Christopher Z}, @@ -1084,7 +1107,7 @@ @article{morris2019UsingSimulationStudies title = {Using Simulation Studies to Evaluate Statistical Methods}, urldate = {2019-01-26}, year = {2019}, -} + bdsk-url-1 = {https://doi.org/10.1002/sim.8086}} @misc{nguyen2022MpowerPackagePower, abstract = {Estimating sample size and statistical power is an essential part of a good study design. This R package allows users to conduct power analysis based on Monte Carlo simulations in settings in which consideration of the correlations between predictors is important. It runs power analyses given a data generative model and an inference model. It can set up a data generative model that preserves dependence structures among variables given existing data (continuous, binary, or ordinal) or high-level descriptions of the associations. Users can generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This paper presents tutorials and examples focusing on applications for environmental mixture studies when predictors tend to be moderately to highly correlated. It easily interfaces with several existing and newly developed analysis strategies for assessing associations between exposures and health outcomes. However, the package is sufficiently general to facilitate power simulations in a wide variety of settings.}, @@ -1123,7 +1146,7 @@ @article{paxton2001MonteCarloExperiments title = {Monte {{Carlo Experiments}}: {{Design}} and {{Implementation}}}, volume = {8}, year = {2001}, -} + bdsk-url-1 = {https://doi.org/10.1207/S15328007SEM0802_7}} @book{robert2010IntroducingMonteCarlo, address = {New York, NY}, @@ -1135,7 +1158,7 @@ @book{robert2010IntroducingMonteCarlo publisher = {Springer}, title = {Introducing {{Monte Carlo Methods}} with {{R}}}, year = {2010}, -} + bdsk-url-1 = {https://doi.org/10.1007/978-1-4419-1576-4}} @misc{scheer2020SimToolConductSimulation, author = {Scheer, Marcel}, @@ -1153,7 +1176,7 @@ @article{siepe2024SimulationStudiesMethodological month = jan, title = {Simulation Studies for Methodological Research in Psychology: A Standardized Template for Planning, Preregistration, and Reporting}, year = {2024}, -} + bdsk-url-1 = {https://doi.org/10.31234/osf.io/ufgy6}} @article{sigal2016PlayItAgain, abstract = {Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate--analyze--summarize coding structure over the typical ``for-loop'' strategy are discussed.}, @@ -1168,7 +1191,7 @@ @article{sigal2016PlayItAgain title = {Play {{It Again}}: {{Teaching Statistics With Monte Carlo Simulation}}}, volume = {24}, year = {2016}, -} + bdsk-url-1 = {https://doi.org/10.1080/10691898.2016.1246953}} @article{skrondal2000DesignAnalysisMonte, abstract = {The design and analysis of Monte Carlo experiments, with special reference to structural equation modelling, is discussed in this article. These topics merit consideration, since the validity of the conclusions drawn from a Monte Carlo study clearly hinges on these features. It is argued that comprehensive Monte Carlo experiments can be implemented on a PC if the experiments are adequately designed. This is especially important when investigating modern computer intensive methodologies like resampling and Markov Chain Monte Carlo methods. We are faced with three fundamental challenges in Monte Carlo experimentation. The first problem is statistical precision, which concerns the reliability of the obtained results. External validity, on the other hand, depends on the number of experimental conditions, and is crucial for the prospects of generalising the results beyond the specific experiment. Finally, we face the constraint on available computer resources. The conventional wisdom in designing and analysing Monte Carlo experiments embodies no explicit specification of meta-model for analysing the output of the experiment, the use of case studies or full factorial designs as experimental plans, no use of variance reduction techniques, a large number of replications, and "eyeballing" of the results. A critical examination of the conventional wisdom is presented in this article. We suggest that the following alternative procedures should be considered. First of all, we argue that it is profitable to specify explicit meta-models, relating the chosen performance statistics and experimental conditions. Regarding the experimental plan, we recommend the use of incomplete designs, which will often result in considerable savings. We also consider the use of common random numbers in the simulation phase, since this may enhance the precision in estimating meta-models. The use of fewer replications per trial, enabling us to investigate an increased number of experimental conditions, should also be considered in order to improve the external validity at the cost of the conventionally excessive precision.}, @@ -1182,7 +1205,7 @@ @article{skrondal2000DesignAnalysisMonte title = {Design and {{Analysis}} of {{Monte Carlo Experiments}}: {{Attacking}} the {{Conventional Wisdom}}}, volume = {35}, year = {2000}, -} + bdsk-url-1 = {https://doi.org/10.1207/S15327906MBR3502_1}} @article{smith1973MonteCarloMethods, author = {Smith, Vincent Kerry}, @@ -1202,7 +1225,7 @@ @article{sofrygin2017SimcausalPackageConducting title = {Simcausal {{R Package}}: {{Conducting Transparent}} and {{Reproducible Simulation Studies}} of {{Causal Effect Estimation}} with {{Complex Longitudinal Data}}}, volume = {81}, year = {2017}, -} + bdsk-url-1 = {https://doi.org/10.18637/jss.v081.i02}} @article{vevea1995general, author = {Vevea, Jack L and Hedges, Larry V}, @@ -1214,7 +1237,7 @@ @article{vevea1995general title = {A general linear model for estimating effect size in the presence of publication bias}, volume = {60}, year = {1995}, -} + bdsk-url-1 = {https://doi.org/10.1007/BF02294384}} @article{white2023HowCheckSimulation, abstract = {Simulation studies are powerful tools in epidemiology and biostatistics, but they can be hard to conduct successfully. Sometimes unexpected results are obtained. We offer advice on how to check a simulation study when this occurs, and how to design and conduct the study to give results that are easier to check. Simulation studies should be designed to include some settings in which answers are already known. They should be coded in stages, with data-generating mechanisms checked before simulated data are analysed. Results should be explored carefully, with scatterplots of standard error estimates against point estimates surprisingly powerful tools. Failed estimation and outlying estimates should be identified and dealt with by changing data-generating mechanisms or coding realistic hybrid analysis procedures. Finally, we give a series of ideas that have been useful to us in the past for checking unexpected results. Following our advice may help to prevent errors and to improve the quality of published simulation studies.}, @@ -1226,4 +1249,4 @@ @article{white2023HowCheckSimulation pages = {dyad134}, title = {How to Check a Simulation Study}, year = {2023}, -} + bdsk-url-1 = {https://doi.org/10.1093/ije/dyad134}} diff --git a/case_study_code/analyze_cluster_RCT.R b/case_study_code/analyze_cluster_RCT.R index f080f6f..632365f 100644 --- a/case_study_code/analyze_cluster_RCT.R +++ b/case_study_code/analyze_cluster_RCT.R @@ -1,7 +1,6 @@ analysis_MLM <- function( dat ) { - M1 <- lme4::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat ) - M1_test <- lmerTest::as_lmerModLmerTest(M1) + M1_test <- lmerTest::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat ) M1_summary <- summary(M1_test)$coefficients tibble( @@ -9,6 +8,7 @@ analysis_MLM <- function( dat ) { SE_hat = M1_summary["Z","Std. Error"], p_value = M1_summary["Z", "Pr(>|t|)"] ) + } analysis_OLS <- function( dat, se_type = "CR2" ) { @@ -63,13 +63,7 @@ estimate_Tx_Fx <- function( } -lmer_with_test <- purrr::compose( - summary, - lmerTest::as_lmerModLmerTest, - lme4::lmer -) - -quiet_safe_lmer <- purrr::quietly(purrr::safely(lmer_with_test)) +quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise=NULL ) ) analysis_MLM_safe <- function( dat, all_results = FALSE ) { @@ -79,49 +73,50 @@ analysis_MLM_safe <- function( dat, all_results = FALSE ) { return(M1) } - message <- ifelse( length( M1$message ) > 0, M1$message, NA_character_ ) - warning <- ifelse( length( M1$warning ) > 0, M1$warning, NA_character_ ) - error <- ifelse( length( M1$result$error) > 0, M1$result$error$message, NA_character_ ) + if ( is.null( M1$result ) ) { + # we had an error! + tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, + message = M1$message, + warning = M1$warning, + error = TRUE ) + } else { + sum <- summary( M1$result ) + tibble( + ATE_hat = sum$coefficients["Z","Estimate"], + SE_hat = sum$coefficients["Z","Std. Error"], + p_value = sum$coefficients["Z", "Pr(>|t|)"], + message = list( M1$message ), + warning = list( M1$warning ), + error = FALSE ) + } - tibble( - ATE_hat = M1$result$result$coefficients["Z","Estimate"], - SE_hat = M1$result$result$coefficients["Z","Std. Error"], - p_value = M1$result$result$coefficients["Z", "Pr(>|t|)"], - message = message, - warning = warning, - error = error - ) } -analysis_MLM_contingent <- function( dat, all_results = FALSE ) { + + +analysis_MLM_contingent <- function( dat ) { M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data=dat ) - if (all_results) { - return(M1) - } - - if (!is.null(M1$result$result)) { - # If lmer() returns a result - res <- tibble( - ATE_hat = M1$result$result$coefficients["Z","Estimate"], - SE_hat = M1$result$result$coefficients["Z","Std. Error"], - p_value = M1$result$result$coefficients["Z", "Pr(>|t|)"], - ) + if (!is.null(M1$result)) { + sum <- summary( M1$result ) + tibble( + ATE_hat = sum$coefficients["Z","Estimate"], + SE_hat = sum$coefficients["Z","Std. Error"], + p_value = sum$coefficients["Z", "Pr(>|t|)"] ) } else { # If lmer() errors, fall back on OLS M_ols <- summary(lm(Yobs ~ Z, data = dat)) res <- tibble( ATE_hat = M_ols$coefficients["Z","Estimate"], SE_hat = M_ols$coefficients["Z", "Std. Error"], - p_value = M_ols$coefficients["Z","Pr(>|t|)"] - ) + p_value = M_ols$coefficients["Z","Pr(>|t|)"] ) } # Store original messages, warnings, errors res$message <- ifelse( length( M1$message ) > 0, M1$message, NA_character_ ) res$warning <- ifelse( length( M1$warning ) > 0, M1$warning, NA_character_ ) - res$error <- ifelse( length( M1$result$error) > 0, M1$result$error$message, NA_character_ ) - + res$error <- is.null( M1$result ) + return(res) } diff --git a/index.Rmd b/index.Rmd index d322e6b..2489a24 100644 --- a/index.Rmd +++ b/index.Rmd @@ -4,8 +4,6 @@ author: "Luke W. Miratrix and James E. Pustejovsky\n(Equal authors)" date: "`r Sys.Date()`" site: bookdown::bookdown_site documentclass: book -bibliography: [book.bib, packages.bib] -biblio-style: apalike link-citations: yes github-repo: jepusto/Designing-Simulations-in-R always_allow_html: true diff --git a/libs.txt b/libs.txt new file mode 100644 index 0000000..0f4bd4c --- /dev/null +++ b/libs.txt @@ -0,0 +1,146 @@ +001-introduction.Rmd:library( tidyverse ) +001-introduction.Rmd:Increasingly, R can also be used to interface with other languages and platforms, such as running Python code via the [`reticulate`](https://rstudio.github.io/reticulate/) package, running Stan programs for Bayesian modeling via [`RStan`](https://mc-stan.org/users/interfaces/rstan), or calling the h2o machine learning library using the [`h2o` package](https://cran.r-project.org/package=h2o) [@fryda2014H2oInterfaceH2O]. +003-programming-preliminaries.Rmd:library( tidyverse ) +003-programming-preliminaries.Rmd:library( tidyverse ) +005-initial-t-test-simulation.Rmd:library( tidyverse ) +015-Case-study-ANOVA.Rmd:library(tidyverse) +015-Case-study-ANOVA.Rmd:library(kableExtra) +015-Case-study-ANOVA.Rmd:library(tibble) +015-Case-study-ANOVA.Rmd:library(simhelpers) +020-Data-generating-models.Rmd:library(tidyverse) +020-Data-generating-models.Rmd:library(ggridges) +030-Estimation-procedures.Rmd:library(tidyverse) +030-Estimation-procedures.Rmd:library(metafor) +035-running-simulation.Rmd:library(tidyverse) +035-running-simulation.Rmd:library(simhelpers) +040-Performance-criteria.Rmd:library(tidyverse) +040-Performance-criteria.Rmd:library(ggplot2) +040-Performance-criteria.Rmd:library( simhelpers ) +040-Performance-criteria.Rmd:library( simhelpers ) +040-Performance-criteria.Rmd:library(clubSandwich) +040-Performance-criteria.Rmd:library( simhelpers ) +070-experimental-design.Rmd:library( tidyverse ) +070-experimental-design.Rmd:library( purrr ) +070-experimental-design.Rmd:library(simhelpers) +070-experimental-design.Rmd: library(future) +070-experimental-design.Rmd: library(furrr) +070-experimental-design.Rmd: library(future) +070-experimental-design.Rmd: library(furrr) +070-experimental-design.Rmd:library( simhelpers ) +072-presentation-of-results.Rmd:library( tidyverse ) +072-presentation-of-results.Rmd:library(ggplot2) +072-presentation-of-results.Rmd:library(lsr) +074-building-good-vizualizations.Rmd:library( tidyverse ) +074-building-good-vizualizations.Rmd:library( purrr ) +074-building-good-vizualizations.Rmd:library( simhelpers ) +075-special-topics-on-reporting.Rmd:library( tidyverse ) +075-special-topics-on-reporting.Rmd:library( purrr ) +075-special-topics-on-reporting.Rmd:library( broom ) +075-special-topics-on-reporting.Rmd:library(lsr) +075-special-topics-on-reporting.Rmd:library(modelr) +075-special-topics-on-reporting.Rmd:library(glmnet) +075-special-topics-on-reporting.Rmd:library(lme4) +077-case-study-comparing-estimators.Rmd:library( tidyverse ) +105-file-management.Rmd:library( purrr ) +105-file-management.Rmd:Another good reason for this type of modular organizing is you can then allow for a whole simulation universe, writing a variety of data generators that together form a library of options. +105-file-management.Rmd: library(future) +105-file-management.Rmd: library(furrr) +120-parallel-processing.Rmd:library( tidyverse ) +120-parallel-processing.Rmd:library(future) +120-parallel-processing.Rmd:library(furrr) +120-parallel-processing.Rmd:library(future) +120-parallel-processing.Rmd:library(furrr) +120-parallel-processing.Rmd:library( tidyverse ) +130-debugging_and_testing.Rmd:library( tidyverse ) +130-debugging_and_testing.Rmd:library( testthat ) +130-debugging_and_testing.Rmd:library(testthat) +140-simulation-for-power-analysis.Rmd:library(tidyverse) +140-simulation-for-power-analysis.Rmd:library(future) +140-simulation-for-power-analysis.Rmd:library(furrr) +140-simulation-for-power-analysis.Rmd: library( future ) +140-simulation-for-power-analysis.Rmd: library( furrr ) +140-simulation-for-power-analysis.Rmd:library( mlmpower ) +140-simulation-for-power-analysis.Rmd:library( lme4 ) +140-simulation-for-power-analysis.Rmd:library( mlmpower ) +150-potential-outcomes-framework.Rmd:library(tidyverse) +160-parametric-bootstrap.Rmd:library( tidyverse ) +200-coding-tidbits.Rmd:library( tidyverse ) +200-coding-tidbits.Rmd:library( blkvar ) +200-coding-tidbits.Rmd:library( blkvar ) +200-coding-tidbits.Rmd:library( tictoc ) +200-coding-tidbits.Rmd:library( bench ) +200-coding-tidbits.Rmd:library( blkvar ) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:Increasingly, R can also be used to interface with other languages and platforms, such as running Python code via the [`reticulate`](https://rstudio.github.io/reticulate/) package, running Stan programs for Bayesian modeling via [`RStan`](https://mc-stan.org/users/interfaces/rstan), or calling the h2o machine learning library using the [`h2o` package](https://cran.r-project.org/package=h2o) [@fryda2014H2oInterfaceH2O]. +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(kableExtra) +Designing-Simulations-in-R.Rmd:library(tibble) +Designing-Simulations-in-R.Rmd:library(simhelpers) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(ggridges) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(metafor) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(simhelpers) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(ggplot2) +Designing-Simulations-in-R.Rmd:library( simhelpers ) +Designing-Simulations-in-R.Rmd:library( simhelpers ) +Designing-Simulations-in-R.Rmd:library(clubSandwich) +Designing-Simulations-in-R.Rmd:library( simhelpers ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( purrr ) +Designing-Simulations-in-R.Rmd:library(simhelpers) +Designing-Simulations-in-R.Rmd: library(future) +Designing-Simulations-in-R.Rmd: library(furrr) +Designing-Simulations-in-R.Rmd: library(future) +Designing-Simulations-in-R.Rmd: library(furrr) +Designing-Simulations-in-R.Rmd:library( simhelpers ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library(ggplot2) +Designing-Simulations-in-R.Rmd:library(lsr) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( purrr ) +Designing-Simulations-in-R.Rmd:library( simhelpers ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( purrr ) +Designing-Simulations-in-R.Rmd:library( broom ) +Designing-Simulations-in-R.Rmd:library(lsr) +Designing-Simulations-in-R.Rmd:library(modelr) +Designing-Simulations-in-R.Rmd:library(glmnet) +Designing-Simulations-in-R.Rmd:library(lme4) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( purrr ) +Designing-Simulations-in-R.Rmd:Another good reason for this type of modular organizing is you can then allow for a whole simulation universe, writing a variety of data generators that together form a library of options. +Designing-Simulations-in-R.Rmd: library(future) +Designing-Simulations-in-R.Rmd: library(furrr) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library(future) +Designing-Simulations-in-R.Rmd:library(furrr) +Designing-Simulations-in-R.Rmd:library(future) +Designing-Simulations-in-R.Rmd:library(furrr) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( testthat ) +Designing-Simulations-in-R.Rmd:library(testthat) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library(future) +Designing-Simulations-in-R.Rmd:library(furrr) +Designing-Simulations-in-R.Rmd: library( future ) +Designing-Simulations-in-R.Rmd: library( furrr ) +Designing-Simulations-in-R.Rmd:library( mlmpower ) +Designing-Simulations-in-R.Rmd:library( lme4 ) +Designing-Simulations-in-R.Rmd:library( mlmpower ) +Designing-Simulations-in-R.Rmd:library(tidyverse) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( tidyverse ) +Designing-Simulations-in-R.Rmd:library( blkvar ) +Designing-Simulations-in-R.Rmd:library( blkvar ) +Designing-Simulations-in-R.Rmd:library( tictoc ) +Designing-Simulations-in-R.Rmd:library( bench ) +Designing-Simulations-in-R.Rmd:library( blkvar ) +index.Rmd:library(tidyverse) diff --git a/packages.bib b/packages.bib index 851f9ae..7f85483 100644 --- a/packages.bib +++ b/packages.bib @@ -23,12 +23,12 @@ @Manual{R-bench } @Manual{R-blkvar, - title = {blkvar: ATE and Treatment Variation Estimation for Blocked and Multisite -RCTs}, + title = {blkvar: ATE and Treatment Variation Estimation for Blocked and +Multisite RCTs}, author = {Luke Miratrix and Nicole Pashley}, - year = {2025}, note = {R package version 0.0.1.6, commit 60cf10e16a9960a3b0fe0c91adbe3671f604e040}, url = {https://github.com/lmiratrix/blkvar}, + year = {2025}, } @Manual{R-bookdown, @@ -260,9 +260,9 @@ @Manual{R-rpart.plot @Manual{R-simhelpers, title = {simhelpers: Helper Functions for Simulation Studies}, author = {Megha Joshi and James Pustejovsky}, - year = {2025}, note = {R package version 0.3.1.9999}, url = {https://meghapsimatrix.github.io/simhelpers/}, + year = {2025}, } @Manual{R-sn, diff --git a/results/Pearson_Poisson_results_nested.rds b/results/Pearson_Poisson_results_nested.rds new file mode 100644 index 0000000..cdbbfe8 Binary files /dev/null and b/results/Pearson_Poisson_results_nested.rds differ diff --git a/results/cluster_RCT_simulation.rds b/results/cluster_RCT_simulation.rds index 8e9319d..74a5ba0 100644 Binary files a/results/cluster_RCT_simulation.rds and b/results/cluster_RCT_simulation.rds differ diff --git a/results/cluster_RCT_simulation_validity.rds b/results/cluster_RCT_simulation_validity.rds index 23d655f..8c9b58f 100644 Binary files a/results/cluster_RCT_simulation_validity.rds and b/results/cluster_RCT_simulation_validity.rds differ