methodscamp.github.io/08_stats_sims.qmd at main · methodscamp/methodscamp.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# Statistics and simulations

```{r}
#| message: false
library(tidyverse)
```

## Random sampling

Before we jump to statistics and simulations, we'll cover how to do random sampling in R.

### Random sampling from theoretical distributions

#### Uniform distribution {.unnumbered}

For the uniform distribution, the arguments specify how many draws we want and the boundaries

```{r}
runif(n = 20, min = -3, max = 3)
```

When we draw a million times from the distribution, we can then plot it and see that it does look as we would expect:

```{r}
set.seed(123)
my_runif <- runif(n = 1000000, min = -3, max = 3)
```

```{r}
ggplot(data.frame(my_runif), aes(x = my_runif)) +
  geom_histogram(binwidth = 0.25, boundary = 0, closed = "right") +
  scale_x_continuous(breaks = seq(-5, 5, 1), limits = c(-5, 5))
```

#### Binomial distribution {.unnumbered}

For the binomial distribution, we can specify the number of draws, how many trials each draw will have, and the probability of success.

For instance, we can ask R to do the following twenty times: flip a fair coin one hundred times, and count the number of tails.

```{r}
rbinom(n = 20, size = 100, prob = 0.5)
```

With `prob =` , we can implement unfair coins:

```{r}
rbinom(n = 20, size = 100, prob = 0.9)
```

#### Normal distribution {.unnumbered}

For the Normal or Gaussian distribution, we specify the number of draws, the mean, and standard deviation:

```{r}
rnorm(n = 20, mean = 0, sd = 1)
```

::: callout-note
#### Exercise

Compute and plot `my_rnorm`, a vector with 10,000 draws from a Normal distribution $X$ with mean equal to -10 and standard deviation equal to 2 ($X\sim N(-10,2)$). You can recycle code!
:::

### Random sampling from data

In this section we will work with good ol' `mtcars`, one of R's most notable default datasets. We'll assign it to an object so it shows in our Environment pane:

```{r}
my_mtcars <- mtcars
```

::: callout-tip
Default datasets such as `mtcars` and `iris` are useful because they are available to everyone, and once you become familiar with them, you can start thinking about the code instead of the intricacies of the data. These qualities also make default datasets ideal for building **reproducible examples** (see [Wickham 2014](http://adv-r.had.co.nz/Reproducibility.html))
:::

We can use the function `sample()` to obtain random values from a vector. The `size =` argument specifies how many values we want. For example, let's get one random value of the "mpg" column:

```{r}
sample(my_mtcars$mpg, size = 1)
```

Every time we run this command, we can get a different result:

```{r}
sample(my_mtcars$mpg, size = 1)
```

```{r}
sample(my_mtcars$mpg, size = 1)
```

In some occasions we do want to get the same result consistently after running some random process multiple times. In this case, we *set a seed*, which takes advantage of R's pseudo-random number generator capabilities. No matter how many times we run the following code block, the result will be the same:

```{r}
set.seed(123)
sample(my_mtcars$mpg, size = 1)
```

Sampling *with replacement* means that we can get the same value multiple times. For example:

```{r}
set.seed(12)
sample(c("Banana", "Apple", "Orange"), size = 3, replace = T)
```

```{r}
sample(my_mtcars$mpg, size = 100, replace = T)
```

In order to sample not from a vector but from a data frame's rows, we can use the `slice_sample()` function from `dplyr`:

```{r}
my_mtcars |>
  slice_sample(n = 2) # a number of rows
```

```{r}
my_mtcars |>
  slice_sample(prop = 0.5) # a proportion of rows
```

Again, we can also use seeds here to ensure that we'll get the same result each time:

```{r}
set.seed(123)
my_mtcars |>
  slice_sample(prop = 0.5)
```

::: callout-note
#### Exercise

Use `slice_sample()` to sample 32 rows from `mtcars` *with replacement*.
:::

## Statistics

> The problems considered by probability and statistics are inverse to each other. In probability theory we consider some underlying process which has some randomness or uncertainty modeled by random variables, and we figure out what happens. In statistics we observe something that has happened, and try to figure out what underlying process would explain those observations. (quote attributed to [Persi Diaconis](https://stats.stackexchange.com/a/675))

-   In statistics we try to learn about a **data-generating process** (DGP) using our observed data. Examples: surveys, GDP statistics.

-   Usually we are restrained to **samples**, while our DGPs of interest are **population-based**.

    -   So we use **random sampling** or refer to **superpopulations** as a way to justify how the data we observe can approximate the population.

-   Statistics has two main targets:

    -   **Estimation**: how we find a reasonable guess of an unknown property (parameter) of a DGP
    -   **Inference**: how we describe uncertainty about our estimate

-   We use an **estimator** $\hat\theta(\cdot)$, which is a function that summarizes data, as a guess about a parameter $\theta$. A guess generated by an estimator in a given sample is called an **estimate**.

::: callout-note
#### Exercise

Suppose there's a uniform distribution $X \sim U(0, \text{unknown})$ out there.[^08_stats_sims-1] We want to use a sample from this distribution (let's say of n=30 observations) to estimate the unknown upper bound.

Discuss: would the sample maximum be a good estimator? Why or why not?
:::

[^08_stats_sims-1]: This is a fascinating distribution with a rich history, and it is used in many statistical textbooks ([Whittinghill and Hogg, 2001](https://www.tandfonline.com/doi/pdf/10.1080/10691898.2001.11910653)). It is thoroughly covered in the UT SDS [Mathematical Statistics](https://web.archive.org/web/20210716123200/https://web.ma.utexas.edu/users/parker/384/index.htm) sequence.

-   Theoretical statistics is all about finding "good" estimators. A few properties of good estimators:

    -   **Unbiasedness**: Across multiple random samples, an unbiased estimator gets the right answer on average.
    -   **Low variance**: Across multiple random samples, a low-variance estimator is more concentrated around the true parameter.
    -   BUT it's sometimes hard to get both unbiasedness and low variance. So we have to make sacrifices. We usually quantify this via the mean squared error: $MSE = bias^2 + variance$. Comparing two estimators, the one with the lowest MSE is said to be more **efficient**.
    -   **Consistency**: A consistent estimator converges in probability to the true value. "If we had enough data, the probability that our estimate would be far from the truth would be close to zero" ([Aronow and Miller 2019](https://www.cambridge.org/core/books/foundations-of-agnostic-statistics/684756357E7E9B3DFF0A8157FB2DCECA), p. 105).

-   Applied statistics is about using these techniques reasonably in messy real-world situations...

## Simulations

-   In simulations, we generate fake data following standard procedures. Why?
    -   To better understand how our estimators work in different settings (the methods reason)
    -   To get insights about complex processes with many moving parts (the substantive reason) (let's talk about gerrymandering).

::: callout-note
#### Exercise

Simulate drawing an n=30 random sample from a $X \sim U(0, 10)$ distribution and take its maximum value.
:::

### Loops

Loops allow us to repeat operations in R. The most common construct is the for-loop:

```{r}
for (i in 1:10){
  print(i)
}
```

It's common to perform operations at each step and save the results. We typically create an empty object and "fill it in" at each step:

```{r}
results <- double(10)
for (i in 1:10){
  results[i] <- i ^ 2
}
```

```{r}
results
```

::: callout-note
#### Functional loops

Another way to do loops is with the `*apply()` family of functions:

```{r}
sapply(1:10, function(x){x ^ 2})
```
:::

> We talked about loops and various extensions in one of our methods workshops last year: [Speedy R](https://arcruz0.github.io/workshops/speedyr).

### An example simulation

We will simulate our exercise from above 10,000 times:

```{r}
set.seed(1)
k <- 10000 # number of simulations
n <- 30    # number of observations in each simulation

# define an empty numeric object
simulated_estimates <- double(k)

# loop: at each step draw a random n=30 sample and get its maximum
for (i in 1:k){
  random_sample <- runif(n, 0, 10)
  simulated_estimates[i] <- max(random_sample)
}
```

Now we can analyze our simulated estimates:

```{r}
mean(simulated_estimates)
```

```{r}
ggplot(data.frame(x = simulated_estimates), aes(x = x)) +
  geom_histogram(binwidth = 0.05, boundary = 0, closed = "right")
```

::: callout-note
#### Exercise

We just simulated to evaluate the sample maximum as an estimator. Modify the code above to evaluate the following two estimators:

-   $\displaystyle\text{(sample maximum)} \cdot \frac{(n+1)}{n}$

-   $\displaystyle 2 \cdot \text{(sample mean)}$
:::

### Another example simulation: bootstrapping

Bootstrap (and its relatives) is one way in which we can do *inference*, i.e., assess uncertainty. (We'll go through the intuition on the board.)

```{r}
# set seed an number of simulations
set.seed(1)
k <- 10000
bootstrapped_means <- double(k)
for (i in 1:k){
  m <- my_mtcars |> slice_sample(prop = 1, replace = T)
  bootstrapped_means[i] <- mean(m$mpg)
}
```

```{r}
ggplot(data.frame(bootstrapped_means), aes(x = bootstrapped_means)) +
  geom_histogram(binwidth = 0.25, boundary = 0, closed = "right")
```