poisson-regression/module.Rmd at master · info-370/poisson-regression · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: "Poisson Regression"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
```

In this learning module, you'll be familiarized with **Poisson Regression** as a method for modeling _count data_. You'll frequently encounter a count outcomes, for example:

- The number of tickets sold for a concert
- The number of elected Republican congress members
- The number of crimes that occur in a particular area

As you can imagine, you would want to leverage a statistical technique that assumes a **distribution of counts** for the outcome variable. One distrubtion that meets this constraint is the **Poisson distribution**, which can be used as in a **generalized linear model**. In the sections below, you'll review the Poisson distribution, implement a Poisson regression, and interpret the results.

## Resoures
You may find the following resources helpful in learning about Poisson regression:

- [UCLA Poisson Example](http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm)
- [Wikipedia: Poisson Regression](https://en.wikipedia.org/wiki/Poisson_regression)
- [Regression Models for Data Science: Poisson Regression](https://leanpub.com/regmods/read#leanpub-auto-example-poisson-regression)
- [Model fit](https://onlinecourses.science.psu.edu/stat501/node/377)

## Poisson Distribution
As you may recall from earlier in the course, the **Poisson Distribution** is a distribution of count values. The distribution is described by a **single parameter** lambda ($\lambda$). Note,

> _In a Poisson distribution, the **mean** is equal to the **variance**. This is captured in the single parameter, lambda ($\lambda$)_.

The expected probability of observing _K_ events is defined as:

$$ P\left( k\ events \right) =\frac { { \lambda }^{k  }{ e }^{-\lambda  } }{  k!} $$

Given the above formula, the probaility of observing 2 events if the mean number of events is 3 ($\lambda=3$) is:

$$ P\left( 2\ events \right) =\frac { { 3 }^{2}{ e }^{-3  } }{  2!} =  .22$$

Below are the randomly generated Poisson distributions for $\lambda=2$, $\lambda=10$, $\lambda=100$ (code found [here](https://leanpub.com/regmods/read#leanpub-auto-example-poisson-regression)):

```{r, echo=FALSE}
# Source: https://leanpub.com/regmods/read#leanpub-auto-example-poisson-regression
par(mfrow = c(1, 3))
plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE, xlab="Count (K)", ylab="Observed Events", main="lambda = 2")
plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE, xlab="Count (K)", ylab="Observed Events", main="lambda = 10")
plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE, xlab="Count (K)", ylab="Observed Events", main="lambda = 100")
```

When modeling count data, it is important to **check the distribution** of your outcome variable to see how well it follows a Poisson distribution. If it does not, you may want to implement a Negative Binomial Regression ([example](http://www.ats.ucla.edu/stat/r/dae/nbreg.htm)), which loosens the assumption that the mean and variance are equal. You can also consider a Zero-Inflated Poisson Regression ([example](http://stats.idre.ucla.edu/r/dae/zip/)), or Zero-Inflated Negative Binomial Regression ([example](http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm)).

## Poisson Formula
The formula for fitting an outcome variable with a Poisson distribution is a type of **Generalized Linear Model**. The **link** between the **linear** set of inputs and the output is a **log-link**. At a glance, this seems similar to logging the outcome variable and running a linear regression. However, directly modeling the log of the outcome is not possible if(when) the _count is zero_ ($log(0) = -Inf$). As such, Poisson regressions model the **log of the expected value** of the outcome, given a vector of input variables ([source](https://leanpub.com/regmods/read#leanpub-auto-example-poisson-regression)):

$$log\left( E [{ Y }_{i } | {X  }_{i }={x }_{i}] \right) =log\left({\mu }_{i} \right) ={\beta  }_{ 0 } +{  \beta}_{  1}{x}_{i}$$
In the equation above, the **log of the expected value of ${Y}_{i}$ given ${X}_{i}$** is linearly approximated using ${B  }_{ 0 } +{  B}_{  1}{x}_{i}$.

Similarly to Logistic regression, the coefficients (betas) are obtained through a _Maximum Likelihood Estimation_ that seeks to produce a formula that maximizes the probability of observing the data. While this MLE procedure is beyond the scope of this course, Poisson models are easily implemented in R or Python.

## Generalized Linear Models
As noted above, Poisson models are in the family of Generalized Linear Models. The following excerpt from [this book](https://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X) describes them well:

>"_Generalized linear modeling_ is a framework for statistical analysis that includes linear and logistic regression as special cases. Linear regression directly predictes continuous data $y$ from a _linear predictor_ $X\beta = {\beta}_0 + {X}_{1}{\beta}_1 + ... + + {X}_{k}{\beta}_k$. Logistic regression predicts $Pr(y = 1)$ for binary data from a linear preditor with an inverse-logit transofmration. A generalized linear model involves:

>1. A data vecotr $y = (y_1, ..., y_n)$
2. Precitors $X$ and coefficnets $\beta$, forming a inear predictor $X\beta$
3. A _link function $g$_, yielding a vector of transofrmed data $\hat { y} ={ g }^{  -1}(X\beta)$ that are used to model the data
4. A data distirubtion, $p(y|\hat{y})$
5. Possibly other parameters, such as variances, overdispersions, and cutpoints,
involved in the predictors, link function, and data distribution." (p.109)

As such, we assume a **Poisson distribution** and use a _logartithmic transformation_ as the link for a Poisson regression, which allows the set of predicted values ($\hat{y}$)to be **positive**.

## Implementation
Implementing a Poisson model in R or Python is a straightforward procedure. Using the generalized linear model function (`glm`), a model can be easily declared:

```{r eval=FALSE}
m1 <- glm(outome ~ var1 + var2, family="poisson", data=df)
```

Here, the model (`m1`) is estimating coefficients for independent variables (`var1` and `var2`) for predicting the _mean expected value_ of an outcome variable (`outcome`). Below is an example from [this website](http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm) of predicting **number of student awards** based on _type of program_ (`prog`) the student is enrolled in (vocational, general or academic) and their score on a final exam in math (`math`).

```{r eval=TRUE, echo=FALSE}
# Data prep - code from http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm
p <- read.csv("~/Documents/info-370/poisson-regression/data/poisson_sim.csv", stringsAsFactors = FALSE)
p <- within(p, {
    prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
    id <- factor(id)
})
```

We can see that the distribution of the number of awards is (roughly) Poisson:

```{r echo=FALSE}
# Create a histrogram of the outcome variable
hist(p$num_awards, xlab="Num. Awards", ylab = "Count", main="Distribution of Awards Received")
```

It is then straightforward to create a Poisson model and view it's results:
```{r echo=TRUE, eval=TRUE}
# Generage a Poisson model of our data
m1 <- glm(num_awards ~ prog + math, family="poisson", data=p)
print(summary(m1))
```

The model is implementing the following formula:

$$log\left( E [{ Awards }_{i } | {X  }_{i }={x }_{i}] \right)={\beta  }_{ 0 } +{  \beta}_{  1}{MathScore}_{i} +{  \beta}_{  2}{Program}_{i}$$

## Interpretation
Similarly to Logistic regression, we're modeling an outcome in **log space**. However, while logistic regression models the **log odds**, Poisson regression models the **log expected value**, making the coefficients somewhat more interpretable. Consider the expected value of $Y$ given some particular value of $X$: $Y_i|X=x_i$. We're modeling the **log expected value** as follows:

$$log\left( E [{ Y }_{i } | {X  }_{i }={x }_{i}] \right) = {\beta }_{ 0 } +{  \beta}_{  1}{x}_{i}    $$

We can then compute the expected value by exponentiating the $\beta$ values:

$$ E [{ Y }_{i } | {X  }_{i }={x }_{i}] = e^{{B  }_{ 0 } +{  B}_{  1}{x}_{i}}    $$
If we want to understand the effect of a **single unit-increase** in $x_i$, we can define the expected value of $Y_{i + 1}$:

$$ E [{ Y }_{i + 1 } | {X  }_{i }={x }_{i+1}] = e^{{\beta  }_{ 0 } +{  \beta}_{  1}{x}_{i+1}}    $$

The **ratio** between these two values can be reduced to:

$$ \frac{E [{ Y }_{i + 1 } | {X  }_{i }={x }_{i+1}]}{E [{ Y }_{i } | {X  }_{i }={x }_{i}]} = e^{\beta_1}$$

Thus, when we **exponentiate our $\beta_1$** value, we can calculate the **proportial increase in the expected value** of our outcome variable.

```{r echo=FALSE}
# Round values for upcoming paragraph
coef <- round(m1$coefficients[4],2)
exp.coef <- round(exp(m1$coefficients[4]),2)

```
Using the example above, when we exponentiate our $\beta_1$ value, it returns the **proportional increase** in the _expected_ number of awards (for each unit increase in math exam score). The estimated coefficient for score is `r coef`, which (when exponentiated) is `r exp.coef`. Thus, the **expected number of awards** increases by a factor of `r exp.coef` for each unit increase in math score.

## Evaluation
Now that we have an interpretation of the **beta values**, we also need a way to evaluate the model fit. The simplest way to do this is to generate a set of predictions and visualize the relationship between the predictions and the actual values.

Using the `predict` function (part of `glm`), it's simple to generate predictions:

```{r}
# Generate predictions that are in the same space as the response
p$preds <- predict(m1, p, type = 'response')
```

We can see how the predicted number of awards increases with math score (for students in each progam):

```{r echo=FALSE}
ggplot(data=p, aes(x=math, y=preds, group = prog, colour = prog)) +
    geom_line() +
    labs(x="Math Score", y="Predicted Number of Awards") +
    ggtitle("Predicted Awards by Program") +
    scale_colour_discrete('Program')
```

These predicted values make intuitive sense. However, when we add the points themselves, the exact fit seems more suspect:

```{r echo=FALSE}
ggplot(data=p, aes(x=math, y=preds, group = prog, colour = prog)) +
    geom_line() +
    geom_point(aes(x=math, y=num_awards, group=prog, colour = prog)) +
    labs(x="Math Score", y="Predicted Number of Awards") +
    ggtitle("Predicted Awards by Program") +
    scale_colour_discrete('Program')
```

Alteratively, we can just compare our predictions directly to our data:

```{r echo=FALSE}
ggplot(data=p, aes(x=num_awards, y=preds, group = prog, colour = prog)) +
    geom_point() +
    labs(x="Num. awards (observed)", y="Num. awards (predicted)") +
    ggtitle("Observed v.s. predicted awards") +
    scale_colour_discrete('Program')
```

As you can see, predicting the actual number of counts is a statistical challenge. However, that _does not necessarily_ mean that the relationships observed in our betas should be disregarded. In order to make a more assertive assessment, we can use various _goodness of fit_ metrics.

## Goodness of Fit
There are a variety of goodness of fit tests we may perform, which are well summarized [here](https://onlinecourses.science.psu.edu/stat501/node/377). We'll begin by introducing the **Pearson Statistic**, also known as the **Pearson chi-squared test statistic**. It is calculated as:

$$ X^2=\sum_{i=1}^{n}\frac{(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})^{2}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}$$

It is the sum of the squared differences between observed and expected values (normalized by the expected values). We can easily calculate this value using the `residuals` function:


```{r}
# Calculate Person statistic residuals
chi.sq <- sum(residuals(m1, type = "pearson")^2)
```

As you can see, the value of this will depend on the number of observations in your dataset, or your _degrees of freedom_. Using the `pchisq` function, we can ask, _is this value statistically significant, given the number of observations_?

```{r}
# Get p value associated with the chi-square statistic
chi.sq <- sum(residuals(m1, type = "pearson")^2)
p.value <- pchisq(chi.sq, m1$df.residual)
```

Interestingly, the null hypothesis here is **that our model is correct**. The weak p-value _is not_ enough to reject our null hypothesis. This is well summarized in the example [here](http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm):

> _"The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data. We conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statistically significant, it would indicate that the data do not fit the model well. In that situation, we may try to determine if there are omitted predictor variables, if our linearity assumption holds and/or if there is an issue of over-dispersion."_

Similarly, we can calculate the **deviance** of our model as follows:

$$ D=2\sum_{i=1}^{n}\biggl[y_{i}\log\biggl(\frac{y_{i}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}\biggr)-(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})\biggr]$$
Luckily, the deviance is returned by our model, and can be accessed as `model.name$deviance`. The p-value for our deviance can be calculated as:

```{r}
# Calculate deviance p value
deviance.p.value <- pchisq(m1$deviance, m1$df.residual, lower.tail=FALSE)
```

The observed deviance p value of `r round(deviance.p.value,2)` indicates that we **cannot reject our model** based on this statistic.