Data-Science-with-R/Tutorials/Tutorial1b_baseball.Rmd at master · tz33cu/Data-Science-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "Basic data operations"
output:
  html_document: default
  html_notebook: default
  pdf_document: default
---

#### Note:
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.

## Baseball salary data
In this tutorial, we will look at the salary of Major League Baseball (MLB) players. [source from baseballguru.com.](http://baseballguru.com/salary.zip)

First we load R libraries that we need for this tutorial. Basic libraries of functions are loaded every time R starts. More specialized functions need to be loaded first before they can used.
```{r}
library(dplyr)
library(readr)
library(DT)
library(RColorBrewer)
```
### Read in the data

R can read data from data files such as csv, txt, and output from other softwares such as STATA and SAS. Google for "R load data xx format" should usually point you to the right direction. CSV is usually one of the most widely used format for data nowadays.

Now let's read in the baseball salary data set.

```{r}
BaseballSalary=read_csv(file="data/BaseballSalary.csv")
```


```{r}
dim(BaseballSalary)
datatable(head(BaseballSalary,50), options = list(scrollX=T, pageLength = 5))
```

### Explore the data with graphs

Select a specific year's data and plot a distribution of the players' salaries during that year.
```{r, fig.height=8, fig.width=6}
col.use=brewer.pal(10, "RdYlBu")
hist.1985=hist(filter(BaseballSalary, year==1985)$salary,
               main="salaries in 1985",
               xlab="annual salary",
               col=col.use,
               nclass = 50)
hist(filter(BaseballSalary, year==2004)$salary,
     main="salaries in 2004",
     xlab="annual salary",
     col=col.use,
     nclass = 50)
```

To make a more meaningful comparison, we save the `hist` object from the 1985 plot and use the `breaks` from this object for plotting the 2004 data. The range for the year 2004 is wider than the year 1985, we then add a last bin to include the maximum value. Lastly, when comparing two histograms, it is important to make the two plots with the same scales for the axes.

Here we can see the effects of the minimum salary in MLB in year 2004.

```{r, fig.height=4, fig.width=6}
par(mfrow=c(2,1))
hist.1985=hist(filter(BaseballSalary, year==1985)$salary,
               main="salaries in 1985",
               xlab="annual salary",
               col=col.use,
               nclass = 20,
               xlim=c(0, max(filter(BaseballSalary, year==2004)$salary)),
               ylim=c(0, 0.0000015), freq=F)
abline(v=mean(filter(BaseballSalary, year==1985)$salary), col=1, lwd=1.5)
abline(v=median(filter(BaseballSalary, year==1985)$salary), col=2, lwd=1.5)
hist(filter(BaseballSalary, year==2004)$salary,
     col=col.use,
     breaks=c(hist.1985$breaks, max(BaseballSalary$salary)),
     main="salaries in 2004",
     xlab="annual salary",
     xlim=c(0, max(filter(BaseballSalary, year==2004)$salary)),
     ylim=c(0,0.0000015),
     freq=F)
abline(v=mean(filter(BaseballSalary, year==2004)$salary), col=1, lwd=1.5)
abline(v=median(filter(BaseballSalary, year==2004)$salary), col=2, lwd=1.5)

```

Now we take a look how the distributions of the salaries over the years.
+ We see that the distributions of the salaries become more skewed to the high values.
+ The median salary remains more or less flat.
+ What happened from 1994 to 1995? [Answer](https://en.wikipedia.org/wiki/1994%E2%80%9395_Major_League_Baseball_strike)

```{r, fig.height=6, fig.width=9}
par(mfrow=c(1,1))
plot(salary~as.factor(year), data=BaseballSalary, col=col.use)
```

## The `dplyr` package
Dplyr aims to provide a function for each basic verb of data manipulation.
- `filter()`
- `arrange()`
- `select()`
- `distinct()`
- `mutate()`
- `summarise()`
- `sample_n()` and `sample_frac()`
- `group_by`

### Compute some team summary statistics.
In the following, we compute for each team (`teamID`) and `year` a list of summary statistics:
+ `count`: number of players
+ `total`: team's total payroll for that year
+ `median`: median salary
+ `mean`: mean salary
+ `min`, `max`, `q1`, `q3`: minimum, maximum, first quartile and third quartile of salaries.

```{r}
BSTeamYear=BaseballSalary%>%
          group_by(teamID, year)%>%
          summarize(
            count=n(),
            total=sum(salary),
            median=median(salary),
            min=min(salary),
            max=max(salary),
            q1=quantile(salary, .25),
            q3=quantile(salary, .75)
          )
BSTeamYear=as.data.frame(BSTeamYear)
sample_n(BSTeamYear, 10)
```

### Visualize team compensation trends in 2004.
```{r, fig.height=6, fig.width=9}
datatable(filter(BSTeamYear, year==2004), options = list(scrollX=T, pageLength = 10))
BS2004=filter(BaseballSalary, year==2004)
plot(as.factor(BS2004$teamID), BS2004$salary, col=col.use, las=2)
BS2004[which.max(BS2004$salary),]
```