R_Tutorial/ProgrammingExercisesSol.Rmd at master · swyder/R_Tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317

# Programming Exercises

We want to practice what we have learned so far.

Work in the Rstudio editor and write a script that also serves as documentation. Try to write clean code (readable and as simple as possible):

* use consistent variable names (e.g. PropBlond or Prop_Blond)
* indent your code
* write functions

***

Before starting let's install the `ggplot2` package:

```{r, eval=FALSE}
install.packages("ggplot2")
```

Load the `ggplot2` package as we will use some of its data:

```{r}
library(ggplot2)
```


***

# Data Set 1: Sleep in mammals

The msleep data set is part of the `ggplot2` package. It contains a mammals sleep dataset (see ?msleep for details).

First let's first look at the structure of the dataset:

```{r}
str(msleep)
```


### 1. Which mammal sleeps the least, the most?
```{r}
msleep[which.min(msleep$sleep_total), ]
msleep[which.max(msleep$sleep_total), ]
```

### 2. Is there a association between total sleep duration and body weight (bodywt)?

Visualize and test the correlation. What if you use the brain weight (brainwt) instead of the body weight?

```{r}
qplot(data=msleep, log(bodywt), log(sleep_total))
qplot(data=msleep, log(brainwt), log(sleep_total))
with(msleep, cor(log(bodywt), log(sleep_total), method = "spearman"))
with(msleep, cor.test(log(bodywt), log(sleep_total), method = "spearman"))
```

### 3. Make a scatterplot of sleep_total vs. sleep_rem
```{r}
qplot(data=msleep, sleep_total, sleep_rem)
```

### 4. Make point size proportional to log(body mass)
```{r}
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt))
```

### 5. Add a OLS (Ordinary least square) regression line
```{r}
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt)) + stat_smooth(method = 'lm')
```

### 6. Color-code the points according to vore. Does the scaling of REM & total sleep differ with diet?
```{r}
#type of vores in the dataset
table(msleep$vore)
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt), col=vore) + stat_smooth(se=FALSE, method='lm')
```

### (advanced) 7. Make the figure from the question 6 in publication quality (Axes labels, font sizes, ..)
```{r}
#using RColorBrewer
#library(RColorBrewer)
qplot(data=msleep, sleep_total, sleep_rem, col=vore, size=3, shape=vore) + xlab("Total amount of sleep (hrs/day)") + ylab("REM sleep (hrs/day)") + theme_classic(base_size = 14, base_family = "Helvetica") + scale_shape(name = "Functional\nfeeding group", labels = c("carnivore","herbivore","insectivore","omnivore")) + guides(size = FALSE, col = FALSE) + + scale_colour_brewer(palette="Set1")
```

The graph is still not perfect, e.g. as the legend is small and not colored. But often it is faster and more convenient to do make small changes manually using graphics software. Here I would save the plot as svg and make the last improvements using Inkscape (or Illustrator on svg or pdf).


Original publication in [PNAS](http://www.pnas.org/content/104/3/1051.abstract)

****

## Data Set 2: Baby names

(Data set 1 is borrowed from a [lecture](http://stat405.had.co.nz/lectures/11-adv-data-manip.pdf) by Hadley Wickham)

The data set contains the top 1000 male and female baby names in the US, from
1880 to 2008 (1000* 2 * 129 = 258,000 records). All names with more than 5 uses
are given.

It contains 5 variables: year, name, soundex, sex and proportion

Download bnames2.csv.bz2 from http://stat405.had.co.nz/data/bnames2.csv.bz2
(Under Windows download the zipped file [bnames2.csv.zip](bnames2.csv.zip) and extract it before reading)

You can directly read in the compressed file like (on Linux and Mac OS)
```{r}
bnames <- read.csv("bnames2.csv.bz2")
```

Also load a file containing the total number of birth per years (for boys and girls separately)
```{r}
births <- read.csv("http://stat405.had.co.nz/data/births.csv")
```


Now it's your turn.
### 1. Extract your name from the dataset. Plot the trend over time.

```{r}
head(bnames)
```


Plotting the frequency of Stefan from 1880 to 2008.

```{r}
bnames.Stefan <- subset(bnames, name=="Stefan")
plot(bnames.Stefan$year, bnames.Stefan$prop, type="l")
qplot(bnames.Stefan$year, bnames.Stefan$prop, geom="line")
```

Robbie is an example for a name that was used both for boys and girls. qplot adds a legend automatically.
```{r}
qplot(year, prop, color=sex, data=subset(bnames, name=="Robbie"), geom="line")
```


### 2. Use the soundex variable to extract all names that sound like yours. Plot the trend over time.
```{r}
#All names sounding like Stefan with soundex=="S315"
unique(subset(bnames, soundex=="S315")$name)
qplot(year, prop, color=sex, data=subset(bnames, soundex=="S315"), geom="line") + facet_wrap(~ name)
#We can also have different scales for each panel
qplot(year, prop, color=sex, data=subset(bnames, soundex=="S315"), geom="line") + facet_wrap(~ name, scales = "free")
```

### 3. Find out the most frequently used similar sounding name
```{r}
head(sort(decreasing = TRUE, table(subset(bnames, soundex=="S315")$name)))
qplot(year, prop, color=name, data=subset(bnames, name %in% c("Steven","Stefan","Stephan")), geom="line") + scale_y_log10()
```

### (advanced) 4. Which boy and girl name was used most over the whole time?

We need to sum up the absolute births over the years. First we add a variable AbsBirths
```{r}
AbsNumber <- vector(length = nrow(bnames))
for (i in 1:nrow(bnames)) {
  totalNumber <- subset(births, year == bnames$year[i] & sex == bnames$sex[i])$births
  AbsNumber[i] <- round(bnames$prop[i] * totalNumber)
}
bnames$AbsBirths <- AbsNumber
```

Then we sum up AbsBirths
```{r}
counts <- tapply(bnames$AbsBirths, bnames$name, sum)
head(counts)
head(sort(decreasing = TRUE, counts))
```

Alternatively we could use the ddply() function:
```{r}
library(plyr)
counts2 <- ddply(bnames, "name", summarize, n = sum(AbsBirths))
```

A quick check
```{r}
counts["Stefan"]
sum(subset(bnames, name == "Stefan")$AbsBirths)
```

### (advanced) 5. Did first names became shorter over time?
bnames$length <- nchar(bnames$name)
bnames.1880ies <- subset(bnames, year >= 1880 & year < 1890)
sum.1880 <- sum(tapply(bnames.1880ies$AbsBirths, bnames.1880ies$length, sum))
tapply(bnames.1880ies$AbsBirths, bnames.1880ies$length, sum)/sum.1880*100
bnames.1990ies <- subset(bnames, year >= 1990 & year < 2000)
sum.1990 <- sum(tapply(bnames.1990ies$AbsBirths, bnames.1990ies$length, sum))
tapply(bnames.1990ies$AbsBirths, bnames.1990ies$length, sum)/sum.1990*100

Calculating it for all years is a bit more tricky as we need a weighted eman and can't use tapply. Either we do it with a for loop or we can use the `ddply()` function from the `plyr` package. See also this [Stackoverflow post](http://stackoverflow.com/questions/18392408/how-to-use-ddply-to-get-weighted-mean-of-class-in-dataframe)

```{r}
bnames.girls <- subset(bnames, sex=="girl")[, c(1,6,7)]
plot(ddply(bnames.girls, .(year), summarize, x = weighted.mean(length, AbsBirths)), type='l', ylab="Mean Length")
bnames.boys <- subset(bnames, sex=="boy")[, c(1,6,7)]
lines(col="red", ddply(bnames.boys, .(year), summarize, x = weighted.mean(length, AbsBirths)))
```

First names have become longer in the US over time. Interestingly, girls' names (in black) are more variable in length than boys' names (in red).

### (advanced) 6. Which names became very rare after 1944?

```{r}
bnames.before1944 <- subset(bnames, year < 1944)
counts.before1944 <- tapply(bnames.before1944$AbsBirths, bnames.before1944$name, sum)

bnames.from1944 <- subset(bnames, year >= 1944)
counts.from1944 <- tapply(bnames.from1944$AbsBirths, bnames.from1944$name, sum)
bnames.from1944$rank <- nrow(bnames.from1944) + 1 -rank(bnames.from1944$prop)
#Merge into 2 data frame
bnames.counts <- merge(as.data.frame(counts.before1944), as.data.frame(counts.from1944), all=TRUE, by="row.names")
bnames.counts$rank.before1944 <- nrow(bnames.counts) + 1 - rank(bnames.counts$counts.before1944, na.last = "keep")
bnames.counts$rank.from1944 <- rank(bnames.counts$counts.from1944, na.last = "keep")
#RankProduct
bnames.counts$RankProd <- bnames.counts$rank.before1944 * bnames.counts$rank.from1944
head(bnames.counts[order(decreasing = FALSE, bnames.counts$RankProd), 1:3])
#Names becoming popular only from 1944
head(bnames.counts[order(decreasing = TRUE, bnames.counts$RankProd), 1:3])
```

### (advanced) 7. Think of another question you could answer with the dataset. E.g. Identify the most popular firstname in 1980ies the or identify the most popular name that was used for boys and girls.

***

## Data Set 3: Hair and Eye Color

The next data set is the distribution of hair and eye color and sex in 592 statistics students stored in the table `HairEyeColor` (see ?HairEyeColor for details).

We load the data set with the data() function and have a look at the structure using str().

```{r}
data("HairEyeColor")
str(HairEyeColor)
```


```{r}
mosaicplot(HairEyeColor)
```

***

# Data Set 4: Movie ratings

The movies data set is from the `ggplot2` package. The internet movie database,
[http://imdb.com/](http://imdb.com/), is a website devoted to collecting movie
data supplied by studios and fans (See ?movies for details).

The data set contains data for 58'788 movies, namely the title of the movie,
year of release, budget, length, rating and genre.

### 1. Look at the structure of `movies` using function str() or head(). Make a histogram of the rating.
```{r}
str(movies)
hist(movies$rating)
```

### 2. Do old movies perform better or worse than recent movies?
```{r}
boxplot(movies$rating ~ movies$year)
```

Some movies obtained less than 10 votes. Remove them and repeat the plotting. Do you see a change?

### 3. Does the movie genre influence the rating?

Add a new variable Genre.simple containing genre information like this:
movies$Genre.simple <- ifelse(movies$Action == 1, "Action", ifelse(movies$Comedy == 1, "Comedy", ifelse(movies$Drama == 1, "Drama", "other")))
Then plot the rating per genre

```{r}
movies$Genre.simple <- ifelse(movies$Action == 1, "Action", ifelse(movies$Comedy == 1, "Comedy", ifelse(movies$Drama == 1, "Drama", "other")))
movies.conf <- subset(movies, votes >= 10)
boxplot(movies.conf$rating ~ movies.conf$Genre.simple)
```

By default genres are alphabetically ordered. Let's order the genres according to their median rating (tip: use factors and function factor()).
```{r}
movies$Genre.simple <- factor(movies$Genre.simple, levels = c("Action","other","Comedy","Drama"))
boxplot(movies.conf$rating ~ movies.conf$Genre.simple)
```

### 4. Which movie is the longest and how long is it?
```{r}
movies.conf[which.max(movies.conf$length), ]
```

### 5. Does the movie length have an impact on its rating?
Tip: use the function cut to make categories.
```{r}
boxplot(movies.conf$rating ~  cut(movies.conf$length, breaks = quantile(movies.conf$length)))
#movies.conf.clean <- subset(movies.conf, length <= 300)
#boxplot(movies.conf.clean$rating ~ cut(movies.conf.clean$length, breaks = 10), las=3)
#But beware of categories with few members!
#tapply(movies.conf.clean$rating, cut(movies.conf.clean$length, breaks = 10), length)
```


sumd <- aggregate(awake ~ conservation, data=msleep, FUN=mean)
sumd$sd <- aggregate(awake ~ conservation, data=msleep, FUN=sd)[,2]
limits <- aes(ymax = awake + sd, ymin = awake - sd)
dyp <- ggplot(sumd, aes(x=conservation, y=awake)) + geom_bar(fill="grey") + theme_classic()
dyp + geom_errorbar(limits, width=0.25)

## Tables

See also the [xtable](http://cran.r-project.org/web/packages/xtable/)
package.

```{r,results='asis'}
library(knitr)
kable(head(iris[,1:3]), format='html')
```