We want to practice what we have learned so far.
Work in the Rstudio editor and write a script that also serves as documentation. Try to write clean code (readable and as simple as possible):
- use consistent variable names (e.g. PropBlond or Prop_Blond)
- indent your code
- write functions
Before starting let's install the ggplot2 package:
install.packages("ggplot2")Load the ggplot2 package as we will use some of its data:
library(ggplot2)## Loading required package: methods
The msleep data set is part of the ggplot2 package. It contains a mammals sleep dataset (see ?msleep for details).
First let's first look at the structure of the dataset:
str(msleep)## 'data.frame': 83 obs. of 11 variables:
## $ name : chr "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
## $ genus : chr "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
## $ vore : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2 ...
## $ order : chr "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
## $ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5 ...
## $ sleep_total : num 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
## $ sleep_rem : num NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
## $ sleep_cycle : num NA NA NA 0.133 0.667 ...
## $ awake : num 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
## $ brainwt : num NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
## $ bodywt : num 50 0.48 1.35 0.019 600 ...
msleep[which.min(msleep$sleep_total), ]## name genus vore order conservation sleep_total sleep_rem
## 30 Giraffe Giraffa herbi Artiodactyla cd 1.9 0.4
## sleep_cycle awake brainwt bodywt
## 30 NA 22.1 NA 899.995
msleep[which.max(msleep$sleep_total), ]## name genus vore order conservation sleep_total
## 43 Little brown bat Myotis insecti Chiroptera <NA> 19.9
## sleep_rem sleep_cycle awake brainwt bodywt
## 43 2 0.2 4.1 0.00025 0.01
Visualize and test the correlation. What if you use the brain weight (brainwt) instead of the body weight?
qplot(data=msleep, log(bodywt), log(sleep_total))qplot(data=msleep, log(brainwt), log(sleep_total))## Warning: Removed 27 rows containing missing values (geom_point).
with(msleep, cor(log(bodywt), log(sleep_total), method = "spearman"))## [1] -0.5346017
with(msleep, cor.test(log(bodywt), log(sleep_total), method = "spearman"))## Warning in cor.test.default(log(bodywt), log(sleep_total), method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: log(bodywt) and log(sleep_total)
## S = 146223, p-value = 1.931e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5346017
qplot(data=msleep, sleep_total, sleep_rem)## Warning: Removed 22 rows containing missing values (geom_point).
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt))## Warning: Removed 22 rows containing missing values (geom_point).
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt)) + stat_smooth(method = 'lm')## Warning: Removed 22 rows containing missing values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
#type of vores in the dataset
table(msleep$vore)##
## carni herbi insecti omni
## 19 32 5 20
qplot(data=msleep, sleep_total, sleep_rem, size=log(bodywt), col=vore) + stat_smooth(se=FALSE, method='lm')## Warning: Removed 9 rows containing missing values (stat_smooth).
## Warning: Removed 8 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
(advanced) 7. Make the figure from the question 6 in publication quality (Axes labels, font sizes, ..)
qplot(data=msleep, sleep_total, sleep_rem, col=vore, size=3, shape=vore) + xlab("Total amount of sleep (hrs/day)") + ylab("REM sleep (hrs/day)") + theme_classic(base_size = 14, base_family = "Helvetica") + scale_shape(name = "Functional\nfeeding group", labels = c("carnivore","herbivore","insectivore","omnivore")) + guides(size = FALSE, col = FALSE)## Warning: Removed 27 rows containing missing values (geom_point).
The graph is still not perfect, e.g. as the legend is small and not colored. But often it is faster and more convenient to do make small changes manually using graphics software. Here I would save the plot as svg and make the last improvements using Inkscape (or Illustrator on svg or pdf).
Original publication in PNAS
(Data set 1 is borrowed from a lecture by Hadley Wickham)
The data set contains the top 1000 male and female baby names in the US, from 1880 to 2008 (1000* 2 * 129 = 258,000 records). All names with more than 5 uses are given.
It contains 5 variables: year, name, soundex, sex and proportion
Download bnames2.csv.bz2 from http://stat405.had.co.nz/data/bnames2.csv.bz2 (Under Windows download the zipped file bnames2.csv.zip and extract it before reading)
You can directly read in the compressed file like (on Linux and Mac OS)
bnames <- read.csv("bnames2.csv.bz2")Also load a file containing the total number of birth per years (for boys and girls separately)
births <- read.csv("http://stat405.had.co.nz/data/births.csv")Now it's your turn.
head(bnames)## year name prop sex soundex
## 1 1880 John 0.081541 boy J500
## 2 1880 William 0.080511 boy W450
## 3 1880 James 0.050057 boy J520
## 4 1880 Charles 0.045167 boy C642
## 5 1880 George 0.043292 boy G620
## 6 1880 Frank 0.027380 boy F652
Plotting the frequency of Stefan from 1880 to 2008.
bnames.Stefan <- subset(bnames, name=="Stefan")
plot(bnames.Stefan$year, bnames.Stefan$prop, type="l")qplot(bnames.Stefan$year, bnames.Stefan$prop, geom="line")Robbie is an example for a name that was used both for boys and girls. qplot adds a legend automatically.
qplot(year, prop, color=sex, data=subset(bnames, name=="Robbie"), geom="line")#All names sounding like Stefan with soundex=="S315"
unique(subset(bnames, soundex=="S315")$name)## [1] Stephen Steven Stephan Stevan Stefan Stephon Stephanie
## [8] Stephania Stefanie Stephenie Stefani Stephany Stephaine Stephani
## 6782 Levels: Aaden Aaliyah Aarav Aaron Ab Abagail Abb Abbey ... Zula
qplot(year, prop, color=sex, data=subset(bnames, soundex=="S315"), geom="line") + facet_wrap(~ name)#We can also have different scales for each panel
qplot(year, prop, color=sex, data=subset(bnames, soundex=="S315"), geom="line") + facet_wrap(~ name, scales = "free")head(sort(decreasing = TRUE, table(subset(bnames, soundex=="S315")$name)))##
## Steven Stephen Stephanie Stephan Stefan Stefanie
## 134 129 121 74 58 40
qplot(year, prop, color=name, data=subset(bnames, name %in% c("Steven","Stefan","Stephan")), geom="line") + scale_y_log10()We need to sum up the absolute births over the years. First we add a variable AbsBirths
AbsNumber <- vector(length = nrow(bnames))
for (i in 1:nrow(bnames)) {
totalNumber <- subset(births, year == bnames$year[i] & sex == bnames$sex[i])$births
AbsNumber[i] <- round(bnames$prop[i] * totalNumber)
}
bnames$AbsBirths <- AbsNumberThen we sum up AbsBirths
counts <- tapply(bnames$AbsBirths, bnames$name, sum)
head(counts)## Aaden Aaliyah Aarav Aaron Ab Abagail
## 959 39665 219 509464 25 2682
head(sort(decreasing = TRUE, counts))## James John Robert Michael Mary William
## 5043259 5036828 4771447 4226596 4111514 3966170
Alternatively we could use the ddply() function:
library(plyr)
counts2 <- ddply(bnames, "name", summarize, n = sum(AbsBirths))A quick check
counts["Stefan"]## Stefan
## 16180
sum(subset(bnames, name == "Stefan")$AbsBirths)## [1] 16180
bnames$length <- nchar(bnames$name) bnames.1880ies <- subset(bnames, year >= 1880 & year < 1890) sum.1880 <- sum(tapply(bnames.1880ies$AbsBirths, bnames.1880ies$length, sum)) tapply(bnames.1880ies$AbsBirths, bnames.1880ies$length, sum)/sum.1880100 bnames.1990ies <- subset(bnames, year >= 1990 & year < 2000) sum.1990 <- sum(tapply(bnames.1990ies$AbsBirths, bnames.1990ies$length, sum)) tapply(bnames.1990ies$AbsBirths, bnames.1990ies$length, sum)/sum.1990100
Calculating it for all years is a bit more tricky as we need a weighted eman and can't use tapply. Either we do it with a for loop or we can use the ddply() function from the plyr package. See also this Stackoverflow post
bnames.girls <- subset(bnames, sex=="girl")[, c(1,6,7)]## Error in `[.data.frame`(subset(bnames, sex == "girl"), , c(1, 6, 7)): undefined columns selected
plot(ddply(bnames.girls, .(year), summarize, x = weighted.mean(length, AbsBirths)), type='l', ylab="Mean Length")## Error in plot(ddply(bnames.girls, .(year), summarize, x = weighted.mean(length, : konnte Funktion "ddply" nicht finden
bnames.boys <- subset(bnames, sex=="boy")[, c(1,6,7)]## Error in `[.data.frame`(subset(bnames, sex == "boy"), , c(1, 6, 7)): undefined columns selected
lines(col="red", ddply(bnames.boys, .(year), summarize, x = weighted.mean(length, AbsBirths)))## Error in lines(col = "red", ddply(bnames.boys, .(year), summarize, x = weighted.mean(length, : konnte Funktion "ddply" nicht finden
First names have become longer in the US over time. Interestingly, girls' names (in black) are more variable in length than boys' names (in red).
bnames.before1944 <- subset(bnames, year < 1944)
counts.before1944 <- tapply(bnames.before1944$AbsBirths, bnames.before1944$name, sum)
bnames.from1944 <- subset(bnames, year >= 1944)
counts.from1944 <- tapply(bnames.from1944$AbsBirths, bnames.from1944$name, sum)
bnames.from1944$rank <- nrow(bnames.from1944) + 1 -rank(bnames.from1944$prop)
#Merge into 2 data frame
bnames.counts <- merge(as.data.frame(counts.before1944), as.data.frame(counts.from1944), all=TRUE, by="row.names")
bnames.counts$rank.before1944 <- nrow(bnames.counts) + 1 - rank(bnames.counts$counts.before1944, na.last = "keep")
bnames.counts$rank.from1944 <- rank(bnames.counts$counts.from1944, na.last = "keep")
#RankProduct
bnames.counts$RankProd <- bnames.counts$rank.before1944 * bnames.counts$rank.from1944
head(bnames.counts[order(decreasing = FALSE, bnames.counts$RankProd), 1:3])## Row.names counts.before1944 counts.from1944
## 490 Arlo 2104 43
## 2354 Farris 676 43
## 4579 Marshal 366 43
## 2789 Haskell 2541 44
## 4180 Llewellyn 1470 46
## 1963 Earlie 896 46
#Names becoming popular only from 1944
head(bnames.counts[order(decreasing = TRUE, bnames.counts$RankProd), 1:3])## Row.names counts.before1944 counts.from1944
## 5664 Ryan 10 850573
## 2216 Erin 9 307579
## 1189 Chase 5 108247
## 1808 Diego 5 69650
## 1353 Cody 24 266989
## 6385 Tyler 38 532129
(advanced) 7. Think of another questions you could answer with the dataset. E.g. Identify the most popular firstname in 1980ies the or identify the most popular name that was used for boys and girls.
The next data set is the distribution of hair and eye color and sex in 592 statistics students stored in the table HairEyeColor (see ?HairEyeColor for details).
We load the data set with the data() function and have a look at the structure using str().
data("HairEyeColor")
str(HairEyeColor)## table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## - attr(*, "dimnames")=List of 3
## ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
## ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
## ..$ Sex : chr [1:2] "Male" "Female"
mosaicplot(HairEyeColor)The movies data set is from the ggplot2 package. The internet movie database,
http://imdb.com/, is a website devoted to collecting movie
data supplied by studios and fans (See ?movies for details).
The data set contains data for 58'788 movies, namely the title of the movie, year of release, budget, length, rating and genre.
str(movies)## 'data.frame': 58788 obs. of 24 variables:
## $ title : chr "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
## $ year : int 1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
## $ length : int 121 71 7 70 71 91 93 25 97 61 ...
## $ budget : int NA NA NA NA NA NA NA NA NA NA ...
## $ rating : num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
## $ votes : int 348 20 5 6 17 45 200 24 18 51 ...
## $ r1 : num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
## $ r2 : num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
## $ r3 : num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
## $ r4 : num 4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
## $ r5 : num 14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
## $ r6 : num 24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
## $ r7 : num 24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
## $ r8 : num 14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
## $ r9 : num 4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
## $ r10 : num 4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
## $ mpaa : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
## $ Action : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Animation : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Comedy : int 1 1 0 1 0 0 0 0 0 0 ...
## $ Drama : int 1 0 0 0 0 1 1 0 1 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Short : int 0 0 1 0 0 0 0 1 0 0 ...
hist(movies$rating)boxplot(movies$rating ~ movies$year)
Some movies obtained less than 10 votes. Remove them and repeat the plotting. Do you see a change?
movies$Genre.simple <- ifelse(movies$Action == 1, "Action", ifelse(movies$Comedy == 1, "Comedy", ifelse(movies$Drama == 1, "Drama", "other")))
movies.conf <- subset(movies, votes >= 10)
boxplot(movies.conf$rating ~ movies.conf$Genre.simple)By default genres are alphabetically ordered. Let's order the genres according to their median rating (tip: use factors and function factor()).
movies$Genre.simple <- factor(movies$Genre.simple, levels = c("Action","other","Comedy","Drama"))
boxplot(movies.conf$rating ~ movies.conf$Genre.simple)movies.conf[which.max(movies.conf$length), ]## title year length budget rating votes r1 r2 r3
## 11937 Cure for Insomnia, The 1987 5220 NA 3.8 59 44.5 4.5 4.5
## r4 r5 r6 r7 r8 r9 r10 mpaa Action Animation Comedy Drama
## 11937 4.5 0 0 0 4.5 4.5 44.5 0 0 0 0
## Documentary Romance Short Genre.simple
## 11937 0 0 0 other
Tip: use the function cut to make categories.
boxplot(movies.conf$rating ~ cut(movies.conf$length, breaks = quantile(movies.conf$length)))#movies.conf.clean <- subset(movies.conf, length <= 300)
#boxplot(movies.conf.clean$rating ~ cut(movies.conf.clean$length, breaks = 10), las=3)
#But beware of categories with few members!
#tapply(movies.conf.clean$rating, cut(movies.conf.clean$length, breaks = 10), length)sumd <- aggregate(awake ~ conservation, data=msleep, FUN=mean) sumd$sd <- aggregate(awake ~ conservation, data=msleep, FUN=sd)[,2] limits <- aes(ymax = awake + sd, ymin = awake - sd) dyp <- ggplot(sumd, aes(x=conservation, y=awake)) + geom_bar(fill="grey") + theme_classic() dyp + geom_errorbar(limits, width=0.25)
See also the xtable package.
library(knitr)
kable(head(iris[,1:3]), format='html')| Sepal.Length | Sepal.Width | Petal.Length |
|---|---|---|
| 5.1 | 3.5 | 1.4 |
| 4.9 | 3.0 | 1.4 |
| 4.7 | 3.2 | 1.3 |
| 4.6 | 3.1 | 1.5 |
| 5.0 | 3.6 | 1.4 |
| 5.4 | 3.9 | 1.7 |
















