-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path04-Intro_Data_Vis_ggplot2.Rmd
More file actions
400 lines (268 loc) · 13.4 KB
/
04-Intro_Data_Vis_ggplot2.Rmd
File metadata and controls
400 lines (268 loc) · 13.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
# Intro to Data Visualization with ggplot2
<https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2>
## Introduction
**Changing One Geom Or Every Geom**
If you have multiple geoms, then mapping an aesthetic to data variable inside the call to `ggplot()` will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the `geom_*()` functions.
`geom_point()` has an alpha argument that controls the opacity of the points. A value of `1` (the default) means that the points are totally opaque; a value of `0` means the points are totally transparent (and therefore invisible). Values in between specify transparency.
`geom_smooth()` adds a smooth trend curve.
```{r include=FALSE}
library(ggplot2)
library(dplyr)
```
```{r}
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha = 0.4) +
geom_smooth()
```
**Saving Plots As Variables**
Plots can be saved as variables, which can be added two later on using the + operator. This is really useful to make multiple related plots from a common base.
```{r}
# Assign the previous plot to a variable to make a base:
plt_price_vs_carat <- ggplot(diamonds, aes(carat, price))
# Assign the base plot to make a new plot:
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color = clarity))
# See the plot
plt_price_vs_carat_by_clarity
```
## Aesthetics
**Typical Visible Aesthetics**
```{r echo=FALSE}
Aesthetic <- c("x", "y", "fill", "color", "size", "alpha", "linetype", "labels", "shape")
Description <- c("X-axis position", "Y-axis position", "Fill color", "Color of points, outlines of other geoms", "Area or radius of points, thickness of lines", "Transparency", "Line dash pattern", "Text on a plot or axes", "Shape")
data.frame(Aesthetic, Description)
```
**All About Aesthetics: Color, Shape And Size**
These are the aesthetics to can consider within `aes()`: `x`, `y`, `color`, `fill`, `size`, `alpha`, `labels` and `shape.`
```{r}
diamonds
ggplot(diamonds, aes(carat, price, color = color)) +
# Set the shape and size of the points
geom_point(shape = 1, size = 1.5)
```
**All About Aesthetics: Color VS. Fill**
Typically, the `color` aesthetic changes the circumference line of a geom and the `fill` aesthetic changes the inside. `geom_point()` is an exception: you use `color` (not `fill`) for the point color.
The default `geom_point()` uses `shape = 19`: a solid circle. An alternative is `shape = 21`: a circle that allow you to use both `fill` for the inside and `color` for the outline. This is lets you to map two aesthetics to each point:
```{r}
ggplot(diamonds, aes(carat, price, fill = clarity, color = color)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
```
**All About Aesthetics: Alpha, Shape And Label**
<center>**Be careful**: the attributes can overwrite the aesthetics of a plot!</center>
<center>**Alpha**</center>
```{r}
# Base layer
diamonds_E <- diamonds %>%
filter(color == "E")
plt_diam_E <- ggplot(diamonds_E, aes(carat, price))
# Map clarity to alpha:
plt_diam_E +
geom_point(aes(alpha = clarity))
```
<center>**Shape**</center>
Use `shape()` function to substitute the observation symbols displayed in the plot into another variable:
```{r}
# Base layer
diamonds_E <- diamonds %>%
filter(color == "E")
plt_diam_E <- ggplot(diamonds_E, aes(carat, price))
# Map cut to alpha:
plt_diam_E +
geom_point(aes(shape = cut))
```
<center>**Label**</center>
Use `label()` function to change the observation symbols displayed in the plot:
```{r}
diamonds_Fair <- diamonds %>%
filter(cut == "Fair")
plt_diam_Fair <- ggplot(diamonds_Fair, aes(carat, price))
# Use text layer and map color to label
plt_diam_Fair +
geom_text(aes(label = color))
```
**All About Attributes: Color, Shape, And Size**
Use `color()`, `shape()`, and `size()` functions as attributes (means to not use inside `ggplot()`):
```{r}
# A hexadecimal color
my_blue <- "#4ABEFF"
# Change the color mapping to a fill mapping
ggplot(diamonds_Fair, aes(carat, price, fill = "E")) +
# Set point size and shape
geom_point(color = my_blue, size = 7, shape = 5, alpha = 0.4)
```
**Aesthetic Label Functions**
Make use some of these functions for improving the appearance of the plot:
`labs()` to set the x- and y-axis labels. It takes strings for each argument.
`scale_color_manual()` defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.
```{r}
palette <- c(E = "#377EB8")
ggplot(diamonds_Fair, aes(carat, price, fill = "E")) +
geom_point(color = my_blue, size = 7, shape = 5, alpha = 0.4) +
labs(x = "This is a new x-axis name", y = "New for Y") +
scale_fill_manual("Color of the Diamonds", values = palette)
```
**Setting Axis Limits**
Specify the limits as separate arguments, or as a single numeric vector. That is, `ylim(lo, hi)` or `ylim(c(lo, hi))`.
```{r}
ggplot(mtcars, aes(mpg, 0)) +
geom_jitter() +
# Set the y-axis limits
ylim(-2, 2)
```
To spread out clustered variables use `geom_jitter()` or `geom_point(position = "jitter")`.
## Geometries
Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:
1. Large datasets
2. Aligned values on a single axis
3. Low-precision data
4. Integer data
**How to Deal with Overplotting In Scatter Plots**
Adjust the numeric value in `shape()` function into `"."`to deal with overflowed data:
```{r}
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))
# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = ".")
```
**How to Deal with "Aligned Values" And "Low-precision Data" In Scatter Plots**
Aligning values on a single axis occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering:
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(position = "jitter", alpha = 0.5)
```
Another way to jitter which gives the same result is:
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(position = position_jitter(width = 0.1), alpha = 0.5)
```
**Positions In Histograms**
Histograms are specialized versions of bar plots.
There're various ways of applying positions to histograms. `geom_histogram()`, a special case of `geom_bar()`, has a `position` argument that can take on the following values:
1. `stack` (the default): Bars for different groups are stacked on top of each other.
2. `dodge`: Bars for different groups are placed side by side.
geom_histogram(binwidth = 1, position = "dodge")
3. `fill`: Bars for different groups are shown as proportions.
geom_histogram(binwidth = 1, position = "fill")
4. `identity`: Plot the values as they appear in the dataset.
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
**Bar Plots**
`geom_bar()` counts the number of cases at each x position.
`geom_col()` plot actual values.
**NOTE:**the function geom_col() is just geom_bar() where both the position and stat arguments are set to "identity". It is used when we want the heights of the bars to represent the exact values in the data.
`geom_bar()` have three position options:
1. stack: The default
2. `dodge`: Preferred
geom_bar(position = "dodge")
3. `fill`: To show proportions
geom_bar(position = "fill")
**Overlapping Bar Plots**
Instead of using `position = "dodge"`, use `position_dodge()`, the same with `position_jitter()` to specify how much dodging (or jittering) wanted.
geom_bar(position = position_dodge(width = 0.2))
## Themes
The themes layer is the visual elements that aren't part of the data. There're 3 types:
```{r echo=FALSE}
types <- c("text", "line", "rectangle")
modified_using <- c("element_text()", "element_line()", "element_rect()")
data.frame(types, modified_using)
```
To change stylistic elements of a plot, call theme() and set plot properties to a new value.
**Moving The Legend**
The `legend` is that little box on the right side of the plot that shows up when functions like `color` or `shape` is plotted:
p + theme(legend.position = new_value)
The new value can be:
"top", "bottom", "left", or "right'": place it at that side of the plot.
"none": don't draw it.
" c(x, y)": "c(0, 0)" means the bottom-left and "c(1, 1)" means the top-right.
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
theme(legend.position = "left")
```
**Modifying Theme Elements**
Many plot elements have multiple properties that can be set.
Line elements in the plot such as axes and gridlines have a color, a thickness (`size`), and a line type (`solid` line, `dashed`, or `dotted`). To set the style of a line, you use `element_line()`. For example, to make the axis lines into `red`, `dashed` lines:
p + theme(axis.line = element_line(color = "red", linetype = "dashed"))
Similarly, `element_rect()` changes `rectangles` and `element_text()` changes `text.` To remove a plot element, use `element_blank()`.
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
theme(legend.position = "left",
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
# Turn off axis ticks
axis.ticks = element_blank(),
# Turn off the panel grid
panel.grid = element_blank()
)
```
**Modifying Whitespace**
*Whitespace* means all the non-visible margins and spacing in the plot.
To set a single *whitespace* value, use `unit(x, unit)`, where `x` is the amount and `unit` is the unit of measure.
Borders have 4 positions, use `margin(top, right, bottom, left, unit)`. The margin order is in a clock-wise position starting from 12 o'clock.
The default unit is `"pt"` (points), which scales well with text. Other options include `"cm"`, `"in"` (inches) and `"lines"` (of text).
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
theme(legend.margin = margin(1, 1, 70, 10, "mm"))
```
**Built-in Themes**
In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.
"theme_gray()" is the default.
"theme_bw()" is useful when you use transparency.
"theme_classic()" is more traditional.
"theme_void()" removes everything but the data.
**Setting Themes**
Install the package `ggthemes`, which is a another source of built-in themes just like `ggplot2`.
Reusing a theme across many plots helps to provide a consistent style. You have several options for this.
1. Assign the theme to a variable, and add it to each plot.
2. Set the theme as the default using "theme_set()".
```{r}
library(ggthemes)
new_theme <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
# Combine the WSJ theme with new_theme
new_theme_wsj <- theme_wsj() + new_theme
# Add the combined theme to the plot
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
new_theme_wsj
```
**NOTE:** add another theme after an already existing theme to overide the settings of the previous theme.
**Using Geoms For Explanatory Plots**
This type of plot will be in an info-viz style, meaning that it would be similar to something you'd see in a magazine or website for a mostly lay audience.
```{r include=FALSE}
gm2007 <- structure(list(country = structure(c(16L, 13L, 19L, 14L, 11L,
2L, 20L, 1L, 5L, 12L, 4L, 6L, 9L, 17L, 15L, 3L, 18L, 8L, 7L,
10L), .Label = c("Afghanistan", "Angola", "Australia", "Canada",
"Central African Republic", "France", "Hong Kong, China", "Iceland",
"Israel", "Japan", "Lesotho", "Liberia", "Mozambique", "Sierra Leone",
"Spain", "Swaziland", "Sweden", "Switzerland", "Zambia", "Zimbabwe"
), class = "factor"), lifeExp = c(39.6, 42.1, 42.4, 42.6, 42.6,
42.7, 43.5, 43.8, 44.7, 45.7, 80.7, 80.7, 80.7, 80.9, 80.9, 81.2,
81.7, 81.8, 82.2, 82.6), continent = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 1L, 1L, 2L, 4L, 3L, 4L, 4L, 5L, 4L, 4L, 3L,
3L), .Label = c("Africa", "Americas", "Asia", "Europe", "Oceania"
), class = "factor")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
```
```{r}
# Add a geom_segment() layer (Key component to create the plot structure below)
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
# Add a geom_text() layer
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
# Modify the scales
scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
# Add a title and caption
labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
```