-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathLab2_ggplot.Rmd
More file actions
189 lines (131 loc) · 10.3 KB
/
Lab2_ggplot.Rmd
File metadata and controls
189 lines (131 loc) · 10.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: 'Lab 2: ggplot'
author: "Isaias Perez"
date: "`r Sys.Date()`"
output:
word_document: default
html_document: default
---
```{r intall of ggthemes}
#installed.packages("ggthemes")
```
```{r Loading packages}
library(tidyverse)
library(here)
library(ggthemes)
```
```{r funtion here}
here()
```
```{r assigning dataset "mtcars" variable name}
data <- mtcars
```
*#1. To begin, we will plot the correlation between horsepower (hp) and miles per gallon (mpg). To do so, we will use the below code.*
*#For question 1, please explain to me what each of the following parts of the code tell R to do in human terms.*
*#ggplot(): This section of code sets up an environment for the graph to be constructed in.*
*#data = data: This section directs R to utilize the data set mtcars, which as been labeled as "data".*
*#aes(x = wt, y = mpg): this section is responsible for assigning categories to the a-xis and y-axis on the scatter plot.*
*#+ geom_point(): This section refers to the type of graph being constructed.*
```{r Scatter plot using wt on the X-axis & mpg on the y-axis}
ggplot(data = data, aes(x = wt, y = mpg)) + geom_point()
```
*#2. For question 2, please interpret the plot; in other words, tell me what you learned about these variables by viewing the plot. In particular, be sure to note whether the relationship between the variable is positive or negative.*
*\# This scatter plot depicts a strong negative linear relationship between wt and mpg. More so, the more wt increases the more mpg decreases.*
*\# 3.The previous plot you created is a scatter plot. Scatter plots are useful for displaying the relationship between two continuous variables (such as wt and mpg). The creation of a scatter plot is often a “layer” that precedes the creation of a plot depicting the relationship between two variables, often depicted using a correlation (line). Before we plot the correlation (line) of the two variables, we should see what the correlation between the two variables is. To do so, use the below code:*
```{r correlation of wt & mpg}
cor.test(x = data$wt, y = data$mpg)
```
*\# For question 3, interpret the correlation. Within this interpretation be sure to note the direction of the correlation and the statistical significance of the correlation (p \< .05 as our cutoff for significance) and note whether these results align with your interpretation of the relationship within question 2.*
*\# The computed p-value of 1.294e-10 is significantly smaller then the alpha-value of .05. We reject the null hypothesis that is wt and mpg have no influence on one another and accept the alternative hypothesis.*
*#4. Now that we’ve estimated the relationship, we can plot it. To do so, we are going to use three lines of code.* *#plot \<- ggplot(data = data, aes(x = wt, y = mpg)) + geom_point() #saves our plot under plot* *#plot \<- plot + stat_smooth(method = "lm", se = FALSE) #adds our correlation line using a linear model (lm)* *#plot #calls the updated plot so that we can view it* *#For question 4, explain to me whether you think this plot is easier or harder to interpret than the plot we created in question 2 as well as why you think this.*
```{r saving constructed plot under the variable name "plot"}
plot <- ggplot(data = data, aes( x = wt, y = mpg)) + geom_point()
```
```{r addition of correlation line using a linear model}
plot <- plot + stat_smooth(method = "lm", se = FALSE)
```
```{r calling updated plot}
plot
```
*#For question 4, explain to me whether you think this plot is easier or harder to interpret than the plot we created in question 2 as well as why you think this.*
*\# This plot is easier to read as it provides a line of best fit for the data points and makes the negative correlation more apparent to viewers.*
*#5. One of the nice parts of ggplot2 is that the plots are extremely customizable. For example, we can use “themes” to change the visual scheme (colors, fonts, etc.). To begin, let’s write three lines of code*
```{r scatter plot ran with theme_bw()}
bwplot <- plot + theme_bw()
print(bwplot)
```
```{r scatter plot ran with theme_void}
plot + theme_void()
```
```{r scatter plot ran with theme_dark}
plot + theme_dark()
```
*#. For question 5, use either "<https://ggplot2-book.org/polishing.html>" or "<https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/>" and choose an additional two themes (in addition to the 3 I provided). After doing so, tell me which of these 5 visualization options you would use if your intent was to make your graph easiest to view in a presentation delivered within our classroom (as well as why). Also identify which one you found least interpretable and why.*
```{r scatter plot ran using ggthemes: theme_solarized}
plot + theme_solarized()
```
```{r plot ran with ggthemes: theme_igray}
plot + theme_igray()
```
```{r saving theme_bw as plot}
plotnew <- plot + theme_bw()
print(plotnew)
```
*\# From these 5 different themes, I would to choose to use ggthemes: theme_igray(), as it provides good contrast and does not over complicate the depiction of the graph. ggthemes: theme_viod is my least favorite as it does not offer enough information on what the graph is and the measurements associated with it.*
*#6. To this point, we have largely relied on generic ggplot calls. However, each plot is going to be unique if only in that it will require us to provide a specific title and confirm that our X and Y axis labels are interpretable. Within our current plot, we do not have a title and neither mpg or wt is all that intuitive as a variable name. For question 6, complete the code to include a titles for the chart, x-axis, and y-axis*
```{r title added to scatter plot}
plot <- plot + theme_igray() +
ggtitle(" The relationship between car weight and miles per gallon")+
labs(y = "Miles per gallon", x = "Car weight in tons")+
stat_smooth(method = "lm", se = FALSE)
print(plot)
```
*#7.Importantly, while we can use preset themes, we can also edit those themes to our liking. I want you to see the degree of customization that we have, so: For question 7, complete the code in the ‘b’ sub-bullets (highlighted in yellow)*
```{r customization of plot}
plot +
theme(plot.title = element_text(color = "red", size = (18))) +
theme(axis.title.x = element_text(color = "blue", size = (14))) +
theme(axis.title.y = element_text(color = "orange", size = (12))) +
theme(panel.background = element_rect(fill = "black")) +
geom_point(color = "white") +
stat_smooth(method = "lm", se = FALSE, color = "pink")
```
*#8. Often, we want to consider how a relationship between variables varies based on another variable (this is called moderation). This can also be plotted using ggplot. Within this example, we are going to see how the relationships vary based on the number of car cylinders (4, 6, or 8).*
```{r moderation relationship between variables based on another variable}
plot + facet_wrap(~cyl)
```
```{r separation line incorporated for each clyinders}
ggplot(data = data, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point() + stat_smooth(method = "lm", se = FALSE)
```
*#For question 8, examine the two plots and tell me which one you would opt to use and why?* \# I would use the second plot due to the colored visual representation of each cylinder. Additionally the second plot has a legened which helps viewers comprehend which each color represents.
#\*9. If you look at the formatting for our plot that created individual correlation lines for each amount of cylinders within one plot (see bullet 8.a.ii.1.), you will see that the previously formatting we added (theme, title, axis labels, etc.) are no longer updated. For question 9, repurpose the code we previously used to update each aspect of the format and apply it to this new figure. Save this figure as Plot_Moderated.
```{r addition of theme, title, axis labels, and renaming as plot_moderated}
plot_moderated <-ggplot(data = data, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point() + stat_smooth(method = "lm", se = FALSE) +
ggtitle(" The relationship between car weight and miles per gallon")+
labs(y = "Miles per gallon", x = "Car weight in tons")
print(plot_moderated)
```
*#10. While being able to create data visualizations is one job of a data scientist, the larger job is to answer questions. I want you to begin practicing this skill. Accordingly, for question 10, I want you to pick 2 variables from the mtcars that you would like to examine the relationship between and recreate the above steps (note: the above code can be adapted for each bulleted expectation found below). Specifically, I expect you to:*
*#a. Pick 2 variables from the mtcars dataset (previously named data) and justify why you think it would be interesting to look at the relationship between them.*
# for this new plot I plan to evaluate the potenial relationship between horsepower (hp) and 1/4 mile time (qsec) as I am curious to see if hp translates to a faster 1/4 mile time.
\#*b. Calculate the correlation between them.*
```{r correlation between hp and qsec}
cor.test(x = data$qsec, y = data$hp)
format (5.766e-06, scientific = FALSE)
```
*#c. Plot the two variables within a scatter plot.*
```{r scatter plot of qsec and hp}
ggplot(data = data, aes( x = qsec, y = hp, color = factor(cyl)))+
geom_point() + stat_smooth(method = "lm", se = FALSE)
```
d. Format the plot such that you would be comfortable showing it to the class.
```{r scatter plot of qsec and hp with proper titles}
ggplot(data = data, aes( x = qsec, y = hp, color = factor(cyl)))+
geom_point() + stat_smooth(method = "lm", se = FALSE)+
ggtitle ("The relationship between 1/4 mile time(qsec) and horsepower (hp)")+
labs(y = "horsepower(hp)", x = "1/4 mile time(qsec)")
```
\#*e. Provide a paragraph summary of your takeaways regarding the relationship between the two variables.*
There is a negative relationship between hp and qsec within the dataset of mtcars. It can be observed in the graph that as hp increase, qsec time decreases. This shows that the higher hp a vehicles has, the less time it takes to complete a 1/4 run. Additionally, the graph also shows that the amount of cylinders(cyl) a car has relates to the hp it has. 8 cyl vehicles typically had higher amounts of hp compared to vehicles with only 4 cyl.