-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathLab5_Statistical_analyses_&_flextables.Rmd
More file actions
229 lines (159 loc) · 9.87 KB
/
Lab5_Statistical_analyses_&_flextables.Rmd
File metadata and controls
229 lines (159 loc) · 9.87 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: 'Lab 5: Statistical analysis & flextable'
author: "Isaias Perez"
date: "`r Sys.Date()`"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r installing packages}
#install.packages("tidyverse")
#install.packages("psych")
#install.packages("felxtable")
#install.packages("Stat2Data")
#install.packages("Hmisc")
#install.packages("htmltools")
```
```{r loading pakages}
library(tidyverse)
library(psych)
library(flextable)
library(Stat2Data)
library(Hmisc)
library("htmltools")
```
```{r section 1 }
data("HorsePrices") #saves the pre constructed data set to a variable named data within the enviroment.
data <-HorsePrices %>% mutate(Prices = as.numeric (Price)) #This line of code changes the prices column to be numeric.
head(data)
```
*#For question 1, select the column that tells us the Horse ID numbers and remove it from our data set. Name this updated data set ‘data.’*
```{r question 1}
data <- data %>%
select(-HorseID,-Prices) #removes ID column and Prices column. Prices was duplicated in the previous section.
```
*\# For question 2, identify which column is not a numeric column. Then describe how to interpret the data within this column, including what numbers the unique scores within this column are associated with.*
```{r question 2}
head(data) #displays the dataset.
str(data)
# The column "sex" is not a numeric, but instead is a factor that has two levels which are Male and Female. Female is coded as 1 and male is coded as 2.
```
*#For question 3, change the lone non-numeric column to numeric. Save this updated dataset as data and then use your structure call to show me that it was appropriately changed.*
```{r question 3}
#This line creates a new column for sex and adjusts the column to be numeric
data <-
data %>%
mutate(Sex = as.numeric(Sex))
str(data) #This line shows what changes were made.
```
*#For question 4, identify the mean, standard deviation, minimum, and maximum for the ‘Price’ variable.*
```{r qustion 4}
describe(data$Price)
describe(data)
#The following descriptive statistics are in reference to the price variable.
#Mean = 26840.00
#Standard Deviation = 14980.22
#Minimum = 1100
#Maximum = 60000
```
*\# For question 5, tell me which column has missing values and how you know that, and then remove any rows with missing values from your dataset. Save this updated dataset as data and then use your structure call to show me that it was appropriately changed.*
#The height column is missing 3 values and is displayed as NA within the dataset. This can be observed briefly scanning through the dataset.
```{r question 5}
na.rows <- apply(is.na(data), 1, any) #this line of code identifies any NA in the dataframe
print(data[na.rows, ]) #This line of code makes visable where in the dataframe NA is present.
data <- drop_na(data) #this line of code removes any NA values in the dataframe
print(data$Height) # this line shows the height column no longer has NA.
```
*#For question 6, compare the means and standard deviations of the ‘full’ dataset that featured NA values to the ‘reduced’ dataset that only has rows with complete data. Tell me if you think any of these values have meaningfully changed.*
```{r question 6}
describe(data)
# While the changes were slight for both the means and standard deviation for all the categories, I do believe this is important as the data is more accurate and represents a refined dataset.
```
#This chunk is needed in order to answer question 7. It is a function that takes the dataset and creates a correlation table.
```{r Corrstars Function}
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .0001, "** ",
ifelse(p < .001, "** ",
ifelse(p < .01, "** ",
ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
return( Rnew)
}
```
*\# For question 7, interpret the correlations between variables. Following your interpretations of each correlation, tell me what aspects generally relate to a horse having a greater price.*
```{r question 7}
corrs <- corstars(data) #Applies the corstars function to the dataset named "data" and renames it to "corrs".
corrs
#Using the corstars function, height and sex are the variables that are positively correlated with price of the horse. Price and height has a significant p-value <. 0.05. As well as, price and sex which has a significant p-value < 0.05.
```
*#For question 8, create a linear regression model that uses price as the outcome and all of the other variables as predictors. Additionally, use the summary function and then identify which of the predictors are statistically significant predictors of price based on the results of this analysis. Following your interpretations of each predictors statistical significance, tell me what aspects generally relate to a horse having a greater price.*
```{r question 8 linear regression model}
lm_summary <- lm(Price ~ Age + Sex + Height, data = data)# produces a linear model equation for the data
Lm_summary <- summary(lm_summary)
Lm_summary
#According to the generated LM, a horses Sex, Height, and Age all are statistically significant at an alpha value of 0.05. As height increases a horses price is also expected to increase. Additionally, A horses sex is expected to increase the overall horses price, especially if the horse is male. In contrast, as a horse ages, price is expected to fall.
```
\#*For question 9, add hashtags to each of these lines that tell me exactly what each line does.*
```{r}
corrs <- corrs %>% rownames_to_column(var = "Variables") # This line adds a column at the start of the dataset corrs and names it variables.
flex_corr <- flextable(corrs) %>% # This line assigns the name flex_corr to the flextable being created.
bold(part = "header") %>% #This formats the header to be in bold font.
width(j = 1, width = 1) %>% # This portion adjust the width of of the columns.
align(part = "header", align = "left") %>% #this line maiplates the header once more and postions it towards the left.
add_footer_row(top = FALSE, values = "Note: Sex was dummy coded such that 1 = Female and 2 = Male", colwidths = 5) %>% #this creates a footer in reference to the manner sex was coded and manipulates the column width.
add_header_lines(values = "Table 1: Correlations between items") %>% #This line assigns the table a header/ title.
italic(part = "header", i = 1) %>% # This line formats the header to be italicized.
border_remove() %>% # removes the border above the header
hline(part = "header", i = c(1,2)) %>% # Adds a line to separate the variable names from the body of the table.
hline(part = "body", i = 4)# Adds a line to separate the body of the table from the footer.
flex_corr #prints the table into the workspace
```
*#. For question 10: i. Tidy your data using the broom package to reduce the dataset to only the show the statistical output pertaining to the regression coefficients (term, estimate, std.error, statistic, p.value). ii. Reduce this output by only selecting the term, estimate, and p.value columns. iii. Rename the term, estimate, and p. value functions as follows:1. term becomes “Term” 2. estimate becomes “Estimate” 3. p.value becomes “p” iv. Then use the flextable commands to approximately recreate this table.*
```{r}
Lm_summary_flextable <- #These two line are responsible for using the broom function on the "Lm_summary" and renaming it to "Lm_summary_flextable"
broom::tidy(Lm_summary)
Lm_summary_flextable <- Lm_summary_flextable %>%
select(term, estimate, p.value) %>%
rename("Term" = term, "Estimate" = estimate, "p" = p.value) #This section is responsible for selecting the columns term, estimate, and p.value to add to the flextable. Additionally it renames the following variable to correct letter capitalization.
finishedtable <- flextable(Lm_summary_flextable) %>% # This section is dedicated to formatting and the creation of the visuals for the flextable.
bold(part = "header") %>%
width(j = 1, width = 3) %>%
align(part = "header", align = "left") %>%
add_footer_row(top = FALSE, values = "Note: Unstandardarized regresion estimates reported. Sex dummy coded such that Female = 1 and Male = 2. Signifcant p values (< .05) are italicized", colwidths = 3) %>%
add_header_lines(values = "Table 2: Regression predicting horse sale price") %>%
italic(part = "header", i = 1) %>%
border_remove() %>%
hline(part = "header", i = c(1,2)) %>%
hline(part = "body", i = 4)
finishedtable
```