StatInference_project/rough_draft_analysis.Rmd at master · srive326/StatInference_project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Stats Project"
output: html_document
---

In this project, we're using sales data from Safeway Eastern's floral department. This data only shows the flowers/bouquets sold at Safeway Eastern supermarkets that were bought from Intergreen USA, a flower importing company which supplies different supermarkets across the US. There are a total of 9 distinct products sold and the sales occurred between April and October 2017.

We're interested in seeing the relationships between several predictor variables and sales to determine if there's a particular independent variable that drives or hurts flower sales.
One of these variables is the closing price of the S&P 500 for a given day. We chose this variable because the S&P 500 includes stocks from large companies that sell consumer staples like Johnson and Johnson, Heinz, Colgate, and Coca-Cola. Many of their products are sold in supermarkets, so our reasoning is that the S&P will reflect the movement in supermarkets. Similarly, since the flowers in the data are being sold in supermarkets this could be a good predictor for flower sales.


The code below imports the flower sales data and combines it with the S&P data.
```{r cars}
library('plyr')
library('bit64')
library('dplyr')
library('reshape2')
library('ggplot2')
library('data.table')

setwd("C:/Users/Holex/Desktop/Stats_Project")

#Sidenote: I saved excel file as a txt before reading it as a table
flower_data <- read.table("flower_data.txt", header = TRUE)
flower_data_apr <- fread("april_june.csv", header = TRUE)


#line below ensures column is date and ordered
flower_data$Txn_Date <- as.Date(flower_data$Txn_Date, format = "%m/%d/%Y")
flower_data_apr$Day <- as.Date(flower_data_apr$Day, format = "%m/%d/%Y")


#Since there are multiple sales on a given day, I want to aggregate the sales based on the date
date_net <- aggregate(Net ~ Txn_Date, flower_data,sum)
colnames(flower_data_apr)[6] <- "Net"
date2_net <- aggregate(Net ~ Day, flower_data_apr,sum)

#renaming columns to make joining easier later
#colnames(date2_net)[2] <- "Net"
colnames(date2_net)[1] <- "Date"

#read in financial data
stock_info <- fread("stockdata.csv")

stock_info2 <- fread("april_june_finance.csv")
stock_info2$Date <- as.Date(stock_info2$Date, format = "%m/%d/%Y")
#change column name so i can do a join
colnames(date_net)[1] <- "Date"
combodata <- join(date_net,stock_info, type="right")
combodata2 <- join(date2_net, stock_info2, type = "right")
combodata2$Date <- as.character(combodata2$Date)
#combo3 <- merge(x = date2_net, y = stock_info2, by = "Date", all= TRUE)

#bind 2 data frames (data from april-june and sep-oct)
results <- rbind(combodata2, combodata)

```
Next, we looked at plots of flower sales against the S&P open, close, high, adjusted close, low, and volume individually to see which plots looked linear and could make it to the next stage of the data analysis process.

```{r}
plot(results$Open, results$Net)
plot(results$Close, results$Net)
plot(results$High, results$Net)
plot(results$`Adj Close`, results$Net)
plot(results$Low, results$Net)
plot(results$Volume, results$Net)
```
The relatiopnship between net sales and adjusted close and net sales and open looks linear based on the scatterplot. The next step is to run diagnostics to check if the assumptions of linearity hold.


```{r}
l.model_close <- lm(results$Net ~ results$`Adj Close`)
summary(l.model_close)
#plots
plot(l.model_close)
```
The plot of the residuals shows that

```{r}
l.model_open <- lm(results$Net ~ results$Open)
summary(l.model_open)
#plots
plot(l.model_open)
```
One of the assumptions of linearity is that the error terms are independent and not correlated. The plot of the residuals shows that there isn't a significant pattern between the residuals and fitted values but they aren't very spread and random either. There are also a few outliers.

The normal q-q plot is used to see if the error terms are normally distributed, if the points lie along the dotted line this is an indicator that the error terms are normally distributed. This q-q plot looks light tailed and there are outliers at the right end which are concerning although most points do lie on the dotted line between -1 and 1.

The scale location checks for homoscedasticity, the assumption of equal variance. In this case the points look spread out.

The residual vs leverage plot helps to find influential cases, values in the top right corner would be
an influential outlier, we don't seem to have any in this case.

Since the scatter plot of net sales and the S&P close seems to curve downward near the right of the graph, and because the q-q plot shows that the error terms at the right end are farther from the dotted line, it may be a good idea to perform a transformation on the predictor variable to improve the linear fit of the data.


#Testing code here

```{r}
#date_net3 <- aggregate(Net ~ Txn_Date, flower_data,sum)
#date_net3$Txn_Date <- factor(date_net3$Txn_Date, levels =  #date_net3$Txn_Date[order(date_net3$Net)])
#str(date_net3)

#date_net3$Txn_Date <- format(format.Date(date_net3))

#There are also multiple stores, it would be interesting to see how different stores sell based on the day
#link to safeway eastern stores
#https://www.google.com/maps/d/viewer?mid=1SU7aT0QrKqPL-5mGKjwvflXmAz4&hl=en_US&ll=39.02715642805036%2C-76.56616499999996&z=7


store_date_net2 <- aggregate(Net ~ Txn_Date + Store , flower_data, sum)
store_date_net2 <- store_date_net[order(as.Date(store_date_net2$Txn_Date, "%m/%d/%Y")),]


```

```{r}

library('rnoaa')
#https://cran.r-project.org/web/packages/rnoaa/rnoaa.pdf

isd_stations_search(lat=38.901335, lon=-76.980425)

#code to get station name for every longitude and latitude for Safeway Eastern stores


store_locations <- read.table("lat_lng_store.txt", header = TRUE)


coops_search(station_name = 997314, begin_date = 20140928, end_date = 20140929, datum = "stnd", product = "predictions")

ghcnd_search("99999", date_min = "1920-01-01", date_max = "1925-01-01")

#create a loop that stores dont neeed loop can pass a vector ;)

#meteo_tidy_ghcnd(stationid = "ASN00003003", date_min = "1950-01-01")

station_data <- ghcnd_stations()

meteo_distance(station_data, lat, long, units = "deg", radius = NULL, limit = NULL)
```

You can also embed plots, for example:

```{r pressure, echo=FALSE}

meteo_distance(station_data, 38.901335, -76.98042, units = "deg", radius = NULL, limit = 5)
```

Stock market data
```{r}
library('quantmod')
```