Lab3/Lab3_Group5.Rmd at master · calegacy/Lab3 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: "Lab 3"
author: "Chelsey Legacy, Xiangmei Zhang, Ben Buzzee"
date: "February 19, 2017"
output: html_document
---

First we loaded in all the sheets by file for file one and file two. The first two columns of each sheet were blank, so we added in the labels "ID" and "Time", and we removed the empty columns in the sheets.
```{r, message=F, warning = F, echo = F }
library(tidyverse)

## deal with dat1 sheet
```

```{r, message = F, warning = F}
d1_sheet <- NULL
d2_sheet <- NULL
for (i in 1:4){
  d1_sheet[[i]] <- readxl::read_excel("./Spreadsheets/FileOne.xlsx",sheet=i)
  d2_sheet[[i]] <- readxl::read_excel("./Spreadsheets/FileTwo.xlsx",sheet=i)
  colnames(d1_sheet[[i]])[1:2] <- c("ID","Time")  # fill empty colnames
  colnames(d2_sheet[[i]])[1:2] <- c("ID","Time")
  d2_sheet[[i]] <-d2_sheet[[i]] %>% mutate(test=2,sem=i)
  d2_sheet[[i]] = d2_sheet[[i]][names(d2_sheet[[i]])!=""] ## remove empty columns
  d1_sheet[[i]] <-d1_sheet[[i]] %>% mutate(test=1,sem=i)
  d1_sheet[[i]] = d1_sheet[[i]][names(d1_sheet[[i]])!=""]
  ## specify the test for each file
}
```

Then we worked with File One to clean it up.  We aggregated all the sheets for semesters into one file. Then, we  made the labeling of Pre and Post tests more consistent by making them all the same format. Part of the data depended on one key, while the rest of the data depended on multiple keys, so we split it into two files, "student1" and "dat1". Then, we double checked the total and normalized changes columns to make sure the calculations were correct.

```{r, message=F, warning = F}
#------deal with file 1--I do this first because all data in the same file have similar structure

d1=rbind(d1_sheet[[1]],d1_sheet[[2]],d1_sheet[[3]],d1_sheet[[4]])
d1=d1 %>% mutate(Time=replace(Time,Time=="PRE","Pre"),  ## unify upper/lower cases
                 Time=replace(Time,Time=="POST","Post"))

#split key for file1
student1 = d1 %>% filter(Time=="Pre") %>%      ## semester/gender/charactor..whose key is only ID
  select(1,`Normalized Changes`:`Treatment Part 2 `,sem)

dat1 =d1 %>% select(1:`Total`,test) %>%            ## data: answer1-10...whose key is ID Time group
  gather(key=answer ,value=score,3:12) %>%
  separate(answer,into=c("type","group"),sep=" ")

#-----test total and normalized changes by re-derive them--------------------

## check rederive total and normalized changes
sumtotal=dat1 %>% group_by(ID,Time) %>% summarise(sum.total=sum(score)) # re-derive total
dat1.1=full_join(dat1, sumtotal, by=c("ID","Time"))
dat1.1 %>% filter(Total!=sum.total)  ## all right!

# re-derive normalized changes (post-pre)/(100−pre) if post>pre
## (post-pre)/pre if post<pre
newchanges=sumtotal %>% spread(key=Time,value=sum.total) %>%
  mutate(
    diff=Post-Pre,
    normalize=100-Pre,
    normalize=replace(normalize,diff<0,Pre),
    changes = diff/normalize) %>%
  select(ID,changes)

student1.1 = full_join(student1,newchanges,by="ID")
student1.1 %>% filter(`Normalized Changes`!=changes) ## all right!
####----------------------------
```
For File Two we did the same data manipulations as for File One. However, in File Two the semester 4 IDs were inconsistent with the File One semester 4 IDs, so we made them match. (Assuming these were supposed to match, if they were the same students.)


```{r, message=F, warning = F}
#------deal with file 2----------------------
### File2 sem1 miss the column "MTH 3"
#Change ids for semester 4 to match with file one
d2_sheet[[4]] =  d2_sheet[[4]] %>% mutate(ID = ID - 1000)
d2_sheet[[1]] = d2_sheet[[1]] %>% mutate(`MTH 3`=NA)  # add an empty colunm named "MTH 3"
d2=rbind(d2_sheet[[2]],d2_sheet[[1]],d2_sheet[[3]],d2_sheet[[4]])
d2=d2 %>% mutate(Time=replace(Time,Time=="PRE","Pre"),  ## unify upper/lower cases
                 Time=replace(Time,Time=="POST","Post"))

#split key for file1
student2 = d2 %>% filter(Time=="Pre") %>%      ## semester/gender/charactor..whose key is only ID
  select(1,`Normalized Change`:`Treatment Part 2`,sem)

dat2 =d2 %>% select(1:`Total`,test) %>%            ## data: answer1-10...whose key is ID Time group
  gather(key=answer ,value=score,3:`DI 4`) %>%
  separate(answer,into=c("type","group"),sep=" ") %>%
  mutate(score=parse_number(score))

#test total and normalized changes by re-derive them

## check rederive total and normalized changes
sumtotal2=dat2 %>% group_by(ID,Time) %>% summarise(sum.total=sum(score,na.rm=T)) # re-derive total
dat2.1=full_join(dat2, sumtotal2, by=c("ID","Time"))
dat2.1 %>% filter(Total!=sum.total)  ## total.score match !!!


# re-derive normalized changes (post-pre)/(100−pre+60) if post>pre
## (post-pre)/pre if post<pre
newchanges2=sumtotal2 %>% spread(key=Time,value=sum.total) %>%
  mutate(
    diff=Post-Pre,
    normalize=100-Pre+60, ### a guess from the data
    normalize=replace(normalize,diff<0,Pre),
    changes = diff/normalize) %>%
  select(ID,changes)

student2.1 = full_join(student2,newchanges2,by="ID")
student2.1 %>% filter(`Normalized Change`!=changes) ## all right!
```
Here, we checked for further inconsistencies in the data. We checked to make sure the genders and characteristics matched from File One to File Two. We then merged the two student data files together.
```{r, warning = F}
#---------------------------------------
#check consistence for gender/characteristic between two files
#---------------------------------------
check=full_join(student1,student2,
                by=c("ID","sem","Treatment Part 1","Treatment Part 2 "="Treatment Part 2"))
check %>% select(1,3:4,9:10) %>% filter(GENDER.x!=GENDER.y)
## genders of 10 IDs do not match !! inconsistent

check %>% select(1,3:4,9:10) %>% filter(Characteristic.x!=Characteristic.y)
## characteristics of 13 IDs do not match !! inconsistent

### drop these inconsistences or change the value ??
# try change them in file 2 and keep information from file 1
mergekey=check %>% group_by(ID)%>%
  mutate(
    gender=GENDER.x,
    gender=replace(gender,is.na(gender),GENDER.y),
    Characteristic = Characteristic.x,
    Characteristic = replace(Characteristic,is.na(Characteristic),Characteristic.y)
  ) %>%
  select(-c(3:4,9:10)) %>% gather(key=test,value=Normalized.change,c(2,6)) %>%
  mutate(
    test=replace(test,test=="Normalized Changes",1),
    test=replace(test,test=="Normalized Change",2),
    test=parse_number(test)
  )
```
Here we combined all the data into one file to make our plots.
```{r}
mergedat <- rbind(dat1,dat2) ## merge data in two files
alldata = full_join(mergedat,mergekey,by=c("ID","test"))  ## all information form two files


```
Below is the code to make a visual summary of the data.  We faceted by semester, and colored by gender. Scores the fall fall on the black line did equally well on the pre and post exam. The points appear to be randomly distributed above and below the line, and there appears to be no difference between how the different genders performed.  We can see test one is represented by the circles and test two is represented by the triangles.

```{r, message=F, warning = F}
#-----------------------------
#visual summary
#-----------------------------

x=alldata %>% group_by(ID,Time,test) %>% summarise(total=unique(Total)) %>%
  spread(key = Time, value = total)
full_join(x,mergekey,by=c("ID","test")) %>% mutate(term=paste("semester",sem), test= as.factor(test)) %>%
  ggplot(aes(x=Pre,y=Post, shape = test, color = gender))+geom_point()+facet_wrap(~term)+ geom_abline(slope = 1, intercept = 0)
```