CLASS-CDA/CDA-DataVis.Rmd at master · EccRiley/CLASS-CDA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
title: "Categorical Data Visualization"
author: "Riley Smith"
date: "`r format(Sys.Date(), '%d %B %Y')`"
---

```{r setup, echo=FALSE, message=FALSE, warning=FALSE, cache=FALSE}
source("../SETUP.R")
options(width = 70)
knitr::opts_chunk$set(
    tidy = FALSE,
    echo = TRUE,
    cache = FALSE,
    fig.keep = 'high',
    fig.show = 'asis',
    results = 'asis',
    autodep = T,
    Rplot = TRUE,
    dev = 'pdf',
    fig.path = 'graphics/reuters/rplot_',
    fig.width = 7,
    fig.asp = 1,
    out.width = "\\linewidth"
    )
library(foreign) ## read.spss() ##
library(car) ## recode() ##
dat <- read.spss("data/reuters.sav", to.data.frame = TRUE)
dat <- within(dat, {recode(response, c("1=0", "2=1", "3=2"))})

```

# Reuters 2016 Polling Data

## Data-Cleaning \\& Preparation


```{r isNA, results='markup'}
Risna <- function(x) sum(is.na(x))
#   ## Getting a count of NA values in the original dataframe ##
sapply(dat, Risna)

R.na <- function(x, v = 0){
    ## x = object to be manipulated,
    ## v = value to assign to NAs ##
    x <- ifelse(is.na(x), v, x)
    return(x)
    }
```

```{marginfigure}
**`R.na():`** _"If `x` = NA (`is.na()`), replace `x` with `v`, otherwise leave `x` alone."_
```

```{r echo=FALSE}
kable(R.msmm(dat$party), caption = "Summary information for '**party**'
        data column _before_ recoding \\texttt{NA}s")
```

```{r}
unique(dat$party)
dat$party <- sapply(dat$party, R.na, v = 99)
```

```{r echo=FALSE}
kable(R.msmm(dat$party), caption = "Summary information for '**party**'
        data column _after_ recoding \\texttt{NA}s")
unique(dat$party)
```

```{r}
dat$response <- recode_factor(dat$response,
                              "other/no opinion" = NA_character_)
## see "recode_factor()" in the {dplyr} package ##
dat <- na.omit(dat)
sapply(dat, R.isna) ## bye-bye NAs! ... again ##
    ## but this time we only lost data for rows with NA
    ## in dat$response (but we did lose ALL of the data
    ## for those rows, as these were removed from the
    ## dataframe entirely, though the original datafile remains untouched).
```

`r tufte::newthought("Now")` the data are, in my opinion, ready for analysis \& plotting.


## \textsc{Bar Plot} of polling data (using `R`'s Base Graphics)

```{r}
poll.t <- table(dat$response)

kable(as.data.frame(poll.t), caption = "Frequency Table of Polling Data",
      col.names = c("Response", "Frequency"))

electpal <- c("red", "blue")
electpal <- sapply(electpal, adjustcolor, alpha = 0.75, USE.NAMES = FALSE)

palette(electpal)
barplot(poll.t, ylab = "Count",
        xlab = "Polling Response",
        family = "ETBembo",
        col = electpal, main = "Polling Responses")
```

## \textsc{Dot Plot} of polling data using the `ggplot2` package.

```{marginfigure}
This would actually be interesting to do with repeated-measures data with the candidates on the `Y-axis` and time (in months/weeks) on the `X-axis`.
```

```{r}
poll.df <- as.data.frame(poll.t)
names( poll.df) <- c("Response", "Frequency")
 poll.df$N <- rep(x = nrow(dat), times = nrow( poll.df))
n <-  poll.df[, 2]
bpoll <- ggplot( poll.df, aes(x = Frequency, y = Response)) +
    geom_segment(aes(yend = Response), xend = 0, colour = pal_my[20]) +
    geom_point(size = 5, aes(colour = Response)) +
    scale_colour_manual(values = electpal, guide = FALSE) +
    labs(y = "", x = "") + thm_tft(xline = TRUE, yline = TRUE)
bpoll
```

`r newthought("Don't forget to set the x-and-y-limits!")` Otherwise, you could be presenting a potentially misleading visualization of the data. Since these are polling data, there is a true "zero" such that `0` would reflect `0` votes for a given candidate in a given poll.^[\\textit{This has happened for one of the two current major party presidential candidates in the very recent past - I will not say who.}] The data should thus be represented according to its appropriate scale limits.

```{r}
bpoll + xlim(0, max( poll.df$Frequency)) +
    geom_text(vjust = -0.5, hjust = 0.5, stat = 'identity',
              position = 'identity', colour = pal_my[19],
              size = rel(4), aes(family = "ETBembo", fontface = "italic",
                                 label = paste("n  = ", n,
                                               "\n", "(",
                                               round(n/ poll.df$N*100),
                                               "%",")")))
```

Further, in plots like these, where discrete data with a relatively small number of categories^[My persona rule of thumb for what constitutes a "relatively small category" is $N_{categories} \leq 5$] are juxtaposed with a continuous scale, setting the continuous axis' limits (in this case the `x-axis`) can help to further disambiguate the information.

```{r}
bpoll2 <- bpoll + scale_x_continuous(breaks = c(0, n),
                                     limits = c(0, max(n)))

bpoll2 + geom_text(vjust = -1.5, hjust = 0.5, stat = 'identity',
                   position = 'identity', colour = pal_my[19],
                   size = rel(4), aes(family = "ETBembo",
                                      fontface = "italic",
                                      label = paste("p  = ",
                                                    round(n/ poll.df$N,
                                                          digits = 2))))
```

## \textsc{Mosaic Plots} for polling data

```{r plot_mosaic}

dat <- within(dat, {
    resp.F <- factor(response)
    ind.F <- factor(ind)
})

tbl <- table(dat$ind.F, dat$resp.F)
tbl

library(vcd)

dimnames(tbl) <- list(ind = c("Party", "Independent"),
                      response = c("Trump","Clinton"))
mosaic(tbl)
mosaicplot(tbl,main = "Reuters Poll Data")

```

`r tufte::newthought("Three-Way")` Mosaic Plot

```{r}
dat <- R.rspss("data/cnnpoll.sav", vlabs = T)
ft <- with(dat, {
    ftable(dat, row.vars = 1:2, col.vars = 3)
})
ft

ftc <- matrix(ft, nrow = 4, byrow = T)
ftc
ftc.a <- array(ftc, dim = c(2, 2, 2), dimnames = list(
    sex = c("Male", "Female"),
    ind = c("Affiliate", "Independent"),
    response = c("Clinton", "Trump")))
ftc.a

mosaicplot(ftc.a, type = "deviance", las = 2, color = pal_my.a75[c(5, 16)])

```

\newpage
# @azen2011categorical's (Chapter 3) Proficiency Data


```{marginfigure}
"See ['Chapter3, Figure4'](https://people.uwm.edu/azen/files/2016/07/AWBookDatasets-1ydyjta.pdf)"
```

> "suppose that federal guidelines state that $80\%$$ of students should demonstrate proficiency in mathematics, and in a randomly selected sample of 10 students only $70\%$ of students were found to be proficient in mathematics. In such a case, we may wish to test whether the proportion of students who are proficient in mathematics in the population is significantly different than the federal guideline of $80\%$. In other words, we would like to know whether our obtained sample proportion of $0.7$ is significantly lower than $0.8$, so we would test the null hypothesis $H_{0}: \pi = 0.8$ against the (one-sided, in this case) alternative $H_{1}: \pi < 0.8$" (p. 23).
>
> > "Two variables are entered: the _proficiency level (`prof`)_ and the _frequency (`count`)_" (p. 32).
> >
> > > "Note that if raw data for $10$ individuals (i.e., $10$ `rows of data`) were analyzed, where the variable called `prof` consisted of _$7 yes$ responses_ and _$3 no$ responses_, the counts would be computed by the program and would not be needed as input" (p. 32).
>
> `r quote_footer("@azen2011categorical, (pp. 23 \\& 32)")`

The `prof` variable should be read by `R` as a `string variable`.

```{r dat, warning=FALSE, message=FALSE}
library(foreign) ## read.spss() ##
dat <- read.spss("data/proficient.sav", to.data.frame = TRUE)
```


## Data Inspection \& Cleaning

```{r}
str(dat)
    ## The proficient.sav dataset is HUGE!
dat <- tbl_df(dat)
    ## see the {dplyr} package - tbl_df's are just better
    ## (than dataframes), particularly when it comes to
    ## very large dataframes, such as these proficiency data.

glimpse(dat) ## {pkg:dplyr} ##

Risna <- function(x) sum(is.na(x))
    ## Getting a count of NA values in the original dataframe ##

sapply(dat, Risna)
```

It appears that there are several instances of `NA` in the `level` column. However, since I do not know these data too well just yet, I am apprehensive, to say the least, to just outright remove those rows containing `NAs`. Instead, I'm going to assign an arbitruary, but meaningful to me, value to all instances of `NA` throughout the dataset. `r margin_note("The output from \\texttt{sapply()} above indicates that the \\texttt{level} column is the only one with \\texttt{NAs}. So, in the next set of analyses, we should only see changes to that column in the end.")`

To ensure that I do not accidentally override an existing value for the `level` variable by assigning an arbitruary value to `NA's`, I first call `unique(dat$level)` below to get a list of the numeric values currently assigned to the levels of `dat$level`. Then, I create a quick function (`R.na()`) to replace `NA` values within a given `R-object`.

```{r cleanDat}
unique(dat$level)
levels(dat$level)
lv <- c(levels(dat$level), "Unknown")
    ## I'm going to have to re-assign the factor levels to dat$level, so
    ## I am saving the labels, with an added level for the NAs, to use later

R.na <- function(x, v = 0){
    ## x = object to be manipulated,
    ## v = value to assign to NAs ##
    x <- ifelse(is.na(x), v, x)
    return(x)
}

dat$level <- sapply(dat$level, R.na, v = 0)
    ## 0 does not currently mean anything else in this data, so I am going
    ## to use to represent it's acutal meaning, which is simply a NULL value

dat$level <- factor(dat$level, levels = c(4, 3, 2, 1, 0), labels = lv)
unique(dat$level)
levels(dat$level)
glimpse(dat)
sapply(dat, R.isna)

```

## Setting up for binomial analysis

Based on their description (and the fact that the example pertains primarily to the Bernoulli Distribution), it appears that @azen2011categorical used a dichotomized version of the `level` variable in their analysis example. However, I am curious about the level of extant intformation available from this variable's original 4-categories^[i.e., `r levels(dat$level)`], as well as the difference, if any, between the two categorical data structuring methods.

```{r datDich}
rec <- "c('proficient', 'advanced') = 1; c('Unknown', 'minimal', 'basic') = 0"
dat$lev.D <- car::recode(dat$level, rec)
    ## "D" = "Dichotomous" ##
summary(dat$lev.D)
```

`r newthought("Quick aside:")` Instead of the above dichotomization, I could have done the following:

```{r}
rec1 <- c("'minimal'=1; 'basic'=2; 'proficient'=3; 'advanced'=4; 'Unknown'=0")
dat$lev.D1 <- car:::recode(dat$level, rec1)
dat$lev.D1 <- as.integer(dat$lev.D1)
dat$lev.D1 <- cut(dat$lev.D1, breaks = 2, right = FALSE)
D1labs <- levels(dat$lev.D1)
levels(dat$lev.D1) <- c("0", "1")
summary(dat$lev.D1)

```

`r tufte::newthought("However")`, as you can see by the differences in the summary outputs above and the barplots below resulting from these different dichotomizing procedures, when I simply "cut()" the factor levels, I get a nicely split (i.e., "cut") factor. Unfortunately, the proportions of the newly dichotomized factor's levels do not necessarily reflect the actual proportions in the data.`r margin_note("Moral of the story - be careful when cutting and never assume that \\texttt{R} knows what you are trying to do (\\texttt{R}'s purpose is not to Harry Huidini your data by reading your mind (or your prof's/advisor's mind), but rather to provide you with a set (to put it mildly) of tools to help you do the work ... but you still have to do the actual thinking work).")`

```{r}
dat$level <- relevel(dat$level, ref = "Unknown")
    ## re-ordering levels for presentation ##
```

```{r barChart1, echo=FALSE, eval=FALSE}
table(dat$level) %>% ## see {dplyr} package ##
    barplot(main = "All Possible Proficiency Levels", col = mpal2(5))
table(dat$level)

table(dat$lev.D) %>%
    barplot(main = "Dichotomized Proficiency Levels", xlab = "Not Proficient                 Proficient", col = mpal2(5)[c(1, 5)])
table(dat$lev.D)

table(dat$lev.D1) %>%
    barplot(main = "Dichotomized Proficiency Levels Resulting from 'cut(...)'", xlab = expression(paste(X <= 3,"                 ", X >= 3)), col = mpal2(5)[c(1, 5)])
table(dat$lev.D1)

```

`r tufte::newthought("Below")` are the exact same plots, except with a random sample of $n = 10$ cases from the proficiency data per the example in @azen2011categorical (Chapter 3).

```{r dat.s}
dat.s <- sample_n(dat, 10)
```

```{r barChart2, echo=FALSE, eval=FALSE}
table(dat.s$level) %>% ## see {dplyr} package ##
    barplot(main = "All Possible Proficiency Levels", col = mpal2(5))
table(dat.s$level)

table(dat.s$lev.D) %>%
    barplot(main = "Dichotomized Proficiency Levels", xlab = "Not Proficient                 Proficient", col = mpal2(5)[c(1, 5)])
table(dat.s$lev.D)

table(dat.s$lev.D1) %>%
    barplot(main = "Dichotomized Proficiency Levels Resulting from 'cut(...)'", xlab = expression(paste(X <= 3, "                 ", X >= 3)), col = mpal2(5)[c(1, 5)])
table(dat.s$lev.D1)
```

\newpage
# References`r R.cite_r("~/GitHub/auxDocs/REFs.bib", footnote = TRUE)`