simulations-practice/summary.Rmd at main · Fdoel/simulations-practice · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Kendall vs. Pearson correlation in reating scale data"
output: html_document
date: "2023-08-27"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, include=FALSE}

library(tidyverse)

```

## Kendall vs. Pearson correlation in reating scale data ##

This text will provide a summary of the comparison between Kendall's tau and Pearson correlation in rating scale data. Several different settings are considered. The correlation coefficients found in the tables and graphs only consider the measure between the first and second item. This can be changed to look at the covariance matrices, however this is hard to visualise.

### Varied amount of items ###

Varying the amount of items seems to have little effect on the correlation coefficient, which seems intuitive as we are only looking at the correlation between the first and second item.

For each of the following plot, only the amount of items differs. The other parameters are as follows
number of observations = 1000, amount of runs = 100

```{r, echo=FALSE, message=FALSE, out.width = "33%"}
p_seq = c(2, 5, 10)
for(i in p_seq) {
  sprintf("pearson_vs_kendall/results/summary/p_differ/results_p=%d", i) %>%
  paste(".RData", sep = "") %>%
  load(.GlobalEnv)
  p <-ggplot(results_rho ,aes(x = Target, y = Correlation, color = Method)) +
    geom_smooth() +
    ggtitle(paste("P = ", i))
  print(p)
}
```


### Varied amount of categories ###

When we add more categories, we expect the rating scale data to slowly resemble continuous data, and thus for the average correlation per method to converge.

For this the following function is used to generate the base probabilities.

```{r}
probabilty_likert <- function(bins, likert_mean = (1 + bins)/2, likert_sd = bins/4) {
  intervals = c(seq(1.5, bins - 0.5), Inf)
  x <- pnorm(intervals, mean = likert_mean, sd = likert_sd)
  y <- dplyr::lag(x, default = 0)
  x - y
}
```

The probabilties are based upon a normally distributed latent variable, this means that as we have a limited number of categories, base probabilites are not normally distributed.

We indeed see that the estimated correlation tends towards the target correlation for categories going to 1000.

```{r, echo=FALSE, message=FALSE}
load("pearson_vs_kendall/results/summary/careless/asymptotic/n=1000_max=1003_increment=20.RData")
plot_line <-  dplyr::filter(results_L, near(Epsilon, 0.0)) %>%
  ggplot(aes(x = Categories, y = Correlation, color = Method)) +
  geom_smooth() +
  ggtitle("n = 1000, target rho = 0.8")
print(plot_line)
```

Transformed Kendall in this case means the Fischer consistent version of Kendall's tau.

$\tilde{R}_K(H)=\sin \left(\frac{1}{2} \pi R_K(H)\right)$

Where $\tilde{R}_K(H)$ is used in the plot.

## Contaminated data ##

More intresting is when we look at contaminated data. When we take 30% ($\epsilon = 0.3$) of our dataand replace them with either careless respondents, who have an equal probability for choosing each catogery on each question, and "agreers", who have a tendency to respond higher in the Likert scale. In the function described earlier 'careless_prob_matrix <- probabilty_likert(l, l/1.1)' where l is the amount of categories.

### Asymptotic behaviour ###

First we'll take a look at the asymptotic behaviour of the contaminated data.

```{r, echo=FALSE, message=FALSE}
# Load and store all specific dataframes
load("pearson_vs_kendall/results/summary/careless/asymptotic/n=1000_max=1003_increment=20.RData")
results_n1000 <- results_L %>%
  mutate(Observations = 1000)

load("pearson_vs_kendall/results/summary/careless/asymptotic/n=100_max=1003_increment=20.RData")
results_n100 <- results_L %>%
  mutate(Observations = 100)

load("pearson_vs_kendall/results/summary/careless/asymptotic/n=50_max=1003_increment=20.RData")
results_n50 <- results_L %>%
  mutate(Observations = 50)

# Create large df in a not so nicely coded way

results_n1000_regular <- dplyr::filter(results_n1000, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n1000_careless <- dplyr::filter(results_n1000, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

results_n100_regular <- dplyr::filter(results_n100, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n100_careless <- dplyr::filter(results_n100, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

results_n50_regular <- dplyr::filter(results_n50, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n50_careless <- dplyr::filter(results_n50, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

# Get the agreer data too

load("pearson_vs_kendall/results/summary/agree/asymptotic/asymptoticn=1000_max=1000_increment=20.RData")
results_n1000_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 1000, Type = "agree")

load("pearson_vs_kendall/results/summary/agree/asymptotic/asymptoticn=100_max=1000_increment=20.RData")
results_n100_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 100, Type = "agree")

load("pearson_vs_kendall/results/summary/agree/asymptotic/asymptoticn=50_max=1000_increment=20.RData")
results_n50_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 50, Type = "agree")

df_asymptotic <- rbind(
  results_n1000_careless,
  results_n1000_regular,
  results_n1000_agree,
  results_n100_careless,
  results_n100_regular,
  results_n100_agree,
  results_n50_careless,
  results_n50_regular,
  results_n50_agree)

plot_grid <-
  ggplot(df_asymptotic, aes(x = Categories, y = Correlation, color = Method)) +
      geom_smooth() +
      facet_grid(factor(Type) ~ factor(Observations))

  print(plot_grid)
```

Behaviour towards 1000 categories behaves as expected. Transformed Kendall and Pearson tend towards the same value. However, while this is useful for validation no one is filling in a 1000 category survey.

As we can see, especially for lower values Transormed Kendall's correlation is a lot closer to the correlation for large values, suggesting there is some usefulness.

### 3 to 11 catogeries ###

The real usefulness of the comaparison between the different methods of measuring correlation lies within regular survey values, like  3 to a maximum of 11.

Lets take a look at that data.

```{r, echo=FALSE, out.width="200%", out.height="150%"}
# Load and store all specific dataframes
load("pearson_vs_kendall/results/summary/careless/3to11/n=1000_max=11_increment=2.RData")
results_n1000 <- results_L %>%
  mutate(Observations = 1000)

load("pearson_vs_kendall/results/summary/careless/3to11/n=100_max=11_increment=2.RData")
results_n100 <- results_L %>%
  mutate(Observations = 100)

load("pearson_vs_kendall/results/summary/careless/3to11/n=50_max=11_increment=2.RData")
results_n50 <- results_L %>%
  mutate(Observations = 50)

# Create large df in a not so nicely coded way

results_n1000_regular <- dplyr::filter(results_n1000, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n1000_careless <- dplyr::filter(results_n1000, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

results_n100_regular <- dplyr::filter(results_n100, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n100_careless <- dplyr::filter(results_n100, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

results_n50_regular <- dplyr::filter(results_n50, near(Epsilon, 0.0))%>%
  mutate(Type = "regular")
results_n50_careless <- dplyr::filter(results_n50, near(Epsilon, 0.3)) %>%
  mutate(Type = "careless")

# Get the agreer data too

load("pearson_vs_kendall/results/summary/agree/3to11/n=1000_max=11_increment=2.RData")
results_n1000_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 1000, Type = "agree")

load("pearson_vs_kendall/results/summary/agree/3to11/n=100_max=11_increment=2.RData")
results_n100_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 100, Type = "agree")

load("pearson_vs_kendall/results/summary/agree/3to11/n=50_max=11_increment=2.RData")
results_n50_agree <- dplyr::filter(results_L, near(Epsilon, 0.3)) %>%
  mutate(Observations = 50, Type = "agree")

df_small <- rbind(
  results_n1000_careless,
  results_n1000_regular,
  results_n1000_agree,
  results_n100_careless,
  results_n100_regular,
  results_n100_agree,
  results_n50_careless,
  results_n50_regular,
  results_n50_agree)
plot_grid <-
    ggplot(df_small, aes(y = Correlation, color = Method)) +
      geom_boxplot(aes(x = factor(Categories))) +
      facet_grid(factor(Type) ~ factor(Observations))

  print(plot_grid)
```
Especially at lower amounts of categories Transformed Kendall's estimate is a lot closer to the target correlation.

### Multiple Constructs ###

When introducing mulitple constructs, the correlation estimate between p1 and p2 should not change. We can contruct the average correlation matrix using the following function.