IntroDataScience/05-DataPreprocessing.Rmd at master · happyrabbit/IntroDataScience · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
# Data Pre-processing {#datapreprocessing}

Many data analysis related books focus on models, algorithms and statistical inferences. However, in practice, raw data is usually not directly used for modeling. Data preprocessing is the process of converting raw data into clean data that is proper for modeling. A model fails for various reasons. One is that the modeler doesn't correctly preprocess data before modeling. Data preprocessing can significantly impact model results, such as imputing missing value and handling with outliers. So data preprocessing is a very critical part.

![Data Pre-processing Outline](images/DataPre-processing.png){width=90%}

In real life, depending on the stage of data cleanup, data has the following types:

1. Raw data
2. Technically correct data
3. Data that is proper for the model
4. Summarized data
5. Data with fixed format

The raw data is the first-hand data that analysts pull from the database, market survey responds from your clients,  the experimental results collected by the research and development department, and so on. These data may be very rough, and R sometimes can't read them directly. The table title could be multi-line, or the format does not meet the requirements:

- Use 50% to represent the percentage rather than 0.5, so R will read it as a character;
- The missing value of the sales is represented by "-" instead of space so that R will treat the variable as character or factor type;
-  The data is in a slideshow document, or the spreadsheet is not ".csv" but ".xlsx"
- ...

Most of the time, you need to clean the data so that R can import them. Some data format requires a specific package. Technically correct data is the data, after preliminary cleaning or format conversion, that R (or another tool you use) can successfully import it.

Assume we have loaded the data into R with reasonable column names, variable format and so on. That does not mean the data is entirely correct. There may be some observations that do not make sense, such as age is negative, the discount percentage is greater than 1, or data is missing. Depending on the situation, there may be a variety of problems with the data. It is necessary to clean the data before modeling. Moreover, different models have different requirements on the data. For example, some model may require the variables are of consistent scale; some may be susceptible to outliers or collinearity, some may not be able to handle categorical variables and so on. The modeler has to preprocess the data to make it proper for the specific model.

Sometimes we need to aggregate the data.  For example, add up the daily sales to get annual sales of a product at different locations.  In customer segmentation, it is common practice to build a profile for each segment. It requires calculating some statistics such as average age, average income, age standard deviation, etc. Data aggregation is also necessary for presentation, or for data visualization.

The final table results for clients need to be in a nicer format than what used in the analysis.  Usually, data analysts will take the results from data scientists and adjust the format, such as labels, cell color, highlight. It is important for a data scientist to make sure the results look consistent which makes the next step easier for data analysts.

It is highly recommended to store each step of the data and the R code, making the whole process as repeatable as possible. The R markdown reproducible report will be extremely helpful for that. If the data changes, it is easy to rerun the process. In the remainder of this chapter, we will show the most common data preprocessing methods.

Load the R packages first:

<!--
```r
source("https://raw.githubusercontent.com/happyrabbit/CE_JSM2017/master/Rcode/00-course-setup.R")
```
-->

```{r, message=FALSE,results="hide"}
# install packages from CRAN
p_needed <- c('imputeMissings','caret','e1071','psych',
              'car','corrplot')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)
```

## Data Cleaning

After you load the data, the first thing is to check how many variables are there, the type of variables, the distributions, and data errors. Let's read and check the data:

```{r,  results="hide"}
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
summary(sim.dat)
```

```pre
      age           gender        income       house       store_exp
 Min.   : 16.0   Female:554   Min.   : 41776   No :432   Min.   : -500
 1st Qu.: 25.0   Male  :446   1st Qu.: 85832   Yes:568   1st Qu.:  205
 Median : 36.0                Median : 93869             Median :  329
 Mean   : 38.8                Mean   :113543             Mean   : 1357
 3rd Qu.: 53.0                3rd Qu.:124572             3rd Qu.:  597
 Max.   :300.0                Max.   :319704             Max.   :50000
                              NA's   :184
   online_exp    store_trans     online_trans        Q1            Q2
 Min.   :  69   Min.   : 1.00   Min.   : 1.0   Min.   :1.0   Min.   :1.00
 1st Qu.: 420   1st Qu.: 3.00   1st Qu.: 6.0   1st Qu.:2.0   1st Qu.:1.00
 Median :1942   Median : 4.00   Median :14.0   Median :3.0   Median :1.00
 Mean   :2120   Mean   : 5.35   Mean   :13.6   Mean   :3.1   Mean   :1.82
 3rd Qu.:2441   3rd Qu.: 7.00   3rd Qu.:20.0   3rd Qu.:4.0   3rd Qu.:2.00
 Max.   :9479   Max.   :20.00   Max.   :36.0   Max.   :5.0   Max.   :5.00

       Q3             Q4             Q5             Q6             Q7
 Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00
 1st Qu.:1.00   1st Qu.:2.00   1st Qu.:1.75   1st Qu.:1.00   1st Qu.:2.50
 Median :1.00   Median :3.00   Median :4.00   Median :2.00   Median :4.00
 Mean   :1.99   Mean   :2.76   Mean   :2.94   Mean   :2.45   Mean   :3.43
 3rd Qu.:3.00   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.00
 Max.   :5.00   Max.   :5.00   Max.   :5.00   Max.   :5.00   Max.   :5.00

       Q8            Q9            Q10              segment
 Min.   :1.0   Min.   :1.00   Min.   :1.00   Conspicuous:200
 1st Qu.:1.0   1st Qu.:2.00   1st Qu.:1.00   Price      :250
 Median :2.0   Median :4.00   Median :2.00   Quality    :200
 Mean   :2.4   Mean   :3.08   Mean   :2.32   Style      :350
 3rd Qu.:3.0   3rd Qu.:4.00   3rd Qu.:3.00
 Max.   :5.0   Max.   :5.00   Max.   :5.00
```

Are there any problems? Questionnaire response Q1-Q10 seem reasonable, the minimum is 1 and maximum is 5. Recall that the questionnaire score is 1-5. The number of store transactions (`store_trans`) and online transactions (`online_trans`) make sense too. Things to pay attention are:

- There are some missing values.
- There are outliers for store expenses (`store_exp`). The maximum value is 50000. Who would spend $50000 a year buying clothes? Is it an imputation error?
- There is a negative value ( -500) in `store_exp ` which is not logical.
- Someone is 300 years old.

How to deal with that? Depending on the situation, if the sample size is large enough and the missing happens randomly, it does not hurt to delete those problematic samples. Or we can set these values as missing and impute them instead of deleting the rows.


```r
# set problematic values as missings
sim.dat$age[which(sim.dat$age > 100)] <- NA
sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA
# see the results
summary(subset(sim.dat, select = c("age", "store_exp")))
```

```pre
      age          store_exp
 Min.   :16.00   Min.   :  155.8
 1st Qu.:25.00   1st Qu.:  205.1
 Median :36.00   Median :  329.8
 Mean   :38.58   Mean   : 1358.7
 3rd Qu.:53.00   3rd Qu.:  597.4
 Max.   :69.00   Max.   :50000.0
 NA's   :1       NA's   :1
```

Now let's deal with the missing values in the data.

## Missing Values

You can write a whole book about missing value. This section will only show some of the most commonly used methods without getting too deep into the topic. Chapter 7 of the book by De Waal, Pannekoek and Scholtus [@Ton2011] makes a concise overview of some of the existing imputation methods. The choice of specific method depends on the actual situation. There is no best way.

One question to ask before imputation:  Is there any auxiliary information? Being aware of any auxiliary information is critical.  For example, if the system set customer who did not purchase as missing, then the real purchasing amount should be 0.  Is missing a random occurrence? If so, it may be reasonable to impute with mean or median. If not, is there a potential mechanism for the missing data? For example, older people are more reluctant to disclose their ages in the questionnaire, so that the absence of age is not completely random. In this case, the missing values need to be estimated using the relationship between age and other independent variables. For example, use variables such as whether they have children, income, and other survey questions to build a model to predict age.

Also, the purpose of modeling is important for selecting imputation methods. If the goal is to interpret the parameter estimate or statistical inference, then it is important to study the missing mechanism carefully and to estimate the missing values using non-missing information as much as possible. If the goal is to predict,  people usually will not study the absence mechanism rigorously (but sometimes the mechanism is obvious). If the absence mechanism is not clear, treat it as missing at random and use mean, median, or k-nearest neighbor to impute. Since statistical inference is sensitive to missing values, researchers from survey statistics have conducted in-depth studies of various imputation schemes which focus on valid statistical inference. The problem of missing values in the prediction model is different from that in the traditional survey.  Therefore, there are not many papers on missing value imputation in the prediction model. Those who want to study further can refer to Saar-Tsechansky and Provost's comparison of different imputation methods [@missing1] and De Waal, Pannekoek and Scholtus' book [@Ton2011].

### Impute missing values with median/mode

In the case of missing at random, a common method is to impute with the mean (continuous variable) or median (categorical variables). You can use `impute()` function in `imputeMissings` package.

```r
# save the result as another object
demo_imp <- impute(sim.dat, method = "median/mode")
# check the first 5 columns
# there is no missing values in other columns
summary(demo_imp[, 1:5])
```

```pre
      age           gender        income       house       store_exp
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8
 1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:  205.1
 Median :36.00                Median : 93869             Median :  329.8
 Mean   :38.58                Mean   :109923             Mean   : 1357.7
 3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:  597.3
 Max.   :69.00                Max.   :319704             Max.   :50000.0
```

After imputation, `demo_imp` has no missing value. This method is straightforward and widely used. The disadvantage is that it does not take into account the relationship between the variables. When there is a significant proportion of missing, it will distort the data. In this case, it is better to consider the relationship between variables and study the missing mechanism. In the example here, the missing variables are numeric. If the missing variable is a categorical/factor variable, the `impute()` function will impute with the mode.

You can also use `preProcess()` in package `caret`, but it is only for numeric variables, and can not impute categorical variables. Since missing values here are numeric, we can use the `preProcess()` function. The result is the same as the `impute()` function. `PreProcess()`  is a powerful function that can link to a variety of data preprocessing methods.  We will use the function later for other data preprocessing.

```r
imp <- preProcess(sim.dat, method = "medianImpute")
demo_imp2 <- predict(imp, sim.dat)
summary(demo_imp2[, 1:5])
```

```pre
      age           gender        income       house       store_exp
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8
 1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:  205.1
 Median :36.00                Median : 93869             Median :  329.8
 Mean   :38.58                Mean   :109923             Mean   : 1357.7
 3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:  597.3
 Max.   :69.00                Max.   :319704             Max.   :50000.0
```

### K-nearest neighbors

K-nearest neighbor (KNN) will find the k closest samples (Euclidian distance) in the training set and impute the mean of those "neighbors."

Use `preProcess()` to conduct KNN:

```r
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
# need to use predict() to get KNN result
demo_imp <- predict(imp, sim.dat)
# only show the first three elements
lapply(sim.dat, class)[1:3]
```

```pre
      age                gender        income
 Min.   :-1.5910972   Female:554   Min.   :-1.43989
 1st Qu.:-0.9568733   Male  :446   1st Qu.:-0.53732
 Median :-0.1817107                Median :-0.37606
 Mean   : 0.0000156                Mean   : 0.02389
 3rd Qu.: 1.0162678                3rd Qu.: 0.21540
 Max.   : 2.1437770                Max.   : 4.13627
```

The `preProcess()` in the first line will automatically ignore non-numeric columns.

Comparing the KNN result with the previous median imputation, the two are very different. This is because when you tell the `preProcess()` function to use KNN (the option `method =" knnImpute"`), it will automatically standardize the data.
Another way is to use Bagging tree (in the next section). Note that KNN can not impute samples with the entire row missing. The reason is straightforward. Since the algorithm uses the average of its neighbors if none of them has a value, what does it apply to calculate the mean?

Let's append a new row with all values missing to the original data frame to get a new object called `temp`. Then apply KNN to `temp` and see what happens:

```r
temp <- rbind(sim.dat, rep(NA, ncol(sim.dat)))
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
demo_imp <- predict(imp, temp)
```

```pre
Error in FUN(newX[, i], ...) :
  cannot impute when all predictors are missing in the new data point
```

There is an error saying “`cannot impute when all predictors are missing in the new data point`”. It is easy to fix by finding and removing the problematic row(s):

```r
idx <- apply(temp, 1, function(x) sum(is.na(x)))
as.vector(which(idx == ncol(temp)))
```

It shows that row 1001 is problematic. You can go ahead to delete it.

### Bagging Tree

Bagging (Bootstrap aggregating) was originally proposed by Leo Breiman. It is one of the earliest ensemble methods [@bag1]. When used in missing value imputation,  it will use the remaining variables as predictors to train a bagging tree and then use the tree to predict the missing values. Although theoretically, the method is powerful,  the computation is much more intense than KNN. In practice, there is a trade-off between computation time and the effect.  If a median or mean meet the modeling needs, even bagging tree may improve the accuracy a little, but the upgrade is so marginal that it does not deserve the extra time. The bagging tree itself is a model for regression and classification. Here we use `preProcess()` to impute `sim.dat`:

```r
imp <- preProcess(sim.dat, method = "bagImpute")
demo_imp <- predict(imp, sim.dat)
summary(demo_imp[, 1:5])
```

```pre
      age           gender        income       house       store_exp
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8
 1st Qu.:25.00   Male  :446   1st Qu.: 86762   Yes:568   1st Qu.:  205.1
 Median :36.00                Median : 94739             Median :  329.0
 Mean   :38.58                Mean   :114665             Mean   : 1357.7
 3rd Qu.:53.00                3rd Qu.:123726             3rd Qu.:  597.3
 Max.   :69.00                Max.   :319704             Max.   :50000.0
```

## Centering and Scaling

It is the most straightforward data transformation. It centers and scales a variable to mean 0 and standard deviation 1. It ensures that the criterion for finding linear combinations of the predictors is based on how much variation they explain and therefore improves the numerical stability. Models involving finding linear combinations of the predictors to explain response/predictors variation need data centering and scaling, such as principle component analysis (PCA) [@pca1], partial least squares (PLS) [@PLS1] and factor analysis [@EFA1]. You can quickly write code yourself to conduct this transformation.

Let's standardize the variable `income` from `sim.dat`:

```r
income <- sim.dat$income
# calculate the mean of income
mux <- mean(income, na.rm = T)
# calculate the standard deviation of income
sdx <- sd(income, na.rm = T)
# centering
tr1 <- income - mux
# scaling
tr2 <- tr1/sdx
```

Or the function `preProcess()` can apply this transformation to a set of predictors.

```r
sdat <- subset(sim.dat, select = c("age", "income"))
# set the 'method' option
trans <- preProcess(sdat, method = c("center", "scale"))
# use predict() function to get the final result
transformed <- predict(trans, sdat)
```

Now the two variables are in the same scale. You can check the result using `summary(transformed)`. Note that there are missing values.

## Resolve Skewness

Skewness is defined to be the third standardized central moment. The formula for the sample skewness statistics is:

$$ skewness=\frac{\sum(x_{i}-\bar{x})^{3}}{(n-1)v^{3/2}}$$
$$v=\frac{\sum(x_{i}-\bar{x})^{2}}{(n-1)}$$
A zero skewness means that the distribution is symmetric, i.e. the probability of falling on either side of the distribution's  mean is equal.

```{r preprocessskew, fig.cap = "Example of skewed distributions", out.width="100%", fig.asp=.75, fig.align="center", echo = FALSE}

# need skewness() function from e1071 package
set.seed(1000)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
# random sample 1000 chi-square distribution with df=2
# right skew
x1 <- rchisq(1000, 2, ncp = 0)
# get left skew variable x2 from x1
x2 <- max(x1) - x1
plot(density(x2), main = paste("left skew, skewnwss =",
    round(skewness(x2), 2)), xlab = "X2")
plot(density(x1), main = paste("right skew, skewness =",
    round(skewness(x1), 2)), xlab = "X1")
```


There are different ways to remove skewness such as log, square root or inverse transformation. However, it is often difficult to determine from plots which transformation is most appropriate for correcting skewness. The Box-Cox procedure automatically identified a transformation from the family of power transformations that are indexed by a parameter $\lambda$ [@BOXCOX1].

$$
x^{*}=\begin{cases}
\begin{array}{c}
\frac{x^{\lambda}-1}{\lambda}\\
log(x)
\end{array} & \begin{array}{c}
if\ \lambda\neq0\\
if\ \lambda=0
\end{array}\end{cases}
$$

It is easy to see that this family includes log transformation ($\lambda=0$), square transformation ($\lambda=2$), square root ($\lambda=0.5$), inverse ($\lambda=-1$) and others in-between. We can still use function `preProcess()` in package `caret` to apply this transformation by chaning the `method` argument.

```r
describe(sim.dat)
```

```pre
             vars    n      mean       sd median   trimmed      mad ...
age             1 1000     38.84    16.42     36     37.69    16.31
gender*         2 1000      1.45     0.50      1      1.43     0.00
income          3  816 113543.07 49842.29  93869 104841.94 28989.47
house*          4 1000      1.57     0.50      2      1.58     0.00
store_exp       5 1000   1356.85  2774.40    329    839.92   196.45
online_exp      6 1000   2120.18  1731.22   1942   1874.51  1015.21
store_trans     7 1000      5.35     3.70      4      4.89     2.97
online_trans    8 1000     13.55     7.96     14     13.42    10.38
...
```

It is easy to see the skewed variables.  If `mean` and `trimmed` differ a lot, there is very likely outliers. By default, `trimmed` reports mean by dropping the top and bottom 10%. It can be adjusted by setting argument `trim= `. It is clear that `store_exp` has outliers.

As an example, we will apply Box-Cox transformation on `store_trans` and `online_trans`:

```{r}
# select the two columns and save them as dat_bc
dat_bc <- subset(sim.dat, select = c("store_trans", "online_trans"))
(trans <- preProcess(dat_bc, method = c("BoxCox")))
```

The last line of the output shows the estimates of $\lambda$ for each variable.  As before, use `predict()` to get the transformed result:

```{r}
transformed <- predict(trans, dat_bc)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
hist(dat_bc$store_trans, main = "Before Transformation",
    xlab = "store_trans")
hist(transformed$store_trans, main = "After Transformation",
    xlab = "store_trans")
```

Before the transformation, the `stroe_trans` is skewed right. `BoxCoxTrans ()` can also conduct Box-Cox transform. But note that `BoxCoxTrans ()` can only be applied to a single variable, and it is not possible to transform difference columns in a data frame at the same time.

```{r}
(trans <- BoxCoxTrans(dat_bc$store_trans))
```

```{r}
transformed <- predict(trans, dat_bc$store_trans)
skewness(transformed)
```

The estimate of $\lambda$ is the same as before (0.1). The skewness of the original observation is 1.1, and -0.2 after transformation. Although it is not strictly 0, it is greatly improved.

## Resolve Outliers {#outliers}

Even under certain assumptions we can statistically define outliers, it can be hard to define in some situations.  Box plot, histogram and some other basic visualizations can be used to initially check whether there are outliers. For example, we can visualize numerical non-survey variables in `sim.dat`:

```{r, warning=FALSE}
# select numerical non-survey data
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
    "online_exp", "store_trans", "online_trans"))
# use scatterplotMatrix() function from car package
par(oma = c(2, 2, 1, 2))
car::scatterplotMatrix(sdat, diagonal = TRUE)
```

<!--As figure \@ref(fig:scm) shows, `store_exp` has outliers. -->
It is also easy to observe the pair relationship from the plot. `age` is negatively correlated with `online_trans` but positively correlated with `store_trans`.  It seems that older people tend to purchase from the local store. The amount of expense is positively correlated with income. Scatterplot matrix like this can reveal lots of information before modeling.

In addition to visualization, there are some statistical methods to define outliers, such as the commonly used Z-score. The Z-score for variable $\mathbf{Y}$ is defined as:

$$Z_{i}=\frac{Y_{i}-\bar{Y}}{s}$$

where $\bar{Y}$ and $s$ are mean and standard deviation for $Y$. Z-score is a measurement of the distance between each observation and the mean. This method may be misleading, especially when the sample size is small. Iglewicz and Hoaglin proposed to use the modified Z-score to determine the outlier [@mad1]：

$$M_{i}=\frac{0.6745(Y_{i}-\bar{Y})}{MAD}$$

Where MAD is the median of a series of $|Y_ {i} - \bar{Y}|$, called the median of the absolute dispersion. Iglewicz and Hoaglin suggest that the points with the Z-score greater than 3.5 corrected above are possible outliers. Let's apply it to `income`:

```{r}
# calculate median of the absolute dispersion for income
ymad <- mad(na.omit(sdat$income))
# calculate z-score
zs <- (sdat$income - mean(na.omit(sdat$income)))/ymad
# count the number of outliers
sum(na.omit(zs > 3.5))
```

According to modified Z-score, variable income has `r sum(na.omit(zs>3.5))` outliers. Refer to [@mad1] for other ways of detecting outliers.

The impact of outliers depends on the model. Some models are sensitive to outliers, such as linear regression, logistic regression. Some are pretty robust to outliers, such as tree models, support vector machine. Also, the outlier is not wrong data. It is real observation so cannot be deleted at will. If a model is sensitive to outliers, we can use _spatial sign transformation_ [@ssp] to minimize the problem. It projects the original sample points to the surface of a sphere by:

$$x_{ij}^{*}=\frac{x_{ij}}{\sqrt{\sum_{j=1}^{p}x_{ij}^{2}}}$$

where $x_{ij}$ represents the $i^{th}$ observation and $j^{th}$ variable.  As shown in the equation, every observation for sample $i$ is divided by its square mode. The denominator is the Euclidean distance to the center of the p-dimensional predictor space. Three things to pay attention here:

1.  It is important to center and scale the predictor data before using this transformation
1.  Unlike centering or scaling, this manipulation of the predictors transforms them as a group
1.  If there are some variables to remove (for example, highly correlated variables), do it before the transformation

Function `spatialSign()` in `caret` package can conduct the transformation. Take `income` and `age` as an example:

```{r}
# KNN imputation
sdat <- sim.dat[, c("income", "age")]
imp <- preProcess(sdat, method = c("knnImpute"), k = 5)
sdat <- predict(imp, sdat)
transformed <- spatialSign(sdat)
transformed <- as.data.frame(transformed)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
plot(income ~ age, data = sdat, col = "blue", main = "Before")
plot(income ~ age, data = transformed, col = "blue", main = "After")
```

Some readers may have found that the above code does not seem to standardize the data before transformation. Recall the introduction of KNN, `preProcess()`  with `method="knnImpute"` by default will standardize data.

## Collinearity {#collinearity}

It is probably the technical term known by the most un-technical people. When two predictors are very strongly correlated, including both in a model may lead to confusion or problem with a singular matrix. There is an excellent function in `corrplot` package with the same name `corrplot()` that can visualize correlation structure of a set of predictors. The function has the option to reorder the variables in a way that reveals clusters of highly correlated ones.

```{r, warning=FALSE}
# select non-survey numerical variables
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
    "online_exp", "store_trans", "online_trans"))
# use bagging imputation here
imp <- preProcess(sdat, method = "bagImpute")
sdat <- predict(imp, sdat)
# get the correlation matrix
correlation <- cor(sdat)
# plot
par(oma = c(2, 2, 2, 2))
corrplot.mixed(correlation, order = "hclust", tl.pos = "lt",
    upper = "ellipse")
```

<!--Here use `corrplot.mixed()` function to visualize the correlation matrix (figure \@ref(fig:corp)).-->
The closer the correlation is to 0, the lighter the color is and the closer the shape is to a circle. The elliptical means the correlation is not equal to 0 (because we set the `upper = "ellipse"`), the greater the correlation, the narrower the ellipse. Blue represents a positive correlation; red represents a negative correlation. The direction of the ellipse also changes with the correlation. The correlation coefficient is shown in the lower triangle of the matrix.

The variables relationship from previous scatter matrix are clear here: the negative correlation between age and online shopping, the positive correlation between income and amount of purchasing. Some correlation is very strong (such as the correlation between `online_trans` and `age` is `r correlation['online_trans','age']`) which means the two variables contain duplicate information.

Section 3.5 of “Applied Predictive Modeling” [@APM] presents a heuristic algorithm to remove a minimum number of predictors to ensure all pairwise correlations are below a certain threshold:

>
(1) Calculate the correlation matrix of the predictors.
(2) Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
(3) Determine the average correlation between A and the other variables. Do the same for predictor B.
(4) If A has a larger average correlation, remove it; otherwise, remove predictor B.
(5) Repeat Step 2-4 until no absolute correlations are above the threshold.

The `findCorrelation()` function in package `caret` will apply the above algorithm.

```{r}
(highCorr <- findCorrelation(cor(sdat), cutoff = 0.7))
```

It returns the index of columns need to be deleted. It tells us that we need to remove the $2^{nd}$ and $6^{th}$ columns to make sure the correlations are all below 0.7.

```r
# delete highly correlated columns
sdat <- sdat[-highCorr]
# check the new correlation matrix
(cor(sdat))
```

The absolute value of the elements in the correlation matrix after removal are all below 0.7. How strong does a correlation have to get, before you should start worrying about multicollinearity? There is no easy answer to that question. You can treat the threshold as a tuning parameter and pick one that gives you best prediction accuracy.

## Sparse Variables

Other than the highly related predictors, predictors with degenerate distributions can cause the problem too.  Removing those variables can significantly improve some models’ performance and stability (such as linear regression and logistic regression but the tree based model is impervious to this type of predictors). One extreme example is a variable with a single value which is called zero-variance variable. Variables with very low frequency of unique values are near-zero variance predictors. In general, detecting those variables follows two rules:

- The fraction of unique values over the sample size
- The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value.

`nearZeroVar()` function in the `caret` package can filter near-zero variance predictors according to the above rules. In order to show the useage of the function, let's arbitaryly add some problematic variables to the origional data `sim.dat`:

```{r}
# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))
```

The function will return a vector of integers indicating which columns to remove:

```{r}
nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10)
```

As expected, it returns the two columns we generated. You can go ahead to remove them. Note the two arguments in the function `freqCut =` and `uniqueCut =` are corresponding to the previous two rules.

- `freqCut`: the cutoff for the ratio of the most common value to the second most common value
- `uniqueCut`: the cutoff for the percentage of distinct values out of the number of total samples

## Re-encode Dummy Variables

A dummy variable is a binary variable (0/1) to represent subgroups of the sample.  Sometimes we need to recode categories to smaller bits of information named “dummy variables.”  For example, some questionnaires have five options for each question, A, B, C, D, and E. After you get the data, you will usually convert the corresponding categorical variables for each question into five nominal variables, and then use one of the options as the baseline.

Let's encode `gender` and `house` from `sim.dat` to dummy variables. There are two ways to implement this. The first is to use `class.ind()` from `nnet` package. However, it only works on one variable at a time.

```{r}
dumVar <- nnet::class.ind(sim.dat$gender)
head(dumVar)
```

Since it is redundant to keep both, we need to remove one of them when modeling. Another more powerful function is `dummyVars()` from `caret`:

```{r}
# use "origional variable name + level" as new name
dumMod <- dummyVars(~gender + house + income,
                    data = sim.dat,
                    levelsOnly = F)
head(predict(dumMod, sim.dat))
```

`dummyVars()` can also use formula format. The variable on the right-hand side can be both categorical and numeric. For a numerical variable, the function will keep the variable unchanged. The advantage is that you can apply the function to a data frame without removing numerical variables. Other than that, the function can create interaction term:

```{r}
dumMod <- dummyVars(~gender + house + income + income:gender,
                    data = sim.dat,
                    levelsOnly = F)
head(predict(dumMod, sim.dat))
```

If you think the impact income levels on purchasing behavior is different for male and female, then you may add the interaction term between `income` and `gender`. You can do this by adding `income: gender` in the formula.