ancient-artifacts/10-load-data.Rmd at master · VandyDataScience/ancient-artifacts · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "10-load-data"
output:
   html_notebook:
      toc: yes
      toc_depth: 3
      toc_float: yes
---

The purpose of this notebook is to load and clean the data.  It should also test the data (e.g., by assertr) to ensure that assumptions about the data are met, and store the cleaned and tested data.

# Basic library imports
Here, we use the package `pacman` in order to load packages.  Something that's great about this package is if the package isn't locally installed on your own system, instead of throwing errors, pacman will automatically install the package and then load it.
```{r import packages, results='hide'}
if (!require("pacman"))
   install.packages("pacman")

pacman::p_load(janitor, assertr, tidyverse, forcats, fs)
```

# Global saving settings
Reading data from Box assumes that everyone has Box Desktop (i.e., Box Sync, Box Drive) installed in the default file location (~/Box).  The following code sets that directory for loading and saving shared models.
```{r box save directory}
box_base <- path.expand("~/../Box/DSI_AncientArtifacts/")
```

# Reading in files
In this step, we'll read in the data.
```{r read data, results='hide'}
LithicData <- read_csv("LithicExperimentalData.csv") %>%
   clean_names() %>%
   mutate(particle_class = 'exp')
ASData <- read_csv("ArchaeologicalSoilData.csv") %>%
   clean_names() %>%
   mutate(particle_class = 'site')
```

Something interesting about these files is that there are actually 2 header lines in these csv files.  The first line is the normal header, and the second line is a descriptor regarding the units of the measurement.  In the following step, we strip out that second line of units.

```{r clean up first row}
# clean up first row of data and add class information
LithicData <- LithicData[-c(1),]
ASData <- ASData[-c(1),]
```

Now, we combine the data into a single file, fix the resultant data types, and set up the factors in the correct order for modeling.
```{r combine data}
artifact_data <- LithicData %>%
   bind_rows(ASData) %>%
   mutate(across(c(-starts_with('Filter'), -particle_class), as.double)) %>%
   mutate(id = as.character(id),
          img_id = as.character(img_id),
          particle_class = fct_relevel(as.factor(particle_class), 'exp', 'site')) %>%
   dplyr::select(particle_class, everything())
```


```{r, purl=FALSE}
artifact_data %>% glimpse()
```
This looks generally correct with ambiguity surrounding the filter variables a well as the hash.  These could be useful, but we should check with the collaborator.

Here I removed all data points greater than 6.0 mm or less than 0.125mm for f_width. This reduces the dataset to only microdebitage.

```{r}
artifact_data <- filter(artifact_data, f_width < 6.0 & f_width > 0.125)
#filter(artifact_data, fiber_length < 0.3)
```

Here's a quick check on the factor order of site:
```{r factor order check, purl=FALSE}
glimpse(artifact_data)
levels(artifact_data$particle_class)
```
Good.  We can see here that the 0th class is the "target" class.  Note that the "target" class for yardstick is the "0" class, so this is the correct ordering that we want.

# Validate data
```{r Here we will start writing asserts to validate the data.}


validate_data <- function(pred_data) {

  # Angularity: 0 (perfect circle)-180 (many sharp edges)
assert(pred_data, within_bounds(0,180), angularity,
       description = "Values must be within 0-180 range.")

# Circularity: 0-1 (perfect circle)
assert(pred_data, within_bounds(0,1), circularity,
       description = "Values must be within 0-1 range.")

# Solidity: 0-1 (very smooth surface)
assert(pred_data, within_bounds(0,1), solidity,
       description = "Values must be within 0-1 range.")

# Transparency: 0 (least transparent)-1 (most transparent)
assert(pred_data, within_bounds(0,1), transparency,
       description = "Values must be within 0-1 range.")

# T/W Ratio: 0-1 (represents a sphere)
assert(pred_data, within_bounds(0,1), t_w_ratio,
       description = "Values must be within 0-1 range.")

# Sphericity: 0-1 (perfect circle)
assert(pred_data, within_bounds(0,1), sphericity,
       description = "Values must be within 0-1 range.")

# Concavity: 0-1 (rough, spikey surface)
assert(pred_data, within_bounds(0,1), concavity,
       description = "Values must be within 0-1 range.")

# Convexity: 0-1 (smooth)
assert(pred_data, within_bounds(0,1), convexity,
       description = "Values must be within 0-1 range.")

#L/W Aspect Ratio: 1 (sphere)-infinity
assert(pred_data, within_bounds(1,Inf), l_w_ratio,
       description = "Values must be within 1-infinity range.")

# W/T Ratio: 1 (sphere)-infinity
assert(pred_data, within_bounds(1,Inf), w_t_ratio,
       description = "Values must be within 1-infinity range.")

# Verify that f_width always is between 0.125-6 range
assert(pred_data, within_bounds(0.125,6), f_width,
       description = "Values must be within 0.125-6 range." )

}
validate_data(artifact_data)

# Row distinctiveness
is_distinct<- tryCatch({
   assert_rows(artifact_data, col_concat, is_uniq, c(-id, -img_id, -fiber_length, -fiber_width))
   return(TRUE)
   }, error=function(err) {
      message(err)
      return(FALSE)
   })

# Remove rows if they're not distinct
if(!is_distinct){
   message('\nOriginal size of data is: ', nrow(artifact_data))
   message('Dropping all rows which are not distinct...\n')
   artifact_data <- artifact_data %>%
      distinct(across(c(everything(), -id, -img_id, -fiber_length, -fiber_width)), .keep_all=TRUE)
   message('\nModified size of data is: ', nrow(artifact_data))
}

```

# Questions
Here, we may want to know two things:
1. What are those `filterx` variables?  Should we keep them in the dataset?
2. What is the `hash` variable?  Should we keep it in the dataset?

Otherwise, we see that we have 49 columns and a combined total of 78,612 rows, where 73,313 are soil data and 5,299 are of lithic microdebitage.