Px-R-book/import_data.qmd at main · Thranholm/Px-R-book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: "Importing data"
format: html
---

## Packages for data import

There is a wide range of R packages for importing all different data formats to R. R also has its own data format called ".rds". Datasets can be written and read respectively via the R functions `saveRDS()` and `readRDS()`. However, you will rarely find data saved in rds format on a webpage where you can download it. R has a package for importing almost any data format, and if in doubt, ask your favorite chatbot for help. The sections below go through the most common data formats.

### Delimited files (`readr`)

The package [readr](https://readr.tidyverse.org/), which is also part of the [tidyverse](https://www.tidyverse.org/), has a lot of functions to deal with CSV files and other delimited files. The following code shows examples of the most commonly used functions from the `readr` package. Note the difference between `read_csv()` (comma delimiter) and `read_csv2()` (semicolon delimiter).

```{r}
#| eval: false
#| label: delimters
library(readr)

# Read CSV files, comma separation
data <- read_csv("file.csv")

# Read files with semicolon separation (common in European data)
data <- read_csv2("file.csv")

# Read tab-separated files
data <- read_tsv("file.tsv")

# Read files with custom delimiters, here a pipe: "|"
data <- read_delim("file.txt", delim = "|")

```

The `readr` functions automatically detect column types and handle common issues like missing values. You can specify column types explicitly using the `col_types` argument:

```{r}
#| eval: false
#| label: ext_csv
data <- read_csv("file.csv",
                 col_types = cols(
                   id = col_integer(),
                   name = col_character(),
                   date = col_date(),
                   value = col_double()
                 ))
```

### Excel files (`readxl`)

For Excel files (.xlsx and .xls), the [readxl](https://readxl.tidyverse.org/) package can be used and is also part of the tidyverse ecosystem. There are a few alternatives as well: the [`xlsx` package](https://colearendt.github.io/xlsx/index.html) and the [`openxlsx` package](https://ycphs.github.io/openxlsx/index.html#openxlsx-). If you have to write data to Excel or load formatted sheets, it might be relevant to look at especially `openxlsx`, but for reading simple Excel files, `readxl` should do the trick.

```{r}
#| eval: false
#| label: excel
library(readxl)

# Read the first sheet
data <- read_excel("file.xlsx")

# Read a specific sheet by name or number
data <- read_excel("file.xlsx", sheet = "Sheet2")
data <- read_excel("file.xlsx", sheet = 2)

# Read a specific range
data <- read_excel("file.xlsx", range = "A1:D10")

# Skip rows (useful for files with headers or metadata)
data <- read_excel("file.xlsx", skip = 3)
```

You can also list all sheet names in an Excel file:

```{r}
#| eval: false
#| label: sheets
excel_sheets("file.xlsx")
```

### Statistical software formats (`haven`)

The [haven](https://haven.tidyverse.org/) package imports data from SPSS, Stata, and SAS.

```{r}
#| eval: false
#| label: haven
library(haven)

# Stata files
data <- read_dta("file.dta")

# SAS files
data <- read_sas("file.sas7bdat")

# SPSS files
data <- read_sav("file.sav")

```

These functions preserve variable labels and value labels from the original statistical software. This can also make the datasets time-consuming to import. The `n_max` argument can be used to only import part of a dataset. Afterwards, the `col_select` argument can be used to read only the necessary columns. The following code shows an example of this workflow.

```{r}
#| eval: false
#| label: obs1

# Reading only first observation to inspect columns of the data
data_obs1 <- read_dta("file.dta", n_max = 1)

# Reading the necessary columns
data <- read_dta("file.dta", col_select = c("id", "age", "sex", "education", "income", "district"))

```

### JSON files (`jsonlite`)

JSON (JavaScript Object Notation) is increasingly common for data exchange, especially from APIs. The [jsonlite](https://arxiv.org/abs/1403.2805) package handles JSON data efficiently.

```{r}
#| eval: false
#| label: json
library(jsonlite)

# Read JSON from file
data <- fromJSON("file.json")

# Read JSON from URL
data <- fromJSON("https://api.example.com/data.json")

# Convert R object to JSON
json_string <- toJSON(data, pretty = TRUE)
```


### Large files (`data.table` and `vroom`)

For very large files, specialized packages offer better performance, especially the `data.table` and `vroom` packages.

```{r}
#| eval: false
#| label: datatable
library(data.table)
library(vroom)

# data.table's fast reader
data <- fread("large_file.csv")

# vroom for very fast reading
data <- vroom("large_file.csv")
```

## Reproducible paths and R projects

One of the most common issues when sharing R code or moving projects between computers is broken file paths. Using R Projects and the here package creates reproducible workflows that work across different operating systems and directory structures. For this reason, it is recommended to have an R project when working on your code.

### R Projects

R Projects (.Rproj files) create a self-contained workspace for your analysis. When opening an R Project, RStudio automatically sets the working directory to the project folder. This means you can use relative paths that work regardless of where the project is stored on your computer. Furthermore, you can export your folder structure and send it to another person, who will be able to run the same code through the R Project.

To create an R Project:

  1. File → New Project → New Directory → New Project
  2. Give your project a name and choose a location
  3. RStudio will create a .Rproj file and set up the folder structure

### The `here` package

The here package builds file paths relative to your project root, and therefore makes sense to use within your project. This also makes your code more portable and robust.


```{r}
#| eval: false
#| label: here
library(here)

# Instead of this (brittle, won't work on other computers):
data <- read_csv("C:/Users/YourName/Documents/my_project/data/file.csv")

# Use this (portable, works anywhere):
data <- read_csv(here("data", "file.csv"))

# The here() function builds the complete path
here("data", "file.csv")
# Returns: "/path/to/your/project/data/file.csv"
```