RepData_PeerAssessment1/PA1_template.Rmd at master · evgaster/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "Reproducible Research: Peer Assessment 1"
output:
  html_document:
    keep_md: true
---

## Introduction
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

Here we make use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

## Loading and preprocessing the data
The data is available on [https://github.com/evgaster/RepData_PeerAssessment1] in the activity.zip file. You need to make that available locally for reading into R and further processing.
```{r}
unzip("activity.zip")
d <- read.csv("activity.csv")
str(d)
```

The attribute *interval* seems to be represented as a single number encoding the hours and the minutes into the hour. The encoding looks like interval = hour \* 100 + minutes. Later on we need to work with date and interval in various *time* ways.
```{r}
d$timestamp <- as.POSIXlt(d$date)
d$timestamp$hour <- d$interval %/% 100
d$timestamp$min <- d$interval %% 100
d$time <- sprintf("%0.2d:%0.2d", d$timestamp$hour, d$timestamp$min)
```


## What is mean total number of steps taken per day?
The total number of steps taken per day is calculated by summing the value of steps over all intervals of a day.

Note that we are ignoring missing values here. So strickly speeking these are not *total number of steps per day* but *total number of **reported** steps per day*.
```{r}
totalStepsByDay <- aggregate(steps ~ date, data = d, sum)
```

The histogram below gives an impression of how many days (y axis: Frequency) a certain number of steps (x axis: Total of reported steps per day) are reported.
```{r}
hist(totalStepsByDay$steps,
     main = "",
     xlab = "Total of reported steps per day"
     )
```

```{r}
meanOftotalStepsByDay <- as.integer(round(mean(totalStepsByDay$steps)))
medianOftotalStepsByDay <- as.integer(round(median(totalStepsByDay$steps)))
```
The mean of the total steps per day = `r meanOftotalStepsByDay`, its median = `r medianOftotalStepsByDay`.


## What is the average daily activity pattern?
An other view on the data is the number of steps over the course of the day. The average daily activity pattern is calculated by taking the mean of the number of reported steps per 5-minute interval, over all days.
```{r}
averageOfStepsByInterval <- aggregate(steps ~ time, data = d, mean)
```

The figure below gives an impression of this activity pattern (y axis: Average number of steps) over the day (x axis: Interval time).
```{r}
averageOfStepsByInterval$timestamp <- as.POSIXct(averageOfStepsByInterval$time, format = "%H:%M")
plot(averageOfStepsByInterval$timestamp, averageOfStepsByInterval$steps,
     type = "l",
     xlab = "Interval time",
     ylab = "Average number of steps"
     )
```

```{r}
# At what time of the day was this person the most active (on average over all days)?
maxAverageOfStepsIntervalTime <- averageOfStepsByInterval$timestamp[which.max(averageOfStepsByInterval$steps)]
maxAverageOfStepsInterval <- format(maxAverageOfStepsIntervalTime, format = "%H:%M")
```
The activity has its max at `r maxAverageOfStepsInterval`.

## Imputing missing values
There are a number of days/intervals where there are missing values.
```{r}
countOfMissingValues <- sum(is.na(d$steps))
```
The number of entries with missing steps value = `r countOfMissingValues`.

There are many approaches to handling missing values. Here we subsitute a missing value by the mean of the number of reported steps per 5-minute interval, over all days.
```{r}
m <- merge(d, averageOfStepsByInterval, by = "time")
m$steps.x[is.na(m$steps.x)] <- m$steps.y[is.na(m$steps.x)]

m <- data.frame(m$steps.x, m$date, m$interval, m$timestamp.x, m$time)
names(m) <- names(d)
```

The total number of steps taken per day is calculated again by summing the value of steps over all intervals of a day.
```{r}
totalStepsByDay <- aggregate(steps ~ date, data = m, sum)
```

The figure below gives an impression of how many days (y axis: Frequency) a certain number of steps (x axis: Total of  steps per day) are reported or **imputed**
```{r}
hist(totalStepsByDay$steps,
     main = "With imputing for missing steps data",
     xlab = "Total of steps per day"
     )
```


```{r}
meanOftotalStepsByDay <- as.integer(round(mean(totalStepsByDay$steps)))
medianOftotalStepsByDay <- as.integer(round(median(totalStepsByDay$steps)))
```
The mean of the total steps per day = `r meanOftotalStepsByDay`, its median = `r medianOftotalStepsByDay`.
This hardly differs from the values obtained previously (without imputing missing values).

Why this is become immediatly clear when we look at where the values are missing.
```{r}
table(d[is.na(d$steps),]$date)
```
This shows that only complete days are missing. Those days get the average of the other days imputed. So they become an average day. And where does an average day ends up in the histogram?

## Are there differences in activity patterns between weekdays and weekends?

To see if the activity pattern differs between the typical work and weekend days we need to make that piece of information explicit.
```{r}
m$dayType <- "weekday"
m$dayType[weekdays(m$timestamp) %in% c("Saturday", "Sunday")] <- "weekend"
m$dayType <- as.factor(m$dayType)
```

The average daily activity pattern is calculated by taking the mean of the number of reported steps per 5-minute interval, over the different types of days.
```{r}
averageOfStepsByInterval <- aggregate(steps ~ time + dayType, data = m, mean)
```

The figure below gives an impression of the activity pattern (y axis: Number of steps) over the day (x axis: Interval) for both weekdays and weekend.
```{r}
averageOfStepsByInterval$timestamp <- as.POSIXct(averageOfStepsByInterval$time, format = "%H:%M")

library(lattice)
p <- xyplot(steps ~ timestamp | dayType,
            averageOfStepsByInterval,
            type = "l",
            xlab = "Interval",
            ylab = "Number of steps",
            layout = c(1, 2),
            scales = list(x = list(format = "%H:%M"))
            )
print(p)
```

In the weekend the activity is more spread out. On weekdays it peaks in the morning rush hour.