Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions docs/blog/240127_dvc/240127_dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Using DVC's pipelines

## Introduction

I find myself sometimes needing to run a bunch of scripts to download, process, format, and clean data. In particular, I am working on a personal project downloading all my data from Garmin Connect. There are a number of open source projects that do this and they seem to be in a constant state of flux. I might be able to use one of them currently, but Garmin might change their API and break the project and it might fall out of maintenance. I hope to be able to set something up that will allow me to download my data in whichever way, clean/transform it to a common format and then use that for whatever analysis I want to do.

I will be exploring if [DVC](https://dvc.org/) can help me with this. DVC is a data version control tool that allows you to track changes to data and code. It's (supposed to be) similar to git but for data. Most commonly, it is used for data science and machine learning projects where large files and datasets are used. This use case is slightly different, so we'll see how it goes.

I found [this blog post](https://realpython.com/python-data-version-control/) which gives a very good introduction to DVC.

## Initial setup

First, install DVC. Instructions are [here](https://dvc.org/doc/install). I'm using a Mac and I installed it with Homebrew.

## Creating a project

In my git repo, I ran

```bash
dvc init # initialize dvc

dvc config core.analytics false # disable analytics
```

I will use Google Drive as my remote storage:

```bash
dvc remote add -d garmining gdrive://1OY48YEepyOaLUjVt1k3ox2Gi8tDYAzvv
```

After having downloaded the data, I ran

```bash
dvc add data/garmin-backup/$USERNAME
```

where `$USERNAME` is my Garmin username. This creates a `USERNAME.dvc` file for the directory and creates/appends `/USERNAME` to a `.gitignore` file outside of the directory. These are added/commited/pushed to git:

```bash
git add data/garmin-backup/$USERNAME.dvc data/garmin-backup/.gitignore
```

Now, we are ready to push the data to the remote storage. If using Google Drive, you will need to authenticate:

```bash
dvc push
```

## Setting up a pipeline

Let's say I in a couple weeks backup my data again. The data in the directory will have changed and I will want to run my scripts again to clean/transform the data. I will create a pipeline to do this.

The first step I will script is to download the data. Here I have used [garminexport](https://github.com/petergardfjall/garminexport) in a shell script. The first step in the pipeline I will call backup. I will create a `dvc.yaml` file in the root of my project:

```yaml
stages:
backup:
cmd: ./scripts/backup.sh
deps:
- scripts/backup.sh
outs:
- data/garmin-backup/SimonAgren
```