diff --git a/docs/blog/240127_dvc/240127_dvc.md b/docs/blog/240127_dvc/240127_dvc.md new file mode 100644 index 0000000..30be130 --- /dev/null +++ b/docs/blog/240127_dvc/240127_dvc.md @@ -0,0 +1,63 @@ +# Using DVC's pipelines + +## Introduction + +I find myself sometimes needing to run a bunch of scripts to download, process, format, and clean data. In particular, I am working on a personal project downloading all my data from Garmin Connect. There are a number of open source projects that do this and they seem to be in a constant state of flux. I might be able to use one of them currently, but Garmin might change their API and break the project and it might fall out of maintenance. I hope to be able to set something up that will allow me to download my data in whichever way, clean/transform it to a common format and then use that for whatever analysis I want to do. + +I will be exploring if [DVC](https://dvc.org/) can help me with this. DVC is a data version control tool that allows you to track changes to data and code. It's (supposed to be) similar to git but for data. Most commonly, it is used for data science and machine learning projects where large files and datasets are used. This use case is slightly different, so we'll see how it goes. + +I found [this blog post](https://realpython.com/python-data-version-control/) which gives a very good introduction to DVC. + +## Initial setup + +First, install DVC. Instructions are [here](https://dvc.org/doc/install). I'm using a Mac and I installed it with Homebrew. + +## Creating a project + +In my git repo, I ran + +```bash +dvc init # initialize dvc + +dvc config core.analytics false # disable analytics +``` + +I will use Google Drive as my remote storage: + +```bash +dvc remote add -d garmining gdrive://1OY48YEepyOaLUjVt1k3ox2Gi8tDYAzvv +``` + +After having downloaded the data, I ran + +```bash +dvc add data/garmin-backup/$USERNAME +``` + +where `$USERNAME` is my Garmin username. This creates a `USERNAME.dvc` file for the directory and creates/appends `/USERNAME` to a `.gitignore` file outside of the directory. These are added/commited/pushed to git: + +```bash +git add data/garmin-backup/$USERNAME.dvc data/garmin-backup/.gitignore +``` + +Now, we are ready to push the data to the remote storage. If using Google Drive, you will need to authenticate: + +```bash +dvc push +``` + +## Setting up a pipeline + +Let's say I in a couple weeks backup my data again. The data in the directory will have changed and I will want to run my scripts again to clean/transform the data. I will create a pipeline to do this. + +The first step I will script is to download the data. Here I have used [garminexport](https://github.com/petergardfjall/garminexport) in a shell script. The first step in the pipeline I will call backup. I will create a `dvc.yaml` file in the root of my project: + +```yaml +stages: + backup: + cmd: ./scripts/backup.sh + deps: + - scripts/backup.sh + outs: + - data/garmin-backup/SimonAgren +```