rse-book.github.io/version-control.qmd at main · rse-book/rse-book.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Git Version Control {#sec-git}

As stated before, \index{version control} may be the single most important thing to take away from *Research Software Engineering* if you have not used it before.
In this chapter about the way developers work and collaborate, I will stick to \index{version control} with *git*. The stack\index{stack} discussion of the previous chapter features a few more \index{version control} systems, but given git's dominant position, we will stick solely to git in this introduction to version control\index{version control}.

## What Is Git Version Control?

Git\index{git} is a decentralized version control system. It manages different versions of your source code (and other text files) in a simple but efficient manner that has become the industry standard: The git program itself is a small console program that creates and manages a hidden folder inside the folder you put under \index{version control} (you know those folders with a leading dot in their folder name, like .myfolder). This folder keeps track of all differences between the current version and other versions before the current one.

![Meaningful commit messages help to make sense of a project's history. Screenshot of a commit history on GitHub. (Source: own GitHub repository.)](images/commits.png){fig-scap="Make commit messages meaningful."}

The key to appreciating the value of git\index{git} is to appreciate the value of semantic versions. Git is *not* Dropbox nor Google Drive. It does *not* sync automagically (even if some Git GUI Tools suggest so). GUI tools GitHub Desktop^[https://desktop.github.com/], Atlassian's Source Tree^[https://www.sourcetreeapp.com/] and Tortoise^[https://tortoisegit.org/] are some of the most popular choices if you are not a console person. Though GUI tools may be convenient, we will use the git console throughout this book to improve our understanding. As opposed to the sync approaches mentioned above, a \index{version control} system allows summarizing a contribution across files and folders based on what this contribution is about. Assume you got a cool pointer from an econometrics professor at a conference, and you incorporated her advice in your work. That advice is likely to affect different parts of your work: your text and your code. As opposed to syncing each of these files based on the time you saved them, \index{version control} creates a version when you decide to bundle things together and to commit the change. That version could be identified easily by its commit\index{git commit} message "incorporated advice from Anna (discussion at XYZ Conference 2020)".


## Why Use Version Control in Research?

A \index{version control} based workflow is a path to your goals that rather consists of semantically relevant steps instead of semantically meaningless chunks based on the time you saved them.

In other, more blatant, applied words: naming files like `final_version_your_name.R` or `final_final_correction_collaboratorX_20200114.R`
is like naming your WiFi `dont_park_the_car_in_the_frontyard` or `be_quiet_at_night` to communicate with your neighbors.  Information is supposed to be sent in a message, not a file name. With \index{version control}, it is immediately clear what the most current version is, no matter the file name. There is no room for interpretation. There is no need to start guessing about the delta between the current version and another version.

Also, you can easily try out different scenarios on different branches and merge them back together if you need to. \index{version control} is a well-established industry standard in software development. And it is relatively easy to adopt. With datasets growing in complexity, it is only natural to improve management of the code that processes these data.

Academia has probably been the only place that would allow you to dive into hacking at somewhat complex problems for several years without ever taking notice of \index{version control}. As a social scientist who rather collaborates in small groups and writes moderate amount of code, have you ever thought about how to collaborate with more 100 than persons in a big software project? Or to manage 10,000 lines of code and beyond? \index{version control} is an important reason these things work. And it's been around for decades. But enough about the rant.


## How Does Git Work?

This introduction tries to narrow things down to the commands that you'll need if you want to use git\index{git} in similar fashion to what you learn from this book. If you are looking for more comprehensive, general guides, three major git platforms, namely, Atlassian's Bitbucket\index{Bitbucket}, GitHub\index{GitHub} and GitLab\index{GitLab} offer comprehensive introductions as well as advanced articles or videos to learn git online.


The first important implication of decentralized \index{version control} is that all versions are stored on the local machines of every collaborator, not just on a remote server (this is also a nice, natural backup of your work). So let's consider a single local machine first.

![Schematic illustration of an example git workflow including remote repositories. (Source: own illustation.)](images/decentralized.png){fig-scap="Illustration of an example git workflow."}


Locally, a git repository consists of a *checkout* which is also called current *working copy*. This is the status of the file that your file explorer or your editor will see when you use them to open a file. To check out a different version, one needs to call a commit by its unique commit hash and check out that particular version.

If you want to add new files to \index{version control} or bundle changes to some existing files into a new commit, add these files to the staging area, so they get committed next time a commit process is triggered. Finally, committing all these staged changes under another commit ID, a new version is created.


## Moving Around

So let's actually do it. Here's a three-stage walk-through of git commands that should have you covered in most use cases a researcher will face. Note that git has some pretty good error messages that guess what could have gone wrong. Make sure to read them carefully. Even if you can't make sense of them, your online search will be a lot more efficient when you include these messages.

**Stage 1: Working Locally**

@tbl-gitbasics summarizes essential git commands\index{git commands} to move around your local repository.

```{r gittable, eval=TRUE,message=FALSE, echo=FALSE}
#| label: tbl-gitbasics
#| tbl-cap: "Basic Commands for Working with Git"

library(kableExtra)
d <- data.frame("Command" = c("git init",
                              "git status",
                              "git add filename.py",
                              'git commit -m "meaningful msg"',
                              "git log",
                              "git checkout some-commit-id",
                              "git checkout main-branch-name"),
                "Effect" = c("puts current directory and all its subdirs under version control.",
                             "shows status",
                             "adds file to tracked files",
                             "creates a new version/commit out of all staged files",
                             "shows log of all commit messages on a branch",
                             "goes to commit, but in detached HEAD state",
                             "leaves temporary state, goes back to last commit"),
                stringsAsFactors = FALSE)
kable(d,"latex", booktabs = TRUE,
      escape = FALSE) |>
      kable_styling(latex_options = c("repeat_header")) |>
      column_spec(2, width = "5.7cm")

```

**Stage 2: Working with a Remote Repository**

Though git can be tremendously useful even without collaborators, the real fun starts
when working together. The first step en route to getting others involved is to add a remote repository. @tbl-gittable2 shows essential commands\index{git commands} for working with a remote repository.

```{r gittable2, eval=TRUE,message=FALSE, echo=FALSE}
#| label: tbl-gittable2
#| tbl-cap: "Commands for Working with a Remote Repository"
library(kableExtra)
d <- data.frame("Command" = c("git clone",
                              "git pull",
                              "git push",
                              "git fetch",
                              "git remote -v",
                              "git remote set-url origin https://some-url.com"),
                "Effect" = c("creates a new repo based on a remote one",
                             "gets all changes from a linked remote repo",
                             "deploys all commit changes to the remote repo",
                             "fetches branches that were created on remote",
                             "shows remote repo URL",
                             "sets URL to remote repo"),
                stringsAsFactors = FALSE)
kable(d,"latex", booktabs = TRUE,
      escape= FALSE) |>
      kable_styling(latex_options = c("repeat_header")) |>
      column_spec(1:2, width = "5.7cm")

```

**Stage 3: Branches**

Branches are derivatives from the main branch that allow to work on different features at the same time
without stepping on someone else's feet. Through branches, repositories can actively maintain different states. @tbl-gitbranches shows commands\index{git commands} to navigate these states.


```{r, eval=TRUE,message=FALSE, echo=FALSE}
#| label: tbl-gitbranches
#| tbl-cap: "Commands for for Handling Multiple Branches"

library(kableExtra)
d <- data.frame("Command" = c("git checkout -b branchname",
                              "git branch",
                              "git checkout branchname",
                              "git switch branchname",
                              "git merge branchname"),
                "Effect" = c("creates new branch named branchname",
                             "shows locally available branches",
                             "switches to branch named branchname",
                             "switches to branch named branchname",
                             "merges branch named branchname into current branch"
                ),
                stringsAsFactors = FALSE)
kable(d,"latex", booktabs = TRUE,
      escape= FALSE) |>
      kable_styling(latex_options = c("repeat_header")) |>
      column_spec(1:2, width = "5.7cm")

```


**Fixing Merge Conflicts**

In\index{merge conflict} most cases, *git*\index{git} is quite clever and can figure out which is the desired state of
a file when putting two versions of it together. When git's *recursive strategy* is possible, git will merge versions automatically. When the same lines were affected in different versions, git cannot tell
which line should be kept. Sometimes, you would even want to keep both changes.

But even in such scenario, fixing the conflict is easy. Git will tell you that your last command
caused a merge conflict and which files are conflicted. Open these files and take a look at all parts of the files in question. @fig-merge-conflict shows a situation in which trying merge a file that had changes across different branches caused a conflict.

![Ouch! We created a conflict by editing the same line in the same file on different branches. (Source: screenshot of own GitHub repository.)](images/merge_conflict.png){#fig-merge-conflict fig-scap="A merge conflict."}

Luckily, git marks the exact spot where the conflict\index{merge conflict} happens. Good text editors/IDEs ship with cool colors to highlight all our options. Some of the fancier editors even have git conflict resolve plugins that let you walk through all conflict points.

![Go for the current status or take what's coming in from the a2 branch? (Source: screenshot of Sublime text editor.)](images/sublime_conflict.png){fig-scap="Sublime text editor git integration."}

![In VS Code, you can even select the option by clicking. (Source: screenshot VS Code IDE.)](images/vscode_conflict.png){fig-scap="VS Code git integration."}

At the end of the day, all do the same, i.e., remove the unwanted part, including all the marker gibberish.
After you have done so, save, commit and push (if you are working with a remote repo). Don't forget to make sure you kinked out all conflicts.


## Collaboration Workflow

The broad acceptance of git as a framework for collaboration has certainly played an important role in git's establishment as an industry standard.


### Feature Branches

This section\index{feature branch} discusses real-world collaboration workflows of modern open source software developers.
Hence, the prerequisites to benefitting the most from this section are bit different.
Make sure you are past the mere ability to describe and explain git basics, make sure you can create and handle your own repositories.

If you had only a handful of close collaborators so far, you may be fine with staying on the main branch and trying not to step on each other's feet.
This is reasonable because, git aside, it is rarely efficient to work asynchronously on exact the same lines of code anyway.
Nevertheless, there is a reason why *feature-branch-based*\index{feature branch} workflows became
very popular among developers: Imagine yourself collaborating in asynchronous fashion, maybe with someone in another time zone.
Or with a colleague, who works on your project, but in a totally different month during the year.
Or, most obviously, with someone you have never met.
Forks and *feature-branch-based* workflows are the way a lot of modern open source projects tackle the above situations.

Forks are just a way to contribute via feature branches\index{feature branch}, even in case you do not have write access to a repository.
But let's just have a look at the basic feature branch case, in which you are part of the team first with full access to the repository.
Assume there is already some work done, some version of the project is already up on a some remote GitHub account.
You join as a collaborator and are allowed to push changes now.
It's certainly not a good idea to simply add things without review to a project's production.
Like if you got access to modify the institute's website, and you made your first changes and all your changes go straight to production. Like this:

\textcolor{pink}{It used to be subtle and light gray. I swear!}

Bet everybody on the team took notice of the new team member by then. In a feature branch workflow, you would start from the latest production version. Remember, git is decentralized, and you have all versions of your team's project on your local machine.
Create a new branch named indicative of the feature you are looking to work on.

```
git checkout -b colorways
```

You are automatically being switched to the freshly created branch.
Do your thing now.
It could be just a single commit, or several commits by different persons.
Once you are done, i.e., committed all changes\index{git commands}, add your branch to the remote repository by pushing.

```
git push -u origin colorways
```

This will add your branch called *colorways* to the remote repository.
If you are on any major git platform with your project, it will come with a decent web GUI.
Such a GUI is the most straightforward way to do the next step: get your Pull Request\index{pull request} (PR) out.

![GitHub\index{GitHub} Pull Request dialog: Select the *Pull Request*; choose which branch you merge into which target branch. (Source: own GitHub repository.)](images/pr.png){fig-scap="GitHub Pull Request dialog."}

As you can see, git will check whether it is possible to merge automatically without interaction.
Even if that is not possible, you can still issue the pull request\index{pull request}.
When you create the request, you can also assign reviewers, but you could also do so at a later stage.


Even after a PR\index{pull request} was issued, you can continue to add commits to the branch about to be merged.
As long as you do not merge the branch through the PR, commits are added to the branch.
In other words, your existing PR gets updated. This is a very natural way to account for reviewer comments.


::: {.callout-note}
Use commit messages like 'added join to SQL\index{SQL}query, closes #3'. The keyword 'closes' or 'fixes', will automatically close issues referred to when merged into the main branch.
:::

Once the merge is done, all your changes are in the main branch, and you and everyone else can pull the main branch that now contains your new feature. Yay!


### Pull Requests\index{pull request} from Forks

Now, let's assume you are using an open source software created by someone else.
At some point, you miss a feature that you feel is not too hard to implement.
After googling and reading up a bit, you realize others would like to have these features, too, but the original authors did not find the time to implement it yet.
Hence, you get to work.
Luckily, the project is open source and up on GitHub, so you can simply get your version of it, i.e., *fork* the project to your own GitHub account (just click the *fork* button and follow the instructions) .

Now that you have your own version of the software with all the access rights to modify it, you can implement your feature and push it to your own remote git repository. Because you forked the repository, your remote git platform will still remember where you got it from and allows you to issue a pull request to the original author. The original authors can now review the pull request\index{pull request}, see the changes, and decide whether they are fine with the feature and its implementation.

There may very well be some back and forth in the message board before the pull requests gets merged. But usually these discussions are very context-aware and sooner or later, you will get your first pull request approved and merged. In that case, congratulations -- you have turned yourself into a team-oriented open source collaborator!


### Rebase\index{git rebase} vs. Merge

Powerful systems often provide more than one way to achieve your goals.
In the case of git, putting together two branches of work -- a very vital task, is exactly such a case:
We can either *merge* or *rebase* branches.

While *merge* keeps the history intact, *rebase* is a history-altering command.
Though most people are happy with the more straightforward *merge* approach, a bit of context is certainly helpful.

Imagine the merge approach as a branch that goes away from the trunk at some point and then grows alongside the trunk in parallel. In case both histories become out of sync because someone else adds to the main branch while you keep adding to the feature branch, you can either merge or rebase to put the two together.

Sitting on a checkout of the feature branch\index{feature branch}, a merge of the main branch would simply create an additional merge commit sequentially after the last commit of the feature branch. This merge commit contains the changes of both branches, no matter if they were automatically merged using the standard recursive strategy or through resolving a merge conflict.

As opposed to that, rebase would move all changes to main to before the feature branch started, then sequentially add all commits of the feature branch.
That way your history remains linear, looks cleaner and does not contain artificial merge commits.

So, when should we merge, and when should we rebase? There is no clear rule to that other than to not use rebase on exposed branches such as main because you would have a different main branch than other developers. Rebase can ruin your collaborative workflow, yet it helps to clean up. In my opinion, merging feature branches is just fine for most people and teams. So unless you have too many people working on too many different features at once and are in danger of not being able to move through your history, simply go with the merge approach.
The following Atlassian tutorial^[https://www.atlassian.com/git/tutorials/merging-vs-rebasing] offers more insights and illustrations to deepen your understanding of the matter.