1. Why version control?

Keeping with the "Document everything you do" mentality, you need to start thinking about how to manage changes to your analysis or modelling projects early on. Imagine the following common situation. You have a working analysis script, but you decide to do additional analyses requested by your colleague/advisor/reviewer. You make the required changes, and while your new analysis works fine, suddely you find that you can't replicate your previous analyses - they give you an error or results that differ from what you had before.

This happens surprisingly often. One way to deal with that is to always backup you scripts, and make the changes to new files. While this may work for the simplest of cases, it becomes unwieldy quickly and you can end up with 10 versions of your analyses at which point you are in a no better situation than you were before, because you have to figure out which script contains the version of the code you need. If you work collaboratively on the same analyses, this becomes even more problematic.

2. What are git and GitHub?

Step in version control with git. Git is a system which allows you to track the changes you or others make to your files/scripts, to log the purpose of those changes and to go back and forth between different versions of your files.

The basic idea is that for each project you manage, you establish what is called a repository for that project with git. That repository contains the files for your project, and using various commands you can instruct it what to do when you make changes to your files.

Git is a command-line tool, which you control by typing commands in a console/terminal. If you are not comfortable doing that, you can use a GUI (graphical user interface) application. One is provided when you install git, but a much better one comes with GitHub, which has a familiar user-interface that allows you to just point-and-click to the most common options. The command-line tool is more versatile, and I recommend learning how to use it, but the end goal is to get the job done, and if you prefer to use the app, go ahead.

So what about GitHub then? Well, git allows you to create repositories on your own computer. GitHub is a complementary service, which allows you to host your repositories on the github.com platform. That way you can share your work with collaborators, access it on different computers, and the changes you make are all synced to this central server.

For this tutorial, you will need to:

  • know how to open and use a command line / console / terminal - check this for windows and this for unix/linux/mac
  • install git
  • sign up for a GitHub account
  • install the GitHub app

3. Local repositories (git)

3.1 Installing git

  1. Go to https://git-scm.com/ and download the appropriate git version for your system
  2. Run the file and follow the steps to install. I recommend choosing all default options, except for one. When it asks you, select the "Nano editor" rather than Vim (unless you are familiar with it).
  3. You need to configure some settings before you start. Open a console/terminal window and run the following commands, replacing them with your email and name. Use the email you are going to use with GitHub later (using your academic .edu address will give some benefits, so use that):
git config --global user.email "[email protected]"
git config --global user.name "Your Name"

3.2 Initializing a new repository

Before you use git on a project, you have to initialize a repository for that project. Make a new empty folder for your project. You can do that either manually, or by typing mkdir pathname in a terminal, where pathname is the full path of your new folder. Copy the path of your new folder. Open a console/terminal window and navigate to the folder you just created with the cd command followed by your folder path. Then just type 'git init' and you're done!

mkdir "C:\data_science\my_awesome_project\"
cd "C:\data_science\my_awesome_project\"
git init

Note: on Windows, if the folder is not on your C: drive, first type the letter of the driver on which it is on followed by a colon

3.3 Cloning an existing repository

Alternatively, you can clone a repository that exists on GitHub. This will download all files present in that repository to a folder with the name of the repository and will have the repository all set up and ready to go. The folder will be created in whatever folder your terminal is currently in! Then you can navigate to the repository folder by using the cd command.

git clone https://github.com/CoAxLab/DataSciencePsychNeuro_CMU85732.git
cd DataSciencePsychNeuro_CMU85732

3.4 Adding files to be tracked

It is important to know that if you add files to your repository folder, you haven't added them to the repository yet. You need to explicitely tell git which files in the folder need to be tracked. This allows you to store in the same folder both files that you need to keep a version control on, such as scripts and drafts, but also other files that you do not want to track, such as experimental stimuli, etc. You can add specific files with the following command:

git add script1.R script2.R

Or you can also add all files in your folder by using an * rather than a specific filename:

git add *

3.5 Commiting changes

Adding files means telling git that it should track changes in those files. However, in contrast to the "Track changes" options in Word, it does not do it continuously and automatically - you have to tell git when you want the changes you've made to be recorded! Git calls this "commiting". There is a good reason for this - you want to record changes at specific identifiable moments which can allow you to go back to a specific version you need. Let's say that you have just spent some time revising your analysis script by changing the regression model you are using. You can commit your new changes with the commit command. You also want to describe briefly the changes you've made so that you can identify the version later. You do this by adding -m "My description" to the commit command:

git commit -m "Change regression model to logistic"

At this point, all of your changes are recorded in a single commit session. If you do this multiple times after each change, you will end up with multiple commits. You can see a log of your commits, when they happend and your messages by typing:

git log

3.6 Working on a different branch

At this point you might be asking what is the point of all of this? You have made changes and saved them, but this hasn't really affected any part of your workflow. Let's talk about branches.

If you imagine your project as a tree, you can think of each branch of a tree as an alternative version of your project. You can work on multiple versions/branches of your project at the same time. Your main branch is called master. It is automatically created when you initialize a repository and unless you create other branches, it will be the only one. At any point in time, you will always be in a certain branch, and any changes you make to your files at that time will only be made to files in that branch. When you create a new branch, you make a copy of the branch from which you created it. Importantly, this does not actually copy your files, and if you go into your project folder you will not see anything different at that point. Once you've made and commit any changes, you can switch between branches. Any time you switch to a branch, the files in your folder will be replaced with the files corresponding to the last changes in the branch. You can also merge branches together, implementing changes from one branch into another.

How is this useful? A very common workflow that protects your files is one in which you have two branches, a master branch, and a development branch. In this setup, you never work in your master branch - you always make changes in the development branch and once you are satisfied with those changes, you merge them into your master branch. That allows you to easily switch between your current in-progress changes and your last working version of your script in your master branch.

To create a new branch and switch to it you can use the following command:

git checkout -b nameofnewbranch

At this point you can make any changes to your project that you like, and use the commands discussed before to add the files and commit your changes. When you do this, your newbranch will contain your changes, but your master branch will contain a version of the repository prior to the creation of the new branch.

The checkout commandas allows you not only to create new branches, when putting the -b branchname operator, but also to switch from one branch to another existing one. To do that, type the command without the -b operator, with the name of the branch you want to switch into. If you want to go back to your master branch:

git checkout master

If you now check your files, they should not contain the changes you made. You can easily go back to that branch with:

git checkout nameofnewbranch

There are many different ways to use branches. If you are simultaneously testing several different versions of your analyses, you can create one branch for each, and easily switch back and forth between them, until you decide which version you want to keep.

3.7 Merging branches

You've finished making your changes on your development branch (called dev), you have tested them, and you want to finalize them. You need to merge your dev branch with your master. Any time you merge two branches together, git compares their contents, checks for conflicts and combines any changes made in the different branches. In the simplest case, where you've only made changes to one of the branches, it will just replace the files in the other with the newer version. If, you've made changes to the same file under different branches, it will try to merge them. If the changes are about different parts of the file, it will combine them with no problem most of the time. If the changes are to the same lines in a file, it will create a conflict error that you will have to resolve manually. For now, just avoid doing that, and let's assume you've made changes only to one branch. When you merge branches, you implement the changes in whatever branch you are currently in. So, to update our master branch with the contents of dev, we first switch to master, and then invoke the merge command like this:

git checkout master
git merge dev

At this point, if you want to continue working on dev, you should switch back to it first. If you were working on a branch devoted to a specific version, let's say called weirdtestbranch, that you no longer need because you have merged it, you can delete the experimental branch with the following command:

git branch -d weirdtestbranch

This command will not let you delete a branch that has not been merged yet. If you decide that your experimental version did not pan out, and you no longer need it and do not wish to merge it, you can delete it with the -D option instead:

git branch -D weirdtestbranch

3.8 Merge only a specific file from a branch

Sometimes you made a lot of changes to your project in a development branch, and you decide that while most of them are to be discarded, you want to keep the changes to a specific file. You can use the checkout command while in your target branch like this:

git checkout dev -- filetokeep.R

This will extract the filetokeep.R file from the dev branch and merge it into your current branch.

3.9 Checking the status of your repository

If you are unsure what files you have added, what changes you have committed, what branch you are on, etc, you can find all this information out by:

git status

4. Remote repositories (GitHub)

So far, all of the things we've done are on your local machine. If you want to share your repository with others, access it on different computers, back it up, etc, you can use a git server. The most popular service is over at github.com. On GitHub, by default all of your repositories are public and accessible by anyone. Private repositories are a premium option, but with an academic account you get unlimited private repositories, so go ahead and sign up with your .edu email address.

To be able to sync your local repository to github, you have to:

  • create a new repository on github
  • connect your local repository to the remote github repository
  • push your commits from your local repository to the remote

4.1 Create a new repository

On the GitHub website, select the "new repository" button. Set up a name, description and decide whether it should be private (accessible only to you) or public (accessible to anyone on the internet). Do not initialize with a README file, gitignore or a licene.

4.2 Connect local to remote

If you have already created your local repository using the init command above and/or have added and commited files to it, you can connect the repository with the remote one on GitHub using:

git remote add origin https://github.com/myusername/nameofrepository.git

where the link after origin is copied from the github webpage that appears after you create your repository. If this is the first time you are doing this, it will ask you to fill out your github email/password, before you proceed.

If you create a new local branch (as explained above), before you can push from it to the remote, you have to specify what branch it should push to. If you want to create a new remote branch and push to it, type:

git push --set-upstream origin dev

4.3 Push and pulling commits to and from remote

Finally, you want your commits to be synced to the remote repository. You do it simply by typing:

git push

Only changes that you have already commited using git commit will be synced.

If you switch work to another computer and you want to work on the same repository, you can:

  • clone the repository, if this is the first time you are using the repository on this computer
  • pull all the files from the remote and update your local files with them. You do this with the pull command:
git pull

5. Putting it all together - example workflow

You are starting a new project. Here's an example workflow with everything we have discussed.

  1. Install git, make a github account
  2. Setup git:
    git config --global user.email "[email protected]"
    git config --global user.name "Your Name"
  3. Create a new project folder, navigate to it, and initialize a new local repository
    mkdir c:\projects\myproject\
    cd c:\projects\myproject\
    git init
  4. Make a remote repository on GitHub, and copy the repository link
  5. Connect the local repository to the remote
    git remote add origin https://github.com/myusername/myproject.git
  6. Create a basic project folder structure, a readme file, briefly describing the project and possibly the folder structure
  7. Add all current files to git
    git add *
  8. Commit your first commit with an initialization message
    git commit -m "Initial commit"
  9. Push your changes to your remote github repository
    git push
  10. Create a new branch called dev on which to develop your project
    git checkout -b dev
    git push --set-upstream origin dev
  11. Develop your scripts/analyses, files, etc.\
  12. Whenever you make changes you are happy with, add the files, commit the changes and push them:
    git add * # possibly specify specific files
    git commit -m "Explain what you have changed briefly"
    git push
  13. When you have reached a stage in which you want to save the version to be the main one, merge your dev branch with master:
    git checkout master
    git merge dev
    git push # in order to sync the master branches
  14. Go back to dev or a new branch, if you want to test an alternative version of the script or add new things:
    git checkout -b specialbranchfornewfeature
    git push --set-upstream origin specialbranchfornewfeature
  15. Rince and repeat :)

Author: Ven Popov