Keeping with the "Document everything you do" mentality, you need to start thinking about how to manage changes to your analysis or modelling projects early on. Imagine the following common situation. You have a working analysis script, but you decide to do additional analyses requested by your colleague/advisor/reviewer. You make the required changes, and while your new analysis works fine, suddely you find that you can't replicate your previous analyses - they give you an error or results that differ from what you had before.
This happens surprisingly often. One way to deal with that is to always backup you scripts, and make the changes to new files. While this may work for the simplest of cases, it becomes unwieldy quickly and you can end up with 10 versions of your analyses at which point you are in a no better situation than you were before, because you have to figure out which script contains the version of the code you need. If you work collaboratively on the same analyses, this becomes even more problematic.
Step in version control with git. Git is a system which allows you to track the changes you or others make to your files/scripts, to log the purpose of those changes and to go back and forth between different versions of your files.
The basic idea is that for each project you manage, you establish what is called a repository for that project with git. That repository contains the files for your project, and using various commands you can instruct it what to do when you make changes to your files.
Git is a command-line tool, which you control by typing commands in a console/terminal. If you are not comfortable doing that, you can use a GUI (graphical user interface) application. One is provided when you install git, but a much better one comes with GitHub, which has a familiar user-interface that allows you to just point-and-click to the most common options. The command-line tool is more versatile, and I recommend learning how to use it, but the end goal is to get the job done, and if you prefer to use the app, go ahead.
So what about GitHub then? Well, git allows you to create repositories on your own computer. GitHub is a complementary service, which allows you to host your repositories on the github.com platform. That way you can share your work with collaborators, access it on different computers, and the changes you make are all synced to this central server.
For this tutorial, you will need to:
git config --global user.email "[email protected]" git config --global user.name "Your Name"
Before you use git on a project, you have to initialize a repository for that project. Make a new empty folder for your project. You can do that either manually, or by typing
mkdir pathname in a terminal, where pathname is the full path of your new folder.
Copy the path of your new folder. Open a console/terminal window and navigate to the folder you just created with the
cd command followed by your folder path. Then just type 'git init' and you're done!
mkdir "C:\data_science\my_awesome_project\" cd "C:\data_science\my_awesome_project\" git init
Note: on Windows, if the folder is not on your C: drive, first type the letter of the driver on which it is on followed by a colon
Alternatively, you can clone a repository that exists on GitHub. This will download all files present in that repository to a folder with the name of the repository and will have the repository all set up and ready to go. The folder will be created in whatever folder your terminal is currently in! Then you can navigate to the repository folder by using the
git clone https://github.com/CoAxLab/DataSciencePsychNeuro_CMU85732.git cd DataSciencePsychNeuro_CMU85732
It is important to know that if you add files to your repository folder, you haven't added them to the repository yet. You need to explicitely tell git which files in the folder need to be tracked. This allows you to store in the same folder both files that you need to keep a version control on, such as scripts and drafts, but also other files that you do not want to track, such as experimental stimuli, etc. You can add specific files with the following command:
git add script1.R script2.R
Or you can also add all files in your folder by using an
* rather than a specific filename:
git add *
Adding files means telling git that it should track changes in those files. However, in contrast to the "Track changes" options in Word, it does not do it continuously and automatically - you have to tell git when you want the changes you've made to be recorded! Git calls this "commiting". There is a good reason for this - you want to record changes at specific identifiable moments which can allow you to go back to a specific version you need. Let's say that you have just spent some time revising your analysis script by changing the regression model you are using. You can commit your new changes with the commit command. You also want to describe briefly the changes you've made so that you can identify the version later. You do this by adding -m "My description" to the commit command:
git commit -m "Change regression model to logistic"
At this point, all of your changes are recorded in a single commit session. If you do this multiple times after each change, you will end up with multiple commits. You can see a log of your commits, when they happend and your messages by typing:
At this point you might be asking what is the point of all of this? You have made changes and saved them, but this hasn't really affected any part of your workflow. Let's talk about branches.
If you imagine your project as a tree, you can think of each branch of a tree as an alternative version of your project. You can work on multiple versions/branches of your project at the same time. Your main branch is called master. It is automatically created when you initialize a repository and unless you create other branches, it will be the only one. At any point in time, you will always be in a certain branch, and any changes you make to your files at that time will only be made to files in that branch. When you create a new branch, you make a copy of the branch from which you created it. Importantly, this does not actually copy your files, and if you go into your project folder you will not see anything different at that point. Once you've made and commit any changes, you can switch between branches. Any time you switch to a branch, the files in your folder will be replaced with the files corresponding to the last changes in the branch. You can also merge branches together, implementing changes from one branch into another.
How is this useful? A very common workflow that protects your files is one in which you have two branches, a master branch, and a development branch. In this setup, you never work in your master branch - you always make changes in the development branch and once you are satisfied with those changes, you merge them into your master branch. That allows you to easily switch between your current in-progress changes and your last working version of your script in your master branch.
To create a new branch and switch to it you can use the following command:
git checkout -b nameofnewbranch
At this point you can make any changes to your project that you like, and use the commands discussed before to add the files and commit your changes. When you do this, your newbranch will contain your changes, but your master branch will contain a version of the repository prior to the creation of the new branch.
checkout commandas allows you not only to create new branches, when putting the
-b branchname operator, but also to switch from one branch to another existing one. To do that, type the command without the -b operator, with the name of the branch you want to switch into. If you want to go back to your master branch:
git checkout master
If you now check your files, they should not contain the changes you made. You can easily go back to that branch with:
git checkout nameofnewbranch
There are many different ways to use branches. If you are simultaneously testing several different versions of your analyses, you can create one branch for each, and easily switch back and forth between them, until you decide which version you want to keep.
You've finished making your changes on your development branch (called dev), you have tested them, and you want to finalize them. You need to merge your dev branch with your master. Any time you merge two branches together, git compares their contents, checks for conflicts and combines any changes made in the different branches. In the simplest case, where you've only made changes to one of the branches, it will just replace the files in the other with the newer version. If, you've made changes to the same file under different branches, it will try to merge them. If the changes are about different parts of the file, it will combine them with no problem most of the time. If the changes are to the same lines in a file, it will create a conflict error that you will have to resolve manually. For now, just avoid doing that, and let's assume you've made changes only to one branch. When you merge branches, you implement the changes in whatever branch you are currently in. So, to update our master branch with the contents of dev, we first switch to master, and then invoke the merge command like this:
git checkout master git merge dev
At this point, if you want to continue working on dev, you should switch back to it first. If you were working on a branch devoted to a specific version, let's say called weirdtestbranch, that you no longer need because you have merged it, you can delete the experimental branch with the following command:
git branch -d weirdtestbranch
This command will not let you delete a branch that has not been merged yet. If you decide that your experimental version did not pan out, and you no longer need it and do not wish to merge it, you can delete it with the
-D option instead:
git branch -D weirdtestbranch
Sometimes you made a lot of changes to your project in a development branch, and you decide that while most of them are to be discarded, you want to keep the changes to a specific file. You can use the
checkout command while in your target branch like this:
git checkout dev -- filetokeep.R
This will extract the filetokeep.R file from the dev branch and merge it into your current branch.
If you are unsure what files you have added, what changes you have committed, what branch you are on, etc, you can find all this information out by:
So far, all of the things we've done are on your local machine. If you want to share your repository with others, access it on different computers, back it up, etc, you can use a git server. The most popular service is over at github.com. On GitHub, by default all of your repositories are public and accessible by anyone. Private repositories are a premium option, but with an academic account you get unlimited private repositories, so go ahead and sign up with your .edu email address.
To be able to sync your local repository to github, you have to:
On the GitHub website, select the "new repository" button. Set up a name, description and decide whether it should be private (accessible only to you) or public (accessible to anyone on the internet). Do not initialize with a README file, gitignore or a licene.
If you have already created your local repository using the init command above and/or have added and commited files to it, you can connect the repository with the remote one on GitHub using:
git remote add origin https://github.com/myusername/nameofrepository.git
where the link after origin is copied from the github webpage that appears after you create your repository. If this is the first time you are doing this, it will ask you to fill out your github email/password, before you proceed.
If you create a new local branch (as explained above), before you can push from it to the remote, you have to specify what branch it should push to. If you want to create a new remote branch and push to it, type:
git push --set-upstream origin dev
Finally, you want your commits to be synced to the remote repository. You do it simply by typing:
Only changes that you have already commited using
git commit will be synced.
If you switch work to another computer and you want to work on the same repository, you can:
You are starting a new project. Here's an example workflow with everything we have discussed.
git config --global user.email "[email protected]" git config --global user.name "Your Name"
mkdir c:\projects\myproject\ cd c:\projects\myproject\ git init
git remote add origin https://github.com/myusername/myproject.git
git add *
git commit -m "Initial commit"
git checkout -b dev git push --set-upstream origin dev
git add * # possibly specify specific files git commit -m "Explain what you have changed briefly" git push
git checkout master git merge dev git push # in order to sync the master branches
git checkout -b specialbranchfornewfeature git push --set-upstream origin specialbranchfornewfeature
Author: Ven Popov