Software version control systems contain a huge amount of evolutionary data. It's very common to mine these repositories to gain some insight about how the development of a software product works. But there is the need for some preprocessing of that data to avoid false analysis.
That’s why I show you how to read the commit information of a Git repository into Pandas’ DataFrame!
Our implementation strategy is straightforward: We try to avoid any functions as much as possible but try to use all the processing power Pandas delivers. The So let's get started.
First, we import our two main libraries for analysis: Pandas and GitPython.
import pandas as pd
import git
With GitPython, you can access a Git repository via a Repo object. That's your entry point to the world of Git.
For this notebook, we analyze the Sprint PetClinic repository that can be easily cloned to your local computer with a
git clone https://github.com/spring-projects/spring-petclinic.git
Repo needs at least the directory to your Git repository. I've added an additional argument odbt with the git.GitCmdObjectDB. With this, GitPython will be using a more performant approach for retrieving all the data (see doc for more details).
repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
repo
<git.Repo "C:\dev\repos\spring-petclinic\.git">
To transform the complete repository into Pandas' DataFrame, we simply iterate over all commits of the master branch.
commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits.head()
raw | |
---|---|
0 | ffa967c94b65a70ea6d3b44275632821838d9fd3 |
1 | fd1c742d4f8d193eb935519909c15302b783cd52 |
2 | f792522b3dffca918f52010c8593999088034e19 |
3 | 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce |
4 | 443d35eae23c874ed38305fbe75216339c41beaf |
Our raw column now contains all the commits as PythonGit's Commit Objects (to be more accurate: references to these objects). The string representation is coincidental the SHA key of the commit.
Let's have a look at the last commit.
last_commit = commits.ix[0, 'raw']
last_commit
<git.Commit "ffa967c94b65a70ea6d3b44275632821838d9fd3">
Such a Commit object is our entry point for retrieving further data.
print(last_commit.__doc__)
Wraps a git Commit object. This class will act lazily on some of its attributes and will query the value on demand only if it involves calling the git binary.
It provides all data we need:
last_commit.__slots__
('tree', 'author', 'authored_date', 'author_tz_offset', 'committer', 'committed_date', 'committer_tz_offset', 'message', 'parents', 'encoding', 'gpgsig')
E. g. basic data like the commit message.
last_commit.message
'spring-petclinic-angular1 repo renamed to spring-petclinic-angularjs'
Or the date of the commit
last_commit.committed_datetime
datetime.datetime(2017, 4, 12, 21, 41, tzinfo=<git.objects.util.tzoffset object at 0x0000025943CEC198>)
Some information about the author.
last_commit.author.name
'Antoine Rey'
last_commit.author.email
'antoine.rey@gmail.com'
Or file statistics about the commit,
last_commit.stats.files
{'readme.md': {'deletions': 1, 'insertions': 1, 'lines': 2}}
Let's check how fast we can retrieve all the authors from the commit's data.
%%time
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits.head()
Wall time: 62.5 ms
Let's go further and retrieve some more data (DataFrame is transposed / rotated via a T for displaying reasons).
%%time
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))
commits.head(2).T
Wall time: 78.1 ms
Dead easy and reasonable fast, but what about the modified files? Let's challenge our computer a little bit more by extracting the statistics data about every commit. The Stats object contains all the touched files per commit including the information about the number of lines that were either inserted or deleted.
Additionally, we need some tricks to get the data we need. For this, I guide you step by step through this approach. The main idea is to retrieve the real statistics data (not only the object's references) and temporarily store these statistics information as Pandas' Series. Then we take another round to transform this data to use it in DataFrame.
This step is a little bit tricky and was found only by a good amount of trial and error. But it works in the end as we will see. The goal is to unpack the information in the stats object into nice columns of out DataFrame via the Series#apply method. I'll show you step by step how this works in principle (albeit it will work a little bit different when using the apply approach).
As seen above, we have access to every file modification of each commit. In the end, it's a dictionary with the filename as the key and a dictionary of the change attributes as values.
some_commit = commits.ix[56, 'raw']
some_commit.stats.files
{'src/main/webapp/WEB-INF/tags/menu.tag': {'deletions': 2, 'insertions': 2, 'lines': 4}}
We extract the dictionary of dictionaries in two steps. We have to keep in mind that all tricky data transformation is highly dependent on the right index. But first things first.
First, to the outer dictionary: We create a Series of the dictionary.
dict_as_series = pd.Series(some_commit.stats.files)
dict_as_series
src/main/webapp/WEB-INF/tags/menu.tag {'insertions': 2, 'deletions': 2, 'lines': 4} dtype: object
Second, we wrap that series into a DataFrame (for index reasons):
dict_as_series_wrapped_in_dataframe = pd.DataFrame(dict_as_series)
dict_as_series_wrapped_in_dataframe
0 | |
---|---|
src/main/webapp/WEB-INF/tags/menu.tag | {'insertions': 2, 'deletions': 2, 'lines': 4} |
After that, some magic occurs. We stack the DataFrame, meaning that we put our columns into our index which becomes a MultiIndex.
stacked_dataframe = dict_as_series_wrapped_in_dataframe.stack()
stacked_dataframe
src/main/webapp/WEB-INF/tags/menu.tag 0 {'insertions': 2, 'deletions': 2, 'lines': 4} dtype: object
stacked_dataframe.index
MultiIndex(levels=[['src/main/webapp/WEB-INF/tags/menu.tag'], [0]], labels=[[0], [0]])
With some manipulation of the index, we achive what we need: an expansion of the rows for each file in a commit.
stacked_dataframe.reset_index().set_index('level_1')
level_0 | 0 | |
---|---|---|
level_1 | ||
0 | src/main/webapp/WEB-INF/tags/menu.tag | {'insertions': 2, 'deletions': 2, 'lines': 4} |
With this (dirty?) trick, we achieved that all files from the stats object can be assigned to the original index of our DataFrame.
In the context of a call with the apply method, the command looks a little bit different, but in the end, the result is the same (I took a commit with multiple modified files from the DataFrame just to show the tranformation a little bit better):
pd.DataFrame(commits[64:65]['raw'].apply(
lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
level_1 | 0 | |
---|---|---|
64 | readme.md | {'insertions': 2, 'deletions': 2, 'lines': 4} |
64 | src/main/java/org/springframework/samples/petc... | {'insertions': 111, 'deletions': 0, 'lines': 111} |
64 | src/main/webapp/WEB-INF/web.xml | {'insertions': 0, 'deletions': 118, 'lines': 118} |
%%time
stats = pd.DataFrame(commits['raw'].apply(
lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats.head()
Wall time: 23.9 s
Unfortunately, this takes almost 30 seconds on my machine :-( (Help needed! Maybe there is a better way for doing this).
Next, we extract the data from the stats_modification column. We do this by simply wrapping the dictionary in a Series, that will return the data needed.
pd.Series(stats.ix[0, 'stats_modifications'])
deletions 1 insertions 1 lines 2 dtype: int64
With an apply, it looks a little bit different because we are applying the lambda function along the DataFrame's index.
We get a warning because there seems to be a problem with the ordering of the index. But I haven't found any errors so far with this approach.
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats_modifications.head(7)
deletions | insertions | lines | |
---|---|---|---|
0 | 1 | 1 | 2 |
1 | 0 | 1 | 1 |
2 | 0 | 10 | 10 |
2 | 3 | 21 | 24 |
2 | 3 | 0 | 3 |
3 | 1 | 1 | 2 |
3 | 9 | 11 | 20 |
We join the newly created data with the existing one with a join method.
stats = stats.join(stats_modifications)
stats.head()
filename | stats_modifications | deletions | insertions | lines | |
---|---|---|---|---|---|
0 | readme.md | {'insertions': 1, 'deletions': 1, 'lines': 2} | 1 | 1 | 2 |
1 | pom.xml | {'insertions': 1, 'deletions': 0, 'lines': 1} | 0 | 1 | 1 |
2 | pom.xml | {'insertions': 10, 'deletions': 0, 'lines': 10} | 0 | 10 | 10 |
2 | pom.xml | {'insertions': 10, 'deletions': 0, 'lines': 10} | 3 | 21 | 24 |
2 | pom.xml | {'insertions': 10, 'deletions': 0, 'lines': 10} | 3 | 0 | 3 |
After we get rid of the now obsolete stats_modifications columns...
del(stats['stats_modifications'])
stats.head()
filename | deletions | insertions | lines | |
---|---|---|---|---|
0 | readme.md | 1 | 1 | 2 |
1 | pom.xml | 0 | 1 | 1 |
2 | pom.xml | 0 | 10 | 10 |
2 | pom.xml | 3 | 21 | 24 |
2 | pom.xml | 3 | 0 | 3 |
...we join the existing DataFrame with the stats information (transposed for displaying reasons)...
commits = commits.join(stats)
commits.head(2).T
0 | 1 | |
---|---|---|
raw | ffa967c94b65a70ea6d3b44275632821838d9fd3 | fd1c742d4f8d193eb935519909c15302b783cd52 |
author | Antoine Rey | Antoine Rey |
antoine.rey@gmail.com | antoine.rey@gmail.com | |
committed_date | 2017-04-12 21:41:00+02:00 | 2017-03-06 08:12:14+00:00 |
message | spring-petclinic-angular1 repo renamed to spri... | Do not fail maven build when git directing is ... |
sha | ffa967c94b65a70ea6d3b44275632821838d9fd3 | fd1c742d4f8d193eb935519909c15302b783cd52 |
filename | readme.md | pom.xml |
deletions | 1 | 0 |
insertions | 1 | 1 |
lines | 2 | 1 |
...and come to an end by deleting the raw data column, too (and also transposed for displaying reasons).
del(commits['raw'])
commits.head(2).T
0 | 1 | |
---|---|---|
author | Antoine Rey | Antoine Rey |
antoine.rey@gmail.com | antoine.rey@gmail.com | |
committed_date | 2017-04-12 21:41:00+02:00 | 2017-03-06 08:12:14+00:00 |
message | spring-petclinic-angular1 repo renamed to spri... | Do not fail maven build when git directing is ... |
sha | ffa967c94b65a70ea6d3b44275632821838d9fd3 | fd1c742d4f8d193eb935519909c15302b783cd52 |
filename | readme.md | pom.xml |
deletions | 1 | 0 |
insertions | 1 | 1 |
lines | 2 | 1 |
So we're finished! A DataFrame that contains all the repository information needed for further analysis!
commits.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2228366 entries, 0 to 557 Data columns (total 9 columns): author object email object committed_date object message object sha object filename object deletions float64 insertions float64 lines float64 dtypes: float64(3), object(6) memory usage: 170.0+ MB
At the end, we still have our commits from the beginning, but with all information that we can work on in another notebook.
len(commits.index.unique())
558
For now, we just store the DataFrame into a h5 format with compression for later usage (we get a warning because of the string objects we're using, but that's no problem AFAIK).
commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')
C:\dev\Anaconda3\lib\site-packages\pandas\core\generic.py:1101: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->['author', 'email', 'committed_date', 'message', 'sha', 'filename']] return pytables.to_hdf(path_or_buf, key, self, **kwargs)
This notebook is really long because it includes a lot of explanations. But if you just need the code to extract a Git repository, here it is:
import pandas as pd
import git
repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))
stats = pd.DataFrame(commits['raw'].apply(lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats = stats.join(stats_modifications)
del(stats['stats_modifications'])
commits = commits.join(stats)
del(commits['raw'])
commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')
C:\dev\Anaconda3\lib\site-packages\pandas\core\generic.py:1101: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->['author', 'email', 'committed_date', 'message', 'sha', 'filename']] return pytables.to_hdf(path_or_buf, key, self, **kwargs)
I hope you aren't demotivated now by my Pandas' approach for extracting data from Git repositories. Agreed, the stats object is little unconventional to work with (and there may be better ways for doing it), but I think in the end, the result is pretty useful.