This series of notebooks are a simple mini tutorial to introduce you to the basic functionality of Jupyter, Python, pandas and matplotlib. The comprehensive explanations should guide you to be able to analyze software data on your own. Therefore, the examples is chosen in such a way that we come across the typical methods in a data analysis. Have fun!
This is part II: The basic of the data analysis framework pandas and the visualization library matplotlib. For Jupyter Notebook and Python basics, go to the other tutorial.
In this notebook, we want to take a closer look at the development history of the open source project "Linux" based on the history of the corresponding GitHub mirror repository.
We want to find out which people are the TOP 10 contributors.
A local clone of the GitHub repository https://github.com/torvalds/linux/ was created by using the command
git clone https://github.com/torvalds/linux.git
The relevant parts of the history for this analysis were produced by using
git log --pretty="%ad,%aN" --no-merges > git_log_linux_authors_timestamps.csv
This command returned the commit timestamp (%ad
) and the author name (%aN
) for each commit of the Git repository. The corresponding values are separated by commas. We also indicated that we do not want to receive merge commits (via --no-merges
). The result of the output was saved in the file git_log_linux_authors_timestamps.csv
and compressed for a optimized file size with gzip
to the file git_log_linux_authors_timestamps.gz
.
Note: For an optimized demo, headers and the separator has been changed manually in the provided dataset to get through this analysis more easily. The differences can be seen at https://www.feststelltaste.de/developers-habits-linux-edition/, which was done with the original dataset.
Pandas is a data analysis tool written in Python (and C), which is perfect for the analysis of tabular data due to the use of effective data structures and built-in statistics functions.
pandas
with import <module> as <abbreviation>
as abbreviated pd
import pandas as pd
pd
and attach a ?
behind itEsc
key.
pd?
read_csv
method from pandas to read in the dataset ../datasets/git_log_linux_authors_timestamps.gz
.log
.log
with the head()
method.
log = pd.read_csv("../datasets/git_log_linux_authors_timestamps.gz")
log.head()
timestamp | author | |
---|---|---|
0 | 2017-12-31 14:47:43 | Linus Torvalds |
1 | 2017-12-31 13:13:56 | Linus Torvalds |
2 | 2017-12-31 13:03:05 | Linus Torvalds |
3 | 2017-12-31 12:30:34 | Linus Torvalds |
4 | 2017-12-31 12:29:02 | Linus Torvalds |
info()
on log
.
log.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 723214 entries, 0 to 723213 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 723214 non-null object 1 author 723213 non-null object dtypes: object(2) memory usage: 11.0+ MB
We see that log
is
timestamp
and author
author
by either['<column name>']
notation orSeries
*.tail()
method*A word of caution: In this tutorial, we access a Series directly with the .<Series name>
notation (e. g. log.author
). This works only if the names of the Series are different from the provided functions and properties of a Series. E. g. it doesn't work, when you try to access a Series named count
, because count()
is a function of a Series. Here, you have to use the ['<Series name>']
notation (e.g. log['count']
. The benefit of the direct access is that you are able to use auto completion (which is good at the beginning). The drawback is creating a source of potential future problems when pandas evolves and gets new functions and properties that are in conflict with your column names.
log.author.tail()
723209 akpm@osdl.org 723210 akpm@osdl.org 723211 Neil Brown 723212 Christoph Lameter 723213 Linus Torvalds Name: author, dtype: object
Possible answers:
value_counts()
.top10
*.top10
.*Note: Normally we would choose a more expressive name like top10_contributors
to make sure we and others now what's in the variable. But in a tutorial, we don't want keep things short.
top10 = log.author.value_counts().head(10)
top10
Linus Torvalds 24259 David S. Miller 9563 Mark Brown 6917 Takashi Iwai 6293 Al Viro 6064 H Hartley Sweeten 5942 Ingo Molnar 5462 Mauro Carvalho Chehab 5384 Arnd Bergmann 5305 Greg Kroah-Hartman 4687 Name: author, dtype: int64
Possible answers:
Next, we want to visualize or plot the result. To display the plotting result of the internally used plotting library matplotlib
directly in the notebook. This needs to be configured.
%matplotlib inline
to display generated graphics directly in the notebook.top10
with plot()
.
%matplotlib inline
top10.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f948f5b5730>
Possible answers:
bar()
sub-method of plot
for the data in top10
.
top10.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9490578820>
;
to the call above and re-execute it.
top10.plot.bar();
Possible answer:
<matplotlib.axes._subplots.AxesSubplot at ...
is no longer printed (which makes our notebook looking nicer now).This data can also be visualized as a pie chart (which is a rare occasion and should be, because pie charts are evil).
pie()
sub-method of plot
for the data in top_curses
.
top10.plot.pie();
Possible answers:
plot()
of the Series top10
and with the following parameters:kind="pie"
figsize=[5,5]
title="Top 10 Contributors"
label=""
Tip: Use auto completion.
top10.plot.pie(
figsize=[5,5],
title="Top 10 Contributors",
label="");
This was a complete walkthrough through the basics of Jupyter Notebook, Python, pandas and matplotlib.
The next sessions introduce you a little more advanced techniques where you get to know the power of this stack!
If you want to dive deeper into this topic, take a look at my blog posts on that topic. I'm looking forward to your comments and feedback on GitHub or on Twitter!