How Long Does It Take To Review an IPython Pull Request?

Fernando Perez wrote:

IPython: 8 minutes from bug on my box to reviewed/merged PR. We're doing something right... https://github.com/ipython/ipython/pull/4294

This got me thinking: how long does it usually take to do a review on the IPython project?

In [1]:
import pandas as pd
import json
import itertools
import collections
import re
import arrow
rcParams['figure.figsize'] = (15, 3)
pd.set_option('display.mpl_style', 'default')

I recently released a python package called git2json for parsing git logs. It isn't well written, but it does what it says on the tin.

In [2]:
!git2json --git-dir=/home/tavish/code/ipython/.git > ipython-log.json

We can parse the resulting JSON file and take a peek at the data structure.

In [3]:
log = json.load(open('ipython-log.json'))
print log[0]
{u'committer': {u'date': 1380325308, u'timezone': u'-0700', u'name': u'Fernando Perez', u'email': u'[email protected]'}, u'author': {u'date': 1380325308, u'timezone': u'-0700', u'name': u'Fernando Perez', u'email': u'[email protected]'}, u'tree': u'ea7fbb2694175423d776f649104ba42520db7cea', u'parents': [u'795c9e30ee8ad2e1a4e780c439fc50b565670e26', u'796a90bdeef377041c353019fdae19aaa72a76d3'], u'commit': u'b8295e9af6745b9c466d11bb31a6eef221e231c1', u'message': u"Merge pull request #4294 from minrk/tornado-2don't require tornado 3 in `--post serve`", u'changes': []}
In [4]:
print "Number of commits = ", len(log)
Number of commits =  12382

Before we answer the question posed at the beginning of this article, I'll answer some easier questions. First off: which files in IPython have been committed to the most? This is a simplified version of "code churn" which is reasonably effective for predicting bugs. (More complicated models include lines modified or take semantic differences into account). Since this is Friday night and I'm not trying to get a paper into ICSE, I'll just take the number of commits for each .py file.

In [5]:
file_changes = lambda: itertools.chain.from_iterable(
    [change[2] for change in commit['changes'] if re.match(r'^.*\.py$', change[2])]
    for commit in log
)
In [6]:
rcParams['figure.figsize'] = (15, 6)
fchanges = file_changes()
fchange_count = collections.Counter(fchanges)
a = average(fchange_count.values())
most_common = fchange_count.most_common(20)
df = pd.DataFrame(most_common)
df.head()
df.index = df[0]
df = df[[1]]
df.head()
p = df.plot(kind='bar', legend=False)
p.set_title('Most-changed Files in IPython')
p.set_ylabel('Commits')
# Draw a red line at the average
hlines(a, 0, len(df), colors='r')
Out[6]:
<matplotlib.collections.LineCollection at 0x831ed90>
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1214: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Someone more familiar with IPython development will know how obvious the above graph is.

Next, I'll make a simple plot showing weekly commit counts over time, similar to the plots GitHub would show you. I'll create a data frame from a list in the format [(date_rounded_down_to_week, commit_id)] and then groupby() the date.

In [7]:
def weekly_date_resolution(ts):
    ar = arrow.arrow(ts)
    day_of_month = ar.timetuple().tm_mday
    week = int(day_of_month) / 7
    new_day = (week*7)+1
    assert new_day > 0
    assert new_day < 30
    try:
        day_adjusted = ar.replace(day=new_day)
    except ValueError:
        new_day = day_of_month # just keep the original
        day_adjusted = ar.replace(day=new_day)
    return day_adjusted.date()

commit_times = lambda: (
    (weekly_date_resolution(commit['committer']['date']), commit['commit'])
    for commit in log
)
dfct = pd.DataFrame(list(commit_times()), columns=['date', 'id'])
dfct = dfct.groupby('date').aggregate(len)
dfct.head()
Out[7]:
id
date
2005-07-01 1
2005-07-15 6
2005-08-08 2
2005-08-15 4
2005-08-22 1

A few more lines gives us a basic plot.

In [8]:
p = dfct.plot(legend=False)
p.set_title('Weekly commits on IPython')
p.set_ylabel('Commits')
Out[8]:
<matplotlib.text.Text at 0x8303250>

Let's get back to the original question: how long does it usually take to do a review?

I'll make a distinction between two different kinds of commits:

  • Merge commits: The commit that merged the pull request branch into the main branch.
  • PR commits: The HEAD (or most recent) commit on the pull request's source branch.
In [9]:
def is_pull_request_merge(message):
    '''Match commit messages that start with "Merge pull request #"'''
    return re.match(r'^Merge pull request #', message)


def pull_request_number(message):
    '''Extract the PR # from the commit message.'''
    return int(re.match(r'^Merge pull request #(\d+)', message).groups()[0])


# Create a table for the merge commits.
merge_commits = [
    [
        commit['parents'][1], # the second parent seems to be the source branch
        commit['committer']['date'],
        pull_request_number(commit['message']),
    ]
    for commit in log if is_pull_request_merge(commit['message'])
]
# Also create a set of merge commits for use in the next cell.
merge_commit_parent_ids = [mc[0] for mc in merge_commits]
merge_df = pd.DataFrame(
    merge_commits,
    columns=['merge_commit', 'merge_timestamp', 'pr'],
    index=merge_commit_parent_ids,
)
merge_df.head()
Out[9]:
merge_commit merge_timestamp pr
796a90bdeef377041c353019fdae19aaa72a76d3 796a90bdeef377041c353019fdae19aaa72a76d3 1380325308 4294
b0d6287174b62de9b89190b868317d12b5bc4ae4 b0d6287174b62de9b89190b868317d12b5bc4ae4 1380323422 4270
07b4962bd787068896f0e2b5dfb977441c7b391a 07b4962bd787068896f0e2b5dfb977441c7b391a 1380301596 4278
e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380271407 4282
68b0f9d2d5f29c540f7873141cb09026c492d0ef 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380215023 4279
In [10]:
# Create a table for pull request commits.
pr_commits = [
    [
        commit['commit'],
        commit['committer']['date'],
        len(commit['changes']), # let's sneak in another basic churn metric
    ]
    for commit in log
    # Do we have a merge commit for this commit?
    if commit['commit'] in merge_commit_parent_ids
]
commit_ids = [prc[0] for prc in pr_commits]
pr_df = pd.DataFrame(pr_commits, columns=['commit', 'commit_timestamp', 'churn'], index=commit_ids)
pr_df.head()
Out[10]:
commit commit_timestamp churn
796a90bdeef377041c353019fdae19aaa72a76d3 796a90bdeef377041c353019fdae19aaa72a76d3 1380325245 1
e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380162595 1
68b0f9d2d5f29c540f7873141cb09026c492d0ef 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380158115 3
07b4962bd787068896f0e2b5dfb977441c7b391a 07b4962bd787068896f0e2b5dfb977441c7b391a 1380153904 1
d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa 1380099263 1

Now that we have these two data frames, we can merge them together.

In [11]:
both_df = pr_df.join(merge_df)
both_df.head()
Out[11]:
commit commit_timestamp churn merge_commit merge_timestamp pr
796a90bdeef377041c353019fdae19aaa72a76d3 796a90bdeef377041c353019fdae19aaa72a76d3 1380325245 1 796a90bdeef377041c353019fdae19aaa72a76d3 1380325308 4294
e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380162595 1 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380271407 4282
68b0f9d2d5f29c540f7873141cb09026c492d0ef 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380158115 3 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380215023 4279
07b4962bd787068896f0e2b5dfb977441c7b391a 07b4962bd787068896f0e2b5dfb977441c7b391a 1380153904 1 07b4962bd787068896f0e2b5dfb977441c7b391a 1380301596 4278
d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa 1380099263 1 d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa 1380149814 4131

We can take the delta by subtracting the merge timestamp from the commit timestamp. This is the difference between the commit-under-review's timestamp and the merge timestamp, in seconds. We can do some basic math to get the difference on other scales. The HTML version of the table shows the results.

In [12]:
rcParams['figure.figsize'] = (5, 10)
both_df['delta'] = both_df['merge_timestamp'] - both_df['commit_timestamp']
delta = both_df['delta']
both_df['delta_weeks'] = delta/60.0/60.0/24.0/7.0
both_df['delta_days'] = delta/60.0/60.0/24.0
both_df['delta_hours'] = delta/60.0/60.0
both_df['delta_mins'] = delta/60.0
both_df['delta_secs'] = delta
both_df.head()
Out[12]:
commit commit_timestamp churn merge_commit merge_timestamp pr delta delta_weeks delta_days delta_hours delta_mins delta_secs
796a90bdeef377041c353019fdae19aaa72a76d3 796a90bdeef377041c353019fdae19aaa72a76d3 1380325245 1 796a90bdeef377041c353019fdae19aaa72a76d3 1380325308 4294 63 0.000104 0.000729 0.017500 1.050000 63
e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380162595 1 e51d4d021c57c846ff1ad439d20b3fcc6b1ebc77 1380271407 4282 108812 0.179914 1.259398 30.225556 1813.533333 108812
68b0f9d2d5f29c540f7873141cb09026c492d0ef 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380158115 3 68b0f9d2d5f29c540f7873141cb09026c492d0ef 1380215023 4279 56908 0.094094 0.658657 15.807778 948.466667 56908
07b4962bd787068896f0e2b5dfb977441c7b391a 07b4962bd787068896f0e2b5dfb977441c7b391a 1380153904 1 07b4962bd787068896f0e2b5dfb977441c7b391a 1380301596 4278 147692 0.244200 1.709398 41.025556 2461.533333 147692
d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa 1380099263 1 d4d6bc63d6c6dd34c5fe1270a90caec2e0208bfa 1380149814 4131 50551 0.083583 0.585081 14.041944 842.516667 50551

The first thing that stands out is (as of this writing) commit 796a90bdeef3770 has a delta of one minute. Fernando reported 8 minutes, which is much more plausible. This makes me wonder if some people have out-of-date clocks. Update: Fernando says that the review time was 1 minute; 8 minutes was from bug to pull request to commit.

Anyways, let's make a box plot. Box plots are fun.

In [13]:
rcParams['figure.figsize'] = (15, 3)
both_df[['delta_days']].boxplot(vert=False)
Out[13]:
{'boxes': [<matplotlib.lines.Line2D at 0x866bbd0>],
 'caps': [<matplotlib.lines.Line2D at 0x866b210>,
  <matplotlib.lines.Line2D at 0x866b710>],
 'fliers': [<matplotlib.lines.Line2D at 0x866d610>,
  <matplotlib.lines.Line2D at 0x866dad0>],
 'medians': [<matplotlib.lines.Line2D at 0x866d110>],
 'whiskers': [<matplotlib.lines.Line2D at 0x8665a50>,
  <matplotlib.lines.Line2D at 0x8665c90>]}

Based on this plot, we see that the IPython folks are pretty fast when it comes to reviewing and merging pull requests — the ones that get committed, anyways. There are some outliers, though: some take up to half a year before getting committed.

We can filter out all of these ranges that are larger than a month and ask: if a review takes less than a month, how long is it likely to take?

In [14]:
rcParams['figure.figsize'] = (15, 3)
both_df[both_df['delta_days'] < 30][['delta_days']].boxplot(vert=False)
Out[14]:
{'boxes': [<matplotlib.lines.Line2D at 0x8633550>],
 'caps': [<matplotlib.lines.Line2D at 0x862db50>,
  <matplotlib.lines.Line2D at 0x8633090>],
 'fliers': [<matplotlib.lines.Line2D at 0x8633f50>,
  <matplotlib.lines.Line2D at 0x8637450>],
 'medians': [<matplotlib.lines.Line2D at 0x8633a50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x862d390>,
  <matplotlib.lines.Line2D at 0x862d5d0>]}

Based on this, it looks like most reviews happen within the first five days. The median is (I'm eyeballing this) around 1 day. Not bad at all.

If we look closely at the pull requests outside this range, we might see any number of things: pull requests that are large and hard to understand or pull requests that are blocked by some other task. I don't think pandas would help us much: we'd have to use our brains and look for patterns.

OK, one more question. Can we see any relationship between the review time and the number of files modified by the last pull request commit? This is a sad way to do it — since I'm not taking into account all commits on a pull request, I'm missing a lot of useful information — but it might be fun.

In [15]:
# The data is skewed, so i'll use the spearman correlation
both_df[['churn', 'delta']].corr(method='spearman')
Out[15]:
churn delta
churn 1.000000 0.048028
delta 0.048028 1.000000

There's almost no correlation, and the correlation that is there is probably just noise. Should we conclude that the number of files that you change in a pull request doesn't affect how long it will take to be reviewed? I don't think so, for two reasons:

  1. I'm completely ignoring pull requests that get ignored or closed. My professor has looked at how long it takes patches to get accepted on large open source projects like linux and he found that on certain projects like Linux, large patches were ignored or rejected. No one had time to understand them and review them. This might be the case for IPython. (Does anybody from the IPython project think this is true?)
  2. I'm only processing the tip of the pull request branch. A pull request can have many commits, and I should incorporate those into my churn metric.

I didn't really make any claims here, but let's pretend this is a real paper.

Threats to Validity

  1. It is 1am on a Friday night.
  2. The churn metric only represents the tip (ha) of the pull request iceberg.
  3. This analysis doesn't look at how review time changed over time. Can we get a less skewed picture of pull request length by looking at only the period in which the project was funded?
  4. This analysis doesn't look at actual review time — the pull request could have been made much later than when the pull request commit was made.

Conclusions

  1. The IPython folks are very prompt. Send them a decent pull request and they'll probably merge it within a day.
  2. The IPython Notebook is pretty cool.

Updates

A previous version of this notebook was incorrectly adding timezone differences to timestamps. After some experimentation, I've discovered that the timestamps provided by git2json don't need to be adjusted for timezone unless you want to display it in a human-readable format. Fixing this bug solved the "time travel" issue.