2013 Chicago Marathon Analysis

With close to 39,000 results, the 2013 Chicago Marathon Results combine two of my favourite topics, statistics and running. I decided to take this opportunity to learn more about pandas by using it to analyze the result set to provide some insight into how people run marathons.

Let me begin first by saying I am not a statistician and this won't be a formal or rigourous statistical analysis. Instead, these are at best Malcolm Gladwell-esque observations, less the interesting storytelling. I will be cherry-picking results in order to illustrate something that may be "interesting", but will likely have no statistical significance.

In [1]:
import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
# Plotting options.
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
figsize(15, 5)
In [3]:
# These results were obtained by scraping the official results site.
results = pd.read_csv('chicago_marathon_results.csv')
In [4]:
# Massage the data: Change the M-15/W-15 to just 0-15 to match the other division nomenclature, which does not have a gender.
results.division[results.division == 'M-15'] = '0-15'
results.division[results.division == 'W-15'] = '0-15'

Here's what a sample result looks like, using my own result as an example: (Note that the DataFrame index is not the finishing place, because I had to scrape male and female results separately)

In [35]:
results[results.name_location == 'Chng, Peter (CAN)']
Out[35]:
place place_gender place_division name_location city_state bib division age half_split finish gender finish_seconds half_split_seconds split_diff_seconds split_diff
512 570 513 153 Chng, Peter (CAN) Toronto, ON 1032 25-29 29 01:27:14 02:55:22 M 10522 5234 54 0:00:54.0

Demographics

There were a total of 38,883 results obtained from the website, broken down into the following numbers for men/women:

In [6]:
gender_breakdown = results.groupby(['gender']).size();
ax = gender_breakdown.plot(kind='bar', figsize=(5,5))
pyplot.xticks(rotation='horizontal')
def label_gender_breakdown(values, total, ax):
    for i, value in enumerate(values):
        ax.text(0.5 + i, 1.05*value, value, size=14, weight='bold')
        ax.text(0.5 + i, 0.5*value, "{:.2%}".format(float(value)/total), size=12, weight='bold', color='w')
label_gender_breakdown(gender_breakdown, len(results), ax)

Age Groups

There's nothing surprising here. Most marathons tend to be split roughly ~60/40 in terms of a male/female ratio, so 55/45 is not too far off. A further breakdown by gender and age group is provided below:

In [7]:
m = results[results.gender == 'M'].groupby(['division']).size()
f = results[results.gender == 'W'].groupby(['division']).size()
d = {'M': m, 'F': f}
gender_division_counts = pd.DataFrame(d)
ax = gender_division_counts.plot(kind='bar', color=['r', 'b'])
ax.set_ylabel('Number of runners')
def label_gender_division_counts(d, ax):
    # Iterate over M/F separately because they may contain separate sets.
    for (i, value) in enumerate(d['F']):
        ax.text(0.25 + i, value + 50, value, weight='bold')    
    for (i, value) in enumerate(d['M']):
        ax.text(0.65 + i, value + 50, value, weight='bold')
label_gender_division_counts(d, ax)

My Malcolm Gladwell-esque observation: Women tend to care more about fitness (or at least marathons) than men earlier in life. The F25-29 group is by far the largest division represented here. However, the female numbers drop off after this age group, perhaps because the amount of free time women have is highest in their mid-to-late 20s.

On the other hand, men develop more concern for marathonining later in life, starting after their 30s, with the next largest divisions being M30-34, M35-39 and M40-44, all of which are roughly equal. These three age groups account for almost 28% of the total number of runners finishing the race. Curiously, male participation drops off steeply after the M40-44 age group. I would have thought it would remain stronger on account of men having more free time later in life and perhaps due to the "mid-life crisis" effect.

All together, the top five divisions by numbers (F25-29, M30-34, M35-39, M40-44 and F30-34) account for 46% of the total finishers.

Finishing times

Let's look at a graph of finishing time vs. finishing place. First, we have to convert the String "HH:MM:SS" format into an integer number of seconds for graphing purposes. (Perhaps there is an easier way by converting this into a timedelta or similar type instead?)

In [273]:
# TODO: PC: Redo this using Time Series/Date functionality: 
# http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects
In [8]:
# Convert a hh:mm:ss string to int seconds.
def timestring_to_seconds(time_string):
    if type(time_string) != str:
        return numpy.nan
    parts = time_string.split(':')
    return int(parts[0])*3600 + int(parts[1])*60 + int(parts[2])

# Convert an int seconds into a str hh:mm:ss
def seconds_to_timestring(time_seconds):
    if math.isnan(time_seconds):
        return nan
    
    # If the number is negative, work with the absolute value to avoid modulo with negative numbers.
    time = abs(time_seconds)
    
    hours = int(time/3600)
    minutes = int((time % 3600)/60)
    seconds = time % 60
    output = "{0}:{1:02.0f}:{2:04.1f}".format(hours, minutes, seconds)
    if time_seconds < 0:
        output = '-' + output
    return output

# Convert times into seconds.
results['finish_seconds'] = results['finish'].map(timestring_to_seconds)
results['half_split_seconds'] = results['half_split'].map(timestring_to_seconds)
In [9]:
# Axis formatter to convert seconds to a human-readable time.
def time_ticks(x, pos):
    d = datetime.timedelta(seconds=x)
    return str(d)
ax = results.sort(columns='finish_seconds').plot(x='place', y='finish_seconds', figsize=(15,5))
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.yaxis.set_label_text('finishing time')
ax.xaxis.set_label_text('finishing place')
pyplot.title('Finishing time vs. finishing place')
Out[9]:
<matplotlib.text.Text at 0xb55ea58>

From what I've seen, this is a pretty typical result. The graph initially has a very steep slope and is concave (slope decreasing) until we reach what I call the "main group" at around a finishing place ~3,000 to ~35,000. Between these two values, the slope is roughly constant and after this, the slope starts to increase again. (convex)

Interpretation: Initially, finishers are spread out more and there is a relatively large amount of time between finisher n and n + 1. By the time we get to finisher 5,000 (roughly ~3:30), there is comparatively less time between subsequent finishers and thus runners are more "bunched up". As time goes on however, finishers get more "drawn out",and the last hundred finishers have comparatively large time gaps between them.

Some more mundane statistics:

In [10]:
# Average Finishing Time:
seconds_to_timestring(results.finish_seconds.mean())
Out[10]:
'4:32:24.7'
In [11]:
# Average Male Finishing Time:
seconds_to_timestring(results[results.gender == 'M'].finish_seconds.mean())
Out[11]:
'4:20:55.7'
In [12]:
# Average Female Finishing Time:
seconds_to_timestring(results[results.gender == 'W'].finish_seconds.mean())
Out[12]:
'4:46:35.7'
In [13]:
# Percentage of finishers faster than the mean:
"{:.2%}".format(float(len(results[results.finish_seconds < results.finish_seconds.mean()].index))/len(results))
Out[13]:
'53.74%'

Note that the number of finishers faster than the mean time of 4:32:24.7 is slightly greater than 50%, indicating the distribution is not symmetrical about the mean/average. (More on this later)

Finishing Times by Age Group

The following are the average finishing times for each age group:

In [14]:
avg_finish_time_groups = results.groupby(['gender', 'division']).finish_seconds.mean()
avg_finish_time_groups = pd.DataFrame({'M': avg_finish_time_groups['M'], 'F': avg_finish_time_groups['W']})
ax = avg_finish_time_groups.plot(kind='bar')
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.yaxis.set_major_locator(MultipleLocator(60*30))
pyplot.title('Average finish time for each age group')
Out[14]:
<matplotlib.text.Text at 0xb9f7da0>

The time difference between genders is remarkably consistent across most of the age groups. Additionally, average times remain relatively consistent from 20-24 all the way to 40-44.

In [15]:
male_female_differences = results.groupby(['gender', 'division']).finish_seconds.mean()['W'] - results.groupby(['gender', 'division']).finish_seconds.mean()['M']
ax = male_female_differences.plot(kind='bar')
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.set_ylim([0, 60*35])
ax.yaxis.set_major_locator(MultipleLocator(300))
pyplot.title('Female-Male average finish time difference')
Out[15]:
<matplotlib.text.Text at 0xb9da860>

Above is a graph showing the time difference between the average female finishing time and average male finishing time for each age group. As you can see, the majority of the differences were between 25 and 30 minutes.

Finishing Time Percentiles

Now, let's define a function that accepts a percentile and outputs the maximum time you could have run and still finished within that percentile.

In [16]:
def percentile_time(percentile):
    if percentile < 0 or percentile >= 1:
        return None
    # Subtract one because indexing is 0-based.
    location = int(percentile * len(results)) - 1
    if location < 0:
        # Time is too fast; no one finished faster than that!
        return None
    # NOTE: Have to use iloc[] to get the integer-based position rather than relying on the label using .loc[] or .ix[]
    return results.sort(columns='finish_seconds').iloc[location]['finish']

Let's run the numbers with some percentiles:

In [17]:
percentiles = pd.Series([0.01, 0.05, 0.1, 0.2, 0.5])
finishing_time_percentiles = pd.Series([percentile_time(p) for p in percentiles])
pd.DataFrame({'percentiles': percentiles, 'finishing_time': finishing_time_percentiles})
Out[17]:
finishing_time percentiles
0 02:50:20 0.01
1 03:14:11 0.05
2 03:28:35 0.10
3 03:46:58 0.20
4 04:27:26 0.50

As you can see, finishing 3:28:35 or faster was good enough finishing in the top 10%.

Percentile Finishing Times Graph

Plotting this information gives us a cumulative distribution function (CDF) of finishing times, which nicely shows the percentage of runners finishing faster than a given time:

In [18]:
def percentage_ticks(value, pos):
    return "{:.2%}".format(value)

plot_times = [7200 + 300*i for i in range (0, 6*12 + 1)] # 5-minute bins from 2:00 to 8:05
num_results = float(len(results)) # Use a float here so the dividend will be converted to a float to avoid integer division/truncation.
percentiles = [len(results[results.finish_seconds < t].index)/num_results for t in plot_times]
# cdf = pd.DataFrame(percentiles, index=plot_times)
fig, ax = pyplot.subplots()
fig.set_size_inches(16, 6)
pylab.xticks(rotation='vertical')
pyplot.title('Percentile Finishing Times')
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(percentage_ticks))
ax.yaxis.set_major_locator(MultipleLocator(0.10))
ax.yaxis.set_minor_locator(MultipleLocator(0.05))
ax.yaxis.set_label_text('Percentage finishing faster')
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.xaxis.set_major_locator(MultipleLocator(300)) # X-axis labels every 300 s/5 min
ax.xaxis.set_label_text('Finishing time')
# Adjust the positioning of the bars to line up with the x-label values. (3:30 should be 10.82%, 3:40 15.77%)
bars = pyplot.bar(left=[i - 200 for i in plot_times], height=percentiles, width=300)

As you can see in this graph, around 10% of finishers were 3:30 or faster. To get into the top 5%, you had run faster than ~3:15. Getting into the top 50% required being faster than 4:30. Interestingly, a sub-3 finish, considered a tough goal by many amateurs including myself, gets you into the top < 3% in this mass-participation event.

Finishing Time Distribution and Goal Times

Now, let's look at a histogram of all the finishing times. In the following histogram, the number of finishers in each five-minute interval are plotted:

Note that the histogram is basically the empirical analogue of the probability mass function (PMF) corresponding to the CDF above; indeed a graph of the partial summation of the histogram below would pretty much yield the CDF above.

In [19]:
# For plotting a histogram of finishing times:
# Five-minute bins starting from 2:00:00.
bins=[7200 + 300*i for i in range (0, 6*12 + 1)]
def finish_time_histogram(results, color=None):
    fig, ax = pyplot.subplots()
    ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
    ax.xaxis.set_major_locator(MultipleLocator(300))
    ax.yaxis.set_label_text('number of finishers')
    pylab.xticks(rotation='vertical')
    return pyplot.hist(results.finish_seconds, bins=bins, color=color)
In [20]:
plot = finish_time_histogram(results)
plot[2][23].set_facecolor('r')

Whoa! This is a pretty interesting result, in my opinion. If everyone was running to their ultimate best, you would expect a somewhat smooth graph that would be reflective of human ability. In fact, I would expect it to be almost a normal distribution, with equal numbers of runners faster and slower than the mean, though I have no evidence to suggest running ability is normally distributed like other human attributes such as height.

Instead, we have one extreme outlier bin - the 3:55-4:00 group of finishers. (I've highlighted this bin in red in the graph above)

My explanation: To make vastly generalized statement, people do not aim to run their best time. Instead, they aim to run good enough to achieve some goal time. Anecdotally, this is what I've always heard when talking with other runners in my group: We all aim to hit some goal, whether it is sub-4:00, sub-3:30 or a Boston Qualifying (BQ) time. We run just hard enough to meet the goal, but not exceed it by a large margin. (The reasoning is that trying to go out too fast could cause you to crash and completely miss your goal, so you usually try to maintain a pace that would have you finish not more than five minutes faster than your goal.)

When looking at the above graph, it's clear to me that the most common goal time is a sub-4 hour finish. Again, I have no evidence to support this claim, but I believe many people who finished in the 3:55-4:00 could have run faster, but achieving a sub-4 finish was their goal. (I am not criticizing anyone here, just trying to make sense of the numbers, and indeed this was my strategy when running my first sub-3 hour marathon: Run the race "just good enough", and no better. After all, why expose yourself to more pain than is necessary?)

To test out my "hypothesis" more, let's look at specific gender-age groups I've cherry-picked to illustrate my point: The following is for the M30-34 group:

In [21]:
plot = finish_time_histogram(results[(results.division == '30-34') & (results.gender == 'M')])
plot[2][11].set_facecolor('r')
plot[2][17].set_facecolor('r')
plot[2][23].set_facecolor('r')

The 3:55-4:00 bin is less prominent in M30-34 than for all runners, but it's still the largest bin and there is a huge drop off in the next slowest bin. Seems like it's far less desirable to finish in 4:00-4:05.

The other anomaly here is the huge jump in the 2:55-3:00 bin compared to the previous one. This likely reflects the number of younger males going for that sub-3 finish, myself included. Again, we all seem to have settled on the "optimal" strategy of running fast enough, but no faster.

Another big jump happens in the 3:25-3:30 bin reflecting the sub-3:30 crowd. Curiously, there is no big jump in the 3:00-3:05 bin, which I would have expected given that a Boston Qualifier for this age group is 3:05 or faster. However, these runners could have decided to go for the sub-3 since it's so close. (Or maybe that's just my after-the-fact explanation, to make the theory fit the facts - again, I am not claiming to be scientific here at all.)

Let's look at the F30-34 division:

In [22]:
plot = finish_time_histogram(results[(results.division == '30-34') & (results.gender == 'W')])
plot[2][18].set_facecolor('r')
plot[2][23].set_facecolor('r')
plot[2][29].set_facecolor('r')
plot[2][35].set_facecolor('r')

Three promiment peaks are seen here: 3:55-4:00 (sub-4), 4:25-4:30 (sub-4:30) and 4:55-5:00. (sub-5). These probably reflect the most common goal times of this particular division/age group, assuming goal times correlate with finishing times.

There is a smaller anomaly resulting in a slightly larger 3:30-3:35 group than the surrounding ones, but it's not enough to be statistically significant I think. (Don't ask me to calculate the p-value!) If it means anything, it's because the BQ time for this age/gender group is 3:35 or faster.

Let's look at one other group I've cherry-picked: M35-39:

In [23]:
plot = finish_time_histogram(results[(results.division == '35-39') & (results.gender == 'M')])
plot[2][13].set_facecolor('r')
plot[2][17].set_facecolor('r')
plot[2][21].set_facecolor('r')
plot[2][23].set_facecolor('r')

There are a few patterns. The first is the sub-3:10 crowd in the 3:05-3:10 bin, which corresponds to coming in just under the BQ time for this age group. The second is the sub-3:30 crowd, and finally, the ubiquitous sub-4 group. There also appears to be a significant sub-3:50 group. I've highlighed all four groups.

In fact, when I looked at the histograms of most of the larger age-groups, one thing became clear: The sub-4 goal is pretty common across most divisions, as evidenced by the histogram across all finishing times. The "BQ" effect was less reliable, showing up sometimes like in the M35-39 group above, while not showing up as much in other groups.

Here's a combined histogram of the finishing times of the five largest age groups:

In [24]:
d = {
    'F25-29': results[(results.division == '25-29') & (results.gender == 'W')].finish_seconds,
    'M30-34': results[(results.division == '30-34') & (results.gender == 'M')].finish_seconds,
    'M35-39': results[(results.division == '35-39') & (results.gender == 'M')].finish_seconds,
    'M40-44': results[(results.division == '40-44') & (results.gender == 'M')].finish_seconds,
    'F30-34': results[(results.division == '30-34') & (results.gender == 'W')].finish_seconds,
     }

fig, ax = pyplot.subplots()
fig.set_size_inches(50, 10, forward=True)
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.xaxis.set_major_locator(MultipleLocator(300))
ax.yaxis.set_label_text('number of finishers')
pylab.xticks(rotation='vertical')
plot = pyplot.hist(d.values(), label=d.keys(), bins=bins, histtype='bar')
pyplot.legend()
Out[24]:
<matplotlib.legend.Legend at 0xc8c38d0>

The graph is a little too large to interpret, but I included it anyways. You'll have to open it up into a separate window and commit the sin of horizontal scrolling to see all of it. But the same story basically holds. Other than the F25-39 group, the 3:55-4:00 bin is the largest for most other age groups seen here, and usually by quite a margin.

Boston Qualifiers

I've talked about Boston Qualifying times a lot, so let's see what percentage of runners achieved a Boston Qualifying (BQ) time in each age-gender group. Boston Qualifying times are based on two factors: Your gender and the age group you fall into. The times increase as your age increases and females get 30 more minutes as compared to a male of the same age.

Note that these statistics may not be 100% valid, as the age in the Chicago Marathon results is probably the age of the runner on race day, whereas your BQ time is determined by your age on the day of the Boston Marathon, not the date when you ran your qualifying race. Thus, if you were 34 on the date of the Chicago Marathon but turned 35 on or before the Boston Marathon, their qualifying time would be 3:10 or faster.

Note that I am excluding the 0-15 and 16-19 age groups, as those contain (at least some) members not eligible to run the Boston Marathon due to age requirements.

In [25]:
# First, define the BQ times for each age-gender group specified in the results. 
# Thankfully, they don't cut across the BQ age groups.
# I should really put these into a CSV and do pd.read_csv() rather than being stupid and manually inputting it here.
bq_times = {
  'M20-24': '3:05:00',
  'M25-29': '3:05:00',
  'M30-34': '3:05:00',
  'M35-39': '3:10:00',
  'M40-44': '3:15:00',
  'M45-49': '3:25:00',
  'M50-54': '3:30:00',
  'M55-59': '3:40:00',
  'M60-64': '3:55:00',
  'M65-69': '4:10:00',
  'M70-74': '4:25:00',
  'M75-79': '4:40:00',
  'M80+': '4:55:00',
  
  'W20-24': '3:35:00',
  'W25-29': '3:35:00',
  'W30-34': '3:35:00',
  'W35-39': '3:40:00',
  'W40-44': '3:45:00',
  'W45-49': '3:55:00',
  'W50-54': '4:00:00',
  'W55-59': '4:10:00',
  'W60-64': '4:25:00',
  'W65-69': '4:40:00',
  'W70-74': '4:55:00',
  'W75-79': '5:10:00',
  'W80+': '5:25:00',
}
In [26]:
m = results[results.gender == 'M'].groupby(['division']).size()
f = results[results.gender == 'W'].groupby(['division']).size()

def calc_bq_percentages(gender, groups):
    bq_indexes = list()
    bq_percentages = list()
    for i, value in enumerate(groups):
        bq_cat = gender + groups.index[i]
        if bq_cat in bq_times:
            bq_indexes.append(groups.index[i])           
            bq_time = timestring_to_seconds(bq_times[bq_cat])
            bq_results = results[
                    (results.gender == gender) & 
                    (results.division == groups.index[i]) & 
                    (results.finish_seconds < bq_time)]
            num_qualifiers = len(bq_results.index)
            bq_percentages.append(num_qualifiers/float(groups[i]))
    return bq_indexes, bq_percentages

male_bq_percentages = calc_bq_percentages('M', m)
female_bq_percentages = calc_bq_percentages('W', f)

male_bq_percentages = pd.Series(male_bq_percentages[1], male_bq_percentages[0])
female_bq_percentages = pd.Series(female_bq_percentages[1], female_bq_percentages[0])
bq_d = {'M': male_bq_percentages, 'F': female_bq_percentages}
bq_percentages = pd.DataFrame(bq_d)

ax = bq_percentages.plot(kind='bar')
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(percentage_ticks))
ax.yaxis.set_major_locator(MultipleLocator(0.10))
ax.yaxis.set_minor_locator(MultipleLocator(0.05))
ax.set_ylim([0, 0.4])
pyplot.title('Percentage of runners meeting Boston Marathon qualifying standards')

def label_percentages(d, ax, v_offset):
    # Iterate over M/F separately because they may contain separate sets.
    format_str = "{:.1%}"
    for (i, value) in enumerate(d['F']):
        ax.text(0.25 + i, value + v_offset, format_str.format(value), weight='bold')    
    for (i, value) in enumerate(d['M']):
        ax.text(0.65 + i, value + v_offset, format_str.format(value), weight='bold')
label_percentages(bq_d, ax, 0.02)

Generally, neither gender appears to have a significant advantage over the other when it comes to obtaining a BQ, despite many who claim that the extra 30 minutes that females get is "unfair". Therefore, one interpretation of the raw data here is that BQ times are "fair" for the different genders.

Also, it appears that the percentage of runners BQ'ing increases with increasing age. This could mean that the extra time given with age is more than is needed to compensate for decreasing running ability as one ages, or it could simply mean that older runners train harder.

Also note that the older age groups do not have as many runners, and so are more affected by statistical anomalies. For example, the male and female 70-74 groups had only 51 and 12 finishers, respectively.

Chicago Qualifying Standards

Background: The Chicago Marathon (CM) has not had guaranteed entry via qualifying times until this year. This change was prompted after the fiasco surrounding last year's registration, where the servers of the third-party handling the registration went down under the load/traffic experienced on the first day of registration. About 25,000 people were able to register and the remaining 15,000 spots were filled by an impromptu lottery.

To avoid a repeat of the same situation, registration this year will be conducted via a lottery as is the case with many other large races where demand greatly exceeds supply. However, there are certain ways to get guaranteed entry, one of which is to run a qualifying time. (This is similar to how the New York City Marathon allots spots)

Just for fun, let's see what percentage of finishers in each age-group division of the 2013 race would qualify for guaranteed entry into the 2014 Chicago Marathon.

The qualifying times are 3:15:00 or faster for men and 3:45:00 or faster for women. Unlike BQ times, there is no age adjustment. This would seem to put older folks at a significant disadvantage, so let's see how the effect scales with age groups:

In [27]:
m = results[results.gender == 'M'].division.value_counts()
f = results[results.gender == 'W'].division.value_counts()

def cm_percentiles(gender, time, division_counts):
    percentiles = dict()
    for division in division_counts.index:
        n_results = len(results[(results.gender == 'M') & (results.division == division) & (results.finish_seconds <= timestring_to_seconds('3:15:00'))].index)
        percentiles[division] = n_results/float(division_counts[division])
    return percentiles

cm_male_percentiles = pd.Series(cm_percentiles('M', '3:15:00', m))
cm_female_percentiles = pd.Series(cm_percentiles('W', '3:45:00', f))
cm_d = {'M': cm_male_percentiles, 'F': cm_female_percentiles}
cm_qualifying_percentages = pd.DataFrame(cm_d)
ax = cm_qualifying_percentages.plot(kind='bar')
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(percentage_ticks))
ax.set_ylim([0, 0.2])
label_percentages(cm_d, ax, 0.01)
pyplot.title('Percentage of runners meeting Chicago Marathon qualifying standards')
Out[27]:
<matplotlib.text.Text at 0xcba94e0>

For males, the percentage of CM qualifiers seems to peak at between 20-24 and 25-29 and is relatively steady until after 40-44. Qualifying for guaranteed entry for a male over 50 would seem to be pretty hard.

For females, the story is a little different. The groups with the largest percentage of qualifiers are between 30 and 44 and the percentages don't drop off until the 55-59 age group. Females would seem to have an advantage in qualifying over men in the older age groups.

Indeed, the only age groups where males have a higher qualifying percentage than females are 20-24 and 25-29. In every other age group, females have a higher qualifying percentage, sometimes by a huge margin.

This is in stark contrast to the BQ percentages, where the male/female differences are seldom more than percentage point different. The data here would seem to suggest that Chicago Marathon qualifying times are biased towards women and biased against increasing age.

My personal opinion is that the qualifying times for guaranteed entry should be age-adjusted and the easiest way is just to adopt BQ standards

In [28]:
print "M: " + str(len(results[(results.gender == 'M') & (results.finish_seconds <= timestring_to_seconds('3:15:00'))].index)/float(len(results[results.gender == 'M'])))
print "W: " + str(len(results[(results.gender == 'W') & (results.finish_seconds <= timestring_to_seconds('3:45:00'))].index)/float(len(results[results.gender == 'W'])))
M: 0.0837211466865
W: 0.0958321356712

On the whole, 8.4% of men and 9.6% of women who ran in 2013 would qualify for 2014. (Note that this is similar to the percentage of BQ qualifiers in the 40-44 age group, which coincidentally, has the same 3:15/3:45 qualifying times as the CM)

Positive/Negative splits

A positive split is where a runner takes longer to finish the second half of a course than the first half. Conversely, a negative split is where the latter half of the course is completed in less time than the first half. Many runners consider running an even or negative-split race to be a hallmark of a good race strategy, as it seems to indicate that one didn't go out too fast and "crash" at the end.

The next set of calculations will be in determining the split differences (difference between the second half split and first half split times) for each participant: (Note that there were 49 results without a half split time, so these will not be considered)

In [29]:
# Difference of second half split minus first half split.
results['split_diff_seconds'] = (results['finish_seconds'] - results['half_split_seconds']) - results['half_split_seconds']
results['split_diff'] = results['split_diff_seconds'].map(seconds_to_timestring)
In [30]:
# Returns results sorted by negative-split based on criteria.
def results_by_negative_split(df, criteria):
    return df[criteria].sort(columns='split_diff_seconds')
In [31]:
# Number of records without a half split time.
# results[results.half_split.apply(lambda x: False if type(x) == str else math.isnan(x))].place.count()

It is often said that aiming to run an even or negative-split race is the "optimal" strategy. As a very blunt test of this hypothesis, let's compare the average finishing times of those that ran even/negative-split races with those that ran positive-split races.

In [105]:
def plot_split_differences(threshold=0, time_limit=None):
    even_neg_split = results[results.split_diff_seconds <= threshold]
    pos_split = results[results.split_diff_seconds > threshold]
    
    if time_limit:
        even_neg_split = even_neg_split[even_neg_split.finish_seconds <= time_limit]
        pos_split = pos_split[pos_split.finish_seconds <= time_limit]
    
    even_neg_split = even_neg_split.finish_seconds
    pos_split = pos_split.finish_seconds
    
    split_d_counts = {
        'even/negative split': even_neg_split.count(),
        'positive split': pos_split.count(),
    }
    split_d_mean = {
        'even/negative split': even_neg_split.mean(),
        'positive split': pos_split.mean(),
    }
    split_d = {'mean_finish_time': split_d_mean, 'number_of_runners': split_d_counts}
    split_df = pd.DataFrame(split_d)
    ax = split_df.mean_finish_time.plot(kind='bar')
    ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
    ax.yaxis.set_major_locator(MultipleLocator(60*30))
    pylab.xticks(rotation='horizontal')
    pylab.title('Mean finish time for split types (Threshold = {} seconds, Limit = {})'.format(threshold, seconds_to_timestring(time_limit or nan)))
    label_bar_values(ax, split_df.mean_finish_time, split_df.number_of_runners, 'count')
    return ax
def label_bar_values(ax, series_values, other=None, other_label=None):
    for i, value in enumerate(series_values):
        ax.text(i + 0.5, value + 300, seconds_to_timestring(value), size='14', weight='bold')
        if other is not None:
            ax.text(i + 0.5, 7200, "{}: {}".format(other_label, other.ix[i]), size='12', weight='bold')

plot_split_differences(threshold=0)
Out[105]:
<matplotlib.axes.AxesSubplot at 0x10971c50>

This shows there's a clear difference in finishing times between those who ran an even/negative split race and those who ran a positive split. Additionally, the ratio between even/negative splits to positive splits is about 1:10.

Note that this simplistic comparison ignores confounding factors and other variables. For example, it would be wrong to look at this graph and conclude that pursuing an even or negative split strategy is the optimal way to run a marathon, because such an explanation ignores the fact that many of the people who ran a positive split likely aimed to run even/negative but failed in that regard and ended up with a positive split. (That is, very few people likely aim to run a positive split)

All that we can conclude from this graph is exactly what it shows: Those that ran a positive split, for whatever reason, had on average, slower finishing times than those that did not.

One possible interpretation of these results is that runners who set overly-ambitious goals and start too fast, tend to finish slower than those who hold back and are able to run an even or negative-split race.

Split differences allowing for slightly positive splits

In the above graph, I've segregated the groups according to whether they've run an even/negative split or positive split; that is, the dividing line was a difference of zero seconds. In reality, many people run fine races with slightly positive splits. Does running a slightly positive split increase or decrease the average finishing time?

In [106]:
plot_split_differences(threshold=60)
Out[106]:
<matplotlib.axes.AxesSubplot at 0x10ccd588>

By changing the threshold for what we consider to be an "even/negative split" from 0 to 60 seconds, we can see that the average finish time of the "even/negative split" group dropped from 4:08 to 4:03.

I actually found that allowing for a threshold of up to 270 seconds or 4:30 gave the lowest average finishing time (just over 3:59), which seems to indicate that running a positive split of less than five minutes doesn't adversely affect finishing time and in fact, it may decrease it.

Calculating the rank correlation between finishing time and split difference shows a moderate correlation between increasing finishing times and increasing positive splits: (Again, I have not bothered to calculate the p-value for this)

In [34]:
results.finish_seconds.corr(results.split_diff_seconds, method='spearman')
Out[34]:
0.5639565640533587

Conclusion

There are many ways to slice and dice this data and I have only chose a few select ways in order to illustrate various points. Furthermore, there are even more ways to interpret the data and I don't claim to make any definitive judgments based on the data and have only presented my own opinions.

The most surprising outcome for me was the distribution of finishing times, which clearly shows the influence of goal-oriented finish times.

Additional Stats by Request

Number of runners without a city_state defined: (Number is negligible compared to the total number of runners: 38,883)

In [171]:
results[results.city_state.map(lambda x: isnan(x) if type(x) == float else False)].finish_seconds.count()
Out[171]:
33

Mean finish times for local vs. non-local runners:

In [186]:
print "Mean finish time for '*Chicago*': " + seconds_to_timestring(results[results.city_state.map(lambda x: 'Chicago' in x if type(x) == str else False)].finish_seconds.mean())
print "Mean finish time for others: " + seconds_to_timestring(results[results.city_state.map(lambda x: 'Chicago' not in x if type(x) == str else False)].finish_seconds.mean())
Mean finish time for '*Chicago*': 4:40:10.8
Mean finish time for others: 4:30:29.9

Most frequent cities: Note that city_state locations have not been normalized, so there may be multiple textual representations for a single city. Mexico City is an example.

In [149]:
results.city_state.value_counts()[:10]
Out[149]:
Chicago, IL              8072
Mexico City, DF           626
New York, NY              429
Naperville, IL            388
Toronto, ON               284
Evanston, IL              226
Arlington Heights, IL     215
Oak Park, IL              198
Sao Paulo, SP             195
Aurora, IL                190
dtype: int64

Number of Illinois vs. non-Illinois runners:

In [184]:
results[results.city_state.map(lambda x: x.strip().endswith(', IL') if type(x) == str else False)].place.count()
Out[184]:
17013
In [203]:
results[results.city_state.map(lambda x: not x.strip().endswith(', IL') if type(x) == str else False)].place.count()
Out[203]:
21837

A plurality (almost a majority) of the Illinois runners hail from Chicago.

Mean finish times for Illinois runners vs non-Illinois runners:

In [205]:
print "Mean finish time for IL: " + seconds_to_timestring(results[results.city_state.map(lambda x: x.strip().endswith(', IL') if type(x) == str else False)].finish_seconds.mean())
print "Mean finish time for non-IL: " + seconds_to_timestring(results[results.city_state.map(lambda x: not x.strip().endswith(', IL') if type(x) == str else False)].finish_seconds.mean())
Mean finish time for IL: 4:42:41.9
Mean finish time for non-IL: 4:24:35.8

This seems to show that those travelling to the race generally had a much faster time than those that didn't have to travel as far. One possible interpretation is that those who have taken it upon themselves the task of such a long journey may be more aggressively goal-oriented when it comes to the race than others.

Average finish time by country: (CAN)

In [207]:
# TODO: PC: Extract country from name_location and put it into a separate column in results.
regional_results = results[results.name_location.map(lambda x: '(CAN)' in x if type(x) != float else False)]
print results[results.name_location.map(lambda x: '(CAN)' in x if type(x) != float else False)].finish_seconds.count()
seconds_to_timestring(regional_results.finish_seconds.mean())
1419
Out[207]:
'4:09:56.3'

Canadians had a mean finish time far below the aggregate average.

Positive-Negative differences with finish time limits:

In [122]:
plot_split_differences(threshold=0, time_limit=3*3600)
Out[122]:
<matplotlib.axes.AxesSubplot at 0x1b70a390>
In [123]:
plot_split_differences(threshold=0, time_limit=3.5*3600)
Out[123]:
<matplotlib.axes.AxesSubplot at 0x1b8d9780>
In [209]:
plot_split_differences(threshold=0, time_limit=4*3600)
Out[209]:
<matplotlib.axes.AxesSubplot at 0x1965b438>

This seems to show that if only runners below a certain time threshold (3:00, 3:30, 4:00) are considered, the differences between positive and negative split mean finish times diminishes. This may be because instituting a time threshold eliminates outliers that greatly skew the average.

Positive-Negative split ratios as a function of threshold time

The above results got me thinking: How does the ratio of positive-split mean finish time over even/negative-split mean time change as the threshold time changes? The following graph will give some idea:

In [253]:
# 5-min bin graph of positive split mean/even-negative split mean vs. time-limit.
def calc_positive_negative_ratio(threshold, time_limit):
    even_neg_split = results[(results.split_diff_seconds <= threshold) & (results.finish_seconds <= time_limit)]
    pos_split = results[(results.split_diff_seconds > threshold) & (results.finish_seconds <= time_limit)]
    
    return pos_split.finish_seconds.mean()/even_neg_split.finish_seconds.mean()\
#         , seconds_to_timestring(even_neg_split.finish_seconds.mean()), even_neg_split.finish_seconds.count(), \
#         seconds_to_timestring(pos_split.finish_seconds.mean()), pos_split.finish_seconds.count()

pos_neg_split_ratios = [calc_positive_negative_ratio(0, t) for t in bins]
pos_neg_s = pd.Series(pos_neg_split_ratios, index=bins)

pylab.xticks(rotation='vertical')
pylab.title('Pos/Neg mean time ratio as a function of maximum finishing time')
ax = pos_neg_s.plot()
ax.yaxis.set_label_text('Ratio of positive-split mean time\n to negative-split mean time')
ax.set_xlim([7200, 7200+6*3600])
ax.xaxis.set_major_locator(MultipleLocator(300))
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.xaxis.set_label_text('Maximum finishing time limit/threshold')
Out[253]:
<matplotlib.text.Text at 0x2479f908>

This graph is a little tricky to understand. The X-axis is the finishing time limit; this means that only finishers with a time at or below this will be considered for the calculation of the ratio. (i.e. sub-3, sub 3:05, sub 3:10, etc.)

The ratio itself is the mean finish time of positive-split runners divided by the mean finish time of even/negative-split runners, but only for runners whose finish time was below the threshold.

When the ratio is < 1, this means that on average, positive-split runners were faster than even/negative split runners. When the ratio is near ~1, this means that positive and even/negative split runners had roughly the same mean finish times. When the ratio > 1, this means that positive-split runners were slower than even/negative-split runners.

What this graph shows is that:

  • Below a finish time of 3:00, positive split runners were faster than negative split runners. (Note that there aren't as many to compare, so the ratio fluctuates more)
  • The ratio is roughly equal from 3:00 to 4:00, meaning that for runners finishing faster than times between 3-4 hours, positive and negative split means were the same.
  • When considering finishing times below a threshold of > 4 hours, negative split runners are faster than positive split runners.

Note that this graph does the calculation considering all runners equal to or faster than the given time, not those in between two times.

Positive-Negative split ratios between two times.

In [261]:
def calc_positive_negative_ratio_between_times(threshold, time_limit, period):
    even_neg_split = results[
        (results.split_diff_seconds <= threshold) & (results.finish_seconds > (time_limit - period)) & (results.finish_seconds <= time_limit)
    ]
    pos_split = results[
        (results.split_diff_seconds > threshold) & (results.finish_seconds > (time_limit - period)) & (results.finish_seconds <= time_limit)
    ]
    return pos_split.finish_seconds.mean()/even_neg_split.finish_seconds.mean()\

ten_min_bins = [7200 + 600*i for i in range (0, 6*6 + 1)]
pos_neg_split_ratios = [calc_positive_negative_ratio_between_times(0, t, 300) for t in ten_min_bins]
pos_neg_s = pd.Series(pos_neg_split_ratios, index=ten_min_bins)

pylab.xticks(rotation='vertical')
pylab.title('Pos/Neg mean finish time ratio as a function of finish time bins')
ax = pos_neg_s.plot()
ax.yaxis.set_label_text('Ratio of positive-split mean time\n to negative-split mean time')
ax.set_xlim([7200, 7200+6*3600])
ax.xaxis.set_major_locator(MultipleLocator(600))
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(time_ticks))
ax.xaxis.set_label_text('Finishing time bins')
Out[261]:
<matplotlib.text.Text at 0x2663ecc0>

Unlike the previous graph, this one charts the ratio between two finish times, i.e. 3:30 and 3:40 and only considered those finishers.

This shows that the positive-negative split mean finish time ratio starts out low and settles around ~1 past 3:00:00. The reason the previous graph skewed upward was because it was a function of only maximum finishing time, while this one limits finishers to those in each 10-minute interval.

By plotting against maximum finishing time instead of limiting to an {upper, lower} bound, large and small finish times skew the mean, which is why the first graph grew: The mean negative split time remained low due because there are fewer negative splitters to add as the maximal finishing time grows, while the mean positive split time grew faster as maximal finishing time increased because there were more positive splitters to add.