By Peter Chng

I've ran the Boston Marathon for the past five years (2011-2015, inclusive) and wanted to get an idea for how the participants and their finishing times had changed across those years.

To do so, I wrote a script to download all the results from `http://boston.r.mikatiming.de/{year}`

and aggregated them into a single CSV file. I then loaded that CSV file into Pandas to analyze the data.

Because Boston has qualifying times that limit registration to only faster runners, this race tends to be much faster than other similar-sized marathons.

In [1]:

```
import time
import datetime
import scipy.stats as spstats
import pandas as pd
%pylab inline
#%matplotlib inline
# Default plot styles.
figsize(15, 5) # Set default figure size.
# Set default colours/styles.
# Requires matplotlib >= 1.4.
plt.style.use('ggplot')
matplotlib.rcParams.update({
'font.family': 'Arial',
'axes.titlesize': 20,
})
```

In [2]:

```
# Aggregate results taken from: http://boston.r.mikatiming.de/{year}
# Some data wrangling had to be done to get all results into the same format.
results = pd.read_csv('boston-marathon-total.csv')
```

Each row corresponds to a finishing time for a runner for a particular year. In addition to the net finishing time, you also get the gun finishing time, first half split time, {overall, gender, division} finishing place, gender, age group and bib number of the runner.

The age groups generally line up with the age groups used for Boston qualifying times, with the exception of the 18-39 group, which actually contains two BQ age groups: 18-34 and 35-39. I believe the reason for this was the distinction between "open" and "masters" age groups typical at most races. (Masters competitors usually start at age 40)

This data allows us to take a look some interesting properties of runners of the Boston Marathon.

In [3]:

```
# Example of what the data looks like:
results[results.name_location.str.contains('Chng')]
```

Out[3]:

Here are the number of results I was able to pull from the website. For some reason, they are all different than the number of finishers listed in the official statistics. (Take, for example, the 2015 statistics) I don't know why this is the case, but the differences are likely small enough to be negligible.

2013 had the smallest number of finishers in recent history, because the bombing/terrorist attack at the finish line prevented many from finishing.

2014 was the largest year because the field size was greatly expanded to accomodate the runners who were not able to complete following the terrorist attack in 2013. Interest remained high in 2015, and so the field size remained higher than in prior years.

2012 had a smaller number of finishers than 2011 because of the high temperatures that year. This reduced the number of finishers in two ways:

- The B.A.A. granted people the ability to defer to 2013, provided they had picked up their bib, so many did not even start.
- The heat caused a slightly lower finishing rate.

In [4]:

```
results.groupby('year').size().plot(kind='bar', fontsize=16, title='Number of finishers each year')
results.groupby('year').size()
```

Out[4]:

Because the name and country of the runner were combined into one field, we'll extract those into separate fields. We'll also convert the string representation of finishing time (`HH:MM:SS`

) into a `pandas.tslib.Timedelta`

object, and at the same time, compute the first and second half splits. This will allow us to analyze the finishing times better.

In [5]:

```
# Data wrangling/data fixing to make the data more useful.
# Extract the name from name_location. (Could have done this in the scraper script, but whatever)
results['name']= results.name_location.map(lambda x : x.rsplit(' ', 1)[0])
# Fix country column, was not extracted properly by the scraper script.
# TODO: PC: Anyway to do this with a Series.str.<method> operation? Series.str.slice() doesn't take non-scalars!
results.country = results.name_location.map(lambda s : s[s.rfind('(') + 1: s.rfind(')')])
# Convert half_split and finish_net to timedelta objects.
def timestring_to_timedelta(time_string):
if type(time_string) != str:
return time_string
parts = time_string.split(':')
return datetime.timedelta(seconds=int(parts[0])*3600 + int(parts[1])*60 + int(parts[2]))
# Compute first half, second half and finish times as timedelta objects.
results['first_half'] = results.half_split.map(timestring_to_timedelta)
results['finish'] = results.finish_net.map(timestring_to_timedelta) # Based on net finish time.
results['second_half'] = results.finish - results.first_half
# List of years covered.
years = results.year.unique()
```

In [6]:

```
print 'Results with missing half_split: {}'.format(results[pd.isnull(results.half_split)].shape[0])
```

There were 186 results with missing half splits, but none with missing finishing times. This is sort of self-evident, as the results posted on the website **only** included those with finishing times, so the total numbers don't necessarily match up with the official BAA statistics. However, the discrepancies were minor.

For some reason, there were no bib numbers for 2011.

Let's look at some broad statistics between each year. Keep in mind that 2013 was the year of the bombing/terrorist attack, and according to the official statistics, only 17,600 runners were allowed to complete, the lowest (75.4%) finishing rate in recent history. This acted as an artificial limit, and as we'll see, this reflects in the major differences in aggregate statistics between 2013 and the other years, making any comparison with 2013 less useful for drawing any inferences.

For this, I'm using the number of finishers (male, female) as a proxy for the number that entered. **This is not entirely accurate**, as the number that entered, the number that started, and the number that finished are all distinct values. The definition of each is:

**Entered**: Registered and accepted with a BQ, and paid registration fee. (Presumably)**Started**: Showed up, picked up bib, crossed the starting line in Hopkinton. (Again, presumably)**Finished**: Crossed the finish line in Boston.

Each subsequent value is less than or equal to the former. The exact relation is:

```
Entered >= Started >= Finished
```

As an example, see the 2011 Boston Marathon official statistics.

Because I'm using the number of finishers (my results only capture finishers), the exact ratios of male/female and other demographics may not be correct, if certain groups were more predisposed to not finishing/not starting.

In [7]:

```
male_female = results.groupby(['year']).gender.value_counts().unstack()
male_female_normed = male_female.div(male_female.sum(1), axis=0)
ax = male_female_normed.plot(kind='barh', stacked=True, fontsize=16, title='Male/Female Ratio')
ax.xaxis.set_major_locator(MultipleLocator(0.05))
ax.invert_yaxis()
# Show raw data
male_female_normed.applymap(lambda x: '{:.2%}'.format(x))
```

Out[7]:

For 2014 and 2015 the male/female ratio was almost identical at 55%/45%. This contrasts with 2011 and 2012, where it was more like M/F 58%/42%.

2013 had a far higher percentage of males than females because the the course was shutdown far earlier than normal following the bombing at the finish line. Since females, on average, are slower than males, this created an artificial limit on the number of females who could finish.

It is interesting to note that that the male ratio has seemingly fallen by 2-3% from 2011/2012 to 2014/2015. **Although I don't have enough years' worth of data**, it is interesting to note that the change lines up with the change in qualifying standards that happened between 2012 and 2013. This would seem to indicate that the five-minute across-the-board reduction in qualifying times slightly benefited women by increasing their participation. Looking at more years in the past along with more data in the future would confirm or deny this assertion.

However, the current ratio of 55%/45% (M/F) is still inline with other big marathons that have no qualifying time restrictions, such as the 2013 Chicago Marathon. Other examples:

- The 2015 New York City Marathon (which does have qualifying times, but the vast majority of runners enter through a lottery) had a ratio of 58%/42% (M/F) according to MarathonGuide.com
- The 2015 Chicago Marathon (which does have qualifying times, but is still open to non-qualifiers via lottery) had a ratio of 54%/46% (M/F) according to MarathonGuide.com
- The 2015 Philly Marathon had a ratio of 55%/45% (M/F) according to MarathonGuide.com

Thus, the current BQ standards are "fair" to both males and females in that there is not a marked difference in gender ratios from other big marathons. (An informal survey of MarathonGuide.com results reveals that most marathons are somewhere between 55-60% male)

For this comparison, I deliberately excluded 2013 for reasons mentioned above. This allows us to compare the remaining years, where all runners were allowed a chance to finish.

In [8]:

```
# TODO: PC: Need more colours!
year_age_groups = results[results.year != 2013].groupby(['year', 'gender', 'age_group']).size().unstack().unstack()
year_age_groups_normed = year_age_groups.div(year_age_groups.sum(1), axis=0)
ax = year_age_groups_normed.plot(kind='barh', stacked=True, fontsize=12, figsize=(15, 7),
title='Age/Gender group distribution')
ax.xaxis.set_major_locator(MultipleLocator(0.05))
ax.invert_yaxis()
# Show underlying DataFrame.
year_age_groups_normed.applymap(lambda x: '{:.2%}'.format(x))
```

Out[8]:

Here, we can see a more detailed view of how the 5-minute tightening of BQ standards affected each age group/gender. First, let's go over a summary of the changes that have happened to Boston registration in the past five years:

- For 2011 registration (which happened in 2010), interest was so high that the race filled up in less than one day. Many people were "shut out", because either they presumed they could wait until after work to register, or the registration website was too slow during the day.
- For 2012 registration, the BAA introduced a process by which they would essentially accept all registrations for a period of time, and then only accept the fastest (runners who had beaten their BQ by the largest amount) up until the race had filled. BQ times were not changed yet, for various reasons.
- For 2013 and onwards, BQ times were reduced by 5 minutes for every age group/gender; the new registration process was also retained.

These changes seem to have reduced the percentage of various groups in the Boston Marathon. For instance, the M18-39 group is noticeably smaller in 2014/2015 after remaining almost constant in 2011/2012.

In particular, **the tighter qualifying times seemed to reduce the percentage size of all male age groups up to and including 50-54**, after which the change was too small or actually in favour of the new BQ times. It should be noted that the older age groups have far fewer participants, so it's harder to draw conclusions since various fluctuations can affect the percentage much more. Again, I'm only looking at two years' worth of data (two before and two after the change), so take this all with a grain of salt.

It's important to note that I am not arguing for or against any of the changes the BAA has made, just observing changes that *may* have happened as a result of it. Whether you are for/against/indifferent to the changes depends on your point-of-view.

In [9]:

```
age_group = results[results.year != 2013].groupby(['year', 'age_group']).size().unstack()
age_group_normed = age_group.div(age_group.sum(1), axis=0)
age_group_normed.plot(kind='barh', stacked=True, title='Age Group Percentages')
age_group_normed.applymap(lambda x: '{:.2%}'.format(x))
```

Out[9]:

At first glance, the 18-39 age group is clearly the largest, accounting for >40% of the entire field across all years. However, this is expected since this age group spans the most number of years. With the exception of the 80+ age group, all other age groups span only five years. (I don't know why Boston defines age groups this way, but it's probably because of the division between the 'open' and 'masters' fields, with master-age competitors typically beginning at age 40)

**There is no way to properly breakdown the 18-39 age group without actually knowing the runners' ages**, but a crude approximation would be to note that this age group spans 22 years, and that this is 4.4 times the size of the five-year age groups. Assuming a uniform distribution within 18-39 (likely not the case), we "normalize" this age group by dividing by 4.4, yielding a **normalized result of 9-9.5%**.

This would mean that the **largest five-year age group, by far, is either 40-44 or 45-49**. You can draw your own conclusions as to why this might be the case. My personal explanation is that younger folks have other things on their mind, and that the training and recovery necessary for getting a BQ is tougher at older ages.

The 40-49 age range seems to be the "sweet spot" where these folks have the {time, interest, money} to train and qualify for Boston, while not being old enough to make it too hard.

Almost all races have a huge local bias, and Boston is no different, with over 80% of all runners over the past five years being from the USA.

Canada and the United Kingdom (which is abbreviated as GBR) are consistently second and third. The distribution over the remainder of the countries appears to follow some power-law distribution, with the amounts getting progressively smaller, though I didn't bother to confirm this.

In [10]:

```
country_counts_sorted = results.groupby('country').size().sort_values(ascending=False)
def top_n_group_other(s, n, label='others'):
top_n = s.sort_values(ascending=False)[:n]
top_n[label] = s[n:].sum()
return top_n
n_countries = 5
top_n_group_other(country_counts_sorted, n_countries).plot(
kind='pie', autopct='%.2f %%', figsize=(15, 15), colormap='coolwarm', shadow=True, legend=True,
fontsize=14,title='Top {} nationalities represented at Boston'.format(n_countries))
```

Out[10]:

In [11]:

```
# Non-US countries: Does this follow some sort of power-trend law, i.e. the decrease between subsequent numbers?
top_n_group_other(country_counts_sorted[1:], 15).plot(
kind='pie', autopct='%.2f %%', figsize=(15, 15), colormap='coolwarm', shadow=True, legend=True,
fontsize=12, title='Top nationalities excluding USA at Boston')
```

Out[11]:

Finishing times are always an interesting topic, not only because they tend to be the main goal of most runners, but also because they provide a nice insight into how changing conditions and demographics between the years might affect overall performance.

In [12]:

```
# Pandas doesn't support aggregate operations (mean(), etc.) on GroupBy objects for Timedelta/timedelta64[ns]:
# https://github.com/pydata/pandas/issues/5724
year_describe_comparison = results.groupby(['year'])['finish']\
.describe(percentiles=[0.10, 0.50]).unstack()\
.reindex(columns=['count', 'mean', '50%', '10%', 'std', 'min', 'max'])\
.rename(columns={'50%': 'median'})
ax = year_describe_comparison.drop(['count', 'max', 'min'], axis=1)\
.applymap(lambda t: t/np.timedelta64(1, 's') if isinstance(t, pd.tslib.Timedelta) else t)\
.plot(kind='bar', title='Finishing time comparison across years', fontsize=16, rot=0)
ax.yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, pos: str(datetime.timedelta(seconds=x))))
ax.yaxis.set_major_locator(MultipleLocator(30*60))
ax.set_xlabel('')
# Show dataframe for reference
year_describe_comparison.applymap(lambda t: t.to_pytimedelta() if isinstance(t, pd.tslib.Timedelta) else t)
```

Out[12]:

In ideal years, the mean finishing time in Boston is in the high 3:40's. For comparison, the Chicago Marathon (comparable in size), considered "fast and flat", has had mean finishing times in the 4:30's for the past few years. A ~40-50 minute difference in mean times is quite a big difference. Thus, it's no surprise that Boston itself is among the marathons with the highest percentage of Boston qualifying times.

Interestingly, the **10% percentile time appears to be around one standard deviation below the median** during most years, except for 2013.

The fastest year was 2013, where, because of the bombing/terrorist attack at the finish line, only the fastest runners were permitted to finish.

Outside of this year, the fastest was 2015 in my five-year sample set. This is interesting, because there was a considerable amount of rain and crosswinds during that year. It didn't seem to have an effect on most runners, myself included. (2015 was my fastest Boston) In particular, it was considerably faster than 2014, a year where weather conditions were considered ideal, but may have been on the warm side for many, rising to a high of 62F/17C.

2012 was by far the slowest year, owing to the heat during that year.

In almost all years, the median was faster than the average/mean, except for 2013, where only the faster runners were permitted to finish. This is likely to there being more "slower" runners than "faster" runners, creating an asymmetric distribution.

I've also shown the fastest 10th percentile time; this is the time that the fastest 10% of runners met, or in another words, if you ran this time or faster, you were in the top 10% of all finishers. As you can see, it's always been around the 3-hour mark, except for the hot 2012 year. This is pretty significant, when you consider that in most other marathon races without qualifying times, running a sub-3 will get you into the top 2-3%. This gives you an ideal for just how competitive the overall field is in Boston.

In [13]:

```
increase_2014_2015_mean = (year_describe_comparison.ix[2014]['mean'] / year_describe_comparison.ix[2015]['mean'] - 1)
increase_2014_2015_std = (year_describe_comparison.ix[2014]['std'] / year_describe_comparison.ix[2015]['std'] - 1)
print '2014-2015 increase in mean={:.2%}, std={:.2%}'.format(increase_2014_2015_mean, increase_2014_2015_std)
```

Obviously, heat has a detrimental effect on running pace; that should be no surprise to any serious runner. What's more interesting, however, is that it appears to affect runners differently, at least in the Boston Marathon.

Now, mind you, I have only five years' worth of data, so that is likely not enough to draw any conclusions, but it's still worthwhile to note that the two hottest years (2012 and 2014, where the highs were 87F/31C and 62F/17C, respectively) had **the highest standard deviation (48:43 and 50:52) in finishing times.**

In cooler years, the standard deviation was much lower. The increased standard deviation in hotter years is not simply explained by each runner slowing down by a percentage of their "ideal" finishing time, as that would result in the mean and standard deviation both increasing by the same percentage, but this did not occur.

For example, let's compare 2015 (cool year) with 2014 (a warmer year): 2014 had a mean that was **7.08%** higher and a standard deviation that was **25.52%** higher than 2015.

The exact relation between finishing times and heat is difficult to determine from the data I have, as there are other variables at play, including (but not limited to), other weather conditions, BQ cut-off times for the year, etc. that also affect finishing times. But it's clear to me that it's not simply a case of each runner slowing down by a fixed percentage off of their "ideal" time.

Instead, it's likely that slower runners are affected more, perhaps because they are out on the course for a longer time and during the hotter part of the day. In particular, it's worthwhile to note that Boston starting times are done in stages/waves, with slower runners starting progressively later in the day, when it's likely hotter. (The start is also later than most marathons)

This may not be obvious (at least it wasn't to me), so we'll do a quick explanation.

Suppose the "ideal" distribution of finishing times, $x$, has a mean and variance for the entire population given by: (Standard deviation would just be the square root of variance)

$$ \bar{x} = {1 \over N} \sum_{i=1}^{N} x_i $$ $$ s_x^2 = {1 \over N} \sum_{i=1}^{N} (x_i - \bar{x})^2 $$

Now, suppose all finishing times are slowed by a fixed percentage (say, because of heat) to create a new distribution, $y$, where $y = ax$ and $a > 1$. Then the new population mean is:

$$ \bar{y} = {1 \over N} \sum_{i=1}^{N} ax_i $$ $$ \bar{y} = {a \over N} \sum_{i=1}^{N} x_i $$ $$ \bar{y} = a\bar{x} $$

And the new population variance is:

$$ s_y^2 = {1 \over N} \sum_{i=1}^{N} (ax_i - \bar{y})^2 = {1 \over N} \sum_{i=1}^{N} (ax_i - a\bar{x})^2 $$ $$ s_y^2 = {1 \over N} \sum_{i=1}^{N} (a^2x_i^2 - 2a^2x_i\bar{x} + a^2\bar{x}^2) $$ $$ s_y^2 = {a^2 \over N} \sum_{i=1}^{N} (x_i^2 - 2x_i\bar{x} + \bar{x}^2) $$ $$ s_y^2 = {a^2 \over N} \sum_{i=1}^{N} (x_i - \bar{x})^2 $$ $$ s_y^2 = a^2s_x^2 $$

Since standard deviation is just the square root of the variance:

$$ s_y = \sqrt{a^2s_x^2} = as_x $$

This shows that both the population mean and standard deviation would be scaled by the same amount if all runners slowed by a fixed percentage.

Since this did not occur, and instead the standard deviation increased by a greater percentage than the mean, we conclude that heat does not slow down all runners by the same percentage.

Looking at the distribution/histogram of finishing times often gives a more useful insight than just looking at the mean and standard deviation.

Note that the 2013 distribution is not directly comparable to other years as not all runners were allowed to finish. The course would have been shut down sometime before 3:00 PM local time, and since the last wave started at 11:15 AM, this probably had an impact on the distribution after the 3:30 mark; thus the distribution is probably comparably up to, but not after, the 3:30 time.

In [14]:

```
bins = [7200 + i *5*60 for i in range(5*12)]
results['raw_finish'] = results.finish.map(lambda t: t/np.timedelta64(1, 's'))
for year in years:
pyplot.subplots()
ax = results[results.year == year].raw_finish.hist(
bins=bins, xrot=90, xlabelsize=14, ylabelsize=14, color=(0, 0.4, 0.8))
ax.set_title('{} Boston Marathon Finishing Times'.format(year))
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, pos: str(datetime.timedelta(seconds=x))))
ax.xaxis.set_major_locator(MultipleLocator(10*60))
# Highlight {2:55-3:00, 3:25-3:30 3:55-4:00} bins.
ax.patches[11].set_facecolor('r')
ax.patches[17].set_facecolor('m')
ax.patches[23].set_facecolor('c')
```

I've chosen to highlight the 3:00, 3:30 and 4:00 marks in all of the distributions. Some notes on this:

- The sub-3 bucket (2:55-3:00, highlighted in red) is only bigger than its neighbours during a fast/ideal weather condition year. 2015 was a fast year, and 2013 would have probably turned out to be a very fast year had the course not been shutdown early.
- The mode (most common bucket) usually occurs somewhere between sub-3:30 to sub-3:40. One exception to this was the hot 2012 year, where the mode was pushed back to sub-4:00. (3:55-4:00)
- Across all years looked at here, there was always a sharp drop-off after the 4:00 mark, though the effect is less in the hot 2012 year.
- The right tail of the distribution is always longer than the left tail. Qualitatively, this means there are more "slow" runners than "fast" runners. (This is positive skew)
- 2014 was slower than 2015, but also had many more runners. This was because those who were prevented from finishing in 2013 were allowed to re-enter, without having to obtain another BQ time. This may have drawn the right tail out more.
- There's some evidence of people running to specific goal times, which causes the distribution to be jagged. An example is the 2:55-3:00 bin in 2015; the bins directly adjacent to it are both smaller, which isn't really expected if finishing times reflected pure ability. I also found this effect when analyzing the 2013 Chicago Marathon.

High temperatures appears to have at least two effects on the distribution:

- The mean/median/mode gets pushed back. You could visualize this as a slight rightward shift in the distribution.
- More importantly, however, is that the degree of positive skew appears to rise with how "tough" the race was that year. This results in the right tail being drawn out more, and is likely the major factor affecting the change in the mean.
- This is also the source of the increased standard deviation in finishing times that we observed earlier.

Looking at finishing times only tells us half of the story. The other half comes from the half-split times obtained on the way to the finish. (Pun intended)

In [15]:

```
bins = [3000 + i *5*60 for i in range(3*12)]
results['raw_half'] = results.first_half.map(lambda t: t/np.timedelta64(1, 's'))
for year in years:
pyplot.subplots()
ax = results[results.year == year].raw_half.hist(
bins=bins, xrot=90, xlabelsize=14, ylabelsize=14, color=(0, 0.4, 0.8))
ax.set_title('{} Boston Marathon Half Splits'.format(year))
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, pos: str(datetime.timedelta(seconds=x))))
ax.xaxis.set_major_locator(MultipleLocator(5*60))
# Highlight {1:30, 1:45, 2:00} bins.
ax.patches[7].set_facecolor('r')
ax.patches[10].set_facecolor('m')
ax.patches[13].set_facecolor('c')
```

I've highlighted the 1:30, 1:45 and 2:00 marks in the above distributions, to align with projected 3:00, 3:30 and 4:00 finishes.

The most striking feature of the histogram/distribution of half times is that they are much more "smooth" than the overall finishing times. What I mean by "smooth" is that **the bins are always decreasing in value as you move away from the mode**, or the highest bin.

This is in sharp contrast to the distribution of finishing times, where there are sometimes local peaks/modes that correspond to specific goal times that runners were aiming for, i.e. a sub-3 finish. This behaviour has a clear effect on the distribution of finishing times, causing it to reflect runners' goals rather than perhaps their all-out ability.

By contrast, the half-split times look more "natural", but there are still some interesting results: In most years, a lot of runners try to make it under the 1:30 mark at the half, resulting in this bin perhaps being larger than it "should" be. This presumably reflects their goal of running a sub-3-hour marathon.

*Suggestion by no_other_plans*

**Note:** *You can safely skip this section if you don't care about statistics.*

An interesting question is: **Are marathon finishing times normally-distributed?** Some other research has pointed towards this, but based on my own observations, I was skeptical of this.

However, I was able to find at least one paper supported the idea that marathon finishing times were normally distributed. El Helou et al. looked at 60 marathon races in their study, *Impact of Environmental Parameters on Marathon Running Performance*, and found:

*"All performances per year and race are normally distributed with distribution parameters (mean and standard deviation) that differ according to environmental factors."**"For each race and each gender every year, we fitted the Normal and log-Normal distributions to the performances and tested the normality and log normality using the Kolmogorov-Smirnov D statistic."**"For all 60 studied races, the women and men's performance distributions were a good approximation of the “log normal” and “normal” distributions (p-values of Kolmogorov-Smirnov statistics ≥0.01)."*

I decided to see if finishing times across the sub-populations of {men, women} and individual age groups (such as M18-39, F40-44, etc.) were normally or log-normally distributed.

In [16]:

```
# Quick sanity check: Sample 1000 numbers from a normal distribution and calculate the test statistic.
# Should produce a p-value > 0.05.
spstats.mstats.normaltest(np.random.normal(10.0, 1.0, size=1000)) # Mean is 10.0, Std. Dev is 1.0
```

Out[16]:

The distribution of male and female finishing times did not appear to be normally distributed.

In [17]:

```
min_samples = 200
# raw_finish is the finishing time in seconds; this is the sample we wish to test for normality.
results.groupby(['year', 'gender']).raw_finish\
.apply(lambda x: (len(x), (spstats.mstats.normaltest(x)[1])) if len(x) >= min_samples else NaN).unstack()
```

Out[17]:

The p-value calculations were obtained using `scipy.stats.mstats.normaltest`

, which does some super-special-elite-black-ops-ninja normality test based on D'Agostino and Pearson's work. I have no idea whether this normality test is appropriate.

The first number in each cell is the number of samples, and the second is the two-sided p-value from the normality test. So, for example, in 2015 the male age group `n=13806`

and `p=0.0`

.

The p-value can be interpreted as the answer to this question: Assuming that the distribution is normal (the null hypothesis), what is the probability that the observed/actual distribution would have occurred? If the p-value is sufficiently low (usually either `p < 0.05`

or `p < 0.01`

) then you conclude the distribution was not normal.

Since the p-values were all extremely small, none of the finishing time distributions for male/female across any of the years I looked at were normally distributed.

Visually looking at the distribution of finishing times for a sub-population (such as M18-39, which I've done below) also reveals that the times don't appear to be normally distributed.

If not normal, is the distribution log-normal? It does not appear so either, as the p-values are still very small.

In [18]:

```
# Apply numpy.log() to the raw_finish distribution and check if the resultant distribution is normal.
results.groupby(['year', 'gender']).raw_finish\
.apply(lambda x: (len(x), (spstats.mstats.normaltest(np.log(x))[1])) if len(x) >= min_samples else NaN).unstack()
```

Out[18]:

I don't know exactly why my result differ so much from *El Helou et al.*, but here are some possibilities:

- They looked at different races, specifically the {Paris, London, Berlin, Boston, Chicago, New York} marathons from 2001-2010, inclusive, for a total of 60 races. Perhaps sometime has changed in the character of marathon runners since then.
- They are normalizing the finishing times somehow. (I didn't note this in the paper, the only partitioning of the data was by each marathon race and gender)
- My methodology or understanding is fatally flawed or I've made some other huge mistake.

After some research, it turns out I was asking the wrong question. The answer to whether run times are "normally distributed" is almost certainly "no", since the normal distribution is not bounded, i.e. the probability only asymptotically approaches zero as x approaches +/- infinity. Clearly run times are bounded, i.e. there is no finishing time above some certain time and no finishing time below ~2 hours.

Instead, the question I should have been asking is: *"Is the normal distribution a reasonably good model for finishing times here?*" At first glance, this might seem like the same question, just re-worded, but the difference turns out to be important.

This article, by Dr. Allen Downey, clearly explains the pitfalls of using just a statistical test to determine if a dataset is drawn from the normal distribution, and instead advocates comparing the CDF of the model to the CDF of the dataset to perform a visual check for how good the model is at representing the data set. So, I'll do that here.

By doing a visual check, we can possibly see ranges for which the normal distribution is a good fit, and ranges for which it is not. Just doing a statistical test and coming up with a p-value doesn't capture this.

I'll break each year's finishing times out into **{male, female}** sub-groups, then compute the **mean** and **standard deviation** for each sub-group. This mean and variance will then be used to plot a normal CDF with the same mean and variance (since a normal distribution is fully defined by mean and standard deviation) and the goodness-of-fit will be assessed visually.

In [19]:

```
# References:
# http://allendowney.blogspot.ca/2013/08/are-my-data-normal.html
# http://allendowney.blogspot.ca/2013/09/how-to-consume-statistical-analysis.html
def plot_cdf_with_norm(sub_group, ax, title, histtype='step', num_norm_samples=100000, lognormal=False):
# Using 1-min bins from 2 hours to 7 hours.
bin_width = 60
start = 7200
range_hours = 5
num_bins= range_hours * (3600/bin_width)
bins = [start + i*bin_width for i in range(num_bins)]
# If plotting against a lognormal model, take the log of the finishing times, then compute the
# {mean, std. dev} of this dataset and use these as the parameters for a normal model.
data = sub_group.raw_finish
if lognormal:
bins = np.log(bins)
data = np.log(data)
# Generate an approximation of normal CDF by taking 100,000 samples from the normal distribution.
# Use the sub-group mean and standard deviation.
norm_model = pd.Series(
np.random.normal(loc=data.mean(), scale=data.std(), size=num_norm_samples))
# Plot CDF of finishing times.
ax.hist(data, bins=bins, cumulative=True,
color=(0, 0.4, 0.8), histtype=histtype, normed=True, label='Data')
ax.set_title(title, size=14)
if not lognormal:
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, pos: str(datetime.timedelta(seconds=x))))
ax.xaxis.set_major_locator(MultipleLocator(10*60))
# Plot CDF of associated normal distribution.
model_title = 'Lognormal Model' if lognormal else 'Normal Model'
ax.hist(norm_model, bins=bins, cumulative=True, histtype=histtype, normed=True, label=model_title)
for tick in ax.xaxis.get_ticklabels():
tick.set_rotation(90)
ax.set_ylim([0, 1.0])
ax.set_xlim(bins[0], bins[-2])
ax.set_xlabel('Finishing time')
if lognormal:
ax.set_xlabel('Finishing time (log s)')
#ax.set_ylabel('CDF')
ax.legend(loc=4)
# Returns slices grouped by the specified keys and a sorted list of tuples of the groups.
def sorted_sub_groups(results, keys):
results_groupby = results.groupby(keys)
# The keys (which are tuples) are not sorted by default.
sorted_keys = sorted(results_groupby.groups.keys())
return results_groupby, sorted_keys
results_year_gender_groupby, groupby_keys = sorted_sub_groups(results, ['year', 'gender'])
fig, axes = pyplot.subplots(nrows=len(results.year.unique()), ncols=len(results.gender.unique()), figsize=(14, 18))
i = 0
for year, gender in groupby_keys:
plot_cdf_with_norm(results_year_gender_groupby.get_group((year, gender)), axes[i/2, i%2],
title='{} Boston Marathon CDF of Finishing Times ({})'.format(year, gender))
i += 1
pyplot.tight_layout()
```

Comparing the normal CDF to the finishing time CDFs, the normal distribution was not a good model for any regular year. The only year for which the normal distribution was decent was 2013, and that was the year where the race was cut short and slower runners were not permitted to finish.

In particular, in all years observed, whether a "fast" or "slow" year, the normal CDF **overestimated** the number of faster runners, **underestimated** the number of "middle-pack" runners and **underestimated** the number of slow runners. This last part is seen in the long right tail of the finishing time distribution graphs seen previously.

Comparisons of the CDF with the lognormal model showed a slightly better fit, but there were still some systematic discrepancies. This was the methodology I used:

- Transform the finishing times by taking the log (natural logarithm) of each value.
- From this transformed dataset, compute the {mean, standard deviation} and use it to generate a normal CDF.
- Compare that normal CDF with the CDF of the transformed data.

In [20]:

```
results_year_gender_groupby, groupby_keys = sorted_sub_groups(results, ['year', 'gender'])
fig, axes = pyplot.subplots(nrows=len(results.year.unique()), ncols=len(results.gender.unique()), figsize=(14, 18))
i = 0
for year, gender in groupby_keys:
plot_cdf_with_norm(results_year_gender_groupby.get_group((year, gender)), axes[i/2, i%2],
title='{} Boston Marathon CDF of Finishing Times ({})'.format(year, gender), lognormal=True)
i += 1
pyplot.tight_layout()
```