IPython Notebook for partial analysis of Lausanne Marathon dataset¶

In this brief notebook non-parametric tests for age distributions for women and men are implemented, specifically for the Lausanne Marathon's dataset. We will apply both the two-sample Kolmogorov-Smirnov test and the Kruskal-Wallis test, which provides a non-parametric version of ANOVA. Results (p-values and test statistics) are reported in a table which considers both the overall and the by-category age distribution per sex (that is, distinguishing among 10 km, half-marathon and marathon).

In [1]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_context('notebook')

In [2]:

data = pd.read_csv('../datasets/lausanne_marathon_2016_stefano.csv')
data.head()

Out[2]:

	Unnamed: 0	cat	sex	rang	nom	an	lieu	temps	retard	pace
0	0	21	M	147	Abaidia Jilani	1966	St-Légier-La Chiésaz	1:45.28,4	25.56,8	4.59
1	1	21	F	81	Abaidia Sandrine	1972	St-Légier	1:49.40,8	24.09,5	5.11
2	2	False	F	33	Abaidia Selma	2006	St-Légier-La Chiésaz	7.12,2	1.36,3	4.48
3	3	21	M	103	Abb Jochen	1948	Ernen	2:50.40,7	1:21.28,7	8.05
4	4	10	M	426	Abbas Dhia	1961	Lausanne	1:13.04,1	38.13,0	7.18

Stats on age and sex¶

In this section the age of ther participants is examined per sex and category (10 km, 21 km, 42 km).

In [3]:

age = 2016-data.an.astype(int)
data['age']=age

In [4]:

# get age of participants and plot its distribution
plt.subplot(1,2,1)
plt.hist(age,bins=20)
plt.gca().set_xlabel('age')
plt.gca().set_ylabel('count')
plt.gcf().set_size_inches(16,6)

Now compare the age distribution for women and men:

In [5]:

age_women = 2016 - data[data.sex=='F'].an.astype(int)
age_men = 2016 - data[data.sex=='M'].an.astype(int)
plt.hist(age_men,edgecolor='blue',fill=False,linewidth=2,label='men',bins=20)
plt.hist(age_women,edgecolor='red',fill=False,linewidth=2,label='women',bins=20)
plt.xlabel('age')
plt.ylabel('count')
plt.gca().set_xlim([0,90])
plt.gcf().set_size_inches(16,6)
plt.legend()

Out[5]:

<matplotlib.legend.Legend at 0xb040a90>

Clean data from 'False' (walk, kids running):

In [4]:

del data['Unnamed: 0']
data = data[data.cat != 'False']

In [7]:

age_women = 2016 - data[data.sex=='F'].an.astype(int)
age_men = 2016 - data[data.sex=='M'].an.astype(int)
plt.hist(age_men,edgecolor='blue',fill=False,linewidth=2,label='men',bins=20)
plt.hist(age_women,edgecolor='red',fill=False,linewidth=2,label='women',bins=20)
plt.xlabel('age')
plt.ylabel('count')
plt.gca().set_xlim([20,90])
plt.gcf().set_size_inches(16,6)
plt.legend()

Out[7]:

<matplotlib.legend.Legend at 0xaf89080>

The code cell below produces the p-values and statisics table. Age is computed with respect to the year 2016. Results are both per category and global.

In [8]:

from scipy import stats
ks_stats = []
ks_pvalues = []
kw_stats = []
kw_pvalues = []
kstest = stats.ks_2samp(2016-data[data.sex=='M'].an.astype(int),2016-data[data.sex=='F'].an.astype(int))
ks_stats.append(kstest[0])
ks_pvalues.append(kstest[1])
kwtest = stats.mstats.kruskalwallis(2016-data[data.sex=='M'].an.astype(int),2016-data[data.sex=='F'].an.astype(int))
kw_stats.append(kwtest[0])
kw_pvalues.append(kwtest[1])

for i in ['10','21','42']:
    subdata = data[data.cat==i]
    kstest = stats.ks_2samp(2016-subdata[subdata.sex=='M'].an.astype(int),2016-subdata[subdata.sex=='F'].an.astype(int))
    ks_stats.append(kstest[0])
    ks_pvalues.append(kstest[1])
    kwtest = stats.mstats.kruskalwallis(2016-subdata[subdata.sex=='M'].an.astype(int),2016-subdata[subdata.sex=='F'].an.astype(int))
    kw_stats.append(kwtest[0])
    kw_pvalues.append(kwtest[1])

test = {'KS stat' : pd.Series(ks_stats), 'KS p-value' : pd.Series(ks_pvalues), 'category' : pd.Series(['global','10','21','42']), 'KW stat' : pd.Series(kw_stats), 'KW p-value' : pd.Series(kw_pvalues)}
test_result = pd.DataFrame(test)
test_result = test_result.set_index('category')
test_result

Out[8]:

	KS p-value	KS stat	KW p-value	KW stat
category
global	1.010906e-40	0.134037	6.411120e-62	275.726303
10	3.487088e-23	0.142575	3.143475e-35	153.392881
21	2.448399e-08	0.096879	1.339886e-11	45.755268
42	8.376947e-04	0.144672	6.268202e-05	16.019844

Both tests express a significant statistical difference between the women and the men population, and globally and within each category.