In this brief notebook non-parametric tests for age distributions for women and men are implemented, specifically for the Lausanne Marathon's dataset. We will apply both the two-sample Kolmogorov-Smirnov test and the Kruskal-Wallis test, which provides a non-parametric version of ANOVA. Results (p-values and test statistics) are reported in a table which considers both the overall and the by-category age distribution per sex (that is, distinguishing among 10 km, half-marathon and marathon).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('notebook')
data = pd.read_csv('../datasets/lausanne_marathon_2016_stefano.csv')
data.head()
Unnamed: 0 | cat | sex | rang | nom | an | lieu | temps | retard | pace | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 21 | M | 147 | Abaidia Jilani | 1966 | St-Légier-La Chiésaz | 1:45.28,4 | 25.56,8 | 4.59 |
1 | 1 | 21 | F | 81 | Abaidia Sandrine | 1972 | St-Légier | 1:49.40,8 | 24.09,5 | 5.11 |
2 | 2 | False | F | 33 | Abaidia Selma | 2006 | St-Légier-La Chiésaz | 7.12,2 | 1.36,3 | 4.48 |
3 | 3 | 21 | M | 103 | Abb Jochen | 1948 | Ernen | 2:50.40,7 | 1:21.28,7 | 8.05 |
4 | 4 | 10 | M | 426 | Abbas Dhia | 1961 | Lausanne | 1:13.04,1 | 38.13,0 | 7.18 |
In this section the age of ther participants is examined per sex and category (10 km, 21 km, 42 km).
age = 2016-data.an.astype(int)
data['age']=age
# get age of participants and plot its distribution
plt.subplot(1,2,1)
plt.hist(age,bins=20)
plt.gca().set_xlabel('age')
plt.gca().set_ylabel('count')
plt.gcf().set_size_inches(16,6)
Now compare the age distribution for women and men:
age_women = 2016 - data[data.sex=='F'].an.astype(int)
age_men = 2016 - data[data.sex=='M'].an.astype(int)
plt.hist(age_men,edgecolor='blue',fill=False,linewidth=2,label='men',bins=20)
plt.hist(age_women,edgecolor='red',fill=False,linewidth=2,label='women',bins=20)
plt.xlabel('age')
plt.ylabel('count')
plt.gca().set_xlim([0,90])
plt.gcf().set_size_inches(16,6)
plt.legend()
<matplotlib.legend.Legend at 0xb040a90>
Clean data from 'False' (walk, kids running):
del data['Unnamed: 0']
data = data[data.cat != 'False']
age_women = 2016 - data[data.sex=='F'].an.astype(int)
age_men = 2016 - data[data.sex=='M'].an.astype(int)
plt.hist(age_men,edgecolor='blue',fill=False,linewidth=2,label='men',bins=20)
plt.hist(age_women,edgecolor='red',fill=False,linewidth=2,label='women',bins=20)
plt.xlabel('age')
plt.ylabel('count')
plt.gca().set_xlim([20,90])
plt.gcf().set_size_inches(16,6)
plt.legend()
<matplotlib.legend.Legend at 0xaf89080>
The code cell below produces the p-values and statisics table. Age is computed with respect to the year 2016. Results are both per category and global.
from scipy import stats
ks_stats = []
ks_pvalues = []
kw_stats = []
kw_pvalues = []
kstest = stats.ks_2samp(2016-data[data.sex=='M'].an.astype(int),2016-data[data.sex=='F'].an.astype(int))
ks_stats.append(kstest[0])
ks_pvalues.append(kstest[1])
kwtest = stats.mstats.kruskalwallis(2016-data[data.sex=='M'].an.astype(int),2016-data[data.sex=='F'].an.astype(int))
kw_stats.append(kwtest[0])
kw_pvalues.append(kwtest[1])
for i in ['10','21','42']:
subdata = data[data.cat==i]
kstest = stats.ks_2samp(2016-subdata[subdata.sex=='M'].an.astype(int),2016-subdata[subdata.sex=='F'].an.astype(int))
ks_stats.append(kstest[0])
ks_pvalues.append(kstest[1])
kwtest = stats.mstats.kruskalwallis(2016-subdata[subdata.sex=='M'].an.astype(int),2016-subdata[subdata.sex=='F'].an.astype(int))
kw_stats.append(kwtest[0])
kw_pvalues.append(kwtest[1])
test = {'KS stat' : pd.Series(ks_stats), 'KS p-value' : pd.Series(ks_pvalues), 'category' : pd.Series(['global','10','21','42']), 'KW stat' : pd.Series(kw_stats), 'KW p-value' : pd.Series(kw_pvalues)}
test_result = pd.DataFrame(test)
test_result = test_result.set_index('category')
test_result
KS p-value | KS stat | KW p-value | KW stat | |
---|---|---|---|---|
category | ||||
global | 1.010906e-40 | 0.134037 | 6.411120e-62 | 275.726303 |
10 | 3.487088e-23 | 0.142575 | 3.143475e-35 | 153.392881 |
21 | 2.448399e-08 | 0.096879 | 1.339886e-11 | 45.755268 |
42 | 8.376947e-04 | 0.144672 | 6.268202e-05 | 16.019844 |
Both tests express a significant statistical difference between the women and the men population, and globally and within each category.