*Tennis* dataset on the book's website, and extract it in the current directory. (http://ipython-books.github.io)

- Let's import NumPy, Pandas, SciPy.stats and matplotlib.

In [ ]:

```
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
```

- We load the dataset corresponding to Roger Federer.

In [ ]:

```
player = 'Roger Federer'
filename = "data/{name}.csv".format(
name=player.replace(' ', '-'))
df = pd.read_csv(filename)
```

In [ ]:

```
print("Number of columns: " + str(len(df.columns)))
df[df.columns[:4]].tail()
```

- Here, we only look at the proportion of points won, and the (relative) number of aces.

In [ ]:

```
npoints = df['player1 total points total']
points = df['player1 total points won'] / npoints
aces = df['player1 aces'] / npoints
```

In [ ]:

```
plt.plot(points, aces, '.');
plt.xlabel('% of points won');
plt.ylabel('% of aces');
plt.xlim(0., 1.);
plt.ylim(0.);
```

- We create a new
`DataFrame`

with only those fields (note that this step is not compulsory). We also remove the rows where one field is missing.

In [ ]:

```
df_bis = pd.DataFrame({'points': points,
'aces': aces}).dropna()
df_bis.tail()
```

In [ ]:

```
df_bis.corr()
```

- Now, to determine if there is a
*statistically significant*correlation between the variables, we use a**chi-square test of independence of variables in a contingency table**. - First, we need to get binary variables (here, whether the number of points won or the number of aces is greater than their medians). For example, the value corresponding to the number of aces is True if the player is doing more aces than usual in a match, and False otherwise.

In [ ]:

```
df_bis['result'] = df_bis['points'] > df_bis['points'].median()
df_bis['manyaces'] = df_bis['aces'] > df_bis['aces'].median()
```

- Then, we create a
**contingency table**, with the frequencies of all four possibilities (True & True, True & False, etc.).

In [ ]:

```
pd.crosstab(df_bis['result'], df_bis['manyaces'])
```

- Finally, we compute the chi-square test statistic and the associated p-value. The null hypothesis is the independence between the variables. SciPy implements this test in
`scipy.stats.chi2_contingency`

, which returns several objects. We're interested in the second result, which is the p-value.

In [ ]:

```
st.chi2_contingency(_)
```

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).