- We import NumPy, Pandas and matplotlib.

In [ ]:

```
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
```

In [ ]:

```
player = 'Roger Federer'
filename = "data/{name}.csv".format(
name=player.replace(' ', '-'))
df = pd.read_csv(filename)
```

`DataFrame`

, a 2D tabular data where each row is an observation, and each column is a variable. We can have a first look at this dataset by just displaying it in the IPython notebook.

In [ ]:

```
df
```

- There are many columns. Each row corresponds to a match played by Roger Federer. Let's add a boolean variable indicating whether he has won the match or not. The
`tail`

method displays the last rows of the column.

In [ ]:

```
df['win'] = df['winner'] == player
df['win'].tail()
```

`df['win']`

is a`Series`

object: it is very similar to a NumPy array, except that each value has an index (here, the match index). This object has a few standard statistical functions. For example, let's look at the proportion of matches won.

In [ ]:

```
print("{player} has won {vic:.0f}% of his ATP matches.".format(
player=player, vic=100*df['win'].mean()))
```

- Now, we are going to look at the evolution of some variables across time. The
`start date`

field contains the start date of the tournament as a string. We can convert the type to a date type using the`pd.to_datetime`

function.

In [ ]:

```
date = pd.to_datetime(df['start date'])
```

In [ ]:

```
df['dblfaults'] = (df['player1 double faults'] /
df['player1 total points total'])
```

- We can use the
`head`

and`tail`

methods to take a look at the beginning and the end of the column, and`describe`

to get summary statistics. In particular, let's note that some rows have`NaN`

values (i.e. the number of double faults is not available for all matches).

In [ ]:

```
df['dblfaults'].tail()
```

In [ ]:

```
df['dblfaults'].describe()
```

- A very powerful feature in Pandas is
`groupby`

. This function allows us to group together rows that have the same value in a particular column. Then, we can aggregate this group-by object to compute statistics in each group. For instance, here is how we can get the proportion of wins as a function of the tournament's surface.

In [ ]:

```
df.groupby('surface')['win'].mean()
```

- Now, we are going to display the proportion of double faults as a function of the tournament date, as well as the yearly average. To do this, we also use
`groupby`

.

In [ ]:

```
gb = df.groupby('year')
```

`gb`

is a`GroupBy`

instance. It is similar to a`DataFrame`

, but there are multiple rows per group (all matches played in each year). We can aggregate those rows using the`mean`

operation. We use matplotlib's`plot_date`

function because the x-axis contains dates.

In [ ]:

```
plt.figure(figsize=(8, 4))
plt.plot_date(date.astype(datetime), df['dblfaults'], alpha=.25, lw=0);
plt.plot_date(gb['start date'].max(),
gb['dblfaults'].mean(), '-', lw=3);
plt.xlabel('Year');
plt.ylabel('Proportion of double faults per match.');
```

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).