#!/usr/bin/env python # coding: utf-8 # > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python. # # # 7.1. Explore a dataset with Pandas and matplotlib # 1. We import NumPy, Pandas and matplotlib. # In[ ]: from datetime import datetime import numpy as np import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # 2. The dataset is a CSV file, i.e. a text file with comma-separated values. Pandas lets us load this file with a single function. # In[ ]: player = 'Roger Federer' filename = "data/{name}.csv".format( name=player.replace(' ', '-')) df = pd.read_csv(filename) # The loaded data is a `DataFrame`, a 2D tabular data where each row is an observation, and each column is a variable. We can have a first look at this dataset by just displaying it in the IPython notebook. # In[ ]: df # 3. There are many columns. Each row corresponds to a match played by Roger Federer. Let's add a boolean variable indicating whether he has won the match or not. The `tail` method displays the last rows of the column. # In[ ]: df['win'] = df['winner'] == player df['win'].tail() # 4. `df['win']` is a `Series` object: it is very similar to a NumPy array, except that each value has an index (here, the match index). This object has a few standard statistical functions. For example, let's look at the proportion of matches won. # In[ ]: print("{player} has won {vic:.0f}% of his ATP matches.".format( player=player, vic=100*df['win'].mean())) # 5. Now, we are going to look at the evolution of some variables across time. The `start date` field contains the start date of the tournament as a string. We can convert the type to a date type using the `pd.to_datetime` function. # In[ ]: date = pd.to_datetime(df['start date']) # 6. We are now looking at the proportion of double faults in each match (taking into account that there are logically more double faults in longer matches!). This number is an indicator of the player's state of mind, his level of self-confidence, his willingness to take risks while serving, and other parameters. # In[ ]: df['dblfaults'] = (df['player1 double faults'] / df['player1 total points total']) # 7. We can use the `head` and `tail` methods to take a look at the beginning and the end of the column, and `describe` to get summary statistics. In particular, let's note that some rows have `NaN` values (i.e. the number of double faults is not available for all matches). # In[ ]: df['dblfaults'].tail() # In[ ]: df['dblfaults'].describe() # 8. A very powerful feature in Pandas is `groupby`. This function allows us to group together rows that have the same value in a particular column. Then, we can aggregate this group-by object to compute statistics in each group. For instance, here is how we can get the proportion of wins as a function of the tournament's surface. # In[ ]: df.groupby('surface')['win'].mean() # 9. Now, we are going to display the proportion of double faults as a function of the tournament date, as well as the yearly average. To do this, we also use `groupby`. # In[ ]: gb = df.groupby('year') # 10. `gb` is a `GroupBy` instance. It is similar to a `DataFrame`, but there are multiple rows per group (all matches played in each year). We can aggregate those rows using the `mean` operation. We use matplotlib's `plot_date` function because the x-axis contains dates. # In[ ]: plt.figure(figsize=(8, 4)) plt.plot_date(date.astype(datetime), df['dblfaults'], alpha=.25, lw=0); plt.plot_date(gb['start date'].max(), gb['dblfaults'].mean(), '-', lw=3); plt.xlabel('Year'); plt.ylabel('Proportion of double faults per match.'); # > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer). # # > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).