A Rubric for Data Wrangling and Exploration

Companion to Lecture 4 of Harvard CS109: Data Science | Prepared by Chris Beaumont

This scene from Cast Away is an accurate metaphor for the amount of time you'll spend cleaning data, and the delirium you'll experience at the end.

In [59]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#tell pandas to display wide tables as pretty HTML tables
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset.

The basic workflow is as follows:

  1. Build a DataFrame from the data (ideally, put all data in this object)
  2. Clean the DataFrame. It should have the following properties:
    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
  4. Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Here's a preview of the raw data we'll use -- it's a list of the 10,000 movies made since 1950 with the most IMDB user ratings. It was scraped about a year ago from pages like this. Download the data at http://bit.ly/cs109_imdb.

In [60]:
!head imdb_top_10000.txt
tt0111161	The Shawshank Redemption (1994)	1994	 9.2	619479	142 mins.	Crime|Drama
tt0110912	Pulp Fiction (1994)	1994	 9.0	490065	154 mins.	Crime|Thriller
tt0137523	Fight Club (1999)	1999	 8.8	458173	139 mins.	Drama|Mystery|Thriller
tt0133093	The Matrix (1999)	1999	 8.7	448114	136 mins.	Action|Adventure|Sci-Fi
tt1375666	Inception (2010)	2010	 8.9	385149	148 mins.	Action|Adventure|Sci-Fi|Thriller
tt0109830	Forrest Gump (1994)	1994	 8.7	368994	142 mins.	Comedy|Drama|Romance
tt0169547	American Beauty (1999)	1999	 8.6	338332	122 mins.	Drama
tt0499549	Avatar (2009)	2009	 8.1	336855	162 mins.	Action|Adventure|Fantasy|Sci-Fi
tt0108052	Schindler's List (1993)	1993	 8.9	325888	195 mins.	Biography|Drama|History|War
tt0080684	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	 8.8	320105	124 mins.	Action|Adventure|Family|Sci-Fi

1. Build a DataFrame

The textfile is tab-separated, and doesn't have any column headers. We set the appropriate keywords in pd.read_csv to handle this

In [61]:
names = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
data = pd.read_csv('imdb_top_10000.txt', delimiter='\t', names=names).dropna()
print "Number of rows: %i" % data.shape[0]
data.head()  # print the first 5 rows
Number of rows: 9999
Out[61]:
imdbID title year score votes runtime genres
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 mins. Drama|Mystery|Thriller
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 mins. Action|Adventure|Sci-Fi
4 tt1375666 Inception (2010) 2010 8.9 385149 148 mins. Action|Adventure|Sci-Fi|Thriller

2. Clean the DataFrame

There are several problems with the DataFrame at this point:

  1. The runtime column describes a number, but is stored as a string
  2. The genres column is not atomic -- it aggregates several genres together. This makes it hard, for example, to extract which movies are Comedies.
  3. The movie year is repeated in the title and year column

Fixing the runtime column

The following snipptet converts a string like '142 mins.' to the number 142:

In [62]:
dirty = '142 mins.'
number, text = dirty.split(' ')
clean = int(number)
print number
142

We can package this up into a list comprehension

In [63]:
clean_runtime = [float(r.split(' ')[0]) for r in data.runtime]
data['runtime'] = clean_runtime
data.head()
Out[63]:
imdbID title year score votes runtime genres
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 Crime|Drama
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 Crime|Thriller
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 Drama|Mystery|Thriller
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 Action|Adventure|Sci-Fi
4 tt1375666 Inception (2010) 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller

Splitting up the genres

We can use the concept of indicator variables to split the genres column into many columns. Each new column will correspond to a single genre, and each cell will be True or False.

In [64]:
#determine the unique genres
genres = set()
for m in data.genres:
    genres.update(g for g in m.split('|'))
genres = sorted(genres)

#make a column for each genre
for genre in genres:
    data[genre] = [genre in movie.split('|') for movie in data.genres]
         
data.head()
Out[64]:
imdbID title year score votes runtime genres Action Adult Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi Sport Thriller War Western
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 Crime|Drama False False False False False False True True False False False False False False False False False False False False False False False False
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 Crime|Thriller False False False False False False True False False False False False False False False False False False False False False True False False
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 Drama|Mystery|Thriller False False False False False False False True False False False False False False False True False False False False False True False False
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 Action|Adventure|Sci-Fi True False True False False False False False False False False False False False False False False False False True False False False False
4 tt1375666 Inception (2010) 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller True False True False False False False False False False False False False False False False False False False True False True False False

Removing year from the title

We can fix each element by stripping off the last 7 characters

In [65]:
data['title'] = [t[0:-7] for t in data.title]
data.head()
Out[65]:
imdbID title year score votes runtime genres Action Adult Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi Sport Thriller War Western
0 tt0111161 The Shawshank Redemption 1994 9.2 619479 142 Crime|Drama False False False False False False True True False False False False False False False False False False False False False False False False
1 tt0110912 Pulp Fiction 1994 9.0 490065 154 Crime|Thriller False False False False False False True False False False False False False False False False False False False False False True False False
2 tt0137523 Fight Club 1999 8.8 458173 139 Drama|Mystery|Thriller False False False False False False False True False False False False False False False True False False False False False True False False
3 tt0133093 The Matrix 1999 8.7 448114 136 Action|Adventure|Sci-Fi True False True False False False False False False False False False False False False False False False False True False False False False
4 tt1375666 Inception 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller True False True False False False False False False False False False False False False False False False False True False True False False

3. Explore global properties

Next, we get a handle on some basic, global summaries of the DataFrame.

Call describe on relevant columns

In [66]:
data[['score', 'runtime', 'year', 'votes']].describe()
Out[66]:
score runtime year votes
count 9999.000000 9999.000000 9999.000000 9999.000000
mean 6.385989 103.580358 1993.471447 16605.462946
std 1.189965 26.629310 14.830049 34564.883945
min 1.500000 0.000000 1950.000000 1356.000000
25% 5.700000 93.000000 1986.000000 2334.500000
50% 6.600000 102.000000 1998.000000 4981.000000
75% 7.200000 115.000000 2005.000000 15278.500000
max 9.200000 450.000000 2011.000000 619479.000000
In [67]:
#hmmm, a runtime of 0 looks suspicious. How many movies have that?
print len(data[data.runtime == 0])

#probably best to flag those bad data as NAN
data.runtime[data.runtime==0] = np.nan
282

After flagging bad runtimes, we repeat

In [68]:
data.runtime.describe()
Out[68]:
count    9717.000000
mean      106.586395
std        20.230330
min        45.000000
25%        93.000000
50%       103.000000
75%       115.000000
max       450.000000
dtype: float64

Make some basic plots

In [69]:
# more movies in recent years, but not *very* recent movies (they haven't had time to receive lots of votes yet?)
plt.hist(data.year, bins=np.arange(1950, 2013), color='#cccccc')
plt.xlabel("Release Year")
remove_border()
In [70]:
plt.hist(data.score, bins=20, color='#cccccc')
plt.xlabel("IMDB rating")
remove_border()
In [71]:
plt.hist(data.runtime.dropna(), bins=50, color='#cccccc')
plt.xlabel("Runtime distribution")
remove_border()
In [72]:
#hmm, more bad, recent movies. Real, or a selection bias?

plt.scatter(data.year, data.score, lw=0, alpha=.08, color='k')
plt.xlabel("Year")
plt.ylabel("IMDB Rating")
remove_border()
In [73]:
plt.scatter(data.votes, data.score, lw=0, alpha=.2, color='k')
plt.xlabel("Number of Votes")
plt.ylabel("IMDB Rating")
plt.xscale('log')
remove_border()

Identify some outliers

In [74]:
# low-score movies with lots of votes
data[(data.votes > 9e4) & (data.score < 5)][['title', 'year', 'score', 'votes', 'genres']]
Out[74]:
title year score votes genres
317 New Moon 2009 4.5 90457 Adventure|Drama|Fantasy|Romance
334 Batman & Robin 1997 3.5 91875 Action|Crime|Fantasy|Sci-Fi
In [75]:
# The lowest rated movies
data[data.score == data.score.min()][['title', 'year', 'score', 'votes', 'genres']]
Out[75]:
title year score votes genres
1982 Manos: The Hands of Fate 1966 1.5 20927 Horror
2793 Superbabies: Baby Geniuses 2 2004 1.5 13196 Comedy|Family
3746 Daniel the Wizard 2004 1.5 8271 Comedy|Crime|Family|Fantasy|Horror
5158 Ben & Arthur 2002 1.5 4675 Drama|Romance
5993 Night Train to Mundo Fine 1966 1.5 3542 Action|Adventure|Crime|War
6257 Monster a-Go Go 1965 1.5 3255 Sci-Fi|Horror
6726 Dream Well 2009 1.5 2848 Comedy|Romance|Sport
In [76]:
# The highest rated movies
data[data.score == data.score.max()][['title', 'year', 'score', 'votes', 'genres']]
Out[76]:
title year score votes genres
0 The Shawshank Redemption 1994 9.2 619479 Crime|Drama
26 The Godfather 1972 9.2 474189 Crime|Drama

Run aggregation functions like sum over several rows or columns

What genres are the most frequent?

In [77]:
#sum sums over rows by default
genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})
Out[77]:
Genre Count
Drama 5697
Comedy 3922
Thriller 2832
Romance 2441
Action 1891
Crime 1867
Adventure 1313
Horror 1215
Mystery 1009
Fantasy 916
Sci-Fi 897
Family 754
War 512
Biography 394
Music 371
History 358
Animation 314
Sport 288
Musical 260
Western 235
Film-Noir 40
Adult 9
News 1
Reality-TV 1

How many genres does a movie have, on average?

In [78]:
#axis=1 sums over columns instead
genre_count = data[genres].sum(axis=1) 
print "Average movie has %0.2f genres" % genre_count.mean()
genre_count.describe()
Average movie has 2.75 genres
Out[78]:
count    9999.000000
mean        2.753975
std         1.168910
min         1.000000
25%         2.000000
50%         3.000000
75%         3.000000
max         8.000000
dtype: float64

Explore Group Properties

Let's split up movies by decade

In [79]:
decade =  (data.year // 10) * 10

tyd = data[['title', 'year']]
tyd['decade'] = decade

tyd.head()
Out[79]:
title year decade
0 The Shawshank Redemption 1994 1990
1 Pulp Fiction 1994 1990
2 Fight Club 1999 1990
3 The Matrix 1999 1990
4 Inception 2010 2010

GroupBy will gather movies into groups with equal decade values

In [80]:
#mean score for all movies in each decade
decade_mean = data.groupby(decade).score.mean()
decade_mean.name = 'Decade Mean'
print decade_mean

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()
year
1950    7.244522
1960    7.062367
1970    6.842297
1980    6.248693
1990    6.199316
2000    6.277858
2010    6.344552
Name: Decade Mean, dtype: float64

We can go one further, and compute the scatter in each year as well

In [81]:
grouped_scores = data.groupby(decade).score

mean = grouped_scores.mean()
std = grouped_scores.std()

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.fill_between(decade_mean.index, (decade_mean + std).values,
                 (decade_mean - std).values, color='r', alpha=.2)
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()

You can also iterate over a GroupBy object. Each iteration yields two variables: one of the distinct values of the group key, and the subset of the dataframe where the key equals that value. To find the most popular movie each year:

In [82]:
for year, subset in data.groupby('year'):
    print year, subset[subset.score == subset.score.max()].title.values
1950 ['Sunset Blvd.']
1951 ['Strangers on a Train']
1952 ["Singin' in the Rain"]
1953 ['The Wages of Fear' 'Tokyo Story']
1954 ['Seven Samurai']
1955 ['Diabolique']
1956 ['The Killing']
1957 ['12 Angry Men']
1958 ['Vertigo']
1959 ['North by Northwest']
1960 ['Psycho']
1961 ['Yojimbo']
1962 ['To Kill a Mockingbird' 'Lawrence of Arabia']
1963 ['The Great Escape' 'High and Low']
1964 ['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb']
1965 ['For a Few Dollars More']
1966 ['The Good, the Bad and the Ugly']
1967 ['Cool Hand Luke']
1968 ['Once Upon a Time in the West']
1969 ['Butch Cassidy and the Sundance Kid' 'Army of Shadows']
1970 ['Patton' 'The Conformist' 'Le Cercle Rouge']
1971 ['A Clockwork Orange']
1972 ['The Godfather']
1973 ['The Sting' 'Scenes from a Marriage']
1974 ['The Godfather: Part II']
1975 ['Outrageous Class']
1976 ['Tosun Pasa']
1977 ['Star Wars: Episode IV - A New Hope']
1978 ['The Girl with the Red Scarf']
1979 ['Apocalypse Now']
1980 ['Star Wars: Episode V - The Empire Strikes Back']
1981 ['Raiders of the Lost Ark']
1982 ['The Marathon Family']
1983 ['Star Wars: Episode VI - Return of the Jedi']
1984 ['Balkan Spy']
1985 ['The Broken Landlord']
1986 ['Aliens']
1987 ['Mr. Muhsin']
1988 ['Cinema Paradiso']
1989 ['Indiana Jones and the Last Crusade' "Don't Let Them Shoot the Kite"]
1990 ['Goodfellas']
1991 ['The Silence of the Lambs']
1992 ['Reservoir Dogs']
1993 ["Schindler's List"]
1994 ['The Shawshank Redemption']
1995 ['The Usual Suspects' 'Se7en']
1996 ['Fargo' 'The Bandit']
1997 ['Life Is Beautiful']
1998 ['American History X']
1999 ['Fight Club']
2000 ['Memento']
2001 ['The Lord of the Rings: The Fellowship of the Ring']
2002 ['City of God']
2003 ['The Lord of the Rings: The Return of the King']
2004 ['Eternal Sunshine of the Spotless Mind']
2005 ['My Father and My Son']
2006 ['The Departed' 'The Lives of Others']
2007 ['Like Stars on Earth']
2008 ['The Dark Knight']
2009 ['Inglourious Basterds']
2010 ['Inception']
2011 ['A Separation']

Small multiples

Let's split up the movies by genre, and look at how their release year/runtime/IMDB score vary. The distribution for all movies is shown as a grey background.

This isn't a standard groupby, so we can't use the groupby method here. A manual loop is needed

In [83]:
#create a 4x6 grid of plots.
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), 
                         tight_layout=True)

bins = np.arange(1950, 2013, 3)
for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].year, 
            bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3, ec='none')
    ax.hist(data.year, bins=bins, histtype='stepfilled', ec='None', normed=True, zorder=0, color='#cccccc')
    
    ax.annotate(genre, xy=(1955, 3e-2), fontsize=14)
    ax.xaxis.set_ticks(np.arange(1950, 2013, 30))
    ax.set_yticks([])
    remove_border(ax, left=False)
    ax.set_xlabel('Year')

Some subtler patterns here:

  1. Westerns and Musicals have a more level distribution
  2. Film Noir movies were much more popular in the 50s and 60s
In [84]:
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(30, 240, 10)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].runtime, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.runtime, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_xticks(np.arange(30, 240, 60))
    ax.set_yticks([])
    ax.set_xlabel("Runtime [min]")
    remove_border(ax, left=False)
    ax.annotate(genre, xy=(230, .02), ha='right', fontsize=12)
  1. Biographies and history movies are longer
  2. Animated movies are shorter
  3. Film-Noir movies have the same mean, but are more conentrated around a 100 minute runtime
  4. Musicals have the same mean, but greater dispersion in runtimes
In [85]:
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(0, 10, .5)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].score, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.score, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_yticks([])
    ax.set_xlabel("Score")
    remove_border(ax, left=False)
    ax.set_ylim(0, .4)
    ax.annotate(genre, xy=(0, .2), ha='left', fontsize=12)
  1. Film-noirs, histories, and biographies have higher ratings (a selection effect?)
  2. Horror movies and adult films have lower ratings

Other Resources

css tweaks in this cell