A practical introduction to IPython Notebook & pandas

Hi! I'm Julia.

Right now: Hacker School.
Before: Data scientist.

I'm on the internet at http://jvns.ca, http://twitter.com/b0rk

You can follow along by downloading this presentation and running the code yourself:

Goal (Today)

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

  1. Demo with rats
  2. Dataframes: what makes pandas powerful
  3. Selecting data from a dataframe
  4. Time series and indexes and resampling
  5. Groupby + aggregate

Some notes about installation:

Don't do this:

sudo apt-get install ipython-notebook

Instead, do this:

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline
In [1]:
# Some display stuff. Don't mind this for now.
import pandas as pd
pd.set_option('display.mpl_style', 'default') # Make graphs pretty
figsize(15, 6) # Make graphs a good size for my screen
# Display all the columns instead of a summary
pd.set_option('display.line_width', 4000) 
pd.set_option('display.max_columns', 100)

Example: Graph the number of noise complaints each hour in New York

In [2]:
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [43]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[43]:
<matplotlib.axes.AxesSubplot at 0xb69fad0>

A few awesome things about pandas

  • Really, really, really, really good at time series
  • Can import Excel files (!!!)
  • Fast (joining, merging)
  • Handles NaNs pretty easily

2. Dataframes: what makes pandas powerful

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.

In [4]:
people = pd.read_csv('tiny.csv')
people
Out[4]:
name age height
0 Scott 12 61
1 Lea 13 73
2 Julia 14 66
3 Kate 15 62
4 Rishi 18 70

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 5 rows.

3. Selecting data from a dataframe

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.

In [5]:
# Load the first 5 rows of our CSV
requests = pd.read_csv('./311-service-requests.csv', nrows=5)
In [6]:
# How to get a column
requests['Complaint Type']
Out[6]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object
In [7]:
# How to get a subset of the columns
requests[['Complaint Type', 'Created Date']]
Out[7]:
Complaint Type Created Date
0 Noise - Street/Sidewalk 10/31/2013 02:08:41 AM
1 Illegal Parking 10/31/2013 02:01:04 AM
2 Noise - Commercial 10/31/2013 02:00:24 AM
3 Noise - Vehicle 10/31/2013 01:56:23 AM
4 Rodent 10/31/2013 01:53:44 AM
In [8]:
# How to get 3 rows
requests[:3]
Out[8]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
1 26593698 10/31/2013 02:01:04 AM NaN NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 11378 58 AVENUE 58 AVENUE 58 PLACE 59 STREET NaN NaN BLOCKFACE MASPETH NaN Precinct Open 10/31/2013 10:01:04 AM NaN 05 QUEENS QUEENS 1009349 201984 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.721041 -73.909453 (40.721040535628305, -73.90945306791765)
2 26594139 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10032 4060 BROADWAY BROADWAY WEST 171 STREET WEST 172 STREET NaN NaN ADDRESS NEW YORK NaN Precinct Closed 10/31/2013 10:00:24 AM 10/31/2013 02:39:42 AM 12 MANHATTAN MANHATTAN 1001088 246531 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.843330 -73.939144 (40.84332975466513, -73.93914371913482)

Get the first 3 rows of a column

In [9]:
requests['Agency Name'][:3]
Out[9]:
0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object
In [10]:
requests[:3]['Agency Name']
Out[10]:
0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object

Compare a column to a value

In [11]:
requests['Complaint Type']
Out[11]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object
In [12]:
requests['Complaint Type'] == 'Noise - Street/Sidewalk'
Out[12]:
0     True
1    False
2    False
3    False
4    False
Name: Complaint Type, dtype: bool

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints

In [13]:
noise_complaints = requests[requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints
Out[13]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
In [14]:
# How to get a specific row
requests.ix[0]
Out[14]:
Unique Key                                                       26589651
Created Date                                       10/31/2013 02:08:41 AM
Closed Date                                                           NaN
Agency                                                               NYPD
Agency Name                               New York City Police Department
Complaint Type                                    Noise - Street/Sidewalk
Descriptor                                                   Loud Talking
Location Type                                             Street/Sidewalk
Incident Zip                                                        11432
Incident Address                                         90-03 169 STREET
Street Name                                                    169 STREET
Cross Street 1                                                  90 AVENUE
Cross Street 2                                                  91 AVENUE
Intersection Street 1                                                 NaN
Intersection Street 2                                                 NaN
Address Type                                                      ADDRESS
City                                                              JAMAICA
Landmark                                                              NaN
Facility Type                                                    Precinct
Status                                                           Assigned
Due Date                                           10/31/2013 10:08:41 AM
Resolution Action Updated Date                     10/31/2013 02:35:17 AM
Community Board                                                 12 QUEENS
Borough                                                            QUEENS
X Coordinate (State Plane)                                        1042027
Y Coordinate (State Plane)                                         197389
Park Facility Name                                            Unspecified
Park Borough                                                       QUEENS
School Name                                                   Unspecified
School Number                                                 Unspecified
School Region                                                 Unspecified
School Code                                                   Unspecified
School Phone Number                                           Unspecified
School Address                                                Unspecified
School City                                                   Unspecified
School State                                                  Unspecified
School Zip                                                    Unspecified
School Not Found                                                        N
School or Citywide Complaint                                          NaN
Vehicle Type                                                          NaN
Taxi Company Borough                                                  NaN
Taxi Pick Up Location                                                 NaN
Bridge Highway Name                                                   NaN
Bridge Highway Direction                                              NaN
Road Ramp                                                             NaN
Bridge Highway Segment                                                NaN
Garage Lot Name                                                       NaN
Ferry Direction                                                       NaN
Ferry Terminal Name                                                   NaN
Latitude                                                         40.70828
Longitude                                                        -73.7916
Location                          (40.70827532593202, -73.79160395779721)
Name: 0, Length: 52, dtype: object
In [15]:
# How not to get a row
requests[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-32d897be788b> in <module>()
      1 # How not to get a row
----> 2 requests[0]

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1926         else:
   1927             # get column
-> 1928             return self._get_item_cache(key)
   1929 
   1930     def _getitem_slice(self, key):

/opt/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    568             return cache[item]
    569         except Exception:
--> 570             values = self._data.get(item)
    571             res = self._box_item_values(item, values)
    572             cache[item] = res

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   1381 
   1382     def get(self, item):
-> 1383         _, block = self._find_block(item)
   1384         return block.get(item)
   1385 

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   1523 
   1524     def _find_block(self, item):
-> 1525         self._check_have(item)
   1526         for i, block in enumerate(self.blocks):
   1527             if item in block:

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   1530     def _check_have(self, item):
   1531         if item not in self.items:
-> 1532             raise KeyError('no item named %s' % com.pprint_thing(item))
   1533 
   1534     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: u'no item named 0'

Back to our example

In [16]:
requests = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [17]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[17]:
<matplotlib.axes.AxesSubplot at 0x92b9790>

Indexes

In [18]:
noise_complaints[:3]
Out[18]:
Created Date Complaint Type
0 2013-10-31 02:08:41 Noise - Street/Sidewalk
16 2013-10-31 00:54:03 Noise - Street/Sidewalk
25 2013-10-31 00:35:18 Noise - Street/Sidewalk
In [19]:
noise_complaints = noise_complaints.set_index('Created Date')
In [20]:
noise_complaints[:3]
Out[20]:
Complaint Type
Created Date
2013-10-31 02:08:41 Noise - Street/Sidewalk
2013-10-31 00:54:03 Noise - Street/Sidewalk
2013-10-31 00:35:18 Noise - Street/Sidewalk

Sorting the index

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES

In [21]:
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]
Out[21]:
Complaint Type
Created Date
2013-10-07 15:45:56 Noise - Street/Sidewalk
2013-10-07 16:17:41 Noise - Street/Sidewalk
2013-10-07 16:58:08 Noise - Street/Sidewalk

Counting the complaints each hour

In [22]:
noise_complaints.resample('H', how=len)[:3]
Out[22]:
Complaint Type
Created Date
2013-10-07 15:00:00 1
2013-10-07 16:00:00 2
2013-10-07 17:00:00 0

Example 1: done!

In [23]:
noise_complaints.resample('H', how=len).plot()
Out[23]:
<matplotlib.axes.AxesSubplot at 0x9773790>

Chaining commands together

In [25]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index('Created Date')
noise_complaints = noise_complaints.sort_index()
noise_complaints = noise_complaints.resample('H', how=len)
#noise_complaints.plot()

is the same as

In [26]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[26]:
<matplotlib.axes.AxesSubplot at 0x4db98d0>

What if we resample by day?

In [46]:
noise_complaints.set_index('Created Date').sort_index().resample('D', how=len).plot(kind='bar')
Out[46]:
<matplotlib.axes.AxesSubplot at 0xc191d90>

Which day of the week is the complainiest?

In [48]:
noise_complaints = noise_complaints.set_index('Created Date').sort_index()
noise_complaints['weekday'] = noise_complaints.index.weekday
In [53]:
complaints_by_day = noise_complaints.groupby('weekday').aggregate(len)
complaints_by_day
Out[53]:
Complaint Type
weekday
0 200
1 187
2 204
3 149
4 180
5 312
6 280
In [55]:
complaints_by_day.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
complaints_by_day.plot(kind='bar')
Out[55]:
<matplotlib.axes.AxesSubplot at 0xd08b510>

Some more examples, for fun...

Example 2: What are the most common complaint types?

In [27]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[27]:
<matplotlib.axes.AxesSubplot at 0x6368a90>

Example 3: Does every zip code complain about the same things?

In [28]:
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')
Out[28]:
<matplotlib.axes.AxesSubplot at 0x8d97d50>

Example 4: Which borough complains the most about rats?

In [56]:
rodent_complaints = orig_data[orig_data['Complaint Type'] == 'Rodent']['Borough'].value_counts()
total_complaints = orig_data['Borough'].value_counts()
(rodent_complaints / total_complaints ).plot(kind='bar')
Out[56]:
<matplotlib.axes.AxesSubplot at 0xd14d290>