This iPython Notebook was created as a companion to the "Advanced Python for Data Analysis" tutorial and class taught at Investigative Reporters and Editors' Computer-Assisted Reporting conference in March, 2015 in Atlanta, Ga.

A first look at Pandas

Now that we're comfortable with our coding environment, and we've dabbled with the iPython notebook, it's time to dive into pandas, the Python library created explicitly for data analysis.

First in our interactive iPython notebook, we'll import pandas and call it "pd", a convention of pandas users that you'll quickly grow accustomed to.

In [1]:
import pandas as pd

Today, we're going to work with a fairly large text file of car accident data from the New Jersey Department of Transportation. This file contains reports from accidents in the Garden State between 2008 and 2013. The data, originally filed in handwritten reports by state troopers and then typed into a fixed-width database by clerks, is both decently large and very messy. (As an aside, if you're really interested in the genesis of this particular file, you can browse the makefile we used to create it.)

Our analysis today will not be comprehensive or particularly accurate, but working with 1.7 million rows of dirty data is a pretty good way to illustrate what you can do with pandas and how it's going to make your life easier.

The first thing we have to do is ingest our data from a source into a pandas DataFrame object. Think of a DataFrame (similar to R's Data Frame) as a virtual spreadsheet. It has columns with unique names and rows with unique row numbers, which we call "the Index". You can read in data from many kinds of sources: json, Excel files, html on the web, sql databases or others. Today, we're going to work with a CSV. First, we're going to create a variable with the name of our CSV, then we'll use pandas' .read_csv() function.

In [2]:
datafile = "njaccidents.csv"

df = pd.read_csv(datafile)
/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (6,17) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

Now we have a DataFrame object with the name df. Let's use the .head() method to look at the first five rows of data. Does it look familliar?

In [3]:
df.head()
Out[3]:
case code County Name Municipality Name Crash Date Crash Day Of Week Crash Time Police Dept Code Police Department Police Station Total Killed ... Is Ramp Ramp To/From Route Name Ramp To/From Route Direction Posted Speed Posted Speed Cross Street Latitude Longitude Cell Phone In Use Flag Other Property Damage Reporting Badge No.
0 2008010108-026816 ATLANTIC ABSECON CITY 03/04/2008 TU 1539 01 ATLANTIC CITY AIU 0 ... 50 39.41158 74.49162 N NONE ... 384
1 2008010108-163190 ATLANTIC ABSECON CITY 12/19/2008 F 1114 01 ATLANTIC CITY TRAFFIC 0 ... 50 39.39231 74.48952 N NONE ... 739
2 2008010108-24779 ATLANTIC ABSECON CITY 11/25/2008 TU 0345 99 NJ TRANSIT P.D. ATLANTIC CITY 0 ... 10 25 N ? ... 0053
3 2008010108-3901 ATLANTIC ABSECON CITY 03/31/2008 M 0105 01 EAST WINDSOR TRAFFIC UNIT 0 ... 0 N NONE ... 551
4 2008010108-5016 ATLANTIC ABSECON CITY 01/25/2008 F 0942 01 EGG HARBOR TWP HQ 0 ... 50 40 39.43036 74.52469 N NONE ... 1571

5 rows × 47 columns

It's a table!

To see what we have here, let's make a list of the columns we have to work with the DataFrame's .columns attribute.

In [4]:
#this is a list that you can pluck items out of.
df.columns
Out[4]:
Index([u'case code', u'County Name', u'Municipality Name', u'Crash Date', u'Crash Day Of Week', u'Crash Time', u'Police Dept Code', u'Police Department', u'Police Station', u'Total Killed', u'Total Injured', u'Pedestrians Killed', u'Pedestrians Injured', u'Severity', u'Intersection', u'Alcohol Involved', u'HazMat Involved', u'Crash Type Code', u'Total Vehicles Involved', u'Crash Location', u'Location Direction', u'Route', u'Route Suffix', u'SRI (Std Rte Identifier)', u'MilePost', u'Road System', u'Road Character', u'Road Surface Type', u'Surface Condition', u'Light Condition', u'Environmental Condition', u'Road Divided By', u'Temporary Traffic Control Zone', u'Distance To Cross Street', u'Unit Of Measurement', u'Directn From Cross Street', u'Cross Street Name', u'Is Ramp', u'Ramp To/From Route Name', u'Ramp To/From Route Direction', u'Posted Speed', u'Posted Speed Cross Street', u'Latitude', u'Longitude', u'Cell Phone In Use Flag', u'Other Property Damage', u'Reporting Badge No.'], dtype='object')

To get a quick overview of what kind of data is in each column, we can try the .describe() method.

In [5]:
df.describe()
Out[5]:
Total Killed Total Injured Pedestrians Killed Pedestrians Injured Total Vehicles Involved Road System Posted Speed
count 1707841.000000 1707841.000000 1707841.000000 1707841.000000 1707841.000000 1707841.000000 1707840.000000
mean 0.001968 0.308403 0.000489 0.016251 1.878721 5.191971 31.205212
std 0.047111 0.698354 0.022238 0.130033 0.538178 2.499018 17.872276
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 25.000000
50% 0.000000 0.000000 0.000000 0.000000 2.000000 5.000000 30.000000
75% 0.000000 0.000000 0.000000 0.000000 2.000000 7.000000 45.000000
max 5.000000 42.000000 2.000000 10.000000 20.000000 10.000000 99.000000

In earlier versions of pandas (pre-15), .describe() gives you the summary statistics for the numeric columns. In newer versions of pandas, using the .describe(include='all') keyword argument will summarize all the columns.

For categorical columns, like 'Severity', you'll get the number of unique values and the most frequent.

In [6]:
df['Severity'].describe()
Out[6]:
count     1707841
unique          3
top             P
freq      1326626
dtype: object

Try it again for 'County Name'.

In [7]:
df['County Name'].describe()
Out[7]:
count          1707841
unique              21
top       MIDDLESEX   
freq            176402
dtype: object

Say you want to select a single column. You can do this in one of two ways. If the column name is a string without spaces, you can use dot notation, like df.Severity, for instance. Otherwise, you use a similar syntax to what we're used to with dicts, using brackets, like this: df['County Name']. If you want to grab more than one column, give it a list of column names.

Now, let's take a look at cleaning messy data in columns. Why does this return an empty DataFrame?

In [8]:
df[df['County Name']=='Passaic']
Out[8]:
case code County Name Municipality Name Crash Date Crash Day Of Week Crash Time Police Dept Code Police Department Police Station Total Killed ... Is Ramp Ramp To/From Route Name Ramp To/From Route Direction Posted Speed Posted Speed Cross Street Latitude Longitude Cell Phone In Use Flag Other Property Damage Reporting Badge No.

0 rows × 47 columns

With some digging, by using something like df[df['County Name'].str.contains('PASSAIC')], you can get a DataFrame of rows that have PASSAIC in the 'County Name' column.

In [9]:
df[df['County Name'].str.contains('PASSAIC')].head()
Out[9]:
case code County Name Municipality Name Crash Date Crash Day Of Week Crash Time Police Dept Code Police Department Police Station Total Killed ... Is Ramp Ramp To/From Route Name Ramp To/From Route Direction Posted Speed Posted Speed Cross Street Latitude Longitude Cell Phone In Use Flag Other Property Damage Reporting Badge No.
242727 2008160108-0002 PASSAIC BLOOMINGDALE BORO 01/01/2008 TU 0052 01 BLOOMINGDALE 0 ... 35 35 N SOUTHBOUND GUARDRAIL ... 5
242728 2008160108-0083 PASSAIC BLOOMINGDALE BORO 01/09/2008 W 0550 01 BLOOMINGDALE 0 ... 0 N NONE ... 11
242729 2008160108-0155 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 0127 01 BLOOMINGDALE 0 ... 40 25 41.01790 74.31252 N UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE... 11
242730 2008160108-0159 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 0909 01 BLOOMINGDALE 0 ... 35 25 41.01119 74.32027 N NONE ... 10
242731 2008160108-0166 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 1814 01 BLOOMINGDALE 0 ... 35 25 41.03530 74.34463 N NONE ... 11

5 rows × 47 columns

When we select a single row with the .ix[] index function, you'll notice that the 'County Name' column includes a bunch of trailing whitespace: 'PASSAIC '. Normally, in Python, you might solve this by writing a for loop to cycle through every item in the column and clean it up one at a time.

In [10]:
df['County Name'][df['County Name'].str.contains('PASSAIC')].ix[242727]
Out[10]:
'PASSAIC     '

But with Pandas, we can do the same thing much faster. We'll use the .map() function to perform the .strip() method on every string in the column at the same time.

In [11]:
df['County Name']=df['County Name'].map(str.strip)

Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County.

You can try something similar by finding Police Departments that contain the string BLOOMINGDALE and cleaning them up with the strip function, if you want to.

In [12]:
df['Police Department'][df['Police Department'].str.contains('BLOOMINGDALE')]
Out[12]:
242727    BLOOMINGDALE             
242728    BLOOMINGDALE             
242729    BLOOMINGDALE             
242730    BLOOMINGDALE             
242731    BLOOMINGDALE             
242732    BLOOMINGDALE             
242733    BLOOMINGDALE             
242734    BLOOMINGDALE             
242735    BLOOMINGDALE             
242736    BLOOMINGDALE             
242737    BLOOMINGDALE             
242738    BLOOMINGDALE             
242739    BLOOMINGDALE             
242740    BLOOMINGDALE             
242741    BLOOMINGDALE             
...
1650731    BLOOMINGDALE BORO PD     
1650732    BLOOMINGDALE BORO PD     
1650733    BLOOMINGDALE BORO PD     
1650734    BLOOMINGDALE BORO PD     
1650735    BLOOMINGDALE BORO PD     
1650736    BLOOMINGDALE BORO PD     
1650737    BLOOMINGDALE BORO PD     
1650738    BLOOMINGDALE BORO PD     
1650739    BLOOMINGDALE BORO PD     
1650740    BLOOMINGDALE BORO PD     
1650741    BLOOMINGDALE BORO PD     
1650742    BLOOMINGDALE BORO PD     
1650743    BLOOMINGDALE BORO PD     
1650744    BLOOMINGDALE BORO PD     
1665827    BLOOMINGDALE BORO PD     
Name: Police Department, Length: 844, dtype: object

Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County, with a comparison using ==.

In [13]:
df[df['County Name']=='PASSAIC']
Out[13]:
case code County Name Municipality Name Crash Date Crash Day Of Week Crash Time Police Dept Code Police Department Police Station Total Killed ... Is Ramp Ramp To/From Route Name Ramp To/From Route Direction Posted Speed Posted Speed Cross Street Latitude Longitude Cell Phone In Use Flag Other Property Damage Reporting Badge No.
242727 2008160108-0002 PASSAIC BLOOMINGDALE BORO 01/01/2008 TU 0052 01 BLOOMINGDALE 0 ... 35 35 N SOUTHBOUND GUARDRAIL ... 5
242728 2008160108-0083 PASSAIC BLOOMINGDALE BORO 01/09/2008 W 0550 01 BLOOMINGDALE 0 ... 0 N NONE ... 11
242729 2008160108-0155 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 0127 01 BLOOMINGDALE 0 ... 40 25 41.01790 74.31252 N UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE... 11
242730 2008160108-0159 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 0909 01 BLOOMINGDALE 0 ... 35 25 41.01119 74.32027 N NONE ... 10
242731 2008160108-0166 PASSAIC BLOOMINGDALE BORO 01/17/2008 TH 1814 01 BLOOMINGDALE 0 ... 35 25 41.03530 74.34463 N NONE ... 11
242732 2008160108-0194 PASSAIC BLOOMINGDALE BORO 01/21/2008 M 1318 01 BLOOMINGDALE 0 ... 0 N NONE ... 11
242733 2008160108-0273 PASSAIC BLOOMINGDALE BORO 02/01/2008 F 0905 01 BLOOMINGDALE 0 ... 35 35 N NONE ... 12
242734 2008160108-0307 PASSAIC BLOOMINGDALE BORO 02/05/2008 TU 1459 01 BLOOMINGDALE 0 ... 0 N NONE ... 11
242735 2008160108-0310 PASSAIC BLOOMINGDALE BORO 02/05/2008 TU 1728 01 BLOOMINGDALE 0 ... 35 25 41.00775 74.33502 N NONE ... 8
242736 2008160108-0330 PASSAIC BLOOMINGDALE BORO 02/08/2008 F 1309 01 BLOOMINGDALE 0 ... 35 25 41.00741 74.34208 N NONE ... 8
242737 2008160108-0360 PASSAIC BLOOMINGDALE BORO 02/11/2008 M 1835 01 BLOOMINGDALE 0 ... 35 25 41.00690 74.34344 N ? ... 6
242738 2008160108-0365 PASSAIC BLOOMINGDALE BORO 02/12/2008 TU 1221 01 BLOOMINGDALE 0 ... 30 35 N NONE ... 12
242739 2008160108-0369 PASSAIC BLOOMINGDALE BORO 02/12/2008 TU 1823 01 BLOOMINGDALE 0 ... 35 15 41.01493 74.35526 N ? ... 6
242740 2008160108-0370 PASSAIC BLOOMINGDALE BORO 02/12/2008 TU 2217 01 BLOOMINGDALE 0 ... 35 30 41.00387 74.33420 N NONE ...
242741 2008160108-0381 PASSAIC BLOOMINGDALE BORO 02/14/2008 TH 0808 01 BLOOMINGDALE 0 ... 35 25 41.01119 74.32027 N NONE ... 11
242742 2008160108-0393 PASSAIC BLOOMINGDALE BORO 02/16/2008 SA 0748 01 BLOOMINGDALE 0 ... 25 25 N NONE ... 12
242743 2008160108-0448 PASSAIC BLOOMINGDALE BORO 02/21/2008 TH 1122 01 BLOOMINGDALE 0 ... 35 41.00673 74.35107 N NONE ... 9
242744 2008160108-0457 PASSAIC BLOOMINGDALE BORO 02/22/2008 F 1228 01 BLOOMINGDALE 0 ... 35 30 41.00141 74.32169 N NONE ... 10
242745 2008160108-0476 PASSAIC BLOOMINGDALE BORO 02/25/2008 M 1142 01 BLOOMINGDALE 0 ... 35 25 41.00690 74.34344 N ? ... 6
242746 2008160108-0491 PASSAIC BLOOMINGDALE BORO 02/27/2008 W 0133 01 BLOOMINGDALE 0 ... 25 25 N NONE NOTED ... 5
242747 2008160108-0504 PASSAIC BLOOMINGDALE BORO 02/28/2008 TH 2020 01 BLOOMINGDALE 0 ... 0 N UNKNOWN ... 8
242748 2008160108-0538 PASSAIC BLOOMINGDALE BORO 03/05/2008 W 0806 01 BLOOMINGDALE 0 ... 30 35 41.00384 74.33426 N NONE ... 9
242749 2008160108-0619 PASSAIC BLOOMINGDALE BORO 03/14/2008 F 0708 01 BLOOMINGDALE 0 ... 35 25 41.03262 74.34254 N KAMPFE LAKE ASSOCIATION PROPERTY-TREES/LANDSCA... 6
242750 2008160108-0640 PASSAIC BLOOMINGDALE BORO 03/15/2008 SA 1537 01 BLOOMINGDALE 0 ... 25 N NONE ... 9
242751 2008160108-0687 PASSAIC BLOOMINGDALE BORO 03/22/2008 SA 0431 01 BLOOMINGDALE 0 ... 25 35 N UTILITY POLE# BMP331P WAS STRUCK AND SUSTAINED... 10
242752 2008160108-0719 PASSAIC BLOOMINGDALE BORO 03/26/2008 W 1340 01 BLOOMINGDALE 0 ... 30 25 41.00247 74.32864 N NONE ... 8
242753 2008160108-0743 PASSAIC BLOOMINGDALE BORO 03/28/2008 F 0700 01 BLOOMINGDALE 0 ... 35 35 N DAMAGE TO LANDSCAPING OF 587 GLENWILD AVE. ...
242754 2008160108-0762 PASSAIC BLOOMINGDALE BORO 03/29/2008 SA 01 BLOOMINGDALE 0 ... 25 35 N NONE ...
242755 2008160108-0797 PASSAIC BLOOMINGDALE BORO 04/02/2008 W 0814 01 BLOOMINGDALE 0 ... 35 15 N NONE ...
242756 2008160108-0798 PASSAIC BLOOMINGDALE BORO 04/02/2008 W 0908 01 BLOOMINGDALE 0 ... 35 35 N NONE ... 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1669227 20131616B060-2013-01720A PASSAIC WOODLAND PARK BORO 07/11/2013 TH 2240 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6732
1669228 20131616B060-2013-01772A PASSAIC WOODLAND PARK BORO 07/17/2013 W 1827 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7190
1669229 20131616B060-2013-01787A PASSAIC WOODLAND PARK BORO 07/17/2013 W 1107 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6701
1669230 20131616B060-2013-01848A PASSAIC WOODLAND PARK BORO 07/24/2013 W 1840 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 5349
1669231 20131616B060-2013-01865A PASSAIC WOODLAND PARK BORO 07/26/2013 F 2014 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7106
1669232 20131616B060-2013-01869A PASSAIC WOODLAND PARK BORO 07/27/2013 SA 1641 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6227
1669233 20131616B060-2013-01922A PASSAIC WOODLAND PARK BORO 08/01/2013 TH 0053 2 NEW JERSEY STATE POLICE TOTOWA-SUB 1 ... 55 N NONE. ... 7238
1669234 20131616B060-2013-01923A PASSAIC WOODLAND PARK BORO 08/01/2013 TH 1104 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... F JACKSON AVE EB 65 N NONE ... 6962
1669235 20131616B060-2013-01937A PASSAIC WOODLAND PARK BORO 08/02/2013 F 1818 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N NONE ... 6185
1669236 20131616B060-2013-01963A PASSAIC WOODLAND PARK BORO 08/05/2013 M 2119 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N NONE ... 7055
1669237 20131616B060-2013-01964A PASSAIC WOODLAND PARK BORO 08/05/2013 M 2119 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 65 N NONE ... 6962
1669238 20131616B060-2013-01966A PASSAIC WOODLAND PARK BORO 08/06/2013 TU 0005 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N APPROXIMATELY 15 FEET OF GUARDRAIL NJDOT TRE... 6084
1669239 20131616B060-2013-02030A PASSAIC WOODLAND PARK BORO 08/12/2013 M 1733 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7071
1669240 20131616B060-2013-02047A PASSAIC WOODLAND PARK BORO 08/14/2013 W 0618 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6701
1669241 20131616B060-2013-02131A PASSAIC WOODLAND PARK BORO 08/25/2013 S 2239 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7106
1669242 20131616B060-2013-02146A PASSAIC WOODLAND PARK BORO 08/27/2013 TU 1016 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7167
1669243 20131616B060-2013-02184A PASSAIC WOODLAND PARK BORO 09/01/2013 S 1004 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7178
1669244 20131616B060-2013-02223A PASSAIC WOODLAND PARK BORO 09/06/2013 F 0906 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6701
1669245 20131616B060-2013-02227A PASSAIC WOODLAND PARK BORO 09/06/2013 F 1308 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6660
1669246 20131616B060-2013-02283A PASSAIC WOODLAND PARK BORO 09/12/2013 TH 2015 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7055
1669247 20131616B060-2013-02297A PASSAIC WOODLAND PARK BORO 09/13/2013 F 1533 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... T SQUIRRELWOOD ROAD SB 55 N ... 6969
1669248 20131616B060-2013-02306A PASSAIC WOODLAND PARK BORO 09/15/2013 S 1933 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7071
1669249 20131616B060-2013-02385A PASSAIC WOODLAND PARK BORO 09/22/2013 S 0330 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 6701
1669250 20131616B060-2013-02394A PASSAIC WOODLAND PARK BORO 09/19/2013 TH 1903 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7167
1669251 20131616B060-2013-02415A PASSAIC WOODLAND PARK BORO 09/24/2013 TU 1608 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 65 N ... 7319
1669252 20131616B060-2013-02426A PASSAIC WOODLAND PARK BORO 09/25/2013 W 0720 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7190
1669253 20131616B060-2013-02432A PASSAIC WOODLAND PARK BORO 09/26/2013 TH 0712 2 NEW JERSEY STATE POLICE TOTOWA-SUB 0 ... 55 N ... 7075
1669254 20131616PCSD13-04189 PASSAIC WOODLAND PARK BORO 07/09/2013 TU 1155 3 PASSAIC CO SHERIFFS DEPT SHERIFFS PATROL 0 ... 25 25 N ...
1669255 20131616PCSD13-04479 PASSAIC WOODLAND PARK BORO 07/21/2013 S 1037 3 PASSAIC CO SHERIFFS DEPT COUNTY PATROL 0 ... 10 25 N ... 115
1669256 20131616PCSD13-05367 PASSAIC WOODLAND PARK BORO 08/27/2013 TU 1950 3 PASSAIC CO SHERIFFS DEPT COUNTY PATROL 0 ... 25 25 N ... 949

113217 rows × 47 columns

If we're confident the 'County Name' column is as clean as it's going to be, we can turn to others. What do we see when we look at the unique values in the 'Police Dept Code' column with the .unique() method?

In [14]:
df['Police Dept Code'].unique()
Out[14]:
array(['01', '99', '  ', '02', '03', '04', 1, 99, 2, 3, 4], dtype=object)

If we filter the column for the police departments that use '99' as their code, we'll see their names are not unique (of course).

In [15]:
df[['Police Dept Code', 'Police Department']][df['Police Dept Code']==99]
Out[15]:
Police Dept Code Police Department
311564 99 PALISADES INTER. PARKWAY
311565 99 PALISADES INTER. PARKWAY
311566 99 PALISADES INTER. PARKWAY
311567 99 PALISADES INTER. PARKWAY
311568 99 PALISADES INTER. PARKWAY
311569 99 PALISADES INTER. PARKWAY
311570 99 PALISADES INTER. PARKWAY
311571 99 PALISADES INTER. PARKWAY
311572 99 PALISADES INTER. PARKWAY
311573 99 PALISADES INTER. PARKWAY
311574 99 PALISADES INTER. PARKWAY
311575 99 PALISADES INTER. PARKWAY
311709 99 NEW JERSEY TRANSIT POLICE
312227 99 NEW JERSEY TRANSIT POLICE
312415 99 NEW JERSEY TRANSIT POLICE
312641 99 NEW JERSEY TRANSIT POLICE
313347 99 NEW JERSEY TRANSIT POLICE
313348 99 NEW JERSEY TRANSIT POLICE
316175 99 NEW JERSEY TRANSIT POLICE
316534 99 PALISADES INTER. PARKWAY
316535 99 PALISADES INTER. PARKWAY
320064 99 WILLIAM PATERSON UNIVERSI
320067 99 WILLIAM PATERSON UNIVERSI
320068 99 WILLIAM PATERSON UNIVERSI
320102 99 WILLIAM PATERSON UNIVERSI
320103 99 WILLIAM PATERSON UNIVERSI
320104 99 WILLIAM PATERSON UNIVERSI
320105 99 WILLIAM PATERSON UNIVERSI
320106 99 WILLIAM PATERSON UNIVERSI
320107 99 WILLIAM PATERSON UNIVERSI
... ... ...
1701415 99 KEAN UNIVERSITY PD
1701416 99 KEAN UNIVERSITY PD
1701417 99 KEAN UNIVERSITY PD
1701418 99 KEAN UNIVERSITY PD
1701419 99 KEAN UNIVERSITY PD
1701420 99 KEAN UNIVERSITY PD
1701421 99 KEAN UNIVERSITY PD
1701422 99 KEAN UNIVERSITY PD
1701423 99 KEAN UNIVERSITY PD
1701424 99 KEAN UNIVERSITY PD
1701425 99 KEAN UNIVERSITY PD
1701426 99 KEAN UNIVERSITY PD
1701427 99 KEAN UNIVERSITY PD
1701428 99 KEAN UNIVERSITY PD
1701429 99 KEAN UNIVERSITY PD
1701430 99 KEAN UNIVERSITY PD
1701431 99 KEAN UNIVERSITY PD
1701432 99 KEAN UNIVERSITY PD
1701433 99 KEAN UNIVERSITY PD
1701434 99 KEAN UNIVERSITY PD
1701435 99 KEAN UNIVERSITY PD
1701436 99 KEAN UNIVERSITY PD
1701437 99 KEAN UNIVERSITY PD
1701438 99 KEAN UNIVERSITY PD
1701439 99 KEAN UNIVERSITY PD
1701440 99 KEAN UNIVERSITY PD
1701441 99 KEAN UNIVERSITY PD
1701442 99 KEAN UNIVERSITY PD
1701443 99 KEAN UNIVERSITY PD
1701444 99 KEAN UNIVERSITY PD

4818 rows × 2 columns

You also might notice that some of our department codes start with 0, so we have data that is mixed between strings and integers in the same column. We can use pandas' .astype() method to change all the type of all of the values in that column to strings.

In [16]:
df['Police Dept Code']=df['Police Dept Code'].astype(str)
df['Police Dept Code'].unique()
Out[16]:
array(['01', '99', '  ', '02', '03', '04', '1', '2', '3', '4'], dtype=object)

You should note two things here. To actually change the value of the column in the DataFrame in place, we'll have to assign it back to itself as if we're defining a new variable. Also, if you check the unique values again, we'll see that there are now values for '01' and '1'. You may want to standardize those, but we'll leave that to you as an exercise later.

Even after all that, we still have some empty fields with two spaces in the string, so let's replace those empty values with the word "Unknown"

In [17]:
df['Police Dept Code'][df['Police Dept Code']=='  ']='Unknown'

We've done a bit of cleaning here and may want to start doing some exploratory analysis. To do that, we'll create a smaller DataFrame of just the columns that we plan to examine, and we'll name it myframe.

In [18]:
#Create a smaller frame
myframe = df[['County Name', 'Municipality Name', 'Crash Date', 'Crash Day Of Week', 'Crash Time', 'Total Killed', 'Total Injured', 'Pedestrians Killed', 'Pedestrians Injured', 'Total Vehicles Involved', 'Crash Type Code', 'Alcohol Involved', 'Environmental Condition', 'Light Condition', 'Cell Phone In Use Flag']]

We'll pass myframe along to the next exercise for aggregating, but we might want to save it for later. While we're thinking about it, let's write it out to a CSV with the shockingly named .to_csv() method.

In [19]:
myframe.to_csv('smallertab.csv')

Digging Deeper with Pandas

We've done a bit of cleaning to our outrageously messy data, and now we want to start to explore what's actually in this data and look for more spots we have to clean.

If you were reporting on car accidents in your state, what would you want to know? Some of the questions you'd ask might include:

  • How many deaths happened in each county?
  • How many accidents have happened over time?
  • How many people are involved in car accidents?
  • What counties might be outliers in terms of deaths and injuries?

We'll start by reading our data into our DataFrame. Then we'll filter it and do a few group-bys, which should be familiar if you've ever worked with SQL. Finally, we'll create smaller DataFrames of this aggregated data to visualize.

One of the cleaning tasks we'll have to deal with is the format of the 'Crash Date' field, so we might as well import Python's standard datetime module and read our csv now.

In [20]:
from datetime import datetime

newframe = pd.read_csv('smallertab.csv')
/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

This should all be pretty familiar. Notice that weird warning about mixed types? Although we cleaned the types in some of the columns, others are still mixed. You can solve this by using the dtypes keyword in the read_csv() function. It's worth looking at the docs to see all the options you can set when you read a CSV.

Now, we want to see our data aggregated by counties, so here's where pandas' .groupby() function comes in.

In [21]:
newframe.groupby('County Name')
Out[21]:
<pandas.core.groupby.DataFrameGroupBy object at 0x1034cf3d0>

Now, when we run .groupby(), it returns a DataFrameGroupBy object, which is really only good to us if we do something else with it.

So try this:

In [22]:
#What county had the most accident deaths?
newframe.groupby('County Name').sum()
Out[22]:
Unnamed: 0 Total Killed Total Injured Pedestrians Killed Pedestrians Injured Total Vehicles Involved
County Name
ATLANTIC 40555622700 187 20957 33 1104 98713
BERGEN 136762281964 166 47575 70 2879 333025
BURLINGTON 56830343320 249 23727 44 634 128223
CAMDEN 70636041684 230 36056 62 1539 178112
CAPE MAY 15954173080 48 6114 7 258 35792
CUMBERLAND 21914587233 143 11086 21 346 49772
ESSEX 128406345105 247 58229 95 5224 315666
GLOUCESTER 35575203872 156 15964 30 475 81242
HUDSON 98932551977 122 29470 44 4026 230722
HUNTERDON 19987461551 55 5835 10 95 40153
MERCER 70130920959 141 22701 39 995 153293
MIDDLESEX 154741088588 308 52131 88 2055 336919
MONMOUTH 111922290674 230 36027 52 1239 231862
MORRIS 64443161991 128 21039 26 658 150796
OCEAN 82316419082 284 30520 65 1048 176583
PASSAIC 109027719563 160 35694 50 2054 216166
SALEM 9398429064 72 3582 4 55 16512
SOMERSET 71108004864 119 19076 17 630 133774
SUSSEX 22863294057 65 7004 6 144 38997
UNION 116426554015 182 37662 66 2142 227130
WARREN 20427091377 69 6255 6 154 35105

Let's deconstruct this a bit, because we're chaining a lot together at once here. We start with our DataFrame object, called newframe. We then do .groupby(), passing it the name of the column we're aggregating on. As we said, this is reminiscent of SQL and returns the DataFrameGroupBy object. We can then run .sum() on the GroupBy object to add up the numeric columns for each county, returning a new DataFrame, which looks like the kind of Pivot Table we'd make in Excel.

Say we want to know which county had the most people killed in accidents over this time period. We can slice out just the 'Total Killed' column and sort it from most to least.

In [23]:
newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)
Out[23]:
County Name
MIDDLESEX      308
OCEAN          284
BURLINGTON     249
ESSEX          247
CAMDEN         230
MONMOUTH       230
ATLANTIC       187
UNION          182
BERGEN         166
PASSAIC        160
GLOUCESTER     156
CUMBERLAND     143
MERCER         141
MORRIS         128
HUDSON         122
SOMERSET       119
SALEM           72
WARREN          69
SUSSEX          65
HUNTERDON       55
CAPE MAY        48
Name: Total Killed, dtype: int64

We select the 'Total Killed' column here using .iloc[], which wants a two-item list. The first item is a slice of the rows. Since we want all rows, we'll use the standard Python slicing notation of an empty colon. The second item in the list is the columns we want, which is just column 2, or if you're counting from 0, as Python does, column 1.

Finally, the .order() function will sort the resulting series and pass it the ascending keyword with a value of False (it defaults to True) to get the largest figures at the top of the list.

We can assign that to a new dataframe variable that we'll use later to make a bar chart and move on for now.

In [24]:
countydeaths = newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)

Now we want to aggregate the accidents by date. But you may have noticed that the dates here are strings and not actual Python date objects. This means when you try to sort them, you get 01/01/2008, 01/01/2009, 01/01/2010, and so on. We need convert the strings to actual Python dates.

In [25]:
#we use lambda, which creates an unnamed, one-line function and applies it to each item in the column
newframe['Crash Date']=newframe['Crash Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y").date())

What did we do here? We take the 'Crash Date' column and use the .apply() function to perform some operation across the value in every row in that column. Then we use Python's lambda, a way to write an unnamed, one-line function. If the variable for the date string in each row of the 'Crash Date' column is x, apply and lambda perform the .strptime() function from Python's datetime module. The part in double quotes "%m/%d/%Y" says to take a string that is a two-digit month and a two-digit date and a four-digit year, divided by slashes, and convert it to a datetime object. Finally, .date() takes just the date part of that object and returns that into the row. That's a mouthful.

Now that we've done that, we can aggregate the crashes by real, sortable dates and count how many accidents happened on each date. Then we'll assign this to a new variable for the resulting DataFrame that we'll use later to make a line chart.

In [26]:
crashesbydate = newframe.groupby('Crash Date').count().iloc[:,0]

So far we've just been transforming and tweaking data that already exists in our dataframe. We can also calculate new data and add it as a column. Here we can create a column for the total number of people involved in an accident by adding together the 'Total Killed' and 'Total Injured' columns. This syntax works similarly to creating a new key in a Python dict.

In [27]:
newframe['Total Involved']=newframe['Total Killed']+newframe['Total Injured']

With this 'Total Involved' value, we can create a histogram later.

In [28]:
newframe['Crash Day Of Week']=newframe['Crash Day Of Week'].map(str.strip)

Here, we're using .map() to call the pandas built-in strip method on its string method across the column, which is a Series object

Finally, let's create a DataFrame of all the columns aggregated by county so we can make a scatter plot of the counties by total killed and pedestrians killed.

In [29]:
countyframe = newframe.groupby('County Name').sum()

With these new DataFrames, we have several slices of our data. Let's see what it looks like by visualizing it.

Basic plots with matplotlib

First, we need to launch the pylab mode inline to get the plots to render inside our notebook. You can do this at startup by running ipython notebook --pylab=inline, or you can do it like this with the magic %.

In [30]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy

Next, we import matplotlib's pyplot class and call it plt.

In [31]:
import matplotlib.pyplot as plt

If we want to make a bar chart of the number of deaths by county, from the countydeaths DataFrame we created above, we can use the .plot() function and pass it the kind='bar' keyword argument.

These plots are basically pandas' wrapper around the matplotlib library. There are many other things you can do with the wrapper and matplotlib itself, and it's worth giving the documentation a read.

In [32]:
countydeaths.plot(kind='bar')
Out[32]:
<matplotlib.axes.AxesSubplot at 0x115da9590>

If we want to do a line graph, we can use the .plot() function on a DataFrame. The default kind is a line graph. So here's a graph of crashes by date.

In [33]:
crashesbydate.plot()
Out[33]:
<matplotlib.axes.AxesSubplot at 0x115d97f50>
Say you want to look at the distribution of the number of people injured or killed in accidents. You can run the ```.hist()``` function on a column in the DataFrame and see that in the vast majority of accidents, no one was hurt.
In [34]:
newframe['Total Involved'].hist(bins=50)
Out[34]:
<matplotlib.axes.AxesSubplot at 0x115ec28d0>

Finally, for now, we can scatter plot our county data by the total number of people killed versus the number of pedestrians killed.

In [35]:
#let's scatter plot county name by total killed and pedestrians killed.
countyframe.plot(kind='scatter', x='Total Killed', y='Pedestrians Killed')
Out[35]:
<matplotlib.axes.AxesSubplot at 0x11630d350>

I hope this was helpful. If you have questions or spot bugs, you can find me on Github, via email or on Twitter.