This iPython Notebook was created as a companion to the "Advanced Python for Data Analysis" tutorial and class taught at Investigative Reporters and Editors' Computer-Assisted Reporting conference in March, 2015 in Atlanta, Ga.
Now that we're comfortable with our coding environment, and we've dabbled with the iPython notebook, it's time to dive into pandas, the Python library created explicitly for data analysis.
First in our interactive iPython notebook, we'll import pandas and call it "pd", a convention of pandas users that you'll quickly grow accustomed to.
import pandas as pd
Today, we're going to work with a fairly large text file of car accident data from the New Jersey Department of Transportation. This file contains reports from accidents in the Garden State between 2008 and 2013. The data, originally filed in handwritten reports by state troopers and then typed into a fixed-width database by clerks, is both decently large and very messy. (As an aside, if you're really interested in the genesis of this particular file, you can browse the makefile we used to create it.)
Our analysis today will not be comprehensive or particularly accurate, but working with 1.7 million rows of dirty data is a pretty good way to illustrate what you can do with pandas and how it's going to make your life easier.
The first thing we have to do is ingest our data from a source into a pandas DataFrame object. Think of a DataFrame (similar to R's Data Frame) as a virtual spreadsheet. It has columns with unique names and rows with unique row numbers, which we call "the Index". You can read in data from many kinds of sources: json, Excel files, html on the web, sql databases or others. Today, we're going to work with a CSV. First, we're going to create a variable with the name of our CSV, then we'll use pandas' .read_csv() function.
datafile = "njaccidents.csv"
df = pd.read_csv(datafile)
/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (6,17) have mixed types. Specify dtype option on import or set low_memory=False. data = self._reader.read(nrows)
Now we have a DataFrame object with the name df
. Let's use the .head()
method to look at the first five rows of data. Does it look familliar?
df.head()
case code | County Name | Municipality Name | Crash Date | Crash Day Of Week | Crash Time | Police Dept Code | Police Department | Police Station | Total Killed | ... | Is Ramp | Ramp To/From Route Name | Ramp To/From Route Direction | Posted Speed | Posted Speed Cross Street | Latitude | Longitude | Cell Phone In Use Flag | Other Property Damage | Reporting Badge No. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008010108-026816 | ATLANTIC | ABSECON CITY | 03/04/2008 | TU | 1539 | 01 | ATLANTIC CITY | AIU | 0 | ... | 50 | 39.41158 | 74.49162 | N | NONE ... | 384 | ||||
1 | 2008010108-163190 | ATLANTIC | ABSECON CITY | 12/19/2008 | F | 1114 | 01 | ATLANTIC CITY | TRAFFIC | 0 | ... | 50 | 39.39231 | 74.48952 | N | NONE ... | 739 | ||||
2 | 2008010108-24779 | ATLANTIC | ABSECON CITY | 11/25/2008 | TU | 0345 | 99 | NJ TRANSIT P.D. | ATLANTIC CITY | 0 | ... | 10 | 25 | N | ? ... | 0053 | |||||
3 | 2008010108-3901 | ATLANTIC | ABSECON CITY | 03/31/2008 | M | 0105 | 01 | EAST WINDSOR | TRAFFIC UNIT | 0 | ... | 0 | N | NONE ... | 551 | ||||||
4 | 2008010108-5016 | ATLANTIC | ABSECON CITY | 01/25/2008 | F | 0942 | 01 | EGG HARBOR TWP | HQ | 0 | ... | 50 | 40 | 39.43036 | 74.52469 | N | NONE ... | 1571 |
5 rows × 47 columns
It's a table!
To see what we have here, let's make a list of the columns we have to work with the DataFrame's .columns
attribute.
#this is a list that you can pluck items out of.
df.columns
Index([u'case code', u'County Name', u'Municipality Name', u'Crash Date', u'Crash Day Of Week', u'Crash Time', u'Police Dept Code', u'Police Department', u'Police Station', u'Total Killed', u'Total Injured', u'Pedestrians Killed', u'Pedestrians Injured', u'Severity', u'Intersection', u'Alcohol Involved', u'HazMat Involved', u'Crash Type Code', u'Total Vehicles Involved', u'Crash Location', u'Location Direction', u'Route', u'Route Suffix', u'SRI (Std Rte Identifier)', u'MilePost', u'Road System', u'Road Character', u'Road Surface Type', u'Surface Condition', u'Light Condition', u'Environmental Condition', u'Road Divided By', u'Temporary Traffic Control Zone', u'Distance To Cross Street', u'Unit Of Measurement', u'Directn From Cross Street', u'Cross Street Name', u'Is Ramp', u'Ramp To/From Route Name', u'Ramp To/From Route Direction', u'Posted Speed', u'Posted Speed Cross Street', u'Latitude', u'Longitude', u'Cell Phone In Use Flag', u'Other Property Damage', u'Reporting Badge No.'], dtype='object')
To get a quick overview of what kind of data is in each column, we can try the .describe()
method.
df.describe()
Total Killed | Total Injured | Pedestrians Killed | Pedestrians Injured | Total Vehicles Involved | Road System | Posted Speed | |
---|---|---|---|---|---|---|---|
count | 1707841.000000 | 1707841.000000 | 1707841.000000 | 1707841.000000 | 1707841.000000 | 1707841.000000 | 1707840.000000 |
mean | 0.001968 | 0.308403 | 0.000489 | 0.016251 | 1.878721 | 5.191971 | 31.205212 |
std | 0.047111 | 0.698354 | 0.022238 | 0.130033 | 0.538178 | 2.499018 | 17.872276 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 25.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 5.000000 | 30.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 7.000000 | 45.000000 |
max | 5.000000 | 42.000000 | 2.000000 | 10.000000 | 20.000000 | 10.000000 | 99.000000 |
In earlier versions of pandas (pre-15), .describe()
gives you the summary statistics for the numeric columns. In newer versions of pandas, using the .describe(include='all')
keyword argument will summarize all the columns.
For categorical columns, like 'Severity'
, you'll get the number of unique values and the most frequent.
df['Severity'].describe()
count 1707841 unique 3 top P freq 1326626 dtype: object
Try it again for 'County Name'
.
df['County Name'].describe()
count 1707841 unique 21 top MIDDLESEX freq 176402 dtype: object
Say you want to select a single column. You can do this in one of two ways. If the column name is a string without spaces, you can use dot notation, like df.Severity
, for instance. Otherwise, you use a similar syntax to what we're used to with dicts, using brackets, like this: df['County Name']
. If you want to grab more than one column, give it a list of column names.
Now, let's take a look at cleaning messy data in columns. Why does this return an empty DataFrame?
df[df['County Name']=='Passaic']
case code | County Name | Municipality Name | Crash Date | Crash Day Of Week | Crash Time | Police Dept Code | Police Department | Police Station | Total Killed | ... | Is Ramp | Ramp To/From Route Name | Ramp To/From Route Direction | Posted Speed | Posted Speed Cross Street | Latitude | Longitude | Cell Phone In Use Flag | Other Property Damage | Reporting Badge No. |
---|
0 rows × 47 columns
With some digging, by using something like df[df['County Name'].str.contains('PASSAIC')]
, you can get a DataFrame of rows that have PASSAIC
in the 'County Name'
column.
df[df['County Name'].str.contains('PASSAIC')].head()
case code | County Name | Municipality Name | Crash Date | Crash Day Of Week | Crash Time | Police Dept Code | Police Department | Police Station | Total Killed | ... | Is Ramp | Ramp To/From Route Name | Ramp To/From Route Direction | Posted Speed | Posted Speed Cross Street | Latitude | Longitude | Cell Phone In Use Flag | Other Property Damage | Reporting Badge No. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
242727 | 2008160108-0002 | PASSAIC | BLOOMINGDALE BORO | 01/01/2008 | TU | 0052 | 01 | BLOOMINGDALE | 0 | ... | 35 | 35 | N | SOUTHBOUND GUARDRAIL ... | 5 | ||||||
242728 | 2008160108-0083 | PASSAIC | BLOOMINGDALE BORO | 01/09/2008 | W | 0550 | 01 | BLOOMINGDALE | 0 | ... | 0 | N | NONE ... | 11 | |||||||
242729 | 2008160108-0155 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 0127 | 01 | BLOOMINGDALE | 0 | ... | 40 | 25 | 41.01790 | 74.31252 | N | UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE... | 11 | ||||
242730 | 2008160108-0159 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 0909 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.01119 | 74.32027 | N | NONE ... | 10 | ||||
242731 | 2008160108-0166 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 1814 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.03530 | 74.34463 | N | NONE ... | 11 |
5 rows × 47 columns
When we select a single row with the .ix[]
index function, you'll notice that the 'County Name'
column includes a bunch of trailing whitespace: 'PASSAIC '
. Normally, in Python, you might solve this by writing a for loop to cycle through every item in the column and clean it up one at a time.
df['County Name'][df['County Name'].str.contains('PASSAIC')].ix[242727]
'PASSAIC '
But with Pandas, we can do the same thing much faster. We'll use the .map()
function to perform the .strip()
method on every string in the column at the same time.
df['County Name']=df['County Name'].map(str.strip)
Once our 'County Name'
field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County.
You can try something similar by finding Police Departments that contain the string BLOOMINGDALE
and cleaning them up with the strip function, if you want to.
df['Police Department'][df['Police Department'].str.contains('BLOOMINGDALE')]
242727 BLOOMINGDALE 242728 BLOOMINGDALE 242729 BLOOMINGDALE 242730 BLOOMINGDALE 242731 BLOOMINGDALE 242732 BLOOMINGDALE 242733 BLOOMINGDALE 242734 BLOOMINGDALE 242735 BLOOMINGDALE 242736 BLOOMINGDALE 242737 BLOOMINGDALE 242738 BLOOMINGDALE 242739 BLOOMINGDALE 242740 BLOOMINGDALE 242741 BLOOMINGDALE ... 1650731 BLOOMINGDALE BORO PD 1650732 BLOOMINGDALE BORO PD 1650733 BLOOMINGDALE BORO PD 1650734 BLOOMINGDALE BORO PD 1650735 BLOOMINGDALE BORO PD 1650736 BLOOMINGDALE BORO PD 1650737 BLOOMINGDALE BORO PD 1650738 BLOOMINGDALE BORO PD 1650739 BLOOMINGDALE BORO PD 1650740 BLOOMINGDALE BORO PD 1650741 BLOOMINGDALE BORO PD 1650742 BLOOMINGDALE BORO PD 1650743 BLOOMINGDALE BORO PD 1650744 BLOOMINGDALE BORO PD 1665827 BLOOMINGDALE BORO PD Name: Police Department, Length: 844, dtype: object
Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County, with a comparison using ==
.
df[df['County Name']=='PASSAIC']
case code | County Name | Municipality Name | Crash Date | Crash Day Of Week | Crash Time | Police Dept Code | Police Department | Police Station | Total Killed | ... | Is Ramp | Ramp To/From Route Name | Ramp To/From Route Direction | Posted Speed | Posted Speed Cross Street | Latitude | Longitude | Cell Phone In Use Flag | Other Property Damage | Reporting Badge No. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
242727 | 2008160108-0002 | PASSAIC | BLOOMINGDALE BORO | 01/01/2008 | TU | 0052 | 01 | BLOOMINGDALE | 0 | ... | 35 | 35 | N | SOUTHBOUND GUARDRAIL ... | 5 | ||||||
242728 | 2008160108-0083 | PASSAIC | BLOOMINGDALE BORO | 01/09/2008 | W | 0550 | 01 | BLOOMINGDALE | 0 | ... | 0 | N | NONE ... | 11 | |||||||
242729 | 2008160108-0155 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 0127 | 01 | BLOOMINGDALE | 0 | ... | 40 | 25 | 41.01790 | 74.31252 | N | UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE... | 11 | ||||
242730 | 2008160108-0159 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 0909 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.01119 | 74.32027 | N | NONE ... | 10 | ||||
242731 | 2008160108-0166 | PASSAIC | BLOOMINGDALE BORO | 01/17/2008 | TH | 1814 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.03530 | 74.34463 | N | NONE ... | 11 | ||||
242732 | 2008160108-0194 | PASSAIC | BLOOMINGDALE BORO | 01/21/2008 | M | 1318 | 01 | BLOOMINGDALE | 0 | ... | 0 | N | NONE ... | 11 | |||||||
242733 | 2008160108-0273 | PASSAIC | BLOOMINGDALE BORO | 02/01/2008 | F | 0905 | 01 | BLOOMINGDALE | 0 | ... | 35 | 35 | N | NONE ... | 12 | ||||||
242734 | 2008160108-0307 | PASSAIC | BLOOMINGDALE BORO | 02/05/2008 | TU | 1459 | 01 | BLOOMINGDALE | 0 | ... | 0 | N | NONE ... | 11 | |||||||
242735 | 2008160108-0310 | PASSAIC | BLOOMINGDALE BORO | 02/05/2008 | TU | 1728 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.00775 | 74.33502 | N | NONE ... | 8 | ||||
242736 | 2008160108-0330 | PASSAIC | BLOOMINGDALE BORO | 02/08/2008 | F | 1309 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.00741 | 74.34208 | N | NONE ... | 8 | ||||
242737 | 2008160108-0360 | PASSAIC | BLOOMINGDALE BORO | 02/11/2008 | M | 1835 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.00690 | 74.34344 | N | ? ... | 6 | ||||
242738 | 2008160108-0365 | PASSAIC | BLOOMINGDALE BORO | 02/12/2008 | TU | 1221 | 01 | BLOOMINGDALE | 0 | ... | 30 | 35 | N | NONE ... | 12 | ||||||
242739 | 2008160108-0369 | PASSAIC | BLOOMINGDALE BORO | 02/12/2008 | TU | 1823 | 01 | BLOOMINGDALE | 0 | ... | 35 | 15 | 41.01493 | 74.35526 | N | ? ... | 6 | ||||
242740 | 2008160108-0370 | PASSAIC | BLOOMINGDALE BORO | 02/12/2008 | TU | 2217 | 01 | BLOOMINGDALE | 0 | ... | 35 | 30 | 41.00387 | 74.33420 | N | NONE ... | |||||
242741 | 2008160108-0381 | PASSAIC | BLOOMINGDALE BORO | 02/14/2008 | TH | 0808 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.01119 | 74.32027 | N | NONE ... | 11 | ||||
242742 | 2008160108-0393 | PASSAIC | BLOOMINGDALE BORO | 02/16/2008 | SA | 0748 | 01 | BLOOMINGDALE | 0 | ... | 25 | 25 | N | NONE ... | 12 | ||||||
242743 | 2008160108-0448 | PASSAIC | BLOOMINGDALE BORO | 02/21/2008 | TH | 1122 | 01 | BLOOMINGDALE | 0 | ... | 35 | 41.00673 | 74.35107 | N | NONE ... | 9 | |||||
242744 | 2008160108-0457 | PASSAIC | BLOOMINGDALE BORO | 02/22/2008 | F | 1228 | 01 | BLOOMINGDALE | 0 | ... | 35 | 30 | 41.00141 | 74.32169 | N | NONE ... | 10 | ||||
242745 | 2008160108-0476 | PASSAIC | BLOOMINGDALE BORO | 02/25/2008 | M | 1142 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.00690 | 74.34344 | N | ? ... | 6 | ||||
242746 | 2008160108-0491 | PASSAIC | BLOOMINGDALE BORO | 02/27/2008 | W | 0133 | 01 | BLOOMINGDALE | 0 | ... | 25 | 25 | N | NONE NOTED ... | 5 | ||||||
242747 | 2008160108-0504 | PASSAIC | BLOOMINGDALE BORO | 02/28/2008 | TH | 2020 | 01 | BLOOMINGDALE | 0 | ... | 0 | N | UNKNOWN ... | 8 | |||||||
242748 | 2008160108-0538 | PASSAIC | BLOOMINGDALE BORO | 03/05/2008 | W | 0806 | 01 | BLOOMINGDALE | 0 | ... | 30 | 35 | 41.00384 | 74.33426 | N | NONE ... | 9 | ||||
242749 | 2008160108-0619 | PASSAIC | BLOOMINGDALE BORO | 03/14/2008 | F | 0708 | 01 | BLOOMINGDALE | 0 | ... | 35 | 25 | 41.03262 | 74.34254 | N | KAMPFE LAKE ASSOCIATION PROPERTY-TREES/LANDSCA... | 6 | ||||
242750 | 2008160108-0640 | PASSAIC | BLOOMINGDALE BORO | 03/15/2008 | SA | 1537 | 01 | BLOOMINGDALE | 0 | ... | 25 | N | NONE ... | 9 | |||||||
242751 | 2008160108-0687 | PASSAIC | BLOOMINGDALE BORO | 03/22/2008 | SA | 0431 | 01 | BLOOMINGDALE | 0 | ... | 25 | 35 | N | UTILITY POLE# BMP331P WAS STRUCK AND SUSTAINED... | 10 | ||||||
242752 | 2008160108-0719 | PASSAIC | BLOOMINGDALE BORO | 03/26/2008 | W | 1340 | 01 | BLOOMINGDALE | 0 | ... | 30 | 25 | 41.00247 | 74.32864 | N | NONE ... | 8 | ||||
242753 | 2008160108-0743 | PASSAIC | BLOOMINGDALE BORO | 03/28/2008 | F | 0700 | 01 | BLOOMINGDALE | 0 | ... | 35 | 35 | N | DAMAGE TO LANDSCAPING OF 587 GLENWILD AVE. ... | |||||||
242754 | 2008160108-0762 | PASSAIC | BLOOMINGDALE BORO | 03/29/2008 | SA | 01 | BLOOMINGDALE | 0 | ... | 25 | 35 | N | NONE ... | ||||||||
242755 | 2008160108-0797 | PASSAIC | BLOOMINGDALE BORO | 04/02/2008 | W | 0814 | 01 | BLOOMINGDALE | 0 | ... | 35 | 15 | N | NONE ... | |||||||
242756 | 2008160108-0798 | PASSAIC | BLOOMINGDALE BORO | 04/02/2008 | W | 0908 | 01 | BLOOMINGDALE | 0 | ... | 35 | 35 | N | NONE ... | 12 | ||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1669227 | 20131616B060-2013-01720A | PASSAIC | WOODLAND PARK BORO | 07/11/2013 | TH | 2240 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6732 | ||||||
1669228 | 20131616B060-2013-01772A | PASSAIC | WOODLAND PARK BORO | 07/17/2013 | W | 1827 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7190 | ||||||
1669229 | 20131616B060-2013-01787A | PASSAIC | WOODLAND PARK BORO | 07/17/2013 | W | 1107 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6701 | ||||||
1669230 | 20131616B060-2013-01848A | PASSAIC | WOODLAND PARK BORO | 07/24/2013 | W | 1840 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 5349 | ||||||
1669231 | 20131616B060-2013-01865A | PASSAIC | WOODLAND PARK BORO | 07/26/2013 | F | 2014 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7106 | ||||||
1669232 | 20131616B060-2013-01869A | PASSAIC | WOODLAND PARK BORO | 07/27/2013 | SA | 1641 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6227 | ||||||
1669233 | 20131616B060-2013-01922A | PASSAIC | WOODLAND PARK BORO | 08/01/2013 | TH | 0053 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 1 | ... | 55 | N | NONE. ... | 7238 | ||||||
1669234 | 20131616B060-2013-01923A | PASSAIC | WOODLAND PARK BORO | 08/01/2013 | TH | 1104 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | F | JACKSON AVE | EB | 65 | N | NONE ... | 6962 | |||
1669235 | 20131616B060-2013-01937A | PASSAIC | WOODLAND PARK BORO | 08/02/2013 | F | 1818 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | NONE ... | 6185 | ||||||
1669236 | 20131616B060-2013-01963A | PASSAIC | WOODLAND PARK BORO | 08/05/2013 | M | 2119 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | NONE ... | 7055 | ||||||
1669237 | 20131616B060-2013-01964A | PASSAIC | WOODLAND PARK BORO | 08/05/2013 | M | 2119 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 65 | N | NONE ... | 6962 | ||||||
1669238 | 20131616B060-2013-01966A | PASSAIC | WOODLAND PARK BORO | 08/06/2013 | TU | 0005 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | APPROXIMATELY 15 FEET OF GUARDRAIL NJDOT TRE... | 6084 | ||||||
1669239 | 20131616B060-2013-02030A | PASSAIC | WOODLAND PARK BORO | 08/12/2013 | M | 1733 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7071 | ||||||
1669240 | 20131616B060-2013-02047A | PASSAIC | WOODLAND PARK BORO | 08/14/2013 | W | 0618 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6701 | ||||||
1669241 | 20131616B060-2013-02131A | PASSAIC | WOODLAND PARK BORO | 08/25/2013 | S | 2239 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7106 | ||||||
1669242 | 20131616B060-2013-02146A | PASSAIC | WOODLAND PARK BORO | 08/27/2013 | TU | 1016 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7167 | ||||||
1669243 | 20131616B060-2013-02184A | PASSAIC | WOODLAND PARK BORO | 09/01/2013 | S | 1004 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7178 | ||||||
1669244 | 20131616B060-2013-02223A | PASSAIC | WOODLAND PARK BORO | 09/06/2013 | F | 0906 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6701 | ||||||
1669245 | 20131616B060-2013-02227A | PASSAIC | WOODLAND PARK BORO | 09/06/2013 | F | 1308 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6660 | ||||||
1669246 | 20131616B060-2013-02283A | PASSAIC | WOODLAND PARK BORO | 09/12/2013 | TH | 2015 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7055 | ||||||
1669247 | 20131616B060-2013-02297A | PASSAIC | WOODLAND PARK BORO | 09/13/2013 | F | 1533 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | T | SQUIRRELWOOD ROAD | SB | 55 | N | ... | 6969 | |||
1669248 | 20131616B060-2013-02306A | PASSAIC | WOODLAND PARK BORO | 09/15/2013 | S | 1933 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7071 | ||||||
1669249 | 20131616B060-2013-02385A | PASSAIC | WOODLAND PARK BORO | 09/22/2013 | S | 0330 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 6701 | ||||||
1669250 | 20131616B060-2013-02394A | PASSAIC | WOODLAND PARK BORO | 09/19/2013 | TH | 1903 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7167 | ||||||
1669251 | 20131616B060-2013-02415A | PASSAIC | WOODLAND PARK BORO | 09/24/2013 | TU | 1608 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 65 | N | ... | 7319 | ||||||
1669252 | 20131616B060-2013-02426A | PASSAIC | WOODLAND PARK BORO | 09/25/2013 | W | 0720 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7190 | ||||||
1669253 | 20131616B060-2013-02432A | PASSAIC | WOODLAND PARK BORO | 09/26/2013 | TH | 0712 | 2 | NEW JERSEY STATE POLICE | TOTOWA-SUB | 0 | ... | 55 | N | ... | 7075 | ||||||
1669254 | 20131616PCSD13-04189 | PASSAIC | WOODLAND PARK BORO | 07/09/2013 | TU | 1155 | 3 | PASSAIC CO SHERIFFS DEPT | SHERIFFS PATROL | 0 | ... | 25 | 25 | N | ... | ||||||
1669255 | 20131616PCSD13-04479 | PASSAIC | WOODLAND PARK BORO | 07/21/2013 | S | 1037 | 3 | PASSAIC CO SHERIFFS DEPT | COUNTY PATROL | 0 | ... | 10 | 25 | N | ... | 115 | |||||
1669256 | 20131616PCSD13-05367 | PASSAIC | WOODLAND PARK BORO | 08/27/2013 | TU | 1950 | 3 | PASSAIC CO SHERIFFS DEPT | COUNTY PATROL | 0 | ... | 25 | 25 | N | ... | 949 |
113217 rows × 47 columns
If we're confident the 'County Name'
column is as clean as it's going to be, we can turn to others. What do we see when we look at the unique values in the 'Police Dept Code'
column with the .unique()
method?
df['Police Dept Code'].unique()
array(['01', '99', ' ', '02', '03', '04', 1, 99, 2, 3, 4], dtype=object)
If we filter the column for the police departments that use '99' as their code, we'll see their names are not unique (of course).
df[['Police Dept Code', 'Police Department']][df['Police Dept Code']==99]
Police Dept Code | Police Department | |
---|---|---|
311564 | 99 | PALISADES INTER. PARKWAY |
311565 | 99 | PALISADES INTER. PARKWAY |
311566 | 99 | PALISADES INTER. PARKWAY |
311567 | 99 | PALISADES INTER. PARKWAY |
311568 | 99 | PALISADES INTER. PARKWAY |
311569 | 99 | PALISADES INTER. PARKWAY |
311570 | 99 | PALISADES INTER. PARKWAY |
311571 | 99 | PALISADES INTER. PARKWAY |
311572 | 99 | PALISADES INTER. PARKWAY |
311573 | 99 | PALISADES INTER. PARKWAY |
311574 | 99 | PALISADES INTER. PARKWAY |
311575 | 99 | PALISADES INTER. PARKWAY |
311709 | 99 | NEW JERSEY TRANSIT POLICE |
312227 | 99 | NEW JERSEY TRANSIT POLICE |
312415 | 99 | NEW JERSEY TRANSIT POLICE |
312641 | 99 | NEW JERSEY TRANSIT POLICE |
313347 | 99 | NEW JERSEY TRANSIT POLICE |
313348 | 99 | NEW JERSEY TRANSIT POLICE |
316175 | 99 | NEW JERSEY TRANSIT POLICE |
316534 | 99 | PALISADES INTER. PARKWAY |
316535 | 99 | PALISADES INTER. PARKWAY |
320064 | 99 | WILLIAM PATERSON UNIVERSI |
320067 | 99 | WILLIAM PATERSON UNIVERSI |
320068 | 99 | WILLIAM PATERSON UNIVERSI |
320102 | 99 | WILLIAM PATERSON UNIVERSI |
320103 | 99 | WILLIAM PATERSON UNIVERSI |
320104 | 99 | WILLIAM PATERSON UNIVERSI |
320105 | 99 | WILLIAM PATERSON UNIVERSI |
320106 | 99 | WILLIAM PATERSON UNIVERSI |
320107 | 99 | WILLIAM PATERSON UNIVERSI |
... | ... | ... |
1701415 | 99 | KEAN UNIVERSITY PD |
1701416 | 99 | KEAN UNIVERSITY PD |
1701417 | 99 | KEAN UNIVERSITY PD |
1701418 | 99 | KEAN UNIVERSITY PD |
1701419 | 99 | KEAN UNIVERSITY PD |
1701420 | 99 | KEAN UNIVERSITY PD |
1701421 | 99 | KEAN UNIVERSITY PD |
1701422 | 99 | KEAN UNIVERSITY PD |
1701423 | 99 | KEAN UNIVERSITY PD |
1701424 | 99 | KEAN UNIVERSITY PD |
1701425 | 99 | KEAN UNIVERSITY PD |
1701426 | 99 | KEAN UNIVERSITY PD |
1701427 | 99 | KEAN UNIVERSITY PD |
1701428 | 99 | KEAN UNIVERSITY PD |
1701429 | 99 | KEAN UNIVERSITY PD |
1701430 | 99 | KEAN UNIVERSITY PD |
1701431 | 99 | KEAN UNIVERSITY PD |
1701432 | 99 | KEAN UNIVERSITY PD |
1701433 | 99 | KEAN UNIVERSITY PD |
1701434 | 99 | KEAN UNIVERSITY PD |
1701435 | 99 | KEAN UNIVERSITY PD |
1701436 | 99 | KEAN UNIVERSITY PD |
1701437 | 99 | KEAN UNIVERSITY PD |
1701438 | 99 | KEAN UNIVERSITY PD |
1701439 | 99 | KEAN UNIVERSITY PD |
1701440 | 99 | KEAN UNIVERSITY PD |
1701441 | 99 | KEAN UNIVERSITY PD |
1701442 | 99 | KEAN UNIVERSITY PD |
1701443 | 99 | KEAN UNIVERSITY PD |
1701444 | 99 | KEAN UNIVERSITY PD |
4818 rows × 2 columns
You also might notice that some of our department codes start with 0, so we have data that is mixed between strings and integers in the same column. We can use pandas' .astype()
method to change all the type of all of the values in that column to strings.
df['Police Dept Code']=df['Police Dept Code'].astype(str)
df['Police Dept Code'].unique()
array(['01', '99', ' ', '02', '03', '04', '1', '2', '3', '4'], dtype=object)
You should note two things here. To actually change the value of the column in the DataFrame in place, we'll have to assign it back to itself as if we're defining a new variable. Also, if you check the unique values again, we'll see that there are now values for '01' and '1'. You may want to standardize those, but we'll leave that to you as an exercise later.
Even after all that, we still have some empty fields with two spaces in the string, so let's replace those empty values with the word "Unknown"
df['Police Dept Code'][df['Police Dept Code']==' ']='Unknown'
We've done a bit of cleaning here and may want to start doing some exploratory analysis. To do that, we'll create a smaller DataFrame of just the columns that we plan to examine, and we'll name it myframe.
#Create a smaller frame
myframe = df[['County Name', 'Municipality Name', 'Crash Date', 'Crash Day Of Week', 'Crash Time', 'Total Killed', 'Total Injured', 'Pedestrians Killed', 'Pedestrians Injured', 'Total Vehicles Involved', 'Crash Type Code', 'Alcohol Involved', 'Environmental Condition', 'Light Condition', 'Cell Phone In Use Flag']]
We'll pass myframe along to the next exercise for aggregating, but we might want to save it for later. While we're thinking about it, let's write it out to a CSV with the shockingly named .to_csv()
method.
myframe.to_csv('smallertab.csv')
We've done a bit of cleaning to our outrageously messy data, and now we want to start to explore what's actually in this data and look for more spots we have to clean.
If you were reporting on car accidents in your state, what would you want to know? Some of the questions you'd ask might include:
We'll start by reading our data into our DataFrame. Then we'll filter it and do a few group-bys, which should be familiar if you've ever worked with SQL. Finally, we'll create smaller DataFrames of this aggregated data to visualize.
One of the cleaning tasks we'll have to deal with is the format of the 'Crash Date' field, so we might as well import Python's standard datetime
module and read our csv now.
from datetime import datetime
newframe = pd.read_csv('smallertab.csv')
/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False. data = self._reader.read(nrows)
This should all be pretty familiar. Notice that weird warning about mixed types? Although we cleaned the types in some of the columns, others are still mixed. You can solve this by using the dtypes
keyword in the read_csv()
function. It's worth looking at the docs to see all the options you can set when you read a CSV.
Now, we want to see our data aggregated by counties, so here's where pandas' .groupby()
function comes in.
newframe.groupby('County Name')
<pandas.core.groupby.DataFrameGroupBy object at 0x1034cf3d0>
Now, when we run .groupby()
, it returns a DataFrameGroupBy object, which is really only good to us if we do something else with it.
So try this:
#What county had the most accident deaths?
newframe.groupby('County Name').sum()
Unnamed: 0 | Total Killed | Total Injured | Pedestrians Killed | Pedestrians Injured | Total Vehicles Involved | |
---|---|---|---|---|---|---|
County Name | ||||||
ATLANTIC | 40555622700 | 187 | 20957 | 33 | 1104 | 98713 |
BERGEN | 136762281964 | 166 | 47575 | 70 | 2879 | 333025 |
BURLINGTON | 56830343320 | 249 | 23727 | 44 | 634 | 128223 |
CAMDEN | 70636041684 | 230 | 36056 | 62 | 1539 | 178112 |
CAPE MAY | 15954173080 | 48 | 6114 | 7 | 258 | 35792 |
CUMBERLAND | 21914587233 | 143 | 11086 | 21 | 346 | 49772 |
ESSEX | 128406345105 | 247 | 58229 | 95 | 5224 | 315666 |
GLOUCESTER | 35575203872 | 156 | 15964 | 30 | 475 | 81242 |
HUDSON | 98932551977 | 122 | 29470 | 44 | 4026 | 230722 |
HUNTERDON | 19987461551 | 55 | 5835 | 10 | 95 | 40153 |
MERCER | 70130920959 | 141 | 22701 | 39 | 995 | 153293 |
MIDDLESEX | 154741088588 | 308 | 52131 | 88 | 2055 | 336919 |
MONMOUTH | 111922290674 | 230 | 36027 | 52 | 1239 | 231862 |
MORRIS | 64443161991 | 128 | 21039 | 26 | 658 | 150796 |
OCEAN | 82316419082 | 284 | 30520 | 65 | 1048 | 176583 |
PASSAIC | 109027719563 | 160 | 35694 | 50 | 2054 | 216166 |
SALEM | 9398429064 | 72 | 3582 | 4 | 55 | 16512 |
SOMERSET | 71108004864 | 119 | 19076 | 17 | 630 | 133774 |
SUSSEX | 22863294057 | 65 | 7004 | 6 | 144 | 38997 |
UNION | 116426554015 | 182 | 37662 | 66 | 2142 | 227130 |
WARREN | 20427091377 | 69 | 6255 | 6 | 154 | 35105 |
Let's deconstruct this a bit, because we're chaining a lot together at once here. We start with our DataFrame object, called newframe. We then do .groupby()
, passing it the name of the column we're aggregating on. As we said, this is reminiscent of SQL and returns the DataFrameGroupBy object. We can then run .sum()
on the GroupBy object to add up the numeric columns for each county, returning a new DataFrame, which looks like the kind of Pivot Table we'd make in Excel.
Say we want to know which county had the most people killed in accidents over this time period. We can slice out just the 'Total Killed' column and sort it from most to least.
newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)
County Name MIDDLESEX 308 OCEAN 284 BURLINGTON 249 ESSEX 247 CAMDEN 230 MONMOUTH 230 ATLANTIC 187 UNION 182 BERGEN 166 PASSAIC 160 GLOUCESTER 156 CUMBERLAND 143 MERCER 141 MORRIS 128 HUDSON 122 SOMERSET 119 SALEM 72 WARREN 69 SUSSEX 65 HUNTERDON 55 CAPE MAY 48 Name: Total Killed, dtype: int64
We select the 'Total Killed'
column here using .iloc[]
, which wants a two-item list. The first item is a slice of the rows. Since we want all rows, we'll use the standard Python slicing notation of an empty colon. The second item in the list is the columns we want, which is just column 2, or if you're counting from 0, as Python does, column 1.
Finally, the .order()
function will sort the resulting series and pass it the ascending keyword with a value of False (it defaults to True) to get the largest figures at the top of the list.
We can assign that to a new dataframe variable that we'll use later to make a bar chart and move on for now.
countydeaths = newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)
Now we want to aggregate the accidents by date. But you may have noticed that the dates here are strings and not actual Python date objects. This means when you try to sort them, you get 01/01/2008, 01/01/2009, 01/01/2010, and so on. We need convert the strings to actual Python dates.
#we use lambda, which creates an unnamed, one-line function and applies it to each item in the column
newframe['Crash Date']=newframe['Crash Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y").date())
What did we do here? We take the 'Crash Date' column and use the .apply()
function to perform some operation across the value in every row in that column. Then we use Python's lambda, a way to write an unnamed, one-line function. If the variable for the date string in each row of the 'Crash Date' column is x, apply and lambda perform the .strptime()
function from Python's datetime module. The part in double quotes "%m/%d/%Y" says to take a string that is a two-digit month and a two-digit date and a four-digit year, divided by slashes, and convert it to a datetime object. Finally, .date()
takes just the date part of that object and returns that into the row. That's a mouthful.
Now that we've done that, we can aggregate the crashes by real, sortable dates and count how many accidents happened on each date. Then we'll assign this to a new variable for the resulting DataFrame that we'll use later to make a line chart.
crashesbydate = newframe.groupby('Crash Date').count().iloc[:,0]
So far we've just been transforming and tweaking data that already exists in our dataframe. We can also calculate new data and add it as a column. Here we can create a column for the total number of people involved in an accident by adding together the 'Total Killed' and 'Total Injured' columns. This syntax works similarly to creating a new key in a Python dict.
newframe['Total Involved']=newframe['Total Killed']+newframe['Total Injured']
With this 'Total Involved' value, we can create a histogram later.
newframe['Crash Day Of Week']=newframe['Crash Day Of Week'].map(str.strip)
Here, we're using .map()
to call the pandas built-in strip method on its string method across the column, which is a Series object
Finally, let's create a DataFrame of all the columns aggregated by county so we can make a scatter plot of the counties by total killed and pedestrians killed.
countyframe = newframe.groupby('County Name').sum()
With these new DataFrames, we have several slices of our data. Let's see what it looks like by visualizing it.
First, we need to launch the pylab mode inline to get the plots to render inside our notebook. You can do this at startup by running ipython notebook --pylab=inline
, or you can do it like this with the magic %
.
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['datetime'] `%matplotlib` prevents importing * from pylab and numpy
Next, we import matplotlib's pyplot class and call it plt.
import matplotlib.pyplot as plt
If we want to make a bar chart of the number of deaths by county, from the countydeaths
DataFrame we created above, we can use the .plot()
function and pass it the kind='bar'
keyword argument.
These plots are basically pandas' wrapper around the matplotlib library. There are many other things you can do with the wrapper and matplotlib itself, and it's worth giving the documentation a read.
countydeaths.plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x115da9590>
If we want to do a line graph, we can use the .plot()
function on a DataFrame. The default kind
is a line graph. So here's a graph of crashes by date.
crashesbydate.plot()
<matplotlib.axes.AxesSubplot at 0x115d97f50>
newframe['Total Involved'].hist(bins=50)
<matplotlib.axes.AxesSubplot at 0x115ec28d0>
Finally, for now, we can scatter plot our county data by the total number of people killed versus the number of pedestrians killed.
#let's scatter plot county name by total killed and pedestrians killed.
countyframe.plot(kind='scatter', x='Total Killed', y='Pedestrians Killed')
<matplotlib.axes.AxesSubplot at 0x11630d350>
I hope this was helpful. If you have questions or spot bugs, you can find me on Github, via email or on Twitter.