This iPython Notebook was created as a companion to the "Advanced Python for Data Analysis" tutorial and class taught at Investigative Reporters and Editors' Computer-Assisted Reporting conference in March, 2015 in Atlanta, Ga.

A first look at Pandas¶

Now that we're comfortable with our coding environment, and we've dabbled with the iPython notebook, it's time to dive into pandas, the Python library created explicitly for data analysis.

First in our interactive iPython notebook, we'll import pandas and call it "pd", a convention of pandas users that you'll quickly grow accustomed to.

In [1]:

import pandas as pd

Today, we're going to work with a fairly large text file of car accident data from the New Jersey Department of Transportation. This file contains reports from accidents in the Garden State between 2008 and 2013. The data, originally filed in handwritten reports by state troopers and then typed into a fixed-width database by clerks, is both decently large and very messy. (As an aside, if you're really interested in the genesis of this particular file, you can browse the makefile we used to create it.)

Our analysis today will not be comprehensive or particularly accurate, but working with 1.7 million rows of dirty data is a pretty good way to illustrate what you can do with pandas and how it's going to make your life easier.

The first thing we have to do is ingest our data from a source into a pandas DataFrame object. Think of a DataFrame (similar to R's Data Frame) as a virtual spreadsheet. It has columns with unique names and rows with unique row numbers, which we call "the Index". You can read in data from many kinds of sources: json, Excel files, html on the web, sql databases or others. Today, we're going to work with a CSV. First, we're going to create a variable with the name of our CSV, then we'll use pandas' .read_csv() function.

In [2]:

datafile = "njaccidents.csv"

df = pd.read_csv(datafile)

/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (6,17) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

Now we have a DataFrame object with the name df. Let's use the .head() method to look at the first five rows of data. Does it look familliar?

In [3]:

df.head()

Out[3]:

	case code	County Name	Municipality Name	Crash Date	Crash Day Of Week	Crash Time	Police Dept Code	Police Department	Police Station	...	Posted Speed	Posted Speed Cross Street	Latitude	Longitude	Cell Phone In Use Flag	Other Property Damage	Reporting Badge No.
0	2008010108-026816	ATLANTIC	ABSECON CITY	03/04/2008	TU	1539	01	ATLANTIC CITY	AIU	...	50		39.41158	74.49162	N	NONE ...	384
1	2008010108-163190	ATLANTIC	ABSECON CITY	12/19/2008	F	1114	01	ATLANTIC CITY	TRAFFIC	...	50		39.39231	74.48952	N	NONE ...	739
2	2008010108-24779	ATLANTIC	ABSECON CITY	11/25/2008	TU	0345	99	NJ TRANSIT P.D.	ATLANTIC CITY	...	10	25			N	? ...	0053
3	2008010108-3901	ATLANTIC	ABSECON CITY	03/31/2008	M	0105	01	EAST WINDSOR	TRAFFIC UNIT	...	0				N	NONE ...	551
4	2008010108-5016	ATLANTIC	ABSECON CITY	01/25/2008	F	0942	01	EGG HARBOR TWP	HQ	...	50	40	39.43036	74.52469	N	NONE ...	1571

5 rows × 47 columns

It's a table!

To see what we have here, let's make a list of the columns we have to work with the DataFrame's .columns attribute.

In [4]:

#this is a list that you can pluck items out of.
df.columns

Out[4]:

Index([u'case code', u'County Name', u'Municipality Name', u'Crash Date', u'Crash Day Of Week', u'Crash Time', u'Police Dept Code', u'Police Department', u'Police Station', u'Total Killed', u'Total Injured', u'Pedestrians Killed', u'Pedestrians Injured', u'Severity', u'Intersection', u'Alcohol Involved', u'HazMat Involved', u'Crash Type Code', u'Total Vehicles Involved', u'Crash Location', u'Location Direction', u'Route', u'Route Suffix', u'SRI (Std Rte Identifier)', u'MilePost', u'Road System', u'Road Character', u'Road Surface Type', u'Surface Condition', u'Light Condition', u'Environmental Condition', u'Road Divided By', u'Temporary Traffic Control Zone', u'Distance To Cross Street', u'Unit Of Measurement', u'Directn From Cross Street', u'Cross Street Name', u'Is Ramp', u'Ramp To/From Route Name', u'Ramp To/From Route Direction', u'Posted Speed', u'Posted Speed Cross Street', u'Latitude', u'Longitude', u'Cell Phone In Use Flag', u'Other Property Damage', u'Reporting Badge No.'], dtype='object')

To get a quick overview of what kind of data is in each column, we can try the .describe() method.

In [5]:

df.describe()

Out[5]:

	Total Killed	Total Injured	Pedestrians Killed	Pedestrians Injured	Total Vehicles Involved	Road System	Posted Speed
count	1707841.000000	1707841.000000	1707841.000000	1707841.000000	1707841.000000	1707841.000000	1707840.000000
mean	0.001968	0.308403	0.000489	0.016251	1.878721	5.191971	31.205212
std	0.047111	0.698354	0.022238	0.130033	0.538178	2.499018	17.872276
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	2.000000	2.000000	25.000000
50%	0.000000	0.000000	0.000000	0.000000	2.000000	5.000000	30.000000
75%	0.000000	0.000000	0.000000	0.000000	2.000000	7.000000	45.000000
max	5.000000	42.000000	2.000000	10.000000	20.000000	10.000000	99.000000

In earlier versions of pandas (pre-15), .describe() gives you the summary statistics for the numeric columns. In newer versions of pandas, using the .describe(include='all') keyword argument will summarize all the columns.

For categorical columns, like 'Severity', you'll get the number of unique values and the most frequent.

In [6]:

df['Severity'].describe()

Out[6]:

count     1707841
unique          3
top             P
freq      1326626
dtype: object

Try it again for 'County Name'.

In [7]:

df['County Name'].describe()

Out[7]:

count          1707841
unique              21
top       MIDDLESEX   
freq            176402
dtype: object

Say you want to select a single column. You can do this in one of two ways. If the column name is a string without spaces, you can use dot notation, like df.Severity, for instance. Otherwise, you use a similar syntax to what we're used to with dicts, using brackets, like this: df['County Name']. If you want to grab more than one column, give it a list of column names.

Now, let's take a look at cleaning messy data in columns. Why does this return an empty DataFrame?

In [8]:

df[df['County Name']=='Passaic']

Out[8]:

	case code	County Name	Municipality Name	Crash Date	Crash Day Of Week	Crash Time	Police Dept Code	Police Department	Police Station	Total Killed	...	Is Ramp	Ramp To/From Route Name	Ramp To/From Route Direction	Posted Speed	Posted Speed Cross Street	Latitude	Longitude	Cell Phone In Use Flag	Other Property Damage	Reporting Badge No.

0 rows × 47 columns

With some digging, by using something like df[df['County Name'].str.contains('PASSAIC')], you can get a DataFrame of rows that have PASSAIC in the 'County Name' column.

In [9]:

df[df['County Name'].str.contains('PASSAIC')].head()

Out[9]:

	case code	County Name	Municipality Name	Crash Date	Crash Day Of Week	Crash Time	Police Dept Code	Police Department	...	Posted Speed	Posted Speed Cross Street	Latitude	Longitude	Cell Phone In Use Flag	Other Property Damage	Reporting Badge No.
242727	2008160108-0002	PASSAIC	BLOOMINGDALE BORO	01/01/2008	TU	0052	01	BLOOMINGDALE	...	35	35			N	SOUTHBOUND GUARDRAIL ...	5
242728	2008160108-0083	PASSAIC	BLOOMINGDALE BORO	01/09/2008	W	0550	01	BLOOMINGDALE	...	0				N	NONE ...	11
242729	2008160108-0155	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	0127	01	BLOOMINGDALE	...	40	25	41.01790	74.31252	N	UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE...	11
242730	2008160108-0159	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	0909	01	BLOOMINGDALE	...	35	25	41.01119	74.32027	N	NONE ...	10
242731	2008160108-0166	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	1814	01	BLOOMINGDALE	...	35	25	41.03530	74.34463	N	NONE ...	11

5 rows × 47 columns

When we select a single row with the .ix[] index function, you'll notice that the 'County Name' column includes a bunch of trailing whitespace: 'PASSAIC '. Normally, in Python, you might solve this by writing a for loop to cycle through every item in the column and clean it up one at a time.

In [10]:

df['County Name'][df['County Name'].str.contains('PASSAIC')].ix[242727]

Out[10]:

'PASSAIC     '

But with Pandas, we can do the same thing much faster. We'll use the .map() function to perform the .strip() method on every string in the column at the same time.

In [11]:

df['County Name']=df['County Name'].map(str.strip)

Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County.

You can try something similar by finding Police Departments that contain the string BLOOMINGDALE and cleaning them up with the strip function, if you want to.

In [12]:

df['Police Department'][df['Police Department'].str.contains('BLOOMINGDALE')]

Out[12]:

242727    BLOOMINGDALE             
242728    BLOOMINGDALE             
242729    BLOOMINGDALE             
242730    BLOOMINGDALE             
242731    BLOOMINGDALE             
242732    BLOOMINGDALE             
242733    BLOOMINGDALE             
242734    BLOOMINGDALE             
242735    BLOOMINGDALE             
242736    BLOOMINGDALE             
242737    BLOOMINGDALE             
242738    BLOOMINGDALE             
242739    BLOOMINGDALE             
242740    BLOOMINGDALE             
242741    BLOOMINGDALE             
...
1650731    BLOOMINGDALE BORO PD     
1650732    BLOOMINGDALE BORO PD     
1650733    BLOOMINGDALE BORO PD     
1650734    BLOOMINGDALE BORO PD     
1650735    BLOOMINGDALE BORO PD     
1650736    BLOOMINGDALE BORO PD     
1650737    BLOOMINGDALE BORO PD     
1650738    BLOOMINGDALE BORO PD     
1650739    BLOOMINGDALE BORO PD     
1650740    BLOOMINGDALE BORO PD     
1650741    BLOOMINGDALE BORO PD     
1650742    BLOOMINGDALE BORO PD     
1650743    BLOOMINGDALE BORO PD     
1650744    BLOOMINGDALE BORO PD     
1665827    BLOOMINGDALE BORO PD     
Name: Police Department, Length: 844, dtype: object

Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County, with a comparison using ==.

In [13]:

df[df['County Name']=='PASSAIC']

Out[13]:

	case code	County Name	Municipality Name	Crash Date	Crash Day Of Week	Crash Time	Police Dept Code	Police Department	Police Station	Total Killed	...	Is Ramp	Ramp To/From Route Name	Ramp To/From Route Direction	Posted Speed	Posted Speed Cross Street	Latitude	Longitude	Cell Phone In Use Flag	Other Property Damage	Reporting Badge No.
242727	2008160108-0002	PASSAIC	BLOOMINGDALE BORO	01/01/2008	TU	0052	01	BLOOMINGDALE		0	...				35	35			N	SOUTHBOUND GUARDRAIL ...	5
242728	2008160108-0083	PASSAIC	BLOOMINGDALE BORO	01/09/2008	W	0550	01	BLOOMINGDALE		0	...				0				N	NONE ...	11
242729	2008160108-0155	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	0127	01	BLOOMINGDALE		0	...				40	25	41.01790	74.31252	N	UTILITY POLE # BMP560B MAILBOX FENCE CEMENT RE...	11
242730	2008160108-0159	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	0909	01	BLOOMINGDALE		0	...				35	25	41.01119	74.32027	N	NONE ...	10
242731	2008160108-0166	PASSAIC	BLOOMINGDALE BORO	01/17/2008	TH	1814	01	BLOOMINGDALE		0	...				35	25	41.03530	74.34463	N	NONE ...	11
242732	2008160108-0194	PASSAIC	BLOOMINGDALE BORO	01/21/2008	M	1318	01	BLOOMINGDALE		0	...				0				N	NONE ...	11
242733	2008160108-0273	PASSAIC	BLOOMINGDALE BORO	02/01/2008	F	0905	01	BLOOMINGDALE		0	...				35	35			N	NONE ...	12
242734	2008160108-0307	PASSAIC	BLOOMINGDALE BORO	02/05/2008	TU	1459	01	BLOOMINGDALE		0	...				0				N	NONE ...	11
242735	2008160108-0310	PASSAIC	BLOOMINGDALE BORO	02/05/2008	TU	1728	01	BLOOMINGDALE		0	...				35	25	41.00775	74.33502	N	NONE ...	8
242736	2008160108-0330	PASSAIC	BLOOMINGDALE BORO	02/08/2008	F	1309	01	BLOOMINGDALE		0	...				35	25	41.00741	74.34208	N	NONE ...	8
242737	2008160108-0360	PASSAIC	BLOOMINGDALE BORO	02/11/2008	M	1835	01	BLOOMINGDALE		0	...				35	25	41.00690	74.34344	N	? ...	6
242738	2008160108-0365	PASSAIC	BLOOMINGDALE BORO	02/12/2008	TU	1221	01	BLOOMINGDALE		0	...				30	35			N	NONE ...	12
242739	2008160108-0369	PASSAIC	BLOOMINGDALE BORO	02/12/2008	TU	1823	01	BLOOMINGDALE		0	...				35	15	41.01493	74.35526	N	? ...	6
242740	2008160108-0370	PASSAIC	BLOOMINGDALE BORO	02/12/2008	TU	2217	01	BLOOMINGDALE		0	...				35	30	41.00387	74.33420	N	NONE ...
242741	2008160108-0381	PASSAIC	BLOOMINGDALE BORO	02/14/2008	TH	0808	01	BLOOMINGDALE		0	...				35	25	41.01119	74.32027	N	NONE ...	11
242742	2008160108-0393	PASSAIC	BLOOMINGDALE BORO	02/16/2008	SA	0748	01	BLOOMINGDALE		0	...				25	25			N	NONE ...	12
242743	2008160108-0448	PASSAIC	BLOOMINGDALE BORO	02/21/2008	TH	1122	01	BLOOMINGDALE		0	...				35		41.00673	74.35107	N	NONE ...	9
242744	2008160108-0457	PASSAIC	BLOOMINGDALE BORO	02/22/2008	F	1228	01	BLOOMINGDALE		0	...				35	30	41.00141	74.32169	N	NONE ...	10
242745	2008160108-0476	PASSAIC	BLOOMINGDALE BORO	02/25/2008	M	1142	01	BLOOMINGDALE		0	...				35	25	41.00690	74.34344	N	? ...	6
242746	2008160108-0491	PASSAIC	BLOOMINGDALE BORO	02/27/2008	W	0133	01	BLOOMINGDALE		0	...				25	25			N	NONE NOTED ...	5
242747	2008160108-0504	PASSAIC	BLOOMINGDALE BORO	02/28/2008	TH	2020	01	BLOOMINGDALE		0	...				0				N	UNKNOWN ...	8
242748	2008160108-0538	PASSAIC	BLOOMINGDALE BORO	03/05/2008	W	0806	01	BLOOMINGDALE		0	...				30	35	41.00384	74.33426	N	NONE ...	9
242749	2008160108-0619	PASSAIC	BLOOMINGDALE BORO	03/14/2008	F	0708	01	BLOOMINGDALE		0	...				35	25	41.03262	74.34254	N	KAMPFE LAKE ASSOCIATION PROPERTY-TREES/LANDSCA...	6
242750	2008160108-0640	PASSAIC	BLOOMINGDALE BORO	03/15/2008	SA	1537	01	BLOOMINGDALE		0	...				25				N	NONE ...	9
242751	2008160108-0687	PASSAIC	BLOOMINGDALE BORO	03/22/2008	SA	0431	01	BLOOMINGDALE		0	...				25	35			N	UTILITY POLE# BMP331P WAS STRUCK AND SUSTAINED...	10
242752	2008160108-0719	PASSAIC	BLOOMINGDALE BORO	03/26/2008	W	1340	01	BLOOMINGDALE		0	...				30	25	41.00247	74.32864	N	NONE ...	8
242753	2008160108-0743	PASSAIC	BLOOMINGDALE BORO	03/28/2008	F	0700	01	BLOOMINGDALE		0	...				35	35			N	DAMAGE TO LANDSCAPING OF 587 GLENWILD AVE. ...
242754	2008160108-0762	PASSAIC	BLOOMINGDALE BORO	03/29/2008	SA		01	BLOOMINGDALE		0	...				25	35			N	NONE ...
242755	2008160108-0797	PASSAIC	BLOOMINGDALE BORO	04/02/2008	W	0814	01	BLOOMINGDALE		0	...				35	15			N	NONE ...
242756	2008160108-0798	PASSAIC	BLOOMINGDALE BORO	04/02/2008	W	0908	01	BLOOMINGDALE		0	...				35	35			N	NONE ...	12
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1669227	20131616B060-2013-01720A	PASSAIC	WOODLAND PARK BORO	07/11/2013	TH	2240	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6732
1669228	20131616B060-2013-01772A	PASSAIC	WOODLAND PARK BORO	07/17/2013	W	1827	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7190
1669229	20131616B060-2013-01787A	PASSAIC	WOODLAND PARK BORO	07/17/2013	W	1107	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6701
1669230	20131616B060-2013-01848A	PASSAIC	WOODLAND PARK BORO	07/24/2013	W	1840	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	5349
1669231	20131616B060-2013-01865A	PASSAIC	WOODLAND PARK BORO	07/26/2013	F	2014	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7106
1669232	20131616B060-2013-01869A	PASSAIC	WOODLAND PARK BORO	07/27/2013	SA	1641	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6227
1669233	20131616B060-2013-01922A	PASSAIC	WOODLAND PARK BORO	08/01/2013	TH	0053	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	1	...				55				N	NONE. ...	7238
1669234	20131616B060-2013-01923A	PASSAIC	WOODLAND PARK BORO	08/01/2013	TH	1104	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...	F	JACKSON AVE	EB	65				N	NONE ...	6962
1669235	20131616B060-2013-01937A	PASSAIC	WOODLAND PARK BORO	08/02/2013	F	1818	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	NONE ...	6185
1669236	20131616B060-2013-01963A	PASSAIC	WOODLAND PARK BORO	08/05/2013	M	2119	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	NONE ...	7055
1669237	20131616B060-2013-01964A	PASSAIC	WOODLAND PARK BORO	08/05/2013	M	2119	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				65				N	NONE ...	6962
1669238	20131616B060-2013-01966A	PASSAIC	WOODLAND PARK BORO	08/06/2013	TU	0005	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	APPROXIMATELY 15 FEET OF GUARDRAIL NJDOT TRE...	6084
1669239	20131616B060-2013-02030A	PASSAIC	WOODLAND PARK BORO	08/12/2013	M	1733	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7071
1669240	20131616B060-2013-02047A	PASSAIC	WOODLAND PARK BORO	08/14/2013	W	0618	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6701
1669241	20131616B060-2013-02131A	PASSAIC	WOODLAND PARK BORO	08/25/2013	S	2239	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7106
1669242	20131616B060-2013-02146A	PASSAIC	WOODLAND PARK BORO	08/27/2013	TU	1016	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7167
1669243	20131616B060-2013-02184A	PASSAIC	WOODLAND PARK BORO	09/01/2013	S	1004	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7178
1669244	20131616B060-2013-02223A	PASSAIC	WOODLAND PARK BORO	09/06/2013	F	0906	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6701
1669245	20131616B060-2013-02227A	PASSAIC	WOODLAND PARK BORO	09/06/2013	F	1308	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6660
1669246	20131616B060-2013-02283A	PASSAIC	WOODLAND PARK BORO	09/12/2013	TH	2015	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7055
1669247	20131616B060-2013-02297A	PASSAIC	WOODLAND PARK BORO	09/13/2013	F	1533	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...	T	SQUIRRELWOOD ROAD	SB	55				N	...	6969
1669248	20131616B060-2013-02306A	PASSAIC	WOODLAND PARK BORO	09/15/2013	S	1933	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7071
1669249	20131616B060-2013-02385A	PASSAIC	WOODLAND PARK BORO	09/22/2013	S	0330	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	6701
1669250	20131616B060-2013-02394A	PASSAIC	WOODLAND PARK BORO	09/19/2013	TH	1903	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7167
1669251	20131616B060-2013-02415A	PASSAIC	WOODLAND PARK BORO	09/24/2013	TU	1608	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				65				N	...	7319
1669252	20131616B060-2013-02426A	PASSAIC	WOODLAND PARK BORO	09/25/2013	W	0720	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7190
1669253	20131616B060-2013-02432A	PASSAIC	WOODLAND PARK BORO	09/26/2013	TH	0712	2	NEW JERSEY STATE POLICE	TOTOWA-SUB	0	...				55				N	...	7075
1669254	20131616PCSD13-04189	PASSAIC	WOODLAND PARK BORO	07/09/2013	TU	1155	3	PASSAIC CO SHERIFFS DEPT	SHERIFFS PATROL	0	...				25	25			N	...
1669255	20131616PCSD13-04479	PASSAIC	WOODLAND PARK BORO	07/21/2013	S	1037	3	PASSAIC CO SHERIFFS DEPT	COUNTY PATROL	0	...				10	25			N	...	115
1669256	20131616PCSD13-05367	PASSAIC	WOODLAND PARK BORO	08/27/2013	TU	1950	3	PASSAIC CO SHERIFFS DEPT	COUNTY PATROL	0	...				25	25			N	...	949

113217 rows × 47 columns

If we're confident the 'County Name' column is as clean as it's going to be, we can turn to others. What do we see when we look at the unique values in the 'Police Dept Code' column with the .unique() method?

In [14]:

df['Police Dept Code'].unique()

Out[14]:

array(['01', '99', '  ', '02', '03', '04', 1, 99, 2, 3, 4], dtype=object)

If we filter the column for the police departments that use '99' as their code, we'll see their names are not unique (of course).

In [15]:

df[['Police Dept Code', 'Police Department']][df['Police Dept Code']==99]

Out[15]:

	Police Dept Code	Police Department
311564	99	PALISADES INTER. PARKWAY
311565	99	PALISADES INTER. PARKWAY
311566	99	PALISADES INTER. PARKWAY
311567	99	PALISADES INTER. PARKWAY
311568	99	PALISADES INTER. PARKWAY
311569	99	PALISADES INTER. PARKWAY
311570	99	PALISADES INTER. PARKWAY
311571	99	PALISADES INTER. PARKWAY
311572	99	PALISADES INTER. PARKWAY
311573	99	PALISADES INTER. PARKWAY
311574	99	PALISADES INTER. PARKWAY
311575	99	PALISADES INTER. PARKWAY
311709	99	NEW JERSEY TRANSIT POLICE
312227	99	NEW JERSEY TRANSIT POLICE
312415	99	NEW JERSEY TRANSIT POLICE
312641	99	NEW JERSEY TRANSIT POLICE
313347	99	NEW JERSEY TRANSIT POLICE
313348	99	NEW JERSEY TRANSIT POLICE
316175	99	NEW JERSEY TRANSIT POLICE
316534	99	PALISADES INTER. PARKWAY
316535	99	PALISADES INTER. PARKWAY
320064	99	WILLIAM PATERSON UNIVERSI
320067	99	WILLIAM PATERSON UNIVERSI
320068	99	WILLIAM PATERSON UNIVERSI
320102	99	WILLIAM PATERSON UNIVERSI
320103	99	WILLIAM PATERSON UNIVERSI
320104	99	WILLIAM PATERSON UNIVERSI
320105	99	WILLIAM PATERSON UNIVERSI
320106	99	WILLIAM PATERSON UNIVERSI
320107	99	WILLIAM PATERSON UNIVERSI
...	...	...
1701415	99	KEAN UNIVERSITY PD
1701416	99	KEAN UNIVERSITY PD
1701417	99	KEAN UNIVERSITY PD
1701418	99	KEAN UNIVERSITY PD
1701419	99	KEAN UNIVERSITY PD
1701420	99	KEAN UNIVERSITY PD
1701421	99	KEAN UNIVERSITY PD
1701422	99	KEAN UNIVERSITY PD
1701423	99	KEAN UNIVERSITY PD
1701424	99	KEAN UNIVERSITY PD
1701425	99	KEAN UNIVERSITY PD
1701426	99	KEAN UNIVERSITY PD
1701427	99	KEAN UNIVERSITY PD
1701428	99	KEAN UNIVERSITY PD
1701429	99	KEAN UNIVERSITY PD
1701430	99	KEAN UNIVERSITY PD
1701431	99	KEAN UNIVERSITY PD
1701432	99	KEAN UNIVERSITY PD
1701433	99	KEAN UNIVERSITY PD
1701434	99	KEAN UNIVERSITY PD
1701435	99	KEAN UNIVERSITY PD
1701436	99	KEAN UNIVERSITY PD
1701437	99	KEAN UNIVERSITY PD
1701438	99	KEAN UNIVERSITY PD
1701439	99	KEAN UNIVERSITY PD
1701440	99	KEAN UNIVERSITY PD
1701441	99	KEAN UNIVERSITY PD
1701442	99	KEAN UNIVERSITY PD
1701443	99	KEAN UNIVERSITY PD
1701444	99	KEAN UNIVERSITY PD

4818 rows × 2 columns

You also might notice that some of our department codes start with 0, so we have data that is mixed between strings and integers in the same column. We can use pandas' .astype() method to change all the type of all of the values in that column to strings.

In [16]:

df['Police Dept Code']=df['Police Dept Code'].astype(str)
df['Police Dept Code'].unique()

Out[16]:

array(['01', '99', '  ', '02', '03', '04', '1', '2', '3', '4'], dtype=object)

You should note two things here. To actually change the value of the column in the DataFrame in place, we'll have to assign it back to itself as if we're defining a new variable. Also, if you check the unique values again, we'll see that there are now values for '01' and '1'. You may want to standardize those, but we'll leave that to you as an exercise later.

Even after all that, we still have some empty fields with two spaces in the string, so let's replace those empty values with the word "Unknown"

In [17]:

df['Police Dept Code'][df['Police Dept Code']=='  ']='Unknown'

We've done a bit of cleaning here and may want to start doing some exploratory analysis. To do that, we'll create a smaller DataFrame of just the columns that we plan to examine, and we'll name it myframe.

In [18]:

#Create a smaller frame
myframe = df[['County Name', 'Municipality Name', 'Crash Date', 'Crash Day Of Week', 'Crash Time', 'Total Killed', 'Total Injured', 'Pedestrians Killed', 'Pedestrians Injured', 'Total Vehicles Involved', 'Crash Type Code', 'Alcohol Involved', 'Environmental Condition', 'Light Condition', 'Cell Phone In Use Flag']]

We'll pass myframe along to the next exercise for aggregating, but we might want to save it for later. While we're thinking about it, let's write it out to a CSV with the shockingly named .to_csv() method.

In [19]:

myframe.to_csv('smallertab.csv')

Digging Deeper with Pandas¶

We've done a bit of cleaning to our outrageously messy data, and now we want to start to explore what's actually in this data and look for more spots we have to clean.

If you were reporting on car accidents in your state, what would you want to know? Some of the questions you'd ask might include:

How many deaths happened in each county?
How many accidents have happened over time?
How many people are involved in car accidents?
What counties might be outliers in terms of deaths and injuries?

We'll start by reading our data into our DataFrame. Then we'll filter it and do a few group-bys, which should be familiar if you've ever worked with SQL. Finally, we'll create smaller DataFrames of this aggregated data to visualize.

One of the cleaning tasks we'll have to deal with is the format of the 'Crash Date' field, so we might as well import Python's standard datetime module and read our csv now.

In [20]:

from datetime import datetime

newframe = pd.read_csv('smallertab.csv')

/Users/tommeagher/.virtualenvs/pandas/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

This should all be pretty familiar. Notice that weird warning about mixed types? Although we cleaned the types in some of the columns, others are still mixed. You can solve this by using the dtypes keyword in the read_csv() function. It's worth looking at the docs to see all the options you can set when you read a CSV.

Now, we want to see our data aggregated by counties, so here's where pandas' .groupby() function comes in.

In [21]:

newframe.groupby('County Name')

Out[21]:

<pandas.core.groupby.DataFrameGroupBy object at 0x1034cf3d0>

Now, when we run .groupby(), it returns a DataFrameGroupBy object, which is really only good to us if we do something else with it.

So try this:

In [22]:

#What county had the most accident deaths?
newframe.groupby('County Name').sum()

Out[22]:

	Unnamed: 0	Total Killed	Total Injured	Pedestrians Killed	Pedestrians Injured	Total Vehicles Involved
County Name
ATLANTIC	40555622700	187	20957	33	1104	98713
BERGEN	136762281964	166	47575	70	2879	333025
BURLINGTON	56830343320	249	23727	44	634	128223
CAMDEN	70636041684	230	36056	62	1539	178112
CAPE MAY	15954173080	48	6114	7	258	35792
CUMBERLAND	21914587233	143	11086	21	346	49772
ESSEX	128406345105	247	58229	95	5224	315666
GLOUCESTER	35575203872	156	15964	30	475	81242
HUDSON	98932551977	122	29470	44	4026	230722
HUNTERDON	19987461551	55	5835	10	95	40153
MERCER	70130920959	141	22701	39	995	153293
MIDDLESEX	154741088588	308	52131	88	2055	336919
MONMOUTH	111922290674	230	36027	52	1239	231862
MORRIS	64443161991	128	21039	26	658	150796
OCEAN	82316419082	284	30520	65	1048	176583
PASSAIC	109027719563	160	35694	50	2054	216166
SALEM	9398429064	72	3582	4	55	16512
SOMERSET	71108004864	119	19076	17	630	133774
SUSSEX	22863294057	65	7004	6	144	38997
UNION	116426554015	182	37662	66	2142	227130
WARREN	20427091377	69	6255	6	154	35105

Let's deconstruct this a bit, because we're chaining a lot together at once here. We start with our DataFrame object, called newframe. We then do .groupby(), passing it the name of the column we're aggregating on. As we said, this is reminiscent of SQL and returns the DataFrameGroupBy object. We can then run .sum() on the GroupBy object to add up the numeric columns for each county, returning a new DataFrame, which looks like the kind of Pivot Table we'd make in Excel.

Say we want to know which county had the most people killed in accidents over this time period. We can slice out just the 'Total Killed' column and sort it from most to least.

In [23]:

newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)

Out[23]:

County Name
MIDDLESEX      308
OCEAN          284
BURLINGTON     249
ESSEX          247
CAMDEN         230
MONMOUTH       230
ATLANTIC       187
UNION          182
BERGEN         166
PASSAIC        160
GLOUCESTER     156
CUMBERLAND     143
MERCER         141
MORRIS         128
HUDSON         122
SOMERSET       119
SALEM           72
WARREN          69
SUSSEX          65
HUNTERDON       55
CAPE MAY        48
Name: Total Killed, dtype: int64

We select the 'Total Killed' column here using .iloc[], which wants a two-item list. The first item is a slice of the rows. Since we want all rows, we'll use the standard Python slicing notation of an empty colon. The second item in the list is the columns we want, which is just column 2, or if you're counting from 0, as Python does, column 1.

Finally, the .order() function will sort the resulting series and pass it the ascending keyword with a value of False (it defaults to True) to get the largest figures at the top of the list.

We can assign that to a new dataframe variable that we'll use later to make a bar chart and move on for now.

In [24]:

countydeaths = newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)

Now we want to aggregate the accidents by date. But you may have noticed that the dates here are strings and not actual Python date objects. This means when you try to sort them, you get 01/01/2008, 01/01/2009, 01/01/2010, and so on. We need convert the strings to actual Python dates.

In [25]:

#we use lambda, which creates an unnamed, one-line function and applies it to each item in the column
newframe['Crash Date']=newframe['Crash Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y").date())

What did we do here? We take the 'Crash Date' column and use the .apply() function to perform some operation across the value in every row in that column. Then we use Python's lambda, a way to write an unnamed, one-line function. If the variable for the date string in each row of the 'Crash Date' column is x, apply and lambda perform the .strptime() function from Python's datetime module. The part in double quotes "%m/%d/%Y" says to take a string that is a two-digit month and a two-digit date and a four-digit year, divided by slashes, and convert it to a datetime object. Finally, .date() takes just the date part of that object and returns that into the row. That's a mouthful.

Now that we've done that, we can aggregate the crashes by real, sortable dates and count how many accidents happened on each date. Then we'll assign this to a new variable for the resulting DataFrame that we'll use later to make a line chart.

In [26]:

crashesbydate = newframe.groupby('Crash Date').count().iloc[:,0]

So far we've just been transforming and tweaking data that already exists in our dataframe. We can also calculate new data and add it as a column. Here we can create a column for the total number of people involved in an accident by adding together the 'Total Killed' and 'Total Injured' columns. This syntax works similarly to creating a new key in a Python dict.

In [27]:

newframe['Total Involved']=newframe['Total Killed']+newframe['Total Injured']

With this 'Total Involved' value, we can create a histogram later.

In [28]:

newframe['Crash Day Of Week']=newframe['Crash Day Of Week'].map(str.strip)

Here, we're using .map() to call the pandas built-in strip method on its string method across the column, which is a Series object

Finally, let's create a DataFrame of all the columns aggregated by county so we can make a scatter plot of the counties by total killed and pedestrians killed.

In [29]:

countyframe = newframe.groupby('County Name').sum()

With these new DataFrames, we have several slices of our data. Let's see what it looks like by visualizing it.

Basic plots with matplotlib¶

First, we need to launch the pylab mode inline to get the plots to render inside our notebook. You can do this at startup by running ipython notebook --pylab=inline, or you can do it like this with the magic %.

In [30]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy

Next, we import matplotlib's pyplot class and call it plt.

In [31]:

import matplotlib.pyplot as plt

If we want to make a bar chart of the number of deaths by county, from the countydeaths DataFrame we created above, we can use the .plot() function and pass it the kind='bar' keyword argument.

These plots are basically pandas' wrapper around the matplotlib library. There are many other things you can do with the wrapper and matplotlib itself, and it's worth giving the documentation a read.

In [32]:

countydeaths.plot(kind='bar')

Out[32]:

<matplotlib.axes.AxesSubplot at 0x115da9590>

If we want to do a line graph, we can use the .plot() function on a DataFrame. The default kind is a line graph. So here's a graph of crashes by date.

In [33]:

crashesbydate.plot()

Out[33]:

<matplotlib.axes.AxesSubplot at 0x115d97f50>

In [34]:

newframe['Total Involved'].hist(bins=50)

Out[34]:

<matplotlib.axes.AxesSubplot at 0x115ec28d0>

Finally, for now, we can scatter plot our county data by the total number of people killed versus the number of pedestrians killed.

In [35]:

#let's scatter plot county name by total killed and pedestrians killed.
countyframe.plot(kind='scatter', x='Total Killed', y='Pedestrians Killed')

Out[35]:

<matplotlib.axes.AxesSubplot at 0x11630d350>

I hope this was helpful. If you have questions or spot bugs, you can find me on Github, via email or on Twitter.