PyCon 2018: Using pandas for Better (and Worse) Data Science¶

GitHub repository: https://github.com/justmarkham/pycon-2018-tutorial ¶

Instructor: Kevin Markham¶

GitHub: https://github.com/justmarkham
Twitter: https://twitter.com/justmarkham
YouTube: https://www.youtube.com/dataschool
Website: http://www.dataschool.io

In [1]:

import pandas as pd
pd.__version__

Out[1]:

'0.22.0'

In [2]:

import matplotlib.pyplot as plt
%matplotlib inline

Dataset: Stanford Open Policing Project (video)¶

https://openpolicing.stanford.edu/

In [3]:

# ri stands for Rhode Island
ri = pd.read_csv('police.csv')

In [4]:

# what does each row represent?
ri.head()

Out[4]:

	stop_date	stop_time	county_name	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop
0	2005-01-02	01:55	NaN	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
1	2005-01-18	08:15	NaN	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
2	2005-01-23	23:15	NaN	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
3	2005-02-20	17:15	NaN	M	1986.0	19.0	White	Call for Service	Other	False	NaN	Arrest Driver	True	16-30 Min	False
4	2005-03-14	10:00	NaN	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False

In [5]:

# what do these numbers mean?
ri.shape

Out[5]:

(91741, 15)

In [6]:

# what do these types mean?
ri.dtypes

Out[6]:

stop_date              object
stop_time              object
county_name           float64
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race            object
violation_raw          object
violation              object
search_conducted         bool
search_type            object
stop_outcome           object
is_arrested            object
stop_duration          object
drugs_related_stop       bool
dtype: object

What does NaN mean?
Why might a value be missing?
Why mark it as NaN? Why not mark it as a 0 or an empty string or a string saying "Unknown"?

In [7]:

# what are these counts? how does this work?
ri.isnull().sum()

Out[7]:

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

In [8]:

(True == 1) and (False == 0)

Out[8]:

True

1. Remove the column that only contains missing values (video)¶

In [9]:

# axis=1 also works, inplace is False by default, inplace=True avoids assignment statement
ri.drop('county_name', axis='columns', inplace=True)

In [10]:

ri.shape

Out[10]:

(91741, 14)

In [11]:

ri.columns

Out[11]:

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')

In [12]:

# alternative method
ri.dropna(axis='columns', how='all').shape

Out[12]:

(91741, 14)

Lessons:

Pay attention to default arguments
Check your work
There is more than one way to do everything in pandas

2. Do men or women speed more often? (video)¶

In [13]:

# when someone is stopped for speeding, how often is it a man or woman?
ri[ri.violation == 'Speeding'].driver_gender.value_counts(normalize=True)

Out[13]:

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

In [14]:

# alternative
ri.loc[ri.violation == 'Speeding', 'driver_gender'].value_counts(normalize=True)

Out[14]:

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

In [15]:

# when a man is pulled over, how often is it for speeding?
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)

Out[15]:

Speeding               0.524350
Moving violation       0.207012
Equipment              0.135671
Other                  0.057668
Registration/plates    0.038461
Seat belt              0.036839
Name: violation, dtype: float64

In [16]:

# repeat for women
ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True)

Out[16]:

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64

In [17]:

# combines the two lines above
ri.groupby('driver_gender').violation.value_counts(normalize=True)

Out[17]:

driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64

What are some relevant facts that we don't know?

Lessons:

There is more than one way to understand a question

3. Does gender affect who gets searched during a stop? (video)¶

In [18]:

# ignore gender for the moment
ri.search_conducted.value_counts(normalize=True)

Out[18]:

False    0.965163
True     0.034837
Name: search_conducted, dtype: float64

In [19]:

# how does this work?
ri.search_conducted.mean()

Out[19]:

0.03483720473942948

In [20]:

# search rate by gender
ri.groupby('driver_gender').search_conducted.mean()

Out[20]:

driver_gender
F    0.020033
M    0.043326
Name: search_conducted, dtype: float64

Does this prove that gender affects who gets searched?

In [21]:

# include a second factor
ri.groupby(['violation', 'driver_gender']).search_conducted.mean()

Out[21]:

violation            driver_gender
Equipment            F                0.042622
                     M                0.070081
Moving violation     F                0.036205
                     M                0.059831
Other                F                0.056522
                     M                0.047146
Registration/plates  F                0.066140
                     M                0.110376
Seat belt            F                0.012598
                     M                0.037980
Speeding             F                0.008720
                     M                0.024925
Name: search_conducted, dtype: float64

Does this prove causation?

Lessons:

Causation is difficult to conclude, so focus on relationships
Include all relevant factors when studying a relationship

4. Why is search_type missing so often? (video)¶

In [22]:

ri.isnull().sum()

Out[22]:

stop_date                 0
stop_time                 0
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

In [23]:

# maybe search_type is missing any time search_conducted is False?
ri.search_conducted.value_counts()

Out[23]:

False    88545
True      3196
Name: search_conducted, dtype: int64

In [24]:

# test that theory, why is the Series empty?
ri[ri.search_conducted == False].search_type.value_counts()

Out[24]:

Series([], Name: search_type, dtype: int64)

In [25]:

# value_counts ignores missing values by default
ri[ri.search_conducted == False].search_type.value_counts(dropna=False)

Out[25]:

NaN    88545
Name: search_type, dtype: int64

In [26]:

# when search_conducted is True, search_type is never missing
ri[ri.search_conducted == True].search_type.value_counts(dropna=False)

Out[26]:

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Probable Cause,Protective Frisk                               33
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Inventory,Protective Frisk                                    11
Protective Frisk,Reasonable Suspicion                         11
Incident to Arrest,Inventory,Protective Frisk                 11
Incident to Arrest,Probable Cause,Protective Frisk            10
Incident to Arrest,Probable Cause,Reasonable Suspicion         6
Inventory,Reasonable Suspicion                                 4
Incident to Arrest,Inventory,Reasonable Suspicion              4
Inventory,Probable Cause,Protective Frisk                      2
Inventory,Probable Cause,Reasonable Suspicion                  2
Probable Cause,Protective Frisk,Reasonable Suspicion           1
Incident to Arrest,Protective Frisk,Reasonable Suspicion       1
Name: search_type, dtype: int64

In [27]:

# alternative
ri[ri.search_conducted == True].search_type.isnull().sum()

Out[27]:

Lessons:

Verify your assumptions about your data
pandas functions ignore missing values by default

5. During a search, how often is the driver frisked? (video)¶

In [28]:

# multiple types are separated by commas
ri.search_type.value_counts(dropna=False)

Out[28]:

NaN                                                         88545
Incident to Arrest                                           1219
Probable Cause                                                891
Inventory                                                     220
Reasonable Suspicion                                          197
Protective Frisk                                              161
Incident to Arrest,Inventory                                  129
Incident to Arrest,Probable Cause                             106
Probable Cause,Reasonable Suspicion                            75
Incident to Arrest,Inventory,Probable Cause                    34
Probable Cause,Protective Frisk                                33
Incident to Arrest,Protective Frisk                            33
Inventory,Probable Cause                                       22
Incident to Arrest,Reasonable Suspicion                        13
Protective Frisk,Reasonable Suspicion                          11
Incident to Arrest,Inventory,Protective Frisk                  11
Inventory,Protective Frisk                                     11
Incident to Arrest,Probable Cause,Protective Frisk             10
Incident to Arrest,Probable Cause,Reasonable Suspicion          6
Incident to Arrest,Inventory,Reasonable Suspicion               4
Inventory,Reasonable Suspicion                                  4
Inventory,Probable Cause,Reasonable Suspicion                   2
Inventory,Probable Cause,Protective Frisk                       2
Probable Cause,Protective Frisk,Reasonable Suspicion            1
Incident to Arrest,Protective Frisk,Reasonable Suspicion        1
Name: search_type, dtype: int64

In [29]:

# use bracket notation when creating a column
ri['frisk'] = ri.search_type == 'Protective Frisk'

In [30]:

ri.frisk.dtype

Out[30]:

dtype('bool')

In [31]:

# includes exact matches only
ri.frisk.sum()

Out[31]:

In [32]:

# is this the answer?
ri.frisk.mean()

Out[32]:

0.0017549405391264537

In [33]:

# uses the wrong denominator (includes stops that didn't involve a search)
ri.frisk.value_counts()

Out[33]:

False    91580
True       161
Name: frisk, dtype: int64

In [34]:

161 / (91580 + 161)

Out[34]:

0.0017549405391264537

In [35]:

# includes partial matches
ri['frisk'] = ri.search_type.str.contains('Protective Frisk')

In [36]:

# seems about right
ri.frisk.sum()

Out[36]:

In [37]:

# frisk rate during a search
ri.frisk.mean()

Out[37]:

0.08573216520650813

In [38]:

# str.contains preserved missing values from search_type
ri.frisk.value_counts(dropna=False)

Out[38]:

NaN      88545
False     2922
True       274
Name: frisk, dtype: int64

In [39]:

# excludes stops that didn't involve a search
274 / (2922 + 274)

Out[39]:

0.08573216520650813

Lessons:

Use string methods to find partial matches
Use the correct denominator when calculating rates
pandas calculations ignore missing values
Apply the "smell test" to your results

6. Which year had the least number of stops? (video)¶

In [40]:

# this works, but there's a better way
ri.stop_date.str.slice(0, 4).value_counts()

Out[40]:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558
Name: stop_date, dtype: int64

In [41]:

# make sure you create this column
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)

In [42]:

ri.dtypes

Out[42]:

stop_date                     object
stop_time                     object
driver_gender                 object
driver_age_raw               float64
driver_age                   float64
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                   object
stop_duration                 object
drugs_related_stop              bool
frisk                         object
stop_datetime         datetime64[ns]
dtype: object

In [43]:

# why is 2005 so much smaller?
ri.stop_datetime.dt.year.value_counts()

Out[43]:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558
Name: stop_datetime, dtype: int64

Lessons:

Consider removing chunks of data that may be biased
Use the datetime data type for dates and times

7. How does drug activity change by time of day? (video)¶

In [44]:

ri.drugs_related_stop.dtype

Out[44]:

dtype('bool')

In [45]:

# baseline rate
ri.drugs_related_stop.mean()

Out[45]:

0.008883705213590434

In [46]:

# can't groupby 'hour' unless you create it as a column
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean()

Out[46]:

stop_datetime
0     0.019728
1     0.013507
2     0.015462
3     0.017065
4     0.011811
5     0.004762
6     0.003040
7     0.003281
8     0.002687
9     0.006288
10    0.005714
11    0.006976
12    0.004467
13    0.010326
14    0.007810
15    0.006416
16    0.005723
17    0.005517
18    0.010148
19    0.011596
20    0.008084
21    0.013342
22    0.013533
23    0.016344
Name: drugs_related_stop, dtype: float64

In [47]:

# line plot by default (for a Series)
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot()

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612e0716d8>

In [48]:

# alternative: count drug-related stops by hour
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.sum().plot()

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612e3a9780>

Lessons:

Use plots to help you understand trends
Create exploratory plots using pandas one-liners

8. Do most stops occur at night? (video)¶

In [49]:

ri.stop_datetime.dt.hour.value_counts()

Out[49]:

10    7350
9     6838
11    5877
14    5634
23    5629
0     5221
8     5210
13    4842
15    4832
7     4572
1     4442
16    3844
12    3582
22    3473
20    3340
6     3290
2     2975
17    2900
19    2587
18    2168
21    1499
3     1172
4      254
5      210
Name: stop_datetime, dtype: int64

In [50]:

ri.stop_datetime.dt.hour.value_counts().plot()

Out[50]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612d37ae48>

In [51]:

ri.stop_datetime.dt.hour.value_counts().sort_index().plot()

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612e3a9d30>

In [52]:

# alternative method
ri.groupby(ri.stop_datetime.dt.hour).stop_date.count().plot()

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612d3d2d68>

Lessons:

Be conscious of sorting when plotting

9. Find the bad data in the stop_duration column and fix it (video)¶

In [53]:

# mark bad data as missing
ri.stop_duration.value_counts()

Out[53]:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
2                1
1                1
Name: stop_duration, dtype: int64

In [54]:

# what four things are wrong with this code?
# ri[ri.stop_duration == 1 | ri.stop_duration == 2].stop_duration = 'NaN'

In [55]:

# what two things are still wrong with this code?
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')].stop_duration = 'NaN'

C:\Users\justm\AppData\Local\conda\conda\envs\pd22.0\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

In [56]:

# assignment statement did not work
ri.stop_duration.value_counts()

Out[56]:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
2                1
1                1
Name: stop_duration, dtype: int64

In [57]:

# solves SettingWithCopyWarning
ri.loc[(ri.stop_duration == '1') | (ri.stop_duration == '2'), 'stop_duration'] = 'NaN'

In [58]:

# confusing!
ri.stop_duration.value_counts(dropna=False)

Out[58]:

0-15 Min     69543
16-30 Min    13635
NaN           5333
30+ Min       3228
NaN              2
Name: stop_duration, dtype: int64

In [59]:

# replace 'NaN' string with actual NaN value
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan

In [60]:

ri.stop_duration.value_counts(dropna=False)

Out[60]:

0-15 Min     69543
16-30 Min    13635
NaN           5335
30+ Min       3228
Name: stop_duration, dtype: int64

In [61]:

# alternative method
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)

Lessons:

Ambiguous data should be marked as missing
Don't ignore the SettingWithCopyWarning
NaN is not a string

10. What is the mean stop_duration for each violation_raw?¶

In [62]:

# make sure you create this column
mapping = {'0-15 Min':8, '16-30 Min':23, '30+ Min':45}
ri['stop_minutes'] = ri.stop_duration.map(mapping)

In [63]:

# matches value_counts for stop_duration
ri.stop_minutes.value_counts()

Out[63]:

8.0     69543
23.0    13635
45.0     3228
Name: stop_minutes, dtype: int64

In [64]:

ri.groupby('violation_raw').stop_minutes.mean()

Out[64]:

violation_raw
APB                                 20.987342
Call for Service                    22.034669
Equipment/Inspection Violation      11.460345
Motorist Assist/Courtesy            16.916256
Other Traffic Violation             13.900265
Registration Violation              13.745629
Seatbelt Violation                   9.741531
Special Detail/Directed Patrol      15.061100
Speeding                            10.577690
Suspicious Person                   18.750000
Violation of City/Town Ordinance    13.388626
Warrant                             21.400000
Name: stop_minutes, dtype: float64

In [65]:

ri.groupby('violation_raw').stop_minutes.agg(['mean', 'count'])

Out[65]:

	mean	count
violation_raw
APB	20.987342	79
Call for Service	22.034669	1298
Equipment/Inspection Violation	11.460345	11020
Motorist Assist/Courtesy	16.916256	203
Other Traffic Violation	13.900265	16223
Registration Violation	13.745629	3432
Seatbelt Violation	9.741531	2952
Special Detail/Directed Patrol	15.061100	2455
Speeding	10.577690	48462
Suspicious Person	18.750000	56
Violation of City/Town Ordinance	13.388626	211
Warrant	21.400000	15

Lessons:

Convert strings to numbers for analysis
Approximate when necessary
Use count with mean to looking for meaningless means

11. Plot the results of the first groupby from the previous exercise¶

In [66]:

# what's wrong with this?
ri.groupby('violation_raw').stop_minutes.mean().plot()

Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612d5470b8>

In [67]:

# how could this be made better?
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612d49d978>

In [68]:

ri.groupby('violation_raw').stop_minutes.mean().sort_values().plot(kind='barh')

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612d495208>

Lessons:

Don't use a line plot to compare categories
Be conscious of sorting and orientation when plotting

12. Compare the age distributions for each violation¶

In [69]:

# good first step
ri.groupby('violation').driver_age.describe()

Out[69]:

	count	mean	std	min	25%	50%	75%	max
violation
Equipment	11007.0	31.781503	11.400900	16.0	23.0	28.0	38.0	89.0
Moving violation	16164.0	36.120020	13.185805	15.0	25.0	33.0	46.0	99.0
Other	4204.0	39.536870	13.034639	16.0	28.0	39.0	49.0	87.0
Registration/plates	3427.0	32.803035	11.033675	16.0	24.0	30.0	40.0	74.0
Seat belt	2952.0	32.206301	11.213122	17.0	24.0	29.0	38.0	77.0
Speeding	48361.0	33.530097	12.821847	15.0	23.0	30.0	42.0	90.0

In [70]:

# histograms are excellent for displaying distributions
ri.driver_age.plot(kind='hist')

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612daf91d0>

In [71]:

# similar to a histogram
ri.driver_age.value_counts().sort_index().plot()

Out[71]:

<matplotlib.axes._subplots.AxesSubplot at 0x2612db4e978>

In [72]:

# can't use the plot method
ri.hist('driver_age', by='violation')

Out[72]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002612DB9E048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E4147F0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612E650B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E67AE10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612E6AD160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E6AD198>]],
      dtype=object)

In [73]:

# what changed? how is this better or worse?
ri.hist('driver_age', by='violation', sharex=True)

Out[73]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002612E84B208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E87E438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612E8A0978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E8CBEF0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612E8FD4A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612E8FD4E0>]],
      dtype=object)

In [74]:

# what changed? how is this better or worse?
ri.hist('driver_age', by='violation', sharex=True, sharey=True)

Out[74]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002612F208160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612F2D6438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612F2FA828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612F325C18>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002612F357048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612F357080>]],
      dtype=object)

Lessons:

Use histograms to show distributions
Be conscious of axes when using grouped plots

13. Pretend you don't have the driver_age column, and create it from driver_age_raw (and call it new_age)¶

In [75]:

ri.head()

Out[75]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_minutes
0	2005-01-02	01:55	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-02 01:55:00	8.0
1	2005-01-18	08:15	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-18 08:15:00	8.0
2	2005-01-23	23:15	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-23 23:15:00	8.0
3	2005-02-20	17:15	M	1986.0	19.0	White	Call for Service	Other	False	NaN	Arrest Driver	True	16-30 Min	False	NaN	2005-02-20 17:15:00	23.0
4	2005-03-14	10:00	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-03-14 10:00:00	8.0

In [76]:

# appears to be year of stop_date minus driver_age_raw
ri.tail()

Out[76]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_minutes
91736	2015-12-31	20:27	M	1986.0	29.0	White	Speeding	Speeding	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31 20:27:00	8.0
91737	2015-12-31	20:35	F	1982.0	33.0	White	Equipment/Inspection Violation	Equipment	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31 20:35:00	8.0
91738	2015-12-31	20:45	M	1992.0	23.0	White	Other Traffic Violation	Moving violation	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31 20:45:00	8.0
91739	2015-12-31	21:42	M	1993.0	22.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2015-12-31 21:42:00	8.0
91740	2015-12-31	22:46	M	1959.0	56.0	Hispanic	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2015-12-31 22:46:00	8.0

In [77]:

ri['new_age'] = ri.stop_datetime.dt.year - ri.driver_age_raw

In [78]:

# compare the distributions
ri[['driver_age', 'new_age']].hist()

Out[78]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002612F4BD3C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002612F574F98>]],
      dtype=object)

In [79]:

# compare the summary statistics (focus on min and max)
ri[['driver_age', 'new_age']].describe()

Out[79]:

	driver_age	new_age
count	86120.000000	86414.000000
mean	34.011333	39.784294
std	12.738564	110.822145
min	15.000000	-6794.000000
25%	23.000000	24.000000
50%	31.000000	31.000000
75%	43.000000	43.000000
max	99.000000	2015.000000

In [80]:

# calculate how many ages are outside that range
ri[(ri.new_age < 15) | (ri.new_age > 99)].shape

Out[80]:

(294, 18)

In [81]:

# raw data given to the researchers
ri.driver_age_raw.isnull().sum()

Out[81]:

In [82]:

# age computed by the researchers (has more missing values)
ri.driver_age.isnull().sum()

Out[82]:

In [83]:

# what does this tell us? researchers set driver_age as missing if less than 15 or more than 99
5621-5327

Out[83]:

In [84]:

# driver_age_raw NOT MISSING, driver_age MISSING
ri[(ri.driver_age_raw.notnull()) & (ri.driver_age.isnull())].head()

Out[84]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_minutes	new_age
146	2005-10-05	08:50	M	0.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Citation	False	0-15 Min	False	NaN	2005-10-05 08:50:00	8.0	2005.0
281	2005-10-10	12:05	F	0.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Warning	False	0-15 Min	False	NaN	2005-10-10 12:05:00	8.0	2005.0
331	2005-10-12	07:50	M	0.0	NaN	White	Motorist Assist/Courtesy	Other	False	NaN	No Action	False	0-15 Min	False	NaN	2005-10-12 07:50:00	8.0	2005.0
414	2005-10-17	08:32	M	2005.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Citation	False	0-15 Min	False	NaN	2005-10-17 08:32:00	8.0	0.0
455	2005-10-18	18:30	F	0.0	NaN	White	Speeding	Speeding	False	NaN	Warning	False	0-15 Min	False	NaN	2005-10-18 18:30:00	8.0	2005.0

In [85]:

# set the ages outside that range as missing
ri.loc[(ri.new_age < 15) | (ri.new_age > 99), 'new_age'] = np.nan

In [86]:

ri.new_age.equals(ri.driver_age)

Out[86]:

True

Lessons:

Don't assume that the head and tail are representative of the data
Columns with missing values may still have bad data (driver_age_raw)
Data cleaning sometimes involves guessing (driver_age)
Use histograms for a sanity check

PyCon 2018: Using pandas for Better (and Worse) Data Science¶

GitHub repository: https://github.com/justmarkham/pycon-2018-tutorial¶

Instructor: Kevin Markham¶

Dataset: Stanford Open Policing Project (video)¶

1. Remove the column that only contains missing values (video)¶

2. Do men or women speed more often? (video)¶

3. Does gender affect who gets searched during a stop? (video)¶

4. Why is search_type missing so often? (video)¶

5. During a search, how often is the driver frisked? (video)¶

6. Which year had the least number of stops? (video)¶

7. How does drug activity change by time of day? (video)¶

8. Do most stops occur at night? (video)¶

9. Find the bad data in the stop_duration column and fix it (video)¶

10. What is the mean stop_duration for each violation_raw?¶

11. Plot the results of the first groupby from the previous exercise¶

12. Compare the age distributions for each violation¶

13. Pretend you don't have the driver_age column, and create it from driver_age_raw (and call it new_age)¶

GitHub repository: https://github.com/justmarkham/pycon-2018-tutorial ¶