NYPD Crash Data

The New York Police Department make vehicle collision data available as PDFs. The NYPD Crash Data Band-Aid site scrapes this data and kindly makes it available in a structured manner. The data is available as CSV files and JSON.

Let's see what we can make of this data.

First, we'll load it. The URL for the CSV file is http://nypd.openscrape.com/data/collisions.csv.gz. First, let's read the file. (Note: though it says it's a CSV file, it's actually a TSV file -- TAB separated values)

In [1]:
import pandas as pd

# From Pandas 0.12 onwards, you can just use a single line:
# data = pd.read_csv('http://nypd.openscrape.com/data/collisions.csv.gz', compression='gzip', sep='\t')

# But for older versions, we need to download the file first.
import os.path
import urllib

# Let's download it only if it does not already exist. Save ourselves some bandwidth.
if not os.path.exists('collisions.csv.gz'):
    urllib.urlretrieve('http://nypd.openscrape.com/data/collisions.csv.gz', 'collisions.csv.gz')

data = pd.read_csv('collisions.csv.gz', compression='gzip', sep='\t')
data
Out[1]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 230138 entries, 0 to 230137
Data columns (total 68 columns):
borocode                          230138  non-null values
precinct                          230138  non-null values
year                              230138  non-null values
month                             230138  non-null values
lon                               228297  non-null values
lat                               228297  non-null values
street1                           230138  non-null values
street2                           230138  non-null values
collisions                        230138  non-null values
persons_involved                  230138  non-null values
collisions_with_injuries          230138  non-null values
motorists_injured                 230138  non-null values
passengers_injured                230138  non-null values
cyclists_injured                  230138  non-null values
pedestr_injured                   230138  non-null values
total_injured                     230138  non-null values
motorists_killed                  230138  non-null values
passengers_killed                 230138  non-null values
cyclists_killed                   230138  non-null values
pedestr_killed                    230138  non-null values
total_killed                      230138  non-null values
ambulance                         999  non-null values
bicycle                           7167  non-null values
bus                               7846  non-null values
fire_truck                        522  non-null values
large_com_veh_6_or_more_tires     6417  non-null values
livery_vehicle                    5204  non-null values
motorcycle                        2678  non-null values
other                             11458  non-null values
passenger_vehicle                 170780  non-null values
pedicab                           31  non-null values
pick_up_truck                     6539  non-null values
scooter                           199  non-null values
small_com_veh_4_tires             7231  non-null values
sport_utility_station_wagon       79000  non-null values
taxi_vehicle                      14690  non-null values
unknown                           35098  non-null values
van                               15283  non-null values
aggressive_driving_road_rage      1096  non-null values
alcohol_involvement               2255  non-null values
backing_unsafely                  7844  non-null values
cell_phone_hand_held              145  non-null values
driver_inattention_distraction    0  non-null values
driver_inexperience               35676  non-null values
drugs_illegal                     162  non-null values
eating_or_drinking                2  non-null values
err_confusn_ped_bike_other_ped    0  non-null values
failure_to_keep_right             0  non-null values
failure_to_yield_right_of_way     15120  non-null values
fatigued_drowsy                   344  non-null values
fell_asleep                       539  non-null values
following_too_closely             12588  non-null values
illness                           420  non-null values
listening_using_headphones        2  non-null values
lost_consciousness                207  non-null values
other_electronic_device           5704  non-null values
other_uninvolved_vehicle          0  non-null values
outside_car_distraction           680  non-null values
passenger_distraction             865  non-null values
passing_or_lane_usage             8088  non-null values
physical_disability               122  non-null values
prescription_medication           36  non-null values
texting                           2  non-null values
traffic_control_disregarded       4183  non-null values
turning_improperly                6787  non-null values
unsafe_lane_changing              9417  non-null values
unsafe_speed                      0  non-null values
vehicle_vandalism                 14  non-null values
dtypes: float64(49), int64(17), object(2)

That's a lot of values. To get a sense of what's happening here, let's view these values. There are a couple of ways of doing this. One is to explicitly convert to HTML and displaying it using IPython.display.

In [2]:
from IPython.display import HTML
HTML(data.head().to_html())
Out[2]:
borocode precinct year month lon lat street1 street2 collisions persons_involved collisions_with_injuries motorists_injured passengers_injured cyclists_injured pedestr_injured total_injured motorists_killed passengers_killed cyclists_killed pedestr_killed total_killed ambulance bicycle bus fire_truck large_com_veh_6_or_more_tires livery_vehicle motorcycle other passenger_vehicle pedicab pick_up_truck scooter small_com_veh_4_tires sport_utility_station_wagon taxi_vehicle unknown van aggressive_driving_road_rage alcohol_involvement backing_unsafely cell_phone_hand_held driver_inattention_distraction driver_inexperience drugs_illegal eating_or_drinking err_confusn_ped_bike_other_ped failure_to_keep_right failure_to_yield_right_of_way fatigued_drowsy fell_asleep following_too_closely illness listening_using_headphones lost_consciousness other_electronic_device other_uninvolved_vehicle outside_car_distraction passenger_distraction passing_or_lane_usage physical_disability prescription_medication texting traffic_control_disregarded turning_improperly unsafe_lane_changing unsafe_speed vehicle_vandalism
0 3 60 2011 8 -73.988165 40.590100 27 AVENUE HARWAY AVENUE 1 2 1 2 0 0 0 2 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 60 2011 8 -73.988310 40.586507 28 AVENUE CROPSEY AVENUE 1 2 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 60 2011 8 -73.985962 40.588771 28 AVENUE HARWAY AVENUE 2 5 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 60 2011 8 -73.984093 40.590566 28 AVENUE STILLWELL AVENUE 1 2 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3 60 2011 8 -73.988643 40.574282 38 STREET SURF AVENUE 1 1 0 0 0 0 0 0 0 0 0 0 0 NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The other way is to tell Pandas that we're OK with really long lines, so that it can go ahead and display lots of columns.

In [3]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.line_width', 10000)
data.head()
Out[3]:
borocode precinct year month lon lat street1 street2 collisions persons_involved collisions_with_injuries motorists_injured passengers_injured cyclists_injured pedestr_injured total_injured motorists_killed passengers_killed cyclists_killed pedestr_killed total_killed ambulance bicycle bus fire_truck large_com_veh_6_or_more_tires livery_vehicle motorcycle other passenger_vehicle pedicab pick_up_truck scooter small_com_veh_4_tires sport_utility_station_wagon taxi_vehicle unknown van aggressive_driving_road_rage alcohol_involvement backing_unsafely cell_phone_hand_held driver_inattention_distraction driver_inexperience drugs_illegal eating_or_drinking err_confusn_ped_bike_other_ped failure_to_keep_right failure_to_yield_right_of_way fatigued_drowsy fell_asleep following_too_closely illness listening_using_headphones lost_consciousness other_electronic_device other_uninvolved_vehicle outside_car_distraction passenger_distraction passing_or_lane_usage physical_disability prescription_medication texting traffic_control_disregarded turning_improperly unsafe_lane_changing unsafe_speed vehicle_vandalism
0 3 60 2011 8 -73.988165 40.590100 27 AVENUE HARWAY AVENUE 1 2 1 2 0 0 0 2 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 60 2011 8 -73.988310 40.586507 28 AVENUE CROPSEY AVENUE 1 2 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 60 2011 8 -73.985962 40.588771 28 AVENUE HARWAY AVENUE 2 5 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 60 2011 8 -73.984093 40.590566 28 AVENUE STILLWELL AVENUE 1 2 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3 60 2011 8 -73.988643 40.574282 38 STREET SURF AVENUE 1 1 0 0 0 0 0 0 0 0 0 0 0 NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Let's take a look at where most of the accidents happen. In other words, let's plot the latitudes and longitudes as a heatmap.

I didn't quite know how to plot a 2D heatmap, but as with all things in life, Google and StackOverflow are your friends. This led to http://stackoverflow.com/a/2461029/100904 which provided this function:

In [4]:
heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-744ec5a742e9> in <module>()
----> 1 heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50)

D:\anaconda\64\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
    609         xedges = yedges = asarray(bins, float)
    610         bins = [xedges, yedges]
--> 611     hist, edges = histogramdd([x,y], bins, range, normed, weights)
    612     return hist, edges[0], edges[1]
    613 

D:\anaconda\64\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
    356         mindiff = dedges[i].min()
    357         if not np.isinf(mindiff):
--> 358             decimal = int(-log10(mindiff)) + 6
    359             # Find which points are on the rightmost edge.
    360             on_edge = where(around(sample[:,i], decimal) == around(edges[i][-1],

ValueError: cannot convert float NaN to integer

We get an error when plotting latitude and longitude. This looks like it's because some of the values are NaN. Let's get rid of those.

In [7]:
locations = data[['lon', 'lat']].dropna()
locations
Out[7]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 228297 entries, 0 to 230137
Data columns (total 2 columns):
lon    228297  non-null values
lat    228297  non-null values
dtypes: float64(2)

Now we can create the 2D histogram and plot it.

In [8]:
heatmap, xedges, yedges = np.histogram2d(locations['lon'], locations['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
Out[8]:
<matplotlib.image.AxesImage at 0xc269240>

That's clustered around the corner. Most likely because there are a few outliers. Could we clean up these outliers? Maybe we can restrict ourselves to 40.5 - 41 longitude and -79 to -78.5 latitude. Let's plot the distribution of the data and see what it looks like.

In [9]:
locations['lon'].hist()
Out[9]:
<matplotlib.axes.AxesSubplot at 0x13f58c18>

Not very helpful. Let's increase the number of bins, and the size of the images going forward.

In [10]:
rcParams['figure.figsize'] = 15, 5
locations['lon'].hist(bins=100)
Out[10]:
<matplotlib.axes.AxesSubplot at 0x1481b160>
In [11]:
locations['lat'].hist(bins=100)
Out[11]:
<matplotlib.axes.AxesSubplot at 0x12c82748>

Hmm... might have been nice to see those side by side. Fortunately, Pandas has some good documentation on subplots: http://pandas.pydata.org/pandas-docs/dev/visualization.html#targeting-different-subplots

In [12]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
locations['lon'].hist(bins=100, ax=axes[0])
locations['lat'].hist(bins=100, ax=axes[1])
Out[12]:
<matplotlib.axes.AxesSubplot at 0x147ba5f8>

Let's get rid of the outliers.

In [13]:
coords = locations[(locations['lon'] < -73.5) & (locations['lon'] > -74.5) &
                   (locations['lat'] < +41.0) & (locations['lat'] > +40.5)]
coords
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 228292 entries, 0 to 230137
Data columns (total 2 columns):
lon    228292  non-null values
lat    228292  non-null values
dtypes: float64(2)

This can be plotted the same way.

In [14]:
heatmap, xedges, yedges = np.histogram2d(coords['lon'], coords['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
Out[14]:
<matplotlib.image.AxesImage at 0xf1d3a58>

That's a bit fuzzy. Nor does it look like New York. Let's flip the longitude axis and increase the bins. We can also change the color scales, so let's do that.

In [15]:
rcParams['figure.figsize'] = [10,10]
heatmap, xedges, yedges = np.histogram2d(-coords['lon'], coords['lat'], bins=500)
plt.imshow(heatmap ** .5, cmap='YlOrBr', aspect=1.5, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
Out[15]:
<matplotlib.image.AxesImage at 0xfd1cb38>

Summary statistics

Let's put together a news report on what's happened so far. We'll ask basic questions like:

  1. How many collisions where there?
  2. How many people were injured?
  3. How many were killed?

Then let's start breaking this up monthwise.

In [16]:
# How many collisions?
data.collisions.sum()
Out[16]:
337930
In [17]:
# How many were injured?
data.total_injured.sum()
Out[17]:
92768
In [18]:
# How many were killed?
data.total_killed.sum()
Out[18]:
431
In [19]:
# Let's break this up by month.
data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum()
Out[19]:
collisions total_injured total_killed
year month
2011 8 14196 4013 14
9 14388 4104 11
10 14934 4184 20
11 14094 4084 14
12 14503 4063 23
2012 1 13166 3516 15
2 12194 3266 13
3 13888 3858 15
4 13839 3958 18
5 15115 4099 27
6 15214 4105 21
7 14142 4053 21
8 14205 3928 22
9 13819 3924 21
10 14152 3837 18
11 13332 3410 13
12 14345 3877 22
2013 1 13092 3387 25
2 12174 2995 17
3 13848 3539 18
4 13809 3731 13
5 15583 4217 12
6 15177 4262 17
7 14721 4358 21

Of course, it's good to be able to see what this looks like.

In [20]:
data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum().plot(subplots=True, kind='line', figsize=(15,9))
Out[20]:
array([<matplotlib.axes.AxesSubplot object at 0x000000000FD10710>,
       <matplotlib.axes.AxesSubplot object at 0x000000000FD496A0>,
       <matplotlib.axes.AxesSubplot object at 0x000000001451A908>], dtype=object)

Breaking up into details

We can now ask questions around: are there specific boroughs or precincts that are more prone to accidents than others.

In [21]:
data.groupby('borocode')['collisions'].sum().order().plot(kind='barh', figsize=[10,3])
Out[21]:
<matplotlib.axes.AxesSubplot at 0x144a5dd8>
In [22]:
data.groupby('precinct')['collisions'].sum().order().plot(kind='barh', figsize=[10,15])
Out[22]:
<matplotlib.axes.AxesSubplot at 0x144bacc0>

What do the numbers mean? A careful look at the scraper source code reveals that these are actually the precinct names -- with a few exceptions.

Let's do that mapping:

In [23]:
precinct = data['precinct'].unique()
precinct
Out[23]:
array([ 60,  61,  62,  63,  66,  67,  68,  69,  70,  71,  72,  73,  75,
        76,  77,  78,  79,  81,  83,  84,  88,  90,  94,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  52,   1,   5,   6,   7,
         9,  10,  13,  17,  19,  20,  22,  23,  24,  25,  26,  28,  30,
        32,  33,  34, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 120, 122, 123,  14,  18, 121], dtype=int64)

Let's create a mapping function.

In [24]:
def precinct_name(number):
    if number == 14:
        return 'Midtown South'
    elif number == 18:
        return 'Midtown North'
    elif number == 22:
        return 'Central Park'
    elif number % 10 == 1 and number % 100 != 11:
        return '%dst precinct' % number
    elif number % 10 == 2 and number % 100 != 12:
        return '%dnd precinct' % number
    elif number % 10 == 3 and number % 100 != 13:
        return '%drd precinct' % number
    else:
        return '%dth precinct' % number

This needs to be applied to collisions by precinct.

In [25]:
data.groupby('precinct')['collisions'].sum().order().rename(precinct_name).plot(kind='barh', figsize=[10, 15])
Out[25]:
<matplotlib.axes.AxesSubplot at 0xfd27ac8>

So Central Park looks safe. But what if it's because no one lives there? In other words, what's the number of accidents per capita?

First, we'll need the number of people in each precinct. A search leads us to http://johnkeefe.net/nyc-police-precinct-and-census-data which leads us to the Google Fusion Table with NYC census data. Aggregating this by precinct, we get the population by precinct which I'm reading below.

In [26]:
from cStringIO import StringIO
population = pd.read_table(StringIO('''precinct	population
1	66679
5	52568
6	62226
7	56355
9	76443
10	50180
13	93640
14	20651
17	79126
18	54066
19	208259
20	102624
22	25
23	73106
24	106460
25	47405
26	49508
28	44781
30	60685
32	70942
33	77645
34	112375
40	91497
41	52246
42	79762
43	172122
44	146441
45	120833
46	128200
47	152374
48	83266
49	114712
50	101720
52	139307
60	104278
61	159645
62	181981
63	108646
66	191382
67	155252
68	124491
69	84480
70	160664
71	98429
72	126230
73	86468
75	183328
76	43694
77	96309
78	61099
79	90263
81	62722
83	112634
84	48196
88	51421
90	116836
94	56247
100	47913
101	67065
102	144008
103	105803
104	170190
105	188582
106	122441
107	151107
108	113200
109	247354
110	172634
111	116431
112	112277
113	120132
114	202766
115	171576
120	175876
122	194822
123	98032
'''))

population.sort('population').head()
Out[26]:
precinct population
12 22 25
7 14 20651
47 76 43694
17 28 44781
15 25 47405

Precinct #22 looks suspiciously low. That's Central Park, if you remember. But presumably no one lives in Central Park.

Let's merge this data into the accidents by precinct. We'll start with the list of collisions:

In [27]:
collisions = pd.DataFrame({'collisions': data.groupby('precinct')['collisions'].sum().order()})
collisions.head()
Out[27]:
collisions
precinct
22 144
121 219
100 1294
30 1501
101 1523

To merge the population data, let's ensure that it has the same index:

In [28]:
population = population.set_index('precinct')
population.head()
Out[28]:
population
precinct
1 66679
5 52568
6 62226
7 56355
9 76443

Now that they have the same index, let's just assign the columns. Pandas automatically takes care of aligning the indexes.

In [29]:
collisions['population'] = population['population']
collisions
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 22 to 17
Data columns (total 2 columns):
collisions    77  non-null values
population    76  non-null values
dtypes: float64(1), int64(1)

Did you notice that population only has 77 values, whereas the dataset has 77 precincts? Which is the missing one?

In [30]:
collisions[collisions['population'].isnull()]
Out[30]:
collisions population
precinct
121 219 NaN

Can't be helped. We just don't have the data for the 121st precinct. Let's move on, and plot the rest.

In [31]:
(collisions['collisions'].astype(float) / collisions['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])
Out[31]:
<matplotlib.axes.AxesSubplot at 0x134002e8>

Which looks ridiculous, of course. Let's ignore Central Park, as hope that the accidents happening there are because of people passing through rather than the 25 people that live there. So if we remove those...

In [32]:
precincts = collisions.drop([22, 121])
(precincts['collisions'].astype(float) / precincts['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])
Out[32]:
<matplotlib.axes.AxesSubplot at 0x11d2d5f8>

... we would be rather worried about the 17th Precinct.