NYPD Crash Data¶

The New York Police Department make vehicle collision data available as PDFs. The NYPD Crash Data Band-Aid site scrapes this data and kindly makes it available in a structured manner. The data is available as CSV files and JSON.

Let's see what we can make of this data.

First, we'll load it. The URL for the CSV file is http://nypd.openscrape.com/data/collisions.csv.gz. First, let's read the file. (Note: though it says it's a CSV file, it's actually a TSV file -- TAB separated values)

In [1]:

import pandas as pd

# From Pandas 0.12 onwards, you can just use a single line:
# data = pd.read_csv('http://nypd.openscrape.com/data/collisions.csv.gz', compression='gzip', sep='\t')

# But for older versions, we need to download the file first.
import os.path
import urllib

# Let's download it only if it does not already exist. Save ourselves some bandwidth.
if not os.path.exists('collisions.csv.gz'):
    urllib.urlretrieve('http://nypd.openscrape.com/data/collisions.csv.gz', 'collisions.csv.gz')

data = pd.read_csv('collisions.csv.gz', compression='gzip', sep='\t')
data

Out[1]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230138 entries, 0 to 230137
Data columns (total 68 columns):
borocode                          230138  non-null values
precinct                          230138  non-null values
year                              230138  non-null values
month                             230138  non-null values
lon                               228297  non-null values
lat                               228297  non-null values
street1                           230138  non-null values
street2                           230138  non-null values
collisions                        230138  non-null values
persons_involved                  230138  non-null values
collisions_with_injuries          230138  non-null values
motorists_injured                 230138  non-null values
passengers_injured                230138  non-null values
cyclists_injured                  230138  non-null values
pedestr_injured                   230138  non-null values
total_injured                     230138  non-null values
motorists_killed                  230138  non-null values
passengers_killed                 230138  non-null values
cyclists_killed                   230138  non-null values
pedestr_killed                    230138  non-null values
total_killed                      230138  non-null values
ambulance                         999  non-null values
bicycle                           7167  non-null values
bus                               7846  non-null values
fire_truck                        522  non-null values
large_com_veh_6_or_more_tires     6417  non-null values
livery_vehicle                    5204  non-null values
motorcycle                        2678  non-null values
other                             11458  non-null values
passenger_vehicle                 170780  non-null values
pedicab                           31  non-null values
pick_up_truck                     6539  non-null values
scooter                           199  non-null values
small_com_veh_4_tires             7231  non-null values
sport_utility_station_wagon       79000  non-null values
taxi_vehicle                      14690  non-null values
unknown                           35098  non-null values
van                               15283  non-null values
aggressive_driving_road_rage      1096  non-null values
alcohol_involvement               2255  non-null values
backing_unsafely                  7844  non-null values
cell_phone_hand_held              145  non-null values
driver_inattention_distraction    0  non-null values
driver_inexperience               35676  non-null values
drugs_illegal                     162  non-null values
eating_or_drinking                2  non-null values
err_confusn_ped_bike_other_ped    0  non-null values
failure_to_keep_right             0  non-null values
failure_to_yield_right_of_way     15120  non-null values
fatigued_drowsy                   344  non-null values
fell_asleep                       539  non-null values
following_too_closely             12588  non-null values
illness                           420  non-null values
listening_using_headphones        2  non-null values
lost_consciousness                207  non-null values
other_electronic_device           5704  non-null values
other_uninvolved_vehicle          0  non-null values
outside_car_distraction           680  non-null values
passenger_distraction             865  non-null values
passing_or_lane_usage             8088  non-null values
physical_disability               122  non-null values
prescription_medication           36  non-null values
texting                           2  non-null values
traffic_control_disregarded       4183  non-null values
turning_improperly                6787  non-null values
unsafe_lane_changing              9417  non-null values
unsafe_speed                      0  non-null values
vehicle_vandalism                 14  non-null values
dtypes: float64(49), int64(17), object(2)

That's a lot of values. To get a sense of what's happening here, let's view these values. There are a couple of ways of doing this. One is to explicitly convert to HTML and displaying it using IPython.display.

In [2]:

from IPython.display import HTML
HTML(data.head().to_html())

Out[2]:

	borocode	precinct	year	month	lon	lat	street1	street2	collisions	persons_involved	collisions_with_injuries	motorists_injured	total_injured	ambulance	bicycle	bus	fire_truck	large_com_veh_6_or_more_tires	livery_vehicle	motorcycle	other	passenger_vehicle	pedicab	pick_up_truck	scooter	small_com_veh_4_tires	sport_utility_station_wagon	taxi_vehicle	unknown	van	aggressive_driving_road_rage	alcohol_involvement	backing_unsafely	cell_phone_hand_held	driver_inattention_distraction	driver_inexperience	drugs_illegal	eating_or_drinking	err_confusn_ped_bike_other_ped	failure_to_keep_right	failure_to_yield_right_of_way	fatigued_drowsy	fell_asleep	following_too_closely	illness	listening_using_headphones	lost_consciousness	other_electronic_device	other_uninvolved_vehicle	outside_car_distraction	passenger_distraction	passing_or_lane_usage	physical_disability	prescription_medication	texting	traffic_control_disregarded	turning_improperly	unsafe_lane_changing	unsafe_speed	vehicle_vandalism
0	3	60	2011	8	-73.988165	40.590100	27 AVENUE	HARWAY AVENUE	1	2	1	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	3	60	2011	8	-73.988310	40.586507	28 AVENUE	CROPSEY AVENUE	1	2	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	60	2011	8	-73.985962	40.588771	28 AVENUE	HARWAY AVENUE	2	5	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	60	2011	8	-73.984093	40.590566	28 AVENUE	STILLWELL AVENUE	1	2	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	3	60	2011	8	-73.988643	40.574282	38 STREET	SURF AVENUE	1	1	0	0	0	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

The other way is to tell Pandas that we're OK with really long lines, so that it can go ahead and display lots of columns.

In [3]:

pd.set_option('display.max_columns', 100)
pd.set_option('display.line_width', 10000)
data.head()

Out[3]:

	borocode	precinct	year	month	lon	lat	street1	street2	collisions	persons_involved	collisions_with_injuries	motorists_injured	total_injured	ambulance	bicycle	bus	fire_truck	large_com_veh_6_or_more_tires	livery_vehicle	motorcycle	other	passenger_vehicle	pedicab	pick_up_truck	scooter	small_com_veh_4_tires	sport_utility_station_wagon	taxi_vehicle	unknown	van	aggressive_driving_road_rage	alcohol_involvement	backing_unsafely	cell_phone_hand_held	driver_inattention_distraction	driver_inexperience	drugs_illegal	eating_or_drinking	err_confusn_ped_bike_other_ped	failure_to_keep_right	failure_to_yield_right_of_way	fatigued_drowsy	fell_asleep	following_too_closely	illness	listening_using_headphones	lost_consciousness	other_electronic_device	other_uninvolved_vehicle	outside_car_distraction	passenger_distraction	passing_or_lane_usage	physical_disability	prescription_medication	texting	traffic_control_disregarded	turning_improperly	unsafe_lane_changing	unsafe_speed	vehicle_vandalism
0	3	60	2011	8	-73.988165	40.590100	27 AVENUE	HARWAY AVENUE	1	2	1	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	3	60	2011	8	-73.988310	40.586507	28 AVENUE	CROPSEY AVENUE	1	2	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	60	2011	8	-73.985962	40.588771	28 AVENUE	HARWAY AVENUE	2	5	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	60	2011	8	-73.984093	40.590566	28 AVENUE	STILLWELL AVENUE	1	2	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	3	60	2011	8	-73.988643	40.574282	38 STREET	SURF AVENUE	1	1	0	0	0	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Let's take a look at where most of the accidents happen. In other words, let's plot the latitudes and longitudes as a heatmap.

I didn't quite know how to plot a 2D heatmap, but as with all things in life, Google and StackOverflow are your friends. This led to http://stackoverflow.com/a/2461029/100904 which provided this function:

In [4]:

heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-744ec5a742e9> in <module>()
----> 1 heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50)

D:\anaconda\64\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
    609         xedges = yedges = asarray(bins, float)
    610         bins = [xedges, yedges]
--> 611     hist, edges = histogramdd([x,y], bins, range, normed, weights)
    612     return hist, edges[0], edges[1]
    613 

D:\anaconda\64\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
    356         mindiff = dedges[i].min()
    357         if not np.isinf(mindiff):
--> 358             decimal = int(-log10(mindiff)) + 6
    359             # Find which points are on the rightmost edge.
    360             on_edge = where(around(sample[:,i], decimal) == around(edges[i][-1],

ValueError: cannot convert float NaN to integer

We get an error when plotting latitude and longitude. This looks like it's because some of the values are NaN. Let's get rid of those.

In [7]:

locations = data[['lon', 'lat']].dropna()
locations

Out[7]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228297 entries, 0 to 230137
Data columns (total 2 columns):
lon    228297  non-null values
lat    228297  non-null values
dtypes: float64(2)

Now we can create the 2D histogram and plot it.

In [8]:

heatmap, xedges, yedges = np.histogram2d(locations['lon'], locations['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])

Out[8]:

<matplotlib.image.AxesImage at 0xc269240>

That's clustered around the corner. Most likely because there are a few outliers. Could we clean up these outliers? Maybe we can restrict ourselves to 40.5 - 41 longitude and -79 to -78.5 latitude. Let's plot the distribution of the data and see what it looks like.

In [9]:

locations['lon'].hist()

Out[9]:

<matplotlib.axes.AxesSubplot at 0x13f58c18>

Not very helpful. Let's increase the number of bins, and the size of the images going forward.

In [10]:

rcParams['figure.figsize'] = 15, 5
locations['lon'].hist(bins=100)

Out[10]:

<matplotlib.axes.AxesSubplot at 0x1481b160>

In [11]:

locations['lat'].hist(bins=100)

Out[11]:

<matplotlib.axes.AxesSubplot at 0x12c82748>

Hmm... might have been nice to see those side by side. Fortunately, Pandas has some good documentation on subplots: http://pandas.pydata.org/pandas-docs/dev/visualization.html#targeting-different-subplots

In [12]:

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
locations['lon'].hist(bins=100, ax=axes[0])
locations['lat'].hist(bins=100, ax=axes[1])

Out[12]:

<matplotlib.axes.AxesSubplot at 0x147ba5f8>

Let's get rid of the outliers.

In [13]:

coords = locations[(locations['lon'] < -73.5) & (locations['lon'] > -74.5) &
                   (locations['lat'] < +41.0) & (locations['lat'] > +40.5)]
coords

Out[13]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228292 entries, 0 to 230137
Data columns (total 2 columns):
lon    228292  non-null values
lat    228292  non-null values
dtypes: float64(2)

This can be plotted the same way.

In [14]:

heatmap, xedges, yedges = np.histogram2d(coords['lon'], coords['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])

Out[14]:

<matplotlib.image.AxesImage at 0xf1d3a58>

That's a bit fuzzy. Nor does it look like New York. Let's flip the longitude axis and increase the bins. We can also change the color scales, so let's do that.

In [15]:

rcParams['figure.figsize'] = [10,10]
heatmap, xedges, yedges = np.histogram2d(-coords['lon'], coords['lat'], bins=500)
plt.imshow(heatmap ** .5, cmap='YlOrBr', aspect=1.5, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])

Out[15]:

<matplotlib.image.AxesImage at 0xfd1cb38>

Summary statistics¶

Let's put together a news report on what's happened so far. We'll ask basic questions like:

How many collisions where there?
How many people were injured?
How many were killed?

Then let's start breaking this up monthwise.

In [16]:

# How many collisions?
data.collisions.sum()

Out[16]:

In [17]:

# How many were injured?
data.total_injured.sum()

Out[17]:

In [18]:

# How many were killed?
data.total_killed.sum()

Out[18]:

In [19]:

# Let's break this up by month.
data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum()

Out[19]:

		collisions	total_injured	total_killed
year	month
2011	8	14196	4013	14
	9	14388	4104	11
	10	14934	4184	20
	11	14094	4084	14
	12	14503	4063	23
2012	1	13166	3516	15
	2	12194	3266	13
	3	13888	3858	15
	4	13839	3958	18
	5	15115	4099	27
	6	15214	4105	21
	7	14142	4053	21
	8	14205	3928	22
	9	13819	3924	21
	10	14152	3837	18
	11	13332	3410	13
	12	14345	3877	22
2013	1	13092	3387	25
	2	12174	2995	17
	3	13848	3539	18
	4	13809	3731	13
	5	15583	4217	12
	6	15177	4262	17
	7	14721	4358	21

Of course, it's good to be able to see what this looks like.

In [20]:

data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum().plot(subplots=True, kind='line', figsize=(15,9))

Out[20]:

array([<matplotlib.axes.AxesSubplot object at 0x000000000FD10710>,
       <matplotlib.axes.AxesSubplot object at 0x000000000FD496A0>,
       <matplotlib.axes.AxesSubplot object at 0x000000001451A908>], dtype=object)

Breaking up into details¶

We can now ask questions around: are there specific boroughs or precincts that are more prone to accidents than others.

In [21]:

data.groupby('borocode')['collisions'].sum().order().plot(kind='barh', figsize=[10,3])

Out[21]:

<matplotlib.axes.AxesSubplot at 0x144a5dd8>

In [22]:

data.groupby('precinct')['collisions'].sum().order().plot(kind='barh', figsize=[10,15])

Out[22]:

<matplotlib.axes.AxesSubplot at 0x144bacc0>

What do the numbers mean? A careful look at the scraper source code reveals that these are actually the precinct names -- with a few exceptions.

Let's do that mapping:

In [23]:

precinct = data['precinct'].unique()
precinct

Out[23]:

array([ 60,  61,  62,  63,  66,  67,  68,  69,  70,  71,  72,  73,  75,
        76,  77,  78,  79,  81,  83,  84,  88,  90,  94,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  52,   1,   5,   6,   7,
         9,  10,  13,  17,  19,  20,  22,  23,  24,  25,  26,  28,  30,
        32,  33,  34, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 120, 122, 123,  14,  18, 121], dtype=int64)

Let's create a mapping function.

In [24]:

def precinct_name(number):
    if number == 14:
        return 'Midtown South'
    elif number == 18:
        return 'Midtown North'
    elif number == 22:
        return 'Central Park'
    elif number % 10 == 1 and number % 100 != 11:
        return '%dst precinct' % number
    elif number % 10 == 2 and number % 100 != 12:
        return '%dnd precinct' % number
    elif number % 10 == 3 and number % 100 != 13:
        return '%drd precinct' % number
    else:
        return '%dth precinct' % number

This needs to be applied to collisions by precinct.

In [25]:

data.groupby('precinct')['collisions'].sum().order().rename(precinct_name).plot(kind='barh', figsize=[10, 15])

Out[25]:

<matplotlib.axes.AxesSubplot at 0xfd27ac8>

So Central Park looks safe. But what if it's because no one lives there? In other words, what's the number of accidents per capita?

First, we'll need the number of people in each precinct. A search leads us to http://johnkeefe.net/nyc-police-precinct-and-census-data which leads us to the Google Fusion Table with NYC census data. Aggregating this by precinct, we get the population by precinct which I'm reading below.

In [26]:

from cStringIO import StringIO
population = pd.read_table(StringIO('''precinct	population
66679
52568
62226
56355
76443
50180
93640
20651
79126
54066
208259
102624
25
73106
106460
47405
49508
44781
60685
70942
77645
112375
91497
52246
79762
172122
146441
120833
128200
152374
83266
114712
101720
139307
104278
159645
181981
108646
191382
155252
124491
84480
160664
98429
126230
86468
183328
43694
96309
61099
90263
62722
112634
48196
51421
116836
56247
47913
67065
144008
105803
170190
188582
122441
151107
113200
247354
172634
116431
112277
120132
202766
171576
175876
194822
98032
'''))

population.sort('population').head()

Out[26]:

	precinct	population
12	22	25
7	14	20651
47	76	43694
17	28	44781
15	25	47405

Precinct #22 looks suspiciously low. That's Central Park, if you remember. But presumably no one lives in Central Park.

Let's merge this data into the accidents by precinct. We'll start with the list of collisions:

In [27]:

collisions = pd.DataFrame({'collisions': data.groupby('precinct')['collisions'].sum().order()})
collisions.head()

Out[27]:

	collisions
precinct
22	144
121	219
100	1294
30	1501
101	1523

To merge the population data, let's ensure that it has the same index:

In [28]:

population = population.set_index('precinct')
population.head()

Out[28]:

	population
precinct
1	66679
5	52568
6	62226
7	56355
9	76443

Now that they have the same index, let's just assign the columns. Pandas automatically takes care of aligning the indexes.

In [29]:

collisions['population'] = population['population']
collisions

Out[29]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 22 to 17
Data columns (total 2 columns):
collisions    77  non-null values
population    76  non-null values
dtypes: float64(1), int64(1)

Did you notice that population only has 77 values, whereas the dataset has 77 precincts? Which is the missing one?

In [30]:

collisions[collisions['population'].isnull()]

Out[30]:

	collisions	population
precinct
121	219	NaN

Can't be helped. We just don't have the data for the 121st precinct. Let's move on, and plot the rest.

In [31]:

(collisions['collisions'].astype(float) / collisions['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])

Out[31]:

<matplotlib.axes.AxesSubplot at 0x134002e8>

Which looks ridiculous, of course. Let's ignore Central Park, as hope that the accidents happening there are because of people passing through rather than the 25 people that live there. So if we remove those...

In [32]:

precincts = collisions.drop([22, 121])
(precincts['collisions'].astype(float) / precincts['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])

Out[32]:

<matplotlib.axes.AxesSubplot at 0x11d2d5f8>

... we would be rather worried about the 17th Precinct.