For better experience, view this notebook in nbviewer here
Docker image used: jupyter/datascience-notebook
To connect to a dockerized Jupyter server:
docker pull jupyter/datascience-notebook
docker run -it -p host_port_number:8888 -v /your/host/directory:/home/jovyan/work --name your_container_name jupyter/datascience-notebook
docker exec your_container_name pip install package_name
The World Health Organization (WHO) and U.S. Department of Transportation statistics show that approximately 1.35 million people in the world or more than 35,000 people in the U.S. lost their lives due to traffic collisions each year. In Los Angeles alone, more than 200 people were killed each year while trying to move around throughout the city. Bikers, pedestrians, children, and older adults were particularly vulnerable on the city’s streets. However, traffic deaths are predictable and preventable, and everyone has the right to move safely in their communities. Thus, the City of Los Angeles has committed to the Vision Zero initiative, aiming to eliminate all traffic deaths and series injuries by 2025. In order to achieve the grand vision of Vision Zero, we first need to understand our existing collisions better. Thanks to the California Highway Patrol (CHP), we have data for all reported traffic collisions in the City of Los Angeles from 2009 to 2018. This project includes an exploratory data analysis on traffic collisions in the City of Los Angeles, with a light touch of machine learning with scikit-learn, aiming to answer questions such as:
The City of Los Angeles Data Portal provides up-to-date collision data from 2010 to present, but with limited attributes for each incident. Although not used in this project, I connected the dataset to Carto just to show a rough idea of what we are dealing with here.
%%HTML
# embed link
import os
import pandas as pd
import pandas as pd
import numpy as np
import googlemaps
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
import cartoframes
from cartoframes.auth import set_default_context, Context
from cartoframes import Credentials
from cartoframes import Layer, BaseMap, styling
from cartoframes.viz import Map, basemaps
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
%matplotlib inline
USERNAME = '----'
APIKEY = '----'
creds = Credentials(username=USERNAME, key=APIKEY)
cc = cartoframes.CartoContext(creds=creds)
plt.style.use('ggplot')
plt.rcParams['axes.facecolor'] = '#F5F5F5'
file_path = '/home/jovyan/work/'
file_name_v = 'victim_2009_2018'
input_data_v = os.path.join(file_path, file_name_v + '.csv')
input_data_v
'/home/jovyan/work/victim_2009_2018.csv'
file_path = '/home/jovyan/work/'
file_name = 'collision_2009_2018'
input_data = os.path.join(file_path, file_name+'.csv')
input_data
'/home/jovyan/work/collision_2009_2018.csv'
collision = pd.read_csv(input_data, parse_dates=(['COLLISION_DATE']))
collision.head()
CASE_ID | ACCIDENT_YEAR | PROC_DATE | JURIS | COLLISION_DATE | COLLISION_TIME | OFFICER_ID | REPORTING_DISTRICT | DAY_OF_WEEK | CHP_SHIFT | ... | COUNT_PED_KILLED | COUNT_PED_INJURED | COUNT_BICYCLIST_KILLED | COUNT_BICYCLIST_INJURED | COUNT_MC_KILLED | COUNT_MC_INJURED | PRIMARY_RAMP | SECONDARY_RAMP | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3918943 | 2009 | 20091202 | 1942 | 2009-03-23 | 1745 | 35046 | 1283 | 1 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | - | - | NaN | NaN |
1 | 3984219 | 2009 | 20090115 | 1942 | 2009-01-03 | 540 | 33829 | 1701 | 6 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | - | - | NaN | NaN |
2 | 3984274 | 2009 | 20090126 | 1942 | 2009-01-07 | 630 | 38614 | 152 | 3 | 5 | ... | 1 | 0 | 0 | 0 | 0 | 0 | - | - | NaN | NaN |
3 | 3984281 | 2009 | 20090126 | 1942 | 2009-01-08 | 50 | 32896 | 1985 | 4 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | - | - | NaN | NaN |
4 | 3984352 | 2009 | 20100914 | 1942 | 2009-01-11 | 1450 | 24517 | 2014 | 7 | 5 | ... | 0 | 0 | 0 | 0 | 1 | 0 | - | - | NaN | NaN |
5 rows × 76 columns
victim = pd.read_csv(input_data_v)
victim.head()
CASE_ID | PARTY_NUMBER | VICTIM_ROLE | VICTIM_SEX | VICTIM_AGE | VICTIM_DEGREE_OF_INJURY | VICTIM_SEATING_POSITION | VICTIM_SAFETY_EQUIP_1 | VICTIM_SAFETY_EQUIP_2 | VICTIM_EJECTED | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3918943 | 2 | 1 | M | 25 | 4 | 1 | N | G | 0 |
1 | 3984219 | 1 | 1 | M | 19 | 1 | 1 | L | G | 0 |
2 | 3984274 | 2 | 3 | F | 58 | 1 | 9 | - | - | 0 |
3 | 3984281 | 1 | 1 | M | 29 | 1 | 1 | L | H | 0 |
4 | 3984352 | 1 | 2 | F | 998 | 0 | 3 | L | G | 0 |
collision.columns = map(str.lower, collision.columns)
collision.columns = collision.columns.str.replace(' ', '_')
collision.columns
Index(['case_id', 'accident_year', 'proc_date', 'juris', 'collision_date', 'collision_time', 'officer_id', 'reporting_district', 'day_of_week', 'chp_shift', 'population', 'cnty_city_loc', 'special_cond', 'beat_type', 'chp_beat_type', 'city_division_lapd', 'chp_beat_class', 'beat_number', 'primary_rd', 'secondary_rd', 'distance', 'direction', 'intersection', 'weather_1', 'weather_2', 'state_hwy_ind', 'caltrans_county', 'caltrans_district', 'state_route', 'route_suffix', 'postmile_prefix', 'postmile', 'location_type', 'ramp_intersection', 'side_of_hwy', 'tow_away', 'collision_severity', 'number_killed', 'number_injured', 'party_count', 'primary_coll_factor', 'pcf_code_of_viol', 'pcf_viol_category', 'pcf_violation', 'pcf_viol_subsection', 'hit_and_run', 'type_of_collision', 'mviw', 'ped_action', 'road_surface', 'road_cond_1', 'road_cond_2', 'lighting', 'control_device', 'chp_road_type', 'pedestrian_accident', 'bicycle_accident', 'motorcycle_accident', 'truck_accident', 'not_private_property', 'alcohol_involved', 'stwd_vehtype_at_fault', 'chp_vehtype_at_fault', 'count_severe_inj', 'count_visible_inj', 'count_complaint_pain', 'count_ped_killed', 'count_ped_injured', 'count_bicyclist_killed', 'count_bicyclist_injured', 'count_mc_killed', 'count_mc_injured', 'primary_ramp', 'secondary_ramp', 'latitude', 'longitude'], dtype='object')
victim.columns = map(str.lower, victim.columns)
victim.columns = victim.columns.str.replace(' ', '_')
victim.columns
Index(['case_id', 'party_number', 'victim_role', 'victim_sex', 'victim_age', 'victim_degree_of_injury', 'victim_seating_position', 'victim_safety_equip_1', 'victim_safety_equip_2', 'victim_ejected'], dtype='object')
to_drop = [
'juris', # all the same since city of Los Angeles
'proc_date', # process date
'population', # incoporated over 250000
'officer_id',
'reporting_district',
'chp_shift',
'cnty_city_loc',
'special_cond',
'beat_type',
'chp_beat_type',
'city_division_lapd',
'chp_beat_class',
'beat_number',
'stwd_vehtype_at_fault',
'chp_vehtype_at_fault',
'caltrans_county',
'caltrans_district',
'weather_2', # only if a second description is necessary
'primary_ramp',
'secondary_ramp'
]
collision.drop(to_drop, axis=1, inplace=True)
collision.set_index('collision_date', inplace=True)
collision.head()
case_id | accident_year | collision_time | day_of_week | primary_rd | secondary_rd | distance | direction | intersection | weather_1 | ... | count_visible_inj | count_complaint_pain | count_ped_killed | count_ped_injured | count_bicyclist_killed | count_bicyclist_injured | count_mc_killed | count_mc_injured | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
collision_date | |||||||||||||||||||||
2009-03-23 | 3918943 | 2009 | 1745 | 1 | WESTERN AV | CENTURY BL | 55.0 | N | N | A | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN |
2009-01-03 | 3984219 | 2009 | 540 | 6 | TAMPA AV | PINE VALLEY AV | 420.0 | N | N | - | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN |
2009-01-07 | 3984274 | 2009 | 630 | 3 | 5TH ST | FLOWER ST | 0.0 | NaN | Y | A | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN | NaN |
2009-01-08 | 3984281 | 2009 | 50 | 4 | VAN NUYS BL | VESPER AV | 0.0 | NaN | Y | A | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN |
2009-01-11 | 3984352 | 2009 | 1450 | 7 | HARVARD BL | 3RD ST | 0.0 | NaN | Y | A | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | NaN | NaN |
5 rows × 55 columns
# day of week
dowdict = {
1: 'Monday',
2: 'Tuesday',
3: 'Wednesday',
4: 'Thursday',
5: 'Friday',
6: 'Saturday',
7: 'Sunday'
}
collision['day_of_week'].replace(
to_replace=[i for i in dowdict.keys()],
value=[i for i in dowdict.values()],
inplace=True
)
# type of collision
tocdict = {
'A': 'Head-On',
'B': 'Sideswipe',
'C': 'Rear End',
'D': 'Broadside',
'E': 'Hit Object',
'F': 'Overturned',
'G': 'Vehicle/Pedestrian',
'H': 'Other',
'-': 'Not Stated'
}
collision['type_of_collision'].replace(
to_replace=[i for i in tocdict.keys()],
value=[i for i in tocdict.values()],
inplace=True
)
# violation category
vcdict = {
'0': 'Unknown',
'1': 'Driving or Bicycling Under the Influence of Alcohol or Drug',
'2': 'Impeding Traffic',
'3': 'Unsafe Speed',
'4': 'Following Too Closely',
'5': 'Wrong Side of Road',
'6': 'Improper Passing',
'7': 'Unsafe Lane Change',
'8': 'Improper Turning',
'9': 'Automobile Right of Way',
'10': 'Pedestrian Right of Way',
'11': 'Pedestrian Violation',
'12': 'Traffic Signals and Signs',
'13': 'Hazardous Parking',
'14': 'Lights',
'15': 'Brakes',
'16': 'Other Equipment',
'17': 'Other Hazardous Violation',
'18': 'Other Than Driver (or Pedestrian)',
'19': None,
'20': None,
'21': 'Unsafe Starting or Backing',
'22': 'Other Improper Driving',
'23': 'Pedestrian or "Other" Under the Influence of Alcohol or Drug',
'24': 'Fell Asleep',
'- ': 'Not Stated'
}
collision['pcf_viol_category'].replace(
to_replace=[i for i in vcdict.keys()],
value=[i for i in vcdict.values()],
inplace=True
)
# collission severity
csdict = {
0: 'Property Damage',
1: 'Fatal',
2: 'Severe',
3: 'Other Visible Injuries',
4: 'Complaint of Pain'
}
collision['collision_severity'].replace(
to_replace=[i for i in csdict.keys()],
value=[i for i in csdict.values()],
inplace=True
)
# location type
ltdict = {
'H': 'Highway',
'I': 'Intersection',
'R': 'Ramp'
}
collision['location_type'].replace(
to_replace=[i for i in ltdict.keys()],
value=[i for i in ltdict.values()],
inplace=True
)
# weather
weatherdict = {
'A': 'Clear',
'B': 'Cloudy',
'C': 'Raining',
'D': 'Snowing',
'E': 'Fog',
'F': 'Other',
'G': 'Wind',
'- ': 'Not stated'
}
collision['weather_1'].replace(
to_replace=[i for i in weatherdict.keys()],
value=[i for i in weatherdict.values()],
inplace=True
)
# bike & ped collision
def bike_ped(row):
if (row.bicycle_accident == 'Y') | (row.pedestrian_accident == 'Y'):
return 'Y'
else:
return 'N'
collision['bike_ped_collision'] = collision.apply(bike_ped, axis=1)
# degree of injury
doidict = {
0: 'No Injury',
1: 'Killed',
2: 'Severe Injury',
3: 'Other Visible Injury',
4: 'Complaint of Pain',
5: 'Other',
6: 'Other',
7: 'Other'
}
victim.victim_degree_of_injury.replace(
to_replace=[i for i in doidict.keys()],
value=[i for i in doidict.values()],
inplace=True
)
# age 998 is not stated - replace with NaN
# age 999 is fatal fetus - replace with age 0 for now; there are 6 of them :(
victim['victim_age'].replace({998: np.nan, 999: 0}, inplace=True)
victim['victim_sex'].replace('-', np.nan, inplace=True)
ranges = [0, 600, 930, 1400, 1830, 2400]
labels = ['Early Morning', 'AM Peak', 'Midday', 'PM Peak', 'Evening']
collision['time_period'] = pd.cut(
collision.collision_time,
ranges,
labels=labels
).astype('category')
collision[['collision_time', 'time_period']].head()
collision_time | time_period | |
---|---|---|
collision_date | ||
2009-03-23 | 1745 | PM Peak |
2009-01-03 | 540 | Early Morning |
2009-01-07 | 630 | AM Peak |
2009-01-08 | 50 | Early Morning |
2009-01-11 | 1450 | PM Peak |
ranges = list(np.arange(0, 2401, 100))
labels = list(np.arange(0, 24))
collision['hour_of_day'] = pd.cut(
collision.collision_time,
ranges,
labels=labels
).astype('category')
collision[['collision_time', 'hour_of_day']].head()
collision_time | hour_of_day | |
---|---|---|
collision_date | ||
2009-03-23 | 1745 | 17 |
2009-01-03 | 540 | 5 |
2009-01-07 | 630 | 6 |
2009-01-08 | 50 | 0 |
2009-01-11 | 1450 | 14 |
# collision.reset_index().to_csv('collision_09_18_cleaned.csv', index=None)
print(
"The dataset has {:,} rows/collisions and {:,} attributes/columns.".format(
collision.shape[0],
collision.shape[1]
)
)
print(
"Among the {A:,} collisions: \n\
{B:,.0f} were fatal... \n\
{C:,.0f} involved a cyclist... \n\
{D:,.0f} involved a pedestrian...".format(
A = collision.shape[0],
B = len(collision[collision.collision_severity=='Fatal']),
C = len(collision[collision.bicycle_accident=='Y']),
D = len(collision[collision.pedestrian_accident=='Y'])
)
)
The dataset has 355,488 rows/collisions and 58 attributes/columns. Among the 355,488 collisions: 1,885 were fatal... 21,663 involved a cyclist... 28,593 involved a pedestrian...
collision.collision_severity.unique()
array(['Complaint of Pain', 'Fatal', 'Property Damage', 'Other Visible Injuries', 'Severe'], dtype=object)
# agg(list(tuple())) to create new columns
# agg(dict()) to apply functions to existing columns
severe_fatal_year = collision.groupby('accident_year').collision_severity.agg([
('All Collisions', len),
('Severe Collisions', lambda x: ((x == 'Severe') | (x == 'Fatal')).sum()),
('Percent of Severe Collisions', lambda x: ((x == 'Severe') | (x == 'Fatal')).mean())
]
)
severe_fatal_year
All Collisions | Severe Collisions | Percent of Severe Collisions | |
---|---|---|---|
accident_year | |||
2009 | 35822 | 1109 | 0.030959 |
2010 | 34446 | 1038 | 0.030134 |
2011 | 33422 | 1057 | 0.031626 |
2012 | 33562 | 1145 | 0.034116 |
2013 | 31937 | 1114 | 0.034881 |
2014 | 33148 | 1133 | 0.034180 |
2015 | 36961 | 1142 | 0.030897 |
2016 | 38487 | 1425 | 0.037025 |
2017 | 39094 | 1480 | 0.037857 |
2018 | 38609 | 1459 | 0.037789 |
ax = severe_fatal_year[['All Collisions', 'Severe Collisions']].plot(
kind='bar',
rot=0,
figsize=(12, 6),
color=['navy', 'orangered']
)
ax.set_title('Collisions by Year (2009 - 2018)', fontsize=14)
ax.set_ylabel('Number of Collisions')
ax.set_xlabel(' ')
ax.set_ylim([0, 45000])
ax.legend(loc='upper left')
fmt = '{x:,.0f}'
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter(fmt))
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
ax.axhline(severe_fatal_year['All Collisions'].mean(), color='red', linestyle='dashed', linewidth=1)
pass
fig = {
"data": [{"type": "bar",
"x": severe_fatal_year.index,
"y": severe_fatal_year['All Collisions'],
"name": "All",
"marker_color": "rgb(0, 0, 128)",
"hoverinfo": "y"
},
{"type": "bar",
"x": severe_fatal_year.index,
"y": severe_fatal_year['Severe Collisions'],
"name": "Severe",
"marker_color": "rgb(255, 69, 0)",
"hoverinfo": "y"
}],
"layout": {"title": {"text": "Collisions by Year (2009 - 2018)", "font_size": 20},
"xaxis": {"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Nnumber of Collisions",
"tickfont_size": 12},
"barmode": "group",
"bargap": 0.25,
"bargroupgap": 0.1,
"plot_bgcolor": "#F5F5F5"
}
}
pio.show(fig)
data = [severe_fatal_year[i] for i in ['All Collisions', 'Severe Collisions']]
names = ["Collisions", "Severe Collisions"]
colors = ["rgb(0,0,128)", "rgb(255, 69, 0)"]
plot_data = zip(data, names, colors)
# Create figures and annotations
fig = go.Figure()
annotation_list = []
for item in plot_data:
fig.add_trace(
go.Scatter(
x=severe_fatal_year.index,
y=item[0],
name=item[1],
marker_color=item[2]),
)
annotation_text = "{:,.0f} {} <br> happend in {}".format(
item[0].max(),
item[1],
item[0].idxmax())
annotation = [
dict(
x=item[0].idxmax(),
y=item[0].max(),
xref="x", yref="y",
text=annotation_text,
ax=0,
ay=-40)]
annotation_list.append(annotation)
# Create drop-downs and plot
fig.update_layout(
updatemenus=[
go.layout.Updatemenu(
active=0,
buttons=list([
dict(label="All Collisions",
method="update",
args=[{"visible": [True, False]},
{"title": "All Collisions from 2009 to 2018",
"annotations": annotation_list[0]}]),
dict(label="Severe Collisions",
method="update",
args=[{"visible": [False, True]},
{"title": "Severe Collisions from 2009 to 2018",
"annotations": annotation_list[1]}])]),
bordercolor="rgb(192,192,192)",
x=1,
xanchor="right",
y=1.12,
yanchor="top"
)],
plot_bgcolor="#F5F5F5",
legend_orientation="h",
title_text="Collisions by Year (2009 - 2018)")
bp_collision = collision[collision.bike_ped_collision=='Y'][['accident_year', 'bike_ped_collision', 'collision_severity']]
bp_collision = bp_collision.groupby('accident_year').agg({
'bike_ped_collision': 'count',
'collision_severity': lambda x: ((x=='Severe') | (x=='Fatal')).sum()
}
)
bp_collision['Percent BP Severe Collisions'] = bp_collision.collision_severity / bp_collision.bike_ped_collision
bp_collision.rename(columns={
'bike_ped_collision': 'Bike & Ped Collision',
'collision_severity': 'Severe Collisions'
}, inplace=True
)
bp_collision
Bike & Ped Collision | Severe Collisions | Percent BP Severe Collisions | |
---|---|---|---|
accident_year | |||
2009 | 4578 | 515 | 0.112495 |
2010 | 4710 | 515 | 0.109342 |
2011 | 4803 | 483 | 0.100562 |
2012 | 5058 | 537 | 0.106168 |
2013 | 5124 | 536 | 0.104606 |
2014 | 5133 | 516 | 0.100526 |
2015 | 5110 | 519 | 0.101566 |
2016 | 5074 | 638 | 0.125739 |
2017 | 5225 | 697 | 0.133397 |
2018 | 5151 | 695 | 0.134925 |
# Concatenate with all collisions
collision_compare = pd.concat([bp_collision, severe_fatal_year], axis=1)
collision_compare
Bike & Ped Collision | Severe Collisions | Percent BP Severe Collisions | All Collisions | Severe Collisions | Percent of Severe Collisions | |
---|---|---|---|---|---|---|
accident_year | ||||||
2009 | 4578 | 515 | 0.112495 | 35822 | 1109 | 0.030959 |
2010 | 4710 | 515 | 0.109342 | 34446 | 1038 | 0.030134 |
2011 | 4803 | 483 | 0.100562 | 33422 | 1057 | 0.031626 |
2012 | 5058 | 537 | 0.106168 | 33562 | 1145 | 0.034116 |
2013 | 5124 | 536 | 0.104606 | 31937 | 1114 | 0.034881 |
2014 | 5133 | 516 | 0.100526 | 33148 | 1133 | 0.034180 |
2015 | 5110 | 519 | 0.101566 | 36961 | 1142 | 0.030897 |
2016 | 5074 | 638 | 0.125739 | 38487 | 1425 | 0.037025 |
2017 | 5225 | 697 | 0.133397 | 39094 | 1480 | 0.037857 |
2018 | 5151 | 695 | 0.134925 | 38609 | 1459 | 0.037789 |
fig = {
"data": [{"type": "scatter",
"x": collision_compare.index,
"y": collision_compare['Percent BP Severe Collisions'],
"name": "% Severe bike & ped collisions",
"marker_color": "rgb(250, 128, 114)",
"hoverinfo": "y"
},
{"type": "scatter",
"x": collision_compare.index,
"y": collision_compare['Percent of Severe Collisions'],
"name": "% Severe collisions",
"marker_color": "rgb(255, 165, 0)",
"hoverinfo": "y"
}],
"layout": {"title": {"text": "Percent of Severe Collisions by Year (2009 - 2018)", "font_size": 20},
"xaxis": {"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Ratio",
"tickfont_size": 12,
"tickformat": ",.0%"},
"barmode": "group",
"bargap": 0.25,
"bargroupgap": 0.1,
"legend_orientation": "h"
}
}
pio.show(fig)
# 10-year total
collision_dow = pd.Series(collision.groupby('day_of_week').count()['case_id'])
collision_dow = collision_dow.reindex([
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday',
'Sunday'
])
pal = sns.color_palette('Reds', 7)
colordict = {
'Friday': pal[6],
'Saturday': pal[5],
'Thursday': pal[4],
'Wednesday': pal[3],
'Tuesday': pal[2],
'Sunday': pal[0],
'Monday': pal[1]
}
ax = collision_dow.plot(figsize=(12, 6), kind='bar', color=[colordict[i] for i in collision_dow.index], width=0.5, rot=0)
ax.set_title('Collisions by Day of Week (10 year Total)', fontsize=14)
ax.set_ylim((40000, 60000))
ax.set_ylabel('Number of Collisions')
ax.set_xlabel(' ')
fmt = '{x:,.0f}'
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter(fmt))
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
# Daily total distribution
collision_dow_measure = collision.groupby(
['collision_date', 'day_of_week'], as_index=False
).count()[['day_of_week', 'case_id']]
collision_dow_measure.rename(
columns={
'day_of_week': 'Day of Week',
'case_id': 'Number of Collisions per Day'
}, inplace=True
)
collision_dow_measure['Day of Week'] = pd.Categorical(
collision_dow_measure['Day of Week'],
categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
ordered=True
)
pal = sns.color_palette('Reds', 7)
colordict = {
'Friday': pal[6],
'Saturday': pal[5],
'Thursday': pal[4],
'Wednesday': pal[3],
'Tuesday': pal[2],
'Sunday': pal[0],
'Monday': pal[1]
}
plt.figure(figsize=(12, 6))
ax = sns.boxplot(x='Day of Week', y='Number of Collisions per Day', data=collision_dow_measure, palette=colordict)
ax.set_title('Collisions by Day of Week (Daily Total Distribution)', fontsize=14)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
pass
severe_collision_dow = pd.Series(collision.groupby('day_of_week')['collision_severity'].agg(lambda x: ((x== 'Severe') | (x == 'Fatal')).mean()))
severe_collision_dow = severe_collision_dow.reindex([
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday',
'Sunday'
])
pal = sns.color_palette("Reds", 7)
colordict = {'Friday': pal[4],
'Saturday': pal[5],
'Thursday': pal[1],
'Wednesday': pal[3],
'Tuesday': pal[1],
'Sunday': pal[6],
'Monday': pal[0]}
ax = severe_collision_dow.plot(figsize=(10, 6), kind='bar', color=[colordict[i] for i in severe_collision_dow.index], width=0.5, rot=0)
ax.set_title('Percent Severe Collisions by Day of the Week over 10 years', fontsize=14)
ax.set_ylabel('Percentage of Severe Collisions')
plt.ylim([0, 0.05])
ax.set_xlabel(' ')
fmt = '{x:,.0%}'
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter(fmt))
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 5
va = 'bottom'
label = '{:.0%}'.format(y_value)
ax.annotate(label,
(x_value, y_value),
xytext=(0, space),
textcoords='offset points',
ha='center',
va=va)
pass
collision_dow_hod = collision.groupby(['hour_of_day', 'day_of_week'])['case_id'].count().unstack()
collision_dow_hod = collision_dow_hod[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']]
temp = collision[(collision.collision_severity=='Fatal') | (collision.collision_severity=='Severe')]
collision_dow_hod_severe = temp.groupby(['hour_of_day', 'day_of_week'])['case_id'].count().unstack()
collision_dow_hod_severe = collision_dow_hod_severe[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']]
data = [collision_dow_hod, collision_dow_hod_severe]
title = [
'Collision by Day of Week and Time of Day',
'Severe Collisions by Day of Week and Time of Day'
]
chart_data = zip(data, title)
for item in chart_data:
fig = {
"data": [{"type": "heatmap",
"x": item[0].columns,
"y": item[0].index,
"z": item[0].values.tolist(),
}],
"layout": {"title": {"text": item[1], "font_size": 20},
"xaxis": {"title": "Day of Week",
"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Hour of Day",
"tickfont_size": 12,
"type": "category"},
"height": 650
}
}
pio.show(fig)
# by year
t1 = collision.groupby(['accident_year', 'type_of_collision'])['case_id'].count().unstack()
t1
type_of_collision | Broadside | Head-On | Hit Object | Not Stated | Other | Overturned | Rear End | Sideswipe | Vehicle/Pedestrian |
---|---|---|---|---|---|---|---|---|---|
accident_year | |||||||||
2009 | 10268 | 2836 | 2558 | 519 | 1201 | 158 | 8625 | 7384 | 2273 |
2010 | 9892 | 2848 | 2567 | 476 | 1177 | 101 | 8259 | 6871 | 2255 |
2011 | 9800 | 2799 | 2287 | 278 | 1139 | 134 | 8052 | 6778 | 2155 |
2012 | 9137 | 2705 | 2440 | 375 | 1500 | 125 | 7931 | 6981 | 2368 |
2013 | 8855 | 2656 | 2310 | 392 | 1421 | 147 | 7168 | 6644 | 2344 |
2014 | 9096 | 2651 | 2269 | 444 | 1519 | 117 | 7583 | 7027 | 2442 |
2015 | 10191 | 3201 | 2486 | 620 | 1518 | 158 | 8386 | 7875 | 2526 |
2016 | 10634 | 3375 | 2399 | 739 | 1237 | 178 | 8771 | 8510 | 2644 |
2017 | 10910 | 3586 | 2185 | 770 | 1250 | 184 | 8656 | 8800 | 2753 |
2018 | 10850 | 3423 | 2194 | 740 | 1297 | 193 | 8360 | 8723 | 2829 |
# by weather type
t2 = collision.groupby(['type_of_collision', 'weather_1'])['case_id'].count().unstack()
t2.drop(['Clear', 'Cloudy'], axis=1, inplace=True)
t2
weather_1 | - | Fog | Other | Raining | Snowing | Wind |
---|---|---|---|---|---|---|
type_of_collision | ||||||
Broadside | 251.0 | 98.0 | 37.0 | 2574.0 | 5.0 | 56.0 |
Head-On | 127.0 | 65.0 | 18.0 | 971.0 | 3.0 | 20.0 |
Hit Object | 86.0 | 95.0 | 23.0 | 1226.0 | 1.0 | 33.0 |
Not Stated | 368.0 | 3.0 | 5.0 | 130.0 | NaN | 3.0 |
Other | 44.0 | 18.0 | 20.0 | 258.0 | 1.0 | 6.0 |
Overturned | 4.0 | 2.0 | 1.0 | 61.0 | NaN | 1.0 |
Rear End | 290.0 | 98.0 | 40.0 | 2554.0 | 9.0 | 39.0 |
Sideswipe | 267.0 | 97.0 | 66.0 | 1767.0 | 2.0 | 24.0 |
Vehicle/Pedestrian | 87.0 | 36.0 | 15.0 | 838.0 | 3.0 | 12.0 |
# by weather type in terms of percentages
t2.div(t2.sum(axis=1)/100, axis=0).round()
weather_1 | - | Fog | Other | Raining | Snowing | Wind |
---|---|---|---|---|---|---|
type_of_collision | ||||||
Broadside | 8.0 | 3.0 | 1.0 | 85.0 | 0.0 | 2.0 |
Head-On | 11.0 | 5.0 | 1.0 | 81.0 | 0.0 | 2.0 |
Hit Object | 6.0 | 6.0 | 2.0 | 84.0 | 0.0 | 2.0 |
Not Stated | 72.0 | 1.0 | 1.0 | 26.0 | NaN | 1.0 |
Other | 13.0 | 5.0 | 6.0 | 74.0 | 0.0 | 2.0 |
Overturned | 6.0 | 3.0 | 1.0 | 88.0 | NaN | 1.0 |
Rear End | 10.0 | 3.0 | 1.0 | 84.0 | 0.0 | 1.0 |
Sideswipe | 12.0 | 4.0 | 3.0 | 79.0 | 0.0 | 1.0 |
Vehicle/Pedestrian | 9.0 | 4.0 | 2.0 | 85.0 | 0.0 | 1.0 |
# by severity
t3 = collision.groupby(['type_of_collision', 'collision_severity'])['case_id'].count().unstack()
t3.drop(['Property Damage', 'Other Visible Injuries', 'Complaint of Pain'], axis=1, inplace=True)
t3.drop(['Vehicle/Pedestrian'], axis=0, inplace=True)
t3
collision_severity | Fatal | Severe |
---|---|---|
type_of_collision | ||
Broadside | 443 | 3144 |
Head-On | 144 | 1234 |
Hit Object | 205 | 624 |
Not Stated | 16 | 166 |
Other | 81 | 497 |
Overturned | 18 | 78 |
Rear End | 88 | 750 |
Sideswipe | 68 | 757 |
title = ['Collision Type by Year', 'Collision Type by Weather', 'Collision Type by Severity']
ylabel = ['Year', 'Weather', 'Severity']
t_data = [t1.T, t2, t3]
for t in zip(title, ylabel, t_data):
plt.figure(figsize = (8,6))
sns.heatmap(t[2], cmap='Blues', alpha=0.8)
plt.title(t[0])
plt.xlabel(' ')
plt.ylabel(' ')
collision.alcohol_involved.replace(np.nan, 'N', inplace=True) # values for [alcohol_used] are either 'Y' or blank
al_dow = collision.groupby(['day_of_week', 'alcohol_involved'])['case_id'].count().unstack()
al_dow = al_dow.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
al_dow.rename(columns={'Y': 'Alcohol Involved', 'N': 'None Reported',}, inplace=True)
al_dow
alcohol_involved | None Reported | Alcohol Involved |
---|---|---|
day_of_week | ||
Monday | 45211 | 3843 |
Tuesday | 46896 | 3326 |
Wednesday | 46810 | 3532 |
Thursday | 47233 | 4006 |
Friday | 51001 | 5289 |
Saturday | 44335 | 7158 |
Sunday | 39199 | 7649 |
al_dow_pct = al_dow.div(al_dow.sum(axis=1)/100, axis=0).round()
al_dow_pct
alcohol_involved | None Reported | Alcohol Involved |
---|---|---|
day_of_week | ||
Monday | 92.0 | 8.0 |
Tuesday | 93.0 | 7.0 |
Wednesday | 93.0 | 7.0 |
Thursday | 92.0 | 8.0 |
Friday | 91.0 | 9.0 |
Saturday | 86.0 | 14.0 |
Sunday | 84.0 | 16.0 |
ax = al_dow_pct.plot.bar(
stacked=True,
figsize=(12,6),
color = ['orangered', 'navy'],
y=['Alcohol Involved', 'None Reported'],
rot=0
)
ax.set_title('Collisions by Day of Week & Alcohol Involvement')
ax.set_ylabel('Percentage')
ax.set_xlabel('Day of Week (over 10 years)')
plt.legend(bbox_to_anchor = (1.01, 0.5), loc = 'center left')
fmt = '{x:,.0f}'
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter(fmt))
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
plt.tight_layout()
# Number of recorded, alcogol-involved incidents by day of week and time of day
al = collision[collision.alcohol_involved == 'Y']
al_dow_tod = al.groupby(['day_of_week', 'time_period'])['case_id'].count().unstack()
al_dow_tod = al_dow_tod.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
al_dow_tod = al_dow_tod.T
al_dow_tod
day_of_week | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
time_period | |||||||
Early Morning | 1193 | 687 | 828 | 912 | 1318 | 2775 | 3219 |
AM Peak | 205 | 223 | 228 | 202 | 219 | 235 | 311 |
Midday | 330 | 333 | 329 | 343 | 375 | 469 | 444 |
PM Peak | 718 | 677 | 694 | 763 | 878 | 929 | 1102 |
Evening | 1378 | 1387 | 1444 | 1777 | 2485 | 2728 | 2543 |
al_dow_tod_pct = al_dow_tod.div(al_dow_tod.sum(axis=1)/100, axis=0).round()
al_dow_tod_pct
day_of_week | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
time_period | |||||||
Early Morning | 11.0 | 6.0 | 8.0 | 8.0 | 12.0 | 25.0 | 29.0 |
AM Peak | 13.0 | 14.0 | 14.0 | 12.0 | 13.0 | 14.0 | 19.0 |
Midday | 13.0 | 13.0 | 13.0 | 13.0 | 14.0 | 18.0 | 17.0 |
PM Peak | 12.0 | 12.0 | 12.0 | 13.0 | 15.0 | 16.0 | 19.0 |
Evening | 10.0 | 10.0 | 11.0 | 13.0 | 18.0 | 20.0 | 19.0 |
fig = {
"data": [{"type": "heatmap",
"x": al_dow_tod_pct.columns,
"y": al_dow_tod_pct.index,
"z": al_dow_tod_pct.values.tolist(),
}],
"layout": {"title": {"text": "% Alcohol Involved in Recorded Collisions by Day of Week & Time of Day", "font_size": 20},
"xaxis": {"title": "Day of Week",
"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Time Period",
"tickfont_size": 12,
"type": "category"},
"height": 650
}
}
pio.show(fig)
severe_victim = victim[
(victim['victim_degree_of_injury'] == 'Killed') |\
(victim['victim_degree_of_injury'] == 'Severe Injury')
]
fatal_victim = victim[
(victim['victim_degree_of_injury'] == 'Killed')
]
fig = plt.figure(figsize=(10, 10))
ax1 = fig.add_subplot(211)
ax1 = victim['victim_age'].hist(bins=40, color='skyblue', edgecolor='grey')
ax2 = fig.add_subplot(212, sharex=ax1)
ax2 = fatal_victim['victim_age'].hist(bins=40, color='salmon', edgecolor='darkred')
for ax in [ax1, ax2]:
ax.set_ylabel('Number of Collisions')
fmt = '{x:,.0f}'
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter(fmt))
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
ax1.set_title("Victim's Age Distribution")
ax1.axvline(victim['victim_age'].mean(), color='red', linestyle='dashed', linewidth=1)
ax1.text(victim['victim_age'].mean() + victim['victim_age'].mean()/10,
40000, 'Mean Age: {:.0f}'.format(victim['victim_age'].mean()), fontsize=12)
ax1.text(110, 3000, 'All Victims', fontsize=12)
ax2.axvline(fatal_victim['victim_age'].mean(), color='darkred', linestyle='dashed', linewidth=1)
ax2.text(fatal_victim['victim_age'].mean() + fatal_victim['victim_age'].mean()/10,
120, 'Mean Age: {:.0f}'.format(fatal_victim['victim_age'].mean()), fontsize=12)
ax2.text(110, 10, 'Killed Victims', fontsize=12)
ax2.set_xlabel('Age')
pass
victim.groupby('victim_sex')['case_id'].count()
victim_sex F 215779 M 223444 Name: case_id, dtype: int64
victim_injury = victim.groupby(['victim_sex', 'victim_degree_of_injury'])['case_id'].count().unstack(level=0)
victim_injury.drop(index=['Complaint of Pain', 'No Injury', 'Other'], inplace=True)
victim_injury
victim_sex | F | M |
---|---|---|
victim_degree_of_injury | ||
Killed | 536 | 1440 |
Other Visible Injury | 29513 | 46603 |
Severe Injury | 3371 | 7600 |
fig = {
"data": [{"type": "bar",
"x": victim_injury.index,
"y": victim_injury['F'],
"name": "Female",
"marker_color": "rgb(0, 0, 128)",
"hoverinfo": "y"
},
{"type": "bar",
"x": victim_injury.index,
"y": victim_injury['M'],
"name": "Male",
"marker_color": "rgb(255, 69, 0)",
"hoverinfo": "y"
}],
"layout": {"title": {"text": "Degree of Injury by Gender (10-yr Total)", "font_size": 20},
"xaxis": {"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Nnumber of Collisions",
"tickfont_size": 12},
"barmode": "group",
"bargap": 0.25,
"bargroupgap": 0.1,
"plot_bgcolor": "#F5F5F5"
}
}
pio.show(fig)
def condition(x):
if (x.victim_safety_equip_1 in ['C', 'E', 'G']) or (x.victim_safety_equip_2 in ['C', 'E', 'G']):
return 'Seat Belts'
else:
return 'No Seat Belts'
victim['safety'] = victim.apply(condition, axis=1)
victim.head()
case_id | party_number | victim_role | victim_sex | victim_age | victim_degree_of_injury | victim_seating_position | victim_safety_equip_1 | victim_safety_equip_2 | victim_ejected | safety | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3918943 | 2 | 1 | M | 25.0 | Complaint of Pain | 1 | N | G | 0 | Seat Belts |
1 | 3984219 | 1 | 1 | M | 19.0 | Killed | 1 | L | G | 0 | Seat Belts |
2 | 3984274 | 2 | 3 | F | 58.0 | Killed | 9 | - | - | 0 | No Seat Belts |
3 | 3984281 | 1 | 1 | M | 29.0 | Killed | 1 | L | H | 0 | No Seat Belts |
4 | 3984352 | 1 | 2 | F | NaN | No Injury | 3 | L | G | 0 | Seat Belts |
sb_injury = victim.groupby(['safety', 'victim_degree_of_injury'])['case_id'].count().unstack(level=0)
sb_injury.drop(index=['Complaint of Pain', 'No Injury', 'Other', 'Other Visible Injury'], inplace=True)
sb_injury
safety | No Seat Belts | Seat Belts |
---|---|---|
victim_degree_of_injury | ||
Killed | 1609 | 370 |
Severe Injury | 8000 | 3156 |
fig = {
"data": [{"type": "bar",
"x": sb_injury.index,
"y": sb_injury['No Seat Belts'],
"name": "No Seat Belts",
"marker_color": "rgb(0, 0, 128)",
"hoverinfo": "y"
},
{"type": "bar",
"x": sb_injury.index,
"y": sb_injury['Seat Belts'],
"name": "Seat Belts",
"marker_color": "rgb(255, 69, 0)",
"hoverinfo": "y"
}],
"layout": {"title": {"text": "Degree of Injury by Seat Belt Status (10-yr Total)", "font_size": 20},
"xaxis": {"tickfont_size": 12,
"type": "category"},
"yaxis": {"title": "Nnumber of Collisions",
"tickfont_size": 12},
"barmode": "group",
"bargap": 0.25,
"bargroupgap": 0.1,
"plot_bgcolor": "#F5F5F5"
}
}
pio.show(fig)
api_key = '----'
gmaps = googlemaps.Client(key=api_key)
fatal_severe_collision = collision[(collision.collision_severity == 'Fatal') | (collision.collision_severity == 'Severe')]
int_fatal_severe_collision = fatal_severe_collision[fatal_severe_collision.intersection == 'Y']
address = (
int_fatal_severe_collision.primary_rd
+ ' and '
+ int_fatal_severe_collision.secondary_rd
+ ', Los Angeles, California').tolist()
address_formatted = []
lat = []
lon = []
for add in address:
try:
geocode_result = gmaps.geocode(add)
ad = geocode_result[0]['formatted_address']
latitude = geocode_result[0]['geometry']['location']['lat']
longitude = geocode_result[0]['geometry']['location']['lng']
address_formatted.append(ad)
lat.append(latitude)
lon.append(longitude)
except:
address_formatted.append('NA')
lat.append('NA')
lon.append('NA')
int_fatal_severe_collision['lat'] = lat
int_fatal_severe_collision['lon'] = lon
int_fatal_severe_collision['address'] = address_formatted
int_fatal_severe_collision[['primary_rd', 'secondary_rd', 'lat', 'lon']].head()
primary_rd | secondary_rd | lat | lon | |
---|---|---|---|---|
0 | SAN PEDRO ST | 32ND ST | 33.715175 | -118.289782 |
1 | 22ND ST | VILLA CABRILLO MARINA | 33.717084 | -118.282379 |
2 | ALMA ST | 30TH ST | 33.718315 | -118.299416 |
3 | PACIFIC AV | 26TH ST | 33.720614 | -118.287991 |
4 | 25TH ST | MEYLER ST | 33.721563 | -118.296843 |
We need to specify the following hyperparameters:
Here, I use the haversine metric and ball tree algorithm to calculate great circle distances between points. Thus, I will convert the epsilon and collision location coordinates to radians.
int_fatal_severe_collision.columns
Index(['case_id', 'accident_year', 'collision_time', 'day_of_week', 'primary_rd', 'secondary_rd', 'distance', 'direction', 'intersection', 'weather_1', 'state_hwy_ind', 'state_route', 'route_suffix', 'postmile_prefix', 'postmile', 'location_type', 'ramp_intersection', 'side_of_hwy', 'tow_away', 'collision_severity', 'number_killed', 'number_injured', 'party_count', 'primary_coll_factor', 'pcf_code_of_viol', 'pcf_viol_category', 'pcf_violation', 'pcf_viol_subsection', 'hit_and_run', 'type_of_collision', 'mviw', 'ped_action', 'road_surface', 'road_cond_1', 'road_cond_2', 'lighting', 'control_device', 'chp_road_type', 'pedestrian_accident', 'bicycle_accident', 'motorcycle_accident', 'truck_accident', 'not_private_property', 'alcohol_involved', 'count_severe_inj', 'count_visible_inj', 'count_complaint_pain', 'count_ped_killed', 'count_ped_injured', 'count_bicyclist_killed', 'count_bicyclist_injured', 'count_mc_killed', 'count_mc_injured', 'latitude', 'longitude', 'bike_ped_collision', 'time_period', 'hour_of_day', 'lat', 'lon', 'address'], dtype='object')
int_fatal_severe_collision.dropna(subset=['lat', 'lon'], inplace=True)
radius = 0.1/6371.0088 # converting 100m to radians (The radius of the earth is 6371.0088 km)
min_event = 3
x = np.column_stack((int_fatal_severe_collision['lat'], int_fatal_severe_collision['lon']))
db = DBSCAN(eps=radius, min_samples=min_event, algorithm='ball_tree', metric='haversine').fit(x)
labels = np.where(db.labels_==-1, "Noise", "Cluster").tolist()
int_fatal_severe_collision['cluster'] = labels
int_fatal_severe_collision[['primary_rd', 'secondary_rd', 'lat', 'lon', 'cluster']].head()
primary_rd | secondary_rd | lat | lon | cluster | |
---|---|---|---|---|---|
0 | SAN PEDRO ST | 32ND ST | 33.715175 | -118.289782 | Noise |
1 | 22ND ST | VILLA CABRILLO MARINA | 33.717084 | -118.282379 | Noise |
2 | ALMA ST | 30TH ST | 33.718315 | -118.299416 | Noise |
3 | PACIFIC AV | 26TH ST | 33.720614 | -118.287991 | Noise |
4 | 25TH ST | MEYLER ST | 33.721563 | -118.296843 | Noise |
cc.write(int_fatal_severe_collision, 'fatal_severe_collision', lnglat=('lon', 'lat'), overwrite=True)
Params: encode_geom, geom_col and everything in kwargs are deprecated and not being used any more Table successfully written to CARTO: https://chenliny1117.carto.com/dataset/fatal_severe_collision
<cartoframes.dataset.Dataset at 0x7f020eb8c9e8>
cc.map(layers=[Layer('fatal_severe_collision',
color={'column': 'cluster', 'scheme': styling.burg(2, bin_method='category')},
size=5.5),
BaseMap(source='light')],
interactive=True,
zoom=10,
lng=-118.4148,
lat=34.1038
)