Oladimeji Salau
https://github.com/dimtics
Montreal is the most populous city in the province of Quebec and the second most populous in Canada. The city is considered generally safe with one of the lowest crime rates in Canada.
In this notebook, I analyzed Montreal city crime data (2015 - 2017) with a goal of providing answers to the following main questions among other findings noted in the analysis:
The dataset used in this analysis contains the criminal acts or crimes registered by the Police Department of the City of Montreal (SPVM) and made available on Montreal Open Data Portal (http://donnees.ville.montreal.qc.ca/dataset/actes-criminels). According to the portal, the dataset which covers the whole Montreal including all the islands of the city has been anonymized to ensure privacy.
Columns Descriptions:
CATEGORY:
Introduction: Breaking and entering a public institution or private residence, theft of a firearm in a residence.
Vol dans / sur véhicule à moteur: Theft of a motor vehicle (car, truck, motorcycle, etc.) contents or a vehicle part (wheel, bumper, etc.).
Vol de véhicule à moteur: Theft of a car, truck, motorcycle, snowmobile tractor with or without a trailer, construction or farm vehicle, all-terrain vehicle.
Méfait: Graffiti and damage to religious property, vehicle or general damage and all other types of mischief.
Vol qualifié: Robbery accompanied by commercial violence, financial institution, person, purse, armored vehicle, vehicle, firearm, and all other types of robbery.
Infraction entraînant la mort: First-degree murder, second-degree murder, manslaughter, infanticide, criminal negligence, and all other types of offenses resulting in death.
DATE: Date of the report of the event to the SPVM.
QUART: The time of the day the event was reported to the SPVM.
PDQ: Number of the neighborhood station covering the territory where the event took place. For example, the neighborhood station 50 corresponds to the unit in charge of the metro.
X and Y: Geospatial position according to MTM8 projection (SRID 2950). The value 0 is used when no geographical position was provided when entering the information.
LAT and LONG: geographical position of the event after obfuscation at an intersection according to the WGS84 geodesic datum. The value 1 is used when no geographical position has been provided when entering the information.
# import relevant packages
import pandas as pd
import numpy as np
from dateutil.parser import parse
import calendar
# Plotly packagess
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
# import functions
from crimedata_functions import map_data, embed_map, generate_map, extract_address, plotchart, markercol, linecol
# settings
init_notebook_mode(connected=True)
pd.set_option('display.max_colwidth', 130)
The dataset used in this work is freely available at http://donnees.ville.montreal.qc.ca/dataset/actes-criminels.
df = pd.read_csv("data/interventionscitoyendo.csv", encoding="latin1")
# display the first five records of the data
df.head()
CATEGORIE | DATE | QUART | PDQ | X | Y | LAT | LONG | |
---|---|---|---|---|---|---|---|---|
0 | Introduction | 2015-01-01 | jour | 23 | 302375.197993 | 5.046522e+06 | 45.558606 | -73.531060 |
1 | Introduction | 2015-01-01 | jour | 23 | 302375.197993 | 5.046522e+06 | 45.558606 | -73.531060 |
2 | Introduction | 2015-01-01 | soir | 13 | 295850.656000 | 5.031730e+06 | 45.425443 | -73.614364 |
3 | Introduction | 2015-01-01 | nuit | 8 | 289215.072000 | 5.036423e+06 | 45.467564 | -73.699308 |
4 | Introduction | 2015-01-01 | soir | 44 | 298915.433995 | 5.046912e+06 | 45.562090 | -73.575381 |
# determine the number of records in the dataset
print('The dataset contains {0} rows and {1} columns.'.format(df.shape[0], df.shape[1]))
The dataset contains 93078 rows and 8 columns.
# check for missing values and data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93078 entries, 0 to 93077 Data columns (total 8 columns): CATEGORIE 93078 non-null object DATE 93078 non-null object QUART 93078 non-null object PDQ 93078 non-null int64 X 93078 non-null float64 Y 93078 non-null float64 LAT 93078 non-null float64 LONG 93078 non-null float64 dtypes: float64(4), int64(1), object(3) memory usage: 5.7+ MB
Columns such as 'CATEGORIE' for crime types as well as 'QUART' field defining the time of the day when the crime event was reported are in French. However, for the purpose of this analysis, these will be summarized in English.
# Map crime descriptions and day times in French to English
crime_mappings = list(zip([
'Introduction', 'Vol dans / sur véhicule à moteur',
'Vol de véhicule à moteur', 'Méfait', 'Vols qualifiés',
'Infractions entrainant la mort'
], [
'Burglary', 'Vehicle contents or parts theft', 'Vehicle theft',
'Misdemeanor', 'Robbery', 'Offenses causing death'
]))
day_mappings = list(zip(['jour', 'soir', 'nuit'], ['day', 'evening', 'night']))
# create a new column 'ADAPTED_CATEGORY' for crime descriptions in English
df['ADAPTED_CATEGORY'] = df['CATEGORIE'].apply(
lambda x: map_data(crime_mappings, x))
# modify 'QUART' column
df['QUART'] = df['QUART'].apply(lambda x: map_data(day_mappings, x))
The date field included in the dataset was parsed and separated into 'YEAR' and 'MONTH' columns. The PDQ column is converted to string and the alphabets 'PDQ' are appended to each value. Columns 'X' and 'Y' are not relevant to the analysis and hence dropped.
# turn date field from object to date data type
df['DATE'] = df['DATE'].apply(lambda x: parse(x))
# define a new column 'YEAR'
df['YEAR'] = df['DATE'].apply(lambda x: x.year).astype(str)
# define a new column 'MONTH'
df['MONTH'] = df['DATE'].apply(lambda x: x.month)
# modify 'PDQ' column
df['PDQ'] = df['PDQ'].apply(lambda x: 'PDQ '+ str(x))
# drop X and Y columns
df.drop(['X', 'Y'], axis=1, inplace=True)
For the purpose of this analysis, the scope of data will be restricted to period between 2015 and 2017.
# extract only '2015 - 2017' records from the dataset
xdf = df[df['YEAR'] != '2018'].copy()
To generate the map, the latitude and longitude fields in the dataset were first chained together and defined as a new column. This was then used to summarize and aggregate crime events in the same neighborhoods to enable easy mapping.
# chain latitudes and longitudes together as a new column
xdf['COORDS'] = list(zip(xdf['LAT'], xdf['LONG']))
# summarize data
aggcrime = xdf.groupby(['COORDS', 'ADAPTED_CATEGORY', 'YEAR']).agg({'ADAPTED_CATEGORY':'count'}).rename(columns={'ADAPTED_CATEGORY': 'COUNT'}).reset_index()
# split 'COORDS' into two columns
aggcrime['LAT'] = aggcrime['COORDS'].apply(lambda x: str(x).split(",")[0].replace("(", "")).astype(float)
aggcrime['LON'] = aggcrime['COORDS'].apply(lambda x: str(x).split(",")[1].replace(")", "")).astype(float)
# a crime map of 2015 is shown here. Maps of 2016 and 2017 can be displayed by changing the year value accordingly.
generate_map(aggcrime, '2015')
# summarize data by year and crime categories
aggdf = xdf.groupby(['YEAR', 'ADAPTED_CATEGORY']).agg({'ADAPTED_CATEGORY':'count'}).rename(columns={'ADAPTED_CATEGORY': 'INCIDENT_COUNT'}).reset_index()
# define lists to hold crime types and sorted years.
titlelist = [
'Burglary', 'Vehicle contents or parts theft', 'Misdemeanor',
'Vehicle theft', 'Robbery', 'Offenses causing death'
]
yearlist = sorted(list(set(xdf['YEAR'])))
# create an empty list to hold chart data definitions for each plot
trace_list = []
# define subplots
fig = tools.make_subplots(rows=1, cols=6, subplot_titles=(["<b>{}</b>".format(i) for i in titlelist]),
shared_yaxes=True, horizontal_spacing=(0.02), print_grid=False)
# a matrix for subplot selection order
m = np.array([1, 1, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6]).reshape(6, 2)
# define chart data
for i in range(6):
data = aggdf[aggdf['ADAPTED_CATEGORY'] == titlelist[i]]
tracex = go.Bar(x=data['YEAR'], y=data['INCIDENT_COUNT'], text=data['INCIDENT_COUNT'],
textposition='outside', hoverinfo='text', outsidetextfont=dict(size='10'),
cliponaxis=False, name='', showlegend=False, width=0.85,
marker=dict(color=markercol[i], line=dict(color=linecol[i], width=1)))
trace_list.append(tracex)
# append each subplot data definitions to the figure instance
fig.append_trace(trace_list[i], m[i][0], m[i][1])
# define layout settings
for i in fig['layout']['annotations']:
i['font'] = dict(size=11)
i['y'] = 1.2
i['yanchor'] = 'top'
fig['layout']['annotations'][1]['text'] = '<b>Vehicle contents<br>or parts theft</b>'
for i in range(1, 7):
fig['layout']['yaxis' + '{}'.format(i)].update(title='Crime Incident',
titlefont=dict(size=11, color='rgb(107, 107, 107)'), tickfont=dict(size=10, color='rgb(107, 107, 107)'),
showticklabels=False, showgrid=True)
fig['layout']['xaxis' + '{}'.format(i)].update(titlefont=dict(size=11, color='rgb(107, 107, 107)'),
tickfont=dict(size=10, color='rgb(107, 107, 107)'))
# update layout settings
fig['layout'].update(height=350, width=950, showlegend=False, autosize=False,
title="<b>Crime Events Distribution: 2015 - 2017</b>", titlefont=dict(size=14),
paper_bgcolor='rgba(245, 246, 249, 1)', plot_bgcolor='rgba(245, 246, 249, 1)')
iplot(fig)
Observation:
# summarize data by year, month and crime categories
mon_trend = xdf.groupby(['YEAR', 'MONTH', 'ADAPTED_CATEGORY']).agg({'ADAPTED_CATEGORY': 'count'}).rename(columns={'ADAPTED_CATEGORY':'INCIDENT_COUNT'}).reset_index()
# sort month column and change months in figures to names
mon_trend['MONTH'] = mon_trend['MONTH'].sort_values().apply(lambda x: calendar.month_abbr[x])
# define subplots
fig = tools.make_subplots(rows=1, cols=3, subplot_titles=(["<b>{}</b>".format(i) for i in yearlist]),
shared_yaxes=True, horizontal_spacing=(0.05), print_grid=False)
# a matrix for subplot selection order
m = np.array([1, 1, 1, 2, 1, 3]).reshape(3, 2)
# define marker colors
scattcol = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(50, 171, 96)',
'rgb(214, 39, 40)', 'rgb(148, 103, 189)', 'rgb(140, 86, 75)']
# define chart data
for i in range(3):
trace_list = []
data = mon_trend[mon_trend['YEAR'] == yearlist[i]]
for j in range(6):
tracex = go.Scatter(x=data[data['ADAPTED_CATEGORY'] == titlelist[j]]['MONTH'],
y=data[data['ADAPTED_CATEGORY'] == titlelist[j]]['INCIDENT_COUNT'], mode='lines',
marker=dict(color=scattcol[j]), line=dict(width=1.5), showlegend=False, name=titlelist[j])
if i == 0: # show legend for only the first subplot.
tracex.showlegend = True
trace_list.append(tracex)
# append each subplot data definitions to the figure instance
fig.append_trace(trace_list[j], m[i][0], m[i][1])
# define layout settings
for i in fig['layout']['annotations']:
i['font'] = dict(size=12)
i['y'] = 1.07
fig['layout']['legend'] = dict(orientation="h")
for i in range(1, 4):
fig['layout']['yaxis' + '{}'.format(i)].update(title='Crime Incident',
titlefont=dict(size=11, color='rgb(107, 107, 107)'), tickfont=dict(size=10, color='rgb(107, 107, 107)'),
range=[0, 1000], showgrid=True)
fig['layout']['xaxis' + '{}'.format(i)].update(titlefont=dict(size=11, color='rgb(107, 107, 107)'),
tickfont=dict(size=10, color='rgb(107, 107, 107)'), tickangle=35, showgrid=False)
# update layout settings
fig['layout'].update(height=420, width=950, showlegend=True, autosize=False,
title="<b>Crime Incidents Trends: 2015 - 2017</b>",
titlefont=dict(size=14), paper_bgcolor='rgba(245, 246, 249, 1)', plot_bgcolor='rgba(245, 246, 249, 1)')
iplot(fig)
Observations:
1. Apart from from few exceptions noted in certain months, almost all the crime categories exhibited the same trend patterns over the 3-years period.
2. Burglary recorded a significant decrease (31%) from 917 in January to 629 in February in 2017.
# summarize data
tot =[aggdf[aggdf['YEAR']==year].sort_values('INCIDENT_COUNT', ascending=False)[:3] for year in yearlist]
prev = pd.concat(tot, ignore_index=True)
# display result in crosstab
display(pd.crosstab(index=prev['YEAR'], columns=prev['ADAPTED_CATEGORY'], values=prev['INCIDENT_COUNT'], aggfunc=sum))
# define chart data and plot
data = prev
chdata = {'trace_data':data, 'x':'ADAPTED_CATEGORY', 'y':'INCIDENT_COUNT'}
chlayout= {'height':400, 'width':850, 'title':"<b>Top 3 crime types committed in 2015, 2016 and 2017</b>", 'yaxistitle':'Crime Incident','tickangle':20}
plotchart(chdata, chlayout, titlelist, yearlist, subtitlelist=yearlist)
ADAPTED_CATEGORY | Burglary | Misdemeanor | Vehicle contents or parts theft |
---|---|---|---|
YEAR | |||
2015 | 9803 | 7056 | 7984 |
2016 | 9828 | 6252 | 7691 |
2017 | 9232 | 6166 | 7179 |
Observation:
# exclude records with no location information
top_neighb = xdf[xdf['LAT']!=1.000000]
# summarize data
top = top_neighb.groupby(['YEAR', 'COORDS']).agg({'ADAPTED_CATEGORY': 'count'}).rename(columns={'ADAPTED_CATEGORY':'CRIME_INCIDENT'}).reset_index()
tot =[top[top['YEAR']==year].sort_values('CRIME_INCIDENT', ascending=False).iloc[:1] for year in yearlist]
topdf = pd.concat(tot, ignore_index=True)
# extract neighborhood addresses as a new column using 'extract_address' function.
topdf['NEIGHBORHOOD'] = extract_address(topdf['COORDS'])
# define list to format subplot headers
klist = ["<b>2015: Boulevard des<br>Galeries-d'Anjou, Anjou</b>",
'<b>2016: Chemin de la<br>Côte-de-Liesse, Saint-Laurent</b>',
'<b>2017: Chemin de la<br>Côte-de-Liesse, Saint-Laurent</b>']
# merge dataframes
tcp = pd.merge(topdf, top_neighb, on=['COORDS','YEAR'])
# summarize data and define a new dataframe
grptcp = tcp.groupby(['YEAR', 'ADAPTED_CATEGORY']).agg({'ADAPTED_CATEGORY': 'count'}).rename(columns={'ADAPTED_CATEGORY':'CRIME_INCIDENT'}).reset_index()
tot =[grptcp[grptcp['YEAR']==year].sort_values('CRIME_INCIDENT', ascending=False) for year in yearlist]
prevcrime = pd.concat(tot, ignore_index=True)
# display result
display(topdf[['YEAR', 'CRIME_INCIDENT', 'NEIGHBORHOOD']])
# define chart data and plot
data = prevcrime
chdata = {'trace_data':data, 'x':'ADAPTED_CATEGORY', 'y':'CRIME_INCIDENT'}
chlayout= {'height':450, 'width':950, 'title':"<b>Crime types in the neighborhoods with highest crime incidents in 2015, 2016 and 2017</b>", 'yaxistitle':'Crime Incident','tickangle':20}
plotchart(chdata, chlayout, titlelist, yearlist, subtitlelist=klist)
YEAR | CRIME_INCIDENT | NEIGHBORHOOD | |
---|---|---|---|
0 | 2015 | 84 | Boulevard des Galeries-d'Anjou, Anjou |
1 | 2016 | 82 | Chemin de la Côte-de-Liesse, Saint-Laurent |
2 | 2017 | 119 | Chemin de la Côte-de-Liesse, Saint-Laurent |
Observations:
1. Boulevard des Galeries-d'Anjou, Anjou and Chemin de la Côte-de-Liesse, Saint-Laurent are neighborhoods with the highest number of crimes in 2015, 2016 and 2017.
2. Vehicle theft and vehicle contents or parts theft were the main common crime types in Boulevard des Galeries-d'Anjou, Anjou and Chemin de la Côte-de-Liesse, Saint-Laurent neighborhoods within the period reviewed.
3. Vehicle theft was particularly prominent in Chemin de la Côte-de-Liesse, Saint-Laurent in 2016 and in 2017. The neighborhood has about 40% increase in vehicle theft from 2016 to 2017. Vehicle contents or parts theft recorded 22% increase within the same period.
# extract and summarize data
ddf = top_neighb[top_neighb['ADAPTED_CATEGORY']=="Offenses causing death"]
dtop = ddf.groupby(['YEAR', 'COORDS']).agg({'COORDS':'count'}).rename(columns={'COORDS':'DEATH_INCIDENT'}).reset_index()
# sort and extract neighborhood addresses as a new column using 'extract_address' function.
topdf = dtop.sort_values('DEATH_INCIDENT', ascending=False).reset_index(drop=True).iloc[:1]
topdf['NEIGHBORHOOD'] = extract_address(topdf['COORDS'])
# display result
display(topdf)
YEAR | COORDS | DEATH_INCIDENT | NEIGHBORHOOD | |
---|---|---|---|---|
0 | 2016 | (45.51171625506121, -73.5622018412409) | 2 | Pavillon Sainte-Catherine, Rue Sainte-Catherine Est, Quartier des Spectacles, Centre-Ville, Ville-Marie |
Observation:
# summarize and sort data
crime_time = xdf.groupby(['YEAR', 'QUART']).agg({'QUART': 'count'}).rename(columns={'QUART':'COUNT'}).reset_index()
tot =[crime_time[crime_time['YEAR']==year].sort_values('COUNT', ascending=False) for year in yearlist]
timedf = pd.concat(tot, ignore_index=True)
# display result in crosstab
display(pd.crosstab(index=timedf['YEAR'], columns=timedf['QUART'], values=timedf['COUNT'], aggfunc=sum))
# define chart data and plot
data = timedf
chdata = {'trace_data':data, 'x':'QUART', 'y':'COUNT'}
chlayout= {'height':400, 'width':850, 'title':"<b>Crime incidents during the days, evenings and nights in 2015, 2016 and 2017</b>", 'yaxistitle':'Crime Incident','tickangle':20}
plotchart(chdata, chlayout, titlelist, yearlist, subtitlelist=yearlist)
QUART | day | evening | night |
---|---|---|---|
YEAR | |||
2015 | 15326 | 10196 | 4666 |
2016 | 14497 | 10134 | 4393 |
2017 | 14173 | 9601 | 4334 |
Observation:
# summarize and sort data
crime_time = xdf.groupby(['YEAR', 'PDQ']).agg({'PDQ': 'count'}).rename(columns={'PDQ':'COUNT'}).reset_index()
tot =[crime_time[crime_time['YEAR']==year].sort_values('COUNT', ascending=False).iloc[:5] for year in yearlist]
pdqdf = pd.concat(tot, ignore_index=True)
# define chart data and plot
data = pdqdf
chdata = {'trace_data':data, 'x':'PDQ', 'y':'COUNT'}
chlayout= {'height':400, 'width':850, 'title':"<b>Top 5 police stations (PDQ) with the highest crime complaints registered in 2015, 2016 and 2017</b>", 'yaxistitle':'Crime Incident','tickangle':20}
plotchart(chdata, chlayout, titlelist, yearlist, subtitlelist=yearlist)
Observation:
# sort and define a new dataframe
tot =[crime_time[crime_time['YEAR']==year].sort_values('COUNT', ascending=True).iloc[:3] for year in yearlist]
pdqdf = pd.concat(tot, ignore_index=True)
# define chart data and plot
data = pdqdf
chdata = {'trace_data':data, 'x':'PDQ', 'y':'COUNT'}
chlayout= {'height':400, 'width':850, 'title':"<b>Top 5 police stations (PDQ) with the lowest crime complaints registered in 2015, 2016 and 2017</b>", 'yaxistitle':'Crime Incident','tickangle':20}
plotchart(chdata, chlayout, titlelist, yearlist, subtitlelist=yearlist)
Observation:
Montreal city crime data as obtained from Montreal Open Data Portal (http://donnees.ville.montreal.qc.ca/dataset/actes-criminels) was analyzed. The following are the summary observations or main insights derived from the analysis based on the initial questions raised in the introduction section.
Burglary, Vehicle contents or parts theft and Misdemeanor are the three most prevalent crimes in the 3-years period.
Besides vehicle theft that increased by 9.4% between 2016 and 2017, all other crime types showed a downward trend in the 3-years period reviewed.
Offenses causing death are the least crime incidents noted during the 3-years period.
Burglary recorded a significant decrease (31%) from 917 in January to 629 in February in 2017.
Boulevard des Galeries-d'Anjou, Anjou and Chemin de la Côte-de-Liesse, Saint-Laurent are the neighborhoods with the highest number of crimes in 2015, 2016 and 2017. Vehicle theft and vehicle contents or parts theft are the main common crime types noted in these neighborhoods.
Vehicle theft was particularly prominent in Chemin de la Côte-de-Liesse, Saint-Laurent in 2016 and in 2017. The neighbourhood has about 40% increase in vehicle theft from 2016 to 2017. Vehicle contents or parts theft recorded 22% increase within the same period.
Two cases of murder occurred in 2016 at Pavillon Sainte-Catherine, Rue Sainte-Catherine Est, Quartier des Spectacles, Centre-Ville, Ville-Marie. These are the highest murder cases within the 3-years period based on the dataset.
Crimes committed during the days (jour) in each year were about twice the total crimes registered in the evenings (soir) and in the nights (nuit) for the same year. However daytime crimes recorded 5.4% decrease from 2015 to 2016 and 2.2% decrease between 2016 and 2017.
PDQ38 and PDQ7 are the prominent among the top stations that registered most crime incidents in 2015, 2016 and 2017.
PDQ0 and PDQ55 recorded the least crime cases among the 3 top stations with low crime registrations in 2015, 2016 and 2017.