This notebook contains the augmenting of the data through request to google APIs, its processing and its merging with the traffic data for the year of 2015. The goals being:
It is quite long, thus if somebody would like to skip intermediates steps he could directly go to the end of section 3 and see the dataframes for the risk.
Then the super short section 4 contains some remark and approximations that we did and that should be kept in consideration while reading our further analysis.
Good reading and we hope that you will enjoy the reading as much as one reading code could enjoy himself.
Here are some links to navigate faster. To use them, you should first copy and paste the url of this page to jupiter nbviewer: http://nbviewer.jupyter.org/
# import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json
from pandas.io.json import json_normalize
import copy
import csv
# for string processing
import unidecode
import collections
import string
# Merging names. Installation: pip install fuzzywuzzy and pip install fuzzywuzzy[speedup]
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from IPython.core import display as ICD
%matplotlib inline
# define temporary data and data folder with original data
MY_DATA_FOLDER = 'my_csv/' # Where we put our created csv files to be read
DATA_FOLDER = '_data/' # where reside the data taken from the web
# load datasets
acc_new = pd.read_csv(MY_DATA_FOLDER+'acc_new.csv',index_col=0, sep='\t')
Here are the google API keys to later do the requests
# GO and fetch the google API keys from a local file
CENTRAL_FOLDER = ''
API_keys = pd.read_csv(CENTRAL_FOLDER + 'API_google_key',index_col=0,header=None, sep='\=', engine='python')
############# Google API KEYS ##################
nearest_roads_API_KEY = API_keys.iloc[0].values[0] # accepts up to 100 points
placesID_API_KEY = API_keys.iloc[1].values[0]
API_KEY = API_keys.iloc[2].values[0] # google directions
NOTE: Google API's have a daily limit of requests for non-premium users. Since for the nearest road API we can group the request by groups of 100, this will not cause any problem. However, for the PlaceId API, we are limited to only 1000 daily requests, thus we decided to operate only on a subset of the data i.e. for year 2015 since it is also the year for which we have the traffic data.
We filter the accident by year and keep only the accident of 2015. Then we write it to a csv file and load it:
# create filter the dataset with only the accidents
acc_new['DATE_'] = acc_new['DATE_'].astype(str)
acc_2015 = acc_new[acc_new['DATE_'].str.contains("2015")]
acc_2015.reset_index(drop=False,inplace = True)
# write it down to a file
#acc_2015.to_csv('acc_2015.csv', sep='\t', encoding='utf-8') # to run with care
#load dataset
acc_2015 = pd.read_csv(MY_DATA_FOLDER+'acc_2015.csv',index_col=0, sep='\t')
# see the difference in number of accidents
print('total number of accidents',acc_new.shape[0],' || number of accidents in 2015:',acc_2015.shape[0] )
total number of accidents 19231 || number of accidents in 2015: 3420
We define functions to create a list of points and one to do the requests:
# make the requests
def request_get_json(url):
r = requests.get(url)
if r.status_code != 200:
print (r.status_code)
page_body = r.text
return json.loads(page_body)
# create a list of points that we can the call the API upon
def create_point_list(df,size_points):
count = 0
point_list = []
for i in range(df.shape[0]//size_points):#acc_new.shape[0]
x_coord = str(df.COOR_X.loc[i*size_points])
y_coord = str(df.COOR_Y.loc[i*size_points])
point = str(y_coord) + ',' +str(x_coord)
count +=1
for k in range (1,size_points):
x_coord = str(df.COOR_X.loc[i*size_points+k])
y_coord = str(df.COOR_Y.loc[i*size_points+k])
point = point + '|' + str(y_coord) + ',' +str(x_coord)
count +=1
point_list.append(point)
count_up_to_now = count
x_coord = str(df.COOR_X.loc[count_up_to_now])
y_coord = str(df.COOR_Y.loc[count_up_to_now])
point = str(y_coord) + ',' +str(x_coord)
count +=1
for i in range(1,df.shape[0]%size_points):
x_coord = str(df.COOR_X.loc[count_up_to_now + i])
y_coord = str(df.COOR_Y.loc[count_up_to_now + i])
point = point + '|' + str(y_coord) + ',' +str(x_coord)
count +=1
point_list.append(point)
print('shapes: ',df.shape[0],'|| count ',count, '|| point_list lenght: ',len(point_list))
return point_list
Create the list of points on which to call the google API so that we have to do less requests.
Then do the request to google API of the nearest roads and we store the results as a list of JSON :
Note that to a geolocalisation point this API will return a PlaceId code. With that code we can later call the second API and get the road name.
point_list = create_point_list(acc_2015,100)
# do request to google API and create a list of json
my_json_list = []
for i,point in enumerate(point_list):
url = 'https://roads.googleapis.com/v1/nearestRoads?' +\
'points=' + str(point) + '&key=' + nearest_roads_API_KEY
my_json = request_get_json(url) # to be ran with parsimony since it uses our daily number of requests
my_json_list.append(my_json)
print(i,'||', end="")
shapes: 3420 || count 3420 || point_list lenght: 35 0 ||1 ||2 ||3 ||4 ||5 ||6 ||7 ||8 ||9 ||10 ||11 ||12 ||13 ||14 ||15 ||16 ||17 ||18 ||19 ||20 ||21 ||22 ||23 ||24 ||25 ||26 ||27 ||28 ||29 ||30 ||31 ||32 ||33 ||400 34 ||
Since with the request the index was set to a value between 1 and 100 we have to set that index back to the original index to later merge it with the original dataframe. Thus we set it to its original value.
Then we write all in a dataframe and we write it down as a csv file such that we won't have to redo the requests.
NOTE: We are not allowed to store the data for more than 30 days by the terms and services of Google !!
# do a deep copy to avoid references
original_json_list = copy.deepcopy(my_json_list);
json_list = copy.deepcopy(my_json_list);
# set the correct original index
for i in range(len(json_list)):
if 'snappedPoints' in json_list[i]:
for j in range(len(json_list[i]['snappedPoints'])):
json_list[i]['snappedPoints'][j]['originalIndex']+=i*100
# transform the list of json in list of dataframes
df_list = []
for i in range(len(json_list)-1): # -1 since we had problems with the google API for the last 99 accidents
data = json_list[i]['snappedPoints']
df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
df_list.append(df)
# merge the list of df in one unique dataframe
df_placeIds = pd.concat(df_list)
df_placeIds.drop_duplicates(subset='originalIndex',inplace=True)
df_placeIds.set_index('originalIndex',inplace=True)
print("df_places shape: ", df_placeIds.shape,' last element:',df_placeIds.index[-1])
print('thus we had no response for :',abs(df_placeIds.shape[0]-df_placeIds.index[-1]),' elements')
# save the dataframe to a CSV file
#df_placeIds.to_csv('placesIds2015_23_11_2017###.csv', sep='\t', encoding='utf-8') # uncomment to overwrite
df_places shape: (3387, 3) last element: 3399 thus we had no response for : 12 elements
We load the dataset, and make the request to the google API. When we reach the QUERY_LIMIT we exit the loop so that we can save the results obtained up to now more easily. We then extract the name of the road from the json received.
We write down the results in a csv file so that we will not have to re-query the google API. Note that the name of the file changed every day since we had to make the requests bit by bit
df_placeIds = pd.read_csv(MY_DATA_FOLDER+'placesIds2015_23_11_2017.csv',index_col=0, sep='\t')
#Number of request we have to make:
print(df_placeIds.shape[0])
# request the name of the roads
# be carefull to chose wisely the range over which to iterate
# since we can run this only once a day
tmp = []
start = 1997
end_ = 2001
for i in range(start,end_+1): # uncomment this line to run the ode
url = 'https://maps.googleapis.com/maps/api/place/details/json?'+\
'placeid='+ df_placeIds.iloc[i]['placeId'] +'&key='+placesID_API_KEY
place_json = request_get_json(url) # to uncomment to do the requests
tmp.append(place_json)
print(i,'||', end="") # i-start to have the right index in tmp
if tmp[i-start]['status']=='OVER_QUERY_LIMIT':
print ('BREAK: reason is:', tmp[i-start]['status'])
print('the last valid index was ',i-1)
break
# extract only the road name
tmp2 = []
for i in range(len(tmp)):
if tmp[i]['status']=='OK':
tmp2.append(tmp[i]['result']['name'])
else:
tmp2.append(':ERROR STATUS: ' + tmp[i]['status'])
# Write down roads in a csv file
df = pd.DataFrame(tmp2)
df = df[df[0] != ':ERROR STATUS: OVER_QUERY_LIMIT'] # just an additional check
df.columns = ['road_name']
#df.to_csv('roads_2015_4##.csv', sep='\t', encoding='utf-8') # to run with care
# load the dataset and visualize quickly
road_df = pd.read_csv(MY_DATA_FOLDER+'roads_2015_4.csv',index_col=0, sep='\t')
road_df.head()
3387 1997 ||1998 ||1999 ||2000 ||2001 ||
road_name | |
---|---|
0 | Rue des Bains 28-34 |
1 | Av. de la Gare-des- Eaux-Vives 2-10 |
2 | Chemin du Stade 5-3 |
3 | Route de Pré-Marais |
4 | Chemin des Champs-de-Chaux 4-16 |
** end of requests: ** from now on we shouldn't have to do any requests to the two previously defined APIs
We load all our csv files and we drop superpositions between them:
We often used to superpose one line so that it was easy to check if we were querying the right data.
We then check with the original data that each dataframe is the exact following of the previous one so that we can concatenate them and have a meaningfull list (as a df) of road names. And we write down as a csv file.
# load the dataframe with the roads in order
road_df_1 = pd.read_csv(MY_DATA_FOLDER+'roads_2015_1.csv',index_col=0, sep='\t')
road_df_2 = pd.read_csv(MY_DATA_FOLDER+'roads_2015_2.csv',index_col=0, sep='\t')
road_df_3 = pd.read_csv(MY_DATA_FOLDER+'roads_2015_3.csv',index_col=0, sep='\t')
road_df_4 = pd.read_csv(MY_DATA_FOLDER+'roads_2015_4.csv',index_col=0, sep='\t')
# drop superpositions:
print('Visualisation of the problem and solving: \n')
# for 1-2
print('1-2')
print(road_df_1.tail(1))
print(road_df_2.head(1))
road_df_2.drop(df.index[0], inplace=True)
print(road_df_2.head(1))
print(' ------------------- ')
# for 2-3 already done
print('2-3')
print(road_df_2.tail(1))
print(road_df_3.head(1))
print(' ------------------- ')
#for 3-4
print('3-4')
print(road_df_3.tail(1))
print(road_df_4.head(1))
road_df_4.drop(df.index[0], inplace=True)
print(road_df_4.head(1))
# concatenate both dataframes
roads_df_tot = pd.concat([road_df_1, road_df_2, road_df_3, road_df_4], axis=0)
#roads_df_tot.to_csv('roads_2015_TOT###.csv', sep='\t', encoding='utf-8') # to run with care
Visualisation of the problem and solving: 1-2 road_name 998 Chemin des Sports road_name 0 Chemin des Sports road_name 1 Boulevard Georges-Favon 27-23 ------------------- 2-3 road_name 1000 Unnamed Road road_name 0 Chemin des Corbillettes 20 ------------------- 3-4 road_name 987 Rue des Bains 28-34 road_name 0 Rue des Bains 28-34 road_name 1 Av. de la Gare-des- Eaux-Vives 2-10
We load the csv files, concatenate them, write the results as csv file and visualize the results:
# load and reset index for later concatenation
road_name_df = pd.read_csv(MY_DATA_FOLDER+'roads_2015_TOT.csv',index_col=0, sep='\t')
road_name_df.reset_index(drop=True,inplace=True) # to have the right indices for the merging
place_df = pd.read_csv(MY_DATA_FOLDER+'placesIds2015_23_11_2017.csv',index_col=0, sep='\t')
place_df.reset_index(inplace=True) # to have the right indices for the merging
result = pd.concat([place_df, road_name_df], axis=1)
#save as csv
#result.to_csv('placeIds_and_road_name_2015###.csv', sep='\t', encoding='utf-8') # to run with care
# visualize
result.head()
originalIndex | location.latitude | location.longitude | placeId | road_name | |
---|---|---|---|---|---|
0 | 0 | 46.204340 | 6.096673 | ChIJQ_TbBapkjEcR4MlC50KNhpw | Place du Lignon 6 |
1 | 2 | 46.184618 | 6.134491 | ChIJcRszIid7jEcRcIdte3BVAjU | Avenue Vibert |
2 | 3 | 46.205589 | 6.207067 | ChIJv5h8Q-hvjEcR5mPZd3Lwu4k | Chemin du Petit-Bel-Air 2 |
3 | 5 | 46.233075 | 6.081553 | ChIJuS2HOHNjjEcRfrZKhrlOiVQ | Rue des Boudines 2-10 |
4 | 6 | 46.232696 | 6.081127 | ChIJA6XmJXNjjEcRnr4TnUB0c4A | Unnamed Road |
We load the result and do the merging.
Cross check: check if the latitude and longitude are close to coordX and coordY (did not completely screwed up) computing the distance between accidents and given geoloc. of the road.
Then we clean the dataset and save a cleaned version.
# dataset with placeIds with original index and road names
road_placeID_df = pd.read_csv(MY_DATA_FOLDER+'placeIds_and_road_name_2015.csv',index_col=0, sep='\t')
road_placeID_df.set_index('originalIndex',inplace=True)
# dataset of the accidents in 2015
acc_2015 = pd.read_csv(MY_DATA_FOLDER+'acc_2015.csv',index_col=0, sep='\t')
# merge
acc_with_names = acc_2015.merge(road_placeID_df, left_index=True, right_index=True, how='inner')
# visualize
print('acc_2015 shape:',acc_2015.shape,'acc_with_names shape',acc_with_names.shape)
print('Thus we did not have the localisation for ',acc_2015.shape[0]-acc_with_names.shape[0] ,
' accidents. Since it is not too much we ignore those points and continue the analysis')
# just some columns to increase readability
acc_with_names[['COOR_X','COOR_Y','location.latitude','location.longitude',\
'ID_ACCIDENT','road_name','GROUPE_ACCIDENT']].head()
#We then save the results in a csv file:
#acc_with_names.to_csv('acc_with_names_dirty##.csv', sep='\t', encoding='utf-8') # to run with care
# duplicate df with columns of interest
tmp = acc_with_names[['COOR_X','COOR_Y','location.latitude','location.longitude','ID_ACCIDENT','road_name']].copy()
tmp.reset_index(drop=True,inplace=True)
# compute the max distance from a point
max_dist = 0
accident_ID_argmax_dist = 0
where_argmax_dist = ''
for i in range(acc_with_names.shape[0]):
original_longitude = tmp['COOR_X'].loc[i]
original_latitude = tmp['COOR_Y'].loc[i]
new_longitude = tmp['location.longitude'].loc[i]
new_latitude = tmp['location.latitude'].loc[i]
dist = np.sqrt((original_longitude-new_longitude)**2+(original_latitude-new_latitude)**2)
if dist > max_dist:
max_dist = dist
accident_ID_argmax_dist = tmp['ID_ACCIDENT'].loc[i]
where_argmax_dist = tmp['road_name'].loc[i]
print('max_dist: ',max_dist,' Where: ',where_argmax_dist,' ID :', accident_ID_argmax_dist)
acc_with_name_dirty = pd.read_csv(MY_DATA_FOLDER + 'acc_with_names_dirty.csv',index_col=0, sep='\t')
acc_with_name_clean = acc_with_name_dirty.drop(['location.latitude','location.longitude','placeId'],axis=1)
#acc_with_name_clean.to_csv('acc_with_names_clean##.csv', sep='\t', encoding='utf-8') # to run with care
acc_with_name_clean.head(2)
acc_2015 shape: (3420, 40) acc_with_names shape (3387, 44) Thus we did not have the localisation for 33 accidents. Since it is not too much we ignore those points and continue the analysis max_dist: 0.000698581085918 Where: Avenue de Mategnin 69-63 ID : 1598335.0
index | ID_ACCIDENT | DATE_ | GROUPE_ACCIDENT | COMMUNE | CONDITIONS_LUMINEUSES | CONDITIONS_METEO | CONSEQUENCES | COOR_X | COOR_Y | ... | NB_BUS | NB_TRAM | E | N | YEAR | MONTH | DAY | CAUSE_g | CAUSE_sg | road_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 300 | 1647805.0 | 2015-09-12 | Accident en parquant | Vernier | Nuit | Couvert | Autres | 6.096657 | 46.204350 | ... | 0.0 | 0.0 | 2496424.0 | 1117876.0 | 2015 | 9 | 12 | Autres mouvements dans la circulation | Marche arrière imprudente | Place du Lignon 6 |
2 | 658 | 1639088.0 | 2015-08-22 | Accident en parquant | Carouge GE | Jour | Beau | Autres | 6.134020 | 46.184662 | ... | 0.0 | 0.0 | 2499271.0 | 1115639.0 | 2015 | 8 | 22 | Inattention et distraction | Manque d'attention momentané | Avenue Vibert |
2 rows × 41 columns
The max distance is reasonable (even though one should rememver that those are GPS coordinates so even 10e-2 would be a big distance) though we consider ourselves satisfied with the matching (here again keep in consideration that some GPS have been mapped inevitably to the wrong roads, but we will consider them to be a negligeable minority)
Note that the index 1 is missing: This is because google did not find a correct match for every accident as said before (even though it did for most of them)
Load the data for traffic:
compt_trafic_data = DATA_FOLDER+'OTC_COMPTAGE_TRAFIC.csv'
compt_trafic_df = pd.read_csv(compt_trafic_data, sep=';', encoding='latin-1')
compt_trafic_df.head(3)
NO_SIREDO | NO_POINT_MESURE | CODE_VOIE | NOM_VOIE | DIRECTION | TJOM | TJOM_ANNEE | TJM | TJM_ANNEE | DISPONIBILITE | ANGLE | PRESELECTION | ID_GM_TRONCON | CAPTEUR | DATEDT | NOM_POINT_DE_MESURE | E | N | HPM | HPS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 80.1 | 6727 | 5525.0 | Rue de Chêne-Bougeries | NaN | 8359.0 | 2015.0 | 8360.0 | 2015.0 | 2 mois par an | 128.0 | Toutes_Directions | 114184.0 | PCRT | 2016-10-12 10:04:52 | Chêne-Bougeries / Montagne | 2503566.98 | 1116940.55 | NaN | NaN |
1 | 147.2 | 17007 | 23590.0 | Avenue de Thônex | NaN | 4539.0 | 2015.0 | 3961.0 | 2015.0 | 2 mois par an | 15.0 | Toutes_Directions | 115652.0 | PCRT | 2016-10-12 10:04:52 | Thônex / Genève | 2504307.66 | 1116431.69 | NaN | NaN |
2 | 147.1 | 17028 | 23590.0 | Avenue de Thônex | NaN | 5076.0 | 2015.0 | 4598.0 | 2015.0 | 2 mois par an | 195.0 | Toutes_Directions | 115652.0 | PCRT | 2016-10-12 10:04:52 | Thônex / Genève | 2504306.25 | 1116437.21 | NaN | NaN |
Selecting the features of importance from the traffic dataset:
We are not intersted in the:
thus we drop them.
We would be interested in:
but there are no data so we drop them as well.
# reload so that we can run this cell many times
compt_trafic_df = pd.read_csv(compt_trafic_data, sep=';', encoding='latin-1')
# drop non-used features
compt_trafic_df.drop(['DIRECTION', 'ANGLE', 'NO_POINT_MESURE', 'NO_SIREDO', 'CAPTEUR', 'HPM', 'HPS',\
'ID_GM_TRONCON','CODE_VOIE'],\
axis=1,inplace=True)
print('shape before selecting rows\n',compt_trafic_df.shape)
# take only the data where the traffic counter counts in all directions
compt_trafic_df = compt_trafic_df.loc[compt_trafic_df['PRESELECTION']=='Toutes_Directions']
print('shape after selecting only rows with "toutes directions"\n',compt_trafic_df.shape)
print ('#null values on the year of collection of the data:\n',compt_trafic_df['TJM_ANNEE'].isnull().sum())
print('value count on the year of collection of the data:\n',compt_trafic_df['TJM_ANNEE'].value_counts())
print('Thus we take only the data from 2015 since it is the year for which we have more data.\n')
# take only the datsa collected from 2015
compt_trafic_df = compt_trafic_df.loc[(compt_trafic_df['TJM_ANNEE']==2015.0) & (compt_trafic_df['TJM_ANNEE']==2015.0) ]
print('shape after selecting rows with TJOM_ANNEE = 2015\n',compt_trafic_df.shape)
# we can then drop the used columns
compt_trafic_df.drop(['PRESELECTION','TJM_ANNEE','TJOM_ANNEE'], axis=1,inplace=True)
compt_trafic_df.head()
shape before selecting rows (642, 11) shape after selecting only rows with "toutes directions" (535, 11) #null values on the year of collection of the data: 141 value count on the year of collection of the data: 2015.0 371 2012.0 23 Name: TJM_ANNEE, dtype: int64 Thus we take only the data from 2015 since it is the year for which we have more data. shape after selecting rows with TJOM_ANNEE = 2015 (371, 11)
NOM_VOIE | TJOM | TJM | DISPONIBILITE | DATEDT | NOM_POINT_DE_MESURE | E | N | |
---|---|---|---|---|---|---|---|---|
0 | Rue de Chêne-Bougeries | 8359.0 | 8360.0 | 2 mois par an | 2016-10-12 10:04:52 | Chêne-Bougeries / Montagne | 2503566.98 | 1116940.55 |
1 | Avenue de Thônex | 4539.0 | 3961.0 | 2 mois par an | 2016-10-12 10:04:52 | Thônex / Genève | 2504307.66 | 1116431.69 |
2 | Avenue de Thônex | 5076.0 | 4598.0 | 2 mois par an | 2016-10-12 10:04:52 | Thônex / Genève | 2504306.25 | 1116437.21 |
3 | Rue de Genève | 5320.0 | 5487.0 | 2 mois par an | 2016-10-12 10:04:52 | Genève / Thônex | 2504296.99 | 1116552.65 |
4 | Rue de Genève | 8330.0 | 8331.0 | 2 mois par an | 2016-10-12 10:04:51 | Genève / Thônex | 2504295.19 | 1116538.37 |
Where:
Now we see that we have sometimes multiple measures per road. They probably correspond to different zones of the road. We will do the asumption that we can alway take the first result found which usually corresponds to the lowest traffic rate. (we have a lack of information about those data)
With that assumption we drop the duplicates.
We save to a csv file.
compt_trafic_df.drop_duplicates(subset='NOM_VOIE', keep='first', inplace=True)
# save to csv
###compt_trafic_df.to_csv('trafic_wout_duplicates.csv', sep='\t', encoding='utf-8') # to run with care
compt_trafic_df.head()
NOM_VOIE | TJOM | TJM | DISPONIBILITE | DATEDT | NOM_POINT_DE_MESURE | E | N | |
---|---|---|---|---|---|---|---|---|
0 | Rue de Chêne-Bougeries | 8359.0 | 8360.0 | 2 mois par an | 2016-10-12 10:04:52 | Chêne-Bougeries / Montagne | 2503566.98 | 1116940.55 |
1 | Avenue de Thônex | 4539.0 | 3961.0 | 2 mois par an | 2016-10-12 10:04:52 | Thônex / Genève | 2504307.66 | 1116431.69 |
3 | Rue de Genève | 5320.0 | 5487.0 | 2 mois par an | 2016-10-12 10:04:52 | Genève / Thônex | 2504296.99 | 1116552.65 |
5 | Rue de Lausanne | 6748.0 | 6533.0 | 2 mois par an | 2016-10-12 10:04:51 | Lausanne / Dentand | 2500440.39 | 1119323.56 |
6 | Avenue Giuseppe- MOTTA | 5922.0 | 5299.0 | 365 jrs/an | 2016-10-12 10:04:51 | Motta / Pré-Cartelier\r\nr | 2499163.13 | 1119256.01 |
Load the dataset and print out a sample of the problem we have with the strings: there are road numbers in the accident dataset and not in the traffic dataset!
acc_with_names = pd.read_csv(MY_DATA_FOLDER + 'acc_with_names_clean.csv',index_col=0, sep='\t')
print('road names in the traffic dataset: ',compt_trafic_df.NOM_VOIE.values[0:10],'\n')
print('road names in the accident dataset: ', acc_with_names.road_name.values[0:10])
road names in the traffic dataset: ['Rue de Chêne-Bougeries' 'Avenue de Thônex' 'Rue de Genève' 'Rue de Lausanne' 'Avenue Giuseppe- MOTTA' 'Avenue du Mail' 'Boulevard Georges- FAVON' 'Rue des Terreaux-du-Temple' 'Quai Ernest- ANSERMET' 'Pont des Acacias'] road names in the accident dataset: ['Place du Lignon 6' 'Avenue Vibert' 'Chemin du Petit-Bel-Air 2' 'Rue des Boudines 2-10' 'Unnamed Road' 'Route des Morillons 4' 'Avenue de Mategnin 69-63' 'Place du Lignon 34' 'Avenue de France' "Boulevard du Pont-d'Arve 19-9"]
As we can see, in the accident dataset we have also road number. We wanna get rid of them for the road assignement!
So we define the following functions:
def find_undesirable_numbers(s):
tabu_list = [' 0',' 1',' 2',' 3',' 4',' 5',' 6',' 7',' 8',' 9']
position = -1
for x in tabu_list:
position = s.find(x)
if position != -1:
return position
return -1
def remove_undesirable_numbers(s):
position = find_undesirable_numbers(s)
if position != -1:
s = s[:position]
return s
# quick test
s = 'Rue des Boudines 2-10'
remove_undesirable_numbers(s)
'Rue des Boudines'
Apply the function to our dataset:
# apply functions on dataset
acc_with_names.road_name = acc_with_names.road_name.apply(remove_undesirable_numbers)
print('road names in the accident dataset: ', acc_with_names.road_name.values[1997:2001])
road names in the accident dataset: ['Rue de Montbrillant' 'Unnamed Road' 'Chemin des Corbillettes' 'Chemin Colladon']
And it seems that we have solved the problem.
# remove all the unnamed roads:
acc_with_names_no_unnamed = acc_with_names[acc_with_names['road_name'] != 'Unnamed Road']
print('before: ',acc_with_names.shape, 'after: ',acc_with_names_no_unnamed.shape)
print('Thus there were: ',acc_with_names.shape[0]-acc_with_names_no_unnamed.shape[0],' Unnamed Roads')
#And we set this new df as the acc_with_names dataframe:
acc_with_names = acc_with_names_no_unnamed
#As usual we saved in a csv file (this is in case other notebooks need the results)
###acc_with_names.to_csv('acc_with_names_no_unnamed_roads.csv', sep='\t', encoding='utf-8') # to run with care
before: (3387, 41) after: (3285, 41) Thus there were: 102 Unnamed Roads
We first need to clean a bit the strings that we have. Here is a sample:
print('road names in the traffic dataset: ',compt_trafic_df.NOM_VOIE.values[0:10],'\n')
road names in the traffic dataset: ['Rue de Chêne-Bougeries' 'Avenue de Thônex' 'Rue de Genève' 'Rue de Lausanne' 'Avenue Giuseppe- MOTTA' 'Avenue du Mail' 'Boulevard Georges- FAVON' 'Rue des Terreaux-du-Temple' 'Quai Ernest- ANSERMET' 'Pont des Acacias']
Thus we define the following functions. Since reformat_string() need the table, we also define here in the same cell:
table = collections.defaultdict(lambda: None)
table.update({
ord('é'):'e',
ord('ô'):'o',
ord(' '):' ',
ord('-'):' ',
ord('_'):' ',
ord('\N{NO-BREAK SPACE}'): ' ',
ord('\N{EN SPACE}'): ' ',
ord('\N{EM SPACE}'): ' ',
ord('\N{THREE-PER-EM SPACE}'): ' ',
ord('\N{FOUR-PER-EM SPACE}'): ' ',
ord('\N{SIX-PER-EM SPACE}'): ' ',
ord('\N{FIGURE SPACE}'): ' ',
ord('\N{PUNCTUATION SPACE}'): ' ',
ord('\N{THIN SPACE}'): ' ',
ord('\N{HAIR SPACE}'): ' ',
ord('\N{ZERO WIDTH SPACE}'): ' ',
ord('\N{NARROW NO-BREAK SPACE}'): ' ',
ord('\N{MEDIUM MATHEMATICAL SPACE}'): ' ',
ord('\N{IDEOGRAPHIC SPACE}'): ' ',
ord('\N{IDEOGRAPHIC HALF FILL SPACE}'): ' ',
ord('\N{ZERO WIDTH NO-BREAK SPACE}'): ' ',
ord('\N{TAG SPACE}'): ' ',
})
table.update(dict(zip(map(ord,string.ascii_uppercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.ascii_lowercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.digits), string.digits)))
def remove_accents(s):
unaccented_string = unidecode.unidecode(s)
return unaccented_string
def reformat_string(s):
s = str(s)
s = remove_accents(s)
s = s.lower()
s = s.strip(',.-_')
s = s.translate(table,)
return s
apply it on our datasets:
compt_trafic_df.NOM_VOIE = compt_trafic_df.NOM_VOIE.apply(reformat_string)
acc_with_names.road_name = acc_with_names.road_name.apply(reformat_string)
visualitze a sample oh the results for both our datasets:
print('road names in the traffic dataset: ',compt_trafic_df.NOM_VOIE.values[0:10],'\n')
print('road names in the accident dataset: ', acc_with_names.road_name.values[0:10])
road names in the traffic dataset: ['rue de chene bougeries' 'avenue de thonex' 'rue de geneve' 'rue de lausanne' 'avenue giuseppe motta' 'avenue du mail' 'boulevard georges favon' 'rue des terreaux du temple' 'quai ernest ansermet' 'pont des acacias'] road names in the accident dataset: ['place du lignon' 'avenue vibert' 'chemin du petit bel air' 'rue des boudines' 'route des morillons' 'avenue de mategnin' 'place du lignon' 'avenue de france' 'boulevard du pont darve' 'rue emma kammacher']
Write in a csv_file:
###acc_with_names.to_csv('string_cleaned_acc_with_names.csv', sep='\t', encoding='utf-8') # to run with care
###compt_trafic_df.to_csv('string_cleaned_compt_trafic_df.csv', sep='\t', encoding='utf-8') # to run with care
Load the desired datasets (if we wanna load any):
Note that we don't load any, we will use the dataframe from the previous part which is fine
reload_something = False
Desire_string_cleaned = True
STRING_CLEANED_DIR = 'string_cleaned/'
if reload_something == True:
if Desire_string_cleaned == True:
acc_with_names = pd.read_csv(MY_DATA_FOLDER +\
STRING_CLEANED_DIR+ 'string_cleaned_acc_with_names.csv',index_col=0, sep='\t')
compt_trafic_df = pd.read_csv(MY_DATA_FOLDER +\
STRING_CLEANED_DIR +'string_cleaned_compt_trafic_df.csv',index_col=0, sep='\t')
else:
acc_with_names = pd.read_csv(MY_DATA_FOLDER + 'acc_with_names_no_unnamed_roads.csv',index_col=0, sep='\t')
compt_trafic_df = pd.read_csv(MY_DATA_FOLDER + 'trafic_wout_duplicates.csv',index_col=0, sep='\t')
#functions
def merging_by_name(df1,df2,col_df1,col_df2,prob_limit):
# creating two lists with the names of the universities from the two datasets
name_df1 = list(df1.loc[:,col_df1])
name_df2 = list(df2.loc[:,col_df2])
print (len(name_df1))
print (df1.shape)
print (df2.shape)
#initializing a new column of df1 with the corresponding name found by the matching function
# just to control that everything went smoothly
df1['corresponding_name'] = 'unknown'
df1['prob'] = 'unknown'
#initialize empty list to collect lost roads
non_found_roads = []
# MATCHING FUNCTION
for i,road_name1 in enumerate(name_df1):
road_name2, prob=process.extractOne(road_name1, name_df2, scorer=fuzz.token_sort_ratio)
if i%100==0:
print(i,'||', end="") # to see the progression
if prob>=prob_limit:
df1.loc[(df1[col_df1]== road_name1) , 'corresponding_name'] = road_name2
df1.loc[(df1[col_df1]== road_name1) , 'prob'] = prob
else:
non_found_roads.append(road_name1)
# MERGING
result_df = pd.merge(df1,df2, left_on='corresponding_name', right_on=col_df2, how = 'inner')
return result_df, non_found_roads
We set the variables for the matching and we DO it!
Note that after a few tries, we decided to use a 100% matching probability since after a few tries we realized that otherwise we would match different roads like "rue de veyrier" and "route de veyrier" (yes, they are indeed different). But anyway, thanks to the string preprocessing (and thanks to the coherence of the google API with the datasets of SwissData), the results are good so we consider ourselves satisfied.
# setting variables and merging
df1 = acc_with_names.dropna().copy()
df2 = compt_trafic_df.copy()
col_df1 = 'road_name'
col_df2 = 'NOM_VOIE'
prob_limit = 100
merged, to_be_matched = merging_by_name(df1,df2,col_df1,col_df2,prob_limit)
3285 (3285, 41) (133, 8) 0 ||100 ||200 ||300 ||400 ||500 ||600 ||700 ||800 ||900 ||1000 ||1100 ||1200 ||1300 ||1400 ||1500 ||1600 ||1700 ||1800 ||1900 ||2000 ||2100 ||2200 ||2300 ||2400 ||2500 ||2600 ||2700 ||2800 ||2900 ||3000 ||3100 ||3200 ||
Visualize a subset of the results:
# visualize the results
print(merged.shape)
merged[['ID_ACCIDENT','COMMUNE','road_name','NOM_VOIE','TJOM']].head()
(1124, 51)
ID_ACCIDENT | COMMUNE | road_name | NOM_VOIE | TJOM | |
---|---|---|---|---|---|
0 | 1570896.0 | Genève | avenue de france | avenue de france | 7303.0 |
1 | 1615794.0 | Genève | avenue de france | avenue de france | 7303.0 |
2 | 1625025.0 | Genève | avenue de france | avenue de france | 7303.0 |
3 | 1676166.0 | Genève | avenue de france | avenue de france | 7303.0 |
4 | 1625712.0 | Genève | avenue de france | avenue de france | 7303.0 |
We see that we lost 2/3 of our data. This was expectable since the traffic is not being monitored on all the roads.
Write down in a csv file:
FINAL_FOLDER = 'final_csv/'
#merged.to_csv(FINAL_FOLDER + 'acc_with_traffic_2015#.csv', sep='\t', encoding='utf-8') # to run with care
In this section, we are gonna attribute to each road a "risk" value. So that based on that value we could assert how likely we are to have an accident if we drive in that road.
Load the data and drop the values that we do not use here for the risk calculation:
FINAL_FOLDER = 'final_csv/'
acc_traffic = pd.read_csv(FINAL_FOLDER + 'acc_with_traffic_2015.csv',index_col=0, sep='\t')
# define un-used columns
to_be_dropped1 = ['corresponding_name','DISPONIBILITE','DATEDT','NOM_POINT_DE_MESURE','E_y','N_y']
to_be_dropped2 = ['DATE_','ID_ACCIDENT','COMMUNE','CONDITIONS_LUMINEUSES','CONDITIONS_METEO','CONSEQUENCES']
to_be_dropped3 = ['ETAT_ROUTE','GENRE_ROUTE','HEURE','JOUR','LOCALITE','YEAR']
to_be_dropped4 = ['COOR_X','COOR_Y','GROUPE_ACCIDENT','index','N_x','N_y']
to_be_dropped5 = ['E_x','E_y','MONTH','DAY','CAUSE_g','CAUSE_sg']
to_be_dropped6 = ['prob','NOM_VOIE']
to_be_dropped = to_be_dropped1 + to_be_dropped2 + to_be_dropped3 + to_be_dropped4 + to_be_dropped5 +\
to_be_dropped6
# drop un-used columns
acc_traffic.drop(to_be_dropped,axis=1,inplace=True)
Visualize the results:
acc_traffic.head(3)
NB_ENFANTS_IMPLIQUES | NB_ENFANTS_ECOLE | NB_BLESSES_LEGERS | NB_BLESSES_GRAVES | NB_TUES | NB_PIETONS | NB_BICYCLETTES | NB_VAE_25 | NB_VAE_45 | NB_CYCLOMOTEURS | ... | NB_MOTOS_125 | NB_MOTOS_11KW | NB_VOITURES_TOURISME | NB_VOITURES_LIVRAISON | NB_CAMIONS | NB_BUS | NB_TRAM | road_name | TJOM | TJM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | avenue de france | 7303.0 | 6946.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | avenue de france | 7303.0 | 6946.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | avenue de france | 7303.0 | 6946.0 |
3 rows × 21 columns
Get the dataframe of the traffic per road:
traffic_per_road = acc_traffic.drop_duplicates(subset='road_name').copy()
traffic_per_road = traffic_per_road[['road_name','TJM','TJOM']]
traffic_per_road.set_index('road_name',inplace=True)
traffic_per_road.head(7)
TJM | TJOM | |
---|---|---|
road_name | ||
avenue de france | 6946.0 | 7303.0 |
route de vernier | 21459.0 | 23533.0 |
route de meyrin | 12589.0 | 13547.0 |
route de veyrier | 6015.0 | 6621.0 |
rue de chene bougeries | 8360.0 | 8359.0 |
quai wilson | 15954.0 | 16876.0 |
quai gustave ador | 16959.0 | 18238.0 |
Get the dataframe of number of accidents by road:
grouped = acc_traffic.groupby(by='road_name').count()
nb_acc_by_road = grouped[['NB_TUES']]
nb_acc_by_road.columns = ['NB_ACC']
nb_acc_by_road.head()
NB_ACC | |
---|---|
road_name | |
avenue de france | 12 |
avenue de la paix | 9 |
avenue de thonex | 12 |
avenue des communes reunies | 13 |
avenue du mail | 9 |
Get the dataframe of the number of implicated objects per road (if we can really consider a human being an object...):
grouped = acc_traffic.groupby(by='road_name', axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)
consequences_per_road = grouped['NB_ENFANTS_IMPLIQUES', 'NB_ENFANTS_ECOLE', 'NB_BLESSES_LEGERS',
'NB_BLESSES_GRAVES', 'NB_TUES', 'NB_PIETONS', 'NB_BICYCLETTES',
'NB_VAE_25', 'NB_VAE_45', 'NB_CYCLOMOTEURS', 'NB_MOTOS_50',
'NB_MOTOS_125', 'NB_MOTOS_11KW', 'NB_VOITURES_TOURISME',
'NB_VOITURES_LIVRAISON', 'NB_CAMIONS', 'NB_BUS', 'NB_TRAM'].sum()
#print(grouped)
# and visualize the results:
consequences_per_road.head()
NB_ENFANTS_IMPLIQUES | NB_ENFANTS_ECOLE | NB_BLESSES_LEGERS | NB_BLESSES_GRAVES | NB_TUES | NB_PIETONS | NB_BICYCLETTES | NB_VAE_25 | NB_VAE_45 | NB_CYCLOMOTEURS | NB_MOTOS_50 | NB_MOTOS_125 | NB_MOTOS_11KW | NB_VOITURES_TOURISME | NB_VOITURES_LIVRAISON | NB_CAMIONS | NB_BUS | NB_TRAM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
road_name | ||||||||||||||||||
avenue de france | 0.0 | 0.0 | 8.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 11.0 | 3.0 | 0.0 | 1.0 | 0.0 |
avenue de la paix | 0.0 | 0.0 | 4.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 10.0 | 0.0 | 0.0 | 1.0 | 0.0 |
avenue de thonex | 1.0 | 0.0 | 4.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 8.0 | 1.0 | 1.0 | 1.0 | 0.0 |
avenue des communes reunies | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 20.0 | 1.0 | 0.0 | 0.0 | 0.0 |
avenue du mail | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 8.0 | 1.0 | 0.0 | 1.0 | 0.0 |
Merge together to have the dataframe prepared for the risk calculation:
to_be_concat = [nb_acc_by_road,consequences_per_road,traffic_per_road]
stuff_per_road = pd.concat(to_be_concat, axis=1)
stuff_per_road.sort_values(by='NB_ACC',ascending=False).head()
NB_ACC | NB_ENFANTS_IMPLIQUES | NB_ENFANTS_ECOLE | NB_BLESSES_LEGERS | NB_BLESSES_GRAVES | NB_TUES | NB_PIETONS | NB_BICYCLETTES | NB_VAE_25 | NB_VAE_45 | ... | NB_MOTOS_50 | NB_MOTOS_125 | NB_MOTOS_11KW | NB_VOITURES_TOURISME | NB_VOITURES_LIVRAISON | NB_CAMIONS | NB_BUS | NB_TRAM | TJM | TJOM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
route de saint julien | 75 | 5.0 | 0.0 | 35.0 | 7.0 | 0.0 | 4.0 | 2.0 | 0.0 | 0.0 | ... | 3.0 | 6.0 | 16.0 | 98.0 | 9.0 | 2.0 | 1.0 | 0.0 | 13495.0 | 14582.0 |
route de meyrin | 65 | 7.0 | 0.0 | 34.0 | 5.0 | 0.0 | 7.0 | 5.0 | 0.0 | 0.0 | ... | 0.0 | 2.0 | 11.0 | 83.0 | 5.0 | 2.0 | 4.0 | 4.0 | 12589.0 | 13547.0 |
quai gustave ador | 49 | 1.0 | 0.0 | 31.0 | 3.0 | 0.0 | 9.0 | 7.0 | 0.0 | 0.0 | ... | 0.0 | 4.0 | 18.0 | 47.0 | 0.0 | 1.0 | 2.0 | 0.0 | 16959.0 | 18238.0 |
route de vernier | 43 | 2.0 | 0.0 | 17.0 | 7.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | ... | 1.0 | 2.0 | 8.0 | 55.0 | 4.0 | 5.0 | 1.0 | 0.0 | 21459.0 | 23533.0 |
route de malagnou | 40 | 1.0 | 0.0 | 14.0 | 9.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | ... | 2.0 | 2.0 | 12.0 | 43.0 | 2.0 | 0.0 | 0.0 | 0.0 | 15308.0 | 16358.0 |
5 rows × 21 columns
We are now ready to compute the risk.
functions for computing the risk and normalizing.
We chose 'TJOM' as traffic which is the mean traffic for the working days (monday to friday) since it appear to us more meaningfull and of greater interest for the reader.
# funtion to compute risk for a given feature
def compute_risk_by_road(df,traffic_col='TJOM',risk_col = 'NB_ACC'):
def compute_risk(x): # applid function to compute the risk
if x[traffic_col]==0:
return nan
return x[risk_col]/x[traffic_col]
return df.apply(compute_risk,axis=1)
# function to compute risk
def compute_risk_df(original_df,list_of_features,traffic_col):
df = original_df.copy()
col_to_drop = df.columns
for i in range(len(list_of_features)):
df[list_of_features[i]+' / traffic'] = compute_risk_by_road(df,traffic_col,risk_col=list_of_features[i])
df.drop(col_to_drop,axis=1, inplace=True)
return df
We then measure the road risk of accident as : $$ \frac{nb\_accident\ of\ desired\ category}{traffic} $$
for each desired feature and regroup in a single dataframe.
# list of features fo which we wanna compute the risk
list_of_features = ['NB_ACC', 'NB_ENFANTS_IMPLIQUES', 'NB_ENFANTS_ECOLE',
'NB_BLESSES_LEGERS', 'NB_BLESSES_GRAVES', 'NB_TUES', 'NB_PIETONS',
'NB_BICYCLETTES', 'NB_VAE_25', 'NB_VAE_45', 'NB_CYCLOMOTEURS',
'NB_MOTOS_50', 'NB_MOTOS_125', 'NB_MOTOS_11KW', 'NB_VOITURES_TOURISME',
'NB_VOITURES_LIVRAISON', 'NB_CAMIONS', 'NB_BUS', 'NB_TRAM']
# create risk df
risk_df = compute_risk_df(stuff_per_road,list_of_features,traffic_col='TJOM')
risk_df.head(3)
NB_ACC / traffic | NB_ENFANTS_IMPLIQUES / traffic | NB_ENFANTS_ECOLE / traffic | NB_BLESSES_LEGERS / traffic | NB_BLESSES_GRAVES / traffic | NB_TUES / traffic | NB_PIETONS / traffic | NB_BICYCLETTES / traffic | NB_VAE_25 / traffic | NB_VAE_45 / traffic | NB_CYCLOMOTEURS / traffic | NB_MOTOS_50 / traffic | NB_MOTOS_125 / traffic | NB_MOTOS_11KW / traffic | NB_VOITURES_TOURISME / traffic | NB_VOITURES_LIVRAISON / traffic | NB_CAMIONS / traffic | NB_BUS / traffic | NB_TRAM / traffic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
avenue de france | 0.001643 | 0.00000 | 0.0 | 0.001095 | 0.000137 | 0.0 | 0.000137 | 0.000137 | 0.000137 | 0.0 | 0.0 | 0.00000 | 0.000137 | 0.000411 | 0.001506 | 0.000411 | 0.00000 | 0.000137 | 0.0 |
avenue de la paix | 0.003159 | 0.00000 | 0.0 | 0.001404 | 0.000351 | 0.0 | 0.000000 | 0.001053 | 0.000000 | 0.0 | 0.0 | 0.00000 | 0.000000 | 0.000702 | 0.003510 | 0.000000 | 0.00000 | 0.000351 | 0.0 |
avenue de thonex | 0.002644 | 0.00022 | 0.0 | 0.000881 | 0.000220 | 0.0 | 0.000220 | 0.000220 | 0.000000 | 0.0 | 0.0 | 0.00022 | 0.000000 | 0.000661 | 0.001763 | 0.000220 | 0.00022 | 0.000220 | 0.0 |
We see that we the information becomes more meaningfull the more accidents have appened in the corresponding category. Thus we could regroup data in super categories to have more meaningfull results.
We can start by visualizing the categories:
stuff_per_road[stuff_per_road.columns].sum()
NB_ACC 1124.0 NB_ENFANTS_IMPLIQUES 43.0 NB_ENFANTS_ECOLE 2.0 NB_BLESSES_LEGERS 473.0 NB_BLESSES_GRAVES 129.0 NB_TUES 4.0 NB_PIETONS 87.0 NB_BICYCLETTES 90.0 NB_VAE_25 3.0 NB_VAE_45 2.0 NB_CYCLOMOTEURS 1.0 NB_MOTOS_50 22.0 NB_MOTOS_125 55.0 NB_MOTOS_11KW 260.0 NB_VOITURES_TOURISME 1298.0 NB_VOITURES_LIVRAISON 91.0 NB_CAMIONS 31.0 NB_BUS 36.0 NB_TRAM 14.0 TJM 728685.0 TJOM 783915.0 dtype: float64
We thus decide to group by broader categories:
We keep this categories as they were:
and we drop the following categories:
def group_motos(x):
return x['NB_CYCLOMOTEURS']+x['NB_MOTOS_50']+x['NB_MOTOS_125']+x['NB_ENFANTS_IMPLIQUES']
def group_velos(x):
return x['NB_BICYCLETTES']+x['NB_VAE_25']+x['NB_VAE_45']
def group_physical_damages(x):
return x['NB_BLESSES_LEGERS']+x['NB_BLESSES_GRAVES']+x['NB_TUES']
def group_heavy_physical_damages(x):
return x['NB_BLESSES_GRAVES']+x['NB_TUES']
def group_public_transportation(x):
return x['NB_BUS']+x['NB_TRAM']
def group_drived_by_employees(x):
return x['NB_VOITURES_LIVRAISON']+x['NB_CAMIONS']+x['NB_BUS']+x['NB_TRAM']
# function to group the categories in the dataframe
def group_categories(original_df):
df = original_df.copy()
col_to_drop = ['NB_ENFANTS_IMPLIQUES','NB_ENFANTS_ECOLE',
'NB_BLESSES_LEGERS', 'NB_BLESSES_GRAVES', 'NB_TUES',
'NB_BICYCLETTES', 'NB_VAE_25', 'NB_VAE_45', 'NB_CYCLOMOTEURS',
'NB_MOTOS_50', 'NB_MOTOS_125', 'NB_MOTOS_11KW',
'NB_VOITURES_LIVRAISON', 'NB_CAMIONS', 'NB_BUS', 'NB_TRAM']
df['MOTOS'] = df.apply(group_motos,axis=1)
df['BIKES'] = df.apply(group_velos,axis=1)
df['INJURY'] = df.apply(group_physical_damages,axis=1)
df['GRAVE_INJURY'] = df.apply(group_heavy_physical_damages,axis=1)
df['PUBLIC_TRANSP'] = df.apply(group_public_transportation,axis=1)
df['DRIVEN_BY_EMPLOYEES'] = df.apply(group_drived_by_employees,axis=1)
df.drop(col_to_drop,axis=1,inplace=True)
return df
# group the categories
grouped_categories_df = group_categories(stuff_per_road)
grouped_categories_df.head(4)
NB_ACC | NB_PIETONS | NB_VOITURES_TOURISME | TJM | TJOM | MOTOS | BIKES | INJURY | GRAVE_INJURY | PUBLIC_TRANSP | DRIVEN_BY_EMPLOYEES | |
---|---|---|---|---|---|---|---|---|---|---|---|
avenue de france | 12 | 1.0 | 11.0 | 6946.0 | 7303.0 | 1.0 | 2.0 | 9.0 | 1.0 | 1.0 | 4.0 |
avenue de la paix | 9 | 0.0 | 10.0 | 2568.0 | 2849.0 | 0.0 | 3.0 | 5.0 | 1.0 | 1.0 | 1.0 |
avenue de thonex | 12 | 1.0 | 8.0 | 3961.0 | 4539.0 | 2.0 | 1.0 | 5.0 | 1.0 | 1.0 | 3.0 |
avenue des communes reunies | 13 | 0.0 | 20.0 | 5465.0 | 5917.0 | 1.0 | 2.0 | 4.0 | 0.0 | 0.0 | 1.0 |
Compute the risk:
# define the list of features for which we want to compute the risk
list_of_features = ['NB_ACC', 'NB_PIETONS', 'NB_VOITURES_TOURISME', 'MOTOS',
'BIKES', 'INJURY', 'GRAVE_INJURY', 'PUBLIC_TRANSP',
'DRIVEN_BY_EMPLOYEES']
# compute the risk for the above lit of features
grouped_risk_df = compute_risk_df(grouped_categories_df,list_of_features,traffic_col='TJOM')
grouped_risk_df.head()
NB_ACC / traffic | NB_PIETONS / traffic | NB_VOITURES_TOURISME / traffic | MOTOS / traffic | BIKES / traffic | INJURY / traffic | GRAVE_INJURY / traffic | PUBLIC_TRANSP / traffic | DRIVEN_BY_EMPLOYEES / traffic | |
---|---|---|---|---|---|---|---|---|---|
avenue de france | 0.001643 | 0.000137 | 0.001506 | 0.000137 | 0.000274 | 0.001232 | 0.000137 | 0.000137 | 0.000548 |
avenue de la paix | 0.003159 | 0.000000 | 0.003510 | 0.000000 | 0.001053 | 0.001755 | 0.000351 | 0.000351 | 0.000351 |
avenue de thonex | 0.002644 | 0.000220 | 0.001763 | 0.000441 | 0.000220 | 0.001102 | 0.000220 | 0.000220 | 0.000661 |
avenue des communes reunies | 0.002197 | 0.000000 | 0.003380 | 0.000169 | 0.000338 | 0.000676 | 0.000000 | 0.000000 | 0.000169 |
avenue du mail | 0.000726 | 0.000000 | 0.000645 | 0.000081 | 0.000081 | 0.000484 | 0.000000 | 0.000081 | 0.000161 |
We can see in this dataframe that the values are more varied than before. For example the third column represents the probability of having an accident givent that we were a tourism car in a given road.
We rescaled the risk in a scale that should be more readable to broad public (or at least to EPFL students). It is a scale that goes from 1 for low risk, to 6 for high risk.
The rescaling formula is:
Where the min and the max have been taken on the whole column.
The idea would be to have later a "risk coefficient" like for the avalanche in mountains.
We define normalizing function:
def shorten_name(s):
position = s.find('/')
if position == -1:
return s
return s[:position-1]
def scale_risk(original_df):
df = original_df.copy()
col_to_drop = df.columns
def normalise_1_6(x): # applied function to compute
return (x-min_)/(max_-min_)*5+1
for x in df:
max_ = df[x].max()
min_ = df[x].min()
if max_==min_: # just a quick check to avoid numerical errors
print('ERROR: a column has risk alway 0')
return df
df['Risk_' + shorten_name(x)] = df[x].apply(normalise_1_6)
df.drop(col_to_drop,axis=1, inplace=True)
return df
Apply the above function for norlalisation:
scaled_risk_df = scale_risk(grouped_risk_df)
scaled_risk_df.head()
Risk_NB_ACC | Risk_NB_PIETONS | Risk_NB_VOITURES_TOURISME | Risk_MOTOS | Risk_BIKES | Risk_INJURY | Risk_GRAVE_INJURY | Risk_PUBLIC_TRANSP | Risk_DRIVEN_BY_EMPLOYEES | |
---|---|---|---|---|---|---|---|---|---|
avenue de france | 1.677023 | 1.659233 | 1.775024 | 1.713112 | 1.516683 | 1.775024 | 1.155005 | 1.910585 | 3.473321 |
avenue de la paix | 2.342910 | 1.000000 | 2.806056 | 1.000000 | 2.986662 | 2.103701 | 1.397332 | 3.334152 | 2.585001 |
avenue de thonex | 2.116569 | 2.060669 | 1.906888 | 3.294716 | 1.415657 | 1.692761 | 1.249394 | 2.465080 | 3.984578 |
avenue des communes reunies | 1.920342 | 1.000000 | 2.739211 | 1.880152 | 1.637711 | 1.425140 | 1.000000 | 1.000000 | 1.763168 |
avenue du mail | 1.274094 | 1.000000 | 1.332018 | 1.420056 | 1.152175 | 1.304350 | 1.000000 | 1.536377 | 1.728451 |
Note that the risk is relative to each column, thus there are not comparable. We can later use this data to color a map or other desired features.
# write that down in a csv file:
#scaled_risk_df.to_csv(FINAL_FOLDER + 'risk.csv', sep='\t', encoding='utf-8') # to run with care
We can the write to json file to later use in the website:
risk_car = pd.DataFrame(scaled_risk_df['Risk_NB_VOITURES_TOURISME'])
json_string = risk_car.to_json()
#with open('road_risk.json', 'w') as f:
# f.write(json_string)
And output some statistics about the dataframe:
print ('Mean risk: ',risk_car['Risk_NB_VOITURES_TOURISME'].mean())
print ('Quantiles risk: \n',risk_car['Risk_NB_VOITURES_TOURISME'].quantile([.25, .5, .75]))
print ('Riskiest roads: ',risk_car['Risk_NB_VOITURES_TOURISME'][risk_car['Risk_NB_VOITURES_TOURISME']>4])
Mean risk: 1.8934052984614589 Quantiles risk: 0.25 1.277073 0.50 1.670775 0.75 2.185445 Name: Risk_NB_VOITURES_TOURISME, dtype: float64 Riskiest roads: route de chancy 4.986709 route de chene 4.227310 route de meyrin 4.152526 route de saint julien 4.458062 route du mandement 6.000000 Name: Risk_NB_VOITURES_TOURISME, dtype: float64
Code written for the milestone 3.
The goal is to link a searched path to a risk
We would like to find the risk of going from a point A to a point B in Geneva. This notebook will implement this idea!
# import supplementary libraries that we may need
# for string processing
import re
%matplotlib inline
# GO and fetch the google API keys from a local file
CENTRAL_FOLDER = ''
API_keys = pd.read_csv(CENTRAL_FOLDER + 'API_google_key',index_col=0,header=None, sep='\=', engine='python')
############# Google API KEYS ##################
nearest_roads_API_KEY = API_keys.iloc[0].values[0] # accepts up to 100 points
placesID_API_KEY = API_keys.iloc[1].values[0]
API_KEY = API_keys.iloc[2].values[0] # google directions
# load risk df:
FINAL_FOLDER = CENTRAL_FOLDER+'final_csv/'
risk_df = pd.read_csv(FINAL_FOLDER + 'risk.csv',index_col=0, sep='\t')
risk_df.head(3)
Risk_NB_ACC | Risk_NB_PIETONS | Risk_NB_VOITURES_TOURISME | Risk_MOTOS | Risk_BIKES | Risk_INJURY | Risk_GRAVE_INJURY | Risk_PUBLIC_TRANSP | Risk_DRIVEN_BY_EMPLOYEES | |
---|---|---|---|---|---|---|---|---|---|
avenue de france | 1.677023 | 1.659233 | 1.775024 | 1.713112 | 1.516683 | 1.775024 | 1.155005 | 1.910585 | 3.473321 |
avenue de la paix | 2.342910 | 1.000000 | 2.806056 | 1.000000 | 2.986662 | 2.103701 | 1.397332 | 3.334152 | 2.585001 |
avenue de thonex | 2.116569 | 2.060669 | 1.906888 | 3.294716 | 1.415657 | 1.692761 | 1.249394 | 2.465080 | 3.984578 |
# function to make requests
def request_get_json(url):
"""make a request to the specified url"""
r = requests.get(url)
if r.status_code != 200:
print (r.status_code)
page_body = r.text
return json.loads(page_body)
# function that define a table that we later used for string processing
def define_table():
"""Function that return a table for string processing"""
table = collections.defaultdict(lambda: None)
table.update({
ord('é'):'e',
ord('ô'):'o',
ord(' '):' ',
ord('-'):' ',
ord('_'):' ',
ord('\N{NO-BREAK SPACE}'): ' ',
ord('\N{EN SPACE}'): ' ',
ord('\N{EM SPACE}'): ' ',
ord('\N{THREE-PER-EM SPACE}'): ' ',
ord('\N{FOUR-PER-EM SPACE}'): ' ',
ord('\N{SIX-PER-EM SPACE}'): ' ',
ord('\N{FIGURE SPACE}'): ' ',
ord('\N{PUNCTUATION SPACE}'): ' ',
ord('\N{THIN SPACE}'): ' ',
ord('\N{HAIR SPACE}'): ' ',
ord('\N{ZERO WIDTH SPACE}'): ' ',
ord('\N{NARROW NO-BREAK SPACE}'): ' ',
ord('\N{MEDIUM MATHEMATICAL SPACE}'): ' ',
ord('\N{IDEOGRAPHIC SPACE}'): ' ',
ord('\N{IDEOGRAPHIC HALF FILL SPACE}'): ' ',
ord('\N{ZERO WIDTH NO-BREAK SPACE}'): ' ',
ord('\N{TAG SPACE}'): ' ',
})
table.update(dict(zip(map(ord,string.ascii_uppercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.ascii_lowercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.digits), string.digits)))
return table
# function to remove accents
def remove_accents(s):
"""remove accents of a string s"""
unaccented_string = unidecode.unidecode(s)
return unaccented_string
# function that reformat string flattening it: remove accents, lower case and remove undesired characters
def reformat_string(s):
"""reformat a string s stripping the strange characters"""
s = str(s)
s = remove_accents(s)
s = s.lower()
s = s.strip(',.-_')
table = define_table()
s = s.translate(table,)
return s
# this function is NOT USED!!
def find_road_match(road_name,prob_limit,df):
"""Find a match between the searched road_name and the ones in the dataframe df"""
road_name_found, prob=process.extractOne(road_name, name_df, scorer=fuzz.token_sort_ratio)
if prob>=prob_limit:
return road_name_found
else:
print('road not found')
return '?'
# In the following function the name matching is again (as for the risk) exact once we
# have "flattened" the strings since we have seen that very similar names of road can
# be two different roads (like replacing "chemin" with "route" gives already mismatches)
def get_road_risk(road_name, df,risk_type_col='Risk_NB_VOITURES_TOURISME'):
"""Given a road name, search fo its equivalent in the dataframe df
and return it with the risk associated to the specified risk column."""
road_name = reformat_string(road_name)
if road_name in df.index:
risk = df.loc[[road_name]][risk_type_col][0]
else:
risk = 0
return [road_name,risk]
# see below for example of use
def find_streets_in_indications(s):
"""Extract the name of the roads from a tipical indication string"""
opening = [m.start() for m in re.finditer('<b>', s)]
closing = [m.start() for m in re.finditer('</b>', s)]
streets = []
for i in range (len(opening)):
indication = s[opening[i]+3 : closing[i]]
if indication[0].isupper(): # The name of the roads begin with upper case letter, not directions
streets.append(indication)
return streets
Example of use:
origin = 'Chemin+terroux'
destination= 'Quai+du+Mont-Blanc+1201+Genève'
get_multiples_routes = 'true'
url = 'https://maps.googleapis.com/maps/api/directions/json?'+\
'origin=' + origin +\
'&destination='+ destination +\
'&alternatives=' + get_multiples_routes +\
'&key='+ API_KEY
my_json = request_get_json(url)
tmp2 = []
for i in range(len(my_json['routes'][0]['legs'][0]['steps'])):
tmp = my_json['routes'][0]['legs'][0]['steps'][i]['html_instructions']
tmp2.append(tmp)
print (tmp)
print('\n ',len(my_json['routes']))
print('\nNow with the function:')
find_streets_in_indications(tmp2[0])
Head <b>northeast</b> on <b>Chemin Terroux</b> toward <b>Chemin Bonvent</b> Continue onto <b>Chemin Edouard-Sarasin</b> Turn <b>right</b> onto <b>Route de Ferney</b> Continue onto <b>Avenue de France</b> Keep <b>left</b> to stay on <b>Avenue de France</b> Continue onto <b>Quai Wilson</b> Continue onto <b>Quai Wald-Wilson</b> Continue onto <b>Quai du Mont-Blanc</b> 3 Now with the function:
['Chemin Terroux', 'Chemin Bonvent']
def find_routes_from_origin_to_destination(origin,destination):
"""Given the origin and destination, the function returns all the routes
that Google API tells us to take in order to reach destination"""
# url of the request
url = 'https://maps.googleapis.com/maps/api/directions/json?'+\
'origin=' + origin +\
'&destination='+ destination +\
'&alternatives=' + get_multiples_routes +\
'&key='+ API_KEY
my_json = request_get_json(url)
num_routes = len(my_json['routes'])
print('Number of routes found: ', num_routes)
if num_routes <1:
return 0
routes = []
for k in range(0,num_routes):
indications = []
for i in range(len(my_json['routes'][k]['legs'][0]['steps'])):
indication = my_json['routes'][k]['legs'][0]['steps'][i]['html_instructions']
indication = find_streets_in_indications(indication)
if len(indication)>1: # if multiple road names in indication, extract them one by one
for x in indication:
indications.append(x)
else:
indications.append(indication[0])
indications = list(set(indications)) # to remove duplications
routes.append(indications)
return routes
Example of use:
origin = 'Chemin+terroux'
destination= 'Quai+du+Mont-Blanc+1201+Genève'
my_routes = find_routes_from_origin_to_destination(origin,destination)
print('Google API found ',len(my_routes),' different routes possibles')
print('Here is the look of the first:\n',my_routes[0])
Number of routes found: 3 Google API found 3 different routes possibles Here is the look of the first: ['Chemin Bonvent', 'Quai Wilson', 'Avenue de France', 'Quai du Mont-Blanc', 'Quai Wald-Wilson', 'Chemin Terroux', 'Route de Ferney', 'Chemin Edouard-Sarasin']
Where we have supposed that google will indicate all the roads taken. From some trials it look like the case exect for "exessively" small roads and thus all the roads that a use would have to take are listed!
def get_path_risk(origin,destination,risk_df,risk_col='Risk_NB_VOITURES_TOURISME'):
"""
Compute the risk of an itinerary
Inputs:
:origin: Starting point of our itinerary as is should be given to google maps API
One could find it by looking at the url after having made a request to google maps
or one could just write key words separated by '+' and it should find the correct place
:destination: Destination point of our itinerary as it should be given to google maps API
:risk_df: The dataframe containing the risk per road
:risk_col: Column with respect to which we wanna compute the risk. By default it is for the
'Voitures de tourisme' since it is the one that makes more sense.
Output: Dataframe containing the risk for the itinerary and several other informations
"""
d = {'Risk': [], 'mean risk': [],'zeros':[], 'all routes':[],'all risks':[],\
'number of roads':[]}
out = pd.DataFrame(data=d)
list_of_trajectories = find_routes_from_origin_to_destination(origin=origin,destination=destination)
for i,traj in enumerate(list_of_trajectories):
routes = []
risks = []
count = 0
for x in traj:
route_risk = get_road_risk(road_name=x, df=risk_df,risk_type_col=risk_col)
routes.append(route_risk[0])
risks.append(route_risk[1])
if route_risk[1]==0:
count +=1
filtered_risks = list(filter(lambda a: a != 0, risks))
out = out.append({'Risk': sum(risks)+count, 'mean risk': np.mean(filtered_risks),'zeros':count,\
'all routes':routes,'all risks':risks,'number of roads':len(routes)}, ignore_index=True)
cols = ['Risk','mean risk','number of roads','zeros','all routes','all risks']
out = out[cols]
return out
Example of use:
origin = 'Vermont,+1202+Geneva'
destination= 'Quai+du+Mont-Blanc+1201+Genève'
get_path_risk(origin,destination,risk_df=risk_df,risk_col='Risk_NB_VOITURES_TOURISME')#['all routes'][0]
Number of routes found: 3
Risk | mean risk | number of roads | zeros | all routes | all risks | |
---|---|---|---|---|---|---|
0 | 10.948168 | 1.982723 | 8.0 | 5.0 | [chemin de vincy, avenue de france, quai du mo... | [0, 1.77502396276, 2.5023691597, 0, 0, 0, 0, 1... |
1 | 8.502369 | 2.502369 | 7.0 | 6.0 | [chemin de vincy, quai du mont blanc, rue du v... | [0, 2.5023691597, 0, 0, 0, 0, 0] |
2 | 15.948168 | 1.982723 | 13.0 | 10.0 | [chemin de vincy, quai wilson, avenue de franc... | [0, 1.67077506518, 1.77502396276, 0, 0, 2.5023... |
Notes about this function: It uses the risk computed previously which, as discussed, is already very uncertain! Morevoer, as it can be seen in the table, only a small fraction of the roads has an equivalent in the risk table (they should not have a traffic detector in thoses roads) and here we did not take into account that mislead. We should note how we have dealt with missing values:
How we could go around this problem:
We managed to complete what we had in mind since the milestone 2 for this part! Further improvement would be e.g.:
This was an initial idea to get the speed limit and have more data on which do the analysis.
# not working cause we need premium features... I sent an email to google, would be super nice if they grant us permission :D
# UPDATE we did not get the permision. can we find an altrenative way?
URL_speed_limit = 'https://roads.googleapis.com/v1/speedLimits?'+\
'placeId=ChIJX12duJAwGQ0Ra0d4Oi4jOGE&placeId=ChIJLQcticc0GQ0RoiNZJVa5GxU&placeId=ChIJJ4vQRudkJA0RpednU70A-5M&key=AIzaSyCB5WD8tw93CsGxK35zl8f2EAWtd3okDsc'
r3 = requests.get(URL_speed_limit)
print('Response status code: {0}\n'.format(r.status_code))
page_body3 = r3.text
rank_json3 = json.loads(page_body3)
rank_json3