Travel time prediction in Indian Metro cities using Uber Movement data and OpenStreetMap

Uber provides anonymized and aggregated travel time data through Uber Movement platform for many citites across the world. For India, current and historic data is available for 5 cities - Bangalore, Hyderabad, New Delhi, Mumbai and Kolkata. It also provides the details on the ward boundaries in the form of JSON file.

OpenStreetMap (OSM) is a free, editable map of the whole world that is being built by volunteers largely from scratch and released with an open-content license. OSM data includes a global navigable street network dataset. Several services exists that provide routing and network analysis on top of this data.

In this project, we use the open travel time dataset from Uber and leverage open-source routing services for OpenStreetMap to build a fairly accurate model for travel time within each of the metro cities in India. We show that by using rich ecosystem of Python Geospatial libraries, we can easily consume, process, and visualize large amount of geospatial data easily and incorporate it easily into a machine learning model.

Open datasets

  • Uber Movement - Travel times and ward boundaries
  • OpenStreetMap

Python libraries

  • geopandas
  • shapely
  • matplotlib
  • folium
  • scikit-learn

Services

  • Open Source Routing Machine (OSRM)
  • OpenRouteService (ORS) API
In [16]:
import pandas as pd
import geopandas as gpd
import numpy as np
import requests
import shapely
import matplotlib.pyplot as plt
import datetime
import os
import math
import random
import folium
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from matplotlib import pyplot

%matplotlib inline

Reading Datasets

Travel Times

In [3]:
data_folder = os.path.join('data', 'uber')

The Uber Movement Travel Times data comes as a CSV file for each quarter. Here we are using the Travel Times By Date By Hour Buckets (All Days) dataset. This data set includes the arithmetic mean, geometric mean, and standard deviations for aggregated travel times between every ward in the city, for every day of the quarter and aggregated into time categories. This is a large dataset with over 7M rows.

We import the data as a Pandas DataFrame and call convert_dtypes() to select the best datatypes for each column.

In [4]:
travel_times_file = 'bangalore-wards-2020-1-All-DatesByHourBucketsAggregate.csv'
travel_times_filepath = os.path.join(data_folder, travel_times_file)
travel_times= pd.read_csv(travel_times_filepath)
travel_times = travel_times.convert_dtypes()
In [39]:
travel_times
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7640967 entries, 0 to 7640966
Data columns (total 17 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   sourceid                                  Int64  
 1   dstid                                     Int64  
 2   month                                     Int64  
 3   day                                       Int64  
 4   start_hour                                Int64  
 5   end_hour                                  Int64  
 6   mean_travel_time                          Float64
 7   standard_deviation_travel_time            Float64
 8   geometric_mean_travel_time                Float64
 9   geometric_standard_deviation_travel_time  Float64
 10  time_period                               int64  
 11  travel_time                               Float64
 12  src_lon                                   float64
 13  src_lat                                   float64
 14  dst_lon                                   float64
 15  dst_lat                                   float64
 16  distance                                  float64
dtypes: Float64(5), Int64(6), float64(5), int64(1)
memory usage: 1.1 GB

Ward Boundaries

The travel times dataset contain details of travel between Zones. For Indian citites, the zones are Wards as defined by the local municipal corporation. This data comes as a GeoJSON file that contains the polygon representation of each ward. We use geopandas to read the file as a GeoDataFrame.

In [6]:
wards_file = 'bangalore_wards.json'
wards_filepath = os.path.join(data_folder, wards_file)
wards = gpd.read_file(wards_filepath)
In [7]:
wards
Out[7]:
WARD_NO WARD_NAME MOVEMENT_ID DISPLAY_NAME geometry
0 2 Chowdeswari Ward 1 Unnamed Road, Bengaluru MULTIPOLYGON (((77.59229 13.09720, 77.59094 13...
1 3 Atturu 2 9th Cross Bhel Layout, Adityanagar, Vidyaranya... MULTIPOLYGON (((77.56862 13.12705, 77.57064 13...
2 4 Yelahanka Satellite Town 3 15th A Cross Road, Yelahanka Satellite Town, Y... MULTIPOLYGON (((77.59094 13.09842, 77.59229 13...
3 51 Vijnanapura 4 SP Naidu Layout 4th Cross Street, SP Naidu Lay... MULTIPOLYGON (((77.67683 13.01147, 77.67695 13...
4 53 Basavanapura 5 Medahalli Kadugodi Road, Bharathi Nagar, Krish... MULTIPOLYGON (((77.72899 13.02061, 77.72994 13...
... ... ... ... ... ...
193 172 Madivala 194 0 1st B Cross Road, Cashier Layout, 1st Stage,... MULTIPOLYGON (((77.61399 12.92347, 77.61419 12...
194 26 Ramamurthy Nagar 195 Kalkere-Agara Main Road, Horamavu Agara, Kalke... MULTIPOLYGON (((77.68336 13.05192, 77.68384 13...
195 25 Horamavu 196 0 Horamavu Agara Main Road, 1st Block, Mallapp... MULTIPOLYGON (((77.64931 13.07853, 77.64993 13...
196 86 Marathahalli 197 0 3rd Cross Road, Manjunatha Layout, Marathaha... MULTIPOLYGON (((77.68549 12.94121, 77.68539 12...
197 198 Hemmigepura 198 BGS Road, Kodipalya, Bengaluru MULTIPOLYGON (((77.49854 12.92574, 77.49854 12...

198 rows × 5 columns

In [11]:
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
Out[11]:
<AxesSubplot:>

Data Pre-Processing

Travel Times

In [49]:
np.random.seed(0)
travel_times = pd.concat([travel_times]*5, ignore_index=True)
In [50]:
travel_times['random'] = np.random.uniform(0, 1, len(travel_times))
In [51]:
travel_times['travel_time'] = np.exp(travel_times['random']*np.log(travel_times['geometric_standard_deviation_travel_time']) + np.log(travel_times['geometric_mean_travel_time']))

The source data contains the travel times grouped by blocks of time (peak/off-peak etc.), defined by start_hour and end_hour columns. To allow us to model this easily, we add a time_period columns and assign an integer category value.

In [52]:
travel_times
Out[52]:
sourceid dstid month day start_hour end_hour mean_travel_time standard_deviation_travel_time geometric_mean_travel_time geometric_standard_deviation_travel_time time_period travel_time src_lon src_lat dst_lon dst_lat distance random
0 102 97 3 13 10 16 322.8 425.14 270.59 1.7 3 362.063898 77.563817 12.982784 77.566287 12.970359 1778.9 0.548814
1 102 97 1 17 19 0 306.71 200.99 256.58 1.84 5 396.842057 77.563817 12.982784 77.566287 12.970359 1778.9 0.715189
2 102 97 2 7 19 0 282.94 206.01 233.29 2.0 5 354.279465 77.563817 12.982784 77.566287 12.970359 1778.9 0.602763
3 102 97 1 19 7 10 294.18 183.97 258.09 1.61 2 334.554688 77.563817 12.982784 77.566287 12.970359 1778.9 0.544883
4 102 97 2 9 7 10 263.55 149.79 232.89 1.67 2 289.404809 77.563817 12.982784 77.566287 12.970359 1778.9 0.423655
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38204830 162 83 3 2 16 19 3120.83 459.43 3089.7 1.15 4 3091.908238 77.587545 12.924545 77.599077 13.005976 11869.0 0.005112
38204831 8 127 2 2 10 16 2943.88 338.43 2924.38 1.12 3 3128.379559 77.672583 12.994474 77.548812 12.963703 17072.9 0.595019
38204832 34 95 2 1 19 0 2624.09 863.77 2493.34 1.37 5 3387.562808 77.506307 13.038050 77.632594 12.973725 19533.4 0.973561
38204833 56 120 1 19 10 16 2366.86 349.15 2340.33 1.16 3 2635.211006 77.603741 13.035174 77.547044 12.972093 13119.4 0.799564
38204834 128 25 1 24 19 0 2135.83 604.69 2033.07 1.39 5 2273.702852 77.551941 12.960545 77.553857 13.026029 10127.4 0.339695

38204835 rows × 18 columns

In [53]:
categories_to_hour = {
    1: [0, 6],
    2: [7, 9],
    3: [10, 15],
    4: [16, 18],
    5: [19, 23]
}

def get_time_period(hour):
    for category, (start_hour, end_hour) in categories_to_hour.items():
        if hour >= start_hour and hour <= end_hour:
            return category
In [54]:
travel_times['time_period'] = travel_times['start_hour'].apply(get_time_period)

Travel time has a strong correlation with the day of the week. So we compute a new column dow from the day and month columns

In [55]:
year = 2020

def get_dow(row):
    return datetime.date(year, int(row['month']), int(row['day'])).weekday()
In [ ]:
travel_times['dow'] = travel_times.apply(get_dow, axis=1)
In [ ]:
travel_times

Ward Boundaries

For modeling purposes, we use centroid of each ward to represent the ward. We use GeoPandas centroid() function to get the point geometry representing the centroid.

Our source data comes in the EPSG:4326 WGS84 Geographic Projection - which is not suitable forgeoprocessing operations. To get the accurate centroid computation, we must re-project the data to a Planar Projection. We use a UTM projection suitable for the region of the data - WGS 84 UTM Zone 43N - which is defined by the code EPSG:32643. Once computed, we transform it back to EPSG:4326 and add it to our GeoDataFrame.

In [30]:
centroid_utm = wards.geometry.to_crs('EPSG:32643').centroid
wards['centroid'] = centroid_utm.to_crs('EPSG:4326')
In [31]:
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
wards['centroid'].plot(color='red',ax=ax)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x22370fe8108>

Distance Computation

We have the travel times for each pair of source and destination wards. The travel time is strongly correlated with the distance between the wards. We need to compute the actual distance along the road network for our ward.

In [32]:
ward_no = wards['MOVEMENT_ID']
index = pd.MultiIndex.from_product([ward_no, ward_no], names = ['sourceid', 'dstid'])

distancematrix = pd.DataFrame(index = index).reset_index()
distancematrix = distancematrix.query('sourceid != dstid')
In [33]:
def get_coordinates(row):
    source_ward = wards[wards['MOVEMENT_ID'] == row['sourceid']].iloc[0]
    dst_ward = wards[wards['MOVEMENT_ID'] == row['dstid']].iloc[0]

    src_lon, src_lat = source_ward['centroid'].x, source_ward['centroid'].y
    dst_lon, dst_lat = dst_ward['centroid'].x, dst_ward['centroid'].y
    return src_lon, src_lat, dst_lon, dst_lat
In [34]:
distancematrix[['src_lon', 'src_lat', 'dst_lon', 'dst_lat']] = distancematrix.apply(get_coordinates, axis=1, result_type='expand')
In [35]:
distancematrix
Out[35]:
sourceid dstid src_lon src_lat dst_lon dst_lat
1 1 2 77.580422 13.121709 77.560037 13.102805
2 1 3 77.580422 13.121709 77.583926 13.090987
3 1 4 77.580422 13.121709 77.669565 13.006063
4 1 5 77.580422 13.121709 77.715456 13.016847
5 1 6 77.580422 13.121709 77.705502 13.022373
... ... ... ... ... ... ...
39198 198 193 77.505015 12.891903 77.594507 12.910882
39199 198 194 77.505015 12.891903 77.614418 12.920018
39200 198 195 77.505015 12.891903 77.676539 13.033613
39201 198 196 77.505015 12.891903 77.653272 13.044560
39202 198 197 77.505015 12.891903 77.691495 12.950743

39006 rows × 6 columns

We need to get driving distance between approximately 40,000 coordinates. To do this efficiently, we ran the Open Source Routing Machine (OSRM) service locally using docker images provided by the project. OSRM holds the network graph in memory and the routing is extremely fast. We write and apply the following function and get the driving distance in meters.

In [36]:
def get_distance(row):
    
    coordinates = '{},{};{},{}'.format(
        row['src_lon'], row['src_lat'], row['dst_lon'], row['dst_lat'])
    url = 'http://127.0.0.1:5000/route/v1/driving/'
    response = requests.get(url + coordinates) 
    if response.status_code== 200:
        data = response.json()   
        distance = data['routes'][0]['distance']
    
    return distance

The resulting distance data is saved locally and used in the subsequent analysis.

In [33]:
osrm_data_folder = os.path.join('data', 'osrm')
distancematrix_file = 'distancematrix.csv'
distancematrix_filepath = os.path.join(osrm_data_folder, distancematrix_file)
distancematrix = pd.read_csv(distancematrix_filepath)
In [34]:
travel_times = pd.merge(travel_times, distancematrix, on=['sourceid', 'dstid']) 
In [39]:
travel_times
Out[39]:
sourceid dstid month day start_hour end_hour mean_travel_time standard_deviation_travel_time geometric_mean_travel_time geometric_standard_deviation_travel_time time_period dow src_lon src_lat dst_lon dst_lat distance
0 102 97 3 13 10 16 322.80 425.14 270.59 1.70 3 4 77.563817 12.982784 77.566287 12.970359 1778.9
1 102 97 1 17 19 0 306.71 200.99 256.58 1.84 5 4 77.563817 12.982784 77.566287 12.970359 1778.9
2 102 97 2 7 19 0 282.94 206.01 233.29 2.00 5 4 77.563817 12.982784 77.566287 12.970359 1778.9
3 102 97 1 19 7 10 294.18 183.97 258.09 1.61 2 6 77.563817 12.982784 77.566287 12.970359 1778.9
4 102 97 2 9 7 10 263.55 149.79 232.89 1.67 2 6 77.563817 12.982784 77.566287 12.970359 1778.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7640962 162 83 3 2 16 19 3120.83 459.43 3089.70 1.15 4 0 77.587545 12.924545 77.599077 13.005976 11869.0
7640963 8 127 2 2 10 16 2943.88 338.43 2924.38 1.12 3 6 77.672583 12.994474 77.548812 12.963703 17072.9
7640964 34 95 2 1 19 0 2624.09 863.77 2493.34 1.37 5 5 77.506307 13.038050 77.632594 12.973725 19533.4
7640965 56 120 1 19 10 16 2366.86 349.15 2340.33 1.16 3 6 77.603741 13.035174 77.547044 12.972093 13119.4
7640966 128 25 1 24 19 0 2135.83 604.69 2033.07 1.39 5 4 77.551941 12.960545 77.553857 13.026029 10127.4

7640967 rows × 17 columns

Data Modeling

We use scikit-learn library to build and train a linear regressor.

The independent variables considered are sourceid, dstid, day, time_period, dow, src_lon, src_lat, dst_lon, dst_lat, distance. The dependnet variable is the travel time geometric_mean_travel_time. Of the independent variables we goes for one-hot-encoding of to categorical variables time_period and dow We sample the travel times to get a subset that will be used for training.

In [35]:
num_samples = 500000
samples = travel_times.sample(n=num_samples, random_state=1)
In [36]:
sel_input=['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']
cat_ip=['time_period','dow']
scale_ip= list(set(sel_input)-set(cat_ip))
In [37]:
x = samples[sel_input].values
y = samples['travel_time']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-cdb2a8455402> in <module>
----> 1 x = samples[sel_input].values
      2 y = samples['travel_time']
      3 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3028             if is_iterator(key):
   3029                 key = list(key)
-> 3030             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   3031 
   3032         # take() does not accept boolean indexers

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1314             if raise_missing:
   1315                 not_found = list(set(key) - set(ax))
-> 1316                 raise KeyError(f"{not_found} not in index")
   1317 
   1318             not_found = key[missing_mask]

KeyError: "['dow'] not in index"

The data set contains categorical variable in case of 'time_period' and 'dow' and hence these are one_hot_encoded since these categorical values has no significance for linear model.

In [ ]:
category_Trans= ColumnTransformer([('encoder',OneHotEncoder(categories='auto', sparse=False),[sel_input.index(i) for i in cat_ip]),
                                   ('scaler',StandardScaler(),[sel_input.index(i) for i in scale_ip])],remainder='passthrough')
In [ ]:
regressor = Pipeline(steps=[('ct',category_Trans),('model',LinearRegression())])
In [38]:
regressor.fit(x_train,y_train)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-38-923b4e1a10f6> in <module>
----> 1 regressor.fit(x_train,y_train)

NameError: name 'regressor' is not defined
In [50]:
print('Training Accuracy: ', regressor.score(x_train,y_train))
Training Accuracy:  0.7442950836977186
In [51]:
print('Prediction Accuracy: ', regressor.score(x_test,y_test))
Prediction Accuracy:  0.7441506633075343

Checking Model Performance

While the model performs well with the test partition, that dataset is not representative of real world data. We want to see how the model performs to routing requests that are not between centroids of wards. To achieve this, we create a dataset with random source and destination coordinates and check the model prediction against travel times predicted by commercial data providers such as Google Maps.

Random Points within a Polygon

We generate random coordinate pairs within the bounds of the city. But to ensure that the points fall within the actual city geometry, we do a spatial join to select the points that intersect the wards. After the join, we select a subset of 100 points.

In [31]:
n_points = 200

x_min, y_min, x_max, y_max = wards.total_bounds

np.random.seed(0)
src_x = np.random.uniform(x_min, x_max, n_points)
src_y = np.random.uniform(y_min, y_max, n_points)
dst_x = np.random.uniform(x_min, x_max, n_points)
dst_y = np.random.uniform(y_min, y_max, n_points)

src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(src_x, src_y), crs='EPSG:4326')
dst_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(dst_x, dst_y), crs='EPSG:4326')
In [32]:
src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
dst_gdf = gpd.sjoin(dst_gdf, wards, how='inner', op='intersects')

src_selected = src_gdf[:100].reset_index()
dst_selected = dst_gdf[:100].reset_index()
In [33]:
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
src_selected.geometry.plot(color='green',ax=ax)
dst_selected.geometry.plot(color='red',ax=ax)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x200333bb488>

Random Days and Times

In [34]:
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 3, 31)
days = (end_date - start_date).days

random.seed(0) 
random_dates = [start_date + datetime.timedelta(days=random.randrange(days))
                for _ in range(100)]
months = [x.month for x in random_dates] 
days = [x.day for x in random_dates]
dows = [x.weekday() for x in random_dates]
random_time_periods = [random.choice(range(1, 6)) for _ in range(100)]
In [35]:
data = pd.DataFrame({
    'sourceid': src_selected['MOVEMENT_ID'],
    'dstid': dst_selected['MOVEMENT_ID'],
    'month': months,
    'day': days,
    'time_period': random_time_periods,
    'dow': dows,
    'src_lon': src_selected.geometry.x,
    'src_lat': src_selected.geometry.y,
    'dst_lon': dst_selected.geometry.x,
    'dst_lat': dst_selected.geometry.y,
})

Random Test Dataset

The resulting distance data is saved locally and used in the subsequent analysis.

In [36]:
osrm_data_folder = os.path.join('data', 'osrm')
model_test_file = 'model_test.csv'
model_test_filepath = os.path.join(osrm_data_folder, model_test_file)
model_test = pd.read_csv(model_test_filepath)
In [37]:
model_test
Out[37]:
sourceid dstid month day time_period dow src_lon src_lat dst_lon dst_lat distance
0 91 184 2 19 2 2 77.637890 12.930561 77.590090 12.888096 9686.2
1 91 184 2 23 1 6 77.644817 12.944129 77.586162 12.885430 14531.8
2 91 165 1 6 1 0 77.643652 12.941535 77.761146 12.935575 16056.7
3 10 165 2 3 5 0 77.655367 12.950985 77.713831 12.937901 10362.1
4 10 165 3 6 4 4 77.658390 12.949876 77.716294 12.927034 8051.9
... ... ... ... ... ... ... ... ... ... ... ...
95 185 49 3 1 1 6 77.556457 12.861758 77.572991 13.007320 20097.0
96 159 49 1 9 5 3 77.588550 12.909991 77.570795 12.997582 12067.3
97 76 178 1 12 4 6 77.648405 13.006604 77.661723 12.881504 20443.1
98 193 178 3 27 2 4 77.597410 12.915178 77.650824 12.889310 8032.5
99 11 115 1 17 1 4 77.557797 13.049415 77.526487 12.968088 12629.0

100 rows × 11 columns

Predicted vs. Reference Travel Times

To validate our model against real-world data, we collected reference travel times from Google Maps. Google Maps allows one to set a specific departure time in the past and get a range of travel times. We used our randomly generated source and destimation pairs along with random departure times and collected reference data.

In [38]:
reference_data_folder = os.path.join('data', 'googlemaps')
reference_file = 'googlemaps_traveltimes.csv'
reference_filepath = os.path.join(reference_data_folder, reference_file)
reference_data = pd.read_csv(reference_filepath)
In [39]:
reference_data
Out[39]:
sourceid dstid month day time_period dow src_lon src_lat dst_lon dst_lat distance goog_distance goog_min goog_max
0 91 184 2 19 2 2 77.637890 12.930561 77.590090 12.888096 9686.2 10700 22 40
1 91 184 2 23 1 6 77.644817 12.944129 77.586162 12.885430 14531.8 15700 30 40
2 91 165 1 6 1 0 77.643652 12.941535 77.761146 12.935575 16056.7 16200 30 40
3 10 165 2 3 5 0 77.655367 12.950985 77.713831 12.937901 10362.1 11000 20 35
4 10 165 3 6 4 4 77.658390 12.949876 77.716294 12.927034 8051.9 10300 24 40
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 185 49 3 1 1 6 77.556457 12.861758 77.572991 13.007320 20097.0 20300 40 50
96 159 49 1 9 5 3 77.588550 12.909991 77.570795 12.997582 12067.3 11900 22 40
97 76 178 1 12 4 6 77.648405 13.006604 77.661723 12.881504 20443.1 24000 45 80
98 193 178 3 27 2 4 77.597410 12.915178 77.650824 12.889310 8032.5 8000 18 35
99 11 115 1 17 1 4 77.557797 13.049415 77.526487 12.968088 12629.0 13400 24 28

100 rows × 14 columns

In [40]:
def predict_time(row):
    input = row[['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']]
    return round(regressor.predict([input])[0]/60)
In [41]:
reference_data['predicted'] = reference_data.apply(predict_time, axis=1)
results = reference_data[['sourceid', 'dstid', 'distance', 'goog_distance', 'goog_min', 'goog_max', 'predicted']].copy()
results['within_range'] = np.where(
    (results['predicted'] <= results['goog_max'])
        & (results['predicted'] >= results['goog_min']), 'Y', 'N')
results.head(40)
Out[41]:
sourceid dstid distance goog_distance goog_min goog_max predicted within_range
0 91 184 9686.2 10700 22 40 29 Y
1 91 184 14531.8 15700 30 40 26 N
2 91 165 16056.7 16200 30 40 34 Y
3 10 165 10362.1 11000 20 35 29 Y
4 10 165 8051.9 10300 24 40 31 Y
5 10 165 10610.8 10700 18 35 30 Y
6 10 165 11898.9 11500 22 40 37 Y
7 177 165 20806.9 21200 40 60 52 Y
8 15 20 17683.4 16800 30 65 40 Y
9 15 20 20441.6 18800 35 60 45 Y
10 15 82 13335.8 16000 24 40 25 Y
11 169 148 25992.0 26400 55 110 70 Y
12 169 148 23628.8 23200 40 65 59 Y
13 169 180 29579.8 28500 55 100 74 Y
14 169 180 31056.3 29200 65 120 81 Y
15 169 180 31774.2 30000 60 110 82 Y
16 168 76 15065.9 19500 35 55 42 Y
17 1 154 22362.3 23800 40 50 41 Y
18 176 198 14221.3 14100 28 50 38 Y
19 196 198 26543.1 33500 50 85 63 Y
20 196 198 32441.9 34800 60 75 65 Y
21 196 198 30500.7 30700 50 90 73 Y
22 5 198 34863.5 34000 55 80 78 Y
23 194 144 9954.9 10800 26 50 31 Y
24 198 35 21076.7 21300 35 50 51 N
25 198 188 9760.8 10200 18 35 24 Y
26 198 188 14291.3 14300 24 28 26 Y
27 198 55 20778.0 21500 40 75 49 Y
28 198 13 33244.1 32800 55 70 67 Y
29 32 13 19364.5 19600 30 60 43 Y
30 32 13 21956.7 21500 40 75 48 Y
31 180 13 27069.0 27500 45 80 63 Y
32 66 13 15000.5 17400 28 50 37 Y
33 81 13 22482.0 23600 40 65 58 Y
34 78 192 22069.3 21800 40 50 41 Y
35 78 192 20063.5 19400 40 50 41 Y
36 78 166 17271.0 16500 28 45 45 Y
37 195 166 16745.3 15100 26 40 43 N
38 2 166 30049.5 31200 45 60 59 Y
39 179 166 16929.2 16900 30 55 44 Y
In [ ]:
 

Real-time Routing and Prediction

To demonstrate the use of our technique in a real-world application, we show how it can be used in a real-time routing application.

In [59]:
from dotenv import load_dotenv
load_dotenv()

ORS_API_KEY = os.getenv('ORS_API_KEY')
In [61]:
def get_driving_route(source_coordinates, dest_coordinates):
    parameters = {
    'api_key': ORS_API_KEY,
    'start' : '{},{}'.format(source_coordinates[1], source_coordinates[0]),
    'end' : '{},{}'.format(dest_coordinates[1], dest_coordinates[0])
    }

    response = requests.get(
        'https://api.openrouteservice.org/v2/directions/driving-car', params=parameters)

    if response.status_code == 200:
        data = response.json()
        return data
    else:
        print('Request failed.')
        return -9999
In [62]:
def get_ward(coordinates):
    df = pd.DataFrame({'x': [coordinates[1]], 'y': [coordinates[0]]})
    src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(df.x, df.y), crs='EPSG:4326')
    src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
    return int(src_gdf['MOVEMENT_ID'][0])
In [80]:
def get_route(source, destination, departure_time):
    sourceid = get_ward(source)
    dstid = get_ward(destination)
    day = departure_time.day
    time_period = get_time_period(departure_time.hour)
    dow = departure_time.weekday()
    driving_data = get_driving_route(source, destination)
    summary = driving_data['features'][0]['properties']['summary']
    distance = summary['distance']
    input = [sourceid, dstid, day, time_period, dow, source[1], source[0], destination[1], destination[0], distance]
    travel_time = round(regressor.predict([input])[0]/60)
    ors_travel_time = round(summary['duration']/60)
    route= driving_data['features'][0]['geometry']['coordinates']
    
    def swap(coord):
        coord[0],coord[1]=coord[1],coord[0]
        return coord

    route=list(map(swap, route))
    m = folium.Map(location=[(source[0] + destination[0])/2,(source[1] + destination[1])/2], zoom_start=13)
    
    tooltip = 'Model predicted time = {} mins, \
        Default travel time = {} mins'.format(travel_time, ors_travel_time)
    folium.PolyLine(
        route,
        weight=8,
        color='blue',
        opacity=0.6,
        tooltip=tooltip
    ).add_to(m)

    folium.Marker(
        location=(source[0],source[1]),
        icon=folium.Icon(icon='play',color='green')
    ).add_to(m)

    folium.Marker(
        location=(destination[0],destination[1]),
        icon=folium.Icon(icon='stop',color='red')
    ).add_to(m)

    return m
In [ ]:
 

Live Demo

We pick a set of coordinates within the city and show how to get turn-by-turn directions using OpenRouteService API and predict the travel-time using our model.

In [86]:
source = 12.946538, 77.579975
destination = 12.994029, 77.661008
departure_time = datetime.datetime.now()
In [87]:
get_route(source, destination, departure_time)
Out[87]:

We can check how the model performs by comparing with the travel time predicted by Google Maps.

In [88]:
import webbrowser

url='https://www.google.com/maps/dir/{},{}/{},{}'.format(source[0],source[1],destination[0],destination[1])
webbrowser.open(url)
Out[88]:
True
In [ ]: