Notebook

In [17]:

from IPython.display import HTML
from IPython.display import Image
from IPython.core.debugger import Tracer;
import IPython.display
HTML('''<script>
code_show=true;
input_show=true;
function code_toggle() {
 if (code_show){
 $('div.input, div.output_stderr,.input_area, .celltoolbar').hide();
 } else {
 $('div.input, div.output_stderr,.input_area, .celltoolbar').show();
 }
 code_show = !code_show
} 

//$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Out[17]:

Max J. Hoffmann (mjhoffmann@gmail.com)

Introduction¶

The ever increasing car traffic in large urban areas and the possible advent of self-driving cars mandate a better use of existing infrastructure. One possible route here is to group individual trips into clusters of trips (i.e. pools, group taxis, impromptu busses) that share a significant section of the trip. The release of NYC Taxi Data is an open invitation to muse about clustering approaches based on real-world taxis trips made in the past. How about a city like NYC saves ~ $300k per day (~$ 100M anually) by throwing together taxi rides and invests this towards affordable higher education?

In [2]:

from IPython.display import Image
Image(filename='visualization/clustered.png') 

Out[2]:

The figure below illustrates one such strategy: Both panels show a part of New York City. The blue lines show taxi rides that occurred on Jan 2nd, 2013, between 7.54pm and 8.00pm. The red points indicate the drop-off location and the blue points the pickup location. These rides were chosen because they all happen to arrive at roughly the same time at the same location with some tolerances. The left panel shows all the original rides. The right panel shows a possible modification of those same rides but now some points have turned from blue to green. These rides have been modified such that some further away rides pick up other rides that are nearly on their way.

We can speculate why all these passengers wish to arrive at the same time and place. Maybe there is some bigger event like a concert or maybe they just have dates at a few nearby bars. In either case one could imagine that those passengers have something more in common with each other than with the random person in a bus or metro and sharing a ride with them may less awkward than with a random stranger. Furthermore, if it turns out to be desirable to transport many people in a single ride to or from the same location, these rides could be marketed more directly. To this end, one could bundle a discounted taxi ride with the sale of a ticket.

More generally speaking: The key idea is how to group the existing trips in an efficient and general way. To this end, consider the defining properties of a taxi trip: an initial coordinate (latitude and longitude), a final coordinate (latitude and longitude), and either a pickup time or drop-off time. The travel time between pickup time and drop-off time is fixed by the distance and possibly the limiting influence of traffic. Thus initially every trip can be thought of as some point in a 5-dimensional space ( $\theta_{\rm pickup}$ , $\phi_{\rm pickup}$ , $\theta_{\rm dropoff}$ , $\phi_{\rm dropoff}$ , $t_{\rm dropoff}$ ). This very large space can seem quite intimidating at first and difficult to think about.

However, it can be easily reduced if we only consider all trips that start or arrive within a small area in time (say within 10 minutes) and space (within a few hundred feet). This could be all the rides that aim to bring passengers to the same concert, airplane flight, or beginning of a work shift. If we cluster rides in this fashion the remaining degrees of freedom can be reduced to the starting point of each ride ( $\theta_{\rm pickup}$ , $\phi_{\rm pickup}$ ). Thus we have readily reduced the 5-dimensional space into a set of 2-dimensional spaces that may be more appropriate to reason about.

The next question is how we can optimize trip planning within each pickup or drop-off cluster. Initially, there are $n$ disparate trips and the total miles traveled is simply the sum of the length of all trips. The tradeoff that is at play here is to find ways different passenger rides can pick each other up to reduce the total miles traveled while each passenger only has a finite tolerance go out of their respective way to pick up others. Naturally, the best scenario would find the following optimizations: say both passenger $A$ and $B$ want to travel from some remote outskirt to the center of Manhattan at point $p$ . So, passenger $A$ and $B$ happen to be close neighbors and $A$ can pickup $B$ with only a small detour while traveling a long distance together. We can model this be iteratively testing combined optimized routes of $A$ via $B$ to $p$ and $B$ via $A$ to $p$ . While at the same time demanding that the combined route is never longer than $(1 + t)$ times the original route of $A$ to $p$ or $B$ to $p$ (here to could be e.g. $t=0.1\mathrm{-}0.2$ ). All the combined routes are sorted by the saved distance between the combined route and the sum of the individual routes. The merger promising the best benefit in miles traveled is then chosen and we continue this merging with the remaining routes (including the one we just merged). We continue this procedure until no trip optimizations within the given tolerance are left.

Using this simple route clustering leads to a number of simple questions for a given detour tolerance $t$ , than can be readily addressed, namely:

how big is the total cost saving possible?
what is the best fleet configuration (vehicle sizes) to serve these requirements?
is there a time during the day when clustering is more promising than during others?
how long ahead of time do we have to receive all ride order to realize these optimizations?

Technical Stack¶

The data we analyze here is NYC Taxi Trip data from 2013. To analyze the data we use standard python/numpy/pandas tools. Interactive data exploration has been done using Jupyter and IPython. The haversine formula is useful to quickly calculate distance between coordinates. The clustering is performed using the DBSCAN method as implemented in scikit-learn. Finally, for finding routes and optimized routes, we use the open-source routing engine valhalla with OpenStreetMap data

In [56]:

# some imports
import matplotlib
%matplotlib inline
import scipy.cluster
import pandas as pd
import sys
import os
import haversine
import datetime
import numpy as np
import sklearn.cluster
import matplotlib.animation
import sklearn.preprocessing
from matplotlib import pyplot as plt
import folium
import cluster_trips
cluster_trips = reload(cluster_trips)
import get_directions
get_directions = reload(get_directions)

# don't forget to load route service with 
# valhalla_route_service /home/hoffmann/src/valhalla/valhalla.json

In [57]:

# Load NYC Taxi Trips from Jan 2013
# pandas_df = pd.read_csv('sorted_trip_data_1.csv', nrows=900000);
pandas_df = pd.read_csv('trip_data_w_fares_1.csv', nrows=900000, skiprows=6196645,  names=open('trip_data_w_fares_1.csv').readline().strip().split(','));
# drop unneeded columns
del pandas_df['medallion']
del pandas_df['hack_license']
del pandas_df['vendor_id']
del pandas_df['store_and_fwd_flag']
trips = pandas_df

In [58]:

trips['dropoff_datetime'] = pd.to_datetime(trips['dropoff_datetime'])
trips['pickup_datetime'] = pd.to_datetime(trips['pickup_datetime'])

In [59]:

# Filter trips with valid coordinates
lon_lim = lon_min, lon_max = -74.05, -73.7
lat_lim = lat_min, lat_max = 40.57, 41.0
# filter on dropoff location
trips = trips[trips['dropoff_longitude'] <= lon_max]
trips = trips[lon_min <= trips['dropoff_longitude']]
trips = trips[lat_min <= trips['dropoff_latitude']]
trips = trips[trips['dropoff_latitude'] <= lat_max]
# filter on pickup location
trips = trips[trips['pickup_longitude'] <= lon_max]
trips = trips[lon_min <= trips['pickup_longitude']]
trips = trips[lat_min <= trips['pickup_latitude']]
trips = trips[trips['pickup_latitude'] <= lat_max]

# Sort by drop-off timestamp
trips.sort_values('dropoff_datetime', inplace=True)
# trips.shape

In [60]:

t1 = pd.Timestamp('2013-01-015 00:00:00')
t2 = pd.Timestamp('2013-01-016 00:00:00')
tz = 8

d2 = trips.loc[pandas_df['dropoff_datetime'] < t2, :]
day = d2.loc[t1 < d2['dropoff_datetime'], :]
day.loc[:, 'dropoff_datetime_hours'] = ((day.loc[:, 'dropoff_datetime'].astype('int') - int(t1.strftime('%s'))*1e9) / 60./ 60. /1e9 + tz)/1.
day.loc[:, 'dropoff_timeslot'] = (day.loc[:, 'dropoff_datetime_hours'] * 10.).astype('int')

Results¶

Drop-off Times Histogram¶

In [61]:

figs = day.hist(figsize=(6, 4),bins=240, column=['dropoff_datetime_hours'],)
fig = figs[0][0]
fig.set_xlabel('hours')
fig.set_ylabel('# rides')
fig.set_title(t1.strftime('%A, %B %d %Y'))
xticks = fig.set_xticks(range(int((t2 - t1).delta/1e9/60./60)))

The histogram above shows the histogram over drop-off times on Jan 15th, 2013. Every hour is divided into 10 intervals of 6 minutes each. The overall shape is quite typical for a work day profile. Most notably there is a sharp drop-off in taxi traffic after midnight and only a little traffic between 1 am and about 5.30 am. After this traffic increase drastically, peaking around 9 am with over 2500-300 rides per time slot. Throughout the day taxi traffic stays fairly stable between 2000-2500 rides per time slot with peaks between 12 pm and 1 pm (aka lunch time) and 3 pm and 4 pm and a fairly pronounced drop between 5 pm and 6 pm. The overall daily maximum then peaks at 7 pm and from there decline steadily and rapidly after 10 pm. As pointed out above this overall shape appears typical for a work day. All subsequent detailed analysis will be carried out only on this data of January 15th, 2013 but there appears reason to believe that there is nothing special about this day and at least some of the insight can be transferred to other days as well.

Clustering by Drop-off Location and Time.¶

The next question is how to group trips that arrive at a similar place and a similar time. For grouping along the time axis,c we can simply take each bar of the histogram above (i.e. every 6-minute time slot) and inspect it further. Using the DBSCAN method we can readily detect spatial clusters of nearby drop-off points.

In [52]:

map1=folium.Map(location=[40.7591704,-73.92714], width='48%', height='100%')
map2 = folium.Map(location=[40.7591704,-74.0392714], width='50%', height='100%', left='50%', position='absolute')


slot = day.loc[day['dropoff_timeslot']==63, :]
for p in slot.iterrows():
    folium.CircleMarker(location=(p[1].dropoff_latitude, p[1].dropoff_longitude), radius=.51, fill_color='blue').add_to(map1)
    folium.CircleMarker(location=(p[1].dropoff_latitude, p[1].dropoff_longitude), radius=.51, fill_color='blue').add_to(map2)

dbscan = sklearn.cluster.DBSCAN(eps=.02, algorithm='ball_tree')

X = slot[['dropoff_latitude', 'dropoff_longitude']]
scaler = sklearn.preprocessing.StandardScaler() 
X = scaler.fit_transform(X)
db = dbscan.fit(X)


core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
n_points = 0


n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    #print(xy)
    ps = scaler.inverse_transform(xy)
    #print(ps)
    for p in ps:
        n_points += 1
        #print(p, p[0], p[1])
        #print(colors[k], ('#%02x%02x%02x' % tuple(colors[k][:-1])), ('#%02x%02x%02x' % tuple(map(lambda x: int(256*x), colors[k][:-1]))))
        folium.CircleMarker(location=(p[0], p[1]), radius=4.,  fill_color=('#%02x%02x%02x' % tuple(map(lambda x: int(256*x), colors[k][:-1])))).add_to(map2)


map1.fit_bounds(map1.get_bounds())
map2.fit_bounds(map2.get_bounds())

#print(X.shape, n_points)
map1.add_child(map2).save('./visualization/dbscan.png')

In [3]:

Image(filename='visualization/dbscan.png')

Out[3]:

The above figure shows an example of clustering all arrivals from 6.18 am to 6.24 am on January 15, 2017, with a density parameter $\epsilon=0.02$ . The unclustered drop-offs are shown on the left as small solid black dots. In the right panel clustered trips are marked by differently colored circles. Those drop-off points that are not assigned to any cluster are still shown as small black dots. The choice of the density parameter $\epsilon$ requires some care to achieve clustering of many points and that the cluster themselves to not become too large. Ideally, there would be a simple parameter to limit the maximum diameter of the resulting cluster. A similar analysis could be done on pickup location. For the scope of this analysis, we limit ourselves to only cluster trip on drop-off location (and time).

Next, we illustrate how for one of these drop-off clusters independent rides are partially merged to create some group rides.

In [25]:

# The big clustering job on drop-off location
DETOUR_TOLERANCE = .1
BATCH_SIZE = 100 # avoid clustering to many at the same time for excessive CPU time
total_trip_length = 0.
total_savings = 0.
OTRIPS_FN = 'optimized_trips.txt'

cluster_trips.logging.info('Starting clustering on pickup')

if os.path.getsize(OTRIPS_FN) == 0:
    with open(OTRIPS_FN, 'w') as outfile:
        outfile.write(cluster_trips.Trip.header)
for i_tslot in range(0, 240):
    cluster_trips.logging.info('Time slot {i_tslot}'.format(**locals()))
    
    if os.path.exists('tslot.done'):
        with open('tslot.done') as infile:
            if (str(i_tslot) + '\n') in infile.readlines():
                cluster_trips.logging.info('Skipping already done tslot {i_tslot}'.format(**locals()))
                continue
    cluster_trips.logging.info("\n\n\nTIME SLOT {i_tslot}\n\n\n".format(**locals()))
    tslot = day[day['dropoff_timeslot']==i_tslot]
    dbscan = sklearn.cluster.DBSCAN(eps=.03, algorithm='ball_tree')

    X = tslot[['dropoff_latitude', 'dropoff_longitude']]
    X = sklearn.preprocessing.StandardScaler().fit_transform(X)
    db = dbscan.fit(X)

    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_

    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    groups = group_labels(labels)
    for group in groups:
        cluster_string = '{group} {i_tslot}'.format(**locals())
        if os.path.exists('cluster.done'):
            with open('cluster.done')  as infile:
                if (cluster_string + '\n') in infile.readlines():
                    cluster_trips.logging.info('Skipping already clustered pool {cluster_string}\n'.format(**locals()))
                    continue
        cluster_trips.logging.info("Cluster {group}/{n_clusters_}".format(**locals()))
        itslot = tslot.iloc[groups[group]]
        itrips = []
        
        for row in itslot.itertuples():
            #print(row)
            trip = cluster_trips.Trip(
            end_point=cluster_trips.Location(lat=row.dropoff_latitude, lon=row.dropoff_longitude),
            start_points=[cluster_trips.Location(lat=row.pickup_latitude, lon=row.pickup_longitude)],
            passengers=row.passenger_count,
            original_fare=row._16,
            pickup_datetime=row.pickup_datetime,
            dropoff_datetime=row.dropoff_datetime,
    
            )
            if trip.original_distance is not None:
                itrips.append(trip)
        if group == -1: # log unclustered trips for later use ...
            with open('unoptimized_trips.txt', 'a') as outfile:
                for itrip in itrips:
                    outfile.write(itrip.to_log())
            with open('cluster.done', 'a') as outfile:
                outfile.write('{cluster_string}\n'.format(**locals()))
            continue
        n_trips = len(itrips)
        cluster_trips.logging.info("Number of trips in cluster {group}/(time: {i_tslot}): {n_trips}".format(**locals()))
        for batch in range(len(itrips)/BATCH_SIZE + 1):
            itrips_batch = itrips[batch*BATCH_SIZE:(batch+1)*BATCH_SIZE]
            trip_length, savings = cluster_trips.cluster_trips(itrips_batch, verbose=True, detour_tolerance=DETOUR_TOLERANCE)
            cluster_trips.logging.info("Saved {savings} miles from {trip_length} miles.".format(**locals()))
            total_trip_length += trip_length
            total_savings += savings
            rel_benefit = 100 * total_savings / total_trip_length
            cluster_trips.logging.info("Savings total so far {rel_benefit}".format(**locals()))
            with open(OTRIPS_FN, 'a') as outfile:
                for itrip in itrips_batch:
                    outfile.write(itrip.to_log())
            with open('cluster.done', 'a') as outfile:
                outfile.write('{cluster_string}\n'.format(**locals()))
    with open('tslot.done', 'a') as outfile:
        outfile.write('{i_tslot}\n'.format(**locals()))
    

In [26]:

def group_labels(labels):
    groups = {}
    for i, label in enumerate(labels):
        groups.setdefault(int(label), []).append(i)
    return groups

In [27]:

## plot clustering with folium
#DETOUR_TOLERANCE = .1
#BATCH_SIZE = 100 # avoid clustering to many at the same time for excessive CPU time
#total_trip_length = 0.
#total_savings = 0.
##OTRIPS_FN = 'optimized_trips.txt'
##with open(OTRIPS_FN, 'w') as outfile:
##    outfile.write('# pickup dropoff passengers benefit original_fare sum_of_fares original_distance final_distance\n')
#plot_tslot = 199
#plot_cluster = 43
#for i_tslot in range(plot_tslot, plot_tslot+1):
#    #print("\n\n\nTIME SLOT {i_tslot}\n\n\n".format(**locals()))
#    tslot = day[day['dropoff_timeslot']==i_tslot]
#    dbscan = sklearn.cluster.DBSCAN(eps=.03, algorithm='ball_tree')
#    X = tslot[['dropoff_latitude', 'dropoff_longitude']]
#    X = sklearn.preprocessing.StandardScaler().fit_transform(X)
#    db = dbscan.fit(X)
#    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
#    core_samples_mask[db.core_sample_indices_] = True
#    labels = db.labels_
#    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
#    groups = group_labels(labels)
#    for group in groups.keys()[plot_cluster:plot_cluster+1]:
#        cluster_string = '{group} {i_tslot}'.format(**locals())
#        itslot = tslot.iloc[groups[group]]
#        itrips = []
#        for row in itslot.itertuples():
#            #print(row)
#            trip = cluster_trips.Trip(
#            end_point=cluster_trips.Location(lat=row.dropoff_latitude, lon=row.dropoff_longitude),
#            start_points=[cluster_trips.Location(lat=row.pickup_latitude, lon=row.pickup_longitude)],
#            passengers=row.passenger_count,
#            original_fare=row._16,)
#            if trip.original_distance is not None:
#                itrips.append(trip)
#        n_trips = len(itrips)
#        #print("Number of trips in cluster {group}/(time: {i_tslot}): {n_trips}".format(**locals()))
#        for batch in range(len(itrips)/BATCH_SIZE + 1):
#            itrips_batch = itrips[batch*BATCH_SIZE:(batch+1)*BATCH_SIZE]
#            trip_length, savings, original_shapes, optimized_shapes = cluster_trips.cluster_trips(itrips_batch,
#                                                               verbose=True,
#                                                               detour_tolerance=DETOUR_TOLERANCE,
#                                                              return_shapes=True)
#            map1=folium.Map(location=[40.7591704,-74.0392714], width='48%', height='60%')
#            map2 = folium.Map(location=[40.7591704,-74.0392714], width='50%', height='60%', left='50%', position='absolute')
#            r = 5
#            for trip in original_shapes:
#                folium.PolyLine(trip, latlon=False).add_to(map1)
#                folium.CircleMarker(list(reversed(trip[0])), fill_color='blue', radius=r).add_to(map1)
#                folium.CircleMarker(list(reversed(trip[-1])), fill_color='red', radius=r).add_to(map1)
#            for trip in optimized_shapes:
#                folium.PolyLine(trip, latlon=False).add_to(map2)
#                folium.CircleMarker(list(reversed(trip[-1])), fill_color='red', radius=r).add_to(map2)
#            for trip in original_shapes:
#                folium.CircleMarker(list(reversed(trip[0])), fill_color='green', radius=r).add_to(map2)
#            for trip in optimized_shapes:
#                folium.CircleMarker(list(reversed(trip[0])), fill_color='blue', radius=r).add_to(map2)            
#            map1.fit_bounds(map1.get_bounds())
#            map2.fit_bounds(map2.get_bounds())
#            #print("Saved {savings} miles from {trip_length} miles.".format(**locals()))
#            total_trip_length += trip_length
#            total_savings += savings
#            rel_benefit = 100 * total_savings / total_trip_length
#            #print("Savings total so far {rel_benefit}".format(**locals()))
#            #with open(OTRIPS_FN, 'a') as outfile:
#            #    for itrip in itrips_batch:
#            #        outfile.write(itrip.to_log())
#            #with open('cluster.done', 'a') as outfile:
#            #    outfile.write('{cluster_string}\n'.format(**locals()))
#        #break
##map1.add_child(map2).save('visualization/clustered_{plot_tslot}_{plot_cluster}.html'.format(**locals()))
##map1.add_child(map2)

In [4]:

Image(filename="visualization/clustered_199_43.png")

Out[4]:

Many of the initial pickup points (blue) remain blue on the right and their trips remain the same. However, three points that are sufficiently close to be passed by other travel routes turn green, indicating that they will be picked up.

In [28]:

optimized_trips = pd.read_csv('optimized_trips.txt',sep='\s+')
optimized_trips.head()
otrips = optimized_trips
#pd.to_datetime?
otrips['pickup_datetime'] = pd.to_datetime(otrips['pickup_date'] + otrips['pickup_time'], format='%Y-%m-%d%H:%M:%S')
otrips['dropoff_datetime'] = pd.to_datetime(otrips['dropoff_date'] + otrips['dropoff_time'], format='%Y-%m-%d%H:%M:%S')
del otrips['pickup_date']
del otrips['pickup_time']
del otrips['dropoff_date']
del otrips['dropoff_time']
otrips['dropoff_datetime_slots'] = (10*((otrips.dropoff_datetime.astype('int') - int(t1.strftime('%s'))*1e9) / 60./ 60. /1e9 + tz)/1.).astype('int')


#otrips.head()

Next, we turn to some of the statistics of clustering rides in this way as a function of the day time. To this end, we first look at all rides that were not assigned to any drop-off cluster. At this point, we look at the sum of collected fares per 6-minute time slot over the course of one day.

In [42]:

optimized_trips = pd.read_csv('unoptimized_trips.txt',sep='\s+')
optimized_trips.head()
otrips = optimized_trips
#pd.to_datetime?
otrips['pickup_datetime'] = pd.to_datetime(otrips['pickup_date'] + otrips['pickup_time'], format='%Y-%m-%d%H:%M:%S')
otrips['dropoff_datetime'] = pd.to_datetime(otrips['dropoff_date'] + otrips['dropoff_time'], format='%Y-%m-%d%H:%M:%S')
del otrips['pickup_date']
del otrips['pickup_time']
del otrips['dropoff_date']
del otrips['dropoff_time']
otrips['dropoff_datetime_slots'] = (10*((otrips.dropoff_datetime.astype('int') - int(t1.strftime('%s'))*1e9) / 60./ 60. /1e9 + tz)/1.).astype('int')


#otrips.head()

#figs = otrips.hist(figsize=(8, 6),bins=240, by='dropoff_datetime_slots',column=['benefit'])
f = otrips.groupby('dropoff_datetime_slots', as_index=False).sum()
for i in range(240):
    if not i in f.dropoff_datetime_slots.values:
        f = f.append(pd.Series([i, 0, 0, 0, 0, 0, 0, 0], f.keys()), ignore_index=True)
f.sort_values(by=['dropoff_datetime_slots'], inplace=True)
f['fare_difference'] = f['sum_of_fares'] - f['original_fare']
fig = f.plot(x='dropoff_datetime_slots', y=['original_fare'], kind='bar',stacked=True,figsize=(15, 6), width=1.)
fig.set_xlabel('hours')
fig.set_ylabel('fares [$]')
fig.set_title(t1.strftime('%A, %B %d %Y'))
xticks = map(lambda x: x/10 if x%10 ==0 else '', range(240))
#fig.set_xticks(range(240))
xt = fig.set_xticklabels(xticks)
#otrips.plot?

#fig.set_xticks?

Overall we note that the general shape is similar to the histogram of all rides with a significant dip in the early morning hours. We also note that the variation of this curve is smaller than in the initial case of all rides. The minima in the early morning hours and the maxima around 9 am and 7 pm are not very pronounced. In other words, those trips mostly arriving at a location without other trips are more evenly spread throughout the day.

In [43]:

#figs = otrips.hist(figsize=(8, 6),bins=240, by='dropoff_datetime_slots',column=['benefit'])
optimized_trips = pd.read_csv('optimized_trips.txt',sep='\s+')
optimized_trips.head()
otrips = optimized_trips
#pd.to_datetime?
otrips['pickup_datetime'] = pd.to_datetime(otrips['pickup_date'] + otrips['pickup_time'], format='%Y-%m-%d%H:%M:%S')
otrips['dropoff_datetime'] = pd.to_datetime(otrips['dropoff_date'] + otrips['dropoff_time'], format='%Y-%m-%d%H:%M:%S')
del otrips['pickup_date']
del otrips['pickup_time']
del otrips['dropoff_date']
del otrips['dropoff_time']
otrips['dropoff_datetime_slots'] = (10*((otrips.dropoff_datetime.astype('int') - int(t1.strftime('%s'))*1e9) / 60./ 60. /1e9 + tz)/1.).astype('int')
f = otrips.groupby('dropoff_datetime_slots', as_index=False).sum()
for i in range(240):
    if not i in f.dropoff_datetime_slots.values:
        f = f.append(pd.Series([i, 0, 0, 0, 0, 0, 0, 0], f.keys()), ignore_index=True)
f.sort_values(by=['dropoff_datetime_slots'], inplace=True)
f['fare_difference'] = f['sum_of_fares'] - f['original_fare']
fig = f.plot(x='dropoff_datetime_slots', y=['fare_difference', 'original_fare'], kind='bar',stacked=True,figsize=(15, 6), width=1.)
fig.set_xlabel('hours')
fig.set_ylabel('fares [$]')
fig.set_title(t1.strftime('%A, %B %d %Y'))
xticks = map(lambda x: x/10 if x%10 ==0 else '', range(240))
#fig.set_xticks(range(240))
xt = fig.set_xticklabels(xticks)
#otrips.plot?

#fig.set_xticks?

This trend becomes more discernible if we look exclusively at the distribution of trips that were assigned to a drop-off cluster. The combined blue/green bar shows the total fare per time slot. The blue fraction indicates the difference in travel fare if each merged trip would be charged only at the cost of the most remote traveler. For this particular day, the maximum synergy effect would amount to about $311k.

Fleet Configuration¶

After applying the cluster optimization, one question would be how the fleet configuration may have to change to the more increased number of passengers per trip. If several trips are merged together, larger vehicles may become necessary.

In [31]:

unoptimized = pd.read_csv('unoptimized_trips.txt',sep='\s+').groupby('passengers', as_index=True).count()
optimized = pd.read_csv('optimized_trips.txt',sep='\s+').groupby('passengers', as_index=True).count()
fig, axs = plt.subplots(1,2, figsize=(14, 6))


merged = optimized.merge(unoptimized, how='left', left_index=True, right_index=True, suffixes=('_optimized', '_unoptimized'))
fig = merged.plot(kind='bar', y=['pickup_optimized', 'pickup_unoptimized'], ax=axs[0], width=.9 )
fig = merged.plot(kind='bar', y=['pickup_optimized', 'pickup_unoptimized'], logy=True, ax=axs[1], width=.9)
label = axs[0].set_ylabel('#rides')

The above graphic shows the distribution of rides as a function of the number of passengers. Both panels present the same data with the left panel using a linear scale and the right panel showing a log scale to better resolve the low- frequency multi-passenger rides.

To this end, look at the distribution over the number of passengers. Before clustering, there is a hard cut-off at 6 passengers with a significant dip at 4 passengers. The latter is presumably due to the fact that the standard sedan cab would accommodate comfortably only for 3 passengers. Parties of 4 passengers may prefer to split over 2 sedans. After clustering trips, a number of larger group travels are created with 7+ passengers per vehicle on the order of several thousand rides per day. However, even after clustering a hard cut-off remains at 12 passengers. The latter indicates that even after optimizing for larger group travel a form factor of more than 12 passengers would most likely not be useful for serving individual traffic.

Order Ahead Time¶

Finally, we turn to another important question: how much ahead of time would we have to know the demand for trips for being able to realize this optimization? I.e. Longer trips with several pickups can only be planned if the information is available before the beginning of the trip and ideally somewhat before to ensure that the vehicle with the right size is available at the first pickup point. To this end we look the distribution of travel times on a typical day.

In [32]:

fig, axs = plt.subplots(1,2, figsize=(14, 6))
figs = day.hist(figsize=(6, 6),bins=100, column=['trip_time_in_secs'], ax=axs[0])
#fig = figs[0][0]
axs[0].set_xlabel('seconds')
axs[0].set_ylabel('# rides')
axs[0].set_xlim([0, 8000])
axs[0].set_title(t1.strftime('Travel time histogram: %A, %B %d %Y'))
#axs[0].set_xticks(range(int((t2 - t1).delta/1e9/60./60)))
day.hist(figsize=(6, 6),bins=100, cumulative=True, normed=True, column=['trip_time_in_secs'], ax=axs[1])

axs[1].set_xlabel('seconds')
axs[1].set_ylabel('# rides')
axs[1].set_xlim([0, 8000])
title = axs[1].set_title(t1.strftime('Cumulative Travel Time Histogram: %A, %B %d %Y'))
#axs[1].set_xticks(range(int((t2 - t1).delta/1e9/60./60)))

The distribution shows a peak at around 400-500 seconds or 6-8 minute rides duration. After this, the distribution drops significantly with some long tail that mostly ends however around 4000 seconds or approximately 1 hour. More specifically, this allows us to quickly estimate how much ahead of time ride orders would have to be received for obtaining and consolidating 90%, 99%, or 99.9% of all rides.

In [33]:

for perc in [.9, .99, .999]:
    t = np.histogram(day['trip_time_in_secs'], bins=1000)[1][np.argmin(np.abs(np.cumsum(np.histogram(day['trip_time_in_secs'], bins=1000)[0])/float(day.shape[0])-perc))]
    m = t / 60
    print(" - {perc:0.3f} => {m:.0f} minutes".format(**locals()))

 - 0.900 => 21 minutes
 - 0.990 => 40 minutes
 - 0.999 => 57 minutes

Summary and Conclusion¶

Using a combination a publicly available taxi trip data, a clustering algorithm and a route-planning server we are able to get a close-up analysis of the possibilities and characteristics of ride pooling based on drop-off locations. Pooling rides on identical drop-off locations can be a marketing model if rides can be offered as an attractive package with event admission. Based on freely available taxi past data from New York City we found evidence that a maximum synergy of about $300k/day exists for a city of this size. A vehicle fleet that can pickup up to 12 passengers per vehicle should be sufficient to cover the vast majority of cases. Having all ride orders at least one hour ahead time would be necessary to account for over 99.9% of all rides