Predict Taxi Trip Duration using Premium Primitives¶

The following tutorial illustrates using premium primitives in Featuretools to predict the duration of a taxi trip in New York City. An accurate predictive model would provide passengers informative time estimate before they begin their trip.

In this notebook we will:

Load Data
Select Primitives
Run Featuretools
Build a model
Interpret features
Apply to new data

To learn more about Featuretools, visit our documentation.

In [1]:

import featuretools as ft
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline  

Step 1: Load Data¶

First, we load in a copy of the data. It is 175 MB, so it may take a few minutes to download

In [2]:

es = ft.entityset.read_entityset("s3://featurelabs-static/nyc_taxi_entityset_train.tar")

In this entity set, there is data on nearly 1.5 million taxi trips in New York City across a several month period. For each trip, we have a handful of columns about it shown below.

With graphviz installed we can generate a visualization of the entity set.

In [3]:

es.plot()

Out[3]:

The primary data types for this problem are geospatial (latitude and longitude) and temporal. By default, most machine learning algorithms have a difficult time processing these data types. Therefore, in order get the most out of this data, we need perform feature engineering to extract predictive signals before applying machine learning.

Step 2. Selecting Premium Primitives¶

Featuretools has several premium primitives that can be used to assist with preparing this data as numeric feature vectors for machine learning.

Below we've selected several primitives that apply to the data types in this dataset. To learn more about any of the primitives, click on the links to view the documentation.

City Block Distance - Cars cannot travel diagonally through a city block, so this primitive can be used to give us the most accurate estimate of the distance the passenger has to travel during there trip.
Lat Long To City - An important factor for the length of a trip is where it begins or ends. This primitive can convert the pick up and drop of locations to the borough e.g this trip began in Manhattan, but ends Brooklyn so we must cross the east river
Is In Geo Box - Trips starting and ending by points of interesting can also be relevent. To extract this we can use a geobox to detect trips that start or end within a couple important areas in New York City that have a lot of taxi trips.
- Area around JFK Airport - (40.62, -73.85), (40.70, -73.75)
- Area around La Guardia Airport - (40.76, -73.89), (40.78, -73.85)
Part Of Day - The traffic conditions greatly affect the duration of the trip. We know traffic varies by time of day, so we can use this primitive to extract if the trip occurs during the morning, afternoon, evening, or night.
Is Federal Holiday - A typical Monday morning may have heavy traffic going into the city, but if it is a federal holiday, the traffic conditions are likely lighter.
Season, Quarter - The weather outside may determine street conditions. Using these primitives we can extract the time of year. Note: this demo data only spans a few months, but these primitives may be very relevent when we expand the dataset.

Step 3. Running Featuretools¶

Next, we run Featuretools using the primitives

In [4]:

from featuretools.primitives import (CityblockDistance, LatLongToCity, IsInGeoBox, PartOfDay, 
                                     IsFederalHoliday, Season, NthWeekOfMonth, Quarter)

trans_primitives = [CityblockDistance,
                    LatLongToCity,
                    IsInGeoBox((40.62, -73.85), (40.70, -73.75)), # JFK Airport
                    IsInGeoBox((40.76, -73.89), (40.78, -73.85)), # La Guardia Airport
                    IsFederalHoliday,
                    PartOfDay,
                    Season,
                    NthWeekOfMonth,
                    Quarter]

Now, we can create the feature matrix using Deep Feature Synthesis

In [5]:

fm, features = ft.dfs(entityset=es,
                      target_entity="trips",
                      trans_primitives=trans_primitives,
                      chunk_size=.1, # lowering this gives more frequent updates
                      verbose=True)

Built 16 features
Elapsed: 04:55 | Progress: 100%|██████████| Remaining: 00:00

In [6]:

fm.head(5)

Out[6]:

	store_and_fwd_flag	trip_duration	passenger_count	vendor_id	CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)	LATLONG_TO_CITY(pickup_latlong)	LATLONG_TO_CITY(dropoff_latlong)	IS_IN_GEOBOX(pickup_latlong, point1=(40.62, -73.85), point2=(40.7, -73.75))	IS_IN_GEOBOX(dropoff_latlong, point1=(40.62, -73.85), point2=(40.7, -73.75))	IS_IN_GEOBOX(pickup_latlong, point1=(40.76, -73.89), point2=(40.78, -73.85))	IS_IN_GEOBOX(dropoff_latlong, point1=(40.76, -73.89), point2=(40.78, -73.85))	IS_FEDERAL_HOLIDAY(pickup_datetime)	PART_OF_DAY(pickup_datetime)	SEASON(pickup_datetime)	NTH_WEEK_OF_MONTH(pickup_datetime)	QUARTER(pickup_datetime)
id
id0000001	False	1105	1	2	4.457185	New York City	Long Island City	False	False	False	False	False	Morning	summer	3.0	2
id0000003	False	1046	5	2	1.770763	Weehawken	Hoboken	False	False	False	False	False	Morning	spring	3.0	1
id0000005	False	368	1	2	0.904869	Manhattan	Manhattan	False	False	False	False	False	Morning	spring	5.0	2
id0000008	False	303	1	1	0.967836	New York City	New York City	False	False	False	False	False	Morning	summer	3.0	2
id0000009	False	547	1	1	4.147816	Manhattan	Manhattan	False	False	False	False	False	Night	spring	2.0	2

Using the primitives above, we created many new features to feed into our machine learning algorithm. Because some of these features are categorical, we will perform categorical encoding (with one-hot encoding) using featuretools before continuing.

In [7]:

fm_encoded, features_encoded = ft.encode_features(fm, features, top_n=5, verbose=True, include_unknown=False)

Encoding pass 1: 100%|██████████| 16/16 [00:08<00:00,  1.96feature/s]
Encoding pass 2: 100%|██████████| 35/35 [00:00<00:00, 235.92feature/s]

Step 4: Building the Model¶

After applying Featuretools, we have a feature matrix of all numeric data that is ready for machine learning.

The final step we will do is apply a log transform to our trip durations. By doing this we can better distiguish short trips when training our models. We can later undo this transform to generate final predictions.

In [8]:

X = fm_encoded.copy()
y = (X.pop('trip_duration') + 1).apply(np.log)

to validate our model, we will do a simple train/test split

In [9]:

X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=.5)

Now, we are ready to train and score our model. For the purposes of this example, we will not perform any hyper parameter tuning of our model.

In [10]:

estimator = RandomForestRegressor(n_estimators=100,
                                  n_jobs=-1,
                                  random_state=0,
                                  verbose=True)
estimator.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   58.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  2.5min finished

Out[10]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                      oob_score=False, random_state=0, verbose=True,
                      warm_start=False)

Using the trained model, we can look at the mean squared error on the test set

In [11]:

y_pred = estimator.predict(X_test)
mean_squared_error(y_pred, y_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    6.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   15.9s finished

Out[11]:

0.2237187249878536

Step 5: Interpretting Features¶

Featuretools primitives are valueable because they transform the raw data (e.g dates, latitude and longitudes) into meaningful attributes a machine learning model can learn from. Compared to other techniques for feature engineering, they are also more interpretable by humans looking to understand the model.

To understand the model better, let's take a look at the most important features discovered when we trained the random forest

In [12]:

importances = pd.DataFrame(zip(X.columns, estimator.feature_importances_), columns=["feature", "importance"]).sort_values("importance", ascending=False)
importances.head(10)

Out[12]:

	feature	importance
8	CITYBLOCK_DISTANCE(dropoff_latlong, pickup_lat...	0.797597
31	NTH_WEEK_OF_MONTH(pickup_datetime)	0.041128
27	PART_OF_DAY(pickup_datetime) = Night	0.011538
6	vendor_id = 2	0.011130
7	vendor_id = 1	0.010091
1	passenger_count = 1	0.008714
28	SEASON(pickup_datetime) = spring	0.008686
24	PART_OF_DAY(pickup_datetime) = Afternoon	0.008539
2	passenger_count = 2	0.007104
30	SEASON(pickup_datetime) = summer	0.006361

As you might expect, a majority of the top features are the result of applying the premium primitives. Let's take a closer look at some features in particular

As the distance increases, the trip duration increase¶

Unsuprisingly the city block distance of the trip is the most important feature. The longer the trip's distance, the longer it will take. However, we will see below this isn't always the case.

In [13]:

fm.sample(1000).plot.scatter(x='CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)', y ='trip_duration')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x138793e10>

Average trips are shorter in duration during winter, and longer during the summer¶

This may be because people are more likely to take a taxi when it is cold outside in New York City

In [14]:

fm.groupby("SEASON(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Season")

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x13859c908>

Trips are longer in duration in the afternoon even though they cover a shorter distance¶

This is counter to what we saw earlier that longer distance trips take a longer amount of time. This is why it is important to extract numerous features from your data so your model and can learn multivariate relationships.

The likely explaination for this is that there is more traffic in the afternoon than at night.

In [15]:

fig, axs = plt.subplots(ncols=2, figsize=(15, 4))
fm.groupby("PART_OF_DAY(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Time of Day", ax=axs[0])
fm.groupby("PART_OF_DAY(pickup_datetime)")["CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)"].mean().plot.bar(title="Trip Distance by Time of Day", ax=axs[1])

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x1572b0a20>

Trips are shorter in time during federal holidays¶

This may be because fewer people are one the road when it is a holiday.

In [16]:

fm.groupby("IS_FEDERAL_HOLIDAY(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Is Federal Holiday")

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x15faa82e8>

Step 6: Apply feature engineering and modeling to new data¶

Once we are happy with our model, we can apply it to new data. Below we load in 600,000 new trips, where we don't know the duration (note: duration is no longer a column in the data).

In [17]:

es_test = ft.entityset.read_entityset("s3://featurelabs-static/nyc_taxi_entityset_test.tar")

In [18]:

es_test.plot()

Out[18]:

we can use the feature definitions that we created before to reperform feature engineering on this dataset. You can read more about saving and loading feature definitions here.

In [19]:

# remove trip_duration from the features to calculate
features = [f for f in features_encoded if f.get_name() != "trip_duration"]

fm_test = ft.calculate_feature_matrix(entityset=es_test,
                                      features=features,
                                      chunk_size=.1,
                                      verbose=True)

Elapsed: 02:17 | Progress: 100%|██████████| Remaining: 00:00

with the new feature matrix in hand, we are ready to reapply our estimator that was previously trained to generate predictions

In [20]:

preds = estimator.predict(fm_test)
preds = np.exp(preds) - 1 # undo log transform
preds = pd.Series(preds, index=fm_test.index)
preds

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    4.2s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   12.2s finished

Out[20]:

id
id0000002    1026.497582
id0000006     681.823890
id0000007     791.126881
id0000017    1029.409864
id0000018    2454.557293
                ...     
id3999960     562.255700
id3999966    1464.219648
id3999967     244.016590
id3999981     649.003684
id3999997    1333.935868
Length: 625134, dtype: float64

Featuretools was created by the developers at Feature Labs. If building impactful data science pipelines is important to you or your business, please get in touch.