The following tutorial illustrates using premium primitives in Featuretools to predict the duration of a taxi trip in New York City. An accurate predictive model would provide passengers informative time estimate before they begin their trip.
In this notebook we will:
To learn more about Featuretools, visit our documentation.
import featuretools as ft
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
First, we load in a copy of the data. It is 175 MB, so it may take a few minutes to download
es = ft.entityset.read_entityset("s3://featurelabs-static/nyc_taxi_entityset_train.tar")
In this entity set, there is data on nearly 1.5 million taxi trips in New York City across a several month period. For each trip, we have a handful of columns about it shown below.
With graphviz installed we can generate a visualization of the entity set.
es.plot()
The primary data types for this problem are geospatial (latitude and longitude) and temporal. By default, most machine learning algorithms have a difficult time processing these data types. Therefore, in order get the most out of this data, we need perform feature engineering to extract predictive signals before applying machine learning.
Featuretools has several premium primitives that can be used to assist with preparing this data as numeric feature vectors for machine learning.
Below we've selected several primitives that apply to the data types in this dataset. To learn more about any of the primitives, click on the links to view the documentation.
City Block Distance - Cars cannot travel diagonally through a city block, so this primitive can be used to give us the most accurate estimate of the distance the passenger has to travel during there trip.
Lat Long To City - An important factor for the length of a trip is where it begins or ends. This primitive can convert the pick up and drop of locations to the borough e.g this trip began in Manhattan, but ends Brooklyn so we must cross the east river
Is In Geo Box - Trips starting and ending by points of interesting can also be relevent. To extract this we can use a geobox to detect trips that start or end within a couple important areas in New York City that have a lot of taxi trips.
Part Of Day - The traffic conditions greatly affect the duration of the trip. We know traffic varies by time of day, so we can use this primitive to extract if the trip occurs during the morning, afternoon, evening, or night.
Is Federal Holiday - A typical Monday morning may have heavy traffic going into the city, but if it is a federal holiday, the traffic conditions are likely lighter.
Season, Quarter - The weather outside may determine street conditions. Using these primitives we can extract the time of year. Note: this demo data only spans a few months, but these primitives may be very relevent when we expand the dataset.
Next, we run Featuretools using the primitives
from featuretools.primitives import (CityblockDistance, LatLongToCity, IsInGeoBox, PartOfDay,
IsFederalHoliday, Season, NthWeekOfMonth, Quarter)
trans_primitives = [CityblockDistance,
LatLongToCity,
IsInGeoBox((40.62, -73.85), (40.70, -73.75)), # JFK Airport
IsInGeoBox((40.76, -73.89), (40.78, -73.85)), # La Guardia Airport
IsFederalHoliday,
PartOfDay,
Season,
NthWeekOfMonth,
Quarter]
Now, we can create the feature matrix using Deep Feature Synthesis
fm, features = ft.dfs(entityset=es,
target_entity="trips",
trans_primitives=trans_primitives,
chunk_size=.1, # lowering this gives more frequent updates
verbose=True)
Built 16 features Elapsed: 04:55 | Progress: 100%|██████████| Remaining: 00:00
fm.head(5)
store_and_fwd_flag | trip_duration | passenger_count | vendor_id | CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong) | LATLONG_TO_CITY(pickup_latlong) | LATLONG_TO_CITY(dropoff_latlong) | IS_IN_GEOBOX(pickup_latlong, point1=(40.62, -73.85), point2=(40.7, -73.75)) | IS_IN_GEOBOX(dropoff_latlong, point1=(40.62, -73.85), point2=(40.7, -73.75)) | IS_IN_GEOBOX(pickup_latlong, point1=(40.76, -73.89), point2=(40.78, -73.85)) | IS_IN_GEOBOX(dropoff_latlong, point1=(40.76, -73.89), point2=(40.78, -73.85)) | IS_FEDERAL_HOLIDAY(pickup_datetime) | PART_OF_DAY(pickup_datetime) | SEASON(pickup_datetime) | NTH_WEEK_OF_MONTH(pickup_datetime) | QUARTER(pickup_datetime) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
id0000001 | False | 1105 | 1 | 2 | 4.457185 | New York City | Long Island City | False | False | False | False | False | Morning | summer | 3.0 | 2 |
id0000003 | False | 1046 | 5 | 2 | 1.770763 | Weehawken | Hoboken | False | False | False | False | False | Morning | spring | 3.0 | 1 |
id0000005 | False | 368 | 1 | 2 | 0.904869 | Manhattan | Manhattan | False | False | False | False | False | Morning | spring | 5.0 | 2 |
id0000008 | False | 303 | 1 | 1 | 0.967836 | New York City | New York City | False | False | False | False | False | Morning | summer | 3.0 | 2 |
id0000009 | False | 547 | 1 | 1 | 4.147816 | Manhattan | Manhattan | False | False | False | False | False | Night | spring | 2.0 | 2 |
Using the primitives above, we created many new features to feed into our machine learning algorithm. Because some of these features are categorical, we will perform categorical encoding (with one-hot encoding) using featuretools before continuing.
fm_encoded, features_encoded = ft.encode_features(fm, features, top_n=5, verbose=True, include_unknown=False)
Encoding pass 1: 100%|██████████| 16/16 [00:08<00:00, 1.96feature/s] Encoding pass 2: 100%|██████████| 35/35 [00:00<00:00, 235.92feature/s]
After applying Featuretools, we have a feature matrix of all numeric data that is ready for machine learning.
The final step we will do is apply a log
transform to our trip durations. By doing this we can better distiguish short trips when training our models. We can later undo this transform to generate final predictions.
X = fm_encoded.copy()
y = (X.pop('trip_duration') + 1).apply(np.log)
to validate our model, we will do a simple train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
Now, we are ready to train and score our model. For the purposes of this example, we will not perform any hyper parameter tuning of our model.
estimator = RandomForestRegressor(n_estimators=100,
n_jobs=-1,
random_state=0,
verbose=True)
estimator.fit(X_train, y_train)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 58.9s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.5min finished
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=0, verbose=True, warm_start=False)
Using the trained model, we can look at the mean squared error on the test set
y_pred = estimator.predict(X_test)
mean_squared_error(y_pred, y_test)
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 6.1s [Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 15.9s finished
0.2237187249878536
Featuretools primitives are valueable because they transform the raw data (e.g dates, latitude and longitudes) into meaningful attributes a machine learning model can learn from. Compared to other techniques for feature engineering, they are also more interpretable by humans looking to understand the model.
To understand the model better, let's take a look at the most important features discovered when we trained the random forest
importances = pd.DataFrame(zip(X.columns, estimator.feature_importances_), columns=["feature", "importance"]).sort_values("importance", ascending=False)
importances.head(10)
feature | importance | |
---|---|---|
8 | CITYBLOCK_DISTANCE(dropoff_latlong, pickup_lat... | 0.797597 |
31 | NTH_WEEK_OF_MONTH(pickup_datetime) | 0.041128 |
27 | PART_OF_DAY(pickup_datetime) = Night | 0.011538 |
6 | vendor_id = 2 | 0.011130 |
7 | vendor_id = 1 | 0.010091 |
1 | passenger_count = 1 | 0.008714 |
28 | SEASON(pickup_datetime) = spring | 0.008686 |
24 | PART_OF_DAY(pickup_datetime) = Afternoon | 0.008539 |
2 | passenger_count = 2 | 0.007104 |
30 | SEASON(pickup_datetime) = summer | 0.006361 |
As you might expect, a majority of the top features are the result of applying the premium primitives. Let's take a closer look at some features in particular
Unsuprisingly the city block distance of the trip is the most important feature. The longer the trip's distance, the longer it will take. However, we will see below this isn't always the case.
fm.sample(1000).plot.scatter(x='CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)', y ='trip_duration')
<matplotlib.axes._subplots.AxesSubplot at 0x138793e10>
This may be because people are more likely to take a taxi when it is cold outside in New York City
fm.groupby("SEASON(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Season")
<matplotlib.axes._subplots.AxesSubplot at 0x13859c908>
This is counter to what we saw earlier that longer distance trips take a longer amount of time. This is why it is important to extract numerous features from your data so your model and can learn multivariate relationships.
The likely explaination for this is that there is more traffic in the afternoon than at night.
fig, axs = plt.subplots(ncols=2, figsize=(15, 4))
fm.groupby("PART_OF_DAY(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Time of Day", ax=axs[0])
fm.groupby("PART_OF_DAY(pickup_datetime)")["CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)"].mean().plot.bar(title="Trip Distance by Time of Day", ax=axs[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1572b0a20>
This may be because fewer people are one the road when it is a holiday.
fm.groupby("IS_FEDERAL_HOLIDAY(pickup_datetime)")["trip_duration"].mean().plot.bar(title="Trip duration by Is Federal Holiday")
<matplotlib.axes._subplots.AxesSubplot at 0x15faa82e8>
Once we are happy with our model, we can apply it to new data. Below we load in 600,000 new trips, where we don't know the duration (note: duration is no longer a column in the data).
es_test = ft.entityset.read_entityset("s3://featurelabs-static/nyc_taxi_entityset_test.tar")
es_test.plot()
we can use the feature definitions that we created before to reperform feature engineering on this dataset. You can read more about saving and loading feature definitions here.
# remove trip_duration from the features to calculate
features = [f for f in features_encoded if f.get_name() != "trip_duration"]
fm_test = ft.calculate_feature_matrix(entityset=es_test,
features=features,
chunk_size=.1,
verbose=True)
Elapsed: 02:17 | Progress: 100%|██████████| Remaining: 00:00
with the new feature matrix in hand, we are ready to reapply our estimator that was previously trained to generate predictions
preds = estimator.predict(fm_test)
preds = np.exp(preds) - 1 # undo log transform
preds = pd.Series(preds, index=fm_test.index)
preds
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 4.2s [Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 12.2s finished
id id0000002 1026.497582 id0000006 681.823890 id0000007 791.126881 id0000017 1029.409864 id0000018 2454.557293 ... id3999960 562.255700 id3999966 1464.219648 id3999967 244.016590 id3999981 649.003684 id3999997 1333.935868 Length: 625134, dtype: float64
Featuretools was created by the developers at Feature Labs. If building impactful data science pipelines is important to you or your business, please get in touch.