In this notebook, we'll use the Gradient Boosting Regressor in Scikit-Learn to produce prediction intervals in addition to a single estimate of the target. Prediction intervals are useful when we want to show the uncertainty inherent in any prediction.
We're using a pretty typical data science stack plus a few visualization and interact tools for making interactive visualizations.
# Data Manipulation
import pandas as pd
import numpy as np
# Modeling
from sklearn.base import BaseEstimator
from sklearn.ensemble import GradientBoostingRegressor
# File finding
import glob
files = glob.glob('data/*_energy_data.csv')
# Interactivity
from ipywidgets import interact, widgets
# Visualization
# Plotly
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly_express as px
# cufflinks is a wrapper on plotly
import cufflinks as cf
cf.go_offline(connected=True)
For our modeling, we're going to be prediction building energy consumption. This is an important problem that we at Cortex Building Intelligence work on every day and it provides a good supervised, regression machine learning task. The energy data is measured every 15 minutes and includes 3 weather variables related to energy consumption: temperature, irradiance, and relative humidity.
Although in a real application, we'd be using data from a database (study up on SQL), here we'll just load in the data from a csv. This is data from the DrivenData Energy Forecasting competition. You can find the original data at DrivenData. I've cleaned up the datasets and extracted 8 features that allow us to predict the energy consumption fairly accurately.
We're not going to spend any time on feature engineering or investigating the data, but know that should be a part of a data science pipeline.
data = pd.read_csv(files[2], parse_dates=['timestamp'], index_col='timestamp').sort_index()
data.head()
data = data.rename(columns={"energy": "actual"})
energy | business_day | temperature | irradiance | relative_humidity | day_of_week | time_of_day | day_of_year | year | |
---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||
2015-01-01 05:15:00+00:00 | 2.18 | 1 | 6.500 | 0.0 | 60.240 | 3 | 5.25 | 1 | 2015 |
2015-01-01 05:30:00+00:00 | 2.24 | 1 | 6.501 | -0.0 | 60.611 | 3 | 5.50 | 1 | 2015 |
2015-01-01 05:45:00+00:00 | 1.90 | 1 | 6.497 | 0.0 | 61.006 | 3 | 5.75 | 1 | 2015 |
2015-01-01 06:00:00+00:00 | 1.92 | 1 | 6.511 | -0.0 | 61.431 | 3 | 6.00 | 1 | 2015 |
2015-01-01 06:15:00+00:00 | 1.72 | 1 | 6.443 | 0.0 | 61.301 | 3 | 6.25 | 1 | 2015 |
With ipywidgets
and plotly
(through cufflinks
), it's easy to create an interact plot in a few lines of code. We can take a look at the energy consumption on different timescales.
# Create a subset of data for plotting
data_to_plot = data.loc["2015"].copy()
def plot_timescale(timescale, selection, theme):
"""
Plot the energy consumption on different timescales (day, week, month).
:param timescale: the timescale to use
:param selection: the numeric value of the timescale selection (for example the 15th day
of the year or the 1st week of the year)
:param theme: aesthetics of plot
"""
# Subset based on timescale and selection
subset = data_to_plot.loc[
getattr(data_to_plot.index, timescale) == selection, "actual"
].copy()
if subset.empty:
print("Choose another selection")
return
# Make an interactive plot
fig = subset.iplot(
title=f"Energy for {selection} {timescale.title()}", theme=theme, asFigure=True
)
fig['layout']['height'] = 900
fig['layout']['width'] = 1400
iplot(fig)
_ = interact(
plot_timescale,
timescale=widgets.RadioButtons(
options=["dayofyear", "week", "month"], value="dayofyear"
),
# Selection
selection=widgets.IntSlider(value=16, min=0, max=365),
theme=widgets.Select(options=cf.themes.THEMES.keys(), value='ggplot')
)
interactive(children=(RadioButtons(description='timescale', options=('dayofyear', 'week', 'month'), value='day…
Clearly, there are different patterns in energy usage over the course of a day, week, and month. We can also look at longer timescales.
data.loc['2015-01-01':'2015-07-01', "actual"].iplot(layout=dict(title='2015 Energy Consumption', height=800))
Plotting this much data can make the notebook slow. Instead, we can resample the data and plot to see any long term trends.
data.resample('12 H')["actual"].mean().iplot(layout=dict(title='Energy Data Resampled at 12 Hours', height=800,
yaxis=dict(title='kWh')))
If this was a real application, we'd problem want to spend more time understanding the data and checking for outliers. However, the main focus here is to predict intervals so let's dive into the modeling.
This code is based on an example from Scikit-Learn. The basic idea is straightforward:
GradientBoostingRegressor
with loss='quantile'
and alpha=lower_quantile
(for example, 0.1)GradientBoostingRegressor
with loss='quantile'
and alpha=upper_quantile
(for example, 0.9)GradientBoostingRegressor
with loss='quantile'
and alpha=0.5
or, we can use the same model with the default loss=ls
(for least squares) which optimizes for the median.The loss refers to the metric which is optimized by the model. We won't get into the details right here (see Quantile Loss Explained below), but for more information on the quantile loss and regression, this article is probably the best place to start: https://towardsdatascience.com/quantile-regression-from-linear-models-to-trees-to-deep-learning-af3738b527c3. The Wikipedia page on Quantile Regression gets into slightly more detail. For the original explanation of the Gradient Boosting model, see Friedman's 1999 paper "Greedy Function Approximation: A Gradient Boosting Machine".
You don't need to know the details to implement the prediction intervals, although it can be useful to tuning the model. Now, let's make the predictions, and later in the notebook we'll take a high-level look at the loss function.
# Train and test sets
X_train = data.loc["2015":"2016"].copy()
X_test = data.loc["2017":].copy()
y_train = X_train.pop("actual")
y_test = X_test.pop("actual")
assert X_train.index.max() < X_test.index.min()
X_train.tail()
X_test.head()
business_day | temperature | irradiance | relative_humidity | day_of_week | time_of_day | day_of_year | year | |
---|---|---|---|---|---|---|---|---|
timestamp | ||||||||
2016-12-31 22:45:00+00:00 | 0 | 12.797 | 0.010 | 86.675 | 5 | 22.75 | 366 | 2016 |
2016-12-31 23:00:00+00:00 | 0 | 12.760 | -0.003 | 86.109 | 5 | 23.00 | 366 | 2016 |
2016-12-31 23:15:00+00:00 | 0 | 12.633 | 0.001 | 85.309 | 5 | 23.25 | 366 | 2016 |
2016-12-31 23:30:00+00:00 | 0 | 12.604 | -0.000 | 84.200 | 5 | 23.50 | 366 | 2016 |
2016-12-31 23:45:00+00:00 | 0 | 12.545 | 0.000 | 83.175 | 5 | 23.75 | 366 | 2016 |
business_day | temperature | irradiance | relative_humidity | day_of_week | time_of_day | day_of_year | year | |
---|---|---|---|---|---|---|---|---|
timestamp | ||||||||
2017-01-01 00:00:00+00:00 | 0 | 12.501 | -0.0 | 82.095 | 6 | 0.00 | 1 | 2017 |
2017-01-01 00:15:00+00:00 | 0 | 12.446 | 0.0 | 81.159 | 6 | 0.25 | 1 | 2017 |
2017-01-01 00:30:00+00:00 | 0 | 12.400 | -0.0 | 80.325 | 6 | 0.50 | 1 | 2017 |
2017-01-01 00:45:00+00:00 | 0 | 12.349 | 0.0 | 79.465 | 6 | 0.75 | 1 | 2017 |
2017-01-01 01:00:00+00:00 | 0 | 12.290 | -0.0 | 78.649 | 6 | 1.00 | 1 | 2017 |
We're assuming we know all the weather in the test set, which isn't entirely realistic, but we'll use it for now!
# Set lower and upper quantile
LOWER_ALPHA = 0.15
UPPER_ALPHA = 0.85
N_ESTIMATORS = 100
MAX_DEPTH = 5
# Each model has to be separate
lower_model = GradientBoostingRegressor(
loss="quantile", alpha=LOWER_ALPHA, n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH
)
# The mid model will use the default
mid_model = GradientBoostingRegressor(loss="ls", n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH)
upper_model = GradientBoostingRegressor(
loss="quantile", alpha=UPPER_ALPHA, n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH
)
The models are trained based on optimizing for the specific loss function. This means we have to build 3 separate models to predict the different objectives. A downside of this method is that it's a little slow, particularly because we can't parallelize training on the Scikit-Learn Gradient Boosting Regresssor. If you wanted, you could re-write this code to train each model on a separate processor (using multiprocessing
.)
_ = lower_model.fit(X_train, y_train)
_ = mid_model.fit(X_train, y_train)
_ = upper_model.fit(X_train, y_train)
With the models all trained, we now make predictions and record them with the true values.
predictions = pd.DataFrame(y_test)
predictions['lower'] = lower_model.predict(X_test)
predictions['mid'] = mid_model.predict(X_test)
predictions['upper'] = upper_model.predict(X_test)
assert (predictions['upper'] > predictions['lower']).all()
predictions.tail()
actual | lower | mid | upper | |
---|---|---|---|---|
timestamp | ||||
2017-10-01 03:00:00+00:00 | 1.32 | 1.401697 | 2.332507 | 3.164036 |
2017-10-01 03:15:00+00:00 | 0.92 | 1.334960 | 2.159851 | 2.916450 |
2017-10-01 03:30:00+00:00 | 1.00 | 1.294453 | 1.762680 | 2.719243 |
2017-10-01 03:45:00+00:00 | 1.26 | 1.254313 | 1.695385 | 2.728570 |
2017-10-01 04:00:00+00:00 | 1.18 | 1.237651 | 1.657526 | 2.248523 |
The best way to inspect these results is visually. Here we use plotly to make a filled area chart showing the prediction intervals.
def plot_intervals(predictions, mid=False, start=None, stop=None, title=None):
"""
Function for plotting prediction intervals as filled area chart.
:param predictions: dataframe of predictions with lower, upper, and actual columns (named for the target)
:param whether to show the mid prediction
:param start: optional parameter for subsetting start of predictions
:param stop: optional parameter for subsetting end of predictions
:param title: optional string title
:return fig: plotly figure
"""
# Subset if required
predictions = (
predictions.loc[start:stop].copy()
if start is not None or stop is not None
else predictions.copy()
)
data = []
# Lower trace will fill to the upper trace
trace_low = go.Scatter(
x=predictions.index,
y=predictions["lower"],
fill="tonexty",
line=dict(color="darkblue"),
fillcolor="rgba(173, 216, 230, 0.4)",
showlegend=True,
name="lower",
)
# Upper trace has no fill
trace_high = go.Scatter(
x=predictions.index,
y=predictions["upper"],
fill=None,
line=dict(color="orange"),
showlegend=True,
name="upper",
)
# Must append high trace first so low trace fills to the high trace
data.append(trace_high)
data.append(trace_low)
if mid:
trace_mid = go.Scatter(
x=predictions.index,
y=predictions["mid"],
fill=None,
line=dict(color="green"),
showlegend=True,
name="mid",
)
data.append(trace_mid)
# Trace of actual values
trace_actual = go.Scatter(
x=predictions.index,
y=predictions["actual"],
fill=None,
line=dict(color="black"),
showlegend=True,
name="actual",
)
data.append(trace_actual)
# Layout with some customization
layout = go.Layout(
height=900,
width=1400,
title=dict(text="Prediction Intervals" if title is None else title),
yaxis=dict(title=dict(text="kWh")),
xaxis=dict(
rangeselector=dict(
buttons=list(
[
dict(count=1, label="1d", step="day", stepmode="backward"),
dict(count=7, label="1w", step="day", stepmode="backward"),
dict(count=1, label="1m", step="month", stepmode="backward"),
dict(count=1, label="YTD", step="year", stepmode="todate"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(step="all"),
]
)
),
rangeslider=dict(visible=True),
type="date",
),
)
fig = go.Figure(data=data, layout=layout)
# Make sure font is readable
fig["layout"]["font"] = dict(size=20)
fig.layout.template = "plotly_white"
return fig
# Example plot subsetted to one week
fig = plot_intervals(predictions, start="2017-03-01", stop="2017-03-08")
iplot(fig)
Not too bad for a first try! Play around with the model parameters and the alpha
parameters to see how it affects the plot.
With any machine learning model, we want to make predictions of the error. Quantifying the error of a prediction range can be tricky. We'll start off with the percentage of the time that the actual value falls in the range. However, one way to maximize this metric would be to just use extremely wide prediction intervals. Therefore, we want to penalize the model for making too wide prediction intervals. As a simple example (let me know ideas for a better metric) we can calculate the absolute error of the bottom and top lines, and divide by two to get an absolute error. We then take the average for the mean absolute error. We can also calculate the absolute error of the mid predictions.
These are likely not the best metrics for all cases, so think about what your objective is before selecting a metric. For instance, you may want wide prediction intervals in which case you value the percent in bounds more than the error, but other times, you might want a narrow range of estimates.
def calculate_error(predictions):
"""
Calculate the absolute error associated with prediction intervals
:param predictions: dataframe of predictions
:return: None, modifies the prediction dataframe
"""
predictions['absolute_error_lower'] = (predictions['lower'] - predictions["actual"]).abs()
predictions['absolute_error_upper'] = (predictions['upper'] - predictions["actual"]).abs()
predictions['absolute_error_interval'] = (predictions['absolute_error_lower'] + predictions['absolute_error_upper']) / 2
predictions['absolute_error_mid'] = (predictions['mid'] - predictions["actual"]).abs()
predictions['in_bounds'] = predictions["actual"].between(left=predictions['lower'], right=predictions['upper'])
calculate_error(predictions)
metrics = predictions[['absolute_error_lower', 'absolute_error_upper', 'absolute_error_interval', 'absolute_error_mid', 'in_bounds']].copy()
metrics.describe()
absolute_error_lower | absolute_error_upper | absolute_error_interval | absolute_error_mid | |
---|---|---|---|---|
count | 26225.000000 | 26225.000000 | 26225.000000 | 26225.000000 |
mean | 0.817944 | 1.367838 | 1.092891 | 0.822518 |
std | 0.779327 | 0.814335 | 0.454010 | 0.583407 |
min | 0.000083 | 0.000169 | 0.124423 | 0.000007 |
25% | 0.256083 | 0.748173 | 0.830815 | 0.375973 |
50% | 0.566104 | 1.278106 | 1.062663 | 0.711893 |
75% | 1.119640 | 1.912011 | 1.245631 | 1.137873 |
max | 4.707506 | 4.776094 | 3.964046 | 4.059814 |
We see the lower prediction has a smaller absolute error (in terms of the median). It's interesting the absolute error for the lower bound is actually less than that for the middle prediction! We can write a short function to display the metrics.
def show_metrics(metrics):
"""
Make a boxplot of the metrics associated with prediction intervals
:param metrics: dataframe of metrics produced from calculate error
:return fig: plotly figure
"""
percent_in_bounds = metrics['in_bounds'].mean() * 100
metrics_to_plot = metrics[[c for c in metrics if 'absolute_error' in c]]
# Rename the columns
metrics_to_plot.columns = [column.split('_')[-1].title() for column in metrics_to_plot]
# Create a boxplot of the metrics
fig = px.box(
metrics_to_plot.melt(var_name="metric", value_name='Absolute Error'),
x="metric",
y="Absolute Error",
color='metric',
title=f"Error Metrics Boxplots In Bounds = {percent_in_bounds:.2f}%",
height=800,
width=1000,
points=False,
)
# Create new data with no legends
d = []
for trace in fig.data:
# Remove legend for each trace
trace['showlegend'] = False
d.append(trace)
# Make the plot look a little better
fig.data = d
fig['layout']['font'] = dict(size=20)
return fig
iplot(show_metrics(metrics))
If you increase the upper quantile and decrease the lower quantile, more actual values will be between the prediction bounds. Increasing the width of the prediction interval results in less precise bounds though.
# Example plot subsetted to one week
fig = plot_intervals(predictions, mid=True, start="2017-03-01", stop="2017-03-08")
iplot(fig)
To make this process repeatable, we can build our own estimator with a Scikit-Learn interface that fits and predicts all 3 models in one call each. This is a very simple class but can be extended based on your needs (feel free to show me ways to improve).
class GradientBoostingPredictionIntervals(BaseEstimator):
"""
Model that produces prediction intervals with a Scikit-Learn inteface
:param lower_alpha: lower quantile for prediction, default=0.1
:param upper_alpha: upper quantile for prediction, default=0.9
:param **kwargs: additional keyword arguments for creating a GradientBoostingRegressor model
"""
def __init__(self, lower_alpha=0.1, upper_alpha=0.9, **kwargs):
self.lower_alpha = lower_alpha
self.upper_alpha = upper_alpha
# Three separate models
self.lower_model = GradientBoostingRegressor(
loss="quantile", alpha=self.lower_alpha, **kwargs
)
self.mid_model = GradientBoostingRegressor(loss="ls", **kwargs)
self.upper_model = GradientBoostingRegressor(
loss="quantile", alpha=self.upper_alpha, **kwargs
)
self.predictions = None
def fit(self, X, y):
"""
Fit all three models
:param X: train features
:param y: train targets
TODO: parallelize this code across processors
"""
self.lower_model.fit(X_train, y_train)
self.mid_model.fit(X_train, y_train)
self.upper_model.fit(X_train, y_train)
def predict(self, X, y):
"""
Predict with all 3 models
:param X: test features
:param y: test targets
:return predictions: dataframe of predictions
TODO: parallelize this code across processors
"""
predictions = pd.DataFrame(y)
predictions["lower"] = self.lower_model.predict(X)
predictions["mid"] = self.mid_model.predict(X)
predictions["upper"] = self.upper_model.predict(X)
self.predictions = predictions
return predictions
def plot_intervals(self, mid=False, start=None, stop=None):
"""
Plot the prediction intervals
:param mid: boolean for whether to show the mid prediction
:param start: optional parameter for subsetting start of predictions
:param stop: optional parameter for subsetting end of predictions
:return fig: plotly figure
"""
if self.predictions is None:
raise ValueError("This model has not yet made predictions.")
return
fig = plot_intervals(predictions, mid=mid, start=start, stop=stop)
return fig
def calculate_and_show_errors(self):
"""
Calculate and display the errors associated with a set of prediction intervals
:return fig: plotly boxplot of absolute error metrics
"""
if self.predictions is None:
raise ValueError("This model has not yet made predictions.")
return
calculate_error(self.predictions)
fig = show_metrics(self.predictions)
return fig
To use the model, we just fit and predict like any Scikit-Learn model. This time though, the predictions are a dataframe with intervals.
model = GradientBoostingPredictionIntervals(
lower_alpha=0.1, upper_alpha=0.9, n_estimators=50, max_depth=3
)
# Fit and make predictions
_ = model.fit(X_train, y_train)
predictions = model.predict(X_test, y_test)
We can show the interval and metric plot to assess the predictions.
metric_fig = model.calculate_and_show_errors()
iplot(metric_fig)
With the implementation completed, let's take a brief look at how quantile regression works. The theory is not needed to produce answers, but it's useful to understand what's going on behind the scenes to improve the model.
The quantile loss for a predicted is expressed as:
$$ \text{quantile loss} = \alpha * (\text{actual} - \text{predicted}) \quad \text{if} \quad (\text{actual} - \text{predicted}) > 0$$$$ \text{quantile loss} = (\alpha - 1) *(\text{actual} - \text{predicted}) \quad \text{if} \quad (\text{actual} - \text{predicted}) < 0$$I find this much easier to parse in Python code.
def calculate_quantile_loss(quantile, actual, predicted):
"""
Quantile loss for a given quantile and prediction
"""
return np.maximum(quantile * (actual - predicted), (quantile - 1) * (actual - predicted))
def calculate_quantile_loss(quantile, actual, predicted):
"""
Quantile loss for a given quantile and prediction
"""
return np.maximum(quantile * (actual - predicted), (quantile - 1) * (actual - predicted))
When we graph the quantile loss versus the error, the weighting of the errors appears as the slope.
Let's walk through an example using lower quantile = 0.1, upper quantile = 0.9, and actual value = 10. There are four possibilities for the predictions:
For cases where the quantile > 0.5 we penalize low predictions more heavily. For cases where the quantile < 0.5 we penalize high predictions more heavily.
If the quantile == 0.5, then the weighting is the same for both low and high predictions. For quantile == 0.5, we are predicting the median.
The model is fit by minimizing the loss function. Through changing the quantile, we can produce predictions corresponding to prediction intervals.
def plot_quantile_loss(actual, prediction_list, quantile_list, plot_ls=False):
"""
Shows the quantile loss associated with predictions at different quantiles.
Figure shows the loss versus the error
:param actual: array-like of actual values
:param prediction_list: list of array-like predictions
:param quantile_list: list of float quantiles corresponding to the predictions
:param plot_ls: whether to plot the least squares loss
:return fig: plotly figure
"""
data = []
# Iterate through each combination of prediction and quantile
for predictions, quantile in zip(prediction_list, quantile_list):
# Calculate the loss
quantile_loss = calculate_quantile_loss(quantile, actual, predictions)
errors = actual - predictions
# Sort errors and loss by error
idx = np.argsort(errors)
errors = errors[idx]; quantile_loss = quantile_loss[idx]
# Add data to plot
data.append(go.Scatter(mode="lines", x=errors, y=quantile_loss, line=dict(width=4), name=f"{quantile} Quantile"))
if plot_ls:
loss = np.square(predictions - actual)
errors = actual - predictions
# Sort errors and loss by error
idx = np.argsort(errors)
errors = errors[idx]; loss = loss[idx]
# Add data to plot
data.append(go.Scatter(mode="lines", x=errors, y=loss, line=dict(width=4), name="Least Squares"))
# Simple plot layout
layout = go.Layout(
title="Quantile Loss vs Error",
yaxis=dict(title="Loss"),
xaxis=dict(title="Error"),
width=1000, height=600,
)
fig = go.Figure(data=data, layout=layout)
fig['layout']['font'] = dict(size=18)
return fig
# Make dummy predictions and actual values
predictions = np.arange(-2.1, 2.1, step=0.1)
actual = np.zeros(len(predictions))
# Create a plot showing the same predictions at different quantiles
fig = plot_quantile_loss(actual, [predictions, predictions, predictions], [0.1, 0.5, 0.9], False)
iplot(fig)
We can see how the quantile loss is asymmetric with a weighting (slope) equal to the quantile value or to (quantile - 1) depending on if the error is positive or negative. With an error defined as (actual - predicted), for a quantile greater than 0.5, we penalize positive errors more and for a quantile less than 0.5, we penalize negative errors more. This drives the predictions with a higher quantile higher than the actual value, and predictions with a lower quantile lower than the actual value. The quantile loss is always positive.
# Create plot with least squares loss as well
fig = plot_quantile_loss(actual, [predictions, predictions, predictions], [0.1, 0.5, 0.9], True)
iplot(fig)
Let's look at the quantile loss associated with our model's predictions. We make sure to use the alpha associated with the lower and upper predictions from the model.
predictions = model.predictions.copy()
fig = plot_quantile_loss(
predictions["actual"],
[predictions["lower"], predictions["mid"], predictions["upper"]],
[model.lower_alpha, 0.5, model.upper_alpha],
)
iplot(fig)
We can see the same weighting applied to the model's predictions. When the error is negative - meaning the actual value was less than the predicted value - and the quantile is less than 0.5, we weight the error by (quantile - 1) to penalize the high prediction. When the error is positive - meaning the actual value was greater than the predicted value - and the quantile is greater than 0.5, we weight the error by the quantile to penalize the low prediction.
This is only a high level explanation, but it's enough to allow us to use the model. For further information, start with this article or the Wikipedia page on quantile loss and dig into the sources.
Predicting intervals instead of a single point value is useful to emphasize that a prediction from a model is always an estimate and hence has uncertainty. One easy way to generate prediction intervals with Scikit-Learn is through the Gradient Boosting Regressor as implemented in this notebook. There are other methods, so don't take away that there is only one way to accomplish this task. As with many aspects of machine learning / data science, you have to try different techniques to find the right one.
Hopefully you found this walkthrough useful, and I encourage you to build on / improve it. The best way to learn anything is to use it solve a problem, so apply this to one of yours and let me know how you are using prediction intervals.
Will Koehrsen
iplot(model.plot_intervals(mid=True, start='2017-06-05', stop='2017-06-12'))