We have often found that Machine Learning (ML) algorithms capable of capturing structural non-linearities in training data - models that are sometimes referred to as 'black box' (e.g. Random Forests, Deep Neural Networks, etc.) - perform far better at prediction than their linear counterparts (e.g. Generalised Linear Models). They are, however, much harder to interpret - in fact, quite often it is not possible to gain any insight into why a particular prediction has been produced, when given an instance of input data (i.e. the model features). Consequently, it has not been possible to use 'black box' ML algorithms in situations where clients have sought cause-and-effect explanations for model predictions, with end-results being that sub-optimal predictive models have been used in their place, as their explanatory power has been more valuable, in relative terms.

In this notebook, we trial a different approach - we train a 'black box' ML model to the best of our abilities and then apply an ancillary algorithm to generate explanations for the predictions. More specifically, we will test the ability of the Local Interpretable Model-agnostic Explanations (LIME) algorithm, recently described by Ribiero et al (2016), to provide explanations for a Random Forest regressor trained on multiple-lot on-line auction data.

The paper that describes the LIME algorithm can be found here: https://arxiv.org/pdf/1602.04938v1.pdf; details of its implementation in Python (as used in this notebook), can be found here: https://github.com/marcotcr/lime/; while a more general discussion of ML algorithm interpretation (that includes LIME), can be found in the eBook by Christoph Molnar, which can be found here: https://christophm.github.io/interpretable-ml-book/.

In [1]:

```
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

Load the auction data from CSV file and take a quick glimpse.

In [2]:

```
auction_eval = pd.read_csv('../data/auction_data.csv')
auction_eval.info()
auction_eval.head()
```

Out[2]:

**auction_id**- unique identifier for a single multi-lot online auction event;**RoR**- the average Return-on-Reserve for successfully sold lots in the multi-lot online auction event, computed as realised price divided by reserve price;**STR**- the proportion of lots (by value) that were successfully sold in the auction event;**BPL**- the average number of bidders per-lot;**lots**- the number of lots offered in the multi-lot online auction event;**product_types**- the number of different product types offered in the multi-lot online auction event;**avg_reserve**- the average reserve price over all lots in the multi-lot online auction event**avg_start_bid**- the average starting-bid (expressed as a fraction of the reserve bid), over all lots in the multi-lot online auction event**auction_mech**- the auction mechanim used for the auction event (one of English Forward, Sealed Bid or Fixed Price); and,**country**- the local market running the auction.

The goal of the modeling exercise that we are interested in, is to use as many features as possible to explain the drivers of RoR, *ex-anti*. That is, features that represent configurable auction (or marketplace) parameters - not auction outcomes (like RoR itself). We make an exception for BPL, as we would like to know how the success of a pre-auction marketing process, which could be viewed as driving factors of BPL, will impacts RoR.

In [3]:

```
RoR = ['RoR']
model_features = ['BPL', 'lots', 'product_types', 'avg_reserve', 'avg_start_bid', 'auction_mech', 'country']
```

We are aware that outliers have found their way into the data. So, we will limit our modeling dataset by filtering-out values that don't feel correct, based on our knowledge of the data (and the issues embedded within it). As such, we will filter-out the top percentile of RoR values as we know there is ambiguity surrounding reserve prices. Likewise, we have spotted auction events with average reserve prices in the millions of Euros, so we will filter-out the top percentile of mean reserve price observations, as well.

In [5]:

```
pctile_thold = 0.99
model_data = (
auction_eval.copy()
.loc[(auction_eval['STR'] > 0)
& (auction_eval['RoR'] < auction_eval['RoR'].quantile(pctile_thold))
& (auction_eval['avg_reserve'] < auction_eval['avg_reserve'].quantile(pctile_thold)),
:]
.assign(country = auction_eval['country'].apply(lambda x: 'country_' + x),
auction_mech = auction_eval['auction_mech'].apply(lambda x: 'auction_mech_' + x))
[RoR + model_features])
desc_features = ['auction_id']
model_data.describe()
```

Out[5]:

The purpose of this notebook is to use a learning algorithm that is capable of modeling non-linearities between the input (independent) variables (or features), such as a Random Forest, and then to apply the LIME algorithm to identify the key drivers of the output for any particular Out-Of-Sample (OOS) instance.

We need to setup a data preparation pipeline for handling the systematic selection of features from the input DataFrame (and their mapping to a NumPy ndarray), creating a new factor variable to identify the presence/absence of starting bids (as the average is a bit too rough-and-dirty), and handling categorical data (via factorising).

Note, that we haven't re-scaled the features as our intention is to use Random Forests to capture the non-linearities in the dataset, which do not benefit from any re-scaling process.

In [6]:

```
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, LabelEncoder, FunctionTransformer
# helper functions
def cat_col_selector(df):
df_cat = df.select_dtypes(include=['object'])
return df_cat.values
def num_col_selector(df):
df_num = df.select_dtypes(exclude=['object'])
return df_num.values
def cat_col_str2fact(X):
fact_cols = [LabelEncoder().fit_transform(col) for col in X.T]
fact_mat = np.vstack(fact_cols)
return fact_mat.T
def cat_col_fact_names(df):
def list_of_fact_names(col):
fact_names = LabelEncoder().fit(col).classes_
return fact_names
df_cat = df.select_dtypes(include=['object'])
X = df_cat.values
fact_name_cols = [(col_name, list_of_fact_names(col))
for col_name, col in zip(df_cat.columns, X.T)]
return dict(fact_name_cols)
def make_new_features(df):
start_bid_map = lambda x: 'start_bids_true' if x > 0 else 'start_bids_false'
new_df = (df
.assign(start_bids=df['avg_start_bid'].apply(start_bid_map))
.drop(['avg_start_bid'], axis=1))
return new_df
# pipline for features
ivar_data_pipe = Pipeline([
('make_new_features', FunctionTransformer(make_new_features, validate=False)),
('union', FeatureUnion([
('num_features', Pipeline([
('get_num_features', FunctionTransformer(num_col_selector, validate=False))
])),
('cat_vars', Pipeline([
('get_cat_features', FunctionTransformer(cat_col_selector, validate=False)),
('factorise', FunctionTransformer(cat_col_str2fact, validate=False))
]))
]))
])
# get feature names
num_feature_names = (
make_new_features(model_data[model_features])
.select_dtypes(exclude=['object'])
.columns
.tolist())
cat_feature_names = (
make_new_features(model_data[model_features])
.select_dtypes(include=['object'])
.columns
.tolist())
all_feature_names = num_feature_names + cat_feature_names
# get levels for categorical data
cat_col_levels = cat_col_fact_names(make_new_features(model_data[model_features]))
# perform transformations
y = (model_data
['RoR']
.values
.reshape(-1, ))
X = (ivar_data_pipe
.fit_transform(model_data[model_features]))
X_df = pd.DataFrame(X, columns=all_feature_names)
X_df.head()
```

Out[6]:

In [7]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print('{} samples in training dataset and {} samples in test dataset'
.format(X_train.shape[0], X_test.shape[0]))
```

For the test DataFrame only, we map the categorical feature levels to actual (string) categories to make it easier to read when testing LIME later on.

In [19]:

```
training_instances = pd.DataFrame(
np.hstack((y_test.reshape(-1, 1), X_test)),
columns=['RoR'] + all_feature_names)
training_instances['auction_mech'] = (
training_instances['auction_mech']
.apply(lambda e: cat_col_levels['auction_mech'][int(e)]))
training_instances['country'] = (
training_instances['country']
.apply(lambda e: cat_col_levels['country'][int(e)]))
training_instances['start_bids'] = (
training_instances['start_bids']
.apply(lambda e: cat_col_levels['start_bids'][int(e)]))
```

We will use RMSE as a test metric and 5-fold cross validation to get a basic estimate of the model's performance out-of-sample.

In [9]:

```
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
def rmse(model_fit, features, labels):
predictions = model_fit.predict(features)
rmse = np.sqrt(mean_squared_error(labels, predictions))
print('in-sample RMSE = {}'.format(rmse))
return None
def cv_results(model_fit, features, labels):
scores = cross_val_score(model_fit, features, labels, scoring='neg_mean_squared_error', cv=5)
cv_mean_rmse = np.sqrt(-scores).mean()
cv_std_rmse = np.sqrt(-scores).std()
print('xross-Val RMSE = {} +/- {}'.format(cv_mean_rmse, cv_std_rmse))
return None
```

We have chosen to estimate a Random Forest model as the 'black box' model, on which we will apply the LIME algorithm to derive insights into what features are the most important for explaining a particular instance. We appreciate that Random Forests are not truly 'black box' models (e.g. when compared against a neural network), as it is possible to extract 'relative feature importance' via a number of ancillary algorithms - for example by computing the expected fraction of samples that using a particular feature will impact, via splitting samples in an internal node. This is not, however, as useful a measure as the coefficient of regression in a linear model (which can also capture effect magnitudes as well as relative importance, and are not fully deterministic). Regardless, we see this as a benefit, as we then have the chance to compare LIME's output with the Random Forest's expected relative variable importances.

We start by using the default Random Forest regression parameters as implemented in Scikit-Learn, estimating the model on the training data, and running our cross-validation experiment to get an idea of how much over-fitting we can expect and what out-of-sample performance we can anticipate.

In [10]:

```
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_jobs=2, random_state=42)
forest_reg.fit(X_train, y_train)
rmse(forest_reg, X_train, y_train)
cv_results(forest_reg, X_train, y_train)
```

We now apply a grid search to find the best set of hyper-parameters.

In [11]:

```
from sklearn.model_selection import GridSearchCV
# configure grid search
forest_reg_param_grid = [{
'bootstrap': [True, False],
'n_estimators': [20, 40],
'max_features': [3, 5, 7],
'min_samples_split': [2, 4, 8, 16],
'random_state' : [42]}]
forest_reg_grid = GridSearchCV(
forest_reg,
forest_reg_param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=2,
return_train_score=True)
# fit grid search
forest_reg_grid.fit(X_train, y_train)
# print results
print('best CV score: {}'.format(np.sqrt(-forest_reg_grid.best_score_)))
print('\n------------------------')
print('best parameters on grid:')
print('------------------------')
for k, v in forest_reg_grid.best_params_.items():
print('{0}: {1}'.format(k, v))
print('------------------------')
```

We now use the best-fit model to score the test dataset.

In [12]:

```
predicted = forest_reg_grid.best_estimator_.predict(X_test)
error = predicted - y_test
abs_error = np.absolute(error)
model_performance = pd.DataFrame(
np.concatenate([predicted.reshape((-1, 1)), y_test.reshape((-1, 1)),
error.reshape((-1, 1)), abs_error.reshape((-1, 1))], axis=1),
columns=['predicted', 'actual', 'error', 'abs_error'])
print('In-sample RMSE = {}'.format(np.sqrt(np.mean(error ** 2))))
_ = (model_performance
.plot(kind='scatter', x='predicted', y='actual', alpha=0.1, xlim=[0, 2], ylim=[0, 2]))
```

We now take a look at the distribution of errors (i.e. the residuals).

In [13]:

```
_ = plt.hist(error, density=True, bins=50, histtype='bar')
```

We can see from the model's performance OOS, as illustrated in the above two plots, that it does a 'reasonable' job of predicting RoR, with the main sources of error coming predominantly from a small collection of poor predictions

Now that we have estimated our model and have some insight into its expected performance, we turn our attention to applying the LIME algorithm to get deeper insight into the specific features that were key in driving the prediction for a particular instance of input data (e.g. from our test dataset).

At a high-level, the LIME algorithm for a regression task performs the following steps:

- simulate artificial test instances, using Normal distributions for continuous variables and empirical distributions for categorical variables. Note, that continuous variables can be discretised and treated as categorical variables (we have chosen this approach, based on quartiles);
- using an exponential (Gaussian) kernel with 'width' (variance) equal to
`sqrt(number of columns) * 0.75`

, generate a set of weights for each artificial test instance; and, - estimate a linear regression model, using the weighted artificial training instances. If the number of variables are less-than-or-equal to 6, then one model for every possible combination of factors is estimated, to find the top N features that have the greatest predictive power for explaining the global model's prediction (by brute-force). If there are more than 6 variables, then a lasso regression is used and the hyper-parameter (the magnitude of the regularisation term) is tuned to yield the chosen number of explanatory variables.

In [15]:

```
from lime.lime_tabular import LimeTabularExplainer
categorical_feature_maps = dict([
(n, e[1]) for n, e in zip(range(4, 7), cat_col_levels.items())])
explainer = LimeTabularExplainer(
X_train,
feature_names=all_feature_names,
class_names=['RoR'],
categorical_features=np.arange(4, 7),
categorical_names=categorical_feature_maps,
verbose=True,
mode='regression',
feature_selection='auto',
discretize_continuous=True,
discretizer='quartile',
random_state=42)
```

Start by choosing an instance from the test dataset (we have reproduce the entire set of training data instances in the Appendix, below, for reference).

In [16]:

```
training_data_instance_number = 10
```

Use LIME to estimate a local model to use for explaining our model's predictions. The outputs will be:

- the intercept estimated for the local model;
- the local model's estimate for the Regression Forest's prediction; and,
- the Regression Forest's actual prediction.

Note, that the actual value from the training data does not enter into this - the idea of LIME is to gain insight into why the chosen model - in our case the Random Forest regressor - is predicting whatever it has been asked to predict. Whether or not this prediction is actually any good, is a separate issue.

In [17]:

```
i = training_data_instance_number
explanation = explainer.explain_instance(
X_test[i],
forest_reg_grid.best_estimator_.predict,
num_features=5)
```

Print the DataFrame row for the chosen test instance.

In [20]:

```
training_instances.loc[[i]]
```

Out[20]:

Now take a look at LIME's interpretation of our Random Forest's prediction.

In [21]:

```
explanation.show_in_notebook(show_table=True)
```