We have often found that Machine Learning (ML) algorithms capable of capturing structural non-linearities in training data - models that are sometimes referred to as 'black box' (e.g. Random Forests, Deep Neural Networks, etc.) - perform far better at prediction than their linear counterparts (e.g. Generalised Linear Models).
They are, however, much harder to interpret - in fact, quite often it is not possible to gain any insight into why a particular prediction has been produced, when given an instance of input data (i.e. the model features).
Consequently, it has not been possible to use 'black box' ML algorithms in situations where clients have sought cause-and-effect explanations for model predictions, with end-results being that sub-optimal predictive models have been used in their place, as their explanatory power has been more valuable, in relative terms.
The problem with model explainability is that it’s very hard to define a model’s decision boundary in human understandable manner.
LIME is a python library which tries to solve for model interpretability by producing locally faithful explanations.
# Install LIME using the following command.
!pip install lime
import numpy as np
np.set_printoptions(precision=4) # To display values only upto four decimal places.
import pandas as pd
pd.set_option('mode.chained_assignment', None) # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1) # To display all the data in the columns.
pd.options.display.max_columns = 40 # To display all the columns. (Set the value to a high number)
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid') # To apply seaborn whitegrid style to the plots.
plt.rc('figure', figsize=(10, 8)) # Set the default figure size of plots.
%matplotlib inline
import warnings
warnings.filterwarnings('ignore') # To suppress all the warnings in the notebook.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
df = pd.read_csv('../../data/Boston.csv')
df.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
This dataset contains information on Housing Values in Suburbs of Boston.
The column medv is the target variable. It is the median value of owner-occupied homes in $1000s.
Column Name | Description |
---|---|
crim | Per capita crime rate by town. |
zn | Proportion of residential land zoned for lots over 25,000 sq.ft. |
indus | Proportion of non-retail business acres per town. |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
nox | Nitrogen oxides concentration (parts per 10 million). |
rm | Average number of rooms per dwelling. |
age | Proportion of owner-occupied units built prior to 1940. |
dis | Weighted mean of distances to five Boston employment centres. |
rad | Index of accessibility to radial highways. |
tax | Full-value property-tax rate per 10,000 dollars. |
ptratio | Pupil-teacher ratio by town. |
black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
lstat | Lower status of the population (percent). |
medv | Target, median value of owner-occupied homes in $1000s. |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): crim 506 non-null float64 zn 506 non-null float64 indus 506 non-null float64 chas 506 non-null int64 nox 506 non-null float64 rm 506 non-null float64 age 506 non-null float64 dis 506 non-null float64 rad 506 non-null int64 tax 506 non-null float64 ptratio 506 non-null float64 black 506 non-null float64 lstat 506 non-null float64 medv 506 non-null float64 dtypes: float64(12), int64(2) memory usage: 55.5 KB
df.describe()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677082 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
Now that the entire data is of numeric datatype, lets begin our modelling process.
Firstly, splitting the complete dataset into training and testing datasets.
df.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
X = df.iloc[:, :-1]
X.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 |
y = df.iloc[:, -1]
y.head()
0 24.0 1 21.6 2 34.7 3 33.4 4 36.2 Name: medv, dtype: float64
# Using scikit-learn's train_test_split function to split the dataset into train and test sets.
# 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Checking the shapes of all the training and test sets for the dependent and independent features.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(404, 13) (404,) (102, 13) (102,)
# Creating a Random Forest Regressor.
regressor_rf = RandomForestRegressor(n_estimators=200, random_state=0, oob_score=True, n_jobs=-1)
# Fitting the model on the dataset.
regressor_rf.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1, oob_score=True, random_state=0, verbose=0, warm_start=False)
regressor_rf.oob_score_
0.8375610635726134
X_train.columns
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat'], dtype='object')
# Checking the feature importances of various features.
# Sorting the importances by descending order (lowest importance at the bottom).
for score, name in sorted(zip(regressor_rf.feature_importances_, X_train.columns), reverse=True):
print('Feature importance of', name, ':', score*100, '%')
Feature importance of rm : 48.652000069343806 % Feature importance of lstat : 32.62519467239989 % Feature importance of dis : 5.723002765414628 % Feature importance of crim : 3.688078587412392 % Feature importance of nox : 1.7565510857613313 % Feature importance of ptratio : 1.7390709979839825 % Feature importance of tax : 1.662967097458929 % Feature importance of age : 1.470551943168394 % Feature importance of black : 1.3621235473729538 % Feature importance of indus : 0.6664660162044432 % Feature importance of rad : 0.3849848939558503 % Feature importance of zn : 0.1498360991801689 % Feature importance of chas : 0.11917222434323611 %
# Making predictions on the training set.
y_pred_train = regressor_rf.predict(X_train)
y_pred_train[:10]
array([12.511 , 19.9775, 19.77 , 13.258 , 18.5805, 24.453 , 20.89 , 23.869 , 8.595 , 23.5375])
# Making predictions on test set.
y_pred_test = regressor_rf.predict(X_test)
y_pred_test[:10]
array([22.57 , 31.5625, 17.109 , 23.243 , 16.7635, 21.303 , 19.24 , 15.6195, 21.429 , 20.964 ])
Error is the deviation of the values predicted by the model with the true values.
# R-Squared Value on the training set.
print('R-Squared Value for train data is:', r2_score(y_train, y_pred_train))
R-Squared Value for train data is: 0.9774359941752926
# R-Squared Value on the test set.
print('R-Squared Value for test data is:', r2_score(y_test, y_pred_test))
R-Squared Value for test data is: 0.8798665934496468
# Selecting the indexes of the categorical features in the dataset.
categorical_features = np.argwhere(np.array([len(set(X_train.values[:, x])) for x in range(X_train.shape[1])]) <= 10).flatten()
categorical_features
array([3, 8], dtype=int64)
from lime.lime_tabular import LimeTabularExplainer
# Creating the LIME explainer object.
explainer = LimeTabularExplainer(X_train.values, mode='regression', feature_names=X_train.columns,
categorical_features=categorical_features, verbose=True, random_state=0)
Start by choosing an instance from the test dataset.
Use LIME to estimate a local model to use for explaining our model's predictions. The outputs will be:
Note, that the actual value from the data does not enter into this - the idea of LIME is to gain insight into why the chosen model - in our case the Random Forest regressor - is predicting whatever it has been asked to predict. Whether or not this prediction is actually any good, is a separate issue.
# Selecting a random instance from the test dataset.
i = np.random.randint(0, X_test.shape[0])
print('i =', i)
i = 34
# Using LIME to estimate a local model. Using only 6 features to explain our model's predictions.
exp = explainer.explain_instance(X_test.values[i], regressor_rf.predict, num_features=6)
Intercept 26.667118750528495 Prediction_local [16.6593] Right: 14.596999999999998
# Here the index column is the original index as per the df dataframe and the number at the beginning the index after reset.
X_test.reset_index().loc[[i]]
index | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 467 | 4.42228 | 0.0 | 18.1 | 0 | 0.584 | 6.003 | 94.5 | 2.5403 | 24 | 666.0 | 20.2 | 331.29 | 21.32 |
exp.show_in_notebook(show_table=True, show_all=False)
Feature | Value |
First, note that the row we explained is displayed on the right side, in table format. Since we had the show_all parameter set to false, only the features used in the explanation are displayed.
The value column displays the original value for each feature.
exp.as_list()
[('lstat > 16.37', -5.335049207889584), ('5.89 < rm <= 6.21', -3.3708905664632702), ('crim > 3.20', -0.9545632488848613), ('330.00 < tax <= 666.00', -0.4234910180513856), ('rad=24', 0.38388434357383366), ('18.70 < ptratio <= 20.20', -0.30774577277796716)]
Obesrvations obtained from LIME's interpretation of our Random Forest's prediction:
The values shown after the condition is the amount by which the value is shifted from the intercept estimated for the local model.
When all these values are added to the intercept, it gives us the Prediction_local (local model's estimate for the Regression Forest's prediction) calculated by LIME.
print('Intercept =', exp.intercept[0])
print('Prediction_local =', exp.local_pred[0])
Intercept = 26.667118750528495 Prediction_local = 16.65926328003526
# Calculating the Prediction_local by adding all the values obtained above for each condition into the intercept.
# The intercept can be obtained from the exp.intercept using the index 0.
intercept = exp.intercept[0]
prediction_local = intercept
for i in range(len(exp.as_list())):
prediction_local += exp.as_list()[i][1]
print('Prediction_local =', prediction_local)
Prediction_local = 16.659263280035262
i = np.random.randint(0, X_test.shape[0])
print('i =', i)
exp = explainer.explain_instance(X_test.values[i], regressor_rf.predict, num_features=6)
i = 93 Intercept 25.658304604538877 Prediction_local [19.7191] Right: 18.784999999999997
X_test.reset_index().loc[[i]]
index | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
93 | 346 | 0.06162 | 0.0 | 4.39 | 0 | 0.442 | 5.898 | 52.3 | 8.0136 | 3 | 352.0 | 18.8 | 364.61 | 12.67 |
exp.show_in_notebook(show_table=True, show_all=False)
Feature | Value |
exp.as_list()
[('5.89 < rm <= 6.21', -3.3739831077720055), ('10.93 < lstat <= 16.37', -1.2396077198032527), ('18.70 < ptratio <= 20.20', -0.8122143970518639), ('black <= 375.47', -0.48573831431424713), ('330.00 < tax <= 666.00', -0.4728522138910121), ('zn <= 0.00', 0.44516993577257424)]
print('Intercept =', exp.intercept[0])
print('Prediction_local =', exp.local_pred[0])
Intercept = 25.658304604538877 Prediction_local = 19.71907878747907
intercept = exp.intercept[0]
prediction_local = intercept
for i in range(len(exp.as_list())):
prediction_local += exp.as_list()[i][1]
print('Prediction_local =', prediction_local)
Prediction_local = 19.71907878747907
By changing the chosen i, we observe that the narrative provided by LIME also changes, in response to changes in the model in the local region of the feature space in which it is working to generate a given prediction.
This is clearly an improvement on relying purely on the Regression Forest's (static) expected relative feature importance and of great benefit to models that provice no insight whatsoever.