Model Interpretability on Random Forest using LIME¶

Table of Contents¶

Problem Statement
Importing Packages
Loading Data

3.1 Description of the Dataset

Data Preprocessing
Data train/test split
Random Forest Model

6.1 Random Forest in scikit-learn
6.2 Using the Model for Prediction

Model Evaluation
- 7.1 Accuracy Score
Model Interpretability using LIME

8.1 Setup LIME Algorithm
8.2 Explore Key Features in Instance-by-Instance Predictions

1. Problem Statement¶

We have often found that Machine Learning (ML) algorithms capable of capturing structural non-linearities in training data - models that are sometimes referred to as 'black box' (e.g. Random Forests, Deep Neural Networks, etc.) - perform far better at prediction than their linear counterparts (e.g. Generalized Linear Models).
They are, however, much harder to interpret - in fact, quite often it is not possible to gain any insight into why a particular prediction has been produced, when given an instance of input data (i.e. the model features).
Consequently, it has not been possible to use 'black box' ML algorithms in situations where clients have sought cause-and-effect explanations for model predictions, with end-results being that sub-optimal predictive models have been used in their place, as their explanatory power has been more valuable, in relative terms.
The problem with model explainability is that it’s very hard to define a model’s decision boundary in human understandable manner.
LIME is a python library which tries to solve for model interpretability by producing locally faithful explanations.

We will use LIME to interpret our RandomForest model.

2. Importing Packages¶

In [ ]:

# Install LIME using the following command.

!pip install lime

In [1]:

import numpy as np
np.set_printoptions(precision=4)                    # To display values only upto four decimal places. 

import pandas as pd
pd.set_option('mode.chained_assignment', None)      # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1)           # To display all the data in the columns.
pd.options.display.max_columns = 40                 # To display all the columns. (Set the value to a high number)

import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')                  # To apply seaborn whitegrid style to the plots.
plt.rc('figure', figsize=(10, 8))                   # Set the default figure size of plots.
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')                   # To suppress all the warnings in the notebook.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

3. Loading Data¶

In [2]:

df = pd.read_csv('../../data/mushrooms.csv')
df.head()

Out[2]:

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

3.1 Description of the Dataset¶

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).
Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.
The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

In [3]:

df.columns

Out[3]:

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype='object')

Column Name	Description
class	classes: edible=e, poisonous=p.
cap-shape	bell=b,conical=c, convex=x, flat=f, knobbed=k, sunken=s.
cap-surface	fibrous=f, grooves=g, scaly=y, smooth=s.
cap-color	brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y.
bruises	bruises=t, no=f.
odor	almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m ,none=n, pungent=p, spicy=s.
gill-attachment	attached=a, descending=d, free=f, notched=n.
gill-spacing	close=c, crowded=w, distant=d.
gill-size	broad=b, narrow=n.
gill-color	black=k, brown=n ,buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y.
stalk-shape	enlarging=e, tapering=t.
stalk-root	bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?.
stalk-surface-above-ring	fibrous=f, scaly=y, silky=k, smooth=s.
stalk-surface-below-ring	fibrous=f, scaly=y, silky=k, smooth=s.
stalk-color-above-ring	brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y.
stalk-color-below-ring	brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y.
veil-type	partial=p ,universal=u.
veil-color	brown=n, orange=o, white=w, yellow=y.
ring-number	none=n, one=o, two=t.
ring-type	cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z.
spore-print-color	black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y.
population	abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y.
habitat	grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d.

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
class                       8124 non-null object
cap-shape                   8124 non-null object
cap-surface                 8124 non-null object
cap-color                   8124 non-null object
bruises                     8124 non-null object
odor                        8124 non-null object
gill-attachment             8124 non-null object
gill-spacing                8124 non-null object
gill-size                   8124 non-null object
gill-color                  8124 non-null object
stalk-shape                 8124 non-null object
stalk-root                  8124 non-null object
stalk-surface-above-ring    8124 non-null object
stalk-surface-below-ring    8124 non-null object
stalk-color-above-ring      8124 non-null object
stalk-color-below-ring      8124 non-null object
veil-type                   8124 non-null object
veil-color                  8124 non-null object
ring-number                 8124 non-null object
ring-type                   8124 non-null object
spore-print-color           8124 non-null object
population                  8124 non-null object
habitat                     8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB

In [5]:

df.describe()

Out[5]:

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
count	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124
unique	2	6	4	10	2	9	2	2	2	12	2	5	4	4	9	9	1	4	3	5	9	6	7
top	e	x	y	n	f	n	f	c	b	b	t	b	s	s	w	w	p	w	o	p	w	v	d
freq	4208	3656	3244	2284	4748	3528	7914	6812	5612	1728	4608	3776	5176	4936	4464	4384	8124	7924	7488	3968	2388	4040	3148

4 Data Preprocessing¶

In [36]:

df.head()

Out[36]:

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

In [37]:

# Creating labels array from the class column.

labels = df.iloc[:, 0].values
labels

Out[37]:

array(['p', 'e', 'e', ..., 'e', 'p', 'e'], dtype=object)

In [38]:

# Creating a LabelEncoder object le and fitting labels array into it.

le = LabelEncoder()
le.fit(labels)

Out[38]:

LabelEncoder()

In [39]:

# Transforming the labels array to have numerical values.

labels = le.transform(labels)
labels

Out[39]:

array([1, 0, 0, ..., 0, 1, 0])

In [40]:

# Storing the different classes found by LabelEncoder in labels array into class_names.

class_names = le.classes_
class_names

Out[40]:

array(['e', 'p'], dtype=object)

In [41]:

# Dropping the class column from the df dataframe.

df.drop(['class'], axis=1, inplace=True)

In [42]:

df.head()

Out[42]:

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

In [43]:

# Creating a range form 0 upto the number of categorical features. Since all the features in df are categorical using len(). 

categorical_features = range(len(df.columns))
categorical_features

Out[43]:

range(0, 22)

In [44]:

# Creating an array of feature names.

feature_names = df.columns.values
feature_names

Out[44]:

array(['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype=object)

In [45]:

# We expand the characters into words, using the dataset description provided in the beginning.

categorical_names = '''bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
fibrous=f,grooves=g,scaly=y,smooth=s
brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
bruises=t,no=f
almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
attached=a,descending=d,free=f,notched=n
close=c,crowded=w,distant=d
broad=b,narrow=n
black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
enlarging=e,tapering=t
bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
fibrous=f,scaly=y,silky=k,smooth=s
fibrous=f,scaly=y,silky=k,smooth=s
brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
partial=p,universal=u
brown=n,orange=o,white=w,yellow=y
none=n,one=o,two=t
cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d'''.split('\n')

categorical_names[0]

Out[45]:

'bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s'

In [46]:

for j, names in enumerate(categorical_names):
    values = names.split(',')
    values = dict([(x.split('=')[1], x.split('=')[0]) for x in values])
    df.iloc[:, j] = df.iloc[:, j].map(values)

In [47]:

df.head()

Out[47]:

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	convex	smooth	brown	bruises	pungent	free	close	narrow	black	enlarging	equal	smooth	smooth	white	white	partial	white	one	pendant	black	scattered	urban
1	convex	smooth	yellow	bruises	almond	free	close	broad	black	enlarging	club	smooth	smooth	white	white	partial	white	one	pendant	brown	numerous	grasses
2	bell	smooth	white	bruises	anise	free	close	broad	brown	enlarging	club	smooth	smooth	white	white	partial	white	one	pendant	brown	numerous	meadows
3	convex	scaly	white	bruises	pungent	free	close	narrow	brown	enlarging	equal	smooth	smooth	white	white	partial	white	one	pendant	black	scattered	urban
4	convex	smooth	gray	no	none	free	crowded	broad	black	tapering	equal	smooth	smooth	white	white	partial	white	one	evanescent	brown	abundant	grasses

In [48]:

# LabelEncoding all the features. Capturing the different class values for each feature in the categorical_names dictionary.

categorical_names = {}

for feature in categorical_features:
    le = LabelEncoder()
    le.fit(df.iloc[:, feature])
    df.iloc[:, feature] = le.transform(df.iloc[:, feature])
    categorical_names[feature] = le.classes_

In [49]:

categorical_names[0]

Out[49]:

array(['bell', 'conical', 'convex', 'flat', 'knobbed', 'sunken'],
      dtype=object)

5. Data train/test split¶

Now that the entire data is of numeric datatype, lets begin our modelling process.
Firstly, splitting the complete dataset into training and testing datasets.

In [50]:

df.head()

Out[50]:

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

In [51]:

X = df.iloc[:, :]
X.head()

Out[51]:

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

In [52]:

y = labels[:]
y[:10]

Out[52]:

array([1, 0, 0, 1, 0, 0, 0, 0, 1, 0])

In [53]:

# Using scikit-learn's train_test_split function to split the dataset into train and test sets.
# 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [54]:

# Checking the shapes of all the training and test sets for the dependent and independent features.

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(6499, 22)
(6499,)
(1625, 22)
(1625,)

In [55]:

# Finally, we use a One-hot encoder, so that the classifier does not take our categorical features as continuous features. 
# We will use this encoder only for the classifier, not for the explainer - 
# and the reason is that the explainer must make sure that a categorical feature only has one value.

ohe = OneHotEncoder(categorical_features=categorical_features)
ohe.fit(df)

Out[55]:

OneHotEncoder(categorical_features=range(0, 22), categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)

In [56]:

X_train_encoded = ohe.transform(X_train)

In [57]:

X_test_encoded = ohe.transform(X_test)

In [58]:

print(X_train_encoded.shape)
print(X_test_encoded.shape)

(6499, 117)
(1625, 117)

6. Random Forest Model¶

6.1 Random Forest with Scikit-Learn¶

In [74]:

# Creating a Random Forest Classifier.

classifier_rf = RandomForestClassifier(n_estimators=500, random_state=0, oob_score=True, n_jobs=-1)

In [75]:

# Fitting the model on the dataset.

classifier_rf.fit(X_train_encoded, y_train)

Out[75]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=True, random_state=0, verbose=0,
                       warm_start=False)

In [76]:

classifier_rf.oob_score_

Out[76]:

1.0

From the OOB score we can see how our model's gonna perform against the test set or new samples.

6.2 Using the Model for Prediction¶

In [124]:

# Making predictions on the training set.

y_pred_train = classifier_rf.predict(X_train_encoded)
y_pred_train[:10]

Out[124]:

array([1, 1, 1, 0, 0, 1, 0, 0, 1, 1])

In [125]:

# Making predictions on test set.

y_pred_test = classifier_rf.predict(X_test_encoded)
y_pred_test[:10]

Out[125]:

array([0, 1, 1, 0, 1, 1, 1, 1, 0, 0])

7. Model Evaluation¶

Error is the deviation of the values predicted by the model with the true values.

7.1 Accuracy Score¶

In [126]:

# Accuracy score on the training set.

print('Accuracy score for train data is:', accuracy_score(y_train, y_pred_train))

Accuracy score for train data is: 1.0

In [127]:

# Accuracy score on the test set.

print('Accuracy score for test data is:', accuracy_score(y_test, y_pred_test))

Accuracy score for test data is: 1.0

We get an accuracy of 100% on our train set and an accuracy of 100% on our test set.
We can notice that the accuracy obtained on the test set (1.0) is similar to the one obtained using the oob_score_ (1.0), so we can use the oob_score_ as a validation before testing our model on the test set.

8. Model Interpretability using LIME¶

LIME stands for Local Interpretable Model-Agnostic Explanations is a technique to explain the predictions of any machine learning classifier, and evaluate its usefulness in various tasks related to trust.

In [132]:

# Our predict function first transforms the data into the one-hot representation. 
# Then it calculates the prediction probability for each class of target variable.

predict_fn = lambda x: classifier_rf.predict_proba(ohe.transform(x))

8.1 Setup LIME Algorithm¶

We now create our explainer.
The categorical_features parameter lets it know which features are categorical (in this case, all of them).
The categorical names parameter gives a string representation of each categorical feature's numerical value.

In [128]:

from lime.lime_tabular import LimeTabularExplainer

In [129]:

# Creating the LIME explainer object.

explainer = LimeTabularExplainer(X_train.values, mode='classification', class_names=['edible', 'poisonous'], 
                                 feature_names = feature_names, categorical_features=categorical_features, 
                                 categorical_names=categorical_names, kernel_width=3, verbose=True, random_state=0)

8.2 Explore Key Features in Instance-by-Instance Predictions¶

Start by choosing an instance from the test dataset.
Use LIME to estimate a local model to use for explaining our model's predictions. The outputs will be:
1. The intercept estimated for the local model.
2. The local model's estimate for the Regression Forest's prediction.
3. The Regression Forest's actual prediction.
Note, that the actual value from the data does not enter into this - the idea of LIME is to gain insight into why the chosen model - in our case the Random Forest regressor - is predicting whatever it has been asked to predict. Whether or not this prediction is actually any good, is a separate issue.

In [253]:

# Selecting a random instance from the test dataset.

i = np.random.randint(0, X_test.shape[0])
print('i =', i)

i = 1075

In [254]:

# Using LIME to estimate a local model. Using only 6 features to explain our model's predictions.

exp = explainer.explain_instance(X_test.values[i], predict_fn, num_features=6)

Intercept 0.7144173526963122
Prediction_local [0.2041]
Right: 0.0

Printing the DataFrame row for the chosen test instance.

In [255]:

# Here the index column is the original index as per the df dataframe and the number at the beginning the index after reset.

X_test.reset_index().loc[[i]]

Out[255]:

	index	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
1075	2280	2	0	0	0	6	1	0	0	7	1	0	3	3	5	3	0	2	1	4	0	5	6

LIME's interpretation of our Random Forest's prediction.

In [256]:

exp.show_in_notebook(show_table=True, show_all=False)

Feature	Value
odor=none	True
gill-size=broad	True
stalk-surface-above-ring=smooth	True
gill-spacing=close	True
bruises=bruises	True
ring-type=pendant	True

First, note that the row we explained is displayed on the right side, in table format. Since we had the show_all parameter set to false, only the features used in the explanation are displayed.
The value column displays the original value for each feature.

To get the output generated above in the form of a list.

In [257]:

exp.as_list()

Out[257]:

[('odor=none', -0.25505675405753075),
 ('gill-size=broad', -0.1277562900735544),
 ('stalk-surface-above-ring=smooth', -0.09797362037586274),
 ('gill-spacing=close', 0.06957793771888027),
 ('bruises=bruises', -0.05072352183511148),
 ('ring-type=pendant', -0.0483729727557791)]

Obesrvations obtained from LIME's interpretation of our Random Forest's prediction:

The values shown after the condition is the amount by which the value is shifted from the intercept estimated for the local model.
When all these values are added to the intercept, it gives us the Prediction_local (local model's estimate for the Regression Forest's prediction) calculated by LIME.

In [258]:

print('Intercept =', exp.intercept[1])
print('Prediction_local =', exp.local_pred[0])

Intercept = 0.7144173526963122
Prediction_local = 0.20411213131735406

In [259]:

# Calculating the Prediction_local by adding all the values obtained above for each condition into the intercept.
# The intercept can be obtained from the exp.intercept using the index 0.

intercept = exp.intercept[1]
prediction_local = intercept

for j in range(len(exp.as_list())):
    prediction_local += exp.as_list()[j][1]

print('Prediction_local =', prediction_local)

Prediction_local = 0.20411213131735403

Choosing another instance from the test dataset.

In [264]:

# This time specifying a particular value of i in order to explain the working of LIME.

i = 515
print('i =', i, '\n')

exp = explainer.explain_instance(X_test.values[i], predict_fn, num_features=6)

i = 515 

Intercept 0.4517135442299755
Prediction_local [0.6529]
Right: 1.0

Printing the DataFrame row for the chosen test instance.

In [277]:

X_test.reset_index().loc[[i]]

Out[277]:

	index	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
515	5357	2	3	8	0	4	1	0	0	10	1	0	0	3	7	7	0	2	1	4	3	3	4

LIME's interpretation of our Random Forest's prediction.

In [266]:

exp.show_in_notebook(show_table=True, show_all=False)

Feature	Value
odor=foul	True
gill-size=broad	True
spore-print-color=chocolate	True
gill-spacing=close	True
ring-type=pendant	True
bruises=bruises	True

In [267]:

exp.as_list()

Out[267]:

[('odor=foul', 0.274586091806193),
 ('gill-size=broad', -0.1292596563485946),
 ('spore-print-color=chocolate', 0.08298611784301334),
 ('gill-spacing=close', 0.07639601609745206),
 ('ring-type=pendant', -0.053535847394535555),
 ('bruises=bruises', -0.049955819676421445)]

In [268]:

print('Intercept =', exp.intercept[1])
print('Prediction_local =', exp.local_pred[0])

Intercept = 0.4517135442299755
Prediction_local = 0.6529304465570823

In [269]:

intercept = exp.intercept[1]
prediction_local = intercept

for j in range(len(exp.as_list())):
    prediction_local += exp.as_list()[j][1]

print('Prediction_local =', prediction_local)

Prediction_local = 0.6529304465570822

By changing the chosen i, we observe that the narrative provided by LIME also changes, in response to changes in the model in the local region of the feature space in which it is working to generate a given prediction.
This is clearly an improvement on relying purely on the Regression Forest's (static) expected relative feature importance and of great benefit to models that provice no insight whatsoever.

Now note that the explanations are based not only on features, but on feature-value pairs.
For example, we are saying that odor = foul is indicative of a poisonous mushroom.

In the context of a categorical feature, odor could take many other values.

Since we perturb each categorical feature drawing samples according to the original training distribution, the way to interpret this is: if odor was not foul, on average, this prediction would be 0.27 less 'poisonous'.

- Let's **check** if **this** is the **case**:

In [270]:

# Checking the different categories in the odor feature.

odor_idx = list(feature_names).index('odor')
explainer.categorical_names[odor_idx]

Out[270]:

array(['almond', 'anise', 'creosote', 'fishy', 'foul', 'musty', 'none',
       'pungent', 'spicy'], dtype=object)

In [271]:

# Checking the feature frequencies of different categories in the odor feature.

explainer.feature_frequencies[odor_idx]

Out[271]:

array([0.0492, 0.0472, 0.0238, 0.0697, 0.2662, 0.0048, 0.4359, 0.0306,
       0.0725])

In [272]:

# Setting foul_idx equal to the index of 'foul' category in the odor feature.
# Then creating non_foul array with different categories in the odor feature except foul category.

foul_idx = 4
non_foul = np.delete(explainer.categorical_names[odor_idx], foul_idx)
non_foul

Out[272]:

array(['almond', 'anise', 'creosote', 'fishy', 'musty', 'none', 'pungent',
       'spicy'], dtype=object)

In [273]:

# Creating non_foul_normalized_frequencies array with feature frequencies of different categories in the odor feature.
# Setting feature frequency of foul category to 0. Then normalizing the feature frequencies to have a total sum of 1.

non_foul_normalized_frequencies = explainer.feature_frequencies[odor_idx].copy()
non_foul_normalized_frequencies[foul_idx] = 0
non_foul_normalized_frequencies /= non_foul_normalized_frequencies.sum()
non_foul_normalized_frequencies

Out[273]:

array([0.0671, 0.0644, 0.0325, 0.095 , 0.    , 0.0065, 0.594 , 0.0417,
       0.0988])

In [286]:

# Calculating the probabilies of mushroom being poisonous for different values of odor except foul.
# Finally calculating the probability of mushroom being poisonous if odor not equal to foul.

print('Making odor not equal foul')

temp = X_test.values[i].copy()
print('P(poisonous) before:', predict_fn(temp.reshape(1,-1))[0,1], '\n')

average_poisonous = 0

for idx, (name, frequency) in enumerate(zip(explainer.categorical_names[odor_idx], non_foul_normalized_frequencies)):
    if name == 'foul':
        continue
    temp[odor_idx] = idx
    p_poisonous = predict_fn(temp.reshape(1,-1))[0,1]
    average_poisonous += p_poisonous * frequency
    print('P(poisonous | odor=%s): %.2f' % (name, p_poisonous))

print ('\nP(poisonous | odor != foul) = %.2f' % average_poisonous)

Making odor not equal foul
P(poisonous) before: 1.0 

P(poisonous | odor=almond): 0.66
P(poisonous | odor=anise): 0.65
P(poisonous | odor=creosote): 0.73
P(poisonous | odor=fishy): 0.73
P(poisonous | odor=musty): 0.72
P(poisonous | odor=none): 0.49
P(poisonous | odor=pungent): 0.76
P(poisonous | odor=spicy): 0.72

P(poisonous | odor != foul) = 0.58

Probability of poisonous when odor equals foul = 1 - P(poisonous | odor != foul) = 1 - 0.58 = 0.42
We see that in this particular case, the linear model is pretty close: it predicted that on average odor = foul increases the probability of poisonous by 0.27, when in fact it is by 0.42.
Notice though that we only changed one feature (odor), when the linear model takes into account perturbations of all the features at once.

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	2	3	0	0	7	1	0	1	0	0	2	3	3	7	7	2	1	4	0	3	4
1	2	3	9	0	0	1	0	0	0	0	1	3	3	7	7	2	1	4	1	2	0
2	0	3	8	0	1	1	0	0	1	0	1	3	3	7	7	2	1	4	1	2	2
3	2	2	8	0	7	1	0	1	1	0	2	3	3	7	7	2	1	4	0	3	4
4	2	3	3	1	6	1	1	0	0	1	2	3	3	7	7	2	1	0	1	0	0