In [1]:

import seaborn as sns
sns.set_context('talk')

In [2]:

from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

Interpret scikit-learn machine learning models¶

This notebook uses some features from scikit-learn which are under-development:

DataFrame handling with OpenML datasets: https://github.com/scikit-learn/scikit-learn/pull/13902
Fast partial dependence plot for Gradient Boosting Decsision Trees: https://github.com/scikit-learn/scikit-learn/pull/13769
Permutation feature importance: https://github.com/scikit-learn/scikit-learn/pull/13146

These features have been combined into a scikit-learn branch in the following repository: https://github.com/glemaitre/scikit-learn/tree/workshop

You can refer to the following documentation to install scikit-learn from such source: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge

Dataset: Current Population Survey (1985)¶

We will use data from the "Current Population Survey" from 1985 and fetch it from OpenML.

In [3]:

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)

We can get more information regarding by looking at the description of the dataset.

In [4]:

print(survey.DESCR)

**Author**:   
**Source**: Unknown - Date unknown  
**Please cite**:   

Determinants of Wages from the 1985 Current Population Survey

Summary:
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. We wish to determine (i) whether wages are related to these characteristics and (ii) whether there is a gender gap in wages.
Based on residual plots, wages were log-transformed to stabilize the variance. Age and work experience were almost perfectly correlated (r=.98). Multiple regression of log wages against sex, age, years of education, work experience, union membership, southern residence, and occupational status showed that these covariates were related to wages (pooled F test, p < .0001). The effect of age was not significant after controlling for experience. Standardized residual plots showed no patterns, except for one large outlier with lower wages than expected. This was a male, with 22 years of experience and 12 years of education, in a management position, who lived in the north and was not a union member. Removing this person from the analysis did not substantially change the results, so that the final model included the entire sample.
Adjusting for all other variables in the model, females earned 81% (75%, 88%) the wages of males (p < .0001). Wages increased 41% (28%, 56%) for every 5 additional years of education (p < .0001). They increased by 11% (7%, 14%) for every additional 10 years of experience (p < .0001). Union members were paid 23% (12%, 36%) more than non-union members (p < .0001). Northerns were paid 11% (2%, 20%) more than southerns (p =.016). Management and professional positions were paid most, and service and clerical positions were paid least (pooled F-test, p < .0001). Overall variance explained was R2 = .35.
In summary, many factors describe the variations in wages: occupational status, years of experience, years of education, sex, union membership and region of residence. However, despite adjustment for all factors that were available, there still appeared to be a gender gap in wages. There is no readily available explanation for this gender gap.

Authorization: Public Domain

Reference: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley.

Description:  The datafile contains 534 observations on 11 variables sampled from the Current Population Survey of 1985.  This data set demonstrates multiple regression, confounding, transformations, multicollinearity, categorical variables, ANOVA, pooled tests of significance, interactions and model building strategies.

Variable names in order from left to right:
EDUCATION: Number of years of education.
SOUTH: Indicator variable for Southern Region (1=Person lives in 		South, 0=Person lives elsewhere).
SEX: Indicator variable for sex (1=Female, 0=Male).
EXPERIENCE: Number of years of work experience.
UNION: Indicator variable for union membership (1=Union member, 		0=Not union member).
WAGE: Wage (dollars per hour).
AGE: Age (years).
RACE: Race (1=Other, 2=Hispanic, 3=White).
OCCUPATION: Occupational category (1=Management, 		2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other).
SECTOR: Sector (0=Other, 1=Manufacturing, 2=Construction).
MARR: Marital Status (0=Unmarried,  1=Married)


Therese Stukel
Dartmouth Hitchcock Medical Center
One Medical Center Dr.
Lebanon, NH 03756
e-mail: stukel@dartmouth.edu


Information about the dataset
CLASSTYPE: numeric
CLASSINDEX: none specific

Downloaded from openml.org.

The data are stored in a pandas dataframe.

In [5]:

survey.data.head()

Out[5]:

	EDUCATION	SOUTH	SEX	EXPERIENCE	UNION	AGE	RACE	OCCUPATION	SECTOR	MARR
0	8.0	no	female	21.0	not_member	35.0	Hispanic	Other	Manufacturing	Married
1	9.0	no	female	42.0	not_member	57.0	White	Other	Manufacturing	Married
2	12.0	no	male	1.0	not_member	19.0	White	Other	Manufacturing	Unmarried
3	12.0	no	male	4.0	not_member	22.0	White	Other	Other	Unmarried
4	12.0	no	male	17.0	not_member	35.0	White	Other	Other	Married

In [6]:

survey.target_names

Out[6]:

['WAGE']

The column WAGE is our target variable (i.e., the variable which we want to predict). You can note that the dataset contains both numerical and categorical data.

First, let's get some insights by looking at the marginal links between the different variables. Only numerical variables will be used.

In [7]:

_ = sns.pairplot(survey.data, diag_kind='kde')

We can add some additional information to the previous plots.

In [8]:

g = sns.PairGrid(survey.data)
g = g.map_upper(
    sns.regplot, scatter_kws={"color": "black", "alpha": 0.2, "s": 30},
    line_kws={"color": "red"}
)
g = g.map_lower(sns.kdeplot, cmap="Reds_d")
g = g.map_diag(sns.kdeplot, shade=True, color='r')

We can already have some intuitions regarding our dataset:

The "WAGE" distribution has a long tail and we could work by taking the log of the wage;
For all 3 variables, "EDUCATION", "EXPERIENCE", and "AGE", the "WAGE" is increasing when these variables are increasing;
The "EXPERIENCE" and "AGE" are correlated.

It is important to point out that we analyzed the data by looking at the joint distribution between 2 variables which is a marginal link.

We will shortly jump into interpreting the coefficients of linear model. In this regard, we should emphasize that linear models compute conditional links. All interpretation of the value coefficient given the relationship between the feature and the target given that other features remain constant. For instance, we can deduce the relationship between the "AGE" and "WAGE" for a given number of year of "EXPERIENCE".

Interpret coefficients of linear models¶

In [9]:

X = survey.data
y = survey.target

In [10]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

We have heterogeneous data and we will need to make a specific preprocessing for each data types.

In [11]:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_columns),
    (OrdinalEncoder(), binary_columns),
    remainder='passthrough'
)

Thus, the preprocessor will:

one-hot encode (i.e., generate a column by category) the categorical columns;
replace by 0 and 1 the categories of binary columns;
keep numerical values as they are.

We will fit a ridge regressor and transform the target before the fit using a log transform.

In [12]:

import numpy as np
import scipy as sp
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import RidgeCV
from sklearn.compose import TransformedTargetRegressor

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=RidgeCV(),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
).fit(X_train, y_train)

We can check the performance of the model which fitted.

In [13]:

from sklearn.metrics import median_absolute_error

def mae_scorer(model, X_train, X_test, y_train, y_test):
    y_pred = model.predict(X_train)
    string_score = (
        f'MAE on training set: '
        f'{median_absolute_error(y_train, y_pred):.2f} $/hour')
    y_pred = model.predict(X_test)
    string_score += (
        f'\nMAE on testing set: '
        f'{median_absolute_error(y_test, y_pred):.2f} $/hour')
    return string_score

In [14]:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))
y_pred = model.predict(X_test)
sns.regplot(y_test, y_pred)

plt.text(3, 20, mae_scorer(model, X_train, X_test, y_train, y_test))

plt.ylabel('Model preditions')
plt.xlabel('Truths')
plt.xlim([0, 27])
_ = plt.ylim([0, 27])

The model learnt is far to be a good model making accurate prediction. Interpretation tools are characterizing model rather than the generative process of the data itself. Thus, interpretations are correct if the model is correct as well.

So now, we can plot the values of the coefficients of the regressor which we fitted.

In [15]:

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
    [feature_names, binary_columns, numerical_columns])

In [16]:

import pandas as pd

coefs = pd.DataFrame(
    model.named_steps['transformedtargetregressor'].regressor_.coef_,
    columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')

One limitation is that we cannot compare the different weights since we did not scale the data during fit and that features can have different range. For instance, the "AGE" coefficient is expressed in $/hours/leaving years while the "EDUCATION" is expressed in $/hours/years of education.

In [17]:

X_train_preprocessed = pd.DataFrame(
    model.named_steps['columntransformer'].transform(X_train),
    columns=feature_names
)
X_train_preprocessed.std().plot(kind='barh', figsize=(9, 7))
_ = plt.title('Features std. dev.')

We can normalize the weights by the standard deviation and then we will be able to compare the different weights.

In [18]:

coefs = pd.DataFrame(
    model.named_steps['transformedtargetregressor'].regressor_.coef_ *
    X_train_preprocessed.std(),
    columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')

The way that we can interpret these values is as follow. An increase of the "AGE" will induce a decrease of the "WAGE" when all other features remain constant or an increase of the "EXPERIENCE" will induce an increase of the "WAGE" when all other features remain constant.

The first interpretation might look counter-intuitive at first, if one relates the relationship between "AGE" and "WAGE" as a marginal link. However, as previously mentioned, a linear model computes a conditional link between "AGE" and "WAGE" given all other features.

Therefore, one could interpret that for a given experience (and all other features constant as well ...), a younger person would have an higher wage.

Up to now, we did not check the stability of the coefficients and it would be important to check it to ensure through cross-validation.

In [19]:

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

cv_model = cross_validate(
    model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
    return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
    [est.named_steps['transformedtargetregressor'].regressor_.coef_ *
     X_train_preprocessed.std()
     for est in cv_model['estimator']],
    columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')

The "AGE" and "EXPERIENCE" are highly instable which might be due to the collinearity between the 2 features. We can remove on of the 2 features and check what is the impact on the features stability.

In [20]:

column_to_drop = ['AGE']

cv_model = cross_validate(
    model, X.drop(columns=column_to_drop), y,
    cv=RepeatedKFold(n_splits=5, n_repeats=5),
    return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
    [est.named_steps['transformedtargetregressor'].regressor_.coef_ *
     X_train_preprocessed.drop(columns=column_to_drop).std()
     for est in cv_model['estimator']],
    columns=feature_names[:-1]
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')

With real-life dataset, data are leaving in a high-dimensional space where features are correlated and to make thing more difficult, we can have a limited number of samples. In this condition, estimating the coefficients is really difficult and we use regularization.

Ridge implements a $l_2$ regularization and we could use instead lasso which uses $l_1$. Lasso will select a subgroup of variable. However, it will be instable.

In [21]:

from sklearn.linear_model import LassoCV

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=LassoCV(max_iter=10000, cv=5),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
).fit(X_train, y_train)

In [22]:

coefs = pd.DataFrame(
    model.named_steps['transformedtargetregressor'].regressor_.coef_ *
    X_train_preprocessed.std(),
    columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')

We can observed that some of the variables have been dropped. We can now check the stability of the coefficients.

In [23]:

cv_model = cross_validate(
    model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
    return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
    [est.named_steps['transformedtargetregressor'].regressor_.coef_ *
     X_train_preprocessed.std()
     for est in cv_model['estimator']],
    columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')

Take-home message¶

linear model coefficients represent conditional link between a variable and the target;
coefficients cannot be compared if they were not normalized;
regularization will have to be used when a problem starts to be complex (e.g. with a large number of features);
collinearity makes interpretation difficult;
interpretations characterize the fitted model rather than the process that generated the data (especially if the model assumptions do not hold).

Interpret tree-based models¶

In the following section, we will focus on the so-called "feature importance". Feature importance are primilarly known for the tree-based algorithms. It is based on the fraction of samples a feature contributes to combined with the decrease in impurity from splitting them. However, we will see that this implementation suffers from some bias which we will highlight. To highlight the drawback of the random forest feature importances, we will add 2 random features (1 categorical feature and 1 numerical feature).

In [24]:

X = survey.data
y = survey.target

X['random_cat'] = np.random.randint(3, size=X.shape[0])
X['random_num'] = np.random.randn(X.shape[0])

In [25]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

Random Forest feature importances of an overfitted model¶

First, we will fit a random forest model. On purpose, we will grow each tree in the forest until each sample will be in a leaf. It implies that each tree will overfit. It allows us to highlight the issues of the feature importance computed in random forest.

In [26]:

from sklearn.ensemble import RandomForestRegressor

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR', 'random_cat']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE', 'random_num']

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_columns),
    (OrdinalEncoder(), binary_columns),
    remainder='passthrough'
)

model = make_pipeline(
    preprocessor,
    RandomForestRegressor(min_samples_leaf=1, random_state=42)
)
model.fit(X_train, y_train)
print(mae_scorer(model, X_train, X_test, y_train, y_test))

MAE on training set: 0.91 $/hour
MAE on testing set: 2.22 $/hour

We are getting a very high training compared to the testing score which is synonymous of having a model which overfits.

In [27]:

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate([feature_names, binary_columns, numerical_columns])

In [28]:

tree_feature_importances = pd.DataFrame(
    model.named_steps['randomforestregressor'].feature_importances_,
    index=feature_names,
    columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
                             .plot(kind='barh', figsize=(9, 7)))

The random forest feature importance reports that "random_num" is one of the most important features. This result is surprising since this feature does not have any predictive power.

We can also notice that the numerical features (random or not) are often much more important than any of the categorical features (random or not). This behavior is due to the bias linked with the cardinality of the unique values in a feature. Indeed, an higher cardinality induces that you have more chance to use this feature to make a split within the tree. Therefore, a typically high-cardinality numerical feature will have a higher importance than a low-cardinality categorical feature.

Secondly, a random feature will appear important when a model is overfitted as RF feature importances are computed on training set statistics. They only reflect the use of the features to split training samples, not necessarily their usefulness to make decisions that can generalize to unseen data.

Permutation importances on the test set¶

Instead of using the random feature importance, one should use the "permutation importance" which is a different way to characterize the importance of feature using random permutation. To know the predictive power of a feature, we can permutate the value for a single feature and compute the decrease in accuracy. If the feature is a predictive feature then the decrease of accuracy will be higher. We can repeat the permutation several time to have an estiamate of the variance of the importances.

In [29]:

from sklearn.inspection import permutation_importance

X_test_preprocessed = (model.named_steps['columntransformer']
                            .transform(X_test))

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_test_preprocessed, y_test, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [30]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")

Computing the permutation feature importance, we see that the random feature are not ranking in the top feature anymore. However, we can see that "EXPERIENCE" is on the bottom of the ranking. It is due to the collinearity with the "AGE" feature. We will see later on what would be the impact on removing one of the correlated feature.

Permutation importances on the training set¶

In [31]:

X_train_preprocessed = (model.named_steps['columntransformer']
                             .transform(X_train))

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_train_preprocessed, y_train, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [32]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")

Same analysis on a non-overfitted model¶

In the previous example, we made the random forest overfit on the training data on purpose to exacerbate the behavior of the feature importance. We can now fit a model which should less overfit by increasing the number of samples required to create a leaf in the tree.

In [33]:

model.set_params(randomforestregressor__min_samples_leaf=10)
model.fit(X_train, y_train)

print(mae_scorer(model, X_train, X_test, y_train, y_test))

MAE on training set: 2.00 $/hour
MAE on testing set: 2.26 $/hour

The training accuracy decreased dratiscally and our model is not overfitting as before.

In [34]:

tree_feature_importances = pd.DataFrame(
    model.named_steps['randomforestregressor'].feature_importances_,
    index=feature_names,
    columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
                             .plot(kind='barh', figsize=(9, 7)))

We can see that the random feature does not rank as high as before but are still considered important. we can compute the permutation importance to see if there is a change of behaviour.

In [35]:

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_test_preprocessed, y_test, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [36]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")

In [37]:

X_train_preprocessed = (model.named_steps['columntransformer']
                             .transform(X_train))

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_train_preprocessed, y_train, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [38]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")

We see that the feature importance using the permutation is more stable and really similar to the first example.

Remove collinear variables¶

We also saw that we have an issue with the importance of the "EXPERIENCE" feature which does not rank high in the importance. We thought that it could be linked to the fact that this feature is correlated with the "AGE" feature. We could drop one of the feature and check the impact on the feature importance.

In [39]:

column_to_drop = ['AGE', 'random_num']

In [40]:

model.fit(X_train.drop(columns=column_to_drop), y_train)

print(
    mae_scorer(model,
               X_train.drop(columns=column_to_drop),
               X_test.drop(columns=column_to_drop),
               y_train, y_test)
)

MAE on training set: 2.07 $/hour
MAE on testing set: 2.60 $/hour

In [41]:

tree_feature_importances = pd.DataFrame(
    model.named_steps['randomforestregressor'].feature_importances_,
    index=feature_names[:-2],
    columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
                             .plot(kind='barh', figsize=(9, 7)))

In [42]:

X_test_preprocessed = (model.named_steps['columntransformer']
                            .transform(X_test.drop(columns=column_to_drop)))

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_test_preprocessed, y_test, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names[:-2])
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [43]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")

In [44]:

X_train_preprocessed = (model.named_steps['columntransformer']
                             .transform(X_train.drop(columns=column_to_drop)))

importances = permutation_importance(
    model.named_steps["randomforestregressor"],
    X_train_preprocessed, y_train, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names[:-2])
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [45]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")

We can see that the "EXPERIENCE" feature starts to be more important. This behaviour is one of the drawback of the permutation importance. If permuting a feature which is correlated with another, there a chance for the model to use the correlated feature to make the decision. Subsquently, there is no decrease of accuracy hidding the importance of this feature. This is one of the limitation of the permutation feature importance.

Permutation importance for other type of models¶

The permutation importance can be computed for every type of model. Let's demonstrate with our original ridge model.

In [46]:

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=RidgeCV(),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
).fit(X_train, y_train)

In [47]:

X_test_preprocessed = (model.named_steps['columntransformer']
                            .transform(X_test))

importances = permutation_importance(
    model.named_steps["transformedtargetregressor"],
    X_test_preprocessed, y_test, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [48]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")

In [49]:

X_train_preprocessed = (model.named_steps['columntransformer']
                             .transform(X_train))

importances = permutation_importance(
    model.named_steps["transformedtargetregressor"],
    X_train_preprocessed, y_train, n_repeats=30,
    n_jobs=-1)
importances = pd.DataFrame(
    importances.importances.T,
    columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
    importances.mean().sort_values(ascending=False).index,
    axis="columns")

In [50]:

plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")

Take-home message¶

random forest feature importances suffer from bias;
permutation importances allievate these issues;
permutation importances will minimize the importance of correlated features;
permutation importances is model agnostic.

Partial Dependence Plot¶

By fixing and varying the feature values for all samples, we can compute the average predictions and thus get the marginal relationship between this feature and the predicted target. The obtained plot is known as partial dependence plot.

Let's train a gradient boosting regressor on the censing data, removing the "EXPERIENCE" column.

In [51]:

from sklearn.preprocessing import FunctionTransformer
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingRegressor

X = survey.data
y = survey.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'AGE']

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_columns),
    (OrdinalEncoder(), binary_columns),
    (FunctionTransformer(validate=False), numerical_columns)
)

model = make_pipeline(
    preprocessor,
    HistGradientBoostingRegressor(max_iter=100, max_leaf_nodes=5,
                                  learning_rate=0.1, random_state=1)
)
model.fit(X_train, y_train)
print(mae_scorer(model, X_train, X_test, y_train, y_test))

MAE on training set: 1.99 $/hour
MAE on testing set: 2.24 $/hour

In [52]:

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
    [feature_names, binary_columns, numerical_columns])

In [53]:

X_train_preprocessed = (model.named_steps['columntransformer']
                             .transform(X_train))
X_test_preprocessed = (model.named_steps['columntransformer']
                            .transform(X_test))

We can then compute the partial dependence between "AGE" against "WAGE", "EDUCATION" against "WAGE", and both "AGE" and "EDUCATION" against "WAGE".

In [54]:

from sklearn.inspection import plot_partial_dependence

features = ["EDUCATION", "AGE", ("EDUCATION", "AGE")]
fig, ax = plt.subplots(figsize=(15, 5))
_ = plot_partial_dependence(
    model.named_steps['histgradientboostingregressor'],
    X_train_preprocessed, features, feature_names=feature_names,
    n_jobs=-1, grid_resolution=20, ax=ax
)

The left-hand plot, we observe that "WAGE" increases with "EDUCATION" and that there is a kink point around 14 education years. Regarding "AGE", we can see an increase up to 40 years old followed by a plateau.

However, this is important to notice that the plots are not smooth and we should be careful to not over-interpret them. It could be a good idea to plot the individual partial dependence and not only the average.

In [55]:

features = ["EDUCATION", "AGE"]
fig, ax = plt.subplots(figsize=(10, 5))
_ = plot_partial_dependence(
    model.named_steps['histgradientboostingregressor'],
    X_train_preprocessed, features, feature_names=feature_names,
    n_jobs=-1, grid_resolution=20, ax=ax, kind="both", subsample=50,
)

The "EDUCATION" plot looks quite stable while the 2 other plots shows variations. Thus, we should be careful when interpreting these plots. One reason for such variation should be linked to the size of the dataset which is rather small.

The "adult_censing" notebook is using a similar dataset with a larger number of sample.

Take-home message¶

partial dependence plot shows the marginal link between one (or a couple) feature and the predicted target capture by a model;
only a small number of samples can be interpreted;
the features are assumed to be independent.

Appendix¶

Effect plots for linear models¶

In [56]:

from sklearn.preprocessing import StandardScaler

X = survey.data
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_columns),
    (OrdinalEncoder(), binary_columns),
    (StandardScaler(), numerical_columns)
)

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=RidgeCV(),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
).fit(X_train, y_train)

In [57]:

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate([feature_names, binary_columns, numerical_columns])

We can check the effect of the coefficients on the original data by multiplying the coefficients of the linear model by the original data (preprocessed).

In [58]:

X_train_preprocessed = pd.DataFrame(
    model.named_steps['columntransformer'].transform(X_train),
    columns=feature_names
)
effect = X_train_preprocessed * model.named_steps['transformedtargetregressor'].regressor_.coef_

# For the categorical variable, we can inverse the one-hot encoding since only one of the category will be non-null
effect_decoded = {}
for cat in categorical_columns:
    cat_cols = effect.columns.str.contains(cat)
    effect_cat = effect.loc[:, cat_cols]
    idx_filter = (effect_cat != 0).values
    effect_decoded[cat] = effect_cat.values[idx_filter]
# For the other columns, we don't need to inverse any embedding
for col in binary_columns + numerical_columns:
    effect_decoded[col] = effect[col]
effect = pd.DataFrame(effect_decoded)

In [59]:

plt.figure(figsize=(9, 7))
# subsample for the swarmplot
sns.swarmplot(data=effect.sample(100), orient='h', color='k', alpha=0.5)
sns.boxenplot(data=effect, orient='h', color='C0')
_ = plt.axvline(x=0, color='.5')

The largest contribution to the expected wages is related to the "EDUCATION" due to the large variance in the effect plot.

References¶

[1] Interpretable Machine Learning, Christph Molnar, 2019. https://github.com/christophM/interpretable-ml-book

[2] Beware Default Random Forest Importances, Terence Parr et al., 2018. https://explained.ai/rf-importance/