import seaborn as sns
sns.set_context('talk')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
This notebook uses some features from scikit-learn which are under-development:
These features have been combined into a scikit-learn branch in the following repository: https://github.com/glemaitre/scikit-learn/tree/workshop
You can refer to the following documentation to install scikit-learn from such source: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge
We will use data from the "Current Population Survey" from 1985 and fetch it from OpenML.
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
We can get more information regarding by looking at the description of the dataset.
print(survey.DESCR)
**Author**: **Source**: Unknown - Date unknown **Please cite**: Determinants of Wages from the 1985 Current Population Survey Summary: The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. We wish to determine (i) whether wages are related to these characteristics and (ii) whether there is a gender gap in wages. Based on residual plots, wages were log-transformed to stabilize the variance. Age and work experience were almost perfectly correlated (r=.98). Multiple regression of log wages against sex, age, years of education, work experience, union membership, southern residence, and occupational status showed that these covariates were related to wages (pooled F test, p < .0001). The effect of age was not significant after controlling for experience. Standardized residual plots showed no patterns, except for one large outlier with lower wages than expected. This was a male, with 22 years of experience and 12 years of education, in a management position, who lived in the north and was not a union member. Removing this person from the analysis did not substantially change the results, so that the final model included the entire sample. Adjusting for all other variables in the model, females earned 81% (75%, 88%) the wages of males (p < .0001). Wages increased 41% (28%, 56%) for every 5 additional years of education (p < .0001). They increased by 11% (7%, 14%) for every additional 10 years of experience (p < .0001). Union members were paid 23% (12%, 36%) more than non-union members (p < .0001). Northerns were paid 11% (2%, 20%) more than southerns (p =.016). Management and professional positions were paid most, and service and clerical positions were paid least (pooled F-test, p < .0001). Overall variance explained was R2 = .35. In summary, many factors describe the variations in wages: occupational status, years of experience, years of education, sex, union membership and region of residence. However, despite adjustment for all factors that were available, there still appeared to be a gender gap in wages. There is no readily available explanation for this gender gap. Authorization: Public Domain Reference: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley. Description: The datafile contains 534 observations on 11 variables sampled from the Current Population Survey of 1985. This data set demonstrates multiple regression, confounding, transformations, multicollinearity, categorical variables, ANOVA, pooled tests of significance, interactions and model building strategies. Variable names in order from left to right: EDUCATION: Number of years of education. SOUTH: Indicator variable for Southern Region (1=Person lives in South, 0=Person lives elsewhere). SEX: Indicator variable for sex (1=Female, 0=Male). EXPERIENCE: Number of years of work experience. UNION: Indicator variable for union membership (1=Union member, 0=Not union member). WAGE: Wage (dollars per hour). AGE: Age (years). RACE: Race (1=Other, 2=Hispanic, 3=White). OCCUPATION: Occupational category (1=Management, 2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other). SECTOR: Sector (0=Other, 1=Manufacturing, 2=Construction). MARR: Marital Status (0=Unmarried, 1=Married) Therese Stukel Dartmouth Hitchcock Medical Center One Medical Center Dr. Lebanon, NH 03756 e-mail: stukel@dartmouth.edu Information about the dataset CLASSTYPE: numeric CLASSINDEX: none specific Downloaded from openml.org.
The data are stored in a pandas dataframe.
survey.data.head()
EDUCATION | SOUTH | SEX | EXPERIENCE | UNION | AGE | RACE | OCCUPATION | SECTOR | MARR | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 8.0 | no | female | 21.0 | not_member | 35.0 | Hispanic | Other | Manufacturing | Married |
1 | 9.0 | no | female | 42.0 | not_member | 57.0 | White | Other | Manufacturing | Married |
2 | 12.0 | no | male | 1.0 | not_member | 19.0 | White | Other | Manufacturing | Unmarried |
3 | 12.0 | no | male | 4.0 | not_member | 22.0 | White | Other | Other | Unmarried |
4 | 12.0 | no | male | 17.0 | not_member | 35.0 | White | Other | Other | Married |
survey.target_names
['WAGE']
The column WAGE is our target variable (i.e., the variable which we want to predict). You can note that the dataset contains both numerical and categorical data.
First, let's get some insights by looking at the marginal links between the different variables. Only numerical variables will be used.
_ = sns.pairplot(survey.data, diag_kind='kde')
We can add some additional information to the previous plots.
g = sns.PairGrid(survey.data)
g = g.map_upper(
sns.regplot, scatter_kws={"color": "black", "alpha": 0.2, "s": 30},
line_kws={"color": "red"}
)
g = g.map_lower(sns.kdeplot, cmap="Reds_d")
g = g.map_diag(sns.kdeplot, shade=True, color='r')
We can already have some intuitions regarding our dataset:
log
of the wage;It is important to point out that we analyzed the data by looking at the joint distribution between 2 variables which is a marginal link.
We will shortly jump into interpreting the coefficients of linear model. In this regard, we should emphasize that linear models compute conditional links. All interpretation of the value coefficient given the relationship between the feature and the target given that other features remain constant. For instance, we can deduce the relationship between the "AGE" and "WAGE" for a given number of year of "EXPERIENCE".
X = survey.data
y = survey.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
We have heterogeneous data and we will need to make a specific preprocessing for each data types.
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']
preprocessor = make_column_transformer(
(OneHotEncoder(), categorical_columns),
(OrdinalEncoder(), binary_columns),
remainder='passthrough'
)
Thus, the preprocessor
will:
We will fit a ridge regressor and transform the target before the fit using a log transform.
import numpy as np
import scipy as sp
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import RidgeCV
from sklearn.compose import TransformedTargetRegressor
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=RidgeCV(),
func=np.log10,
inverse_func=sp.special.exp10
)
).fit(X_train, y_train)
We can check the performance of the model which fitted.
from sklearn.metrics import median_absolute_error
def mae_scorer(model, X_train, X_test, y_train, y_test):
y_pred = model.predict(X_train)
string_score = (
f'MAE on training set: '
f'{median_absolute_error(y_train, y_pred):.2f} $/hour')
y_pred = model.predict(X_test)
string_score += (
f'\nMAE on testing set: '
f'{median_absolute_error(y_test, y_pred):.2f} $/hour')
return string_score
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
y_pred = model.predict(X_test)
sns.regplot(y_test, y_pred)
plt.text(3, 20, mae_scorer(model, X_train, X_test, y_train, y_test))
plt.ylabel('Model preditions')
plt.xlabel('Truths')
plt.xlim([0, 27])
_ = plt.ylim([0, 27])
The model learnt is far to be a good model making accurate prediction. Interpretation tools are characterizing model rather than the generative process of the data itself. Thus, interpretations are correct if the model is correct as well.
So now, we can plot the values of the coefficients of the regressor which we fitted.
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
[feature_names, binary_columns, numerical_columns])
import pandas as pd
coefs = pd.DataFrame(
model.named_steps['transformedtargetregressor'].regressor_.coef_,
columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')
One limitation is that we cannot compare the different weights since we did not scale the data during fit and that features can have different range. For instance, the "AGE" coefficient is expressed in $/hours/leaving years
while the "EDUCATION" is expressed in $/hours/years of education
.
X_train_preprocessed = pd.DataFrame(
model.named_steps['columntransformer'].transform(X_train),
columns=feature_names
)
X_train_preprocessed.std().plot(kind='barh', figsize=(9, 7))
_ = plt.title('Features std. dev.')
We can normalize the weights by the standard deviation and then we will be able to compare the different weights.
coefs = pd.DataFrame(
model.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std(),
columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')
The way that we can interpret these values is as follow. An increase of the "AGE" will induce a decrease of the "WAGE" when all other features remain constant or an increase of the "EXPERIENCE" will induce an increase of the "WAGE" when all other features remain constant.
The first interpretation might look counter-intuitive at first, if one relates the relationship between "AGE" and "WAGE" as a marginal link. However, as previously mentioned, a linear model computes a conditional link between "AGE" and "WAGE" given all other features.
Therefore, one could interpret that for a given experience (and all other features constant as well ...), a younger person would have an higher wage.
Up to now, we did not check the stability of the coefficients and it would be important to check it to ensure through cross-validation.
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold
cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
[est.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std()
for est in cv_model['estimator']],
columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')
The "AGE" and "EXPERIENCE" are highly instable which might be due to the collinearity between the 2 features. We can remove on of the 2 features and check what is the impact on the features stability.
column_to_drop = ['AGE']
cv_model = cross_validate(
model, X.drop(columns=column_to_drop), y,
cv=RepeatedKFold(n_splits=5, n_repeats=5),
return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
[est.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.drop(columns=column_to_drop).std()
for est in cv_model['estimator']],
columns=feature_names[:-1]
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')
With real-life dataset, data are leaving in a high-dimensional space where features are correlated and to make thing more difficult, we can have a limited number of samples. In this condition, estimating the coefficients is really difficult and we use regularization.
Ridge implements a $l_2$ regularization and we could use instead lasso which uses $l_1$. Lasso will select a subgroup of variable. However, it will be instable.
from sklearn.linear_model import LassoCV
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=LassoCV(max_iter=10000, cv=5),
func=np.log10,
inverse_func=sp.special.exp10
)
).fit(X_train, y_train)
coefs = pd.DataFrame(
model.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std(),
columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
_ = plt.axvline(x=0, color='.5')
We can observed that some of the variables have been dropped. We can now check the stability of the coefficients.
cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
[est.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std()
for est in cv_model['estimator']],
columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxenplot(data=coefs, orient='h', color='C0')
plt.axvline(x=0, color='.5')
_ = plt.title('Stability of coefficients')
In the following section, we will focus on the so-called "feature importance". Feature importance are primilarly known for the tree-based algorithms. It is based on the fraction of samples a feature contributes to combined with the decrease in impurity from splitting them. However, we will see that this implementation suffers from some bias which we will highlight. To highlight the drawback of the random forest feature importances, we will add 2 random features (1 categorical feature and 1 numerical feature).
X = survey.data
y = survey.target
X['random_cat'] = np.random.randint(3, size=X.shape[0])
X['random_num'] = np.random.randn(X.shape[0])
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
First, we will fit a random forest model. On purpose, we will grow each tree in the forest until each sample will be in a leaf. It implies that each tree will overfit. It allows us to highlight the issues of the feature importance computed in random forest.
from sklearn.ensemble import RandomForestRegressor
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR', 'random_cat']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE', 'random_num']
preprocessor = make_column_transformer(
(OneHotEncoder(), categorical_columns),
(OrdinalEncoder(), binary_columns),
remainder='passthrough'
)
model = make_pipeline(
preprocessor,
RandomForestRegressor(min_samples_leaf=1, random_state=42)
)
model.fit(X_train, y_train)
print(mae_scorer(model, X_train, X_test, y_train, y_test))
MAE on training set: 0.91 $/hour MAE on testing set: 2.22 $/hour
We are getting a very high training compared to the testing score which is synonymous of having a model which overfits.
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate([feature_names, binary_columns, numerical_columns])
tree_feature_importances = pd.DataFrame(
model.named_steps['randomforestregressor'].feature_importances_,
index=feature_names,
columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
.plot(kind='barh', figsize=(9, 7)))
The random forest feature importance reports that "random_num" is one of the most important features. This result is surprising since this feature does not have any predictive power.
We can also notice that the numerical features (random or not) are often much more important than any of the categorical features (random or not). This behavior is due to the bias linked with the cardinality of the unique values in a feature. Indeed, an higher cardinality induces that you have more chance to use this feature to make a split within the tree. Therefore, a typically high-cardinality numerical feature will have a higher importance than a low-cardinality categorical feature.
Secondly, a random feature will appear important when a model is overfitted as RF feature importances are computed on training set statistics. They only reflect the use of the features to split training samples, not necessarily their usefulness to make decisions that can generalize to unseen data.
Instead of using the random feature importance, one should use the "permutation importance" which is a different way to characterize the importance of feature using random permutation. To know the predictive power of a feature, we can permutate the value for a single feature and compute the decrease in accuracy. If the feature is a predictive feature then the decrease of accuracy will be higher. We can repeat the permutation several time to have an estiamate of the variance of the importances.
from sklearn.inspection import permutation_importance
X_test_preprocessed = (model.named_steps['columntransformer']
.transform(X_test))
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_test_preprocessed, y_test, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")
Computing the permutation feature importance, we see that the random feature are not ranking in the top feature anymore. However, we can see that "EXPERIENCE" is on the bottom of the ranking. It is due to the collinearity with the "AGE" feature. We will see later on what would be the impact on removing one of the correlated feature.
X_train_preprocessed = (model.named_steps['columntransformer']
.transform(X_train))
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_train_preprocessed, y_train, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")
In the previous example, we made the random forest overfit on the training data on purpose to exacerbate the behavior of the feature importance. We can now fit a model which should less overfit by increasing the number of samples required to create a leaf in the tree.
model.set_params(randomforestregressor__min_samples_leaf=10)
model.fit(X_train, y_train)
print(mae_scorer(model, X_train, X_test, y_train, y_test))
MAE on training set: 2.00 $/hour MAE on testing set: 2.26 $/hour
The training accuracy decreased dratiscally and our model is not overfitting as before.
tree_feature_importances = pd.DataFrame(
model.named_steps['randomforestregressor'].feature_importances_,
index=feature_names,
columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
.plot(kind='barh', figsize=(9, 7)))
We can see that the random feature does not rank as high as before but are still considered important. we can compute the permutation importance to see if there is a change of behaviour.
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_test_preprocessed, y_test, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")
X_train_preprocessed = (model.named_steps['columntransformer']
.transform(X_train))
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_train_preprocessed, y_train, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")
We see that the feature importance using the permutation is more stable and really similar to the first example.
We also saw that we have an issue with the importance of the "EXPERIENCE" feature which does not rank high in the importance. We thought that it could be linked to the fact that this feature is correlated with the "AGE" feature. We could drop one of the feature and check the impact on the feature importance.
column_to_drop = ['AGE', 'random_num']
model.fit(X_train.drop(columns=column_to_drop), y_train)
print(
mae_scorer(model,
X_train.drop(columns=column_to_drop),
X_test.drop(columns=column_to_drop),
y_train, y_test)
)
MAE on training set: 2.07 $/hour MAE on testing set: 2.60 $/hour
tree_feature_importances = pd.DataFrame(
model.named_steps['randomforestregressor'].feature_importances_,
index=feature_names[:-2],
columns=['RF Feature Importances']
)
_ = (tree_feature_importances.sort_values(by='RF Feature Importances')
.plot(kind='barh', figsize=(9, 7)))
X_test_preprocessed = (model.named_steps['columntransformer']
.transform(X_test.drop(columns=column_to_drop)))
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_test_preprocessed, y_test, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names[:-2])
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")
X_train_preprocessed = (model.named_steps['columntransformer']
.transform(X_train.drop(columns=column_to_drop)))
importances = permutation_importance(
model.named_steps["randomforestregressor"],
X_train_preprocessed, y_train, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names[:-2])
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")
We can see that the "EXPERIENCE" feature starts to be more important. This behaviour is one of the drawback of the permutation importance. If permuting a feature which is correlated with another, there a chance for the model to use the correlated feature to make the decision. Subsquently, there is no decrease of accuracy hidding the importance of this feature. This is one of the limitation of the permutation feature importance.
The permutation importance can be computed for every type of model. Let's demonstrate with our original ridge model.
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=RidgeCV(),
func=np.log10,
inverse_func=sp.special.exp10
)
).fit(X_train, y_train)
X_test_preprocessed = (model.named_steps['columntransformer']
.transform(X_test))
importances = permutation_importance(
model.named_steps["transformedtargetregressor"],
X_test_preprocessed, y_test, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (test set)")
X_train_preprocessed = (model.named_steps['columntransformer']
.transform(X_train))
importances = permutation_importance(
model.named_steps["transformedtargetregressor"],
X_train_preprocessed, y_train, n_repeats=30,
n_jobs=-1)
importances = pd.DataFrame(
importances.importances.T,
columns=feature_names)
# sort (reorder columns) the DataFrame for the plotting
importances = importances.reindex(
importances.mean().sort_values(ascending=False).index,
axis="columns")
plt.figure(figsize=(9, 8))
sns.boxplot(data=importances, orient='h', color='C0')
sns.swarmplot(data=importances, orient='h', color='k', alpha=0.3)
plt.axvline(x=0, color='.5')
_ = plt.title("Permutation Importances (train set)")
By fixing and varying the feature values for all samples, we can compute the average predictions and thus get the marginal relationship between this feature and the predicted target. The obtained plot is known as partial dependence plot.
Let's train a gradient boosting regressor on the censing data, removing the "EXPERIENCE" column.
from sklearn.preprocessing import FunctionTransformer
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
X = survey.data
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'AGE']
preprocessor = make_column_transformer(
(OneHotEncoder(), categorical_columns),
(OrdinalEncoder(), binary_columns),
(FunctionTransformer(validate=False), numerical_columns)
)
model = make_pipeline(
preprocessor,
HistGradientBoostingRegressor(max_iter=100, max_leaf_nodes=5,
learning_rate=0.1, random_state=1)
)
model.fit(X_train, y_train)
print(mae_scorer(model, X_train, X_test, y_train, y_test))
MAE on training set: 1.99 $/hour MAE on testing set: 2.24 $/hour
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
[feature_names, binary_columns, numerical_columns])
X_train_preprocessed = (model.named_steps['columntransformer']
.transform(X_train))
X_test_preprocessed = (model.named_steps['columntransformer']
.transform(X_test))
We can then compute the partial dependence between "AGE" against "WAGE", "EDUCATION" against "WAGE", and both "AGE" and "EDUCATION" against "WAGE".
from sklearn.inspection import plot_partial_dependence
features = ["EDUCATION", "AGE", ("EDUCATION", "AGE")]
fig, ax = plt.subplots(figsize=(15, 5))
_ = plot_partial_dependence(
model.named_steps['histgradientboostingregressor'],
X_train_preprocessed, features, feature_names=feature_names,
n_jobs=-1, grid_resolution=20, ax=ax
)
The left-hand plot, we observe that "WAGE" increases with "EDUCATION" and that there is a kink point around 14 education years. Regarding "AGE", we can see an increase up to 40 years old followed by a plateau.
However, this is important to notice that the plots are not smooth and we should be careful to not over-interpret them. It could be a good idea to plot the individual partial dependence and not only the average.
features = ["EDUCATION", "AGE"]
fig, ax = plt.subplots(figsize=(10, 5))
_ = plot_partial_dependence(
model.named_steps['histgradientboostingregressor'],
X_train_preprocessed, features, feature_names=feature_names,
n_jobs=-1, grid_resolution=20, ax=ax, kind="both", subsample=50,
)
The "EDUCATION" plot looks quite stable while the 2 other plots shows variations. Thus, we should be careful when interpreting these plots. One reason for such variation should be linked to the size of the dataset which is rather small.
The "adult_censing" notebook is using a similar dataset with a larger number of sample.
from sklearn.preprocessing import StandardScaler
X = survey.data
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR']
binary_columns = ['MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']
preprocessor = make_column_transformer(
(OneHotEncoder(), categorical_columns),
(OrdinalEncoder(), binary_columns),
(StandardScaler(), numerical_columns)
)
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=RidgeCV(),
func=np.log10,
inverse_func=sp.special.exp10
)
).fit(X_train, y_train)
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate([feature_names, binary_columns, numerical_columns])
We can check the effect of the coefficients on the original data by multiplying the coefficients of the linear model by the original data (preprocessed).
X_train_preprocessed = pd.DataFrame(
model.named_steps['columntransformer'].transform(X_train),
columns=feature_names
)
effect = X_train_preprocessed * model.named_steps['transformedtargetregressor'].regressor_.coef_
# For the categorical variable, we can inverse the one-hot encoding since only one of the category will be non-null
effect_decoded = {}
for cat in categorical_columns:
cat_cols = effect.columns.str.contains(cat)
effect_cat = effect.loc[:, cat_cols]
idx_filter = (effect_cat != 0).values
effect_decoded[cat] = effect_cat.values[idx_filter]
# For the other columns, we don't need to inverse any embedding
for col in binary_columns + numerical_columns:
effect_decoded[col] = effect[col]
effect = pd.DataFrame(effect_decoded)
plt.figure(figsize=(9, 7))
# subsample for the swarmplot
sns.swarmplot(data=effect.sample(100), orient='h', color='k', alpha=0.5)
sns.boxenplot(data=effect, orient='h', color='C0')
_ = plt.axvline(x=0, color='.5')
The largest contribution to the expected wages is related to the "EDUCATION" due to the large variance in the effect plot.
[1] Interpretable Machine Learning, Christph Molnar, 2019. https://github.com/christophM/interpretable-ml-book
[2] Beware Default Random Forest Importances, Terence Parr et al., 2018. https://explained.ai/rf-importance/