FeatureBinarizerFromTrees
¶The FeatureBinarizerFromTrees
transformer binarizes features for BooleanRuleCG (BRCG), LogisticRuleRegression (LogRR), and LinearRuleRegression (LinearRR) models. It generates binary features (i.e. rules) based on the splits in fitted decision trees. This approach naturally creates optimal thresholds and returns only important features. Compared to FeatureBinarizer
, the FeatureBinarizerFromTrees
transformer reduces the number of features required to produce an accurate model. Not only does this shorten training times, but more importantly, it often results in simpler rule sets.
This notebook demonstrates basic FeatureBinarizerFromTrees
, compares FeatureBinarizer
, and concludes with a formal performance comparison.
import warnings
warnings.filterwarnings('ignore')
from feature_binarizer_from_trees_demo import fbt_vs_fb_crime, format_results, get_corr_columns, print_metrics
import numpy as np
import pandas as pd
from pandas import DataFrame
import pickle
from time import time
from aix360.algorithms.rbm import BooleanRuleCG, BRCGExplainer, FeatureBinarizer, FeatureBinarizerFromTrees, \
GLRMExplainer, LogisticRuleRegression
from aix360.datasets.heloc_dataset import HELOCDataset, nan_preprocessing
from aix360.datasets import MEPSDataset
import shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
def print_brcg_rules(rules):
print('Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:\n')
for rule in rules:
print(f' - {rule}')
print()
def fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test, lambda0=0.001, lambda1=0.001):
bcrg = BooleanRuleCG(lambda0, lambda1, silent=True)
explainer = BRCGExplainer(bcrg)
t = time()
explainer.fit(X_train_b, y_train)
print(f'Model trained in {time() - t:0.1f} seconds\n')
print_metrics(y_test, explainer.predict(X_test_b))
print_brcg_rules(explainer.explain()['rules'])
def fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test):
logrr = LogisticRuleRegression(lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000)
explainer = GLRMExplainer(logrr)
t = time()
explainer.fit(X_train_b, y_train, X_train_std)
print(f'Model trained in {time() - t:0.1f} seconds\n')
print_metrics(y_test, explainer.predict(X_test_b, X_test_std))
return explainer.explain()
Using TensorFlow backend.
Create a binary classification problem to predict the top 25% of violent crimes from a subset of the UCI Communities and Crime data.
X, y = shap.datasets.communitiesandcrime()
y = (y >= np.percentile(y, 75)).astype(np.int)
After dropping highly correlated columns, there are 88 ordinal features.
X.drop(columns=get_corr_columns(X), inplace=True)
print(X.shape)
(1994, 88)
Split the data: 2/3 training, 1/3 test.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
FeatureBinarizerFromTrees
with BRCG¶The code below initializes the default transformer and transforms the data. The default transformer uses one decision tree with a maximum node depth of 4. This will create up to 30 features. In this case, it generates only 28 features. Perhaps the fitted tree didn't require the maximum number of nodes to fit the data, or the binarizer dropped duplicates.
fbt = FeatureBinarizerFromTrees(randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
28 features.
An obvious question might have crossed your mind by now: "If you can sufficiently describe the data with features from a simple decision tree, why not use the decision tree as the explainable model?" It is true that a decision tree may be a satisfactory model in some cases. However, rule sets generated by BCRG can be simpler and more accessible than a decision tree. Furthermore, this is just an introductory example. As we will show, we often need more than one simple decision tree to generate features for an accurate model.
Here are the binarized features for PctKidsBornNeverMar
. The binarizer selected two thresholds: 2.64 and 4.26. There are two complimentary features for each threshold with operators <=
and >
. Additional operators are supported for categorical and binary features, but are not shown in this notebook.
X_train_b['PctKidsBornNeverMar'].head()
operation | <= | > | ||
---|---|---|---|---|
value | 2.64 | 4.26 | 2.64 | 4.26 |
1765 | 0 | 0 | 1 | 1 |
2164 | 1 | 1 | 0 | 0 |
1691 | 0 | 1 | 1 | 0 |
697 | 0 | 1 | 1 | 0 |
39 | 1 | 1 | 0 | 0 |
Here, we fit the model, predict the test set, and display the training time, test metrics, and rule set.
The model trains in roughly 4 seconds and creates a simple, one-rule model with almost 84% accuracy using default BRCG model parameters.
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 4.3 seconds Accuracy = 0.83567 Precision = 0.74157 Recall = 0.52800 F1 = 0.61682 F1 Weighted = 0.82562 Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0: - FemalePctDiv > 10.88 AND PctKidsBornNeverMar > 4.26 AND PctPopUnderPov > 9.80 AND PctSpeakEnglOnly <= 97.82
For this data set, we can easily improve the fit by changing a few binarizer parameters. The parameter values used here were manually selected for demonstration purposes. More optimal values are possible, especially if the BRCG parameters are also tuned.
fbt = FeatureBinarizerFromTrees(treeNum=3, treeDepth=3, treeFeatureSelection=0.5, threshRound=0, randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
36 features.
Here we describe the key parameters used above. (See the FeatureBinarizerFromTrees
API for a full list of arguments.)
treeNum
- The number of trees to fit. A value greater than one encourages a greater variety of features and thresholds.treeDepth
- The depth of the fitted decision trees. The greater the depth, the more features are generated.treeFeatureSelection
- The proportion of randomly chosen input features to consider at each split in the decision tree. When more than one tree is specified, this encourages a greater variety of features. See the API documentation for a full list of options.threshRound
- Round the threshold values to the given number of decimal places. Rounding the thresholds prevents near duplicate thresholds like 1.01 and 1.0. In the crime data, most of the features are ratios and integers, so rounding to the nearest integer value is acceptable.The model trains in around 10 seconds and appears to improve accuracy significantly. Though more features improved the fit in this case, it is important to point out that more features are not always better. For both explainability and accuracy, we suggest starting with a small number of features. From there, increase the number of features incrementally until accuracy plateaus or the explanation is sufficient.
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 10.1 seconds Accuracy = 0.86774 Precision = 0.81720 Recall = 0.60800 F1 = 0.69725 F1 Weighted = 0.86074 Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0: - HousVacant > 2172.00 AND PctKidsBornNeverMar > 3.00 - FemalePctDiv > 12.00 AND PctKidsBornNeverMar > 4.00 AND PctPopUnderPov > 10.00 AND PctSpeakEnglOnly <= 98.00 AND racePctWhite <= 79.00
FeatureBinarizerFromTrees
with Linear Models¶To use FeatureBinarizerFromTrees
with LogRR and LinearRR, set returnOrd=True
. Like the standard FeatureBinarizer
, the transformer will return a standardized data frame of ordinal features in addition to the binarized features. The standardized features can then be passed to the linear model to improve accuracy. (Make sure to set useOrd=True
for the linear model.)
fbt = FeatureBinarizerFromTrees(treeNum=2, treeDepth=4, treeFeatureSelection=None, threshRound=0, returnOrd=True,
randomState=0)
X_train_b, X_train_std = fbt.fit_transform(X_train, y_train)
X_test_b, X_test_std = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
28 features.
The explanation for the fitted linear model lists the features in descending order by linear coefficient magnitude. For this feature set, the linear model does not appear to improve the accuracy significantly.
fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)
Model trained in 5.5 seconds Accuracy = 0.86373 Precision = 0.79381 Recall = 0.61600 F1 = 0.69369 F1 Weighted = 0.85759
rule/numerical feature | coefficient | |
---|---|---|
0 | (intercept) | -1.9073 |
1 | PctKidsBornNeverMar <= 4.00 AND PctSpeakEnglOn... | -3.12689 |
2 | FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0... | 2.35018 |
3 | PctKidsBornNeverMar <= 4.00 | 2.16666 |
4 | PctKidsBornNeverMar | 2.09972 |
5 | FemalePctDiv > 11.00 AND OwnOccQrange > 37500.... | 1.70801 |
6 | FemalePctDiv | 1.07142 |
7 | HousVacant <= 2172.00 | -0.868909 |
8 | FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0... | -0.682629 |
9 | HousVacant | 0.65058 |
10 | PctKidsBornNeverMar <= 3.00 | -0.640423 |
11 | FemalePctDiv > 11.00 AND PctEmplManu <= 15.00 ... | 0.597499 |
12 | PctSpeakEnglOnly | -0.510433 |
13 | pctWInvInc <= 38.00 | 0.484888 |
14 | pctWInvInc | -0.476527 |
15 | pctWWage <= 74.00 | 0.381191 |
16 | PctEmplManu | -0.315508 |
FeatureBinarizer
¶The standard FeatureBinarizer
creates thresholds by binning the data into a user-specified number of quantiles. The default setting of 9 thresholds creates 1,528 features for these data when negations are enabled. This is a very large feature space.
fb = FeatureBinarizer(negations=True)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
1528 features.
Here are the binary features associated with the PctKidsBornNeverMar
input feature.
X_train_b['PctKidsBornNeverMar'].head()
operation | <= | > | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
value | 0.570 | 0.930 | 1.240 | 1.670 | 2.130 | 2.770 | 3.528 | 4.942 | 7.432 | 0.570 | 0.930 | 1.240 | 1.670 | 2.130 | 2.770 | 3.528 | 4.942 | 7.432 |
1765 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2164 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1691 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
697 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
39 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The model takes more than 5 minutes to train. The test accuracy also appears to be lower and the rule set is complex.
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 322.8 seconds Accuracy = 0.83166 Precision = 0.72527 Recall = 0.52800 F1 = 0.61111 F1 Weighted = 0.82207 Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0: - FemalePctDiv > 15.19 AND PctKidsBornNeverMar > 7.43 - PctUnemployed > 5.56 AND PctImmigRec5 <= 29.48 AND NumStreet > 14.00 - PctFam2Par <= 60.46 AND PctSpeakEnglOnly <= 96.91 AND PersPerRentOccHous > 2.23 - PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND MedOwnCostPctIncNoMtg <= 13.10 AND PctSameHouse85 > 42.21 - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PctKidsBornNeverMar > 4.94 AND OwnOccQrange > 31100.00 AND RentQrange > 145.00
A more reasonable number of thresholds for this data set is 4. This setting generates 688 features. The accuracy is now comparable with the previous results, but it still takes approximately ten times longer to train the model (compared to 10 seconds). The rule set is also complex.
fb = FeatureBinarizer(negations=True, numThresh=4)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
688 features. Model trained in 102.8 seconds Accuracy = 0.85371 Precision = 0.76000 Recall = 0.60800 F1 = 0.67556 F1 Weighted = 0.84795 Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0: - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PersPerFam > 3.05 AND PctKidsBornNeverMar > 4.94 - racepctblack > 16.73 AND blackPerCap <= 11133.40 AND PctImmigRecent > 19.66 AND PctUsePubTrans > 0.25 - agePct12t29 <= 27.62 AND PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND PctSameState85 <= 91.12 - racepctblack > 1.72 AND PctEmploy <= 68.48 AND MalePctDivorce > 8.43 AND FemalePctDiv > 13.42 AND PctKidsBornNeverMar > 2.77 AND PctImmigRecent > 5.68 AND PctHousOccup <= 96.34 AND NumInShelters > 5.00
This LogRR model has comparable accuracy to the one trained with FeatureBinarizerFromTrees
, but it takes 3 minutes to train and it also has a more complex rule set.
fb = FeatureBinarizer(negations=True, returnOrd=True, numThresh=4)
X_train_b, X_train_std = fb.fit_transform(X_train)
X_test_b, X_test_std = fb.transform(X_test)
fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)
Model trained in 180.1 seconds Accuracy = 0.86172 Precision = 0.75000 Recall = 0.67200 F1 = 0.70886 F1 Weighted = 0.85911
rule/numerical feature | coefficient | |
---|---|---|
0 | (intercept) | -1.6929 |
1 | FemalePctDiv > 9.41 AND PctFam2Par <= 83.38 AN... | 1.89106 |
2 | blackPerCap <= 11133.40 AND FemalePctDiv > 9.41 | 1.44801 |
3 | agePct12t29 | -1.37336 |
4 | FemalePctDiv > 9.41 AND MedRentPctHousInc > 24.00 | 1.24788 |
5 | PctKidsBornNeverMar | 1.06566 |
6 | FemalePctDiv > 9.41 AND OwnOccLowQuart > 39380.00 | -0.972519 |
7 | PctHousOccup | -0.908306 |
8 | MedOwnCostPctIncNoMtg | -0.902969 |
9 | racepctblack <= 5.28 | -0.800059 |
10 | racepctblack > 1.72 AND FemalePctDiv > 9.41 | 0.787698 |
11 | racepctblack <= 16.73 | -0.767632 |
12 | PctOccupManu <= 14.70 | -0.75916 |
13 | PctTeen2Par <= 68.20 | 0.711791 |
14 | NumStreet <= 3.00 | -0.632624 |
15 | PctPopUnderPov <= 7.42 | -0.592511 |
16 | RentQrange <= 192.00 | -0.589909 |
17 | FemalePctDiv > 9.41 AND OwnOccQrange > 31100.00 | 0.560425 |
18 | PctPersDenseHous <= 1.90 | -0.53483 |
19 | PctImmigRecent <= 10.34 | -0.518697 |
20 | PctVacMore6Mos <= 37.50 | 0.480573 |
21 | HispPerCap > 6657.60 AND FemalePctDiv > 9.41 A... | 0.460934 |
22 | PctKidsBornNeverMar <= 2.77 | -0.45671 |
23 | pctWPubAsst <= 4.74 | -0.443499 |
24 | PctWOFullPlumb | 0.430075 |
25 | LandArea <= 17.80 | -0.42746 |
26 | population <= 18846.60 | -0.410339 |
27 | PctWOFullPlumb <= 0.64 | -0.404707 |
28 | FemalePctDiv > 9.41 AND PctWorkMomYoungKids <=... | 0.394237 |
29 | MedRentPctHousInc <= 26.90 | -0.389177 |
30 | pctUrban <= 99.47 | -0.3877 |
31 | MalePctDivorce <= 11.51 | -0.378395 |
32 | PctForeignBorn | 0.37134 |
33 | FemalePctDiv > 9.41 AND PctUsePubTrans > 0.25 | 0.358929 |
34 | HousVacant <= 1573.80 | 0.347633 |
35 | HousVacant | 0.347582 |
36 | HousVacant <= 761.80 | 0.326033 |
37 | PctNotHSGrad <= 24.46 | -0.313271 |
38 | FemalePctDiv > 9.41 AND PctUsePubTrans > 0.81 | 0.301539 |
39 | PctPopUnderPov > 7.42 AND FemalePctDiv > 9.41 | 0.2864 |
40 | PctUnemployed <= 6.23 | -0.229864 |
41 | FemalePctDiv <= 13.42 | -0.226307 |
42 | FemalePctDiv <= 11.60 | -0.22487 |
43 | LandArea | 0.20781 |
44 | LemasPctOfficDrugUn | 0.200326 |
45 | PersPerRentOccHous <= 2.39 | -0.196087 |
46 | pctWInvInc <= 38.79 | 0.190517 |
47 | NumInShelters <= 5.00 | -0.173206 |
48 | racePctWhite | -0.157085 |
49 | NumStreet <= 0.00 | 0.156811 |
50 | PctPersDenseHous | 0.109269 |
51 | racePctWhite > 84.60 | -0.0930171 |
52 | FemalePctDiv > 9.41 AND MedRentPctHousInc > 26.90 | -0.0847854 |
53 | FemalePctDiv > 9.41 AND RentQrange > 134.00 | -0.0266475 |
54 | PctKidsBornNeverMar <= 4.94 | 0.024893 |
Finally, we provide a formal performance comparison between FeatureBinarizerFromTrees
and FeatureBinarizer
over 30 random train-test splits. The settings for binarizers and models are as follows.
FeatureBinarizerFromTrees
: treeNum=2, treeDepth=4, treeFeatureSelection=None, returnOrd=True
FeatureBinarizer
: numThresh=4, negations=True, returnOrd=True
BooleanRuleCG
: DefaultsLogisticRuleRegression
: lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000
This process takes over two hours to run, so we saved the output and loaded it here for display. To re-run the test, uncomment the code below.
# %%time
# df = fbt_vs_fb_crime(iterations=30, treeNum=2, treeDepth=4, numThresh=4, filename='./data/crime.pkl')
Wall time: 2h 17min 35s
In the table below, 'fb' and 'fbt' indicate models fit with FeatureBinarizer
and FeatureBinarizerFromTrees
, respectively.
For these data and settings, the output shows that models trained using FeatureBinarizerFromTrees
fit, on average, in less than 1/10th of the time and generate rule sets with significantly fewer clauses (i.e., the explanations are significantly less complex). There are no statistically significant differences in the mean scoring metrics.
with open('./data/crime.pkl', 'rb') as fl:
df = pickle.load(fl)
format_results(df)
time | accuracy | precision | recall | f1 | rules | clauses | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | ||
brcg | fb | 103.903 | (33.13) | 0.855 | (0.012) | 0.768 | (0.03) | 0.607 | (0.057) | 0.676 | (0.036) | 3.233 | (0.935) | 13.867 | (3.213) |
fbt | 7.209 | (1.803) | 0.854 | (0.014) | 0.756 | (0.046) | 0.624 | (0.06) | 0.681 | (0.036) | 2.533 | (0.776) | 7.2 | (1.901) | |
logrr | fb | 151.336 | (48.178) | 0.866 | (0.013) | 0.746 | (0.033) | 0.708 | (0.044) | 0.726 | (0.029) | 45.967 | (4.491) | 46.967 | (4.491) |
fbt | 8.887 | (3.489) | 0.863 | (0.011) | 0.745 | (0.031) | 0.691 | (0.035) | 0.717 | (0.023) | 15.833 | (1.802) | 16.833 | (1.802) |