FeatureBinarizerFromTrees

The FeatureBinarizerFromTrees transformer binarizes features for BooleanRuleCG (BRCG), LogisticRuleRegression (LogRR), and LinearRuleRegression (LinearRR) models. It generates binary features (i.e. rules) based on the splits in fitted decision trees. This approach naturally creates optimal thresholds and returns only important features. Compared to FeatureBinarizer, the FeatureBinarizerFromTrees transformer reduces the number of features required to produce an accurate model. Not only does this shorten training times, but more importantly, it often results in simpler rule sets.

This notebook demonstrates basic FeatureBinarizerFromTrees, compares FeatureBinarizer, and concludes with a formal performance comparison.

Initialize

In [1]:
import warnings
warnings.filterwarnings('ignore')

from feature_binarizer_from_trees_demo import fbt_vs_fb_crime, format_results, get_corr_columns, print_metrics

import numpy as np
import pandas as pd
from pandas import DataFrame
import pickle
from time import time

from aix360.algorithms.rbm import BooleanRuleCG, BRCGExplainer, FeatureBinarizer, FeatureBinarizerFromTrees, \
    GLRMExplainer, LogisticRuleRegression
from aix360.datasets.heloc_dataset import HELOCDataset, nan_preprocessing
from aix360.datasets import MEPSDataset

import shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

def print_brcg_rules(rules):
    print('Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:\n')
    for rule in rules:
        print(f'  - {rule}')
    print()

def fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test, lambda0=0.001, lambda1=0.001):
    bcrg = BooleanRuleCG(lambda0, lambda1, silent=True)
    explainer = BRCGExplainer(bcrg)
    t = time()
    explainer.fit(X_train_b, y_train)
    print(f'Model trained in {time() - t:0.1f} seconds\n')
    print_metrics(y_test, explainer.predict(X_test_b))
    print_brcg_rules(explainer.explain()['rules'])
    
def fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test):
    logrr = LogisticRuleRegression(lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000)
    explainer = GLRMExplainer(logrr)
    t = time()
    explainer.fit(X_train_b, y_train, X_train_std)
    print(f'Model trained in {time() - t:0.1f} seconds\n')
    print_metrics(y_test, explainer.predict(X_test_b, X_test_std))
    return explainer.explain()
Using TensorFlow backend.

Communities and Crime Data

Create a binary classification problem to predict the top 25% of violent crimes from a subset of the UCI Communities and Crime data.

In [2]:
X, y = shap.datasets.communitiesandcrime()
y = (y >= np.percentile(y, 75)).astype(np.int)

After dropping highly correlated columns, there are 88 ordinal features.

In [3]:
X.drop(columns=get_corr_columns(X), inplace=True)
print(X.shape)
(1994, 88)

Split the data: 2/3 training, 1/3 test.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

Using FeatureBinarizerFromTrees with BRCG

The code below initializes the default transformer and transforms the data. The default transformer uses one decision tree with a maximum node depth of 4. This will create up to 30 features. In this case, it generates only 28 features. Perhaps the fitted tree didn't require the maximum number of nodes to fit the data, or the binarizer dropped duplicates.

In [23]:
fbt = FeatureBinarizerFromTrees(randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
28 features.

An obvious question might have crossed your mind by now: "If you can sufficiently describe the data with features from a simple decision tree, why not use the decision tree as the explainable model?" It is true that a decision tree may be a satisfactory model in some cases. However, rule sets generated by BCRG can be simpler and more accessible than a decision tree. Furthermore, this is just an introductory example. As we will show, we often need more than one simple decision tree to generate features for an accurate model.

Here are the binarized features for PctKidsBornNeverMar. The binarizer selected two thresholds: 2.64 and 4.26. There are two complimentary features for each threshold with operators <= and >. Additional operators are supported for categorical and binary features, but are not shown in this notebook.

In [24]:
X_train_b['PctKidsBornNeverMar'].head()
Out[24]:
operation <= >
value 2.64 4.26 2.64 4.26
1765 0 0 1 1
2164 1 1 0 0
1691 0 1 1 0
697 0 1 1 0
39 1 1 0 0

Here, we fit the model, predict the test set, and display the training time, test metrics, and rule set.

The model trains in roughly 4 seconds and creates a simple, one-rule model with almost 84% accuracy using default BRCG model parameters.

In [26]:
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 4.3 seconds

Accuracy = 0.83567
Precision = 0.74157
Recall = 0.52800
F1 = 0.61682
F1 Weighted = 0.82562

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - FemalePctDiv > 10.88 AND PctKidsBornNeverMar > 4.26 AND PctPopUnderPov > 9.80 AND PctSpeakEnglOnly <= 97.82

For this data set, we can easily improve the fit by changing a few binarizer parameters. The parameter values used here were manually selected for demonstration purposes. More optimal values are possible, especially if the BRCG parameters are also tuned.

In [27]:
fbt = FeatureBinarizerFromTrees(treeNum=3, treeDepth=3, treeFeatureSelection=0.5, threshRound=0, randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
36 features.

Here we describe the key parameters used above. (See the FeatureBinarizerFromTrees API for a full list of arguments.)

  • treeNum - The number of trees to fit. A value greater than one encourages a greater variety of features and thresholds.
  • treeDepth - The depth of the fitted decision trees. The greater the depth, the more features are generated.
  • treeFeatureSelection - The proportion of randomly chosen input features to consider at each split in the decision tree. When more than one tree is specified, this encourages a greater variety of features. See the API documentation for a full list of options.
  • threshRound - Round the threshold values to the given number of decimal places. Rounding the thresholds prevents near duplicate thresholds like 1.01 and 1.0. In the crime data, most of the features are ratios and integers, so rounding to the nearest integer value is acceptable.

The model trains in around 10 seconds and appears to improve accuracy significantly. Though more features improved the fit in this case, it is important to point out that more features are not always better. For both explainability and accuracy, we suggest starting with a small number of features. From there, increase the number of features incrementally until accuracy plateaus or the explanation is sufficient.

In [28]:
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 10.1 seconds

Accuracy = 0.86774
Precision = 0.81720
Recall = 0.60800
F1 = 0.69725
F1 Weighted = 0.86074

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - HousVacant > 2172.00 AND PctKidsBornNeverMar > 3.00
  - FemalePctDiv > 12.00 AND PctKidsBornNeverMar > 4.00 AND PctPopUnderPov > 10.00 AND PctSpeakEnglOnly <= 98.00 AND racePctWhite <= 79.00

Using FeatureBinarizerFromTrees with Linear Models

To use FeatureBinarizerFromTrees with LogRR and LinearRR, set returnOrd=True. Like the standard FeatureBinarizer, the transformer will return a standardized data frame of ordinal features in addition to the binarized features. The standardized features can then be passed to the linear model to improve accuracy. (Make sure to set useOrd=True for the linear model.)

In [29]:
fbt = FeatureBinarizerFromTrees(treeNum=2, treeDepth=4, treeFeatureSelection=None, threshRound=0, returnOrd=True, 
                                randomState=0)
X_train_b, X_train_std = fbt.fit_transform(X_train, y_train)
X_test_b, X_test_std = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
28 features.

The explanation for the fitted linear model lists the features in descending order by linear coefficient magnitude. For this feature set, the linear model does not appear to improve the accuracy significantly.

In [11]:
fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)
Model trained in 5.5 seconds

Accuracy = 0.86373
Precision = 0.79381
Recall = 0.61600
F1 = 0.69369
F1 Weighted = 0.85759

Out[11]:
rule/numerical feature coefficient
0 (intercept) -1.9073
1 PctKidsBornNeverMar <= 4.00 AND PctSpeakEnglOn... -3.12689
2 FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0... 2.35018
3 PctKidsBornNeverMar <= 4.00 2.16666
4 PctKidsBornNeverMar 2.09972
5 FemalePctDiv > 11.00 AND OwnOccQrange > 37500.... 1.70801
6 FemalePctDiv 1.07142
7 HousVacant <= 2172.00 -0.868909
8 FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0... -0.682629
9 HousVacant 0.65058
10 PctKidsBornNeverMar <= 3.00 -0.640423
11 FemalePctDiv > 11.00 AND PctEmplManu <= 15.00 ... 0.597499
12 PctSpeakEnglOnly -0.510433
13 pctWInvInc <= 38.00 0.484888
14 pctWInvInc -0.476527
15 pctWWage <= 74.00 0.381191
16 PctEmplManu -0.315508

Compare with FeatureBinarizer

The standard FeatureBinarizer creates thresholds by binning the data into a user-specified number of quantiles. The default setting of 9 thresholds creates 1,528 features for these data when negations are enabled. This is a very large feature space.

In [31]:
fb = FeatureBinarizer(negations=True)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
1528 features.

Here are the binary features associated with the PctKidsBornNeverMar input feature.

In [32]:
X_train_b['PctKidsBornNeverMar'].head()
Out[32]:
operation <= >
value 0.570 0.930 1.240 1.670 2.130 2.770 3.528 4.942 7.432 0.570 0.930 1.240 1.670 2.130 2.770 3.528 4.942 7.432
1765 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
2164 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
1691 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
697 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0
39 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

The model takes more than 5 minutes to train. The test accuracy also appears to be lower and the rule set is complex.

In [13]:
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
Model trained in 322.8 seconds

Accuracy = 0.83166
Precision = 0.72527
Recall = 0.52800
F1 = 0.61111
F1 Weighted = 0.82207

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - FemalePctDiv > 15.19 AND PctKidsBornNeverMar > 7.43
  - PctUnemployed > 5.56 AND PctImmigRec5 <= 29.48 AND NumStreet > 14.00
  - PctFam2Par <= 60.46 AND PctSpeakEnglOnly <= 96.91 AND PersPerRentOccHous > 2.23
  - PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND MedOwnCostPctIncNoMtg <= 13.10 AND PctSameHouse85 > 42.21
  - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PctKidsBornNeverMar > 4.94 AND OwnOccQrange > 31100.00 AND RentQrange > 145.00

A more reasonable number of thresholds for this data set is 4. This setting generates 688 features. The accuracy is now comparable with the previous results, but it still takes approximately ten times longer to train the model (compared to 10 seconds). The rule set is also complex.

In [14]:
fb = FeatureBinarizer(negations=True, numThresh=4)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)
688 features.
Model trained in 102.8 seconds

Accuracy = 0.85371
Precision = 0.76000
Recall = 0.60800
F1 = 0.67556
F1 Weighted = 0.84795

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PersPerFam > 3.05 AND PctKidsBornNeverMar > 4.94
  - racepctblack > 16.73 AND blackPerCap <= 11133.40 AND PctImmigRecent > 19.66 AND PctUsePubTrans > 0.25
  - agePct12t29 <= 27.62 AND PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND PctSameState85 <= 91.12
  - racepctblack > 1.72 AND PctEmploy <= 68.48 AND MalePctDivorce > 8.43 AND FemalePctDiv > 13.42 AND PctKidsBornNeverMar > 2.77 AND PctImmigRecent > 5.68 AND PctHousOccup <= 96.34 AND NumInShelters > 5.00

This LogRR model has comparable accuracy to the one trained with FeatureBinarizerFromTrees, but it takes 3 minutes to train and it also has a more complex rule set.

In [15]:
fb = FeatureBinarizer(negations=True, returnOrd=True, numThresh=4)
X_train_b, X_train_std = fb.fit_transform(X_train)
X_test_b, X_test_std = fb.transform(X_test)
In [16]:
fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)
Model trained in 180.1 seconds

Accuracy = 0.86172
Precision = 0.75000
Recall = 0.67200
F1 = 0.70886
F1 Weighted = 0.85911

Out[16]:
rule/numerical feature coefficient
0 (intercept) -1.6929
1 FemalePctDiv > 9.41 AND PctFam2Par <= 83.38 AN... 1.89106
2 blackPerCap <= 11133.40 AND FemalePctDiv > 9.41 1.44801
3 agePct12t29 -1.37336
4 FemalePctDiv > 9.41 AND MedRentPctHousInc > 24.00 1.24788
5 PctKidsBornNeverMar 1.06566
6 FemalePctDiv > 9.41 AND OwnOccLowQuart > 39380.00 -0.972519
7 PctHousOccup -0.908306
8 MedOwnCostPctIncNoMtg -0.902969
9 racepctblack <= 5.28 -0.800059
10 racepctblack > 1.72 AND FemalePctDiv > 9.41 0.787698
11 racepctblack <= 16.73 -0.767632
12 PctOccupManu <= 14.70 -0.75916
13 PctTeen2Par <= 68.20 0.711791
14 NumStreet <= 3.00 -0.632624
15 PctPopUnderPov <= 7.42 -0.592511
16 RentQrange <= 192.00 -0.589909
17 FemalePctDiv > 9.41 AND OwnOccQrange > 31100.00 0.560425
18 PctPersDenseHous <= 1.90 -0.53483
19 PctImmigRecent <= 10.34 -0.518697
20 PctVacMore6Mos <= 37.50 0.480573
21 HispPerCap > 6657.60 AND FemalePctDiv > 9.41 A... 0.460934
22 PctKidsBornNeverMar <= 2.77 -0.45671
23 pctWPubAsst <= 4.74 -0.443499
24 PctWOFullPlumb 0.430075
25 LandArea <= 17.80 -0.42746
26 population <= 18846.60 -0.410339
27 PctWOFullPlumb <= 0.64 -0.404707
28 FemalePctDiv > 9.41 AND PctWorkMomYoungKids <=... 0.394237
29 MedRentPctHousInc <= 26.90 -0.389177
30 pctUrban <= 99.47 -0.3877
31 MalePctDivorce <= 11.51 -0.378395
32 PctForeignBorn 0.37134
33 FemalePctDiv > 9.41 AND PctUsePubTrans > 0.25 0.358929
34 HousVacant <= 1573.80 0.347633
35 HousVacant 0.347582
36 HousVacant <= 761.80 0.326033
37 PctNotHSGrad <= 24.46 -0.313271
38 FemalePctDiv > 9.41 AND PctUsePubTrans > 0.81 0.301539
39 PctPopUnderPov > 7.42 AND FemalePctDiv > 9.41 0.2864
40 PctUnemployed <= 6.23 -0.229864
41 FemalePctDiv <= 13.42 -0.226307
42 FemalePctDiv <= 11.60 -0.22487
43 LandArea 0.20781
44 LemasPctOfficDrugUn 0.200326
45 PersPerRentOccHous <= 2.39 -0.196087
46 pctWInvInc <= 38.79 0.190517
47 NumInShelters <= 5.00 -0.173206
48 racePctWhite -0.157085
49 NumStreet <= 0.00 0.156811
50 PctPersDenseHous 0.109269
51 racePctWhite > 84.60 -0.0930171
52 FemalePctDiv > 9.41 AND MedRentPctHousInc > 26.90 -0.0847854
53 FemalePctDiv > 9.41 AND RentQrange > 134.00 -0.0266475
54 PctKidsBornNeverMar <= 4.94 0.024893

Formal Performance Comparison

Finally, we provide a formal performance comparison between FeatureBinarizerFromTrees and FeatureBinarizer over 30 random train-test splits. The settings for binarizers and models are as follows.

  • FeatureBinarizerFromTrees: treeNum=2, treeDepth=4, treeFeatureSelection=None, returnOrd=True
  • FeatureBinarizer: numThresh=4, negations=True, returnOrd=True
  • BooleanRuleCG: Defaults
  • LogisticRuleRegression: lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000

This process takes over two hours to run, so we saved the output and loaded it here for display. To re-run the test, uncomment the code below.

In [20]:
# %%time
# df = fbt_vs_fb_crime(iterations=30, treeNum=2, treeDepth=4, numThresh=4, filename='./data/crime.pkl')
Wall time: 2h 17min 35s

In the table below, 'fb' and 'fbt' indicate models fit with FeatureBinarizer and FeatureBinarizerFromTrees, respectively.

For these data and settings, the output shows that models trained using FeatureBinarizerFromTrees fit, on average, in less than 1/10th of the time and generate rule sets with significantly fewer clauses (i.e., the explanations are significantly less complex). There are no statistically significant differences in the mean scoring metrics.

In [21]:
with open('./data/crime.pkl', 'rb') as fl:
    df = pickle.load(fl)
    
format_results(df)
Out[21]:
time accuracy precision recall f1 rules clauses
mean std mean std mean std mean std mean std mean std mean std
brcg fb 103.903 (33.13) 0.855 (0.012) 0.768 (0.03) 0.607 (0.057) 0.676 (0.036) 3.233 (0.935) 13.867 (3.213)
fbt 7.209 (1.803) 0.854 (0.014) 0.756 (0.046) 0.624 (0.06) 0.681 (0.036) 2.533 (0.776) 7.2 (1.901)
logrr fb 151.336 (48.178) 0.866 (0.013) 0.746 (0.033) 0.708 (0.044) 0.726 (0.029) 45.967 (4.491) 46.967 (4.491)
fbt 8.887 (3.489) 0.863 (0.011) 0.745 (0.031) 0.691 (0.035) 0.717 (0.023) 15.833 (1.802) 16.833 (1.802)
In [ ]: