`FeatureBinarizerFromTrees`¶

The FeatureBinarizerFromTrees transformer binarizes features for BooleanRuleCG (BRCG), LogisticRuleRegression (LogRR), and LinearRuleRegression (LinearRR) models. It generates binary features (i.e. rules) based on the splits in fitted decision trees. This approach naturally creates optimal thresholds and returns only important features. Compared to FeatureBinarizer, the FeatureBinarizerFromTrees transformer reduces the number of features required to produce an accurate model. Not only does this shorten training times, but more importantly, it often results in simpler rule sets.

This notebook demonstrates basic FeatureBinarizerFromTrees, compares FeatureBinarizer, and concludes with a formal performance comparison.

Initialize¶

In [1]:

import warnings
warnings.filterwarnings('ignore')

from feature_binarizer_from_trees_demo import fbt_vs_fb_crime, format_results, get_corr_columns, print_metrics

import numpy as np
import pandas as pd
from pandas import DataFrame
import pickle
from time import time

from aix360.algorithms.rbm import BooleanRuleCG, BRCGExplainer, FeatureBinarizer, FeatureBinarizerFromTrees, \
    GLRMExplainer, LogisticRuleRegression
from aix360.datasets.heloc_dataset import HELOCDataset, nan_preprocessing
from aix360.datasets import MEPSDataset

import shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

def print_brcg_rules(rules):
    print('Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:\n')
    for rule in rules:
        print(f'  - {rule}')
    print()

def fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test, lambda0=0.001, lambda1=0.001):
    bcrg = BooleanRuleCG(lambda0, lambda1, silent=True)
    explainer = BRCGExplainer(bcrg)
    t = time()
    explainer.fit(X_train_b, y_train)
    print(f'Model trained in {time() - t:0.1f} seconds\n')
    print_metrics(y_test, explainer.predict(X_test_b))
    print_brcg_rules(explainer.explain()['rules'])
    
def fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test):
    logrr = LogisticRuleRegression(lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000)
    explainer = GLRMExplainer(logrr)
    t = time()
    explainer.fit(X_train_b, y_train, X_train_std)
    print(f'Model trained in {time() - t:0.1f} seconds\n')
    print_metrics(y_test, explainer.predict(X_test_b, X_test_std))
    return explainer.explain()

Using TensorFlow backend.

Communities and Crime Data¶

Create a binary classification problem to predict the top 25% of violent crimes from a subset of the UCI Communities and Crime data.

In [2]:

X, y = shap.datasets.communitiesandcrime()
y = (y >= np.percentile(y, 75)).astype(np.int)

After dropping highly correlated columns, there are 88 ordinal features.

In [3]:

X.drop(columns=get_corr_columns(X), inplace=True)
print(X.shape)

(1994, 88)

Split the data: 2/3 training, 1/3 test.

In [4]:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

Using `FeatureBinarizerFromTrees` with BRCG¶

The code below initializes the default transformer and transforms the data. The default transformer uses one decision tree with a maximum node depth of 4. This will create up to 30 features. In this case, it generates only 28 features. Perhaps the fitted tree didn't require the maximum number of nodes to fit the data, or the binarizer dropped duplicates.

In [23]:

fbt = FeatureBinarizerFromTrees(randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')

28 features.

An obvious question might have crossed your mind by now: "If you can sufficiently describe the data with features from a simple decision tree, why not use the decision tree as the explainable model?" It is true that a decision tree may be a satisfactory model in some cases. However, rule sets generated by BCRG can be simpler and more accessible than a decision tree. Furthermore, this is just an introductory example. As we will show, we often need more than one simple decision tree to generate features for an accurate model.

Here are the binarized features for PctKidsBornNeverMar. The binarizer selected two thresholds: 2.64 and 4.26. There are two complimentary features for each threshold with operators <= and >. Additional operators are supported for categorical and binary features, but are not shown in this notebook.

In [24]:

X_train_b['PctKidsBornNeverMar'].head()

Out[24]:

operation	<=		>
value	2.64	4.26	2.64	4.26
1765	0	0	1	1
2164	1	1	0	0
1691	0	1	1	0
697	0	1	1	0
39	1	1	0	0

Here, we fit the model, predict the test set, and display the training time, test metrics, and rule set.

The model trains in roughly 4 seconds and creates a simple, one-rule model with almost 84% accuracy using default BRCG model parameters.

In [26]:

fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)

Model trained in 4.3 seconds

Accuracy = 0.83567
Precision = 0.74157
Recall = 0.52800
F1 = 0.61682
F1 Weighted = 0.82562

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - FemalePctDiv > 10.88 AND PctKidsBornNeverMar > 4.26 AND PctPopUnderPov > 9.80 AND PctSpeakEnglOnly <= 97.82

For this data set, we can easily improve the fit by changing a few binarizer parameters. The parameter values used here were manually selected for demonstration purposes. More optimal values are possible, especially if the BRCG parameters are also tuned.

In [27]:

fbt = FeatureBinarizerFromTrees(treeNum=3, treeDepth=3, treeFeatureSelection=0.5, threshRound=0, randomState=0)
X_train_b = fbt.fit_transform(X_train, y_train)
X_test_b = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')

36 features.

Here we describe the key parameters used above. (See the FeatureBinarizerFromTrees API for a full list of arguments.)

treeNum - The number of trees to fit. A value greater than one encourages a greater variety of features and thresholds.
treeDepth - The depth of the fitted decision trees. The greater the depth, the more features are generated.
treeFeatureSelection - The proportion of randomly chosen input features to consider at each split in the decision tree. When more than one tree is specified, this encourages a greater variety of features. See the API documentation for a full list of options.
threshRound - Round the threshold values to the given number of decimal places. Rounding the thresholds prevents near duplicate thresholds like 1.01 and 1.0. In the crime data, most of the features are ratios and integers, so rounding to the nearest integer value is acceptable.

The model trains in around 10 seconds and appears to improve accuracy significantly. Though more features improved the fit in this case, it is important to point out that more features are not always better. For both explainability and accuracy, we suggest starting with a small number of features. From there, increase the number of features incrementally until accuracy plateaus or the explanation is sufficient.

In [28]:

fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)

Model trained in 10.1 seconds

Accuracy = 0.86774
Precision = 0.81720
Recall = 0.60800
F1 = 0.69725
F1 Weighted = 0.86074

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - HousVacant > 2172.00 AND PctKidsBornNeverMar > 3.00
  - FemalePctDiv > 12.00 AND PctKidsBornNeverMar > 4.00 AND PctPopUnderPov > 10.00 AND PctSpeakEnglOnly <= 98.00 AND racePctWhite <= 79.00

Using `FeatureBinarizerFromTrees` with Linear Models¶

To use FeatureBinarizerFromTrees with LogRR and LinearRR, set returnOrd=True. Like the standard FeatureBinarizer, the transformer will return a standardized data frame of ordinal features in addition to the binarized features. The standardized features can then be passed to the linear model to improve accuracy. (Make sure to set useOrd=True for the linear model.)

In [29]:

fbt = FeatureBinarizerFromTrees(treeNum=2, treeDepth=4, treeFeatureSelection=None, threshRound=0, returnOrd=True, 
                                randomState=0)
X_train_b, X_train_std = fbt.fit_transform(X_train, y_train)
X_test_b, X_test_std = fbt.transform(X_test)
print(f'{X_train_b.shape[1]} features.')

28 features.

The explanation for the fitted linear model lists the features in descending order by linear coefficient magnitude. For this feature set, the linear model does not appear to improve the accuracy significantly.

In [11]:

fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)

Model trained in 5.5 seconds

Accuracy = 0.86373
Precision = 0.79381
Recall = 0.61600
F1 = 0.69369
F1 Weighted = 0.85759

Out[11]:

	rule/numerical feature	coefficient
0	(intercept)	-1.9073
1	PctKidsBornNeverMar <= 4.00 AND PctSpeakEnglOn...	-3.12689
2	FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0...	2.35018
3	PctKidsBornNeverMar <= 4.00	2.16666
4	PctKidsBornNeverMar	2.09972
5	FemalePctDiv > 11.00 AND OwnOccQrange > 37500....	1.70801
6	FemalePctDiv	1.07142
7	HousVacant <= 2172.00	-0.868909
8	FemalePctDiv > 11.00 AND PctPopUnderPov > 10.0...	-0.682629
9	HousVacant	0.65058
10	PctKidsBornNeverMar <= 3.00	-0.640423
11	FemalePctDiv > 11.00 AND PctEmplManu <= 15.00 ...	0.597499
12	PctSpeakEnglOnly	-0.510433
13	pctWInvInc <= 38.00	0.484888
14	pctWInvInc	-0.476527
15	pctWWage <= 74.00	0.381191
16	PctEmplManu	-0.315508

Compare with `FeatureBinarizer`¶

The standard FeatureBinarizer creates thresholds by binning the data into a user-specified number of quantiles. The default setting of 9 thresholds creates 1,528 features for these data when negations are enabled. This is a very large feature space.

In [31]:

fb = FeatureBinarizer(negations=True)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')

1528 features.

Here are the binary features associated with the PctKidsBornNeverMar input feature.

In [32]:

X_train_b['PctKidsBornNeverMar'].head()

Out[32]:

operation	<=									>
value	0.570	0.930	1.240	1.670	2.130	2.770	3.528	4.942	7.432	0.570	0.930	1.240	1.670	2.130	2.770	3.528	4.942	7.432
1765	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1
2164	0	0	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0
1691	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	0	0
697	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	0	0	0
39	0	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0

The model takes more than 5 minutes to train. The test accuracy also appears to be lower and the rule set is complex.

In [13]:

fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)

Model trained in 322.8 seconds

Accuracy = 0.83166
Precision = 0.72527
Recall = 0.52800
F1 = 0.61111
F1 Weighted = 0.82207

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - FemalePctDiv > 15.19 AND PctKidsBornNeverMar > 7.43
  - PctUnemployed > 5.56 AND PctImmigRec5 <= 29.48 AND NumStreet > 14.00
  - PctFam2Par <= 60.46 AND PctSpeakEnglOnly <= 96.91 AND PersPerRentOccHous > 2.23
  - PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND MedOwnCostPctIncNoMtg <= 13.10 AND PctSameHouse85 > 42.21
  - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PctKidsBornNeverMar > 4.94 AND OwnOccQrange > 31100.00 AND RentQrange > 145.00

A more reasonable number of thresholds for this data set is 4. This setting generates 688 features. The accuracy is now comparable with the previous results, but it still takes approximately ten times longer to train the model (compared to 10 seconds). The rule set is also complex.

In [14]:

fb = FeatureBinarizer(negations=True, numThresh=4)
X_train_b = fb.fit_transform(X_train)
X_test_b = fb.transform(X_test)
print(f'{X_train_b.shape[1]} features.')
fit_predict_bcrg(X_train_b, y_train, X_test_b, y_test)

688 features.
Model trained in 102.8 seconds

Accuracy = 0.85371
Precision = 0.76000
Recall = 0.60800
F1 = 0.67556
F1 Weighted = 0.84795

Predict Y=1 if ANY of the following rules are satisfied, otherwise Y=0:

  - pctWInvInc <= 38.79 AND blackPerCap > 6280.60 AND PersPerFam > 3.05 AND PctKidsBornNeverMar > 4.94
  - racepctblack > 16.73 AND blackPerCap <= 11133.40 AND PctImmigRecent > 19.66 AND PctUsePubTrans > 0.25
  - agePct12t29 <= 27.62 AND PctTeen2Par <= 68.20 AND PctKidsBornNeverMar > 2.77 AND PctSameState85 <= 91.12
  - racepctblack > 1.72 AND PctEmploy <= 68.48 AND MalePctDivorce > 8.43 AND FemalePctDiv > 13.42 AND PctKidsBornNeverMar > 2.77 AND PctImmigRecent > 5.68 AND PctHousOccup <= 96.34 AND NumInShelters > 5.00

This LogRR model has comparable accuracy to the one trained with FeatureBinarizerFromTrees, but it takes 3 minutes to train and it also has a more complex rule set.

In [15]:

fb = FeatureBinarizer(negations=True, returnOrd=True, numThresh=4)
X_train_b, X_train_std = fb.fit_transform(X_train)
X_test_b, X_test_std = fb.transform(X_test)

In [16]:

fit_predict_logrr(X_train_b, X_train_std, y_train, X_test_b, X_test_std, y_test)

Model trained in 180.1 seconds

Accuracy = 0.86172
Precision = 0.75000
Recall = 0.67200
F1 = 0.70886
F1 Weighted = 0.85911

Out[16]:

	rule/numerical feature	coefficient
0	(intercept)	-1.6929
1	FemalePctDiv > 9.41 AND PctFam2Par <= 83.38 AN...	1.89106
2	blackPerCap <= 11133.40 AND FemalePctDiv > 9.41	1.44801
3	agePct12t29	-1.37336
4	FemalePctDiv > 9.41 AND MedRentPctHousInc > 24.00	1.24788
5	PctKidsBornNeverMar	1.06566
6	FemalePctDiv > 9.41 AND OwnOccLowQuart > 39380.00	-0.972519
7	PctHousOccup	-0.908306
8	MedOwnCostPctIncNoMtg	-0.902969
9	racepctblack <= 5.28	-0.800059
10	racepctblack > 1.72 AND FemalePctDiv > 9.41	0.787698
11	racepctblack <= 16.73	-0.767632
12	PctOccupManu <= 14.70	-0.75916
13	PctTeen2Par <= 68.20	0.711791
14	NumStreet <= 3.00	-0.632624
15	PctPopUnderPov <= 7.42	-0.592511
16	RentQrange <= 192.00	-0.589909
17	FemalePctDiv > 9.41 AND OwnOccQrange > 31100.00	0.560425
18	PctPersDenseHous <= 1.90	-0.53483
19	PctImmigRecent <= 10.34	-0.518697
20	PctVacMore6Mos <= 37.50	0.480573
21	HispPerCap > 6657.60 AND FemalePctDiv > 9.41 A...	0.460934
22	PctKidsBornNeverMar <= 2.77	-0.45671
23	pctWPubAsst <= 4.74	-0.443499
24	PctWOFullPlumb	0.430075
25	LandArea <= 17.80	-0.42746
26	population <= 18846.60	-0.410339
27	PctWOFullPlumb <= 0.64	-0.404707
28	FemalePctDiv > 9.41 AND PctWorkMomYoungKids <=...	0.394237
29	MedRentPctHousInc <= 26.90	-0.389177
30	pctUrban <= 99.47	-0.3877
31	MalePctDivorce <= 11.51	-0.378395
32	PctForeignBorn	0.37134
33	FemalePctDiv > 9.41 AND PctUsePubTrans > 0.25	0.358929
34	HousVacant <= 1573.80	0.347633
35	HousVacant	0.347582
36	HousVacant <= 761.80	0.326033
37	PctNotHSGrad <= 24.46	-0.313271
38	FemalePctDiv > 9.41 AND PctUsePubTrans > 0.81	0.301539
39	PctPopUnderPov > 7.42 AND FemalePctDiv > 9.41	0.2864
40	PctUnemployed <= 6.23	-0.229864
41	FemalePctDiv <= 13.42	-0.226307
42	FemalePctDiv <= 11.60	-0.22487
43	LandArea	0.20781
44	LemasPctOfficDrugUn	0.200326
45	PersPerRentOccHous <= 2.39	-0.196087
46	pctWInvInc <= 38.79	0.190517
47	NumInShelters <= 5.00	-0.173206
48	racePctWhite	-0.157085
49	NumStreet <= 0.00	0.156811
50	PctPersDenseHous	0.109269
51	racePctWhite > 84.60	-0.0930171
52	FemalePctDiv > 9.41 AND MedRentPctHousInc > 26.90	-0.0847854
53	FemalePctDiv > 9.41 AND RentQrange > 134.00	-0.0266475
54	PctKidsBornNeverMar <= 4.94	0.024893

Formal Performance Comparison¶

Finally, we provide a formal performance comparison between FeatureBinarizerFromTrees and FeatureBinarizer over 30 random train-test splits. The settings for binarizers and models are as follows.

FeatureBinarizerFromTrees: treeNum=2, treeDepth=4, treeFeatureSelection=None, returnOrd=True
FeatureBinarizer: numThresh=4, negations=True, returnOrd=True
BooleanRuleCG: Defaults
LogisticRuleRegression: lambda0=0.005, lambda1=0.001, useOrd=True, maxSolverIter=1000

This process takes over two hours to run, so we saved the output and loaded it here for display. To re-run the test, uncomment the code below.

In [20]:

# %%time
# df = fbt_vs_fb_crime(iterations=30, treeNum=2, treeDepth=4, numThresh=4, filename='./data/crime.pkl')

Wall time: 2h 17min 35s

In the table below, 'fb' and 'fbt' indicate models fit with FeatureBinarizer and FeatureBinarizerFromTrees, respectively.

For these data and settings, the output shows that models trained using FeatureBinarizerFromTrees fit, on average, in less than 1/10th of the time and generate rule sets with significantly fewer clauses (i.e., the explanations are significantly less complex). There are no statistically significant differences in the mean scoring metrics.

In [21]:

with open('./data/crime.pkl', 'rb') as fl:
    df = pickle.load(fl)
    
format_results(df)

Out[21]:

		time		accuracy		precision		recall		f1		rules		clauses
		mean	std	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
brcg	fb	103.903	(33.13)	0.855	(0.012)	0.768	(0.03)	0.607	(0.057)	0.676	(0.036)	3.233	(0.935)	13.867	(3.213)
brcg	fbt	7.209	(1.803)	0.854	(0.014)	0.756	(0.046)	0.624	(0.06)	0.681	(0.036)	2.533	(0.776)	7.2	(1.901)
logrr	fb	151.336	(48.178)	0.866	(0.013)	0.746	(0.033)	0.708	(0.044)	0.726	(0.029)	45.967	(4.491)	46.967	(4.491)
logrr	fbt	8.887	(3.489)	0.863	(0.011)	0.745	(0.031)	0.691	(0.035)	0.717	(0.023)	15.833	(1.802)	16.833	(1.802)

In [ ]:

FeatureBinarizerFromTrees¶