Notebook

In [1]:

import rulematrix
from rulematrix.surrogate import rule_surrogate
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_iris

Load Dataset¶

First, we load a dataset. To make use of the visualization, it's better to provide feature names and target names.

We partition the dataset into training and test set.

In [2]:

# Load dataset
# dataset = load_iris()
dataset = load_breast_cancer()

# Feature Information
is_continuous = dataset.get('is_continuous', None)
is_categorical = dataset.get('is_categorical', None)
is_integer = dataset.get('is_integer', None)
feature_names = dataset.get('feature_names', None)
target_names = dataset.get('target_names', None)

# Split dataset into train and test
train_x, test_x, train_y, test_y = \
    train_test_split(dataset['data'], dataset['target'], test_size=0.25, random_state=42)

Training a Neural Net¶

In [3]:

def train_nn(neurons=(20,), **kwargs):
    is_categorical = dataset.get('is_categorical', None)
    model = MLPClassifier(hidden_layer_sizes=neurons, **kwargs)
    if is_categorical is not None:
        model = Pipeline([
            ('one_hot', OneHotEncoder(categorical_features=is_categorical)),
            ('mlp', model)
        ])
    model.fit(train_x, train_y)
    train_score = model.score(train_x, train_y)
    test_score = model.score(test_x, test_y)
    print('Training score:', train_score)
    print('Test score:', test_score)
    return model


nn = train_nn((20, 20, 20), random_state=43)

Training score: 0.9061032863849765
Test score: 0.9230769230769231

Train Rule Surrogate¶

Next we train the surrogate rulelist of the neural net, using default parameters, and render the RuleMatrix visualization.

In [4]:

def train_surrogate(model, sampling_rate=2.0, **kwargs):
    surrogate = rule_surrogate(model.predict, train_x, sampling_rate=sampling_rate,
                               is_continuous=is_continuous,
                               is_categorical=is_categorical,
                               is_integer=is_integer,
                               rlargs={'feature_names': feature_names, 'verbose': 2},
                               **kwargs)

    train_fidelity = surrogate.score(train_x)
    test_fidelity = surrogate.score(test_x)
    print('Training fidelity:', train_fidelity)
    print('Test fidelity:', test_fidelity)
    return surrogate

surrogate = train_surrogate(nn, 4, seed=44)
rl = surrogate.student
print(rl)

Training fidelity: 0.8779342723004695
Test fidelity: 0.8951048951048951
The rule list contains 10 of rules:

     IF (worst area in (-inf, 91.3)) THEN prob: [0.9947, 0.0053]

ELSE IF (area error in (74.37, inf)) THEN prob: [0.9973, 0.0027]

ELSE IF (mean perimeter in (-inf, 52.41)) THEN prob: [0.6957, 0.3043]

ELSE IF (worst area in (1134.9, inf)) THEN prob: [0.9862, 0.0138]

ELSE IF (area error in (48.8, 74.37)) AND (mean compactness in (0.14295, inf)) THEN prob: [0.8868, 0.1132]

ELSE IF (mean perimeter in (108.31, 120.6)) THEN prob: [0.2381, 0.7619]

ELSE IF (worst area in (734.0, 958.2)) AND (mean area in (611.8, 853.1)) THEN prob: [0.5364, 0.4636]

ELSE IF (worst concavity in (0.3725, inf)) AND (mean area in (63.9, 294.4)) THEN prob: [0.7600, 0.2400]

ELSE IF (worst area in (126.5, 734.0)) THEN prob: [0.3571, 0.6429]

ELSE DEFAULT prob: [0.9333, 0.0667]

Render RuleMatrix¶

Now let's render the RuleMatrix visualization.

Here is some instructions for how to read the RuleMatrix.

In the middle is a matrix of rules. Each row represents a rule, and each column represents a feature. For example, rule 2 is IF area error > 74.4 THEN Prob(malignant) = 1.0.
The shadowed part in the cell also indicates the value range of the feature used in rule. The histogram of light blue in the cell shows the distribution of the feature. You can hover to read the text of the rule. You can also click on a cell to expand it to look at detail feature distribution
In the left of the matrix is the data flow, showing how all the data is captured by each of the rules. The width of the flow indicates the number of data captured/uncaptured by each rule. The color of the flow indicates different labels. For example, there are about 1:2 malignant:benign data in the breast_cancer dataset.
In the right of the matrix shows detail information about each rule. Fidelity means how accurate a rule is in representing/approximating the original model (on the data captured by this rule). The eveidence (or support in the itemset mining term) shows the number of data with different labels. For example, the data captured by rule 9 is mostly benign. The stripped part encodes the part of data wrongly classified by the model as a certain label represent by the color.

In [5]:

rulematrix.render(train_x, train_y, surrogate, 
       feature_names=feature_names, target_names=target_names, 
       is_categorical=is_categorical)

Out[5]:

In [ ]: