Homepage: https://spkit.github.io
Nikesh Bajaj : http://nikeshbajaj.in

Decision Trees with visualization using SpKit¶

Note: This notebook covers the use of (1) Classification and (2) Regression Tree from *spkit* library with different verbosity mode while training and plotting resulting decision tree after training. We use two different datasets Iris and Breast Cancer for classification and Boston Housing price for Regression.

Number of features:: 4
Number of samples :: 105
---------------------------------------
|Building the tree.....................
|-Feature::3_petal length (cm) Gain::0.93 thr::_Depth = 1   
|->False branch (<<<)..
|->{Leaf Node:: value: 0 }_Depth =2  

|
|->True branch (>>>)..
|--Feature::4_petal width (cm) Gain::0.81 thr::_Depth = 2    
|-->False branch (<<<)..
|--Feature::3_petal length (cm) Gain::0.18 thr::_Depth = 3   
|-->False branch (<<<)..
|-->{Leaf Node:: value: 1 }_Depth =4  

|
|-->True branch (>>>)..
|--->{Leaf Node:: value: 2 }_Depth =4  

|
|-->True branch (>>>)..
|---Feature::3_petal length (cm) Gain::0.1 thr::_Depth = 3    
|--->False branch (<<<)..
|---Feature::2_sepal width (cm) Gain::0.81 thr::_Depth = 4    
|--->False branch (<<<)..
|--->{Leaf Node:: value: 2 }_Depth =5  

|
|--->True branch (>>>)..
|---->{Leaf Node:: value: 1 }_Depth =5  

|
|--->True branch (>>>)..
|---->{Leaf Node:: value: 2 }_Depth =4  

|
|.........................tree is buit!
---------------------------------------

verbose=3 (printing branches only)¶

In [8]:

clf = ClassificationTree()
clf.fit(Xt,yt,verbose=3,feature_names=feature_names)

Number of features:: 4
Number of samples :: 105
---------------------------------------
|Building the tree.....................
None 1 | 
True 2 | T
True 3 | TT
True 4 | TTT
False 4 | TTTF
True 5 | TTTFT
False 5 | TTTFF
False 3 | TTF
True 4 | TTFT
False 4 | TTFF
False 2 | TF
|
|.........................tree is buit!
---------------------------------------

verbose=4 (Plotting tree.. while building)¶

In [9]:

%matplotlib notebook

In [10]:

clf = ClassificationTree()
clf.fit(Xt,yt,verbose=4,feature_names=feature_names)

Number of features:: 4
Number of samples :: 105
---------------------------------------
|Building the tree.....................
|
|.........................tree is buit!
---------------------------------------

Plotting the resulting tree¶

In [11]:

%matplotlib inline

In [12]:

plt.figure(figsize=(10,6))
clf.plotTree(show=True,scale=False)

Plotting Tree with same color branches¶

In [13]:

plt.figure(figsize=(8,6))
clf.plotTree(DiffBranchColor=False)

Predicting¶

In [14]:

ytp = clf.predict(Xt)
ysp = clf.predict(Xs)


ytpr = clf.predict_proba(Xt)[:,1]
yspr = clf.predict_proba(Xs)[:,1]

print('Depth of trained Tree ', clf.getTreeDepth())
print('Accuracy')
print('- Training : ',np.mean(ytp==yt))
print('- Testing  : ',np.mean(ysp==ys))
print('Logloss')
Trloss = -np.mean(yt*np.log(ytpr+1e-10)+(1-yt)*np.log(1-ytpr+1e-10))
Tsloss = -np.mean(ys*np.log(yspr+1e-10)+(1-ys)*np.log(1-yspr+1e-10))
print('- Training : ',Trloss)
print('- Testing  : ',Tsloss)

Depth of trained Tree  4
Accuracy
- Training :  1.0
- Testing  :  0.9111111111111111
Logloss
- Training :  14.473392013068288
- Testing  :  15.350567286593632

Iris data with smaller tree¶

In [15]:

clf = ClassificationTree(max_depth=3)
clf.fit(Xt,yt,verbose=1,feature_names=feature_names)
plt.figure(figsize=(5,5))
clf.plotTree(show=True,DiffBranchColor=True)
ytp = clf.predict(Xt)
ysp = clf.predict(Xs)

ytpr = clf.predict_proba(Xt)[:,1]
yspr = clf.predict_proba(Xs)[:,1]

print('Depth of trained Tree ', clf.getTreeDepth())
print('Accuracy')
print('- Training : ',np.mean(ytp==yt))
print('- Testing  : ',np.mean(ysp==ys))
print('Logloss')
Trloss = -np.mean(yt*np.log(ytpr+1e-10)+(1-yt)*np.log(1-ytpr+1e-10))
Tsloss = -np.mean(ys*np.log(yspr+1e-10)+(1-ys)*np.log(1-yspr+1e-10))
print('- Training : ',Trloss)
print('- Testing  : ',Tsloss)

Number of features:: 4
Number of samples :: 105
---------------------------------------
|Building the tree.....................
|subtrees::|100%|-------------------->||
|.........................tree is buit!
---------------------------------------

Depth of trained Tree  3
Accuracy
- Training :  0.9904761904761905
- Testing  :  0.8666666666666667
Logloss
- Training :  2.5937142862825335
- Testing  :  1.6081773385894438

Breast Cancer data¶

In [16]:

data = datasets.load_breast_cancer()
X = data.data
y = data.target

feature_names = data.feature_names #Optional

Xt,Xs, yt, ys = train_test_split(X,y,test_size=0.3)

print(X.shape,y.shape, Xt.shape, yt.shape, Xs.shape, ys.shape)

(569, 30) (569,) (398, 30) (398,) (171, 30) (171,)

Fitting model with displaying the details of tree in process (verbose=4)¶

While building tree, To first choose True branch and then False set randomBranch=False

In [17]:

%matplotlib notebook
clf = ClassificationTree()
clf.fit(Xt,yt,verbose=4,feature_names=feature_names,randomBranch=False)
plt.close(clf.fig)

Number of features:: 30
Number of samples :: 398
---------------------------------------
|Building the tree.....................
|
|.........................tree is buit!
---------------------------------------

To randomly selevting True or False branch set randomBranch=True

In [18]:

clf = ClassificationTree()
clf.fit(Xt,yt,verbose=4,feature_names=feature_names,randomBranch=True)
plt.close(clf.fig)

Number of features:: 30
Number of samples :: 398
---------------------------------------
|Building the tree.....................
|
|.........................tree is buit!
---------------------------------------

Resulting tree¶

In [19]:

%matplotlib inline
plt.figure(figsize=(10,6))
clf.plotTree(show=True,DiffBranchColor=True,scale=False)
plt.close(clf.fig)

Fitting model with displaying the progress only (verbose=1)¶

In [20]:

#%matplotlib inline
clf = ClassificationTree()
clf.fit(Xt,yt,verbose=1,feature_names=feature_names)

plt.figure(figsize=(6,6))
clf.plotTree()

Number of features:: 30
Number of samples :: 398
---------------------------------------
|Building the tree.....................
|subtrees::|100%|-------------------->|-
|.........................tree is buit!
---------------------------------------

Predicting¶

In [21]:

ytp = clf.predict(Xt)
ysp = clf.predict(Xs)

ytpr = clf.predict_proba(Xt)[:,1]
yspr = clf.predict_proba(Xs)[:,1]

print('Depth of trained Tree ', clf.getTreeDepth())
print('Accuracy')
print('- Training : ',np.mean(ytp==yt))
print('- Testing  : ',np.mean(ysp==ys))
print('Logloss')
Trloss = -np.mean(yt*np.log(ytpr+1e-10)+(1-yt)*np.log(1-ytpr+1e-10))
Tsloss = -np.mean(ys*np.log(yspr+1e-10)+(1-ys)*np.log(1-yspr+1e-10))
print('- Training : ',Trloss)
print('- Testing  : ',Tsloss)

Depth of trained Tree  6
Accuracy
- Training :  1.0
- Testing  :  0.9298245614035088
Logloss
- Training :  -1.000000082690371e-10
- Testing  :  1.6158491879730155

It's overfitting, try with smaller trees by decresing the max_depth of classifier

Regression Tree¶

Boston House price¶

In [22]:

data = datasets.load_boston()
X = data.data
y = data.target

feature_names = data.feature_names #Optional

Xt,Xs, yt, ys = train_test_split(X,y,test_size=0.3)

print(X.shape,y.shape, Xt.shape, yt.shape, Xs.shape, ys.shape)

(506, 13) (506,) (354, 13) (354,) (152, 13) (152,)

In [23]:

rgr = RegressionTree()
rgr.fit(Xt,yt,verbose=1,feature_names = feature_names)

Number of features:: 13
Number of samples :: 354
---------------------------------------
|Building the tree.....................
|subtrees::|100%|-------------------->|\
|.........................tree is buit!
---------------------------------------

Ploting resulting tree¶

In [24]:

%matplotlib inline
plt.style.use('default')
plt.figure(figsize=(15,15))
rgr.plotTree(show=True,scale=True, showtitle =False, showDirection=False)

Prediction¶

In [25]:

ytp = rgr.predict(Xt)
ysp = rgr.predict(Xs)
print('Training MSE: ',np.mean((ytp-yt)**2))
print('Testing  MSE: ',np.mean((ysp-ys)**2))

Training MSE:  0.0
Testing  MSE:  15.329736842105262

Boston Data with smaller tree¶

In [26]:

rgr = RegressionTree(max_depth=4)
rgr.fit(Xt,yt,verbose=1,feature_names = feature_names)

Number of features:: 13
Number of samples :: 354
---------------------------------------
|Building the tree.....................
|subtrees::|100%|-------------------->|\
|.........................tree is buit!
---------------------------------------

In [27]:

%matplotlib inline
plt.style.use('default')

plt.figure(figsize=(6,5))
rgr.plotTree(show=True,scale=True, showtitle =True, showDirection=False,DiffBranchColor=True)

ytp = rgr.predict(Xt)
ysp = rgr.predict(Xs)
print('Training MSE: ',np.mean((ytp-yt)**2))
print('Testing  MSE: ',np.mean((ysp-ys)**2))

Training MSE:  8.833248184343507
Testing  MSE:  17.31131839344341

Decision Trees with visualization using SpKit¶

Table of Contents

Import libraries¶

Classification Tree¶

Iris Dataset¶

Fitting a model (displaying the tree building) with different modes¶

verbose=0 (silence mode)¶

verbose=1 (progress bar)¶

verbose=2 (printing tree info)¶

verbose=3 (printing branches only)¶

verbose=4 (Plotting tree.. while building)¶

Plotting the resulting tree¶

Plotting Tree with same color branches¶

Predicting¶

Iris data with smaller tree¶

Breast Cancer data¶

Fitting model with displaying the details of tree in process (verbose=4)¶

Resulting tree¶

Fitting model with displaying the progress only (verbose=1)¶

Predicting¶

Regression Tree¶

Boston House price¶

Ploting resulting tree¶

Prediction¶

Boston Data with smaller tree¶