Random Forest classifiers¶

In this section of the tutorial, we will investigate the use of Random Forest classifiers in sklearn. As for all models in the sklearn framework, Random Forests mainly rely on fit(X, y) and predict(X) methods. Once fitted, relative importance of the features can be accessed via the feature_importances_ property.

More information about the use of Random Forests for Classification in sklearn can be found at: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

To begin with, let us import libraries we need and define a function to plot a fitted classifier (this function will not be specific to Random Forests) in 2D.

In [1]:

%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import numpy as np

def plot_decision(clf, X, y):
    # Build a 2D grid and perform classification using clf on this grid
    xx, yy = np.meshgrid(np.arange(X[:,0].min() - .5, X[:,0].max() + .5, .01),
                         np.arange(X[:,1].min() - .5, X[:,1].max() + .5, .01))
    zz = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.contourf(xx, yy, zz, alpha=.2)
    # Plot data
    plt.scatter(X[:, 0], X[:, 1], c=y, s=40)
    # Set figure coordinate limits
    plt.xlim(X[:,0].min() - .5, X[:,0].max() + .5)
    plt.ylim(X[:,1].min() - .5, X[:,1].max() + .5)

Then, we load some data and train a forest made of a single tree (n_estimators=1):

In [2]:

X, y = make_circles(n_samples=100, random_state=0, noise=.1, factor=.6)
clf = RandomForestClassifier(n_estimators=1)
clf.fit(X, y)
plot_decision(clf, X, y)

Now, if we vary the number of trees in the model, things can change a little bit:

In [3]:

plt.figure(figsize=(15, 5))
for i, n_trees in enumerate([1, 10, 100]):
    plt.subplot(1, 3, i + 1)
    clf = RandomForestClassifier(n_estimators=n_trees)
    clf.fit(X, y)
    plot_decision(clf, X, y)
    plt.title("%d tree(s)" % n_trees)

Once a model fitted, we can have a look at relative importance of the different features:

In [4]:

clf.feature_importances_

Out[4]:

array([ 0.52747471,  0.47252529])

To get an idea, we can add random components and see what happens:

In [5]:

X, y = make_circles(n_samples=100, random_state=0, noise=.1, factor=.8)
X = np.hstack((X, np.random.randn(100, 10)))
clf = RandomForestClassifier(n_estimators=1)
clf.fit(X, y)
print(clf.feature_importances_)

[ 0.01904568  0.09171014  0.11415525  0.12332016  0.          0.11732331
  0.07057143  0.07285714  0.10031884  0.10500781  0.18569024  0.        ]

Surprisingly enough, first 2 dimensions do not seem to be the most informative. This is because we do not have sufficient amount of data to assess feature importance. If we add some:

In [6]:

for n in [100, 1000, 10000]:
    X, y = make_circles(n_samples=n, random_state=0, noise=.1, factor=.8)
    X = np.hstack((X, np.random.randn(n, 10)))
    clf = RandomForestClassifier(n_estimators=1)
    clf.fit(X, y)
    print(n, clf.feature_importances_)

100 [ 0.          0.06123426  0.03619321  0.16778742  0.0317675   0.19332414
  0.          0.32911261  0.          0.03648969  0.0790939   0.06499726]
1000 [ 0.25272982  0.22444649  0.06011106  0.04157876  0.05531609  0.04779058
  0.05020186  0.06180678  0.02339828  0.03804851  0.08905098  0.05552079]
10000 [ 0.28151625  0.24462119  0.04144543  0.04638612  0.05330028  0.05210088
  0.04936459  0.04543707  0.05635139  0.04343161  0.04420838  0.04183681]