In this section of the tutorial, we will investigate the use of Random Forest classifiers in sklearn
. As for all models in the sklearn
framework, Random Forests mainly rely on fit(X, y)
and predict(X)
methods. Once fitted, relative importance of the features can be accessed via the feature_importances_
property.
More information about the use of Random Forests for Classification in sklearn
can be found at: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
To begin with, let us import libraries we need and define a function to plot a fitted classifier (this function will not be specific to Random Forests) in 2D.
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import numpy as np
def plot_decision(clf, X, y):
# Build a 2D grid and perform classification using clf on this grid
xx, yy = np.meshgrid(np.arange(X[:,0].min() - .5, X[:,0].max() + .5, .01),
np.arange(X[:,1].min() - .5, X[:,1].max() + .5, .01))
zz = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy, zz, alpha=.2)
# Plot data
plt.scatter(X[:, 0], X[:, 1], c=y, s=40)
# Set figure coordinate limits
plt.xlim(X[:,0].min() - .5, X[:,0].max() + .5)
plt.ylim(X[:,1].min() - .5, X[:,1].max() + .5)
Then, we load some data and train a forest made of a single tree (n_estimators=1
):
X, y = make_circles(n_samples=100, random_state=0, noise=.1, factor=.6)
clf = RandomForestClassifier(n_estimators=1)
clf.fit(X, y)
plot_decision(clf, X, y)
Now, if we vary the number of trees in the model, things can change a little bit:
plt.figure(figsize=(15, 5))
for i, n_trees in enumerate([1, 10, 100]):
plt.subplot(1, 3, i + 1)
clf = RandomForestClassifier(n_estimators=n_trees)
clf.fit(X, y)
plot_decision(clf, X, y)
plt.title("%d tree(s)" % n_trees)
Once a model fitted, we can have a look at relative importance of the different features:
clf.feature_importances_
array([ 0.52747471, 0.47252529])
To get an idea, we can add random components and see what happens:
X, y = make_circles(n_samples=100, random_state=0, noise=.1, factor=.8)
X = np.hstack((X, np.random.randn(100, 10)))
clf = RandomForestClassifier(n_estimators=1)
clf.fit(X, y)
print(clf.feature_importances_)
[ 0.01904568 0.09171014 0.11415525 0.12332016 0. 0.11732331 0.07057143 0.07285714 0.10031884 0.10500781 0.18569024 0. ]
Surprisingly enough, first 2 dimensions do not seem to be the most informative. This is because we do not have sufficient amount of data to assess feature importance. If we add some:
for n in [100, 1000, 10000]:
X, y = make_circles(n_samples=n, random_state=0, noise=.1, factor=.8)
X = np.hstack((X, np.random.randn(n, 10)))
clf = RandomForestClassifier(n_estimators=1)
clf.fit(X, y)
print(n, clf.feature_importances_)
100 [ 0. 0.06123426 0.03619321 0.16778742 0.0317675 0.19332414 0. 0.32911261 0. 0.03648969 0.0790939 0.06499726] 1000 [ 0.25272982 0.22444649 0.06011106 0.04157876 0.05531609 0.04779058 0.05020186 0.06180678 0.02339828 0.03804851 0.08905098 0.05552079] 10000 [ 0.28151625 0.24462119 0.04144543 0.04638612 0.05330028 0.05210088 0.04936459 0.04543707 0.05635139 0.04343161 0.04420838 0.04183681]