This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
Decisions trees are frequently used to represent workflows or algorithms. They also form a method for non-parametric supervised learning. A tree mapping observations to target values is learnt on a training set and gives the outcomes of new observations.
Random forests are ensembles of decision trees. Multiple decision trees are trained and aggregated to form a model that is more performant than any of the individual trees. This general idea is the purpose of ensemble learning.
There are many types of ensemble methods. Random forests are an instance of bootstrap aggregating, also called bagging, where models are trained on randomly drawn subsets of the training set.
Random forests yield information about the importance of each feature for the classification or regression task. In this recipe, we use this method to find the features the most influent on the price of Boston houses. We will use a classic dataset containing a range of diverse indicators about the houses' neighborhoud.
import numpy as np
import sklearn as sk
import sklearn.datasets as skd
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['figure.dpi'] = mpl.rcParams['savefig.dpi'] = 300
data = skd.load_boston()
The details of this dataset can be found in data['DESCR']
. Here is the description of all features:
The target value is MEDV
.
RandomForestRegressor
model.reg = ske.RandomForestRegressor()
X = data['data']
y = data['target']
reg.fit(X, y);
reg.feature_importances_
. We sort them by decreasing order of importance.fet_ind = np.argsort(reg.feature_importances_)[::-1]
fet_imp = reg.feature_importances_[fet_ind]
fig = plt.figure(figsize=(8,4));
ax = plt.subplot(111);
plt.bar(np.arange(len(fet_imp)), fet_imp, width=1, lw=2);
plt.grid(False);
ax.set_xticks(np.arange(len(fet_imp))+.5);
ax.set_xticklabels(data['feature_names'][fet_ind]);
plt.xlim(0, len(fet_imp));
We find that LSTAT (proportion of lower status of the population) and RM (number of rooms per dwelling) are the most important features determining the price of a house. As an illustration, here is a scatter plot of the price as a function of LSTAT:
plt.scatter(X[:,-1], y);
plt.xlabel('LSTAT indicator');
plt.ylabel('Value of houses (k$)');
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).