This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.6. Using a random forest to select important features for regression

Decisions trees are frequently used to represent workflows or algorithms. They also form a method for non-parametric supervised learning. A tree mapping observations to target values is learnt on a training set and gives the outcomes of new observations.

Random forests are ensembles of decision trees. Multiple decision trees are trained and aggregated to form a model that is more performant than any of the individual trees. This general idea is the purpose of ensemble learning.

There are many types of ensemble methods. Random forests are an instance of bootstrap aggregating, also called bagging, where models are trained on randomly drawn subsets of the training set.

Random forests yield information about the importance of each feature for the classification or regression task. In this recipe, we use this method to find the features the most influent on the price of Boston houses. We will use a classic dataset containing a range of diverse indicators about the houses' neighborhoud.

How to do it...

  1. We import the packages.
In [ ]:
import numpy as np
import sklearn as sk
import sklearn.datasets as skd
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['figure.dpi'] = mpl.rcParams['savefig.dpi'] = 300
  1. We load the Boston dataset.
In [ ]:
data = skd.load_boston()

The details of this dataset can be found in data['DESCR']. Here is the description of all features:

  • CRIM, per capita crime rate by town
  • ZN, proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS, proportion of non-retail business acres per town
  • CHAS, Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX, nitric oxides concentration (parts per 10 million)
  • RM, average number of rooms per dwelling
  • AGE, proportion of owner-occupied units built prior to 1940
  • DIS, weighted distances to five Boston employment centres
  • RAD, index of accessibility to radial highways
  • TAX, full-value property-tax rate per USD 10,000
  • PTRATIO, pupil-teacher ratio by town
  • B, $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
  • LSTAT, % lower status of the population
  • MEDV, Median value of owner-occupied homes in $1000's

The target value is MEDV.

  1. We create a RandomForestRegressor model.
In [ ]:
reg = ske.RandomForestRegressor()
  1. We get the samples and the target values from this dataset.
In [ ]:
X = data['data']
y = data['target']
  1. Let's fit the model.
In [ ]:, y);
  1. The importance of our features can be found in reg.feature_importances_. We sort them by decreasing order of importance.
In [ ]:
fet_ind = np.argsort(reg.feature_importances_)[::-1]
fet_imp = reg.feature_importances_[fet_ind]
  1. Finally, we plot a histogram of the features importance.
In [ ]:
fig = plt.figure(figsize=(8,4));
ax = plt.subplot(111);, fet_imp, width=1, lw=2);
plt.xlim(0, len(fet_imp));

We find that LSTAT (proportion of lower status of the population) and RM (number of rooms per dwelling) are the most important features determining the price of a house. As an illustration, here is a scatter plot of the price as a function of LSTAT:

In [ ]:
plt.scatter(X[:,-1], y);
plt.xlabel('LSTAT indicator');
plt.ylabel('Value of houses (k$)');

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).