Notebook

Estimators In Depth: Trees and Forests¶

In [ ]:

%pylab inline

Here we'll explore a class of algorithms based on Decision trees. Decision trees at their root (Ha!) are extremely intuitive. They encode a series of binary choices in a process that parallels how a person might classify things themselves, but using an information criterion to decide which question is most fruitful at each step. For example, if you wanted to create a guide to identifying an animal found in nature, you might ask the following series of questions:

Is the animal bigger or smaller than a meter long?
- bigger: does the animal have horns?
  - yes: are the horns longer than ten centimeters?
  - no: is the animal wearing a collar
- smaller: does the animal have two or four legs?
  - two: does the animal have wings?
  - four: does the animal have a bushy tail?

and so on. This binary splitting of questions is the essence of a decision tree.

Decision trees, and the related boosted trees and random forest estimators, are powerful non-parametric approaches that are broadly useful. Non-parametric approaches are useful when the underlying structure or model of the data is unknown.

For example, imagine you wanted to derive a model for some periodic function having to do with tides. We'll mimic this data as a sum of two sine waves:

In [ ]:

rng = np.random.RandomState(0)
N = 100
X = np.linspace(0, 6, N)[:, np.newaxis]
error = 0.4
y_true = np.sin(X).ravel() + np.sin(6 * X).ravel()
y_noisy = y_true + rng.normal(0, error, X.shape[0])

In [ ]:

plt.plot(X.ravel(), y_true, color='gray')
plt.plot(X.ravel(), y_noisy, '.k')

This data looks relatively complicated, but if we had the intuition to know that it is simply the combination of a small number of sine waves, we could use this sparse representation in Fourier space and use a fast linear estimator.

Taking the FFT of the data gives us two peaks in frequency, indicating that the representation is sparse in this basis:

In [ ]:

from scipy import fftpack
plt.plot(fftpack.fftfreq(len(y_noisy))[:N/2], abs(fftpack.fft(y_noisy))[:N/2])
plt.xlim(0, None)
plt.xlabel('frequency')
plt.ylabel('Fourier power')

This shows how important intuition, especially physical intuition, can be when performing a learning problem on real-world data. If you blindly throw algorithms at a data set, you might be missing a simple intuition which might lead to a sparse representation that is much more meaningful.

But suppose we don't have any intuition that would lead to representing the data in a sparse basis. In this case we can benefit from using a non-parametric estimator to fit our task. One well-known and powerful non-parametric estimator is the Decision Tree.

Decision Tree Regression¶

A decision tree is a simple binary classification tree that, at its root, is similar to nearest neighbor classification. It can be used as follows:

In [ ]:

i = np.random.permutation(X.shape[0])
X = X[i]
y_noisy = y_noisy[i]

In [ ]:

from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(max_depth=5)
clf.fit(X, y_noisy)

X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))
y_fit_1 = clf.predict(X_fit)

plt.plot(X_fit.ravel(), y_fit_1, color='blue')
plt.plot(X.ravel(), y_noisy, '.k')

A single decision tree allows us to estimate the signal in a non-parametric way, but clearly has some issues. In some regions, the model shows high bias and under-fits the data (seen in the long flat lines which don't follow the contours of the data), while in other regions the model shows high variance and over-fits the data (reflected in the narrow spikes which are influenced by noise in single points).

One way to address this is to combine many trees together, so that the effects of their over-fitting go away on average. This approach is known as Random Forests

Random Forests¶

Here we will use a random forest of 200 trees to reduce the tendency of each tree to over-fitting the data.

In [ ]:

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=200, max_depth=5)
clf.fit(X, y_noisy)

y_fit_200 = clf.predict(X_fit)

plt.plot(X_fit.ravel(), y_fit_200, color='blue')
plt.plot(X.ravel(), y_noisy, '.k')

Selecting the Optimal Estimator via Cross-Validation¶

In [ ]:

from sklearn import grid_search

rf = RandomForestRegressor()
parameters = {'n_estimators':[200, 300, 400],
              'max_depth':[5, 7, 9]}

# Warning: be sure your data is shuffled before using GridSearch!
clf_grid = grid_search.GridSearchCV(rf, parameters)
clf_grid.fit(X, y_noisy)

In [ ]:

rf_best = clf_grid.best_estimator_
X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))
y_fit_best = rf_best.predict(X_fit)

print rf_best.n_estimators, rf_best.max_depth

plt.plot(X_fit.ravel(), y_fit_best, color='blue')
plt.plot(X.ravel(), y_noisy, '.k')

Another option: Gradient Boosting¶

Another Ensemble method that can be useful is Boosting: here, rather than looking at 200 (say) parallel estimators, We construct a chain of 200 estimators which iteratively refine the results of the previous estimator. The idea is that by sequentially applying very fast, simple models, we can get a total model error which is better than any of the individual pieces.

In [ ]:

from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor(n_estimators=200, max_depth=2)
clf.fit(X, y_noisy)

y_fit_200 = clf.predict(X_fit)

plt.plot(X_fit.ravel(), y_fit_200, color='blue')
plt.plot(X.ravel(), y_noisy, '.k')

Exercise: Cross-validating Gradient Boosting¶

Use a grid search to optimize the number of estimators and max_depth for a Gradient Boosted Decision tree.

In [ ]:

Plug this optimal max_depth into a single decision tree. Does this single tree over-fit or under-fit the data?

In [ ]:

Repeat this for the Random Forest. Construct a single decision tree using the max_depth which is optimal for the Random Forest. Does this single tree over-fit or under-fit the data?

In [ ]:

Solution¶

In [ ]:

%load solutions/07B_grid_search.py

In [ ]: