This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.1. Getting started with scikit-learn

We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem.

  1. First, let's make all the necessary imports.
In [ ]:
import numpy as np
import scipy.stats as st
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
%matplotlib inline
  1. We now define the deterministic function underlying our generative model.
In [ ]:
f = lambda x: np.exp(3 * x)
  1. We generate the values along the curve on $[0, 2]$.
In [ ]:
x_tr = np.linspace(0., 2, 200)
y_tr = f(x_tr)
  1. Now, let's generate our data points within $[0, 1]$. We use the function $f$ and we add some Gaussian noise. In order to be able to demonstrate some effects, we use one specific set of data points generated in this fashion.
In [ ]:
x = np.array([0, .1, .2, .5, .8, .9, 1])
# y = f(x) + np.random.randn(len(x))
y = np.array([0.59837698, 2.90450025, 4.73684354, 3.87158063, 11.77734608, 15.51112358, 20.08663964])
  1. Let's plot our data points on $[0, 1]$.
In [ ]:
plt.figure(figsize=(6,3));
plt.plot(x_tr[:100], y_tr[:100], '--k');
plt.plot(x, y, 'ok', ms=10);
  1. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the LinearRegression class). Then we fit the model to our data. Finally, we predict values from our trained model.
In [ ]:
# We create the model.
lr = lm.LinearRegression()
# We train the model on our training dataset.
lr.fit(x[:, np.newaxis], y);
# Now, we predict points with our trained model.
y_lr = lr.predict(x_tr[:, np.newaxis])

We need to convert x and x_tr to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature.

  1. We now plot the result of the trained linear model. We obtain a regression line, in green here.
In [ ]:
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');
plt.plot(x_tr, y_lr, 'g');
plt.plot(x, y, 'ok', ms=10);
plt.xlim(0, 1);
plt.ylim(y.min()-1, y.max()+1);
plt.title("Linear regression");
  1. The linear fit is not well adapted here, since the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a non-linear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that, by pre-computing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander function. We will explain this trick in more detail in How it works....
In [ ]:
lrp = lm.LinearRegression()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2, 5], ['-', '.']):
    lrp.fit(np.vander(x, deg + 1), y);
    y_lrp = lrp.predict(np.vander(x_tr, deg + 1))
    plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0, 1.4);
    plt.ylim(-10, 40);
    # Print the model's coefficients.
    print(' '.join(['%.2f' % c for c in lrp.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Linear regression");

We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \geq 1$). This is what we call overfitting: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set.

  1. We will now use a different learning model, called ridge regression. It works like linear regression, except that it prevents the polynomial's coefficients to explode (which was what happened in the overfitting example above). By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model. We will see more details in the next section.

The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge class. However, scikit-learn includes another model called RidgeCV which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the fit-predict API, all we have to do is replace lm.LinearRegression by lm.RidgeCV in the code above. We will give more details in the next section.

In [ ]:
ridge = lm.RidgeCV()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2, 5], ['-', '.']):
    ridge.fit(np.vander(x, deg + 1), y);
    y_ridge = ridge.predict(np.vander(x_tr, deg + 1))
    plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0, 1.5);
    plt.ylim(-5, 80);
    # Print the model's coefficients.
    print(' '.join(['%.2f' % c for c in ridge.coef_]))

plt.plot(x, y, 'ok', ms=10);
plt.title("Ridge regression");

This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes underfitting). The ridge regression reduces the overfitting issue here.

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).