Scikit-Learn: Machine learning in Python

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable: BSD license
In [1]:
from IPython.core.display import Image
Image(url="http://scikit-learn.org/stable/_static/ml_map.png")
Out[1]:

Quick start example

In [2]:
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn import datasets, __version__
import numpy as np

np.random.seed(2014)
__version__
Out[2]:
'0.15.2'
In [3]:
digits = datasets.load_digits()
digits.keys()
Out[3]:
['images', 'data', 'target_names', 'DESCR', 'target']
In [4]:
digits.images.shape
Out[4]:
(1797, 8, 8)
In [5]:
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation="None")
Out[5]:
<matplotlib.image.AxesImage at 0x1110ab710>
In [6]:
digits.images[0]
Out[6]:
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
In [7]:
digits.data.shape # features in (n_samples, n_features) format
Out[7]:
(1797, 64)
In [8]:
digits.data[0] # simply the flattened image!
Out[8]:
array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])
In [9]:
digits.target.shape
Out[9]:
(1797,)
In [10]:
digits.target[0]
Out[10]:
0

The task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC that implements support vector classification. The constructor of an estimator takes as arguments the parameters of the model, but for the time being, we will consider the estimator as a black box:

In [11]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.) # manually chosen model parameters

It is possible to automatically find good values for the model parameters by using tools such as grid search and cross validation.

We call our estimator instance clf, as it is a classifier. It now must be fitted to the model, that is, it must learn from the model. This is done by passing our training set to the fit method. As a training set, let us use all the images of our dataset apart from the last one.

In [12]:
clf.fit(digits.data[:-1], digits.target[:-1])
Out[12]:
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
In [13]:
clf.predict(digits.data[-1]), digits.target[-1]
Out[13]:
(array([8]), 8)
In [14]:
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation="None")
Out[14]:
<matplotlib.image.AxesImage at 0x11177ad90>

Supervised learning: predicting an output variable from high-dimensional observations

Supervised learning consists in learning the link between two datasets: the observed data $X$ and an external variable $y$ that we are trying to predict, usually called “target” or “labels”. Most often, $y$ is a 1D array of length n_samples.

If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task. When doing classification in scikit-learn, y is a vector of integers or strings.

Classifying irises

In [15]:
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
iris_X[0] # petal and sepal length and width of irises
Out[15]:
array([ 5.1,  3.5,  1.4,  0.2])
In [16]:
iris.target_names, np.unique(iris_y) # 3 different types of irises
iris.target
Out[16]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

k-Nearest neighbors (k-NN) classifier¶

The simplest possible classifier is the nearest neighbor: given a new observation, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector.

In [17]:
# split into training and test
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test  = iris_X[indices[-10:]] # yay half-open intervals
iris_y_test  = iris_y[indices[-10:]]

# create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train) 
knn.predict(iris_X_test), iris_y_test
Out[17]:
(array([0, 0, 0, 2, 2, 2, 0, 0, 1, 1]), array([0, 0, 0, 2, 2, 2, 0, 0, 2, 1]))
In [18]:
# The mean square error
def mse(yhat, y):
    return np.mean((yhat - y)**2)
mse(knn.predict(iris_X_test), iris_y_test)
Out[18]:
0.10000000000000001

Linear regression

Linear Regression, in the simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible. Linear models are of the form

$$y = X\beta + \epsilon$$

with data $X$, target variable $y$, coefficients $\beta$ and observation noise $\epsilon$.

In [19]:
diabetes = datasets.load_diabetes()
# 10 physiological variables (age, sex, weight, blood pressure),
# and an indication of disease progression after one year
diabetes_X_train = diabetes.data[:-20]
diabetes_y_train = diabetes.target[:-20]
diabetes_X_test  = diabetes.data[-20:]
diabetes_y_test  = diabetes.target[-20:]
diabetes_X_train[0], diabetes_y_train[0]
Out[19]:
(array([ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613]),
 151.0)
In [20]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
regr.coef_
Out[20]:
array([  3.03499549e-01,  -2.37639315e+02,   5.10530605e+02,
         3.27736980e+02,  -8.14131709e+02,   4.92814588e+02,
         1.02848452e+02,   1.84606489e+02,   7.43519617e+02,
         7.60951722e+01])
In [21]:
mse(regr.predict(diabetes_X_test), diabetes_y_test)
Out[21]:
2004.5676026898218
In [22]:
# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and Y.
regr.score(diabetes_X_test, diabetes_y_test)
Out[22]:
0.58507530226905724
In [23]:
plt.plot(diabetes_X_test[:,2], diabetes_y_test, "go") # look at feature 2 only
plt.plot(diabetes_X_test[:,2], regr.predict(diabetes_X_test), "ro")
Out[23]:
[<matplotlib.lines.Line2D at 0x111988710>]

Shrinkage

If there are few data points per dimension, noise in the observations induces high variance:

In [24]:
X = np.c_[.5, 1].T
y = [.5, 1]
test = np.c_[0, 2].T
regr = linear_model.LinearRegression()
for _ in range(6): 
   this_X = X + .1 * np.random.normal(size=X.shape) # add some noise
   regr.fit(this_X, y)
   plt.plot(test, regr.predict(test)) 
   plt.scatter(this_X, y, s=3)

One possible solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:

In [25]:
regr = linear_model.Ridge(alpha=0.1)
for _ in range(6): 
   this_X = X + .1 * np.random.normal(size=X.shape) # add some noise
   regr.fit(this_X, y)
   plt.plot(test, regr.predict(test)) 
   plt.scatter(this_X, y, s=3)

This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance.

Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting. The bias introduced by the ridge regression is called a regularization.

We can choose alpha to minimize the error, this time using the diabetes dataset rather than our synthetic data:

In [26]:
alphas = np.logspace(-4, -1, 6) # search over this range
s = np.array([regr.set_params(alpha=alpha) \
     .fit(diabetes_X_train, diabetes_y_train) \
     .score(diabetes_X_test, diabetes_y_test) \
     for alpha in alphas]); s
Out[26]:
array([ 0.58511107,  0.5852073 ,  0.58546775,  0.5855512 ,  0.58307171,
        0.57058999])
In [27]:
alphas[s.argmin()]
Out[27]:
0.10000000000000001

Curse of dimensionality

For an estimator to be effective, you need the distance between neighboring points to be less than some value $d$, which depends on the problem. In one dimension, this requires on average $n$ $1/d$ points. In the context of the above k-NN example, if the data is described by just one feature with values ranging from 0 to 1 and with $n$ training observations, then new data will be no further away than $1/n$. Therefore, the nearest neighbor decision rule will be efficient as soon as $1/n$ is small compared to the scale of between-class feature variations.

If the number of features is $p$, you now require $n$ $1/d^p$ points. Let’s say that we require 10 points in one dimension: now $10^p$ points are required in $p$ dimensions to pave the [0, 1] space. As $p$ becomes large, the number of training points required for a good estimator grows exponentially.

This is called the curse of dimensionality and is a core problem that machine learning addresses.

To mitigate the curse of dimensionality involing the diabetes dataset, it would be interesting to select only the informative features and set non-informative ones, like feature #2, to 0. Ridge regression will decrease their contribution, but not set them to zero. Another penalization approach, called lasso (least absolute shrinkage and selection operator), can set some coefficients to zero. Such methods are called sparse methods and sparsity can be seen as an application of Occam’s razor: prefer simpler models.

In [28]:
regr = linear_model.Lasso() # or LassoLars
s = np.array([regr.set_params(alpha=alpha) \
     .fit(diabetes_X_train, diabetes_y_train) \
     .score(diabetes_X_test, diabetes_y_test) \
     for alpha in alphas])
best_alpha = alphas[s.argmin()]
regr.alpha = best_alpha
regr.fit(diabetes_X_train, diabetes_y_train)
regr.coef_
Out[28]:
array([   0.        , -153.88667105,  508.67446975,  281.35649556,
        -51.0115067 ,   -0.        , -219.74128855,    0.        ,
        469.11520313,   46.03782743])

Now some coefficients are set to zero, which means that certain features are ignored altogether.

Logistic regression

For classification, as in the labeling iris task, linear regression is not the right approach as it will give too much weight to data far from the decision frontier. An alternative is to fit a sigmoid function or logistic function.

In [29]:
logistic = linear_model.LogisticRegression(C=1e5, penalty="l1")
# l1 gives sparsity, l2 gives shrinkage
logistic.fit(iris_X_train, iris_y_train)
Out[29]:
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty='l1',
          random_state=None, tol=0.0001)
In [30]:
mse(logistic.predict(iris_X_test), iris_y_test)
Out[30]:
0.0

Support vector machines (SVMs)¶

Support Vector Machines belong to the discriminant model family: they try to find a combination of samples to build a plane maximizing the margin between the two classes. Regularization is set by the $C$ parameter: a small value for $C$ means the margin is calculated using many or all of the observations around the separating line (more regularization); a large value for $C$ means the margin is calculated on observations close to the separating line (less regularization). SVMs can be used in regression (SVR: support vector regression), or in classification (SVC: support vector classification).

Note: For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is important to get good prediction.

In [31]:
from IPython.core.display import Image
Image(url="http://scikit-learn.org/stable/_images/plot_svm_margin_0021.png")
Out[31]:
In [32]:
from sklearn import svm
svc = svm.SVC(kernel="linear")
svc.fit(iris_X_train, iris_y_train)
Out[32]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by positioning kernels on observations:

In [33]:
from IPython.core.display import Image
Image(url="http://scikit-learn.org/stable/_images/plot_svm_kernels_0011.png")
# linear kernel
Out[33]:
In [34]:
from IPython.core.display import Image
Image(url="http://scikit-learn.org/stable/_images/plot_svm_kernels_0021.png")
# polynomial kernel
Out[34]:
In [35]:
from IPython.core.display import Image
Image(url="http://scikit-learn.org/stable/_images/plot_svm_kernels_0031.png")
# RBF (Radial Basis Function) kernel
Out[35]:

Model selection: Choosing estimators and their parameters

As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data: bigger is better.

In [36]:
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel="linear")
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
Out[36]:
0.97999999999999998

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing, also called $k$-fold cross validation.

In [37]:
K = 3
X_folds = np.array_split(X_digits, K)
y_folds = np.array_split(y_digits, K)
scores = list()
for k in range(K):
    X_train = list(X_folds) # We use 'list' to copy, in order to 'pop' later on
    X_test  = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
scores, np.array(scores).mean()
Out[37]:
([0.93489148580968284, 0.95659432387312182, 0.93989983305509184],
 0.94379521424596557)

Cross-validation generators

The code above to split data in train and test sets is tedious to write. Scikit-learn exposes cross-validation generators to generate list of indices for this purpose:

In [38]:
from sklearn import cross_validation
kfold = cross_validation.KFold(n=6, n_folds=3)
for train_indices, test_indices in kfold:
     print("Train: {} | test: {}".format(train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
In [39]:
kfold = cross_validation.KFold(len(X_digits), n_folds=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
         for train, test in kfold]
Out[39]:
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
  • KFold(n, k): split into $k$ folds, train on $k - 1$ folds, and then test on the remaining fold
  • StratifiedKFold(y, k): preserves the class ratios / label distribution within each fold
  • LeaveOneOut(n): leave one observation out
  • LeaveOneLabelOut(labels): takes a label array to group observations

Unsupervised learning: Seeking new representations of the data

Clustering: grouping observations together

The problem solved in clustering: For example, given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label them: we could try a clustering task: split the observations into well-separated group called clusters.

$k$-means clustering

Note that there exist a lot of different clustering criteria and associated algorithms. The simplest clustering algorithm is $k$-means.

In [40]:
from sklearn import cluster
X_iris = iris.data
y_iris = iris.target

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris) 

print(k_means.labels_[::10])

print(y_iris[::10])
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Warning: There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikit-learn employs several tricks to mitigate this issue. Don’t over-interpret clustering results.

In [41]:
from scipy import misc
lena = misc.lena()
plt.imshow(lena, cmap=plt.cm.gray)
Out[41]:
<matplotlib.image.AxesImage at 0x111a202d0>
In [42]:
X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array
k_means = cluster.KMeans(n_clusters=3, n_init=1)
k_means.fit(X) 

values = k_means.cluster_centers_.squeeze()
print(values)
labels = k_means.labels_
print(labels)
lena_compressed = np.choose(labels, values)
lena_compressed.shape = lena.shape
[ 132.26803407   65.87560281  185.92742083]
[2 2 2 ..., 0 0 0]
In [43]:
plt.imshow(lena_compressed, cmap=plt.cm.gray)
#lena_compressed
Out[43]:
<matplotlib.image.AxesImage at 0x111a31ed0>

Decompositions: From a signal to components and loadings¶

If $X$ is our multivariate data, then the problem that we are trying to solve is to rewrite it on a different observational basis: we want to learn loadings $L$ and a set of components $C$ such that $X = L C$. Different criteria exist to choose the components.

Principal component analysis (PCA)¶

PCA selects the successive components that explain the maximum variance in the signal. When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.

In [44]:
from sklearn.decomposition import PCA
# Create a signal with only 2 useful dimensions
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2 # 2D data embedded in 3D space
X = np.c_[x1, x2, x3]

pca = PCA()
pca.fit(X)
print(pca.explained_variance_)  

# As we can see, only the 2 first components are useful
pca.n_components = 2
X_reduced = pca.fit_transform(X)
X_reduced.shape
[  2.93977791e+00   1.12126485e+00   2.87710261e-32]
Out[44]:
(100, 2)

Linear Discriminant Analysis (LDA)

LDA tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.

Comparison of LDA and PCA 2D projection of Iris dataset:

In [45]:
from sklearn.lda import LDA
X = iris.data
y = iris.target

pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

lda = LDA(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

# Create plots
target_names = iris.target_names
plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title("PCA of IRIS dataset")

plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title("LDA of IRIS dataset")
Out[45]:
<matplotlib.text.Text at 0x114cfb290>

There is so much more...