%pylab inline
import pylab as pl
import numpy as np
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Let's start with downloading the data using a scikit-learn utility function.
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
Let's introspect the images arrays to find the shapes (for plotting with matplotlib)
X = lfw_people.data
y = lfw_people.target
names = lfw_people.target_names
n_samples, n_features = X.shape
_, h, w = lfw_people.images.shape
n_classes = len(names)
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
n_samples: 1288 n_features: 1850 n_classes: 7
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.7 * n_col, 2.3 * n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())
plot_gallery(X, names[y], h, w)
Let's have a look at the repartition among target classes:
pl.figure(figsize=(14, 3))
y_unique = np.unique(y)
counts = [(y == i).sum() for i in y_unique]
pl.xticks(y_unique, names[y_unique])
locs, labels = pl.xticks()
pl.setp(labels, rotation=45, size=20)
_ = pl.bar(y_unique, counts)
Let's split the data in a development set and final evaluation set.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
To train a model we will first reduce the dimensionality of the original picture to a 150 PCA space: unsupervised feature extraction.
from sklearn.decomposition import RandomizedPCA
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (
n_components, X_train.shape[0])
pca = RandomizedPCA(n_components=n_components, whiten=True)
%time pca.fit(X_train)
eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 150 eigenfaces from 966 faces CPU times: user 559 ms, sys: 69.4 ms, total: 629 ms Wall time: 449 ms
Let's plot the gallery of the most significant eigenfaces:
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
Projecting the input data on the eigenfaces orthonormal basis:
X_train_pca = pca.transform(X_train)
Let's now train a Kernel Support Vector Machine on the projected data. We perform an automated parameter search to find good values for Gamma and C:
from sklearn.svm import SVC
svm = SVC(kernel='rbf', class_weight='auto')
svm
SVC(C=1.0, cache_size=200, class_weight='auto', coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001, verbose=False)
Unfortunately an SVM is very sensitive to the parameters C and gamma and it's very unlikely that the default parameters will yield a good predictive accurracy:
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
cv = StratifiedShuffleSplit(y_train, test_size=0.20, n_iter=3)
%time svm_cv_scores = cross_val_score(svm, X_train_pca, y_train, scoring='f1', n_jobs=2)
svm_cv_scores
CPU times: user 15.6 ms, sys: 21.8 ms, total: 37.4 ms Wall time: 531 ms
array([ 0.73740893, 0.75845638, 0.74661801])
svm_cv_scores.mean(), svm_cv_scores.std()
(0.74749443841822638, 0.0086149047467813464)
Fortunately we can automate the search for the best combination of parameters:
from sklearn.grid_search import GridSearchCV
param_grid = {
'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
clf = GridSearchCV(svm, param_grid, scoring='f1', cv=cv, n_jobs=2)
%time clf = clf.fit(X_train_pca, y_train)
print("Best estimator found by randomized hyper parameter search:")
print(clf.best_params_)
print("Best parameters validation score: {:.3f}".format(clf.best_score_))
CPU times: user 560 ms, sys: 187 ms, total: 747 ms Wall time: 12.3 s Best estimator found by randomized hyper parameter search: {'C': 5000.0, 'gamma': 0.001} Best parameters validation score: 0.809
Let's start with a qualitative inspection of the some of the predictions:
X_test_pca = pca.transform(X_test)
y_pred = clf.predict(X_test_pca)
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=names))
precision recall f1-score support Ariel Sharon 0.81 0.76 0.79 17 Colin Powell 0.89 0.84 0.86 61 Donald Rumsfeld 0.85 0.74 0.79 31 George W Bush 0.90 0.96 0.93 134 Gerhard Schroeder 0.76 0.84 0.80 19 Hugo Chavez 0.89 0.89 0.89 19 Tony Blair 0.84 0.78 0.81 41 avg / total 0.87 0.87 0.87 322
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred, labels=range(n_classes))
print(cm)
[[ 13 2 1 1 0 0 0] [ 2 51 0 4 1 0 3] [ 0 1 23 5 1 1 0] [ 1 1 2 129 0 0 1] [ 0 0 1 0 16 1 1] [ 0 0 0 0 1 17 1] [ 0 2 0 5 2 0 32]]
pl.gray()
_ = pl.imshow(cm, interpolation='nearest')