There are a lot of resources to help you do deep learning in Python. I've organized this page into sections:
There are several software packages to be aware of, which do not necessarily have Python bindings but which may grow to include them. You don't always have to use Python!
Torch7 - Deep Learning software based on C and Lua
CudaConvNet - Alex Krizhevsky's software for deep convolutional nets.
PyLearn2 - Deep Learning Framework in Python based on Theano.
Deep learning can take your research in many directions, and it's nice that Python has a lot of projects to help you on your way:
Python - the standard library is surprisingly comprehensive, get to know what's in it.
NumPy - the defacto official n-dimensional array data type for Python. Doing numerical work in Python, you will use these all the time.
SciPy - a library of many algorithms and utilities that are useful across a broad range of scientific work. I originally captioned the following "especially useful" but the list grew to include almost the entire library:
scipy.io - read and write various matrix formats including MATLAB
scipy.linalg - decompositions, inverses, etc.
scipy.ndimage - basic image processing
scipy.optimize - 1-D and N-D optimization algorithms
scipy.signal - signal processing
scipy.sparse - several sparse matrix types
scipy.special - special functions (e.g. erf)
scipy.stats - pdfs, cdfs, sampling algorithms, tests
IPython is a feature-rich Python interpreter.
Matplotlib is the most widely-used plotting library for Python. It provides an interface similar to that of MATLAB or R. It is sometimes annoying, and not as flashy as say, d3 but it is the workhorse of data visualization in Python.
Theano - an array expression compiler, with automatic differentiation and behind-the-scenes GPU execution.
PyAutoDiff - provides automatic differentiation for code using NumPy. Currently this is a Theano front-end, but the API has been designed to keep Theano out of the sight of client code.
pandas - machine learning algorithms and types, mainly for working with time series. Includes R-like data structures (like R's data frame).
Cython - a compiler for a more strongly-typed Python dialect, very useful for optimizing numeric Python code.
Copperhead - compiles natural python syntax to GPU kernels
numexpr - compiles expression strings to perform loop-fused element-wise computations
scikit-learn - well-documented and well-tested implementations of many standard ML algorithms.
scikits-image - image-processing (edge detection, color spaces, many standard algorithms)
skdata - data sets for machine learning
There are several great resources to help you get oriented to use Python for numeric scientific work.
Unix/Python/NumPy - This is a really high-level intro to unix and Python. There is tons of documentation for these things and they're kind of out of scope of these tutorials--by all means dig around the net for more background if you want to know more.
Deep Learning Tutorials provide code-centered introductions to Deep Learning algorithms and architectures. They make heavy use of Theano, and illustrate good practices of programming with Theano.
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Running the following should show a picture of a boat, though it may have to first download the CIFAR-10 XXX data set in order to do it. Once the data set has been downloaded, it is stored in ~/.skdata/cifar10 for re-use. Browse through the data set by changing the index into x, or modify the code fragment to show more than one picture at a time. We'll also be using MNIST XXX and the Google Street View House Numbers (SVHN XXX) data sets.
from skdata.cifar10.views import OfficialImageClassificationTask
task = OfficialImageClassificationTask()
print 'training set size', task.train.x.shape
print 'Image shape:', task.train.x[8].shape
imshow(task.train.x[1000])
training set size (50000, 32, 32, 3) Image shape: (32, 32, 3)
<matplotlib.image.AxesImage at 0x1070db610>
from skdata.mnist.views import OfficialImageClassification
task = OfficialImageClassification()
print task.all_images.shape
imshow(task.all_images[1000].reshape((28, -1)))
(70000, 28, 28, 1)
<matplotlib.image.AxesImage at 0x116a5b950>
from skdata.svhn.view import CroppedDigitsView2
data_view = CroppedDigitsView2(x_dtype="float32", n_train=1000)
imshow(data_view.train.x[1].reshape(32,32,3))
<matplotlib.image.AxesImage at 0x116217190>
The linear SVM is a linear classifier parameterized by a matrix weights $\mathbf{W}$ and a bias vector $\mathbf{b}$. We will develop the multiclass "one vs. all" linear svm, which is a concatenation of binary svm. We will suppose that our label set is the non-negative integers up to some maximum L
There are other ways of setting up a multi-class SVM, such as using the Crammer-Singer loss or LaRank. Despite theoretical liminations (lack of consistency!?) the approach is widely used. See John Langford's page on machine learning reductions for more perspectives on multi-class SVMs.
Deep architectures for classification are typically formed by classifying learned features with a linear classifier such as an SVM or logistic regression model. We will begin our exploration of deep earchitectures with the shallowest model of all: the linear SVM.
The Deep Learning Tutorials provide a good complementary introduction to logistic regression.
import logging
import sys, time
import numpy as np
from numpy import random
from skdata import mnist
import autodiff
dtype = 'float64' ## for mac, use 'float32' for GPU
n_examples = 10000
n_classes = 10
img_shape = (28, 28)
data_view = mnist.views.OfficialImageClassification(x_dtype=dtype)
x = data_view.all_images[:n_examples].reshape((n_examples, -1))
y = data_view.all_labels[:n_examples]
print x.shape, y.shape
print x.min(), x.max()
(10000, 784) (10000,) 0.0 255.0
Supposing that our input is a vector $\mathbf{x}$, the prediction $l^*$ of the OVA-SVM is the argmax of the affine function
$ l^* = \operatorname*{argmax}_{l=0}^{L} (\mathbf{x W}_l + \mathbf{b}_l) $
So here $\mathbf{W}$ is a group of column vectors (transpose of design matrix), and $\mathbf{x}$ is a group of row vectors (e.g. design matrix format) - ONE Advantage of this formatting is that the result is still of the design matrix format, which means the different instance will be the different rows, and it is easy to do boardcasting in numpy (row-wise) - the point is, keep the main axis row-wise
def ova_prediction(W, b, x):
"""
W: weight matrix, weights are column vectors
b: single vectors
x: design matrix, instances are row vectors
"""
return np.argmax(np.dot(x, W) + b, axis = 1)
*Training an SVM*
The most standard way of training an SVM is to maximize the margin on training data. The margin is always defined in the two-class cases where $y \in \{-1,+1\}$ as $y*(\mathbf{xW}+\mathbf{b})$.
The affine function parameterized by $W$ and $b$ defines a hyper-plane where $xW + b = 0$ that the SVM interprets as a decision surface. The margin defines how far away $xW+b$ is from being on the wrong side of the decision surface - large positive values mean there is a safe distance, negative values mean that the decision surface is actually incorrect for the given x.
The most standard sense in which SVMs maximize margin is via the hinge loss. The hinge loss of margin value $u$ is defined as
$$hing(u) = max(0, 1-u)$$That explains why SVM prefers -1, +1 encoding of classification
By maximizing the average hinge loss of the margin of training examples, we maximize a tight convex upper bound on the mis-classification rate (zero-one loss), can get a good binary classifier using fast algorithms for convex optimization
def hinge(u):
"""
u: margin defined in svm, it can be as small (negative)
as it wants, but when it gets bigger than 1, it is out
of our interest. That makes sense if we think of them
as PROBABILITIES. 1 defines Geometrical margin.
On the contrary, 0-1 error only cares about if margin is
bigger than 0
"""
## use np.maximum to keep vector friendly
return np.maximum(0, 1 - u)
## plotting hinge
ugrid = np.arange(-5, 5, .1)
plot(ugrid, hinge(ugrid), label = 'hinge loss')
plot(ugrid, ugrid < 0, label = 'zeron-one loss')
legend(loc="best")
<matplotlib.legend.Legend at 0x1173cd690>
Figure 7.5 from Chris Bishop's PRML book. The Hinge Loss $E(z) = max(0,1-z)$ is plotted in blue, the Log Loss in red, the Square Loss in green and the misclassification error in black.
*Multiclass SVM*
In the multiclass case, it is not clear what the "correct" margin definition should be, and several useful ones have been proposed.
In the OVA-SVM setting, the training objective is defined by L independent SVMs. We will train an OVA-SVM by converting the integer-valued labels y to $Y_{N\times{L}} \in \{-1, +1\}$ and training $L$ SVMs at once. The $L$ columns of $Y$ represent binary classifiation tasks. The $L$ columns of W and $L$ elements of $b$ store the parameters of $L$ SVMs.
As such, the unregularized training objective is $$ \mathcal{L}(\mathcal{D}; W, \mathbf{b}) = \frac{1}{N} \sum_{(\mathbf{x}^{(i)}, Y^{(i)}) \in \mathcal{D}} ~ \sum_{l=1}^{L} ~ \max \left( 0, 1 - Y^{(i)}_l (\mathbf{x}^{(i)} W_l + \mathbf{b}_l) \right) $$
## prepare a "1-hot" version of the labels, denoted Y in the math
y11 = -1 * np.ones((y.shape[0], n_classes)).astype(dtype)
print dtype
y11[arange(len(y)), y] = 1
float64
## alternatively using sklearn.preprocessing.LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
y1 = LabelBinarizer(neg_label=-1, pos_label=+1).fit_transform(y).astype(dtype)
assert all(y11 == y1)
def ova_svm_cost(W, b, x, y1):
"""
One vs. all linear SVM loss
x: nrow x nfeat
y1: nrow x nlabel
w: nfeat x nlabel
b: nlabel
x * w + b: nrow x nlabel
"""
margin = y1 * (np.dot(x, W) + b)
## mean among all instances
## sum over all L SVMs
cost = hinge(margin).mean(axis = 0).sum()
return cost
The training itself consists of minizing the objective $L(D; W,b)$ w.r.t. $W$ and $b$. The criterion is convex, so it doesnt much matter where we start. Zero works well in practice.
## initialize the model
W = np.zeros((x.shape[1], n_classes), dtype = dtype)
b = np.zeros(n_classes, dtype = dtype)
There are many specialized SVM solvers available, but they tend to be most helpful for non-linear SVMs. In the linear case a simple combination of stochastic gradient descent (SGD) and BFGS can be very effective.
** SGD not implemented yet in latest autodiff **
## Do n online loops through the data set doing SGD
## This can be faster at the beginning than L-BFGS
tic = time.time()
batch_size = 1
n_epochs = 1
n_batches = n_examples / batch_size
W, b = autodiff.fmin_sgd(ova_svm_cost, (W, b),
streams={
'x': x.reshape((n_batches, online_batch_size, x.shape[1])),
'y1': y1.reshape((n_batches, online_batch_size, y1.shape[1]))},
loops=n_epochs,
stepsize=0.001,
print_interval=1000,
)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-14-a4fa809a8c0e> in <module>() 6 n_batches = n_examples / batch_size 7 ----> 8 W, b = autodiff.fmin_sgd(ova_svm_cost, (W, b), 9 streams={ 10 'x': x.reshape((n_batches, online_batch_size, x.shape[1])), AttributeError: 'module' object has no attribute 'fmin_sgd'
## Using sklearn.linear_model.SGDClassifier
W = np.zeros((x.shape[1], n_classes), dtype = dtype)
b = np.zeros(n_classes, dtype = dtype)
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
## normalize input
## IT SEEMS THAT THE MinMaxScaler produces
## results that are more similiar to the one
## implemented in the tuturial
## WHEREAS StandardScalar produces different results
#x_tile = StandardScaler().fit_transform(x)
x_tile = MinMaxScaler().fit_transform(x)
sgd = SGDClassifier(loss = 'hinge', alpha = 0,
learning_rate = "constant", eta0 = 0.001,
n_iter = 1,
n_jobs=-1, verbose = 0)
sgd.fit(x_tile, y)
W = sgd.coef_.T
b = sgd.intercept_
print 'sgd final cost', ova_svm_cost(W, b, x_tile, y1), sgd.score(x_tile, y)
import util
util.show_filters(W.T, img_shape, (2, 5))
sgd final cost 0.878557437033 0.8733
## L-BFGS implementation
W = np.zeros((x.shape[1], n_classes), dtype = dtype)
b = np.zeros(n_classes, dtype = dtype)
print x.dtype, y1.dtype, W.dtype, b.dtype
def batch_cost(W, b):
return ova_svm_cost(W, b, x, y1)
W, b = autodiff.optimize.fmin_l_bfgs_b(batch_cost,
(W, b),
maxfun = 20,
m=20,
iprint = 1)
print 'final cost', batch_cost(W, b)
import util
util.show_filters(W.T, img_shape, (2, 5))
float64 float64 float64 float64 final cost 0.90519276046
*Test The SVM*
Once a classifier has been trained, we can test it for generalization accuracy of the test set. We test it by making predictions for the examples in the test set and counting up the number of classification mistakes
train_error = np.mean(y!=ova_prediction(W, b, x))
print train_error
x_test = data_view.all_images[10000:20000, :].reshape((10000, -1))
y_test = data_view.all_labels[10000:20000]
test_error = np.mean(y_test != ova_prediction(W, b, x_test))
print test_error
0.1375 0.1589
*SVM Summary*
The linear Support Vector Machine (SVM) is an accurate and quick-to-train linear classifier. We looked a multiclass variant called a "One vs. All" SVM, parameterized by a matrix $W$ of weights and a vector $\mathbf{b}$ of biases. A linear SVM can be trained by gradient descent; whether SGD or L-BFGS works faster depends on the size of the data set and accuracy desired from the minimization process. A few iterations of SGD followed by refinement by L-BFGS is a fairly robust strategy.
*** Exercise: L1 and L2 regularization ***
Typically linear SVMs are L2-regularized. That means that in addition to $\mathcal{L}$, we minimize the sum of the squared elements of $W$:
$$ \mathcal{L}_{\ell_2}(\mathcal{D}; W, b) = \mathcal{L}(\mathcal{D}; W, b) + \alpha ||W||_2^2 $$def ova_svm_cost_l2(W, b, x, y1, alpha = 0.001):
margin = y1 * (np.dot(x, W) + b)
#penalty = np.apply_along_axis(np.linalg.norm, 0, W).mean()
penalty = (W*W).mean(axis = 0).sum()
return hinge(margin).mean(axis = 0).sum() + alpha * penalty
*Summary of SVM *
%pylab inline
def plot_weights(W, img_shape, layout, cmp=cm.gray):
"""
W: rows are different images
"""
nrows, ncols = layout
fig, axes = subplots(nrows = nrows, ncols = ncols,
figsize=(1.5*ncols, 1.5*nrows))
axes = axes.flatten()
for (i, im) in enumerate(W):
axes[i].imshow(im.reshape(img_shape), cmp)
#plot_weights(W.T, img_shape, (2, 5))
#figure()
#util.show_filters(W.T, img_shape, (2, 5))
Populating the interactive namespace from numpy and matplotlib
from sklearn.grid_search import RandomizedSearchCV
from sklearn.linear_model import SGDClassifier
searcher = RandomizedSearchCV(SGDClassifier(n_jobs=1, n_iter=1, ),
{'loss': ['hinge', 'log', 'huber'],
'penalty': ["l1", "l2", "elasticnet"],
'alpha': np.logspace(-5, 3, 9),
'l1_ratio': np.arange(0.1, 0.2, 0.02)},
n_jobs = 1)
searcher.fit(x_tile, y)
sgd = searcher.best_estimator_
print sgd
print sgd.score(x_tile, y)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
#W = StandardScaler().fit_transform(sgd.coef_.T).T
plot_weights(sgd.coef_, img_shape, (2, 5))
SGDClassifier(alpha=0.01, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.14, learning_rate=optimal, loss=log, n_iter=1, n_jobs=1, penalty=l2, power_t=0.5, random_state=None, rho=None, shuffle=False, verbose=0, warm_start=False) 0.8893
*Using Linear SVM (SGD version) on other datasets*
from skdata.cifar10.views import OfficialImageClassificationTask
from sklearn import preprocessing
from sklearn import grid_search
from sklearn import linear_model
from skimage.color import colorconv
task = OfficialImageClassificationTask()
x = task.train.x
y = task.train.y
x_hue = np.asarray([colorconv.rgb2hsv(img) for img in x])
x = x.reshape((50000, -1)).astype('float')
x_hue = x_hue.reshape((50000, -1)).astype('float')
print x.shape, y.shape, x_hue.shape
x_tile = preprocessing.MinMaxScaler().fit_transform(x)
x_hue_tile = preprocessing.MinMaxScaler().fit_transform(x_hue)
(50000, 3072) (50000,) (50000, 3072)
## for raw RGB pixel
params = {
'loss': ['hinge', 'huber']
, 'penalty': ['l1', 'l2', 'elasticnet']
, 'alpha': np.logspace(-5, 4, 10)
, 'l1_ratio': np.arange(0.1, 0.5, 0.1)
}
searcher = grid_search.RandomizedSearchCV(SGDClassifier(),
param_distributions = params)
searcher.fit(x_tile, y)
sgd = searcher.best_estimator_
print sgd.score(x_tile, y)
gray()
plot_weights(preprocessing.MinMaxScaler().fit_transform(sgd.coef_),
(32, 32, 3), (2, 5))
0.2955
<matplotlib.figure.Figure at 0x114a503d0>
## for HSV pixel
params = {
'loss': ['hinge', 'huber']
, 'penalty': ['l1', 'l2', 'elasticnet']
, 'alpha': np.logspace(-5, 4, 10)
, 'l1_ratio': np.arange(0.1, 0.5, 0.1)
}
searcher = grid_search.RandomizedSearchCV(SGDClassifier(),
param_distributions = params)
searcher.fit(x_hue_tile, y)
sgd = searcher.best_estimator_
print sgd.score(x_hue_tile, y)
W = sgd.coef_.reshape((10, 32, 32, 3))
#W = [colorconv.hsv2rgb(im).reshape((32*32*3, -1)) for im in W]
plot_weights(preprocessing.MinMaxScaler().fit_transform(W),
(32, 32, 3), (2, 5), cmp=cm.hsv_r)
0.35232