In the previous tutorial on linear SVMs, we looked at how training a model consists in defining a training objective, and then minimizing it with respect to model parameters by either stochastic or batch gradient descent.
In this tutorial we will augment our SVM with internal layers of hidden units and turn our linear classifier into a multi-layer architecture. An MLP with one internal layer is sometimes called a shallow architecture, in contrast to an MLP with more internal layers which is sometimes called a deep architecture.
An MLP can be viewed as an SVM, where the input $\mathbf{x}$ is first transformed using a learnt non-linear vector-valued transformation $\Phi$. The purpose of this transformation is to map the input data into a space where it becomes linearly separable. The vector $\Phi(\mathbf{x})$ is referred to as a hidden layer.
This tutorial will again tackle the problem of MNIST digit classification.
import logging
import sys, time
import numpy as np
from skdata import mnist
import autodiff
import util
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import grid_search
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['dtype'] `%pylab --no-import-all` prevents importing * from pylab and numpy
## load data
dtype = 'float'
nexamples = 10000
nclasses = 10
img_shape = (28, 28)
data_view = mnist.views.OfficialImageClassification(x_dtype=dtype)
x = data_view.all_images[:nexamples].reshape((nexamples, -1))
y = data_view.all_labels[:nexamples]
y1 = preprocessing.LabelBinarizer(neg_label = -1).fit_transform(y)
x_normed = preprocessing.MinMaxScaler().fit_transform(x)
x = x / 255.
x_test = x_test / 255.
print x.shape, y.shape, x_normed.shape, y1.shape
"""
dtype = 'float'
nexamples = 10000
train_idx = slice(0, nexamples)
test_idx = slice(nexamples, nexamples*2)
img_shape = (28, 28)
nclasses = 10
v = mnist.views.OfficialVectorClassification()
x = v.all_vectors[train_idx].astype(dtype)
y = v.all_labels[train_idx]
x_test = v.all_vectors[test_idx].astype(dtype)
y_test = v.all_labels[test_idx]
## DONT USE MinMaxScaler or StandarizeScaler as they
## do normalization individually for each feature
#scaler = preprocessing.MinMaxScaler()
#scaler.fit(v.all_vectors.astype(dtype))
#x = scaler.transform(x)
#x_test = scaler.transform(x)
#x = x / 255.
#x_test = x_test / 255.
y1 = preprocessing.LabelBinarizer().fit_transform(y)
y1_test = preprocessing.LabelBinarizer().fit_transform(y_test)
"""
print x.min(), x.max(), x_test.min(), x_test.max()
(10000, 784) (10000,) (10000, 784) (10000, 10) 0.0 1.0 0.0 1.0
Typically the function $\Phi$ is taken to be the function $\mathbf{R}^{D_0} \rightarrow (-1, 1)^{D_1}$, where $$ \Phi(\mathbf{x};V,\mathbf{c}) = tanh(\mathbf{x}V + \mathbf{c}) $$ in which $V \in \mathbb{R}^{D_0 \times D_1}$ is called a weight matrix, and $\mathbf{c} \in \mathbb{R}^{D_1}$ is called a bias vector. The integer $D_0$ is the number of elements in $\mathbf{x}$ (sometimes, number of "input units"). The integer $D_1$ is the number of "hidden units" of the MLP. We abuse notation slightly here by using $\tanh(\mathbf{u})$ for a vector $\mathbf{u}$ to denote the vector of values $\tanh(\mathbf{u}_i)$. Sometimes other non-linear scalar functions are used instead of the tanh function -- whichever one is used is called the activation function of layer $\Phi$.
def tanh_layer(V, c, x):
"""
V: nin * nhid, a group of column vectors
c: nhid
x: nsample * nin
return: nsample * nhid
"""
return np.tanh(np.dot(x, V) + c)
def ova_svm_prediction(W, b, x):
return np.argmax(np.dot(x, W) + b, axis = 1)
def ova_svm_cost(W, b, x, y1):
margin = y1 * (np.dot(x, W) + b)
cost = hinge(margin).mean(axis = 0).sum()
return cost
def hinge(margin):
return np.maximum(0, 1 - margin)
When combined with an SVM classifier (hinge loss + l2 regularization), the full classification model can be written as $$ MLP(\mathbf{x}) = SVM(\Phi({\mathbf{x}; V, \mathbf{c}}); W, \mathbf{b}) $$
A single hidden layer of this form is sufficient to make the MLP a universal approximator. However, we will see shortly that there are benefits of using many such hidden layers (deep learning). See these course notes for an introducion to MLPs, BP and how to train them
def mlp_prediction(V, c, W, b, x):
"""
x (nsample * nin): input layer
V (nin * nhid), c (nhid): hidden layer
W (nhid * nout), b (nout): output layer
return nsample * nout
"""
h = tanh_layer(V, c, x)
return ova_svm_prediction(W, b, h)
def mlp_cost(V, c, W, b, x, y1, alpha = 0.001):
data_cost = ova_svm_cost(W, b, tanh_layer(V, c, x), y1)
penalty = np.sum(V * V) + np.sum(W * W)
return data_cost + alpha * penalty
To train an MLP, we learn all parameters of the model $(W, \mathbf{b}, V, \mathbf{c})$ by gradient descent. just as we learned the parameters $W, \mathbf{b}$ previously when training the SVM.
The initial values for the weights of a hidden layer ($V$) should be uniformly sampled from a symmetric interval that depends on the activation function. For the tanh activation function results obtained in [Xavier10]_ show that the interval should be $$ \left[ -\sqrt{\frac{6}{D_0 + D_1}}, \sqrt{\frac{6}{D_0 + D_1}} \right] $$
For the logistic sigmoid function $1 / (1 + e^{-u})$ the interval is slightly different:
$$ \left[ -4\sqrt{\frac{6}{D_0 + D_1}},4\sqrt{\frac{6}{D_0 + D_1}} \right] $$.
This initialization ensures that at least early in training, each neuron operates in a regime of its activation function where information can easily be propagated both upward (activations flowing from inputs to outputs) and backward (gradients flowing from outputs to inputs).
nin = np.product(img_shape)
nhid = 10*10
V0 = np.random.uniform(low = -np.sqrt(6. / (nin+nhid)),
high = np.sqrt(6. / (nin+nhid)),
size = (nin, nhid)).astype(x.dtype)
c0 = np.zeros(nhid, dtype = x.dtype)
## SVM on the top
W0 = np.zeros((nhid, nclasses), dtype=x.dtype)
b0 = np.zeros(nclasses, dtype=x.dtype)
As before, we train this model using stochastic gradient descent with
mini-batches. The difference is that we modify the cost function to include the
regularization term. L1_reg
and L2_reg
are the hyperparameters
controlling the weight of these regularization terms in the total cost function.
## L-BFGS minimization
#x = x_normed
def batch_cost(V, c, W, b):
return mlp_cost(V, c, W, b, x, y1, alpha=0)
V, c, W, b = autodiff.optimize.fmin_l_bfgs_b(batch_cost,
(V0, c0, W0, b0),
maxfun = 20,
m = 20,
iprint = 1)
print 'final cost', batch_cost(V, c, W, b)
util.show_filters(V.T, img_shape, (10, 10))
figure()
util.show_filters(W.T, (10, 10), (2, 5))
final cost 0.808789949832
## test it on new sample
"""
test_idx = slice(nexamples, nexamples*2)
x_test = data_view.all_images[test_idx].reshape((nexamples, -1))
y_test = data_view.all_labels[test_idx]
y1_test = preprocessing.LabelBinarizer(neg_label = -1).fit_transform(y_test)
x_test_normed = preprocessing.MinMaxScaler().fit_transform(x_test)
print x_test.shape, y_test.shape, x_test_normed.shape, y1_test.shape
"""
#x_test = x_test_normed
yhat_test = mlp_prediction(V, c, W, b, x_test)
yhat_train = mlp_prediction(V, c, W, b, x)
print 'train accuracy:', np.mean(yhat_train == y)
print 'test accuracy:', np.mean(yhat_test == y_test)
train accuracy: 0.8689 test accuracy: 0.8448
There are several hyper-parameters in the above code, which are not (and, generally speaking, cannot be) optimized by gradient descent. The design of outer-loop algorithms for optimizing them is a topic of ongoing research. Over the last 25 years, researchers have devised various rules of thumb for choosing them. A very good overview of these tricks can be found in Efficient BackProp by Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller. Here, we summarize the same issues, with an emphasis on the parameters and techniques that we actually used in our code.
Which non-linear activation function should you use in a neural network? Two of the most common ones are the logistic sigmoid and the tanh functions. For reasons explained in Section 4.4, nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties.
At initialization we want the weights to be small enough around the origin so that the activation function operates near its linear regime, where gradients are the largest. Otherwise, the gradient signal used for learning is attenuated by each layer as it is propagated from the classifier towards the inputs. Other desirable properties, especially for deep networks, are to conserve variance of the activation as well as variance of back-propagated gradients from layer to layer. This allows information to flow well upward and downward in the network and reduces discrepancies between layers. The initialization used above represents a good compromise between these two constraints. For mathematical considerations, please refer to [Xavier10]_.
Optimization by stochastic gradient descent is very sensitive to the step size or learning rate. There is a great deal of literature on how to choose a the learning rate, and how to change it during optimization. The simplest solution is to use a constant rate. Rule of thumb: try several log-spaced values ($10^{-1}, 10^{-2}, \ldots$) and narrow the (logarithmic) grid search to the region where you obtain the lowest validation error.
Decreasing the learning rate over time can help a model to settle down into a [local] minimum. One simple rule for doing that is $\frac{\mu_0}{1 + d\times t}$ where $\mu_0$ is the initial rate (chosen, perhaps, using the grid search technique explained above), $d$ is a so-called "decrease constant" which controls the rate at which the learning rate decreases (typically, a smaller positive number, $10^{-3}$ and smaller) and $t$ is the epoch/stage.
Section 4.7 details procedures for choosing a learning rate for each parameter (weight) in our network and for choosing them adaptively based on the error of the classifier.
The number of hidden units that gives best results is dataset-dependent. Generally speaking, the more complicated the input distribution is, the more capacity the network will require to model it, and so the larger the number of hidden units thatwill be needed (note that the number of weights in a layer, perhaps a more direct measure of capacity, is $D_0\times D_1$ (recall $D_0$ is the number of inputs and $D_1$ is the number of hidden units).
Unless we employ some regularization scheme (early stopping or L1/L2 penalties), a typical number of hidden units vs. generalization performance graph will be U-shaped.
Typical values to try for the L1/L2 regularization parameter $\lambda$ are $10^{-2}, 10^{-3}, \ldots$. It can be useful to regularize the topmost layers in an MLP (closest to and including the classifier itself) to prevent them from overfitting noisy hidden layer features, and to encourage the features themselves to be more discriminative.
This notebook has been adapted from the Deep Learning Tutorials: Convolutional Neural Network
Convolutional Neural Networks (ConvNets) are MLP variants which are specialized for image processing. From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68], we know there exists a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called a _receptive field, and are tiled in such a way as to cover the entire visual field. These filters are local in input space and are thus better suited to exploit the strong spatially local correlation present in natural images.
Additionally, two basic cell types have been identified: simple cells (S) and complex cells (C). Simple cells (S) respond maximally to specific edge-like stimulus patterns within their receptive field. Complex cells (C) have larger receptive fields and are locally invariant to the exact position of the stimulus.