The auto-encoder (AE) is a classic unsupervised algorithm for non-linear dimensionality reduction. The de-noising auto-encoder (DAE) is an extension of the auto-encoder introduced as a building block for deep networks in [Vincent08]. This tutorial introduces the autoencoder as an unsupervised feature learning algorithm for the MNIST data set, and then develops the denoising autoencoder. See section 4.6 of [Bengio09]_ for an overview of auto-encoders.
Generally speaking, an autoencoder maps vector input $\mathbf{x}$ to some intermediate representation $\mathbf{y}$ and then back into the original space, in such a way that the end-point is close to $\mathbf{x}$. While the goal thus defined is simply to learn an identity function, things get interesting when the mappings are parameterized or constrained in such a way that a general identity function is either impossible to represent or at least difficult to learn from data. When this is the case, the goal of learning is to learn a special-purpose identity function that typically works for vectors $\mathbf{x}$ that we care about, which come from some empirically interesting distribution. The $\mathbf{y}$ vector that comes out of this process contains all the important information about $x$ in a new and potentially useful way.
In our tutorial here we will deal with vectors $\mathbf{x}$ that come from the MNIST data set of hand-written digits. Examples from MNIST are vectors $\mathbf{x} \in [0,1]^N$, and we will look at the classic one-layer sigmoidal autoencoder parameterized by:
which encodes vectors $\mathbf{x}$ into $\mathbf{y} \in [0,1]^M$ by the deterministic mapping
$$ \mathbf{h} = s(\mathbf{x}\mathbf{W} + \mathbf{b}) $$Where $s$ is a poinwise sigmoidal function $s(u) = 1 / (1 + e^{-u})$. The latent representation $\mathbf{h}$, or code is then mapped back (decoded) into reconstruction $\mathbf{z}$ through the similar transformation
$$\mathbf{z} = s(\mathbf{h}\mathbf{V} + \mathbf{c}) $$Training an autoencoder consists of minimizing a reconstruction cost, taking $\mathbf{z}$ as the reconstruction of $\mathbf{x}$. Intuitively, we want a model that reconstructs the important aspects of $\mathbf{x}$ so that we guarantee that these aspects are preserved by $\mathbf{y}$. Typically, mathematical formalizations of this intuition are not particularly exotic, but they should, at the very least, respect the domain and range of $\mathbf{x}$ and $\mathbf{z}$ respectively. So for example, if $\mathbf{x}$ were a real-valued element of $\mathbb{R}^N$ then we might formalize the reconstruction cost as $|| \mathbf{x} - \mathbf{z} ||^2$, but there is no mathematical requirement to use the $\ell_2$ norm or indeed to use a valid norm at all.
For our MNIST data, we will suppose that $\mathbf{x}$ is a vector Bernoulli probabilities, and define the reconstruction cost to be the cross-entropy betwen $\mathbf{z}$ and $\mathbf{x}$:
$$ L(\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log \mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)] $$Training an autoencoder is an optimization problem that is typically treated in two steps:
Initialization The parameters of this particular model ($W, \mathbf{b}, V, \mathbf{c}$) must be initialized so that symmetry is broken between the columns of $W$ (also the rows of $V$), otherwise the gradient on each column will be identical and local search will be completely ineffective.
It is customary to initialize $W$ to small random values to break symmetry, and to initialize the rest of the parameters to 0.
Local Search
The training criterion $L$ defined above is a deterministic and differentiable function of parameters ($W, b, V, c$) so any multi-dimensional minimization algorithm can do the job. The speed and accuracy of a the method generally depends on the size nature of the data set, the parameters and parameterization of the model, and of course the sophistication of the minimization algorithm.
A classic, and still widely useful, algorithm is stochastic gradient descent.
Another powerful minimization method is the well-known (L-BFGS) algorithm. When used as a training algorithm, it is called a batch algorithm because it does not deal well with stochastic functions, and therefore requires that we evaluate the total loss over all training examples on each cost function evaluation.
L-BFGS can be run right away from the random initialization,
but often what works better is to do a few loops of fmin_sgd
to move the initial random parameters to a promising area of the search space, and then to refine that solution with L-BFGS.
That's it, we've trained an auto-encoder.
An autoencoder is a function that maps a vector to approximately itself, by passing it through intermediate values in such a way that the pure identify function is difficult or impossible. An autoencoder is a non-stochastic pre-cursur to several more modern probabilistic models such as De-Noising AutoEncoders, Sparse Coding, , Restricted Boltzmann Machines and even clustering. They are a flexible class of feature extraction algorithms that can be quick to train by either stochastic gradient descent, or more sophisticated minimization algorithms such as L-BFGS.
Spend some time running through the exercises below to get a better feel for how autoencoders work and what they can do.
One serious potential issue with auto-encoders is that if there is no other constraint besides minimizing the reconstruction error, then an auto-encoder with $n$ inputs and an encoding of dimension at least $n$ could potentially just learn the identity function, and fail to differentiate test examples (from the training distribution) from other input configurations. Surprisingly, experiments reported in [Bengio07]_ nonetheless suggest that in practice, when trained with stochastic gradient descent, non-linear auto-encoders with more hidden units than inputs (called overcomplete) yield useful representations (in the sense of classification error measured on a network taking this representation in input). A simple explanation is based on the observation that stochastic gradient descent with early stopping is similar to an L2 regularization of the parameters. To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder with non-linear hidden units (exactly like in the above code) needs very small weights in the first (encoding) layer (to bring the non-linearity of the hidden units in their linear regime) and very large weights in the second (decoding) layer. With binary inputs, very large weights are also needed to completely minimize the reconstruction error. Since the implicit or explicit regularization makes it difficult to reach large-weight solutions, the optimization algorithm finds encodings which only work well for examples similar to those in the training set, which is what we want. It means that the representation is exploiting statistical regularities present in the training set, rather than learning to replicate the identity function.
## setup
from functools import partial
import logging, sys, time
import numpy as np
from skdata import mnist, cifar10, svhn
import autodiff
import util
from sklearn import preprocessing
from sklearn import neural_network
from sklearn import pipeline
from sklearn import linear_model
from sklearn import grid_search
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) /usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning)
## data
dtype = 'float'
nexamples = 10000
train_idx = slice(0, nexamples)
test_idx = slice(nexamples, nexamples*2)
img_shape = (28, 28)
v = mnist.views.OfficialImageClassification()
x = v.all_images[train_idx].reshape((nexamples, -1)).astype(dtype)
y = v.all_labels[train_idx]
x_test = v.all_images[test_idx].reshape((nexamples, -1)).astype(dtype)
y_test = v.all_labels[test_idx]
## DONT USE MinMaxScaler or StandarizeScaler as they
## do normalization individually for each feature
#scaler = preprocessing.MinMaxScaler()
#scaler.fit(v.all_vectors.astype(dtype))
#x = scaler.transform(x)
#x_test = scaler.transform(x)
x = x / 255.
x_test = x_test / 255.
y1 = preprocessing.LabelBinarizer().fit_transform(y)
y1_test = preprocessing.LabelBinarizer().fit_transform(y_test)
print x.shape, y.shape, x_test.shape, y_test.shape, y1.shape, y1_test.shape
print x.max(), x.min(), x_test.max(), x_test.min()
(10000, 784) (10000,) (10000, 784) (10000,) (10000, 10) (10000, 10) 1.0 0.0 1.0 0.0
## helper function
def logistic(u):
return 1. / (1 + np.exp(-u))
def autoencoder(W, b, V, c, x):
## using V = W.T is called "tied weights"
## using separate V is called "untied weights"
h = logistic(np.dot(x, W) + b)
z = logistic(np.dot(h, V) + c)
return z, h
## cost function of Bernoulli autoencoder - cross entropy
def cross_entropy(x, z):
"""x, z: Bernoulli probability vectos
"""
return -np.sum(x*np.log(z) + (1-x)*np.log(1-z), axis = 1)
def training_cost(W, b, V, c, x):
"""
Always return sum along different features
Always return mean along different rows
"""
z, h = autoencoder(W, b, V, c, x)
L = cross_entropy(x, z).mean()
return L
## initialization
## It is customary to initialize $W$ to small random values to break symmetry,
## and to initialize the rest of the parameters to 0.
## for logitstic function, it should be 4*sqrt(6 / (nin+nhid))
## for tanh function, it should be sqrt(6 / (nin+nhid) )
## we use logistic regression here simply because we need the output
## to be in [0, 1] instead of [-1, 1] (like for SVM-hinge case)
hid_shape = (10, 10)
nhid = np.prod(hid_shape)
nin = np.prod(img_shape)
W0 = np.random.uniform(low = -4.*np.sqrt(6. / (nin+nhid)),
high = 4.*np.sqrt(6. / (nin+nhid)),
size = (nin, nhid)).astype(dtype)
V0 = np.zeros_like(W0.T, dtype=dtype)
b0 = np.zeros(nhid, dtype = dtype)
c0 = np.zeros(nin, dtype=dtype)
## batch training with l-bfgs-b
def batch_cost(W, b, V, c):
"""
make it tied
"""
return training_cost(W, b, W.T, c, x)
W, b, V, c = autodiff.optimize.fmin_l_bfgs_b(
batch_cost,
(W0, b0, V0, c0),
maxfun = 20, ## how many calls to allow
m=20, ## how well to approximate the Hessian
iprint = 1
)
%pylab inline
figure(figsize=(12, 12))
util.show_filters(W.T, img_shape, hid_shape)
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['dtype', 'logistic'] `%pylab --no-import-all` prevents importing * from pylab and numpy
## use learned features for a SGDClassifier
_, feats = autoencoder(W, b, V, c, x)
print feats.shape
X = x
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge',
penalty = 'l2'),
params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
(10000, 100) precision recall f1-score support 0 0.94 0.98 0.96 1001 1 0.92 0.95 0.94 1127 2 0.90 0.88 0.89 991 3 0.90 0.88 0.89 1032 4 0.91 0.91 0.91 980 5 0.89 0.83 0.86 863 6 0.93 0.96 0.94 1014 7 0.95 0.91 0.93 1070 8 0.83 0.87 0.85 944 9 0.87 0.88 0.88 978 avg / total 0.91 0.91 0.91 10000
## use learned features for a SGDClassifier
_, feats = autoencoder(W, b, V, c, x)
print feats.shape
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge',
penalty = 'l2'),
params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
(10000, 100) precision recall f1-score support 0 0.88 0.97 0.93 1001 1 0.92 0.96 0.94 1127 2 0.88 0.86 0.87 991 3 0.87 0.84 0.85 1032 4 0.89 0.87 0.88 980 5 0.88 0.68 0.77 863 6 0.91 0.94 0.92 1014 7 0.93 0.89 0.91 1070 8 0.76 0.86 0.81 944 9 0.83 0.84 0.83 978 avg / total 0.88 0.88 0.87 10000
There are different ways that an auto-encoder with more hidden units than inputs could be prevented from learning the identity, and still capture something useful about the input in its hidden representation. One is the addition of sparsity (forcing many of the hidden units to be zero or near-zero), and it has been exploited very successfully by many [Ranzato07] [Lee08]. One of the most extreme forms of sparsity in autoencoders is known as the K-means algorithm. In the K-means model, the latent code hid is a vector with one 1 and the rest all 0. K-means is typically solved using a very efficient batch EM algorithm, but we can also tackle it with the same tools we've been using thus far. How does this compare with sklearn's K-means solver?
from sklearn.metrics import pairwise
from scipy import optimize
dtype = 'float64'
def softmax(x):
"""
Return the softmax of each row in x
see http://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_Regression
for how to avoid overflow
"""
xx = x - x.max(axis=1)[:, np.newaxis]
ex = np.exp(xx)
return ex / ex.sum(axis = 1)[:, np.newaxis]
def edist2(X, Y):
"""
Return all-pairs squared distance betwee rows of X and Y
"""
#return pairwise.euclidean_distances(X, Y)
XX = np.sum(X*X, axis = 1)[:, newaxis]
YY = np.sum(Y*Y, axis = 1)[newaxis, :]
distances = np.maximum(XX - 2 * dot(X, Y.T) + YY, 0)
return distances
def kmeans_real_x(w, x):
"""
I like the thinking that everything is like a deep learning model
w: group of column vectors, which are centers of clusters
"""
#w = w.reshape((x.shape[1], -1))
xw = edist2(x, w.T)
## hard winner
hid = (xw == xw.min(axis=1)[:, newaxis]).astype(dtype)
## soft winner
hid = hid + softmax(xw)
## recall x
x_rec = np.dot(hid, w.T)
## always sum over features and average over instances
cost = ((x-x_rec) ** 2).sum(axis = 1).mean()
return cost
"""
W = optimize.fmin_l_bfgs_b(partial(kmeans_real_x, x = x),
W0,
approx_grad = True,
maxfun=20,
iprint = 1)
"""
def kmeans_cost(w):
return kmeans_real_x(w, x)
W = autodiff.optimize.fmin_l_bfgs_b(kmeans_cost,
init_args = (W0, ),
maxfun = 20,
iprint = 1,
m = 20)
util.show_filters(W.T, img_shape, hid_shape)
## use learned features for a SGDClassifier
xw = edist2(x, W.T)
## hard winner
hid = (xw == xw.min(axis=1)[:, newaxis]).astype(dtype)
## soft winner
hid = hid + softmax(xw)
feats = hid
print feats.shape
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge',
penalty = 'l2'),
params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
(10000, 100) precision recall f1-score support 0 0.39 0.92 0.54 1001 1 0.46 0.94 0.61 1127 2 0.50 0.32 0.39 991 3 0.35 0.10 0.15 1032 4 0.54 0.28 0.37 980 5 0.34 0.18 0.23 863 6 0.56 0.33 0.41 1014 7 0.46 0.75 0.57 1070 8 0.33 0.34 0.33 944 9 0.51 0.06 0.11 978 avg / total 0.44 0.43 0.38 10000
## use sklearn kmeans
from sklearn.cluster import KMeans
## learn kmeans
kmeans = KMeans(n_clusters = 100, n_jobs = 1).fit(x)
## transform to center distance
xdist = kmeans.transform(x)
## average dist for each cluster
mu = xdist.mean(axis = 0)
feats = np.maximum(mu-xdist, 0)
W = kmeans.cluster_centers_
util.show_filters(W, img_shape, hid_shape)
print feats.shape
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge',
penalty = 'l2'),
params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
(10000, 100) precision recall f1-score support 0 0.94 0.97 0.96 1001 1 0.97 0.94 0.96 1127 2 0.83 0.92 0.87 991 3 0.88 0.87 0.88 1032 4 0.93 0.81 0.87 980 5 0.88 0.87 0.88 863 6 0.93 0.95 0.94 1014 7 0.95 0.89 0.92 1070 8 0.81 0.85 0.83 944 9 0.83 0.85 0.84 978 avg / total 0.90 0.90 0.90 10000
The Contrastive Divergence algorithm for training Restricted Boltzmann Machines (RBMs) looks a lot like an auto-encoder from an algorithmic perspective, especially the denoising autoencoder. Tempting though it is to use the autoencoder as a unifying framework, it is a stretch to call an RBM an auto-encoder. Mathematically, it is better to train an RBM by Persistent CD (aka Stochastic Maximum Likelihood) and these methods actually work on the basis of ensuring that the RBM does not perfectly reconstruct any training instances. Nevertheless, the following code implements the CD algorithm in a way that makes a clear connection to the other autoencoder-style models above.
One important difference with the cases above is that CD is not in fact gradient descent on a cost function and although this code makes use of a difference in free energies as a cost function, it is just a trick.
Consequently, you'll see that the so-called "cost" values regularly jump both up and down even when the algorithm is working perfectly well. Computation of the likelihood of an RBM involves an infamously intractable partition function -- there are techniques for estimating it (see Ruslan Salakhutdinov's PhD thesis), but no tutorial here yet.
## DIY implementation
dtype = 'float64'
def logistic(u):
"""
Return logistic sigmoid of float or ndarray u
"""
return 1. / (1. + np.exp(-u))
def rbm_binary_x(w, hidbias, visbias, x):
## FORWARD - POSITIVE
## feed forward calculation
hid = logistic(np.dot(x, w) + hidbias)
## hid_sample is like a probability of being active
hid_sample = (hid > np.random.rand(*hid.shape)).astype(dtype)
## BACKWARD - NEGATIVE
## NB model is not actually trained to reconstruct x
x_rec = logistic(np.dot(hid_sample, w.T) + visbias)
x_rec_sample = (x_rec > np.random.rand(*x_rec.shape)).astype(dtype)
## negative phase hidden unit expectation
## FORWARD - POSITIVE
## BUT WHY DO WE NEED THIS HERE?
hid_rec = logistic(np.dot(x_rec_sample, w) + hidbias)
def free_energy(xx):
xw_b = np.dot(xx, w) + hidbias
return -np.log(1+np.exp(xw_b)).sum(axis = 1) - np.dot(xx, visbias)
## cost as difference of energy
cost = free_energy(x) - free_energy(x_rec_sample)
return cost.mean()
def rbm_binary_cost(W, hidbas, visbias):
return rbm_binary_x(W, b, c, x)
W, b, c = autodiff.optimize.fmin_l_bfgs_b(
rbm_binary_cost,
init_args = (W0, b0, c0),
maxfun = 20,
m = 20,
iprint = 1)
## test the feature performance with SGDClassifier
feats = logistic(np.dot(x, W) + b)
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(
linear_model.SGDClassifier(loss='hinge',
penalty='l2'),
params)
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
util.show_filters(W.T, img_shape, hid_shape)
figure()
util.show_filters(feats[:100], hid_shape, (10, 10))
precision recall f1-score support 0 0.84 0.98 0.90 1001 1 0.95 0.95 0.95 1127 2 0.74 0.88 0.81 991 3 0.88 0.81 0.84 1032 4 0.93 0.75 0.83 980 5 0.64 0.84 0.73 863 6 0.85 0.95 0.90 1014 7 0.97 0.75 0.85 1070 8 0.89 0.73 0.80 944 9 0.83 0.79 0.81 978 avg / total 0.86 0.84 0.85 10000
## BUILT-in RBM
from sklearn import neural_network
rbm = neural_network.BernoulliRBM(n_components=100).fit(x)
feats = rbm.transform(x)
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(
linear_model.SGDClassifier(loss='hinge',
penalty='l2'),
params)
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
figure(figsize = (12, 12))
util.show_filters(rbm.components_, img_shape, hid_shape)
precision recall f1-score support 0 0.91 0.96 0.93 1001 1 0.97 0.96 0.97 1127 2 0.93 0.87 0.90 991 3 0.87 0.88 0.88 1032 4 0.94 0.87 0.90 980 5 0.94 0.78 0.85 863 6 0.91 0.97 0.94 1014 7 0.96 0.88 0.92 1070 8 0.78 0.91 0.84 944 9 0.82 0.88 0.85 978 avg / total 0.90 0.90 0.90 10000
Another variation on the autoencoder theme is to add noise to the input by zeroing out some of the elements randomly. This is known as the denoising autoencoder (Vincent et al., XXX) and it has been shown to be a form of score-matching (probabilistic model interpretation) and a good feature extraction algorithm.
def logistic(u):
return 1. / (1. + np.exp(-u))
def cross_entropy(x, y):
"""
x, y are Bernoulli probability vectors
or matrices with row vectors
"""
cost = -(x * np.log(y) + (1-x) * np.log(1-y)).sum(axis = 1)
return cost
def denoising_autoencoder_binary_x(W, b, c, x):
## corrupting the niput by zero-ing out some values randomly
snr = 0.3
## it seems here is the only difference between
## a denoising autoencoder and a normal one
noisy_x = x * (rand(*x.shape) > snr)
noisy_x = noisy_x.astype(dtype)
hid = logistic(np.dot(x, W) + b)
## tied weights implementation
x_rec = logistic(np.dot(hid, W.T) + c)
cost = cross_entropy(x, x_rec)
return cost.mean()
def denoising_autoencoder_cost(W, b, c):
return denoising_autoencoder_binary_x(W, b, c, x)
W, b, c = autodiff.optimize.fmin_l_bfgs_b(
denoising_autoencoder_cost,
init_args = (W0, b0, c0),
m = 20,
maxfun = 20,
iprint = 1)
figure(figsize = (10, 10))
util.show_filters(W.T, img_shape, hid_shape)
## performance of denoising auto encoder
snr = 0.3
x_noisy = x * (np.random.rand(*x.shape) > snr)
#x_noisy = x
feats = logistic(np.dot(x_noisy, W) + b)
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(
linear_model.SGDClassifier(loss='hinge',
penalty='l2'),
params)
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
precision recall f1-score support 0 0.89 0.97 0.93 1001 1 0.92 0.96 0.94 1127 2 0.87 0.85 0.86 991 3 0.85 0.84 0.84 1032 4 0.89 0.83 0.86 980 5 0.89 0.75 0.81 863 6 0.91 0.95 0.93 1014 7 0.92 0.87 0.89 1070 8 0.79 0.85 0.82 944 9 0.80 0.84 0.82 978 avg / total 0.87 0.87 0.87 10000
efficient and has only one hyperparameter (number of features) to tune
In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution.
Instead, it optimizes a simple cost function - the sparsity of $l_2$-normalized features.
It scales to high dimensional inputs and can be used to learn meaningful features in additional layers with greedy layer-wise stacking.
The main difference of sparse-filtering and other unsupervised learning algorithms: traditionally feature learning methods have largely sought to learn models that provide good approximations of the true data distribution; e.g., denoising autoencoder, RBM, some verions of ICA, and sparse coding among others.
Instead of modelling the data distribution directly, SF focues on several key points of useful features, namely, population sparsity, lifetime sparsity, and high disperal. It can be used directly with off-the-shelf optimizers such as l-bgfs-b.
Essentially, distribution modelling based methods can be viewed as generating particular feature distributions. For instance, sparse coding represents each example using a few non-zero coefficients (features). A feature distribution oriented approach can provide insights into designing new algorithms based on optimizing for desirable properties of the feature distribution.
Popularity Sparsity (Sparse Features per example): Each example should be represented by only a few active (non-zero) features. For example, an image can be represented by a description of the objects in it, and while there are many possible objects that can appear, only a few are typically present at a single time.
Lifetime Sparsity (Sparse Features across examples): Features should be discriminative and allow us to distinguish examples; thus, each feature should ba active for a few examples.
High Dispersal (Uniform Activity Distribution): For each featutre, the distribution should have similiar statistics to every other feature; no one feature should have significantly more activity than the other features.While high dispersal is not necessarily the feature for a good representation of the feature space, we found that enforcing this property can prevent degenerate situations in which the same features are always active. For Overcomplete representations, high disperal translates to having fewer inactive features. As an example, PCA codes do not generally satisfy high disperal since the codes that correspond to the largest eigenvalues are almost always active
In an implementation language, assume a design matrix X, where rows are instances and columns are features*
Popularity Sparsity: each row (one example) should be sparse (with a small number of active elements)
Lifetime Sparsity: each column (one feature) should be sparse (with a small number of active elements)
High Dispersal: the mean squared activations of each feature ($l_2$ norm) of all features should be roughly the same .
Many feature learning algorithms include these objectives. For example, the sparse RBM [6] works by constraining the expected activation of a feature (over its lifetime) to be close to a target value.
ICA [9, 10] has constraints (e.g., each basis has unit norm) that normalize each feature, and further optimizes for the lifetime sparsity of the features it learns. Sparse autoencoders [16] also explicitly optimize for lifetime sparsity.
On the other hand, clustering-based methods such as k-means [17] can be seen as enforcing an extreme form of population sparsity where each cluster centroid corresponds to a feature and only one feature is allowed to be active per example. “Triangle” activation functions, which essentially serve to ensure population sparsity, have also been shown to obtain good classification results [17].
Sparse coding [11] is also typically seen as enforcing population sparsity. In this work, we use the feature distribution view to derive a simple feature learning algorithm that solely optimizes for population sparsity while enforcing high dispersal. In our experiments, we found that realizing these two properties was sufficient to allow us to learn overcomplete represen- tations; we also argue later that these two properties are jointly sufficient to ensure lifetime sparsity.
*Explainations*
## Sparse Filtering with soft-absolute function
from numpy.linalg import norm
def soft_absolute(u):
"""
approximation to |u|
"""
epsilon = 1e-8
return np.sqrt(epsilon + u*u)
def sparse_filtering_x_cost(W, b, x):
"""
W: nin * nout
b: nout
x: nsample * nin
y: nsample * nin
"""
y = soft_absolute(np.dot(x, W) + b)
#y = logistic(np.dot(x, W) + b)
## normalize each feature (column-wise)
yy = y / np.sqrt(np.sum(y*y, axis=0) + 1e-8)
## normalize each instance (row-wise)
yyy = yy / np.sqrt(np.sum(yy*yy, axis = 1) + 1e-8)[:,np.newaxis]
## l1 penalty along each instance (row-wise)
cost = np.abs(yyy).sum()
return cost
def sparse_filtering_cost(W, b):
return sparse_filtering_x_cost(W, b, x)
W, b = autodiff.optimize.fmin_l_bfgs_b(sparse_filtering_cost,
#(np.random.randn(nin, nhid), np.random.randn(nhid)),
(W0, b0),
maxfun = 20,
m=20,
iprint = 1)
figure(figsize=(10, 10))
util.show_filters(W.T, img_shape, hid_shape)
f = soft_absolute(np.dot(x, W) + b)
ff = f / np.sqrt(np.sum(f*f, axis=0))
fff = ff / np.sqrt(np.sum(ff*ff, axis = 1))[:,np.newaxis]
feats = fff
X = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(
linear_model.SGDClassifier(loss='hinge',
penalty='l2'),
params)
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))
precision recall f1-score support 0 0.91 0.98 0.94 1001 1 0.97 0.96 0.97 1127 2 0.87 0.92 0.89 991 3 0.80 0.86 0.83 1032 4 0.91 0.87 0.89 980 5 0.93 0.76 0.84 863 6 0.95 0.95 0.95 1014 7 0.94 0.89 0.92 1070 8 0.82 0.85 0.84 944 9 0.84 0.86 0.85 978 avg / total 0.90 0.89 0.89 10000
f = soft_absolute(np.dot(x_test, W) + b)
ff = f / np.sqrt(np.sum(f*f, axis=0))
fff = ff / np.sqrt(np.sum(ff*ff, axis = 1))[:,np.newaxis]
feats = fff
X_test = feats
params = {
'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(
linear_model.SGDClassifier(loss='hinge',
penalty='l2'),
params)
searcher.fit(X_test, y_test)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y_test, sgd.predict(X_test))
precision recall f1-score support 0 0.86 0.93 0.89 993 1 0.90 0.98 0.94 1154 2 0.85 0.83 0.84 938 3 0.87 0.77 0.82 1044 4 0.79 0.81 0.80 965 5 0.83 0.84 0.83 912 6 0.89 0.93 0.91 957 7 0.88 0.91 0.90 1023 8 0.80 0.79 0.80 978 9 0.80 0.71 0.75 1036 avg / total 0.85 0.85 0.85 10000