Notebook

Auto-Encoders and Denoising Auto-Encoders¶

The auto-encoder (AE) is a classic unsupervised algorithm for non-linear dimensionality reduction. The de-noising auto-encoder (DAE) is an extension of the auto-encoder introduced as a building block for deep networks in [Vincent08]. This tutorial introduces the autoencoder as an unsupervised feature learning algorithm for the MNIST data set, and then develops the denoising autoencoder. See section 4.6 of [Bengio09]_ for an overview of auto-encoders.

Autoencoders¶

Generally speaking, an autoencoder maps vector input $\mathbf{x}$ to some intermediate representation $\mathbf{y}$ and then back into the original space, in such a way that the end-point is close to $\mathbf{x}$. While the goal thus defined is simply to learn an identity function, things get interesting when the mappings are parameterized or constrained in such a way that a general identity function is either impossible to represent or at least difficult to learn from data. When this is the case, the goal of learning is to learn a special-purpose identity function that typically works for vectors $\mathbf{x}$ that we care about, which come from some empirically interesting distribution. The $\mathbf{y}$ vector that comes out of this process contains all the important information about $x$ in a new and potentially useful way.

In our tutorial here we will deal with vectors $\mathbf{x}$ that come from the MNIST data set of hand-written digits. Examples from MNIST are vectors $\mathbf{x} \in [0,1]^N$, and we will look at the classic one-layer sigmoidal autoencoder parameterized by:

matrix $W \in \mathbb{R}^{N \times M}$ - the weights
vector $\mathbf{b} \in \mathbb{R}^M$ - the bias
matrix $V \in \mathbb{R}^{M \times N}$ - the reconstruction weights
vector $\mathbf{c} \in \mathbb{R}^N$ - the recontruction bias

which encodes vectors $\mathbf{x}$ into $\mathbf{y} \in [0,1]^M$ by the deterministic mapping

$$ \mathbf{h} = s(\mathbf{x}\mathbf{W} + \mathbf{b}) $$

Where $s$ is a poinwise sigmoidal function $s(u) = 1 / (1 + e^{-u})$. The latent representation $\mathbf{h}$, or code is then mapped back (decoded) into reconstruction $\mathbf{z}$ through the similar transformation

$$\mathbf{z} = s(\mathbf{h}\mathbf{V} + \mathbf{c}) $$

Tied vs Untied weights discussion

Training an Autoencoder¶

Training an autoencoder consists of minimizing a reconstruction cost, taking $\mathbf{z}$ as the reconstruction of $\mathbf{x}$. Intuitively, we want a model that reconstructs the important aspects of $\mathbf{x}$ so that we guarantee that these aspects are preserved by $\mathbf{y}$. Typically, mathematical formalizations of this intuition are not particularly exotic, but they should, at the very least, respect the domain and range of $\mathbf{x}$ and $\mathbf{z}$ respectively. So for example, if $\mathbf{x}$ were a real-valued element of $\mathbb{R}^N$ then we might formalize the reconstruction cost as $|| \mathbf{x} - \mathbf{z} ||^2$, but there is no mathematical requirement to use the $\ell_2$ norm or indeed to use a valid norm at all.

For our MNIST data, we will suppose that $\mathbf{x}$ is a vector Bernoulli probabilities, and define the reconstruction cost to be the cross-entropy betwen $\mathbf{z}$ and $\mathbf{x}$:

$$ L(\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log \mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)] $$

Training an autoencoder is an optimization problem that is typically treated in two steps:

Initialization - choosing a starting point for gradient-based search,
Local Search - running a gradient-based optimization algorithm

Initialization The parameters of this particular model ($W, \mathbf{b}, V, \mathbf{c}$) must be initialized so that symmetry is broken between the columns of $W$ (also the rows of $V$), otherwise the gradient on each column will be identical and local search will be completely ineffective.

It is customary to initialize $W$ to small random values to break symmetry, and to initialize the rest of the parameters to 0.

Local Search

The training criterion $L$ defined above is a deterministic and differentiable function of parameters ($W, b, V, c$) so any multi-dimensional minimization algorithm can do the job. The speed and accuracy of a the method generally depends on the size nature of the data set, the parameters and parameterization of the model, and of course the sophistication of the minimization algorithm.

A classic, and still widely useful, algorithm is stochastic gradient descent.

Another powerful minimization method is the well-known (L-BFGS) algorithm. When used as a training algorithm, it is called a batch algorithm because it does not deal well with stochastic functions, and therefore requires that we evaluate the total loss over all training examples on each cost function evaluation.

L-BFGS can be run right away from the random initialization, but often what works better is to do a few loops of fmin_sgd to move the initial random parameters to a promising area of the search space, and then to refine that solution with L-BFGS.

That's it, we've trained an auto-encoder.

Autoencoder Summary¶

An autoencoder is a function that maps a vector to approximately itself, by passing it through intermediate values in such a way that the pure identify function is difficult or impossible. An autoencoder is a non-stochastic pre-cursur to several more modern probabilistic models such as De-Noising AutoEncoders, Sparse Coding, , Restricted Boltzmann Machines and even clustering. They are a flexible class of feature extraction algorithms that can be quick to train by either stochastic gradient descent, or more sophisticated minimization algorithms such as L-BFGS.

Spend some time running through the exercises below to get a better feel for how autoencoders work and what they can do.

Capacity Regularization via L2 Regularization¶

One serious potential issue with auto-encoders is that if there is no other constraint besides minimizing the reconstruction error, then an auto-encoder with $n$ inputs and an encoding of dimension at least $n$ could potentially just learn the identity function, and fail to differentiate test examples (from the training distribution) from other input configurations. Surprisingly, experiments reported in [Bengio07]_ nonetheless suggest that in practice, when trained with stochastic gradient descent, non-linear auto-encoders with more hidden units than inputs (called overcomplete) yield useful representations (in the sense of classification error measured on a network taking this representation in input). A simple explanation is based on the observation that stochastic gradient descent with early stopping is similar to an L2 regularization of the parameters. To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder with non-linear hidden units (exactly like in the above code) needs very small weights in the first (encoding) layer (to bring the non-linearity of the hidden units in their linear regime) and very large weights in the second (decoding) layer. With binary inputs, very large weights are also needed to completely minimize the reconstruction error. Since the implicit or explicit regularization makes it difficult to reach large-weight solutions, the optimization algorithm finds encodings which only work well for examples similar to those in the training set, which is what we want. It means that the representation is exploiting statistical regularities present in the training set, rather than learning to replicate the identity function.

In [1]:

## setup
from functools import partial
import logging, sys, time
import numpy as np
from skdata import mnist, cifar10, svhn
import autodiff
import util
from sklearn import preprocessing
from sklearn import neural_network
from sklearn import pipeline
from sklearn import linear_model
from sklearn import grid_search

/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
/usr/local/lib/python2.7/dist-packages/scipy/lib/_util.py:34: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)

In [2]:

## data
dtype = 'float'
nexamples = 10000
train_idx = slice(0, nexamples)
test_idx = slice(nexamples, nexamples*2)
img_shape = (28, 28)

v = mnist.views.OfficialImageClassification()
x = v.all_images[train_idx].reshape((nexamples, -1)).astype(dtype)
y = v.all_labels[train_idx]
x_test = v.all_images[test_idx].reshape((nexamples, -1)).astype(dtype)
y_test = v.all_labels[test_idx]

## DONT USE MinMaxScaler or StandarizeScaler as they 
## do normalization individually for each feature
#scaler = preprocessing.MinMaxScaler()
#scaler.fit(v.all_vectors.astype(dtype))
#x = scaler.transform(x)
#x_test = scaler.transform(x)
x = x / 255.
x_test = x_test / 255.
y1 = preprocessing.LabelBinarizer().fit_transform(y)
y1_test = preprocessing.LabelBinarizer().fit_transform(y_test)

print x.shape, y.shape, x_test.shape, y_test.shape, y1.shape, y1_test.shape
print x.max(), x.min(), x_test.max(), x_test.min()

(10000, 784) (10000,) (10000, 784) (10000,) (10000, 10) (10000, 10)
1.0 0.0 1.0 0.0

In [3]:

## helper function
def logistic(u):
    return 1. / (1 + np.exp(-u))

def autoencoder(W, b, V, c, x):
    ## using V = W.T is called "tied weights"
    ## using separate V is called "untied weights"
    h = logistic(np.dot(x, W) + b)
    z = logistic(np.dot(h, V) + c)
    return z, h

In [4]:

## cost function of Bernoulli autoencoder - cross entropy
def cross_entropy(x, z):
    """x, z: Bernoulli probability vectos
    """
    return -np.sum(x*np.log(z) + (1-x)*np.log(1-z), axis = 1)

def training_cost(W, b, V, c, x):
    """
    Always return sum along different features
    Always return mean along different rows
    """
    z, h = autoencoder(W, b, V, c, x)
    L = cross_entropy(x, z).mean()
    return L

In [5]:

## initialization
## It is customary to initialize $W$ to small random values to break symmetry,
## and to initialize the rest of the parameters to 0.
## for logitstic function, it should be 4*sqrt(6 / (nin+nhid))
## for tanh function, it should be sqrt(6 / (nin+nhid) )
## we use logistic regression here simply because we need the output
## to be in [0, 1] instead of [-1, 1] (like for SVM-hinge case)

hid_shape = (10, 10)
nhid = np.prod(hid_shape)
nin = np.prod(img_shape)

W0 = np.random.uniform(low = -4.*np.sqrt(6. / (nin+nhid)),
                      high = 4.*np.sqrt(6. / (nin+nhid)),
                      size = (nin, nhid)).astype(dtype)
V0 = np.zeros_like(W0.T, dtype=dtype)
b0 = np.zeros(nhid, dtype = dtype)
c0 = np.zeros(nin, dtype=dtype)

In [6]:

## batch training with l-bfgs-b
def batch_cost(W, b, V, c):
    """
    make it tied
    """
    return training_cost(W, b, W.T, c, x)

W, b, V, c = autodiff.optimize.fmin_l_bfgs_b(
        batch_cost, 
        (W0, b0, V0, c0),
        maxfun = 20, ## how many calls to allow
        m=20,        ## how well to approximate the Hessian
        iprint = 1
)

In [7]:

%pylab inline
figure(figsize=(12, 12))
util.show_filters(W.T, img_shape, hid_shape)

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['dtype', 'logistic']
`%pylab --no-import-all` prevents importing * from pylab and numpy

In [8]:

## use learned features for a SGDClassifier
_, feats = autoencoder(W, b, V, c, x)
print feats.shape

X = x

params = {
    'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge', 
                                                                     penalty = 'l2'), 
                                          params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

(10000, 100)
             precision    recall  f1-score   support

          0       0.94      0.98      0.96      1001
          1       0.92      0.95      0.94      1127
          2       0.90      0.88      0.89       991
          3       0.90      0.88      0.89      1032
          4       0.91      0.91      0.91       980
          5       0.89      0.83      0.86       863
          6       0.93      0.96      0.94      1014
          7       0.95      0.91      0.93      1070
          8       0.83      0.87      0.85       944
          9       0.87      0.88      0.88       978

avg / total       0.91      0.91      0.91     10000

In [9]:

## use learned features for a SGDClassifier
_, feats = autoencoder(W, b, V, c, x)
print feats.shape

X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge', 
                                                                     penalty = 'l2'), 
                                          params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

(10000, 100)
             precision    recall  f1-score   support

          0       0.88      0.97      0.93      1001
          1       0.92      0.96      0.94      1127
          2       0.88      0.86      0.87       991
          3       0.87      0.84      0.85      1032
          4       0.89      0.87      0.88       980
          5       0.88      0.68      0.77       863
          6       0.91      0.94      0.92      1014
          7       0.93      0.89      0.91      1070
          8       0.76      0.86      0.81       944
          9       0.83      0.84      0.83       978

avg / total       0.88      0.88      0.87     10000

Sparsity, K-means¶

There are different ways that an auto-encoder with more hidden units than inputs could be prevented from learning the identity, and still capture something useful about the input in its hidden representation. One is the addition of sparsity (forcing many of the hidden units to be zero or near-zero), and it has been exploited very successfully by many [Ranzato07] [Lee08]. One of the most extreme forms of sparsity in autoencoders is known as the K-means algorithm. In the K-means model, the latent code hid is a vector with one 1 and the rest all 0. K-means is typically solved using a very efficient batch EM algorithm, but we can also tackle it with the same tools we've been using thus far. How does this compare with sklearn's K-means solver?

In [10]:

from sklearn.metrics import pairwise
from scipy import optimize
dtype = 'float64'
def softmax(x):
    """
    Return the softmax of each row in x
    see http://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_Regression
    for how to avoid overflow
    """
    xx = x - x.max(axis=1)[:, np.newaxis]
    ex = np.exp(xx)
    return ex / ex.sum(axis = 1)[:, np.newaxis]

def edist2(X, Y):
    """
    Return all-pairs squared distance betwee rows of X and Y
    """
    #return pairwise.euclidean_distances(X, Y)
    XX = np.sum(X*X, axis = 1)[:, newaxis]
    YY = np.sum(Y*Y, axis = 1)[newaxis, :]
    distances = np.maximum(XX - 2 * dot(X, Y.T) + YY, 0)
    return distances
    
def kmeans_real_x(w, x):
    """
    I like the thinking that everything is like a deep learning model
    w: group of column vectors, which are centers of clusters
    """
    #w = w.reshape((x.shape[1], -1))
    xw = edist2(x, w.T)
    ## hard winner
    hid = (xw == xw.min(axis=1)[:, newaxis]).astype(dtype)
    ## soft winner
    hid = hid + softmax(xw)
    ## recall x
    x_rec = np.dot(hid, w.T)
    ## always sum over features and average over instances
    cost = ((x-x_rec) ** 2).sum(axis = 1).mean()
    return cost

"""
W = optimize.fmin_l_bfgs_b(partial(kmeans_real_x, x = x), 
                           W0, 
                           approx_grad = True,
                           maxfun=20, 
                           iprint = 1)
"""

def kmeans_cost(w):
    return kmeans_real_x(w, x)

W  = autodiff.optimize.fmin_l_bfgs_b(kmeans_cost, 
                                    init_args = (W0, ),
                                    maxfun = 20, 
                                    iprint = 1,
                                    m = 20)

In [11]:

util.show_filters(W.T, img_shape, hid_shape)
## use learned features for a SGDClassifier

xw = edist2(x, W.T)
## hard winner
hid = (xw == xw.min(axis=1)[:, newaxis]).astype(dtype)
## soft winner
hid = hid + softmax(xw)
feats = hid
print feats.shape

X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge', 
                                                                     penalty = 'l2'), 
                                          params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

(10000, 100)
             precision    recall  f1-score   support

          0       0.39      0.92      0.54      1001
          1       0.46      0.94      0.61      1127
          2       0.50      0.32      0.39       991
          3       0.35      0.10      0.15      1032
          4       0.54      0.28      0.37       980
          5       0.34      0.18      0.23       863
          6       0.56      0.33      0.41      1014
          7       0.46      0.75      0.57      1070
          8       0.33      0.34      0.33       944
          9       0.51      0.06      0.11       978

avg / total       0.44      0.43      0.38     10000

In [12]:

## use sklearn kmeans
from sklearn.cluster import KMeans

## learn kmeans
kmeans = KMeans(n_clusters = 100, n_jobs = 1).fit(x)
## transform to center distance
xdist = kmeans.transform(x)
## average dist for each cluster
mu = xdist.mean(axis = 0)
feats = np.maximum(mu-xdist, 0)

In [13]:

W = kmeans.cluster_centers_
util.show_filters(W, img_shape, hid_shape)

In [14]:

print feats.shape

X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}
searcher = grid_search.RandomizedSearchCV(linear_model.SGDClassifier(loss = 'hinge', 
                                                                     penalty = 'l2'), 
                                          params, )
searcher.fit(X, y)
sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

(10000, 100)
             precision    recall  f1-score   support

          0       0.94      0.97      0.96      1001
          1       0.97      0.94      0.96      1127
          2       0.83      0.92      0.87       991
          3       0.88      0.87      0.88      1032
          4       0.93      0.81      0.87       980
          5       0.88      0.87      0.88       863
          6       0.93      0.95      0.94      1014
          7       0.95      0.89      0.92      1070
          8       0.81      0.85      0.83       944
          9       0.83      0.85      0.84       978

avg / total       0.90      0.90      0.90     10000

Bernoulli RBM -- Not really an autoencoder¶

The Contrastive Divergence algorithm for training Restricted Boltzmann Machines (RBMs) looks a lot like an auto-encoder from an algorithmic perspective, especially the denoising autoencoder. Tempting though it is to use the autoencoder as a unifying framework, it is a stretch to call an RBM an auto-encoder. Mathematically, it is better to train an RBM by Persistent CD (aka Stochastic Maximum Likelihood) and these methods actually work on the basis of ensuring that the RBM does not perfectly reconstruct any training instances. Nevertheless, the following code implements the CD algorithm in a way that makes a clear connection to the other autoencoder-style models above.

One important difference with the cases above is that CD is not in fact gradient descent on a cost function and although this code makes use of a difference in free energies as a cost function, it is just a trick.

Consequently, you'll see that the so-called "cost" values regularly jump both up and down even when the algorithm is working perfectly well. Computation of the likelihood of an RBM involves an infamously intractable partition function -- there are techniques for estimating it (see Ruslan Salakhutdinov's PhD thesis), but no tutorial here yet.

In [15]:

## DIY implementation

dtype = 'float64'
def logistic(u):
    """
    Return logistic sigmoid of float or ndarray u
    """
    return 1. / (1. + np.exp(-u))

def rbm_binary_x(w, hidbias, visbias, x):
    ## FORWARD - POSITIVE
    ## feed forward calculation
    hid = logistic(np.dot(x, w) + hidbias)
    ## hid_sample is like a probability of being active
    hid_sample = (hid > np.random.rand(*hid.shape)).astype(dtype)
    ## BACKWARD - NEGATIVE
    ## NB model is not actually trained to reconstruct x
    x_rec = logistic(np.dot(hid_sample, w.T) + visbias)
    x_rec_sample = (x_rec > np.random.rand(*x_rec.shape)).astype(dtype)
    ## negative phase hidden unit expectation
    ## FORWARD - POSITIVE
    ## BUT WHY DO WE NEED THIS HERE?
    hid_rec = logistic(np.dot(x_rec_sample, w) + hidbias)
    
    def free_energy(xx):
        xw_b = np.dot(xx, w) + hidbias
        return -np.log(1+np.exp(xw_b)).sum(axis = 1) - np.dot(xx, visbias)
    ## cost as difference of energy
    cost = free_energy(x) - free_energy(x_rec_sample)
    return cost.mean()

def rbm_binary_cost(W, hidbas, visbias):
    return rbm_binary_x(W, b, c, x)

W, b, c = autodiff.optimize.fmin_l_bfgs_b(
                        rbm_binary_cost, 
                        init_args = (W0, b0, c0),
                        maxfun = 20,
                        m = 20,
                        iprint = 1)

In [16]:

## test the feature performance with SGDClassifier
feats = logistic(np.dot(x, W) + b)
X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}

searcher = grid_search.RandomizedSearchCV(
                    linear_model.SGDClassifier(loss='hinge', 
                                                penalty='l2'), 
                    params)
searcher.fit(X, y)

sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

util.show_filters(W.T, img_shape, hid_shape)
figure()
util.show_filters(feats[:100], hid_shape, (10, 10))

             precision    recall  f1-score   support

          0       0.84      0.98      0.90      1001
          1       0.95      0.95      0.95      1127
          2       0.74      0.88      0.81       991
          3       0.88      0.81      0.84      1032
          4       0.93      0.75      0.83       980
          5       0.64      0.84      0.73       863
          6       0.85      0.95      0.90      1014
          7       0.97      0.75      0.85      1070
          8       0.89      0.73      0.80       944
          9       0.83      0.79      0.81       978

avg / total       0.86      0.84      0.85     10000

In [17]:

## BUILT-in RBM
from sklearn import neural_network

rbm = neural_network.BernoulliRBM(n_components=100).fit(x)

In [18]:

feats = rbm.transform(x)
X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}

searcher = grid_search.RandomizedSearchCV(
                    linear_model.SGDClassifier(loss='hinge', 
                                                penalty='l2'), 
                    params)
searcher.fit(X, y)

sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

figure(figsize = (12, 12))
util.show_filters(rbm.components_, img_shape, hid_shape)

             precision    recall  f1-score   support

          0       0.91      0.96      0.93      1001
          1       0.97      0.96      0.97      1127
          2       0.93      0.87      0.90       991
          3       0.87      0.88      0.88      1032
          4       0.94      0.87      0.90       980
          5       0.94      0.78      0.85       863
          6       0.91      0.97      0.94      1014
          7       0.96      0.88      0.92      1070
          8       0.78      0.91      0.84       944
          9       0.82      0.88      0.85       978

avg / total       0.90      0.90      0.90     10000

Sparse coding (FISTA)¶

Denoising AutoEncoder¶

Another variation on the autoencoder theme is to add noise to the input by zeroing out some of the elements randomly. This is known as the denoising autoencoder (Vincent et al., XXX) and it has been shown to be a form of score-matching (probabilistic model interpretation) and a good feature extraction algorithm.

In [19]:

def logistic(u):
    return 1. / (1. + np.exp(-u))

def cross_entropy(x, y):
    """
    x, y are Bernoulli probability vectors
    or matrices with row vectors
    """
    cost = -(x * np.log(y) + (1-x) * np.log(1-y)).sum(axis = 1)
    return cost

def denoising_autoencoder_binary_x(W, b, c, x):
    ## corrupting the niput by zero-ing out some values randomly
    snr = 0.3
    ## it seems here is the only difference between 
    ## a denoising autoencoder and a normal one
    noisy_x = x * (rand(*x.shape) > snr)
    noisy_x = noisy_x.astype(dtype)
    hid = logistic(np.dot(x, W) + b)
    ## tied weights implementation
    x_rec = logistic(np.dot(hid, W.T) + c)
    cost = cross_entropy(x, x_rec)
    return cost.mean()


def denoising_autoencoder_cost(W, b, c):
    return denoising_autoencoder_binary_x(W, b, c, x)

W, b, c = autodiff.optimize.fmin_l_bfgs_b(
                denoising_autoencoder_cost, 
                init_args = (W0, b0, c0),
                m = 20,
                maxfun = 20,
                iprint = 1)

In [20]:

figure(figsize = (10, 10))

util.show_filters(W.T, img_shape, hid_shape)

In [21]:

## performance of denoising auto encoder
snr = 0.3
x_noisy = x * (np.random.rand(*x.shape) > snr)
#x_noisy = x
feats = logistic(np.dot(x_noisy, W) + b)
X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}

searcher = grid_search.RandomizedSearchCV(
                    linear_model.SGDClassifier(loss='hinge', 
                                                penalty='l2'), 
                    params)
searcher.fit(X, y)

sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

             precision    recall  f1-score   support

          0       0.89      0.97      0.93      1001
          1       0.92      0.96      0.94      1127
          2       0.87      0.85      0.86       991
          3       0.85      0.84      0.84      1032
          4       0.89      0.83      0.86       980
          5       0.89      0.75      0.81       863
          6       0.91      0.95      0.93      1014
          7       0.92      0.87      0.89      1070
          8       0.79      0.85      0.82       944
          9       0.80      0.84      0.82       978

avg / total       0.87      0.87      0.87     10000

Sparse Filtering¶

efficient and has only one hyperparameter (number of features) to tune
In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution.
Instead, it optimizes a simple cost function - the sparsity of $l_2$-normalized features.
It scales to high dimensional inputs and can be used to learn meaningful features in additional layers with greedy layer-wise stacking.
The main difference of sparse-filtering and other unsupervised learning algorithms: traditionally feature learning methods have largely sought to learn models that provide good approximations of the true data distribution; e.g., denoising autoencoder, RBM, some verions of ICA, and sparse coding among others.
Instead of modelling the data distribution directly, SF focues on several key points of useful features, namely, population sparsity, lifetime sparsity, and high disperal. It can be used directly with off-the-shelf optimizers such as l-bgfs-b.
Essentially, distribution modelling based methods can be viewed as generating particular feature distributions. For instance, sparse coding represents each example using a few non-zero coefficients (features). A feature distribution oriented approach can provide insights into designing new algorithms based on optimizing for desirable properties of the feature distribution.

Definition of Desirable Properties of Feature Distribution¶

Popularity Sparsity (Sparse Features per example): Each example should be represented by only a few active (non-zero) features. For example, an image can be represented by a description of the objects in it, and while there are many possible objects that can appear, only a few are typically present at a single time.
Lifetime Sparsity (Sparse Features across examples): Features should be discriminative and allow us to distinguish examples; thus, each feature should ba active for a few examples.
High Dispersal (Uniform Activity Distribution): For each featutre, the distribution should have similiar statistics to every other feature; no one feature should have significantly more activity than the other features.While high dispersal is not necessarily the feature for a good representation of the feature space, we found that enforcing this property can prevent degenerate situations in which the same features are always active. For Overcomplete representations, high disperal translates to having fewer inactive features. As an example, PCA codes do not generally satisfy high disperal since the codes that correspond to the largest eigenvalues are almost always active

In an implementation language, assume a design matrix X, where rows are instances and columns are features*

Popularity Sparsity: each row (one example) should be sparse (with a small number of active elements)
Lifetime Sparsity: each column (one feature) should be sparse (with a small number of active elements)
High Dispersal: the mean squared activations of each feature ($l_2$ norm) of all features should be roughly the same .

Many feature learning algorithms include these objectives. For example, the sparse RBM [6] works by constraining the expected activation of a feature (over its lifetime) to be close to a target value.

ICA [9, 10] has constraints (e.g., each basis has unit norm) that normalize each feature, and further optimizes for the lifetime sparsity of the features it learns. Sparse autoencoders [16] also explicitly optimize for lifetime sparsity.

On the other hand, clustering-based methods such as k-means [17] can be seen as enforcing an extreme form of population sparsity where each cluster centroid corresponds to a feature and only one feature is allowed to be active per example. “Triangle” activation functions, which essentially serve to ensure population sparsity, have also been shown to obtain good classification results [17].

Sparse coding [11] is also typically seen as enforcing population sparsity. In this work, we use the feature distribution view to derive a simple feature learning algorithm that solely optimizes for population sparsity while enforcing high dispersal. In our experiments, we found that realizing these two properties was sufficient to allow us to learn overcomplete represen- tations; we also argue later that these two properties are jointly sufficient to ensure lifetime sparsity.

Sparse Filtering Algorithm¶

normalize each features (column-wise)
normalize each instance (row-wise)
optimize for sparseness using $l_1$ penalty

*Explainations*

the features could be parameterized by a weight matrix $W$, so $f=X*W$ or even more complicated nonlinear transform $f=logistic(X * W + b)$
for each step:
1. normalize each feature by dividing its $l2$ norm, which makes all features equally active.
2. normalize each instance by dividing its $l2$ norm, which makes all instances lie on the unit $l2$-ball.
3. achieve sparseness by optimizing $l1$ penalty of each INSTANCE vector.

In [22]:

## Sparse Filtering with soft-absolute function
from numpy.linalg import norm
def soft_absolute(u):
    """
    approximation to |u|
    """
    epsilon = 1e-8
    return np.sqrt(epsilon + u*u)

def sparse_filtering_x_cost(W, b, x):
    """
    W: nin * nout
    b: nout
    x: nsample * nin
    y: nsample * nin
    """
    y = soft_absolute(np.dot(x, W) + b)
    #y = logistic(np.dot(x, W) + b)
    ## normalize each feature (column-wise)
    yy = y / np.sqrt(np.sum(y*y, axis=0) + 1e-8)
    ## normalize each instance (row-wise)
    yyy = yy / np.sqrt(np.sum(yy*yy, axis = 1) + 1e-8)[:,np.newaxis]
    ## l1 penalty along each instance (row-wise)
    cost = np.abs(yyy).sum()
    return cost


def sparse_filtering_cost(W, b):
    return sparse_filtering_x_cost(W, b, x)

W, b = autodiff.optimize.fmin_l_bfgs_b(sparse_filtering_cost, 
                         #(np.random.randn(nin, nhid), np.random.randn(nhid)), 
                         (W0, b0),
                         maxfun = 20,
                         m=20,
                         iprint = 1)

In [23]:

figure(figsize=(10, 10))
util.show_filters(W.T, img_shape, hid_shape)

In [24]:

f = soft_absolute(np.dot(x, W) + b)
ff = f / np.sqrt(np.sum(f*f, axis=0))
fff = ff / np.sqrt(np.sum(ff*ff, axis = 1))[:,np.newaxis]
feats = fff
X = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}

searcher = grid_search.RandomizedSearchCV(
                    linear_model.SGDClassifier(loss='hinge', 
                                                penalty='l2'), 
                    params)
searcher.fit(X, y)

sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y, sgd.predict(X))

             precision    recall  f1-score   support

          0       0.91      0.98      0.94      1001
          1       0.97      0.96      0.97      1127
          2       0.87      0.92      0.89       991
          3       0.80      0.86      0.83      1032
          4       0.91      0.87      0.89       980
          5       0.93      0.76      0.84       863
          6       0.95      0.95      0.95      1014
          7       0.94      0.89      0.92      1070
          8       0.82      0.85      0.84       944
          9       0.84      0.86      0.85       978

avg / total       0.90      0.89      0.89     10000

In [25]:

f = soft_absolute(np.dot(x_test, W) + b)
ff = f / np.sqrt(np.sum(f*f, axis=0))
fff = ff / np.sqrt(np.sum(ff*ff, axis = 1))[:,np.newaxis]
feats = fff
X_test = feats

params = {
    'alpha': np.logspace(-5, 4, 10)
}

searcher = grid_search.RandomizedSearchCV(
                    linear_model.SGDClassifier(loss='hinge', 
                                                penalty='l2'), 
                    params)
searcher.fit(X_test, y_test)

sgd = searcher.best_estimator_
from sklearn.metrics import classification_report
print classification_report(y_test, sgd.predict(X_test))

             precision    recall  f1-score   support

          0       0.86      0.93      0.89       993
          1       0.90      0.98      0.94      1154
          2       0.85      0.83      0.84       938
          3       0.87      0.77      0.82      1044
          4       0.79      0.81      0.80       965
          5       0.83      0.84      0.83       912
          6       0.89      0.93      0.91       957
          7       0.88      0.91      0.90      1023
          8       0.80      0.79      0.80       978
          9       0.80      0.71      0.75      1036

avg / total       0.85      0.85      0.85     10000

In [25]:

In [ ]: