$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} $

21.1 Autoencoder Neural Networks

21.1: Trained using no standardization. Works much better, but not sure why. Also added details related to A7.

Download nn_torch.zip and extract the files.

In [25]:
import numpy as np
import matplotlib.pyplot as plt
import pickle
import gzip
import pandas

import neuralnetworks_torch as nntorch

If we train a network to learn an identify function, meaning that its output is trained to match as closely as possible its input, then we have an autoencoder. "auto" means to duplicate itself (its input). "encoder" means that we are encoding the input in the hidden layers in a way that preserves as much informaton as possible to regenerate the input as the output.

This idea can be used to do a nonlinear reduction in the dimensionality of the input. Just construct an inner layer of fewer units than the dimensionality of the input. This inner, often the middle, layer is usually the narrowest one in the network. It is often called the bottleneck.

Geographical Origin of Music Data Set

Let's play with this idea using this data. Features are attributes of music and the last two are latitude and longitude of origin of music.

In [27]:
d = pandas.read_csv('default_plus_chromatic_features_1059_tracks.txt', header=None)
In [28]:
d.shape
Out[28]:
(1059, 118)
In [29]:
d.head()
Out[29]:
0 1 2 3 4 5 6 7 8 9 ... 108 109 110 111 112 113 114 115 116 117
0 7.161286 7.835325 2.911583 0.984049 -1.499546 -2.094097 0.576000 -1.205671 1.849122 -0.425598 ... -0.364194 -0.364194 -0.364194 -0.364194 -0.364194 -0.364194 -0.364194 -0.364194 -15.75 -47.95
1 0.225763 -0.094169 -0.603646 0.497745 0.874036 0.290280 -0.077659 -0.887385 0.432062 -0.093963 ... 0.936616 0.936616 0.936616 0.936616 0.936616 0.936616 0.936616 0.936616 14.91 -23.51
2 -0.692525 -0.517801 -0.788035 1.214351 -0.907214 0.880213 0.406899 -0.694895 -0.901869 -1.701574 ... 0.603755 0.603755 0.603755 0.603755 0.603755 0.603755 0.603755 0.603755 12.65 -8.00
3 -0.735562 -0.684055 2.058215 0.716328 -0.011393 0.805396 1.497982 0.114752 0.692847 0.052377 ... 0.187169 0.187169 0.187169 0.187169 0.187169 0.187169 0.187169 0.187169 9.03 38.74
4 0.570272 0.273157 -0.279214 0.083456 1.049331 -0.869295 -0.265858 -0.401676 -0.872639 1.147483 ... 1.620715 1.620715 1.620715 1.620715 1.620715 1.620715 1.620715 1.620715 34.03 -6.85

5 rows × 118 columns

In [30]:
d = d.values
In [31]:
d.shape
Out[31]:
(1059, 118)
In [32]:
X = d[:, :-2]
T = d[:, -2:]
X.shape, T.shape
Out[32]:
((1059, 116), (1059, 2))

Train an autoencoder with 2 units in bottle neck layer.

In [33]:
n_in = X.shape[1]
n_out = n_in
nnet = nntorch.NeuralNetwork(n_in, [1000, 100, 100, 2, 100, 100, 1000], n_out, device='cuda')

nnet.train(X, X, 50000, 0.001, method='adam', verbose=True) 

plt.plot(nnet.error_trace)
Epoch 5000: RMSE 0.465
Epoch 10000: RMSE 0.396
Epoch 15000: RMSE 0.336
Epoch 20000: RMSE 0.285
Epoch 25000: RMSE 0.239
Epoch 30000: RMSE 0.204
Epoch 35000: RMSE 0.176
Epoch 40000: RMSE 0.156
Epoch 45000: RMSE 0.135
Epoch 50000: RMSE 0.113
Out[33]:
[<matplotlib.lines.Line2D at 0x7f9c98747590>]

How well does it learn the identity function?

In [34]:
plt.figure(figsize=(10, 10))
Y = nnet.use(X)
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.plot(X[i, :])
    plt.plot(Y[i, :])

Pretty good match.

So where is each music sample project to in the two-dimensional plane formed by the bottleneck later? First let's color the points by latitude.

In [35]:
middle = nnet.use_to_middle(X)
middle.shape
Out[35]:
(1059, 2)
In [36]:
plt.scatter(middle[:, 0], middle[:, 1], c=T[:, 0])
plt.title('Color is latitude')
plt.colorbar();

Now color them by longitude.

In [37]:
plt.scatter(middle[:, 0], middle[:, 1], c=T[:, 1])
plt.title('Color is longitude')
plt.colorbar();

Can we predict latitude and longitude using just the two values from the bottleneck layer?

In [38]:
nnet_predict = nntorch.NeuralNetwork(2, [20, 20, 20], 2, device='cuda')
nnet_predict.train(middle, T, 20000, 0.01, method='adam')
plt.plot(nnet_predict.error_trace);
Epoch 2000: RMSE 0.668
Epoch 4000: RMSE 0.626
Epoch 6000: RMSE 0.596
Epoch 8000: RMSE 0.582
Epoch 10000: RMSE 0.583
Epoch 12000: RMSE 0.567
Epoch 14000: RMSE 0.561
Epoch 16000: RMSE 0.559
Epoch 18000: RMSE 0.553
Epoch 20000: RMSE 0.549

Final error is about 0.6. Pretty good for latitude and longitude in range of 90 and 100.

In [39]:
Y = nnet_predict.use(middle)
print(Y.shape)
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(T[:, 0], label='T latitude')
plt.plot(Y[:, 0], label='Y latitude')
plt.legend()
plt.xlim(0, 200)
plt.subplot(1, 2, 2)
plt.plot(T[:, 1], label='T longitude')
plt.plot(Y[:, 1], label='Y longitude')
plt.legend()
plt.xlim(0, 200)
(1059, 2)
Out[39]:
(0, 200)
In [40]:
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(T[:, 0], Y[:, 0],'o')
plt.title('Latitude')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.subplot(1, 2, 2)
plt.plot(T[:, 1], Y[:, 1],'o')
plt.title('Longitude')
plt.xlabel('Actual')
plt.ylabel('Predicted')
Out[40]:
Text(0, 0.5, 'Predicted')

How does this compare to predicting from original data?

In [41]:
nnet_predict = nntorch.NeuralNetwork(X.shape[1], [20, 20], 2)
nnet_predict.train(X, T, 10000, 0.01, method='adam')
plt.plot(nnet_predict.error_trace);
Epoch 1000: RMSE 0.109
Epoch 2000: RMSE 0.071
Epoch 3000: RMSE 0.065
Epoch 4000: RMSE 0.050
Epoch 5000: RMSE 0.046
Epoch 6000: RMSE 0.040
Epoch 7000: RMSE 0.041
Epoch 8000: RMSE 0.049
Epoch 9000: RMSE 0.041
Epoch 10000: RMSE 0.042
In [42]:
Y = nnet_predict.use(X)
print(Y.shape)
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(T[:, 0], label='T latitude')
plt.plot(Y[:, 0], label='Y latitude')
plt.legend()
plt.xlim(0, 200)
           
plt.subplot(1, 2, 2)
plt.plot(T[:, 1], label='T longitude')
plt.plot(Y[:, 1], label='Y longitude')
plt.legend()
plt.xlim(0, 200)
(1059, 2)
Out[42]:
(0, 200)
In [43]:
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(T[:, 0], Y[:, 0],'o')
plt.title('Latitude')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.subplot(1, 2, 2)
plt.plot(T[:, 1], Y[:, 1],'o')
plt.title('Longitude')
plt.xlabel('Actual')
plt.ylabel('Predicted')
Out[43]:
Text(0, 0.5, 'Predicted')

With 2 units in the bottleneck, we are not able to predict latitude and longitude nearly as well as with the full dimension of the data. But, as we saw in class, if we use 5 units in that bottleneck layer we do much better!

Now let's switch from a regression problem to a classification problem and use the MNIST digit dataset.

MNIST Data set

In [68]:
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

Xtrain = train_set[0]
Ttrain = train_set[1]

Xtest = test_set[0]
Ttest = test_set[1]

Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape
Out[68]:
((50000, 784), (50000,), (10000, 784), (10000,))
In [45]:
plt.figure(figsize=(10, 10))
for i in range(100):
    plt.subplot(10, 10, i + 1)
    plt.imshow(-Xtrain[i, :].reshape((28, 28)), interpolation='nearest', cmap='gray')
    plt.axis('off')
    plt.title(str(Ttrain[i]))
plt.tight_layout()

Let's again try to squeeze the data through a narrow layer of 2 units.

In [77]:
n_in = Xtrain.shape[1]
nnet = nntorch.NeuralNetwork(n_in, [500, 100, 50, 50, 2, 50, 50, 100, 500], n_in, device='cuda')
nnet.train(Xtrain, Xtrain, 5000, 0.001, method='adam', standardize='')
plt.plot(nnet.error_trace);
Epoch 500: RMSE 0.220
Epoch 1000: RMSE 0.207
Epoch 1500: RMSE 0.199
Epoch 2000: RMSE 0.195
Epoch 2500: RMSE 0.191
Epoch 3000: RMSE 0.189
Epoch 3500: RMSE 0.187
Epoch 4000: RMSE 0.185
Epoch 4500: RMSE 0.184
Epoch 5000: RMSE 0.183
In [78]:
Ytest = nnet.use(Xtest)
In [79]:
plt.figure(figsize=(10, 10))
for i in range(0, 64, 2):
    plt.subplot(8, 8, i + 1)
    plt.imshow(-Xtest[i, :].reshape((28, 28)), interpolation='nearest', cmap='gray')
    plt.axis('off')
    plt.subplot(8, 8, i + 2)
    plt.imshow(-Ytest[i, :].reshape((28, 28)), interpolation='nearest', cmap='gray')
    plt.axis('off')
In [80]:
bottle_neck_train = nnet.use_to_middle(Xtrain)
bottle_neck_test = nnet.use_to_middle(Xtest)

show_n = 2000

show_these_train = np.random.choice(range(Ttrain.shape[0]), show_n)
show_these_test = np.random.choice(range(Ttest.shape[0]), show_n)

plt.figure(figsize=(12, 10))
plt.scatter(bottle_neck_train[show_these_train, 0], bottle_neck_train[show_these_train, 1],
            c=Ttrain[show_these_train].flat, alpha=0.5)
plt.scatter(bottle_neck_test[show_these_test, 0], bottle_neck_test[show_these_test, 1],
            c=Ttest[show_these_test].flat, alpha=0.5)
plt.colorbar();