DS4420: Exercise on Dimensionality Reduction via PCA¶

Your name:

In general dimensionality reduction may be viewed as encoding data $x$ into latent (lower dimensional) representations $z$ that one can then use to recover, noisily, the original $x$'s.

One may view this as an encoding-decoding process, where we encode $x$ into $z$, and then decode $z$ into a reconstruction $\tilde{x}$. We saw that if our objective is to capture as much variance as possible in the compressed space, then we can use eigenvectors; namely the top $m$ eigenvectors (ranked by corresponding eigenvalues) may be used to create an $m$ dimensional representation of $x$. In this exercise, we see this in practice.

In [1]:

import numpy as np 
import matplotlib.pyplot as plt

import torch
# conda install -c pytorch torchvision
import torchvision

# note: if you cannot get torchvision installed 
# using the above sequence, you can resort to 
# the colab version here: 
# -- just be sure to download and then upload
# the notebook to blackboard when complete.
fMNIST = torchvision.datasets.FashionMNIST(
    root = './data/FashionMNIST',
    train = True,
    download = True)                                

In [2]:

fMNIST.data.shape

Out[2]:

torch.Size([60000, 28, 28])

In [3]:

from IPython.display import Image 
from matplotlib.pyplot import imshow
%matplotlib inline
imshow(np.asarray(fMNIST.data[6]), cmap='gray')

Out[3]:

<matplotlib.image.AxesImage at 0x7ff8fa4edc50>

In [4]:

X = fMNIST.data
X = np.array([x_i.flatten().numpy() for x_i in X])
X = X / 255 # normalize
X.shape

Out[4]:

(60000, 784)

TODO 1¶

Zero-mean $X$, and then construct the covariance matrix $S = X^t X$. What is the shape of $S$?

In [5]:

X_mu = X.mean(axis=0)
X = X - X_mu # zero-mean
S = (X.transpose() @ X)
S.shape

Out[5]:

(784, 784)

TODO 2¶

Use numpy.linalg.eigh to retrieve the top eigenvectors and eigenvalues of $S$. Then, take the top $M$ of these -- say $M=50$ -- and project $X$ into $Z$. How much variance is captured with this value of $M$?

In [6]:

eigvals, eigvecs = np.linalg.eigh(S)

In [188]:

M = 50
B = eigvecs[:, -M:]
B.shape

Out[188]:

(784, 50)

In [189]:

Z = X @ B
Z.shape

Out[189]:

(60000, 50)

In [190]:

sum(eigvals)

Out[190]:

4092975.6596677555

In [191]:

var_captured = np.sum(eigvals[-M:])/np.sum(eigvals)
print("{} components captures {}% of the variance".format(M, var_captured))

50 components captures 0.8626917002845209% of the variance

TODO 3¶

Plot the variance captured as a function of number of components. Remember; this is given by the sum of the first $k$ eigenvalues over the total sum of the eigenvalues. Based on this plot, what do you think is a 'good' number of components to pick?

In [192]:

total_var = np.sum(eigvals)
component_range = list(range(S.shape[0]))
vars_captured = np.array([np.sum(eigvals[-(k+1):])/total_var for k in component_range])
plt.plot(component_range, vars_captured)

Out[192]:

[<matplotlib.lines.Line2D at 0x7f81a2b4a390>]

TODO 4¶

Now "decode" an image back from $Z$ to $\tilde{X}$. Visualize it (you can follow the above code, using imshow).

In [193]:

X_tilde = Z @ B.transpose()
X_tilde.shape

Out[193]:

(60000, 784)

In [194]:

imshow(np.asarray(X_tilde[20]+X_mu).reshape((28,28)), cmap='gray')

Out[194]:

<matplotlib.image.AxesImage at 0x7f81c0469be0>

In [195]:

imshow(np.asarray(X[20] + X_mu).reshape(28,28), cmap='gray')

Out[195]:

<matplotlib.image.AxesImage at 0x7f81e404dc88>

Bonus 1: Scatter two-dimensional representation of two item types (classes).¶

Do they separate well in this reduced space?

In [197]:

y = fMNIST.train_labels.numpy()
#subset_indices = y[(y==1) | (y==7)]#np.argwhere((y==2 )|| (y==3))
ones = np.argwhere(y==1).flatten()
twos = np.argwhere(y==7).flatten()

In [198]:

Z1 = Z[ones]
Z2 = Z[twos]
Zc = np.concatenate((Z1, Z2))
colors = ["blue"] * Z1.shape[0] + ["red"] * Z2.shape[0]

In [199]:

Zc.shape

Out[199]:

(12000, 50)

In [200]:

plt.clf()
%matplotlib inline
plt.scatter(Zc[:,0], Zc[:,1], c=colors, alpha=.1)

Out[200]:

<matplotlib.collections.PathCollection at 0x7f81b1111470>

Bonus 2: Fun with Probabilistic PCA¶

In the MML reading, they derived that the posterior distribution over $z$ (the latent representation) given $x$ is:

$\mathcal{N}(m, C)$

Where:

$m = B^T (B B^T + \sigma^2I)^{-1} (x-u)$

$C = I - B^T(BB^T + \sigma^2I)^{-1}B$

In general we will assume $\sigma^2$ is known.

In [201]:

def sample_from_posterior_given_x(x, var=0.1, samples=10):
    B_term = B.transpose() @ np.linalg.inv((B @ B.transpose() + var * np.eye(B.shape[0])))
    m = B_term @ (x - X_mu)
    C = np.eye(B.shape[1]) - B_term @ B
    samples = np.array([np.random.multivariate_normal(m, C) for _ in range(samples)])
    return samples

In [209]:

samples = sample_from_posterior_given_x(X[1,:], var=1)

In [210]:

decoded_samples = samples @ B.transpose()
decoded_samples.shape

Out[210]:

(10, 784)

In [211]:

imshow(np.asarray(X[1] + X_mu).reshape(28,28), cmap='gray')

Out[211]:

<matplotlib.image.AxesImage at 0x7f818113b1d0>

In [214]:

imshow(np.asarray(decoded_samples[5] + X_mu).reshape(28,28), cmap='gray')

Out[214]:

<matplotlib.image.AxesImage at 0x7f8180335470>

In [ ]: