#!/usr/bin/env python
# coding: utf-8

# # Probabilistic Principal Component Analysis versus Factor Analysis
# 
# Consider $\mathbf{z}_i \sim \mathcal{N}(0,I)$. Let 
# 
# \begin{equation}
# \mathbf{x}_i = W\mathbf{z}_i,~\text{where}~
# W = \begin{pmatrix}
# 1 & 0 \\
# 1 & 0.001 \\
# 0 & 10
# \end{pmatrix}.
# \end{equation}
# 
# Let us generate 1,000 observations of such data.

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

np.random.seed(2016)

N = 1000
Z = stats.norm.rvs(loc=0, scale=1, size=2*N)
Z = Z.reshape((-1,2))
W = np.array([[1,0],[1,0.001],[0,10]])
X = np.dot(Z, W.T)
X[:10,:]


# Let's apply probabilistic principal component analysis, now, with $L = 1$. Here, it's assumed that
# 
# \begin{equation}
# \mathbf{x}_i \sim \mathcal{N}\left(Wz_i,\sigma^2I\right).
# \end{equation}
# 
# We can estimate the parameters with Equations 12.58 and 12.59.

# In[22]:


L = 1
U,D,V = np.linalg.svd(X)
V = V.T
Z = np.dot(U[:,:L], np.diag(D[:L]))
W = V[:,:L]
print(W)
print(np.dot(Z, W.T)) # reconstruction
print(X) # actual


# Thus, we see that this points along $x_3$, which is the direction of highest variance. Notice that when look at $Z$, the reconstructed $X$, the we accurately predict $x_3$, whereas are estimates for $x_1$ and $x_2$ are way off.
# 
# Now, let's try factor analysis. This time, we have 
# 
# \begin{equation}
# \mathbf{x}_i \sim \mathcal{N}\left(Wz_i + \boldsymbol\mu,\Psi\right),
# \end{equation}
# 
# where $\Psi$ is diagonal. We can estimate $W$ and $\Psi$ with the EM algorithm described in 12.1.5.

# In[3]:


## initialize
W = stats.uniform.rvs(size=L*X.shape[1])
W = W.reshape((-1,L))
W = np.hstack((W, np.zeros(shape=(X.shape[1],1))))
Psi = np.diag(np.diag(np.dot(X.T, X)))/len(X)
m = np.empty(shape=(N, L)) # latent variable


def update_W(X, B, C):
    numerator = np.zeros(shape=(X.shape[1], B.shape[1]))
    denominator = np.zeros_like(C[0])
    for i in range(len(X)):
        numerator += np.outer(X[i], B[i])
        denominator += C[i]
    denominator
    return np.dot(numerator, np.linalg.inv(denominator))

def update_Psi(X, W, B):
    Psi = np.zeros(shape=(X.shape[1], X.shape[1]))
    for i in range(len(X)):
        Psi += np.outer(X[i] - np.dot(W, B[i]), X[i])
    return np.diag(np.diag(Psi))/len(X)

for k in range(100): # 50 iterations should be enough
    # E step
    Sigma = np.linalg.inv(np.eye(L) + np.dot(np.dot(W[:,:L].T, np.linalg.inv(Psi)), W[:,:L]))
    C = np.empty(shape=(N, L + 1, L + 1))
    for i in range(N):
        m[i] = np.dot(Sigma, np.dot(np.dot(W[:,:L].T, np.linalg.inv(Psi)), X[i] - W[:,L]))        
        C[i][:L,:L] = Sigma + np.outer(m[i], m[i])
        C[i][L,:L] = m[i]
        C[i][:L,L] = m[i]
        C[i][L,L] = 1
    B = np.hstack((m, np.ones((N,1))))  
    # M step
    W = update_W(X, B, C)
    Psi = update_Psi(X, W, B)
print(W[:,:L]) # principal directions
print(W[:,L]) # mu
print(Psi)
print(m)


# Now, we see that the direction of this vector is approximately $(1,1,0)$, which is the direction of max correlation. Recall that 
# 
# \begin{align}
# x_{i1} &= z_{i1} \\
# x_{i2} &= z_{i1} + 0.001z_{i2} \\
# x_{i3} &= 10z_{i2},
# \end{align}
# 
# so $x_{i1}$ and $x_{i2}$ are nearly identical, so they are very correlated. $x_{i3}$ can almost be treated as an indepedent normal variable. 
# 
# The main difference between the two methods is $\Psi$: for factor analysis, we have that $\Psi$ is merely diagonal, whereas, PPCA requires that $\Psi = \sigma^2I$.
# 
# In this case, nearly all the variance is found in $x_{i3}$ since the coefficient is much larger. Thus, that direction explains the variance, so it's no surprise that our weight vector found in PPCA points in that direction. In this way, it's almost like when we project onto this direction, we're just observing a single-dimensional normal distribution. From another perspective, if we were going to try to reconstruct $\mathbf{x}_i$ with one dimension, we'd simply take the sample standard deviation of the third dimension since that has the largest variance and multiply it by a standard normal.
# 
# On the other hand, that single-dimensional normal distribution can be seen as independent from $x_{i1}$ and $x_{i2}$. Thus, we can explain everything that happens along that dimension with $\Psi$ by setting the variance for the third component high. In this way, we have maximized the likelihood for that dimension, so all that is left is to explain relationship between $x_{i1}$ and $x_{i2}$. They happen to be nearly identical, so the weight vector approximately points in the direction $(1,1,0)$.