Principal Component Analyses

A tutorial by Mickey Atwal (atwal@cshl.edu)

Principal Component Analyses (PCA) is a useful mathematical technique that can reveal the structure of high-dimensional data by looking for linear combinations of variables that best explains the variance in the data.

At its heart, PCA transforms data into a new coordinate system such that the variables are no longer correlated. This is useful for dimensional reduction. The output of PCA is a set of eigenvectors, which are the new axes, and a set of eigenvalues, which tell you how much variance the new axes explain.

We will explore the idea behind PCA using a toy example in Python.

In [1]:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

1. Toy example

We can get some intuition for this procedure by cooking up a simple two-dimensional dataset, where we set most of the variation to lie along one direction. Let's generate random data from a two-dimensional gaussian distribution, rotated around the origin so that x and y variables are now correlated.

Generate Monte Carlo data

Let's first draw random samples from a two-dimensional Gaussian (normal) distribution centered around the origin, where the x and y variables are independent and uncorrelated, $$ \begin{array} pp(x,y)=p(x)p(y) &=&\displaystyle \frac{1}{\sqrt{2 \pi \sigma_x^2}} \exp{ \left( - \frac{x^2}{2 \sigma_x^2} \right)} \frac{1}{\sqrt{2 \pi \sigma_y^2}} \exp{ \left( - \frac{y^2}{2 \sigma_y^2} \right)} \\ &=&\displaystyle \frac{1}{2 \pi \sigma_x \sigma_y} \exp{\left( - \frac{1}{2} \left[ \frac{x^2}{\sigma_x^2} + \frac{y^2}{\sigma_y^2} \right] \right)} \end{array} $$

In [2]:
N=100 # number of datapoints
sx=7 # standard deviation in the x direction
sy=2 # standard deviation in the y direction

x=np.random.normal(0,sx,N) # random samples from gaussian in x-direction
y=np.random.normal(0,sy,N) # random samples from gaussian in y-direction

axs=max(abs(x))*1.2 # axes plotting range
t=linspace(-axs,axs,1000) # range of mesh points for contour plot
cx,cy=np.meshgrid(t,t) # mesh array
z=(1/(2*pi*sx*sy))*exp(-((cx*cx)/(2*sx*sx))-((cy*cy)/(2*sy*sy))) # actual 2d gaussian

plt.figure(figsize=(10,5))
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
cm = plt.cm.RdBu # coloring scheme for contour plots

plt.subplot(1,2,1)
contour(t,t,z,linspace(0.0000001,1/(2*pi*sx*sy),20),cmap=cm)
plt.grid()
plt.title('contour map of 2d gaussian distribution\n $\sigma_x=$ %d, $\sigma_y=$ %d' % (sx,sy),fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)

plt.subplot(1,2,2)
contour(t,t,z,linspace(0.0000001,1/(2*pi*sx*sy),20),cmap=cm)
plt.plot(x,y,'bo',markersize=5)
plt.grid()
plt.title('%d random samples from the 2d gaussian\n distribution' % N, fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)

plt.show()

Let's now rotate the data by $\theta$ degrees anticlockwise. This can be done my multiplying the (x,y) data by a rotation matrix M, $$ \left( \begin{array}{c} x' \ y' \end{array}

\right)

\left( \begin{array}{cc} \cos (\theta) & -\sin (\theta) \ \sin (\theta) & \cos (\theta) \end{array} \right) \left( \begin{array}{c} x \ y \end{array} \right) $$

In [3]:
degrees=30 # theta degrees
theta=2*pi*degrees/360 # convert degrees to radians

M=array([[cos(theta),-sin(theta)],[sin(theta),cos(theta)]]) #rotation matrix represented as 2d array
print 'rotation matrix, M=',
print M
xyp=dot(M,vstack([x,y])).T # matrix multiplication

xd=xyp[:,0] # new x-coordinates
yd=xyp[:,1] # new y-coordinates

plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('%d random samples from 2d gaussian distribution \n rotated by $\Theta$= %d degrees anticlockwise' % (N,degrees))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
rotation matrix, M= [[ 0.8660254 -0.5      ]
 [ 0.5        0.8660254]]

We see that the x and y variables are now correlated. We can quantify this by calculating the covariance matrix and the correlation matrix: $$ \rm{covariance}= \left( \begin{array}{cc} \left< xx \right> & \left< xy \right> \\ \left< yx \right> & \left< yy \right> \end{array} \right) \hspace{1in} correlation= \left( \begin{array}{cc} 1 & \frac{\left< xy \right>}{\sqrt{\left< xx \right> \left< yy \right>}} \\ \frac{\left< yx \right>}{\sqrt{\left< yy \right> \left< xx \right>}} & 1 \end{array} \right) $$ where the brackets denotes averaging (conventional physics notation), i.e., $$ \left< xy \right>=\frac{1}{N}\sum_{i}^{N} x_i y_i $$

In [4]:
cov(xyp.T) # covariance
Out[4]:
array([[ 46.28249177,  26.12496001],
       [ 26.12496001,  19.55457642]])
In [5]:
corrcoef(xyp.T) # correlation
Out[5]:
array([[ 1.        ,  0.86840649],
       [ 0.86840649,  1.        ]])

PCA on the data

Let's now find the new axes (eigenvectors) in which the covariance matrix $\matrix{C}$ is diagonal. Recall that the eigenvalues w and eigenvectors $\vec{v}$ satisfy the equation $\matrix{C} \vec{v}=w \vec{v}$

In [6]:
w,v=eig(cov(xyp.T)) #finds the eigenvalues and eigenvectors of the covariance matrix
In [7]:
print 'The eigenvalues are %.2f and %.2f' % (w[0],w[1]) #eigenvalues
The eigenvalues are 62.26 and 3.57

Note that the eigenvalues are equal to the variances of the data along the eigenvectors. The square root of the eigenvalues are fairly close to the standard deviations of the original unrotated normal distribution.

In [8]:
print 'The eigenvectors are' 
print v #eigenvectors 
The eigenvectors are
[[ 0.85305734 -0.52181718]
 [ 0.52181718  0.85305734]]

The eigenvectors are linear combinations of the original axes.

In [9]:
ah=0.1 # size of arrow head
f=1.1 # axes range

plt.figure(figsize=(5,5))
plt.subplot(111,aspect='equal')
plt.arrow(0,0,v[0,0],v[1,0],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.arrow(0,0,v[0,1],v[1,1],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.text(f*v[0,0],f*v[1,0],r'Eigenvector 1, $\vec{v_1}$ =  %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,0],v[1,0]), fontsize=15)
plt.text(f*v[0,1],f*v[1,1],r'Eigenvector 2, $\vec{v_1}$ =  %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,1],v[1,1]), fontsize=15)
plt.xlim([-f,f])
plt.ylim([-f,f])
plt.xlabel('x',fontsize=15)
plt.ylabel('y',fontsize=15)
plt.grid()
plt.show()

The eigenvectors (also known as principal components) correspond to the orthogonal directions of greatest and least variation. Let's plot them on the original data with the size of the vectors equal to the square root of the eigenvalues.

In [10]:
plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5,zorder=1,)
plt.axis('equal')
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('Principal Components (eigenvectors) of random data', fontsize=12)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)

plt.arrow(0,0,v[0,0]*sqrt(w[0]),v[1,0]*sqrt(w[0]),color='r',linewidth=2,head_width=1,head_length=1)
plt.arrow(0,0,v[0,1]*sqrt(w[1]),v[1,1]*sqrt(w[1]),color='r',linewidth=2,head_width=1,head_length=1)
plt.show()

The variance along each principal component (PC) is given by the eigenvalue $w$ of each principal component. Therefore the amount of variance explained by each PC is given by $$ \text{Variance explained by PC i}=\frac{w_i}{\sum_j w_j} $$

In [11]:
var_exp=100*w/sum(w)
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.bar(np.arange(len(w)),w,width=0.6,color='r')
plt.ylabel('Eigenvalue',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.subplot(1,2,2)
plt.bar(np.arange(len(w)),var_exp,width=0.6,color='g')
plt.ylabel('Variance explained (%)',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.show()

Representing the data onto the first eigenvector

In [12]:
a=dot(inv(v),xyp.T)

plt.figure(figsize=(5,5))
plt.axis('equal')
plt.grid()
plot(a[0],a[1],'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.xlabel('Principal Component 1',fontsize=15)
plt.ylabel('Principal Component 2',fontsize=15)
plt.show()

Hopefully this should resemble the original unrotated dataset.