Notebook

Principal Component Analyses¶

A tutorial by Mickey Atwal (atwal@cshl.edu), with added geometrical insight in dot-product based rotation by Daan van Es (d.m.van.es@vu.nl)¶

Principal Component Analyses (PCA) is a useful mathematical technique that can reveal the structure of high-dimensional data by looking for linear combinations of variables that best explains the variance in the data.

At its heart, PCA transforms data into a new coordinate system such that the variables are no longer correlated. This is useful for dimensional reduction. The output of PCA is a set of eigenvectors, which are the new axes, and a set of eigenvalues, which tell you how much variance the new axes explain.

We will explore the idea behind PCA using a toy example in Python.

In [1]:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

1. Toy example¶

We can get some intuition for this procedure by cooking up a simple two-dimensional dataset, where we set most of the variation to lie along one direction. Let's generate random data from a two-dimensional gaussian distribution, rotated around the origin so that x and y variables are now correlated.

Generate Monte Carlo data¶

Let's first draw random samples from a two-dimensional Gaussian (normal) distribution centered around the origin, where the x and y variables are independent and uncorrelated, $$ \begin{array} pp(x,y)=p(x)p(y) &=&\displaystyle \frac{1}{\sqrt{2 \pi \sigma_x^2}} \exp{ \left( - \frac{x^2}{2 \sigma_x^2} \right)} \frac{1}{\sqrt{2 \pi \sigma_y^2}} \exp{ \left( - \frac{y^2}{2 \sigma_y^2} \right)} \\ &=&\displaystyle \frac{1}{2 \pi \sigma_x \sigma_y} \exp{\left( - \frac{1}{2} \left[ \frac{x^2}{\sigma_x^2} + \frac{y^2}{\sigma_y^2} \right] \right)} \end{array} $$

In [2]:

N=100 # number of datapoints
sx=7 # standard deviation in the x direction
sy=2 # standard deviation in the y direction

x=np.random.normal(0,sx,N) # random samples from gaussian in x-direction
y=np.random.normal(0,sy,N) # random samples from gaussian in y-direction

axs=max(abs(x))*1.2 # axes plotting range
t=np.linspace(-axs,axs,1000) # range of mesh points for contour plot
cx,cy=np.meshgrid(t,t) # mesh array
z=(1/(2*np.pi*sx*sy))*np.exp(-((cx*cx)/(2*sx*sx))-((cy*cy)/(2*sy*sy))) # actual 2d gaussian

plt.figure(figsize=(10,5))
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
cm = plt.cm.RdBu # coloring scheme for contour plots

plt.subplot(1,2,1)
plt.contour(t,t,z,np.linspace(0.0000001,1/(2*np.pi*sx*sy),20),cmap=cm)
plt.grid()
plt.title('contour map of 2d gaussian distribution\n $\sigma_x=$ %d, $\sigma_y=$ %d' % (sx,sy),fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)

plt.subplot(1,2,2)
plt.contour(t,t,z,np.linspace(0.0000001,1/(2*np.pi*sx*sy),20),cmap=cm)
plt.plot(x,y,'bo',markersize=5)
plt.grid()
plt.title('%d random samples from the 2d gaussian\n distribution' % N, fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)

plt.show()

Let's now rotate the data by $\theta$ degrees anticlockwise. This can be done my multiplying the (x,y) data by a rotation matrix M, $$ \left( \begin{array}{c} x' \\ y' \end{array} \right) = \left( \begin{array}{cc} \cos (\theta) & -\sin (\theta) \\ \sin (\theta) & \cos (\theta) \end{array} \right) \left( \begin{array}{c} x \\ y \end{array} \right) $$

In [3]:

degrees=30 # theta degrees
theta=2*np.pi*degrees/360 # convert degrees to radians

M=np.array([[np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)]]) #rotation matrix represented as 2d array
print 'rotation matrix, M=',
print M
xyp=np.dot(M,np.vstack([x,y])).T # matrix multiplication

xd=xyp[:,0] # new x-coordinates
yd=xyp[:,1] # new y-coordinates

plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('%d random samples from 2d gaussian distribution \n rotated by $\Theta$= %d degrees anticlockwise' % (N,degrees))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

rotation matrix, M= [[ 0.8660254 -0.5      ]
 [ 0.5        0.8660254]]

We see that the x and y variables are now correlated. We can quantify this by calculating the covariance matrix and the correlation matrix: $$ \rm{covariance}= \left( \begin{array}{cc} \left< xx \right> & \left< xy \right> \\ \left< yx \right> & \left< yy \right> \end{array} \right) \hspace{1in} correlation= \left( \begin{array}{cc} 1 & \frac{\left< xy \right>}{\sqrt{\left< xx \right> \left< yy \right>}} \\ \frac{\left< yx \right>}{\sqrt{\left< yy \right> \left< xx \right>}} & 1 \end{array} \right) $$ where the brackets denotes averaging (conventional physics notation), i.e., $$ \left< xy \right>=\frac{1}{N}\sum_{i}^{N} x_i y_i $$

In [4]:

np.cov(xyp.T) # covariance

Out[4]:

array([[ 33.78440758,  18.68697036],
       [ 18.68697036,  16.95711535]])

In [5]:

np.corrcoef(xyp.T) # correlation

Out[5]:

array([[ 1.        ,  0.78073699],
       [ 0.78073699,  1.        ]])

PCA on the data¶

Let's now find the new axes (eigenvectors) in which the covariance matrix $\matrix{C}$ is diagonal. Recall that the eigenvalues w and eigenvectors $\vec{v}$ satisfy the equation $\matrix{C} \vec{v}=w \vec{v}$

In [6]:

w,v=np.linalg.eig(np.cov(xyp.T)) #finds the eigenvalues and eigenvectors of the covariance matrix

In [7]:

print 'The eigenvalues are %.2f and %.2f' % (w[0],w[1]) #eigenvalues

The eigenvalues are 45.86 and 4.88

Note that the eigenvalues are equal to the variances of the data along the eigenvectors. The square root of the eigenvalues are fairly close to the standard deviations of the original unrotated normal distribution.

In [8]:

print 'The eigenvectors are' 
print v #eigenvectors 

The eigenvectors are
[[ 0.83980583 -0.54288689]
 [ 0.54288689  0.83980583]]

The eigenvectors are linear combinations of the original axes.

In [9]:

ah=0.1 # size of arrow head
f=1.1 # axes range

plt.figure(figsize=(5,5))
plt.subplot(111,aspect='equal')
plt.arrow(0,0,v[0,0],v[1,0],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.arrow(0,0,v[0,1],v[1,1],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.text(f*v[0,0],f*v[1,0],r'Eigenvector 1, $\vec{v_1}$ =  %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,0],v[1,0]), fontsize=15)
plt.text(f*v[0,1],f*v[1,1],r'Eigenvector 2, $\vec{v_1}$ =  %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,1],v[1,1]), fontsize=15)
plt.xlim([-f,f])
plt.ylim([-f,f])
plt.xlabel('x',fontsize=15)
plt.ylabel('y',fontsize=15)
plt.grid()
plt.show()

The eigenvectors (also known as principal components) correspond to the orthogonal directions of greatest and least variation. Let's plot them on the original data with the size of the vectors equal to the square root of the eigenvalues.

In [10]:

plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5,zorder=1,)
plt.axis('equal')
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('Principal Components (eigenvectors) of random data', fontsize=12)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
scaled_eigenvec1 = [v[0,0]*math.sqrt(w[0]),v[1,0]*math.sqrt(w[0])]
scaled_eigenvec2 = [v[0,1]*math.sqrt(w[1]),v[1,1]*math.sqrt(w[1])]

plt.arrow(0,0,scaled_eigenvec1[0],scaled_eigenvec1[1]),color='r',linewidth=2,head_width=1,head_length=1)
plt.arrow(0,0,scaled_eigenvec2[0],scaled_eigenvec2[1],color='r',linewidth=2,head_width=1,head_length=1)
plt.show()

  File "<ipython-input-10-8dc5b61c60fe>", line 13
    plt.arrow(0,0,scaled_eigenvec1[0],scaled_eigenvec1[1]),color='r',linewidth=2,head_width=1,head_length=1)
                                                                                                           ^
SyntaxError: invalid syntax

The variance along each principal component (PC) is given by the eigenvalue $w$ of each principal component. Therefore the amount of variance explained by each PC is given by $$ \text{Variance explained by PC i}=\frac{w_i}{\sum_j w_j} $$

In [11]:

var_exp=100*w/sum(w)
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.bar(np.arange(len(w)),w,width=0.6,color='r')
plt.ylabel('Eigenvalue',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.subplot(1,2,2)
plt.bar(np.arange(len(w)),var_exp,width=0.6,color='g')
plt.ylabel('Variance explained (%)',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.show()

The dot product can geometrically be thought of as the length of the projection of one vector onto the other multiplied by the length (norm) of the vector that it is projected upon. However, because the eigenvalues are unit vectors (with length one), the dot product of a data point onto an eigenvector is defined only by the length (norm) of the projected vector. If one wants to find the coordinates of a datapoint after rotation among the eigenvectors, we can simply compute it by taking the dot product of that data point with the first eigenvector to get the x coordinate, and the dot product of that data point with the second eigenvector to get the y coordinate. This is illustrated by the next block of code. Change the data point to view other examplars:

In [12]:

### PICK RANDOM DATA POINT (x & y between -10/10)
data_point = [8,8]
###

# plot data
plt.figure(figsize=(10,10))
plt.plot(xd,yd,'bo',markersize=4,zorder=1,alpha = 0.1)
plt.plot(data_point[0], data_point[1], 'bo',markersize=5, color='b')
plt.text(data_point[0]+.5,data_point[1],'('+str(data_point[0])+','+str(data_point[1])+')',color='b',fontsize='16')
plt.plot([-100,100],[0,0],'k')
plt.plot([0,0],[-100,100],'k')
plt.axis('equal')
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.arrow(0,0,v[0,0],v[1,0],color='r',linewidth=2,head_width=0.8,head_length=0.8)
plt.arrow(0,0,v[0,1],v[1,1],color='r',linewidth=2,head_width=0.8,head_length=0.8)
plt.text(15,12,'comp 1',color='r',fontsize='16')
plt.text(-7,15,'comp 2',color='r',fontsize='16')
z = v*100
plt.plot([-z[0,0],z[0,0]],[-z[1,0],z[1,0]],'r',linestyle = '--')
plt.plot([-z[0,1],z[0,1]],[-z[1,1],z[1,1]],'r',linestyle = '--')

# compute new position of data point
dot_product = np.dot(np.linalg.inv(v),data_point)
# derive norms of projections of the data point along both eigenvectors
x_comp = v[:,0]*dot_product[0]
y_comp = v[:,1]*dot_product[1]
# plot geometrical representation of this derivation:
plt.plot([data_point[0],x_comp[0]],[data_point[1],x_comp[1]],'k',linestyle = '--')
plt.plot([data_point[0],y_comp[0]],[data_point[1],y_comp[1]],'k',linestyle = '--')
plt.arrow(0,0,x_comp[0],x_comp[1],color='g',linewidth=1,head_width=0.3,head_length=0.3)
plt.arrow(0,0,y_comp[0],y_comp[1],color='g',linewidth=1,head_width=0.3,head_length=0.3)
plt.text(x_comp[0],x_comp[1]+1,str(int(round(dot_product[0]))),color='g',fontsize='16')
plt.text(y_comp[0],y_comp[1]-1,str(int(round(dot_product[1]))),color='g',fontsize='16')
plt.plot(dot_product[0], dot_product[1], 'bo',markersize=5, color='r')
plt.text(int(round(dot_product[0]))+.5,int(round(dot_product[1])),'('+str(int(round(dot_product[0])))+','+str(int(round(dot_product[1])))+')',color='g',fontsize='16')

plt.show()

Now compute new location for every point and plot:

In [13]:

a=np.dot(np.linalg.inv(v),xyp.T)

plt.figure(figsize=(5,5))
plt.axis('equal')
plt.grid()
plt.plot(a[0],a[1],'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.xlabel('Principal Component 1',fontsize=15)
plt.ylabel('Principal Component 2',fontsize=15)
plt.show()

Hopefully this should resemble the original unrotated dataset.

In [13]:

In [ ]: