Principal Component Analyses (PCA) is a useful mathematical technique that can reveal the structure of high-dimensional data by looking for linear combinations of variables that best explains the variance in the data.
At its heart, PCA transforms data into a new coordinate system such that the variables are no longer correlated. This is useful for dimensional reduction. The output of PCA is a set of eigenvectors, which are the new axes, and a set of eigenvalues, which tell you how much variance the new axes explain.
We will explore the idea behind PCA using a toy example in Python.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
We can get some intuition for this procedure by cooking up a simple two-dimensional dataset, where we set most of the variation to lie along one direction. Let's generate random data from a two-dimensional gaussian distribution, rotated around the origin so that x and y variables are now correlated.
Let's first draw random samples from a two-dimensional Gaussian (normal) distribution centered around the origin, where the x and y variables are independent and uncorrelated, $$ \begin{array} pp(x,y)=p(x)p(y) &=&\displaystyle \frac{1}{\sqrt{2 \pi \sigma_x^2}} \exp{ \left( - \frac{x^2}{2 \sigma_x^2} \right)} \frac{1}{\sqrt{2 \pi \sigma_y^2}} \exp{ \left( - \frac{y^2}{2 \sigma_y^2} \right)} \\ &=&\displaystyle \frac{1}{2 \pi \sigma_x \sigma_y} \exp{\left( - \frac{1}{2} \left[ \frac{x^2}{\sigma_x^2} + \frac{y^2}{\sigma_y^2} \right] \right)} \end{array} $$
N=100 # number of datapoints
sx=7 # standard deviation in the x direction
sy=2 # standard deviation in the y direction
x=np.random.normal(0,sx,N) # random samples from gaussian in x-direction
y=np.random.normal(0,sy,N) # random samples from gaussian in y-direction
axs=max(abs(x))*1.2 # axes plotting range
t=np.linspace(-axs,axs,1000) # range of mesh points for contour plot
cx,cy=np.meshgrid(t,t) # mesh array
z=(1/(2*np.pi*sx*sy))*np.exp(-((cx*cx)/(2*sx*sx))-((cy*cy)/(2*sy*sy))) # actual 2d gaussian
plt.figure(figsize=(10,5))
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
cm = plt.cm.RdBu # coloring scheme for contour plots
plt.subplot(1,2,1)
plt.contour(t,t,z,np.linspace(0.0000001,1/(2*np.pi*sx*sy),20),cmap=cm)
plt.grid()
plt.title('contour map of 2d gaussian distribution\n $\sigma_x=$ %d, $\sigma_y=$ %d' % (sx,sy),fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)
plt.subplot(1,2,2)
plt.contour(t,t,z,np.linspace(0.0000001,1/(2*np.pi*sx*sy),20),cmap=cm)
plt.plot(x,y,'bo',markersize=5)
plt.grid()
plt.title('%d random samples from the 2d gaussian\n distribution' % N, fontsize=12)
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)
plt.show()
Let's now rotate the data by $\theta$ degrees anticlockwise. This can be done my multiplying the (x,y) data by a rotation matrix M, $$ \left( \begin{array}{c} x' \\ y' \end{array} \right) = \left( \begin{array}{cc} \cos (\theta) & -\sin (\theta) \\ \sin (\theta) & \cos (\theta) \end{array} \right) \left( \begin{array}{c} x \\ y \end{array} \right) $$
degrees=30 # theta degrees
theta=2*np.pi*degrees/360 # convert degrees to radians
M=np.array([[np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)]]) #rotation matrix represented as 2d array
print 'rotation matrix, M=',
print M
xyp=np.dot(M,np.vstack([x,y])).T # matrix multiplication
xd=xyp[:,0] # new x-coordinates
yd=xyp[:,1] # new y-coordinates
plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('%d random samples from 2d gaussian distribution \n rotated by $\Theta$= %d degrees anticlockwise' % (N,degrees))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
rotation matrix, M= [[ 0.8660254 -0.5 ] [ 0.5 0.8660254]]
We see that the x and y variables are now correlated. We can quantify this by calculating the covariance matrix and the correlation matrix: $$ \rm{covariance}= \left( \begin{array}{cc} \left< xx \right> & \left< xy \right> \\ \left< yx \right> & \left< yy \right> \end{array} \right) \hspace{1in} correlation= \left( \begin{array}{cc} 1 & \frac{\left< xy \right>}{\sqrt{\left< xx \right> \left< yy \right>}} \\ \frac{\left< yx \right>}{\sqrt{\left< yy \right> \left< xx \right>}} & 1 \end{array} \right) $$ where the brackets denotes averaging (conventional physics notation), i.e., $$ \left< xy \right>=\frac{1}{N}\sum_{i}^{N} x_i y_i $$
np.cov(xyp.T) # covariance
array([[ 33.78440758, 18.68697036], [ 18.68697036, 16.95711535]])
np.corrcoef(xyp.T) # correlation
array([[ 1. , 0.78073699], [ 0.78073699, 1. ]])
Let's now find the new axes (eigenvectors) in which the covariance matrix $\matrix{C}$ is diagonal. Recall that the eigenvalues w and eigenvectors $\vec{v}$ satisfy the equation $\matrix{C} \vec{v}=w \vec{v}$
w,v=np.linalg.eig(np.cov(xyp.T)) #finds the eigenvalues and eigenvectors of the covariance matrix
print 'The eigenvalues are %.2f and %.2f' % (w[0],w[1]) #eigenvalues
The eigenvalues are 45.86 and 4.88
Note that the eigenvalues are equal to the variances of the data along the eigenvectors. The square root of the eigenvalues are fairly close to the standard deviations of the original unrotated normal distribution.
print 'The eigenvectors are'
print v #eigenvectors
The eigenvectors are [[ 0.83980583 -0.54288689] [ 0.54288689 0.83980583]]
The eigenvectors are linear combinations of the original axes.
ah=0.1 # size of arrow head
f=1.1 # axes range
plt.figure(figsize=(5,5))
plt.subplot(111,aspect='equal')
plt.arrow(0,0,v[0,0],v[1,0],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.arrow(0,0,v[0,1],v[1,1],color='r',linewidth=2,head_width=ah,head_length=ah)
plt.text(f*v[0,0],f*v[1,0],r'Eigenvector 1, $\vec{v_1}$ = %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,0],v[1,0]), fontsize=15)
plt.text(f*v[0,1],f*v[1,1],r'Eigenvector 2, $\vec{v_1}$ = %.2f $\vec{x}$ + %.2f $\vec{y}$' % (v[0,1],v[1,1]), fontsize=15)
plt.xlim([-f,f])
plt.ylim([-f,f])
plt.xlabel('x',fontsize=15)
plt.ylabel('y',fontsize=15)
plt.grid()
plt.show()
The eigenvectors (also known as principal components) correspond to the orthogonal directions of greatest and least variation. Let's plot them on the original data with the size of the vectors equal to the square root of the eigenvalues.
plt.figure(figsize=(5,5))
plt.plot(xd,yd,'bo',markersize=5,zorder=1,)
plt.axis('equal')
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.title('Principal Components (eigenvectors) of random data', fontsize=12)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
scaled_eigenvec1 = [v[0,0]*math.sqrt(w[0]),v[1,0]*math.sqrt(w[0])]
scaled_eigenvec2 = [v[0,1]*math.sqrt(w[1]),v[1,1]*math.sqrt(w[1])]
plt.arrow(0,0,scaled_eigenvec1[0],scaled_eigenvec1[1]),color='r',linewidth=2,head_width=1,head_length=1)
plt.arrow(0,0,scaled_eigenvec2[0],scaled_eigenvec2[1],color='r',linewidth=2,head_width=1,head_length=1)
plt.show()
File "<ipython-input-10-8dc5b61c60fe>", line 13 plt.arrow(0,0,scaled_eigenvec1[0],scaled_eigenvec1[1]),color='r',linewidth=2,head_width=1,head_length=1) ^ SyntaxError: invalid syntax
The variance along each principal component (PC) is given by the eigenvalue $w$ of each principal component. Therefore the amount of variance explained by each PC is given by $$ \text{Variance explained by PC i}=\frac{w_i}{\sum_j w_j} $$
var_exp=100*w/sum(w)
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.bar(np.arange(len(w)),w,width=0.6,color='r')
plt.ylabel('Eigenvalue',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.subplot(1,2,2)
plt.bar(np.arange(len(w)),var_exp,width=0.6,color='g')
plt.ylabel('Variance explained (%)',fontsize=12)
plt.xticks(np.arange(len(w))+0.3,np.arange(len(w))+1)
plt.xlabel('Principal Component',fontsize=12)
plt.show()
### PICK RANDOM DATA POINT (x & y between -10/10)
data_point = [8,8]
###
# plot data
plt.figure(figsize=(10,10))
plt.plot(xd,yd,'bo',markersize=4,zorder=1,alpha = 0.1)
plt.plot(data_point[0], data_point[1], 'bo',markersize=5, color='b')
plt.text(data_point[0]+.5,data_point[1],'('+str(data_point[0])+','+str(data_point[1])+')',color='b',fontsize='16')
plt.plot([-100,100],[0,0],'k')
plt.plot([0,0],[-100,100],'k')
plt.axis('equal')
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.grid()
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.arrow(0,0,v[0,0],v[1,0],color='r',linewidth=2,head_width=0.8,head_length=0.8)
plt.arrow(0,0,v[0,1],v[1,1],color='r',linewidth=2,head_width=0.8,head_length=0.8)
plt.text(15,12,'comp 1',color='r',fontsize='16')
plt.text(-7,15,'comp 2',color='r',fontsize='16')
z = v*100
plt.plot([-z[0,0],z[0,0]],[-z[1,0],z[1,0]],'r',linestyle = '--')
plt.plot([-z[0,1],z[0,1]],[-z[1,1],z[1,1]],'r',linestyle = '--')
# compute new position of data point
dot_product = np.dot(np.linalg.inv(v),data_point)
# derive norms of projections of the data point along both eigenvectors
x_comp = v[:,0]*dot_product[0]
y_comp = v[:,1]*dot_product[1]
# plot geometrical representation of this derivation:
plt.plot([data_point[0],x_comp[0]],[data_point[1],x_comp[1]],'k',linestyle = '--')
plt.plot([data_point[0],y_comp[0]],[data_point[1],y_comp[1]],'k',linestyle = '--')
plt.arrow(0,0,x_comp[0],x_comp[1],color='g',linewidth=1,head_width=0.3,head_length=0.3)
plt.arrow(0,0,y_comp[0],y_comp[1],color='g',linewidth=1,head_width=0.3,head_length=0.3)
plt.text(x_comp[0],x_comp[1]+1,str(int(round(dot_product[0]))),color='g',fontsize='16')
plt.text(y_comp[0],y_comp[1]-1,str(int(round(dot_product[1]))),color='g',fontsize='16')
plt.plot(dot_product[0], dot_product[1], 'bo',markersize=5, color='r')
plt.text(int(round(dot_product[0]))+.5,int(round(dot_product[1])),'('+str(int(round(dot_product[0])))+','+str(int(round(dot_product[1])))+')',color='g',fontsize='16')
plt.show()
Now compute new location for every point and plot:
a=np.dot(np.linalg.inv(v),xyp.T)
plt.figure(figsize=(5,5))
plt.axis('equal')
plt.grid()
plt.plot(a[0],a[1],'bo',markersize=5)
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
plt.xlabel('Principal Component 1',fontsize=15)
plt.ylabel('Principal Component 2',fontsize=15)
plt.show()
Hopefully this should resemble the original unrotated dataset.