# Workshop on Data Science in Africa¶

## Introduction¶

This notebook will allow you to run the clustering examples presented in the clustering lecture. We will cover three clustering algorithms:

1. K-means clustering
2. Gaussian Mixture Models
3. Hierachical clustering

## K-means Clustering¶

As discussed in the lecture, K-means works by iterating through two steps:

1. Assigning data points to closest center
2. Recomputing cluster centers based on current assignment

When the cluster assignments don't change, the algorithm has converged.

### Example¶

We will consider an example with 2D data generated from a 2 component Gaussian Mixture Model. It is often important to test algorithms with toy data to ensure our implementations are correct.

#### Generating the toy data¶

We will generate and plot 20 data points from a mixture of 2 Gaussians (10 data points from each component).

In [6]:
import numpy as np
import pylab as plt

#Set the random seed for reproducability
np.random.seed(123)
#Generate the data
X1=np.random.multivariate_normal(np.array([1,1]),np.diag([.1,.1]),10)#10 points from the first GMM component
X2=np.random.multivariate_normal(np.array([3,3]),np.diag([.1,.1]),10)#10 points from the second GMM component
X=np.concatenate((X1,X2))

#Plot the data
%matplotlib inline
plt.plot(X[:,0],X[:,1],'bo',markersize=10)
plt.xlim([0,5]);
plt.ylim([0,5]);


We see that the two distinct clusters are clearly visible. If we run the K-mean algorithm on these data we can recover these clusters. We write a function that implements the two steps in the K-means algorithm. The inputs to this algorithm are the data and the current cluster centers. It returns a cluster assignment based on the input centers, the cost after the assignment step and new cluster centers based on the new assignment.

In [7]:
def kmeans(data,centers):

new_cent=np.zeros(centers.shape)
#assignment step
dist=np.zeros((data.shape[0],centers.shape[0]))
for i in range(centers.shape[0]):
dist[:,i]=np.sum((data-centers[i,:])**2,1)

assign=np.argmin(dist,1)

#compute new centers and cost
cost=0
for i in range(centers.shape[0]):
new_cent[i,:]=np.mean(data[assign==i,:],0)
cost+=np.sum(np.sum((data[assign==i,:]-centers[i,:])**2,1))

return new_cent,cost,assign


Let us run through one iteration of the K-means algorithm.

In [8]:
#We need a value of K
K=2 # we assume there are two clusters

#random starting points
centers=np.random.multivariate_normal(np.array([2,2]),np.diag([1,1]),K)

#run one iteration of K-means
new_cent,cost,assign=kmeans(X,centers)

#plot the assignment based on input cluster centers
col=['r','g'] #colour code for the two clusters
plt.plot(X[assign==0,0],X[assign==0,1],col[0]+'o',markersize=10)
plt.plot(X[assign==1,0],X[assign==1,1],col[1]+'o',markersize=10)
plt.plot(centers[0,0],centers[0,1],col[0]+'x',markersize=10,mew=2) #input center
plt.plot(centers[1,0],centers[1,1],col[1]+'x',markersize=10,mew=2) #input center
plt.xlim([0,5]);
plt.ylim([0,5]);

In [9]:
#Plot the new centers
centers=new_cent # this will be the centers in the next iteration
plt.plot(X[assign==0,0],X[assign==0,1],col[0]+'o',markersize=10)
plt.plot(X[assign==1,0],X[assign==1,1],col[1]+'o',markersize=10)
plt.plot(centers[0,0],centers[0,1],col[0]+'x',markersize=10,mew=2)
plt.plot(centers[1,0],centers[1,1],col[1]+'x',markersize=10,mew=2)
plt.xlim([0,5]);
plt.ylim([0,5]);


Write a for loop that calls the function kmeans for a number of iterations (say 4) and monitor any change in cluster assignment.

In [ ]:
##Code box for your solution


Write a while loop that allows you to terminate the algorithm when the cluster assignment doesn't change.

In [32]:
##Code box for your solution


### K-means for Image compression (Vector Quantization)¶

The K-means algorithm can be used for image compression. In this example we will demonstrate it on an image of Mzee Jomo Kenyatta. We divide the image into 2-by-2 blocks which we treat as vectors in $\mathbf{R}^4$. We then cluster these blocks and use the assigned cluster centers to represent the image.

We will use the Python Imaging Library (PIL) to access the image. We will also use the scikit-learn toolbox of implementations of machine learning algorithms for K-means.

In [15]:
import PIL
from sklearn.cluster import KMeans
import matplotlib.cm as cm #colour map to display grayscale image

#Obtain the image
img=PIL.Image.open('data/jomo.jpg')

#Display the image
#img.show()
plt.imshow(np.asarray(img),cmap = cm.Greys_r)

In [44]:
A=np.asarray(img)


We need to get the 2-by-2 blocks from the image array and flatten them to vectors in $\mathbf{R}^4$ before running K means.

In [68]:
#get the vectors
px=img.size[0]
vq=np.zeros(((px*px)/4,4))
k=0
for i in range(0,A.shape[0],2):
for j in range(0,A.shape[1],2):
vq[k,:]=np.ndarray.flatten(A[i:i+2,j:j+2])
k+=1

kmeans = KMeans(init='k-means++', n_clusters=2, n_init=10)
kmeans.fit(vq)
cent=kmeans.cluster_centers_


Plot the quantized image

In [69]:
#quantized image
img_mtx2=np.zeros(img.size)
k=0
for i in range(0,img.size[0],2):
for j in range(0,img.size[1],2):
img_mtx2[i:i+2,j:j+2]=np.reshape(np.round(cent[kmeans.predict(vq[k,:])]),(2,2))
k+=1

In [70]:
plt.imshow(img_mtx2,cmap = cm.Greys_r)


## Gaussian Mixture Models¶

In this example we will use GMMs to cluster the energy of speech frames and use this information to determine whether a speech frame corresponds to speech or silence. Let's start by using the scipy library to plot a mixture of Gaussians.

In [71]:
from scipy.stats import norm


Let's plot the pdf of a Gaussian with mean zero and variance 1. Vary the parameters of the distribution and see what happens.

In [75]:
x=np.linspace(-5,5,100)
plt.plot(x,norm.pdf(x,0,1),linewidth=3)
plt.xlabel(r'$x$',fontsize=16)
plt.ylabel(r'$p(x)$',fontsize=16)
plt.title(r'Univariate Gaussian with $\mu=0$ and $\sigma=1$',fontsize=16)


Now let us plot a GMM with three components.

In [76]:
plt.plot(x,0.4*norm.pdf(x,-3,.5),'r',linewidth=1)
plt.plot(x,0.2*norm.pdf(x,1,1),'g',linewidth=1)
plt.plot(x,0.4*norm.pdf(x,3,.5),'b',linewidth=1)
plt.plot(x,0.4*norm.pdf(x,-3,.5)+0.2*norm.pdf(x,1,1)+0.4*norm.pdf(x,3,.5),'m',linewidth=2)
plt.xlabel(r'$x$',fontsize=16)
plt.ylabel(r'$p(x)$',fontsize=16)


## Voice Activity Detection¶

We now experiment with voice activity detection by modelling the log energy of speech frames using a GMM. We will use scipy to access the speech sample stored in a WAV file

In [13]:
import scipy.io.wavfile

#open the wav file
freq=x[0]#sampling frequency
signal=x[1]/np.float(2**15);#normalize to range -1 to 1, The numbers in x[1] are 16 bit signed integers


Let us plot this speech signal which contains the first seven digits in Kiswahili.

In [81]:
plt.plot(np.arange(len(signal))*(1./freq),signal)
plt.xlabel('Time (seconds)')


We see that the location of the seven digits can be clearly see and correspond to areas of high energy in the signal. We divide the signal into segments with 256 samples each and determine the energy. We then plot a histogram of the log energy. It is clear that the data is multimodal

In [86]:
blk_size=256
num_blk=int(np.floor(len(signal)/blk_size))
energy=np.zeros(num_blk)

for i in range(num_blk):
energy[i]=np.sum(signal[i*blk_size:(i+1)*blk_size]**2)

plt.hist(np.log(energy),50)
plt.xlabel('Logarithm of block energy')



We can try to model the data using a single Gaussian.

In [87]:
mu, std = norm.fit(np.log(energy))
plt.hist(np.log(energy), bins=50, normed=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=3)
plt.xlabel('Logarithm of block energy')


We see that this is not a good model of the data. We then fit a GMM with 2 components using the Expextation maximization implementation in scikit-learn.

In [89]:
from sklearn import mixture

g = mixture.GMM(n_components=2)
g.fit(np.log(energy))
p2 = g.weights_[0]*norm.pdf(x, g.means_[0], np.sqrt(g.covars_[0]))+g.weights_[1]*norm.pdf(x, g.means_[1], np.sqrt(g.covars_[1]))
plt.hist(np.log(energy), bins=50, normed=True)
plt.plot(x, p2, 'r', linewidth=3)
plt.xlabel('Logarithm of block energy')


Try and fit GMMs with more components and see what happens.

We can compute the probability of each frame belonging to each cluster. To get the cluster corresponding to speech, we determine the cluster with the highest mean.

In [11]:
speech_comp=np.argmax(g.means_)


Then we compute the probability of each frame belonging to each of the clusters.

In [12]:
pred_prob=g.predict_proba(np.log(energy))


We can classify a segment as being speech if the probability of the segment belonging to the cluster with the highest mean is above a given threshold.

In [14]:
#set threshold
thresh=0.5
#determine speech blocks
speech=pred_prob[:,speech_comp]>thresh

#Now plot the speech signal and indicate the regions classified as speech
plt.figure()
plt.plot(np.arange(len(signal))*(1./freq),signal)
for i in range(num_blk):
if speech[i]:
plt.plot(np.array([i*blk_size,(i+1)*blk_size])*(1./freq),[0,0],'r-',linewidth=4)

plt.xlabel('Time (seconds)')


What do you think about this classification? Try and change the threshold and the number of clusters. How does this change the classification.

## Hierachical Clustering¶

This is an approach to clustering that yields a hierarchy of clusters. Clusters in one level of the hierarchy are formed by merging clusters in the lower level and at the lowest level of the hierarchy each datum is in its own cluster. We will demonstrate hierarchical clustering using the scipy module scipy.cluster.hierarchy

In [16]:
import scipy.cluster.hierarchy


We then generate 6 data points from a 2 mixture GMM. Three data points from each component. We plot the data using the index of the data point.

In [21]:
np.random.seed(123)
X1=np.random.multivariate_normal(np.array([1,1]),np.diag([.1,.1]),3)
X2=np.random.multivariate_normal(np.array([3,3]),np.diag([.1,.1]),3)
X=np.concatenate((X1,X2))

%matplotlib inline
for i in range(X.shape[0]):
plt.text(X[i,0],X[i,1],str(i),color='b',fontsize=16 )

plt.xlim([0,5]);
plt.ylim([0,5]) ;


We then perform hierarchical clustering using single linkage.

In [24]:
#hierarchical clustering
Z=scipy.cluster.hierarchy.single(X)

#print Z to show the steps in the clustering process
print Z


The variable $Z$ summarises the steps covered in the lecture. They are

1. Merge 0 and 2 to form a new cluster and label it 6
2. Merge 3 and 5 to form a new cluster and label it 7
3. Merge 4 and 7 to form a cluster with the objects $\{3,4,5\}$ and label it 8
4. Merge 1 and 6 to form a cluster with the objects $\{0,1,2\}$ and label it 9
5. Merge 8 and 9 to form a cluster with all the objects

We can plot a dendogram.

In [28]:
scipy.cluster.hierarchy.dendrogram(Z,leaf_font_size=16)