Notebook

Introduction¶

We use the well known Iris data set to explore K-Means clustering further and use the heuristic "elbow" method to pick a "good" number of clusters. The Iris data set is found in the UCI Repository [1] but is also part of the Python test data. Here we use SciPy and NumPy to do our computations so this data is already available in the SciPy distribution. We plot the elbow plot and the K-Means cluster assignment for the final clustering.

Methods¶

As per a discussion on StackOverflow [1] we use NumPy and SciPy to do k-means clustering on the well-known UN countries data set using number of clusters K = 1 to 10.

Although we suspect that there are probably fewer than 10 natural clusters, this allows us to plot a measure called "within cluster sum-of-squares" vs. number of clusters.

This measure tracks how "tight" a cluster is. It operates on each cluster and adds up the squares of the distance of each point in a cluster from the centroid of the cluster. That is, for each point in a cluster we take the distance between that point and the centroid of the cluster and square it. We do this for each point in the cluster and then we sum it for that cluster. Then we do this for each cluster.

When we have a "good" or tight cluster, individual distances will be small and hence the sum of squares for that cluster be small. For a "bad" or loose cluster the opposite is true.

Now further, when we increase the value of K, the value of "within-cluster-sum-of-squares" will drop as we have more clusters hence smaller distances to centroids. But each successive increase in K will not give the same drop. At some point the improvement will start to level off. We call that value of K the elbow and use that as the "good" value of K.

We do this now for the UN countries data set.

Results¶

We now use our modeling software to import data, generate the elbow curve, decide on a k, and then run the clustering over our data. We then plot the clusters to visualize how the software has allocated cluster numbers to data points and interpret the results.

In [1]:

%pylab inline
# We run the following SciPy and NumPy code in [1] 
# and generate the plots mentioned above using Matplotlib 

# load the UN dataset transformed to float with 4 numeric columns, 
# lifeMale,lifeFemale,infantMortality and GDPperCapita

fName = ('../datasets/UN4col.csv')
fp = open(fName)
X = np.loadtxt(fp)
fp.close()

Populating the interactive namespace from numpy and matplotlib

In [8]:

import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

##### cluster data into K=1..10 clusters #####
#K, KM, centroids,D_k,cIdx,dist,avgWithinSS = kmeans.run_kmeans(X,10)

K = range(1,10)

  # scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM]   # cluster centroids

D_k = [cdist(X, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]  

In [10]:

kIdx = 2
# plot elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
      markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')  

So we see that a good K to use in our model would be 3. We now use the KMeans modeling software to fit clusters to our data and then plot how the software has clustered our data. In this case we look at how the data cluster when we plot Infant Mortality against GDP.

In [12]:

from sklearn.cluster import KMeans
km = KMeans(3, init='k-means++') # initialize
km.fit(X)
c = km.predict(X) # classify into three clusters

In [14]:

# see the code in helper library kmeans.py
# it wraps a number of variables and maps integers to categoriy labels
# this wrapper makes it easy to interact with this code and try other variables
# as we see below in the next plot
import kmeans as mykm
(pl0,pl1,pl2) = mykm.plot_clusters(X,c,3,2) # column 3 GDP, vs column 2 infant mortality. Note indexing is 0 based

Here we see some patterns, obvious in retrospect. The countries with GDP (in US Dollars) below 10K have rapidly rising infant mortality as GDP drops. On the other hand as GDP rises we see rapidly decreasing infant mortality, which is as we know, a correlate of financial prosperity, i.e. high GDP.

We also see 3 clusters which we can informally call, the underdeveloped, the developing and the developed countries, based on, respectively, GDP (in US Dollars) below 10K, between 10K and 20K and finally greater than 20K.

What would happen if we tried other dimensions to cluster on, say lifeMale and GDPperCapita. Let's see.

In [15]:

(pl0,pl1,pl2) = mykm.plot_clusters(X,c,3,0,False)

And similarly with lifeFemale vs GDPperCapita.

In [16]:

(pl0,pl1,pl2) = mykm.plot_clusters(X,c,3,1,False)

In both the last two cases we see an opposite trend to infant mortality, where life expectancy rises rapidly as GDP grows, but drop precipitously even to below 40 yrs for countries with the lowest GDP.

Sections of code above are taken from a StackOverflow discussion [1].
Authorship of these segments is due to user Amro [2] on StackOverflow.
The discussion [1] has greater detail and more extensive examples and the reader is referred there for more depth.

Exercise¶

Follow the link to the StackOverflow discussion.
Look at the handwriting recognition dataset.
Import it and run the code in the rest of the discussion.
Do you get similar results?

Conclusions¶

Using the elbow plot we see that the largest drops in "average within-cluster sum-of squares" occur from 1 to 2 and from 2 to 3 clusters. After that the drops are much smaller and decreasing. We pick 3 as the best number of clusters.

From our domain knowledge we know that our data cluster into underdeveloped, developing and developed countries respectively based on GDP. And using k=3 we see this in our cluster plots as well.

References¶

[1] http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means
[2] http://stackoverflow.com/users/97160/amro

In [9]:

from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Out[9]:

In [9]: