Unsupvervised Learning

Clustering is a class of unsupervised learning methods that associates observations according to some specified measure of similarity (e.g. Euclidean distance).

K-means Algorithm

The K-means clustering algorithm associates each point $x_i$ in a set of input points $\{x_1, x_2, \ldots, x_m\}$ to $K$ clusters. Each cluster is specified by a centroid that is the average location of all the points in the cluster. The algorithm proceeds iteratively from arbitrary centroid locations, updating the membership of each point according to minimum distance, then updating the centroid location based on the new cluster membership.

In this sense, it is similar to the expectation maximization (EM) algorithm. Recall that in EM we iteratively assigned labels to observations, according to which mixture component they were most likely to have been derived from. K-means is simpler, in that we just use the minimum distance to assign membership.

The algorithm will have converged when the assignment of points to centroids does not change with each iteration.


  1. Initialize cluster centroids:
$$\mu^{(0)}_1, \ldots, \mu^{(0)}_k \in \mathbb{R}^n$$
  1. Iterate until converged:

    a. Set $c_i = \text{argmin}_j || x_i - \mu_j^{(s)} ||$

    b. Update centroids:

    $$\mu_j^{(s+1)} = \frac{\sum_{i=1}^m I[c_i = j] x_i}{\sum_{i=1}^m I[c_i = j]}$$

The K-means algorithm is simply a Gaussian mixture model with two restrictions:

  1. the covariance matrix is spherical:

    $$\Sigma_k = \sigma I_D$$

  2. the mixture weights are fixed:

    $$\pi_k = \frac{1}{K}$$

Hence, we are only interested in locating the appropriate centroid of the clusters. This serves to speed computation.

We can define the distortion function:

$$J(c,\mu) = \sum_{i]1}^m ||x_i - \mu_{c_i}||$$

which gets smaller at every iteration. So, k-means is coordinate ascent on $J(c,\mu)$

Choosing $k$

To check whether a chosen $k$ is reasonable, one approach is to compare the distances between the centroids to the mean distance bewween each data point and their assigned centroid. A good fit involves relatively large inter-centroid distances.

The appropriate value for k (the number of clusters) may depend on the goals of the analysis, or it may be chosen algorithmically, using an optimization procedure.

Example: clustering random points

In [ ]:
%matplotlib inline
import seaborn as sns; sns.set_context('notebook')
import numpy as np
import matplotlib.pyplot as plt

x, y = np.random.uniform(0, 10, 50).reshape(2, 25)
plt.scatter(x, y)

Let's start with $k=4$, arbitrarily assigned:

In [ ]:
centroids = (3, 3), (3, 7), (7, 3), (7, 7)
In [ ]:
In [ ]:
plt.scatter(x, y)
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)

We can use the function cdist from SciPy to calculate the distances from each point to each centroid.

In [ ]:
from scipy.spatial.distance import cdist

distances = cdist(centroids, list(zip(x,y)))

We can make the initial assignment to centroids by picking the minimum distance.

In [ ]:
labels = distances.argmin(axis=0)
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)

Now we can re-assign the centroid locations based on the means of the current members' locations.

In [ ]:
new_centroids = [(x[labels==i].mean(), y[labels==i].mean())
                 for i in range(len(centroids))]
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(new_centroids), c='r', marker='+', s=100)

So, we simply iterate these steps until convergence.

In [ ]:
centroids = new_centroids
iterations = 20

for _ in range(iterations):
    distances = cdist(centroids, list(zip(x,y)))
    labels = distances.argmin(axis=0)
    centroids = [(x[labels==i].mean(), y[labels==i].mean())
                 for i in range(len(centroids))]
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)

k-means using scikit-learn

The scikit-learn package includes a KMeans class for flexibly fitting K-means models. It includes additional features, such as initialization options and the ability to set the convergence tolerance.

In [ ]:
from sklearn.cluster import KMeans
from numpy.random import RandomState
rng = RandomState(1)

# Instantiate model
kmeans = KMeans(n_clusters=4, random_state=rng)
# Fit model

After fitting, we can retrieve the labels and cluster centers.

In [ ]:
In [ ]:

The resulting plot should look very similar to the one we fit by hand.

In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[kmeans.labels_])
plt.scatter(*kmeans.cluster_centers_.T, c='r', marker='+', s=100)

Example: Wine chemistry

Recall the wine dataset in wine.dat that includes thirteen chemical measurements carried out on each of 178 wines from three regions of Italy. If we did not have the labels for the wines, we might be interested to see whether a clustering algorithm could correctly assign labels to the wines.

In [ ]:
import pandas as pd

wine = pd.read_table("../data/wine.dat", sep='\s+')

attributes = ['Grape',
            'Malic acid',
            'Alcalinity of ash',
            'Total phenols',
            'Nonflavanoid phenols',
            'Color intensity',
            'OD280/OD315 of diluted wines',

wine.columns = attributes

In [ ]:
X = wine.copy()
y = X.pop('Grape')

To simplify the analysis, and aid visualization, we will again perform a PCA to isolate the majority of the variation into two principal components.

In [ ]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
In [ ]:
wine['First Component'] = X_pca[:, 0]
wine['Second Component'] = X_pca[:, 1]

sns.lmplot('First Component', 'Second Component', 

We can now create a KMeans object with k=3, and fit the data with it.

In [ ]:
km_wine = KMeans(n_clusters=3, random_state=rng)

From this, we can extract the cluster centroids (in the cluster_center_ attribute) and the group labels (in labels_) in order to generate a plot of the classification result.

In [ ]:
np.round(km_wine.cluster_centers_, decimals=2)
In [ ]:

Now we can visually examine the clusters, and compare them to the known labels.

In [ ]:
wine['Cluster'] = km_wine.labels_ + 1

grid = sns.lmplot('First Component', 'Second Component', 
grid.ax.scatter(*wine.loc[wine.Grape!=wine.Cluster, ['First Component', 'Second Component']].values.T, 
             s=60, linewidth=1, facecolors='none', edgecolors='y')

scikit-learn includes a suite of well-known clustering algorithms. Like most unsupervised learning models in the scikit, they expect the data to be clustered to have the shape (n_samples, n_features):

  • sklearn.cluster.KMeans : The simplest, yet effective clustering algorithm. Needs to be provided with the number of clusters in advance, and assumes that the data is normalized as input (but use a PCA model as preprocessor).
  • sklearn.cluster.MeanShift : Can find better looking clusters than KMeans but is not scalable to high number of samples.
  • sklearn.cluster.DBSCAN : Can detect irregularly shaped clusters based on density, i.e. sparse regions in the input space are likely to become inter-cluster boundaries. Can also detect outliers (samples that are not part of a cluster).

Exercise: clustering baseball statistics

We can use clustering to try to find interesting groupings among sets of baseball statistics. Load the baseball dataset and run a clustering algorithm on the set of three statistics:

  • hit rate: hits / at bats
  • strikeout rate: strikeouts / at bats
  • walk rate: bases on balls /at bats

You should probably set a minimum number of at bats to qualify for the analysis, since there are pitchers that get only a handful of at bats each year.

Since we are clustering in 3 dimensions, you can visualize the output as a series of pairwise plots.

In [ ]:
import pandas as pd

baseball = pd.read_csv("../data/baseball.csv", index_col=0)
In [ ]:
## Write answer here


The major weakness of the k-means approach to clustering is that the number of clusters needs to be specified at the outset. However, there is usually uncertainty with respect to the appropriate number of clusters for a given dataset. A flexible alternative to k-means that allows for an unknown number of clusters involves using a Bayesian non-parametric mixture model instead (Kulis and Jordan 2011). In particular, a Dirichlet process (DP) mixture model, which we have seen in a previous lecture, probabilistically assigns observations to clusters, using a stick-breaking algorithm.

Recall the definition of a finite mixture model:

$$f(y) = \sum_{h=1}^{k} \pi_h \mathcal{K}(y|\theta_h)$$

where $k$ is the number of mixture components, $\pi_h$ is the mixing coefficient for component $h$, and $K$ specifies the mixing components (e.g. a Gaussian distribution), which has parameters $\theta_h$ for each component.

A DP mixture extends this by placing a Dirichlet prior of dimension $k$ on the mixing coefficients. The distribution over the group indicators can then be specified as a categorical distribution:

$$\begin{aligned} \mathbf{\pi} &\sim \text{Dirichlet}(k, \pi_0) \\ z_1,\ldots,z_n &\sim \text{Categorical}(\mathbf{\pi}) \\ \end{aligned}$$

We might then specify the observations as being a mixture of Gaussians, whose means are drawn from an appropriate prior distribution $P$:

$$\begin{aligned} \theta_1,\ldots,\theta_k &\sim P \\ y_1,\ldots,y_n &\sim N(\theta_{z[i]}, \sigma I) \end{aligned}$$

To demonstrate, we will implement a DP to cluster the iris dataset.

In [ ]:
from sklearn.decomposition import PCA
from sklearn import datasets

iris = datasets.load_iris()

pca = PCA(n_components=2, whiten=True).fit(iris.data)
X_pca = pca.transform(iris.data)
y = iris.target

The sklearn.mixture module includes a variety of Gaussian Mixture Models, including the BayesianGaussianMixture which fits the mixture using either Dirichlet distribution priors or Dirichlet process priors, and fits them using variational inference.

In [ ]:
from sklearn.mixture import BayesianGaussianMixture
In [ ]:
K = 10

dp_mixture = BayesianGaussianMixture(weight_concentration_prior_type="dirichlet_process", mean_precision_prior=1,
                n_components=K, reg_covar=0, init_params='random', weight_concentration_prior=1e5)

In [ ]:
import matplotlib as mpl
import matplotlib.gridspec as gridspec

colors = np.array(['#0072B2', '#F0E442', '#D55E00'])

def plot_ellipses(ax, weights, means, covars):
    for n in range(means.shape[0]):
        eig_vals, eig_vecs = np.linalg.eigh(covars[n])
        unit_eig_vec = eig_vecs[0] / np.linalg.norm(eig_vecs[0])
        angle = np.arctan2(unit_eig_vec[1], unit_eig_vec[0])
        # Ellipse needs degrees
        angle = 180 * angle / np.pi
        # eigenvector normalization
        eig_vals = 2 * np.sqrt(2) * np.sqrt(eig_vals)
        ell = mpl.patches.Ellipse(means[n], eig_vals[0], eig_vals[1],
                                  180 + angle)

def plot_results(ax1, ax2, estimator, X, y):
    ax1.scatter(X[:, 0], X[:, 1], s=15, marker='o', color=colors[y], alpha=0.8)
    plot_ellipses(ax1, estimator.weights_, estimator.means_,

    ax2.yaxis.grid(True, alpha=0.7)
    for k, w in enumerate(estimator.weights_):
        ax2.bar(k - .45, w, width=0.9, color='#56B4E9', zorder=3)
        ax2.text(k, w + 0.007, "%.1f%%" % (w * 100.),
    ax2.set_xlim(-.6, K - .4)
    ax2.set_ylim(0., 1.1)
    ax2.tick_params(axis='y', which='both', left='off',
                    right='off', labelleft='off')
    ax2.tick_params(axis='x', which='both', top='off')

    ax1.set_ylabel('Estimated Mixtures')
    ax2.set_ylabel('Weight of each component')
In [ ]:
plt.figure(figsize=(4.7 * 3, 8))
plt.subplots_adjust(bottom=.04, top=0.90, hspace=.05, wspace=.05,
                    left=.03, right=.99)

gs = gridspec.GridSpec(3, 1)
plot_results(plt.subplot(gs[:2]), plt.subplot(gs[2]), dp_mixture, X_pca, y)

As we have shown, the Dirichlet process mixture results in infinite mixture models which do not fix the number of clusters in the data a priori. However, Bayesian non-parametric models require fitting via sampling algorithms or variational inference techniques that are non-trivial to implement and scale poorly with large data. This is in contrast to k-means, which is straightforward to implement and scales easily.

It can be shown that the k-means algorithm is a limiting special case of the EM agorithm, if all of the covariance matrices associated with the clusters in a Gaussian mixture model set to $\sigma I$ and we let $\sigma$ go to zero. We can apply a similar limit to the Dirichlet process, using a Gibbs sampling algorithm. The result is a method with hard (rather than probabilistic) cluster assignments, but allows new clusters to be formed when points are far enough from existing cluster centroids.

Suppose in a Gaussian mixture model, all gaussians have the same covariance $\sigma I$, then the E-step of the EM algorithm becomes:

$$\gamma(z_{ic}) = \frac{\pi_c \exp(-\frac{1}{2}||x_i - \pi_c||^2_2)}{\sum_{j=1}^k \pi_j \exp(-\frac{1}{2}||x_i - \pi_j||^2_2)}$$

where $\gamma(z_{ic})$ is the probability of assigning point $i$ to cluster $c$. As $\sigma \rightarrow 0$, this probability approaches zero for all clusters except the closest one. The M-step simply recomputes the cluster means. Hence, this is an equivalent update to k-means.

We can derive an analogous hard clustering algorithm, based on a Dirichlet process mixture model. We first define the baseline distribution $G_0$ to be a zero-mean Gaussian with covariance $\rho I$. This allows us to use a straightforward Gibbs sampling update that results in new points being assigned to a new cluster with probability:

$$\frac{\alpha}{Z}[2\pi(\rho + \sigma)]^{(d/2)} \exp\left( -\frac{1}{2(\rho + \sigma)}||x_i||^2 \right)$$

and to existing cluster $c$ with probability:

$$\frac{n_{-i,c}}{Z}[2\pi\sigma]^{(d/2)} \exp\left( -\frac{1}{2\sigma}||x_i - \mu_c||^2_2 \right)$$

where $Z$ is a normalizing constant. We define $\alpha = (1 + \rho/\sigma)^{d/2} \exp\left(-\frac{\lambda}{2\sigma}\right)$, for some $\lambda$. The Gibbs sampling update becomes:

$$\hat{\gamma}(z_{ic}) = \frac{n_{-i,c} \exp \left(-\frac{1}{2}||x_i - \pi_c||^2 \right)}{n_{-i,c} \exp\left(-\frac{\lambda}{2\sigma} - \frac{||x_i||^2}{2(\rho + \sigma)}\right) + \sum_{j=1}^k n_{-i,j} \exp \left(-\frac{1}{2\sigma}||x_i - \pi_j||^2 \right)}$$

for existing clusters, and:

$$\hat{\gamma}(z_{i,new}) = \frac{\exp \left(-\frac{1}{2\sigma}\left[\lambda + \frac{\sigma}{\rho + \sigma}||x_i ||^2 \right]\right)}{n_{-i,c} \exp\left(-\frac{\lambda}{2\sigma} - \frac{||x_i||^2}{2(\rho + \sigma)}\right) + \sum_{j=1}^k n_{-i,j} \exp \left(-\frac{1}{2\sigma}||x_i - \pi_j||^2 \right)}$$

for new clusters.

As we allow $\sigma \rightarrow 0$ and leave $\rho$ fixed, the $\lambda$ term dominates the numerator. The result is that the probabilities become binary, with the closest cluster converging to one and the others to zero. This becomes identical to the k-means cluster assignment step, except that when the Euclidean distance is greater than $\lambda$, we create a new cluster.

The final step is to sample a mean for a new cluster, should one be created. This is taken from the posterior resulting from the prior $G_0$ and the likelihood of the single observation $x_i$ that seeds the new cluster. Since these are both Gaussian, the posterior will be Gaussian as well, with mean and covariance:

$$\begin{aligned} \tilde{\mu}_c &=& \left(1 + \frac{\sigma}{\rho n_c}\right)^{-1} \\ \tilde{\Sigma}_c &=& \frac{\sigma \rho}{\sigma + \rho n_c}I \end{aligned}$$

But, as $\sigma \rightarrow 0$, the mean of the gaussian approaches $\bar{x}_c$ and the covariance goes to zero, so we simply choose $\bar{x}_c$ as the cluster center.

DP-means algorithm

  1. Initialize number of clusters to 1, and assign all observations to that cluster. Calculate cluster mean to be global mean.
  2. Specify cluster penalty parameter $\lambda$
  3. Initialize cluster indicators: $z_1 = z_2 = \ldots, = z_n = 1 $
  4. Repeat until convergence:

    • For each data point $x_i$:

      • compute distance from means $d_{ic} = ||x_i - \mu_c||^2$ for $c=1,\ldots,k$
      • If $\min_c(d_{ic}) > \lambda$ set $k = k+1$, $z_i = k$, $\mu_k = x_i$
      • Otherwise set $z_i = \text{argmin}_c(d_{ic})$
    • Generate clusters $l_1, \ldots, l_k$ from $z_1,\ldots,z_n$

    • Recompute cluster means: $\mu_j = \frac{1}{|l_j|} \sum_{x \in l_j} x$

In [ ]:
def dp_means(x, lam, max_iter=100, tol=1e-5, metric='euclidean'):
    x = np.array(x)
    n = x.shape[0]
    k = 1
    # Initialize cluster indicators
    z = np.ones(n, int)
    # Initialize with single cluster of all observations
    mu = [x.mean(0)]
    # Initialize variables
    converged = False
    iteration = 0
    ss = np.inf

    # Iterate until converged or maxed out
    while (not converged) and (iteration < max_iter):
        # Calculate distances for all points
        d = cdist(x, np.array(mu), metric=metric)
        for i in range(n):
            if np.min(d[i]) > lam:
                # Create new group

                k += 1

                z[i] = k - 1

                mu += [x[i]]

                # Assign to closest group

                z[i] = np.argmin(d[i])
        for j in range(k):
            # Recalculate centroids
            if (z==j).sum():
                indices = np.where(z==j)[0]
                mu[j] = np.mean(x[indices], 0)
        ss_old = ss
        # Calcuate sum of squared distances to use as convergence criterion
        ss = np.sum([[(x[i,j] - mu[z[i]][j])**2 for j in range(2)] for i in range(n)])

        ss_diff = ss_old - ss
        if ss_diff < tol:
            converged = True
        iteration += 1
    return(dict(centers=np.array(mu), z=z, k=k, iterations=iteration, converged=converged, ss=ss))
In [ ]:
x,y = X_pca.T
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for i,c in enumerate([2, 3, 4]):
    dpm = dp_means(X_pca, c, metric='seuclidean')
    axes[i].scatter(x, y, c=dpm['z'])
    axes[i].scatter(*dpm['centers'].T, c='r', marker='+', s=100)
    axes[i].set_title(r'$\lambda$={0}, k={1}'.format(c, dpm['k']))
In [ ]:
clusters = []
lambdas = np.linspace(2, 4)
for c in lambdas:
    dpm = dp_means(X_pca, c, metric='euclidean')
In [ ]:
plt.plot(lambdas, clusters)

Exercise: Wine clustering

Try running DP-means on the wine chemistry dataset. Try varying the value for lam to see where the number of clusters stabilizes.

In [ ]:
# Write your answer here


  1. Kulis B and Jordan MI. Revisiting k-means: New Algorithms via Bayesian Nonparametrics. arXiv preprint arXiv:11110352 2011.