Chapter 10 - Learning without Supervision

-- A Python Course for the Humanities


cluster In the previous chapter we have developed a system that on the basis of examples attempts to learn a function to classify new, unseen examples. Not always do we have the luxury of a labeled data set. In fact, most of the time only unlabeled data is available. In unsupervized machine learning, or learning without supervision, we attempt the create systems that detect patterns in our data, such as groupings or clusters. Given a collection of texts, we could for example try to measure the pairwise distances between all texts and given these distances construct a grouping of the texts. Another example of unsupervized learning is the popular method of Topic Modeling in which we attempt to find clusters of semantically coherent words that together form a topic.

In this chapter we will introduce you to some of the techniques to cluster you data without supervision. As is the case with supervized learning, there are many different approaches to clustering. We will discuss one of the most popular ones: hierachical agglomerative clustering. We will develop a general hierarchical cluster module and implement a number of different cluster procedures.


Hierarchical Clustering

All cluster techniques follow a similar procedure. We start with a data set consisting of $n$ different data points. The end state will be a data set with $k$ clusters each consisting of a number of original data points. The cluster procedure iteratively moves through all data points and assigns each data point to a particular cluster. Cluster techniques differ with respect to how the merging of data points happens. In this section we will look at hierarchical clustering, which is a clustering method that builds hierarchies of clusters. Typically hierarchical clustering techniques construct a so-called dendrogram, which has a top or root node covering all other data points. The leaf nodes of the dendrogram, or cluster tree, consists of all original data points. If we think of the original datapoints as singleton clusters, hierarchical clustering is an iterative procedure in which in each iteration two clusters are merged into a new cluster. This process is repeated until we arrive at the root node.

At each iteration, the two clusters with the highest similarity combined. The definition of what counts as being similar is what differentiates between the different hierarchical clustering methods. One popular clustering method is single-linkage clustering. Mathematically, the single linkage function – the distance $D(X,Y)$ between clusters $X$ and $Y$ – is described by the expression

$$D(X,Y)=\min_{x\in X, y\in Y} d(x,y),$$

where $X$ and $Y$ are any two clusters, and $d(x,y)$ represents the distance between the two data points $x$ and $y$. The two clusters $A$ and $B$ of which $D(A,B)$ is the smallest are merged into a new cluster $C$.

Let's have a look at a small example to make this all a little more concrete. The following table presents a data set of apples. Each apple is described according to a number of different features:

quality color firmness taste
0 bad red firm sweet
1 bad red firm sweet
2 bad yellow firm sour
3 good red soft sour
4 good yellow soft sweet
5 bad yellow firm sour

To construct clusters from these items, we need have some way to compute the distance or similarity between two items. A very simple distance method would be to take the length of the difference between the feature values of two apples $A$ and $B$:

$$ A \cup B = \{ x: x \in A | x \not\in B\} $$

Take apple$_0$ and apple$_2$ as an example:

In [ ]:
apple_a = {"bad", "red", "firm", "sweet"}
apple_b = {"bad", "yellow", "firm", "sour"}
len(apple_a.difference(apple_b))

We can do the same thing for all items to obtain the pairwise distances between all apples:

In [ ]:
apples = [(0, {"bad", "red", "firm", "sweet"}), 
          (1, {"bad", "red", "firm", "sour"}),
          (2, {"bad", "yellow", "firm", "sour"}),
          (3, {"good", "red", "soft", "sour"}),
          (4, {"good", "yellow", "soft", "sweet"}),
          (5, {"bad", "yellow", "firm", "sour"})]

n = len(apples)
distances = [[0 for i in range(n)] for _ in range(n)]
for i in range(n):
    for j in range(i):
        distance = len(apples[i][1].difference(apples[j][1]))
        distances[i][j] = distance
        distances[j][i] = distances[i][j]

The result is a distance matrix in which for each combination of items the distance is given:

In [ ]:
distances

If you look closely, you'll notice that the upper and lower triangle of the matrix are the same. Now that we have all pairwise distances, we can start clustering the examples. We have to start with merging the two clusters of which the data points are most similar. Can you figure out which clusters are most similar to each other?

Indeed, the clusters containing apple$_2$ and apple$_5$. Let's merge these two clusters into a new cluster:

In [ ]:
apples[2] = (6, (apples[2], apples[5]))
del apples[5]
apples

We now continue our clustering procedure and again select the two items that are most similar to each other. We see that $D(\text{apple$_0$}, \text{apple$_1$}) = 1$. The same holds for $D(\text{apple$_1$}, \text{apple$_2$})$ and $D(\text{apple$_1$}, \text{apple$_5$})$. In such cases we choose one of these pairs at random. Let's go for the pair of apple$_0$ and apple$_1$ and merge these two into a new cluster:

In [ ]:
apples[0] = (7, (apples[0], apples[1]))
del apples[1]
apples

We continue and select the next pair with the shortest distance. This is $D(\text{apple$_1$}, \text{apple$_2$})$. Since apple$_1$ is in cluster 7 and apple$_2$ in cluster 6, we now merge cluster 6 and 7:

In [ ]:
apples[0] = (8, (apples[0], apples[1]))
del apples[1]
apples

Next on our list is the pair of apple$_3$ and apple$_4$:

In [ ]:
apples[1] = (9, (apples[1], apples[2]))
del apples[2]
apples

We conclude the clustering procedure by merging the last two clusters:

In [ ]:
apples[0] = (10, (apples[0], apples[1]))
del apples[1]
apples

What does our clustering tell us? We can distinguish two main groups: cluster 8 (consisting of apple$_0$, apple$_1$, apple$_2$ and apple$_5$), and cluster 9 (consisting of apple$_3$ and apple$_4$). These two clusters seem to point out that the quality of the apples is considered on the basis of their firmness and that, apparently, soft apples are considered to be better than hard apples.

Hopefully, you will now have a better feeling of hierarchical clustering and what the single linkage method is about. It is about time that we start with our own implementation of the hierarchical clustering algorithm.


Quiz!

a) We will begin with implementing a simple similarity metric, called the Jaccard Distance. This metric computes the dissimilarity between two sets by dividing the difference of the sizes of the union and the intersection of two sets by the size of their union:

$$d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }$$
In [ ]:
def jaccard_distance(a, b):
    # insert your code here

# these tests should return True if your code is correct
print(jaccard_distance({'a', 'b', 'c'}, {'b', 'c', 'a'}) == 0.0)
print(round(jaccard_distance({'a', 'b', 'c'}, {'b', 'c'}), 2) == 0.33)

b) Write a function pairwise_distances that takes as input a list of examples and some distance function and returns a matrix represented by a nested list which contains all pairwise distances.

In [ ]:
def pairwise_distances(X, distance_fn=jaccard_distance):
    # insert your code here

# these tests should return True if your code is correct
X = [{'a', 'f', 'c'}, {'b', 'd', 'a'}, {'a', 'b', 'c'}, {'f', 'b', 'c'}]
print(pairwise_distances(X) == [[0,   0.8, 0.5, 0.5],
                               [0.8, 0,   0.5, 0.8],
                               [0.5, 0.5, 0,   0.5],
                               [0.5, 0.8, 0.5, 0  ]])

c) Next, we will write a function that takes as input a distance matrix represented as a nested list and returns two indexes $i$ and $j$ corresponding to the two clusters $A, B$ with the smallest distance $D(A, B)$:

In [ ]:
from itertools import combinations

def smallest_distance(dm):
    # insert your code here

# these tests should return True if your code is correct
distances = [[0, 1, 2, 3, 3, 2],
             [1, 0, 1, 2, 4, 1],
             [2, 1, 0, 3, 3, 0],
             [3, 2, 3, 0, 2, 3],
             [3, 4, 3, 2, 0, 3],
             [2, 1, 0, 3, 3, 0]]
print(smallest_distance(distances) == (2, 5))

So, we have created a function to compute the distance between two sets, a function to compute the pairwise distances between items given a distance function and a function to extract the indices corresponding to the clusters with the smallest distance. Next, we need to define a data structure to represent the hierarchical tree of clusters. First, we will define a class named Cluster which represents a single node in a cluster tree. A Cluster consists of a cluster ID, and a list of its child nodes. A Cluster object is actually no more than a nested list in Python where each list has a unique ID. We therefore define Cluster as a subclass of a list object:

In [ ]:
class Cluster(list):
    """Represents a Cluster node in a Dendrogram. A Cluster can be
    initialized using 
    
    >>> c = Cluster(1)
    
    to create a cluster leaf node with id=1. You can also initialize 
    a non-terminal `Cluster` node using
    
    >>> c = Cluster(3, Cluster(1), Cluster(2))
    
    where Cluster(1) and Cluster(2) are the children of Cluster(3)."""
        
    def __init__(self, id, *children):
        self.id = id
        super(Cluster, self).__init__(children)
        
    def __repr__(self):
        childstr = ", ".join(str(c) for c in self)
        if self:
            return '%s(%s, [%s])' % (type(self).__name__, self.id, childstr)
        return '%s(%s)' % (type(self).__name__, self.id)
    
    def __str__(self):
        return self.pprint()
    
    def pprint(self, indent=0):
        s = '%s(%s' % (type(self).__name__, self.id)
        for child in self:
            if child:
                s += '\n' + ' ' * (indent + 2) + child.pprint(indent=indent+2)
            else:
                s += '\n' + ' ' * (indent + 2) + '%r' % child
        return s + ')'

We create a new Cluster using:

In [ ]:
c1 = Cluster(1)
c2 = Cluster(2)

To create a non-terminal Cluster we initialize a Cluster object as follows:

In [ ]:
print(Cluster(3, c1, c2))

Quiz!

A ClusterTree is an object that consists of multiple Cluster objects. At initialization, each Cluster node obtains an ID within the range 0 to $n$ where $n$ is the number of data points to be clustered. You can also pass a list of labels to the labels argument of the constructor, which will then be used as ID's. The class ClusterTree is responsible for merging two Cluster nodes into a new Cluster. Complete the method merge. It takes as input two indices $i$ and $j$, corresponding to two Cluster objects $C_i$ and $C_j$. These two clusters are merged into a new cluster which takes the original position of cluster $C_i$. Don't forget to remove cluster $C_j$ and initialize the new cluster with an appropriate ID.

In [ ]:
class ClusterTree:
    """A ClusterTree, or Dendrogram consists of one or more
    `Cluster` objects. Initialize a `ClusterTree` using
    
    >>> tree = ClusterTree(n=10)
    
    where n is the number of original data points to be clustered."""
    
    def __init__(self, n, labels=None):
        self._n = n
        if labels is None:
            labels = range(n)
        self._clusters = [Cluster(i) for i in labels]
    
    def merge(self, i, j):
        # insert your code here
    
    def __str__(self):
        return '%s' % self._clusters[0]
        
# these tests should return True if your code is correct
tree = ClusterTree(5)
tree.merge(1, 2)
print(len(tree._clusters[1]) == 2)
print(tree._clusters[1].id == 5)

Now it's time for the most tricky part of our clustering algorithm: computing the linkage function. We have already implemented a function that extracts the two indices corresponding to the two clusters that are closest to each other. But how do we find the indices of two clusters when they have been merged? One way would be to recursively go through the ClusterTree and extract the indices from there. There is however another way which makes it possible to use the same smallest_distance function at each iteration of the clustering procedure. As we will see this method is particularly beneficial because it will allow us to implement a number of different linkage functions in a elegant and simple way.

Recall that in the single linkage function – the distance $D(X,Y)$ between clusters $X$ and $Y$ – is described by the expression

$$D(X,Y)=\min_{x\in X, y\in Y} d(x,y),$$

where $X$ and $Y$ are any two clusters, and $d(x,y)$ represents the distance between the two data points $x$ and $y$. Since we are only interested in the minimal distance of ${x\in X, y\in Y}$, we can store that information directly in our distance matrix. Let's go through an example to show you more clearly what this means.

Say we have a distance matrix containing the pairwise distance between 6 clusters:

distances = [[0, 1, 2, 3, 3, 1],
             [ , 0, 1, 2, 4, 1],
             [ ,  , 0, 3, 3, 0],
             [ ,  ,  , 0, 2, 3],
             [ ,  ,  ,  , 0, 3],
             [ ,  ,  ,  ,  , 0]]

The smallest distance can be found between cluster 2 and cluster 5, which can be obtained using our function smallest_distance:

>>> smallest_distance(distances)
(2, 5)

So far we haven't done anything different. Here comes the crucial step. Before we enter the next iteration of our clustering procedure, we update the distance matrix to reflect the minimal distances between all clusters with respect to cluster 2 and cluster 5. For example, we compare cluster 0 to both cluster 2 and cluster 5. The distance between cluster 0 and cluster 2 is 2. The distance between cluster 0 and cluster 5 is 1, which is the smallest. We update the largest distance (between cluster 0 and cluster 2) to 1 to represent the minimal distance between 0 and any of cluster 2 or cluster 5:

distances = [[0, 1, 1, 3, 3, 1],
             [ , 0, 1, 2, 4, 1],
             [ ,  , 0, 3, 3, 0],
             [ ,  ,  , 0, 2, 3],
             [ ,  ,  ,  , 0, 3],
             [ ,  ,  ,  ,  , 0]]

we do this for all remaining clusters to obtain the following distance matrix:

distances = [[0, 1, 1, 3, 3, 1],
             [ , 0, 1, 2, 4, 1],
             [ ,  , 0, 3, 3, 0],
             [ ,  ,  , 0, 2, 3],
             [ ,  ,  ,  , 0, 3],
             [ ,  ,  ,  ,  , 0]]

Since all distances between any cluster and the cluster (2, 5) are now stored in both the rows and columns of cluster 2 and 5, we can remove one of them to obtain the following distance matrix.

distances = [[0, 1, 1, 3, 3],
             [ , 0, 1, 2, 4],
             [ ,  , 0, 3, 3],
             [ ,  ,  , 0, 2],
             [ ,  ,  ,  , 0]]

From here on the same procedure is repeated until there is only one cluster left. So, again we extract the two indices corresponding to the two clusters that are closest to each other.

>>> smallest_distance(distances)
(1, 2)

We update the distance matrix

distances = [[0, 1, 2, 3], 
             [ , 0, 3, 3], 
             [ ,  , 0, 2], 
             [ ,  ,  , 0]]

and go on with the next iteration.


Quiz!

a) Implement the function called single_linkage. It takes as input a distance matrix, and two indices $i$ and $j$ corresponding to the two clusters in the distance matrix that are closest to each other. Update the matrix according to the procedure described above and return the new matrix without row$_j$ and column$_j$.

In [ ]:
def single_linkage(dm, i, j):
    # insert your code here

# these tests should return True if your code is correct

distances = [[0, 1, 2, 3, 3, 1],
             [1, 0, 1, 2, 4, 1],
             [2, 1, 0, 3, 3, 0],
             [3, 2, 3, 0, 2, 3],
             [3, 4, 3, 2, 0, 3],
             [1, 1, 0, 3, 3, 0]]

print(single_linkage(distances, 2, 5) == [[0, 1, 1, 3, 3], 
                                          [1, 0, 1, 2, 4], 
                                          [1, 1, 0, 3, 3], 
                                          [3, 2, 3, 0, 2], 
                                          [3, 4, 3, 2, 0]])

b) Great. Now that we have our linkage function in place, we start working on the final piece of the clustering algorithm: the main iterative loop, of which the skeleton is given below. Fill in the missing elements.

In [ ]:
def cluster(data_points, labels=None, linkage=single_linkage, distance_fn=jaccard_distance):
    # initialize a `ClusterTree` with n=len(data_points)
    tree = # insert your code here
    # compute the pairwise distances between all data points 
    # using the provided distance function
    dm = # insert your code here
    while len(dm) > 1:
        # extract the indices of the clusters corresponding to the 
        # two closest clusters in the distance matrix
        i, j = # insert your code here
        # update the distance matrix using the provided linkage function
        dm = # insert your code here
        # merge the two clusters in the ClusterTree:
        # insert your code here
    return tree

# these tests should return True if your code is correct
apples = [{"bad", "red", "firm", "sweet"}, {"bad", "red", "firm", "sour"},
          {"bad", "yellow", "firm", "sour"}, {"good", "red", "soft", "sour"},
          {"good", "yellow", "soft", "sweet"}, {"bad", "yellow", "firm", "sour"}]
tree = cluster(apples)
print(tree._clusters[0].id == 10)

Single linkage is just one of many different clustering strategies in hierarchical clustering. Another important linkage function is complete linkage. Complete linkage is very similar to single linkage except that we do not take the minimal distance between two clusters but the maximal distance. The distance $D(X,Y)$ between clusters $X$ and $Y$ — is described by the following expression

$$D(X,Y)= \max_{x\in X, y\in Y} d(x,y)$$

where $d(x,y)$ is the distance between elements $x \in X$ and $y \in Y$. $X$ and $Y$ are both clusters.

The difference between single linkage and complete linkage lies only in the function used to compute the distance of a cluster to all other clusters (i.e. min or max). We could write a function complete_linkage that is almost equal to our single_linkage function except for the function to compute the distances. However, this would imply that we repeat quite a bit of code, which is never good practice. Instead, we will make an abstraction over the two linkage functions in the function called general_linkage. This function takes the same arguments as our single_linkage function plus an argument that specifies the distance function used.

In [ ]:
def general_linkage(dm, i, j, distance_fn):
    for k in range(len(dm)):
        if k != i and k != j:
            dm[i][k] = distance = distance_fn(dm[i][k], dm[j][k])
            dm[k][i] = distance
    dm = [[val for c, val in enumerate(row) if c != j] 
               for r, row in enumerate(dm) if r != j]
    return dm

This general formulation allows us to redefine the single_linkage function as follows:

In [ ]:
def single_linkage(dm, i, j):
    return general_linkage(dm, i, j, min)

Quiz!

Implement the complete_linkage function.

In [ ]:
def complete_linkage(dm, i, j):
    # insert your code here

# these tests should return True if your code is correct

distances = [[0, 1, 2, 3, 3, 1],
             [1, 0, 1, 2, 4, 1],
             [2, 1, 0, 3, 3, 0],
             [3, 2, 3, 0, 2, 3],
             [3, 4, 3, 2, 0, 3],
             [1, 1, 0, 3, 3, 0]]

complete_linkage(distances, 2, 5) == [[0, 1, 2, 3, 3],
                                      [1, 0, 1, 2, 4],
                                      [2, 1, 0, 3, 3],
                                      [3, 2, 3, 0, 2],
                                      [3, 4, 3, 2, 0]]

Practical: Clustering Distances of West-European Languages

Now that we have implemented all functions to perform hierarchical cluster analysis, let's apply the method to a more realistic and more interesting example than clustering apples. Within the family of Indo-European languages we can distinguish multiple sub-families, such as Germanic languages (e.g. German and Dutch) or Romance languages (e.g. French and Italian). In the cell below I listed the first 10 numerals in the variable numerals for each of the 10 languages stored in the variable languages:

In [ ]:
numerals = [
   ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"],
   ["een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "negen", "tien"],
   ["ien", "twa", "trije", "fjouwer", "fiif", "seis", "san", "acht", "njoggen", "tsien"],
   ["eins", "zwei", "drei", "vier", "funf", "sechs", "sieben", "acht", "neun", "zehn"],
   ["en", "to", "tre", "fire", "fem", "seks", "sju", "atte", "ni", "ti"],
   ["én", "to", "tre", "fire", "fem", "seks", "syv", "otte", "ni", "ti"],
   ["en", "tva", "tre", "fyra", "fem", "sex", "sju", "atta", "nio", "tio"],
   ["uno", "dos", "tres", "cuatro", "cinco", "seis", "siete", "ocho", "nueve", "diez"],
   ["un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix"],
   ["uno", "due", "tre", "quattro", "cinque", "sei", "sette", "otto", "nove", "dieci"]]

languages = ['English', 'Dutch', 'Frisian', 'German', 'Norwegian', 
             'Danish', 'Swedish', 'Spanish', 'French', 'Italian']

We apply hierarchical cluster analysis to investigate whether we can detect some interesting and meaningful groupings of these languages on the basis of their first 10 numerals.

We first need to decide how to compute the distance between two languages. One simple way is to take the sum of the distances between each numeral in language $A$ and language $B$. Mathematically, the distance between two languages $A$ and $B$ is described by the following expression:

$$ D(A, B) = \sum^n_{i=1} d(A_i, B_i) $$

$A$ and $B$ are two vectors with $n$ items and $d$ is some distance function. How do we compute the distance between two numerals like Dutch een and English one? There are more appropriate and advanced methods to compute this distance, but for the moment let's make use of the jaccard distance function, we defined above:

In [ ]:
dutch_one = set(list("een"))
english_one = set(list("one"))
jaccard_distance(dutch_one, english_one)

Quiz!

a) Write a function summed_jaccard_distance that takes as input two equally sized lists $A$ and $B$. Return $D(A, B) = \sum^n_{i=1} d(A_i, B_i)$ where $d$ is the jaccard distance.

In [ ]:
def summed_jaccard_distance(A, B):
    # insert your code here

# these tests should return True if your code is correct
print(round(summed_jaccard_distance(numerals[0], numerals[4]), 2) == 5.57)
print(round(summed_jaccard_distance(numerals[5], numerals[6]), 2) == 4.8)

b) Perform a cluster analysis using single linkage and the summed jaccard distance on the numerals data. Report on your findings.

In [ ]:
solution = # insert your code here
print(solution)

c) This website contains a data set with the numbers from 1 to 10 in over 5000 languages. Visit the website and add some other languages (preferably from different language families) to the data set. Run the analyis again and report on you findings.


You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:

In [387]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[387]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */