MA6 Heaps and Hierarchical Clustering (50 pts)¶

Learner Objectives¶

At the conclusion of this micro assignment, participants should be able to:

Implement a priority queue using a binary heap
Implement hierarchical clustering using a priority queue

Prerequisites¶

Before starting this micro assignment, participants should be able to:

Write Markdown and code cells in Jupyter Notebook
Implement k-means clustering

Acknowledgments¶

Content used in this assignment is based upon information in the following sources:

k-Means Clustering assignment by Chris Bailey-Kellogg

Overview and Requirements¶

For this micro assignment, we are going to add functionality to a previous programming assignment (PA2) where we implemented the k-means clustering algorithm. Please re-read the specifications for the k-means assignment and review your code.

Hierarchical Clustering¶

A problem with k-means clustering is that we may not know what k is (though we could try several and compare the resulting cluster quality). Another way to cluster, avoiding such a parameter (but with its own issues), is called hierarchical clustering. A hierarchical clustering is a binary tree, with higher-up branches connecting subtrees that are less similar, and lower-down branches connecting subtrees that are more similar.

For example, consider the following data (simple.csv):

Sample
g0	0	0.1	0.2	0	0.4	0.5	0.6	0.7	0.8	0.9
g1	1.0	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0	0.1
g2	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
g3	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4
g4	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0.2	0.1	0.0
g5	0.5	0	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5

One possible hierarchical clustering for this simple example is as follows (a string representation of the binary tree with "*" as the label for the inner nodes):

Graphically:

The simple made-up case for initial testing is simple.csv; the real dataset is cancer.csv.

Hierarchical clustering builds a tree bottom up. Each object starts in its own cluster (a leaf node, denoted with a square in the above diagram). We then find the closest pair of clusters, and make them the children of a newly formed cluster (inner node, denoted with a diamond in the above diagram). That cluster is then treated like the rest of the clusters (leaves and inner nodes). We repeat the process:

Find the closest pair of clusters (which could be leaves or inner nodes)
1. Make them children of a new cluster (inner node)
Repeat until all that's left is a single object (the root)

When we form a cluster, we need to be able to compute its distance to the other clusters (leaves and inner nodes), so that we can determine which pair is closest. Distances involving one or two clusters can be computed in a number of ways. For this assignment we will simply use the centroid of a cluster as a representative point for computing the cluster's distance. We define the centroid of a cluster to be the weighted average of the centroids of all leaves in the subtree.

Implementation¶

We need to keep track of two things: the clusters that need to be further clustered (starting with the leaves), and the distances between them.

Classes to Define¶

Pair: keeps track of distances between a pair of clusters
- Includes a __lt__() for comparing Pair objects for heap ordering purposes
- Computes the weighted centroid of cluster pair. This can be done by weighting the average of the centroids of the two clusters by the number of leaves.
HierarchicalCluster: a binary tree representing the hierarchical clusters
- Inherits from BinaryTree
- root_data is the weighted centroid of the tree
BinaryMinHeap: a priority queue used to find the closest pair of clusters
- Utilizes Pair's __lt__(), that enables us to find the closest pair of clusters, which is the next to merge in the hierarchical clustering

Algorithm¶

Initialize the priority queue (BinaryMinHeap) with all pairs (Pair) of trees. Then when a new cluster is formed, create new pairs between it and the remaining clusters (not its children!) and add them to the queue. Be careful not to merge a cluster twice. An easy way to do this is to test, after removing the closests pair from the priority queue, whether its left and right trees are both still available. That's one reason why as mentioned above we keep track of the clusters that haven't yet been clustered. So be sure to update that list appropriately upon merging. Loop until there's only 1 thing left in the queue — the single cluster that's the whole tree (root node).

Observations¶

Test on the two provided datasets, along with any others you like. Note that in addition to the visual display (heatmaps), the tree structure and k-means cluster memberships are printed to the console; paste into a text editor to see everything. (The visual display doesn't capture the tree structure, but does reorder the leaves along the fringe, so you can kind of see patterns of similarity.) Save the textual representations and take screenshots.

Write a short (at least a couple paragraphs) description of what you observe in the clusters, e.g. similarities and differences between the two methods and their strengths and weaknesses. Also discuss how well the clustering captures the similarities and differences in the samples (e.g. are same-type cancers clustered?).

Bonus (5 pts)¶

Do a divisive clusters rather than an agglomerative one; i.e., build the tree from root to leaves, by splitting rather than merging.

Submitting Assignments¶

Use the Blackboard tool https://learn.wsu.edu to submit your assignment. You will submit your code to the corresponding programming assignment under the "Content" tab. You must upload your solutions as <your last name>_ma6.zip by the due date and time.
Your .zip file should contain your .py file(s), .png file(s), and a typed write up.

Grading Guidelines¶

This assignment is worth 50 points + 5 points bonus. Your assignment will be evaluated based on a successful compilation and adherence to the program requirements. We will grade according to the following criteria:

5 pts for correct Pair class
5 pts for correct HierarchicalCluster class
5 pts for using your own heap (BinaryMinHeap class)
5 pts for initial leaf clusters and priority queue
5 pts for repeatedly creating new clusters from closest unmerged pair
5 pts for creating new pairs
5 pts for computing the weighted cluster prototype (centroid)
5 pts for visualizing the tree (one or more of: text output, graphical output, heatmap output)
5 pts for including a discussion of clustering patterns, similarities and differences, strengths and weaknesses
5 pts for adherence to proper programming style and comments established for the class

Sample
g0	0	0.1	0.2	0	0.4	0.5	0.6	0.7	0.8	0.9
g1	1.0	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0	0.1
g2	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
g3	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4
g4	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0.2	0.1	0.0
g5	0.5	0	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5

Sample
g0	0	0.1	0.2	0	0.4	0.5	0.6	0.7	0.8	0.9
g1	1.0	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0	0.1
g2	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
g3	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4
g4	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0.2	0.1	0.0
g5	0.5	0	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5

CptS 215 Data Analytics Systems and Algorithms¶