At the conclusion of this micro assignment, participants should be able to:
Before starting this micro assignment, participants should be able to:
Content used in this assignment is based upon information in the following sources:
For this micro assignment, we are going to add functionality to a previous programming assignment (PA2) where we implemented the k-means clustering algorithm. Please re-read the specifications for the k-means assignment and review your code.
A problem with k-means clustering is that we may not know what k is (though we could try several and compare the resulting cluster quality). Another way to cluster, avoiding such a parameter (but with its own issues), is called hierarchical clustering. A hierarchical clustering is a binary tree, with higher-up branches connecting subtrees that are less similar, and lower-down branches connecting subtrees that are more similar.
For example, consider the following data (simple.csv):
Sample | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
g0 | 0 | 0.1 | 0.2 | 0 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
g1 | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0 | 0.1 |
g2 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 |
g3 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |
g4 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 |
g5 | 0.5 | 0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
One possible hierarchical clustering for this simple example is as follows (a string representation of the binary tree with "*" as the label for the inner nodes):
g1
*
g4
*
g0
*
g2
*
g3
*
g5
Graphically:
The simple made-up case for initial testing is simple.csv; the real dataset is cancer.csv.
Hierarchical clustering builds a tree bottom up. Each object starts in its own cluster (a leaf node, denoted with a square in the above diagram). We then find the closest pair of clusters, and make them the children of a newly formed cluster (inner node, denoted with a diamond in the above diagram). That cluster is then treated like the rest of the clusters (leaves and inner nodes). We repeat the process:
When we form a cluster, we need to be able to compute its distance to the other clusters (leaves and inner nodes), so that we can determine which pair is closest. Distances involving one or two clusters can be computed in a number of ways. For this assignment we will simply use the centroid of a cluster as a representative point for computing the cluster's distance. We define the centroid of a cluster to be the weighted average of the centroids of all leaves in the subtree.
We need to keep track of two things: the clusters that need to be further clustered (starting with the leaves), and the distances between them.
Pair
: keeps track of distances between a pair of clusters__lt__()
for comparing Pair
objects for heap ordering purposesHierarchicalCluster
: a binary tree representing the hierarchical clustersBinaryTree
root_data
is the weighted centroid of the treeBinaryMinHeap
: a priority queue used to find the closest pair of clustersPair
's __lt__()
, that enables us to find the closest pair of clusters, which is the next to merge in the hierarchical clusteringInitialize the priority queue (BinaryMinHeap
) with all pairs (Pair
) of trees. Then when a new cluster is formed, create new pairs between it and the remaining clusters (not its children!) and add them to the queue. Be careful not to merge a cluster twice. An easy way to do this is to test, after removing the closests pair from the priority queue, whether its left and right trees are both still available. That's one reason why as mentioned above we keep track of the clusters that haven't yet been clustered. So be sure to update that list appropriately upon merging. Loop until there's only 1 thing left in the queue — the single cluster that's the whole tree (root node).
Test on the two provided datasets, along with any others you like. Note that in addition to the visual display (heatmaps), the tree structure and k-means cluster memberships are printed to the console; paste into a text editor to see everything. (The visual display doesn't capture the tree structure, but does reorder the leaves along the fringe, so you can kind of see patterns of similarity.) Save the textual representations and take screenshots.
Write a short (at least a couple paragraphs) description of what you observe in the clusters, e.g. similarities and differences between the two methods and their strengths and weaknesses. Also discuss how well the clustering captures the similarities and differences in the samples (e.g. are same-type cancers clustered?).
Do a divisive clusters rather than an agglomerative one; i.e., build the tree from root to leaves, by splitting rather than merging.
<your last name>_ma6.zip
by the due date and time.This assignment is worth 50 points + 5 points bonus. Your assignment will be evaluated based on a successful compilation and adherence to the program requirements. We will grade according to the following criteria:
Pair
classHierarchicalCluster
classBinaryMinHeap
class)