$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\piv}{\mathbf{\pi}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}$

# Hierarchical Clustering¶

Hierarchical clustering is often used to construct dendrograms. We will see an example below.

The methods are straightforward. The similarity between pairs of samples is usually related to the Euclidean distance between them. In agglomerative clustering, initially each sample is in a unique cluster. Then, the most similar two clusters are merged. This continues until a single cluster results that contains all samples. The distance between two clusters, $C_i$ and $C_j$, can be determined by the single-linkage method $$d(C_i,C_j) = \min_{\xv\in C_i, \yv\in C_j} d(\xv,\yv)$$ or complete-linkage method $$d(C_i,C_j) = \max_{\xv\in C_i, \yv\in C_j} d(\xv,\yv)$$ or average-linkage method $$d(C_i,C_j) = \text{mean}_{\xv\in C_i, \yv\in C_j} d(\xv,\yv)$$ where $d(\xv,\yv)$ is the Euclidean distance between $\xv$ and $\yv$.

In divisive clustering, all samples are initially in one cluster, which is successively split until all samples are in unique clusters. We will use agglomerative clustering, as it often results in more compact dendrograms.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


Let's represent clusters as a list of sample matrices, each matrix containing samples from one cluster. Initially, all samples are in their own clusters. Let's use the Old Faithful data to develop our implementation.

In [2]:
!curl -O http://www.cs.colostate.edu/~anderson/cs445/notebooks/oldfaithful.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100  3808  100  3808    0     0   309k      0 --:--:-- --:--:-- --:--:--  309k

In [3]:
!head oldfaithful.csv

3.600      79
1.800      54
3.333      74
2.283      62
4.533      85
2.883      55
4.700      88
3.600      85
1.950      51
4.350      85

In [4]:
data = np.loadtxt('oldfaithful.csv')

In [5]:
data.shape

Out[5]:
(272, 2)
In [6]:
plt.scatter(data[:,0],data[:,1]);
plt.xlabel('Duration');
plt.ylabel('Interval');

In [7]:
clusters = [d for d in data]
clusters[:5]

Out[7]:
[array([ 3.6, 79. ]),
array([ 1.8, 54. ]),
array([ 3.333, 74.   ]),
array([ 2.283, 62.   ]),
array([ 4.533, 85.   ])]

Now we need the complete-linkage cluster distance function.

In [8]:
C1 = [np.array([[1,2]]), np.array([[2,3]])]
C2 = [np.array([[2,2]]), np.array([[6,7]])]
C1,C2

Out[8]:
([array([[1, 2]]), array([[2, 3]])], [array([[2, 2]]), array([[6, 7]])])
In [9]:
allC1 = np.vstack((C1))
allC2 = np.vstack((C2))
allC1,allC2

Out[9]:
(array([[1, 2],
[2, 3]]), array([[2, 2],
[6, 7]]))
In [10]:
allC1[:,np.newaxis,:] - allC2

Out[10]:
array([[[-1,  0],
[-5, -5]],

[[ 0,  1],
[-4, -4]]])
In [11]:
np.sum((allC1[:,np.newaxis,:] - allC2)**2,axis=2)

Out[11]:
array([[ 1, 50],
[ 1, 32]])
In [12]:
np.max(np.sum((allC1[:,np.newaxis,:] - allC2)**2,axis=2))

Out[12]:
50

So, the maximum square distance between $C_1$ and $C_2$ is 50.

In [13]:
def clusterDistance(Ci,Cj):
allCi = np.vstack((Ci))
allCj = np.vstack((Cj))
return np.max(np.sum((allCi[:,np.newaxis,:] - allCj)**2, axis=2))

In [14]:
clusterDistance(C1,C2)

Out[14]:
50

All that is left is a way to identify to two clusters with the minimum distance.

In [15]:
C3 = [np.array([[6,4]]), np.array([[8,9]])]

In [16]:
clusters = [C1, C2, C3]
clusters

Out[16]:
[[array([[1, 2]]), array([[2, 3]])],
[array([[2, 2]]), array([[6, 7]])],
[array([[6, 4]]), array([[8, 9]])]]
In [17]:
for i in range(len(clusters)-1):
for j in range(i+1,len(clusters)):
print(i,j)

0 1
0 2
1 2

In [18]:
dists = []
for i in range(len(clusters)-1):
for j in range(i+1,len(clusters)):
dists.append([i,j,clusterDistance(clusters[i],clusters[j])])
dists

Out[18]:
[[0, 1, 50], [0, 2, 98], [1, 2, 85]]

or

In [19]:
[[i,j,clusterDistance(clusters[i],clusters[j])] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))]

Out[19]:
[[0, 1, 50], [0, 2, 98], [1, 2, 85]]
In [20]:
def clusterDistance(Ci,Cj):
'''Ci and Cj are two clusters, each being a dict with 'X' and 'label' keys'''
return np.mean(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.min(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.max(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))


So, clusters at indices 0 and 1 are closest. We can merge these two using np.vstack. Now we are ready to write the function.

In [21]:
def mergeClusters(Ci,Cj, k):
return {'X': np.vstack((Ci['X'], Cj['X'])),
'label': k}


Now for a simple, but very inefficient, implementation of agglomerative clustering.

In [22]:
def agglomerative(X,clusterDistanceF, nClusters):
labels = np.zeros((X.shape[0]))
# clusters is list of pairs of sample and label
clusters = [ {'X':X[i:i+1,:], 'label':i} for i in range(X.shape[0]) ]
k = X.shape[0] - 1
while len(clusters) > nClusters:
dists = np.array( [[i,j,clusterDistance(clusters[i],clusters[j])] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))] )
whichClosest = np.argmin(dists[:,-1])
closest = dists[whichClosest,:2]
i,j = closest.astype(int)
# Merge them
k += 1
clusters[i] = {'X': np.vstack((clusters[i]['X'],clusters[j]['X'])),
'label': k}
clusters.pop(j)
print(len(clusters), end=' ')
return clusters

In [23]:
data.shape

Out[23]:
(272, 2)
In [24]:
clusters = agglomerative(data,clusterDistance, 2)

271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
In [25]:
clusters

Out[25]:
[{'X': array([[ 3.6  , 79.   ],
[ 3.733, 79.   ],
[ 3.95 , 79.   ],
[ 4.25 , 79.   ],
[ 4.283, 79.   ],
[ 4.117, 79.   ],
[ 4.433, 79.   ],
[ 4.417, 79.   ],
[ 4.5  , 79.   ],
[ 4.55 , 79.   ],
[ 4.2  , 78.   ],
[ 4.167, 78.   ],
[ 4.267, 78.   ],
[ 4.   , 78.   ],
[ 4.083, 78.   ],
[ 3.85 , 78.   ],
[ 3.833, 78.   ],
[ 4.7  , 78.   ],
[ 4.667, 78.   ],
[ 4.567, 78.   ],
[ 4.6  , 78.   ],
[ 4.817, 78.   ],
[ 4.767, 78.   ],
[ 4.433, 78.   ],
[ 3.45 , 78.   ],
[ 4.083, 76.   ],
[ 4.15 , 76.   ],
[ 4.333, 76.   ],
[ 4.233, 76.   ],
[ 3.883, 76.   ],
[ 3.95 , 76.   ],
[ 5.067, 76.   ],
[ 4.583, 76.   ],
[ 4.8  , 76.   ],
[ 4.467, 77.   ],
[ 4.366, 77.   ],
[ 4.367, 77.   ],
[ 4.25 , 77.   ],
[ 4.283, 77.   ],
[ 4.567, 77.   ],
[ 4.583, 77.   ],
[ 4.583, 77.   ],
[ 4.   , 77.   ],
[ 3.966, 77.   ],
[ 4.817, 77.   ],
[ 5.033, 77.   ],
[ 3.333, 74.   ],
[ 3.833, 74.   ],
[ 4.533, 74.   ],
[ 4.467, 74.   ],
[ 4.35 , 74.   ],
[ 4.167, 74.   ],
[ 4.8  , 75.   ],
[ 4.733, 75.   ],
[ 4.75 , 75.   ],
[ 4.483, 75.   ],
[ 4.133, 75.   ],
[ 4.15 , 75.   ],
[ 3.75 , 75.   ],
[ 3.833, 75.   ],
[ 4.3  , 73.   ],
[ 4.067, 73.   ],
[ 4.533, 73.   ],
[ 4.5  , 73.   ],
[ 4.5  , 73.   ],
[ 4.7  , 73.   ],
[ 3.567, 73.   ],
[ 4.3  , 72.   ],
[ 3.067, 69.   ],
[ 4.067, 69.   ],
[ 3.333, 68.   ],
[ 3.717, 71.   ],
[ 3.567, 71.   ],
[ 4.   , 71.   ],
[ 3.917, 71.   ],
[ 4.1  , 70.   ],
[ 4.083, 70.   ],
[ 4.   , 70.   ],
[ 3.917, 70.   ],
[ 2.383, 71.   ],
[ 4.533, 85.   ],
[ 4.6  , 85.   ],
[ 4.583, 85.   ],
[ 4.35 , 85.   ],
[ 3.6  , 85.   ],
[ 4.   , 85.   ],
[ 3.917, 84.   ],
[ 3.85 , 84.   ],
[ 4.083, 84.   ],
[ 4.1  , 84.   ],
[ 4.8  , 84.   ],
[ 4.667, 84.   ],
[ 4.7  , 84.   ],
[ 4.567, 84.   ],
[ 4.5  , 84.   ],
[ 4.533, 84.   ],
[ 4.7  , 83.   ],
[ 4.883, 83.   ],
[ 4.5  , 83.   ],
[ 4.5  , 83.   ],
[ 4.417, 83.   ],
[ 4.45 , 83.   ],
[ 4.45 , 83.   ],
[ 4.167, 83.   ],
[ 4.25 , 83.   ],
[ 4.25 , 83.   ],
[ 3.6  , 83.   ],
[ 3.6  , 83.   ],
[ 3.767, 83.   ],
[ 3.317, 83.   ],
[ 4.   , 86.   ],
[ 4.15 , 86.   ],
[ 4.183, 86.   ],
[ 4.85 , 86.   ],
[ 4.933, 86.   ],
[ 4.933, 86.   ],
[ 4.417, 87.   ],
[ 3.5  , 87.   ],
[ 4.033, 80.   ],
[ 3.817, 80.   ],
[ 4.833, 80.   ],
[ 4.833, 80.   ],
[ 4.517, 80.   ],
[ 4.633, 80.   ],
[ 4.7  , 80.   ],
[ 4.35 , 80.   ],
[ 4.633, 82.   ],
[ 4.5  , 82.   ],
[ 4.533, 82.   ],
[ 4.333, 82.   ],
[ 4.35 , 82.   ],
[ 4.383, 82.   ],
[ 4.367, 82.   ],
[ 4.267, 82.   ],
[ 4.8  , 82.   ],
[ 4.9  , 82.   ],
[ 4.033, 82.   ],
[ 3.833, 82.   ],
[ 4.317, 81.   ],
[ 4.333, 81.   ],
[ 4.233, 81.   ],
[ 4.233, 81.   ],
[ 4.05 , 81.   ],
[ 4.167, 81.   ],
[ 4.117, 81.   ],
[ 4.633, 81.   ],
[ 4.6  , 81.   ],
[ 4.5  , 81.   ],
[ 4.8  , 81.   ],
[ 3.683, 81.   ],
[ 3.767, 81.   ],
[ 4.7  , 88.   ],
[ 4.6  , 88.   ],
[ 4.933, 88.   ],
[ 5.   , 88.   ],
[ 4.367, 88.   ],
[ 4.15 , 88.   ],
[ 4.783, 90.   ],
[ 4.716, 90.   ],
[ 4.65 , 90.   ],
[ 4.333, 90.   ],
[ 4.45 , 90.   ],
[ 4.417, 90.   ],
[ 4.9  , 89.   ],
[ 4.333, 89.   ],
[ 3.967, 89.   ],
[ 4.4  , 92.   ],
[ 4.133, 91.   ],
[ 4.083, 93.   ],
[ 4.617, 93.   ],
[ 4.8  , 94.   ],
[ 5.1  , 96.   ]]), 'label': 540}, {'X': array([[ 1.8  , 54.   ],
[ 1.833, 54.   ],
[ 1.833, 54.   ],
[ 1.85 , 54.   ],
[ 1.883, 54.   ],
[ 1.733, 54.   ],
[ 1.75 , 54.   ],
[ 2.417, 54.   ],
[ 2.2  , 54.   ],
[ 2.1  , 53.   ],
[ 2.033, 53.   ],
[ 1.8  , 53.   ],
[ 1.8  , 53.   ],
[ 1.867, 53.   ],
[ 2.617, 53.   ],
[ 2.4  , 53.   ],
[ 2.883, 55.   ],
[ 1.967, 55.   ],
[ 2.   , 55.   ],
[ 1.883, 55.   ],
[ 2.267, 55.   ],
[ 2.183, 55.   ],
[ 1.967, 56.   ],
[ 1.967, 56.   ],
[ 2.   , 56.   ],
[ 2.8  , 56.   ],
[ 1.95 , 51.   ],
[ 1.867, 51.   ],
[ 1.883, 51.   ],
[ 1.8  , 51.   ],
[ 2.033, 51.   ],
[ 2.25 , 51.   ],
[ 2.167, 52.   ],
[ 2.017, 52.   ],
[ 1.933, 52.   ],
[ 1.6  , 52.   ],
[ 1.783, 52.   ],
[ 1.867, 48.   ],
[ 1.75 , 48.   ],
[ 2.167, 48.   ],
[ 2.1  , 49.   ],
[ 2.017, 49.   ],
[ 1.917, 49.   ],
[ 1.933, 49.   ],
[ 1.867, 49.   ],
[ 1.867, 50.   ],
[ 1.867, 50.   ],
[ 2.317, 50.   ],
[ 2.417, 50.   ],
[ 2.217, 50.   ],
[ 1.75 , 47.   ],
[ 1.75 , 47.   ],
[ 1.867, 47.   ],
[ 2.35 , 47.   ],
[ 1.917, 45.   ],
[ 1.867, 45.   ],
[ 2.2  , 45.   ],
[ 1.833, 46.   ],
[ 1.833, 46.   ],
[ 1.817, 46.   ],
[ 1.783, 46.   ],
[ 2.15 , 46.   ],
[ 1.983, 43.   ],
[ 2.283, 62.   ],
[ 2.483, 62.   ],
[ 1.75 , 62.   ],
[ 1.983, 62.   ],
[ 1.833, 63.   ],
[ 2.367, 63.   ],
[ 2.9  , 63.   ],
[ 3.367, 66.   ],
[ 3.5  , 66.   ],
[ 2.133, 67.   ],
[ 3.833, 64.   ],
[ 3.417, 64.   ],
[ 1.667, 64.   ],
[ 2.333, 64.   ],
[ 2.067, 65.   ],
[ 2.633, 65.   ],
[ 2.4  , 65.   ],
[ 1.833, 59.   ],
[ 1.817, 59.   ],
[ 1.7  , 59.   ],
[ 2.   , 59.   ],
[ 1.983, 59.   ],
[ 2.233, 59.   ],
[ 2.3  , 59.   ],
[ 1.817, 60.   ],
[ 2.017, 60.   ],
[ 2.1  , 60.   ],
[ 2.2  , 60.   ],
[ 2.233, 60.   ],
[ 2.25 , 60.   ],
[ 1.883, 58.   ],
[ 1.85 , 58.   ],
[ 1.75 , 58.   ],
[ 2.   , 58.   ],
[ 2.083, 57.   ],
[ 2.083, 57.   ],
[ 1.833, 57.   ]]), 'label': 541}]
In [26]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
plt.scatter(cluster[:,0], cluster[:,1])
plt.xlabel('Duration');
plt.ylabel('Interval');


How might we make this more efficient?

Maybe if we compute the pairwise squared distance between data points once! Then clusters are defined by indices into this distance matrix.

In [27]:
dataDists = np.sum((data[:,np.newaxis,:] - data)**2, axis=2)

Out[27]:
(272, 272)
In [28]:
def clusterDistance(Ci, Cj, dataDists):
'''Ci and Cj are two clusters, each being a dict with 'X' and 'label' keys'''
return np.mean( np.array([dataDists[i,j] for i in Ci['X'] for j in Cj['X']]) )
# return np.min(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.max(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))

In [29]:
def agglomerative(X, clusterDistanceF, nClusters):
print('Calculating data pair distances...', end='')
dataDists = np.sum((X[:,np.newaxis,:] - X)**2, axis=2)
print('Done')
labels = np.zeros((X.shape[0]))
# clusters is list of pairs of sample and label
clusters = [ {'X':[i], 'label':i} for i in range(X.shape[0]) ]
k = X.shape[0] - 1
while len(clusters) > nClusters:
dists = np.array( [[i, j, clusterDistance(clusters[i], clusters[j], dataDists)] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))] )
whichClosest = np.argmin(dists[:, -1])
closest = dists[whichClosest, :2]
i,j = closest.astype(int)
# Merge them
k += 1
clusters[i] = {'X': clusters[i]['X'] + clusters[j]['X'],
'label': k}
clusters.pop(j)
print(len(clusters), end=' ')
return clusters

In [30]:
clusters = agglomerative(data,clusterDistance, 3)

Calculating data pair distances...Done
271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 
In [31]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data[c] for c in cluster])
plt.scatter(coords[:,0], coords[:,1])
plt.xlabel('Duration');
plt.ylabel('Interval');


Why are our clusters looking flattened?

Right. Look at the x and y axis scales. What should we do? What we should have done at the top of this notebook!

In [32]:
datas = (data - data.mean()) / data.std(0)

In [33]:
clusters = agglomerative(datas,clusterDistance, 3)

Calculating data pair distances...Done
271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 
In [34]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data[c] for c in cluster])
plt.scatter(coords[:,0], coords[:,1])
plt.xlabel('Duration');
plt.ylabel('Interval');


What else could you do to speed this up?

Another data set, from Finland

In [35]:
!curl -O http://www.cs.colostate.edu/~anderson/cs445/notebooks/Jain.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100  4933  100  4933    0     0   535k      0 --:--:-- --:--:-- --:--:--  535k

In [36]:
!head Jain.csv




In [37]:
data2 = np.loadtxt('Jain.csv', usecols=(0, 1))

In [38]:
data2.shape

Out[38]:
(373, 2)
In [39]:
plt.scatter(data2[:,0], data2[:,1]);

In [40]:
clusters =  agglomerative(data2, clusterDistance, 2)

Calculating data pair distances...Done
372 371 370 369 368 367 366 365 364 363 362 361 360 359 358 357 356 355 354 353 352 351 350 349 348 347 346 345 344 343 342 341 340 339 338 337 336 335 334 333 332 331 330 329 328 327 326 325 324 323 322 321 320 319 318 317 316 315 314 313 312 311 310 309 308 307 306 305 304 303 302 301 300 299 298 297 296 295 294 293 292 291 290 289 288 287 286 285 284 283 282 281 280 279 278 277 276 275 274 273 272 271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 
In [41]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data2[c] for c in cluster])
plt.scatter(coords[:,0], coords[:,1])

In [ ]:



Let's try another data set from Finland.

In [50]:
!curl -O http://cs.joensuu.fi/sipu/datasets/MopsiLocations2012-Joensuu.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100  117k  100  117k    0     0   140k      0 --:--:-- --:--:-- --:--:--  140k

In [51]:
!mv MopsiLocations2012-Joensuu.txt Joensuu.csv

62.598090 29.744480
62.598090 29.744480
62.598090 29.744480
62.619430 29.694100
62.603260 29.744940
62.597960 29.744160
62.622750 29.700380
62.618920 29.815060
62.602260 29.749540
62.618200 29.708660

In [52]:
data3 = np.loadtxt('Joensuu.csv')

In [53]:
data3.shape

Out[53]:
(6014, 2)
In [54]:
data3 = data3[:1500, :]

In [55]:
plt.scatter(data3[:,0], data3[:,1]);

In [56]:
clusters =  agglomerative(data3, clusterDistance, 4)

Calculating data pair distances...Done
1499 1498 1497 1496 1495 1494 1493 1492 1491 1490 1489 1488 1487 1486 1485 1484 1483 1482 1481 1480 1479 1478 1477 1476 1475 1474 1473 1472 1471 1470 1469 1468 1467 1466 1465 1464 1463 1462 1461 1460 1459 1458 1457 1456 1455 1454 1453 1452 1451 1450 1449 1448 1447 1446 1445 1444 1443 1442 1441 1440 1439 1438 1437 1436 1435 1434 1433 1432 1431 1430 1429 1428 1427 1426 1425 1424 1423 1422 1421 1420 1419 1418 1417 1416 1415 1414 1413 1412 1411 1410 1409 1408 1407 1406 1405 1404 1403 1402 1401 1400 1399 1398 1397 1396 1395 1394 1393 1392 1391 1390 1389 1388 1387 1386 1385 1384 1383 1382 1381 1380 1379 1378 1377 1376 1375 1374 1373 1372 1371 1370 1369 1368 1367 1366 1365 1364 1363 1362 1361 1360 1359 1358 1357 1356 1355 1354 1353 1352 1351 1350 1349 1348 1347 1346 1345 1344 1343 1342 1341 1340 1339 1338 1337 1336 1335 1334 1333 1332 1331 1330 1329 1328 1327 1326 1325 1324 1323 1322 1321 1320 1319 1318 1317 1316 1315 1314 1313 1312 1311 1310 1309 1308 1307 1306 1305 1304 1303 1302 1301 1300 1299 1298 1297 1296 1295 1294 1293 1292 1291 1290 1289 1288 1287 1286 1285 1284 1283 1282 1281 1280 1279 1278 1277 1276 1275 1274 1273 1272 1271 1270 1269 1268 1267 1266 1265 1264 1263 1262 1261 1260 1259 1258 1257 1256 1255 1254 1253 1252 1251 1250 1249 1248 1247 1246 1245 1244 1243 1242 1241 1240 1239 1238 1237 1236 1235 1234 1233 1232 1231 1230 1229 1228 1227 1226 1225 1224 1223 1222 1221 1220 1219 1218 1217 1216 1215 1214 1213 1212 1211 1210 1209 1208 1207 1206 1205 1204 1203 1202 1201 1200 1199 1198 1197 1196 1195 1194 1193 1192 1191 1190 1189 1188 1187 1186 1185 1184 1183 1182 1181 1180 1179 1178 1177 1176 1175 1174 1173 1172 1171 1170 1169 1168 1167 1166 1165 1164 1163 1162 1161 1160 1159 1158 1157 1156 1155 1154 1153 1152 1151 1150 1149 1148 1147 1146 1145 1144 1143 1142 1141 1140 1139 1138 1137 1136 1135 1134 1133 1132 1131 1130 1129 1128 1127 1126 1125 1124 1123 1122 1121 1120 1119 1118 1117 1116 1115 1114 1113 1112 1111 1110 1109 1108 1107 1106 1105 1104 1103 1102 1101 1100 1099 1098 1097 1096 1095 1094 1093 1092 1091 1090 1089 1088 1087 1086 1085 1084 1083 1082 1081 1080 1079 1078 1077 1076 1075 1074 1073 1072 1071 1070 1069 1068 1067 1066 1065 1064 1063 1062 1061 1060 1059 1058 1057 1056 1055 1054 1053 1052 1051 1050 1049 1048 1047 1046 1045 1044 1043 1042 1041 1040 1039 1038 1037 1036 1035 1034 1033 1032 1031 1030 1029 1028 1027 1026 1025 1024 1023 1022 1021 1020 1019 1018 1017 1016 1015 1014 1013 1012 1011 1010 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000 999 998 997 996 995 994 993 992 991 990 989 988 987 986 985 984 983 982 981 980 979 978 977 976 975 974 973 972 971 970 969 968 967 966 965 964 963 962 961 960 959 958 957 956 955 954 953 952 951 950 949 948 947 946 945 944 943 942 941 940 939 938 937 936 935 934 933 932 931 930 929 928 927 926 925 924 923 922 921 920 919 918 917 916 915 914 913 912 911 910 909 908 907 906 905 904 903 902 901 900 899 898 897 896 895 894 893 892 891 890 889 888 887 886 885 884 883 882 881 880 879 878 877 876 875 874 873 872 871 870 869 868 867 866 865 864 863 862 861 860 859 858 857 856 855 854 853 852 851 850 849 848 847 846 845 844 843 842 841 840 839 838 837 836 835 834 833 832 831 830 829 828 827 826 825 824 823 822 821 820 819 818 817 816 815 814 813 812 811 810 809 808 807 806 805 804 803 802 801 800 799 798 797 796 795 794 793 792 791 790 789 788 787 786 785 784 783 782 781 780 779 778 777 776 775 774 773 772 771 770 769 768 767 766 765 764 763 762 761 760 759 758 757 756 755 754 753 752 751 750 749 748 747 746 745 744 743 742 741 740 739 738 737 736 735 734 733 732 731 730 729 728 727 726 725 724 723 722 721 720 719 718 717 716 715 714 713 712 711 710 709 708 707 706 705 704 703 702 701 700 699 698 697 696 695 694 693 692 691 690 689 688 687 686 685 684 683 682 681 680 679 678 677 676 675 674 673 672 671 670 669 668 667 666 665 664 663 662 661 660 659 658 657 656 655 654 653 652 651 650 649 648 647 646 645 644 643 642 641 640 639 638 637 636 635 634 633 632 631 630 629 628 627 626 625 624 623 622 621 620 619 618 617 616 615 614 613 612 611 610 609 608 607 606 605 604 603 602 601 600 599 598 597 596 595 594 593 592 591 590 589 588 587 586 585 584 583 582 581 580 579 578 577 576 575 574 573 572 571 570 569 568 567 566 565 564 563 562 561 560 559 558 557 556 555 554 553 552 551 550 549 548 547 546 545 544 543 542 541 540 539 538 537 536 535 534 533 532 531 530 529 528 527 526 525 524 523 522 521 520 519 518 517 516 515 514 513 512 511 510 509 508 507 506 505 504 503 502 501 500 499 498 497 496 495 494 493 492 491 490 489 488 487 486 485 484 483 482 481 480 479 478 477 476 475 474 473 472 471 470 469 468 467 466 465 464 463 462 461 460 459 458 457 456 455 454 453 452 451 450 449 448 447 446 445 444 443 442 441 440 439 438 437 436 435 434 433 432 431 430 429 428 427 426 425 424 423 422 421 420 419 418 417 416 415 414 413 412 411 410 409 408 407 406 405 404 403 402 401 400 399 398 397 396 395 394 393 392 391 390 389 388 387 386 385 384 383 382 381 380 379 378 377 376 375 374 373 372 371 370 369 368 367 366 365 364 363 362 361 360 359 358 357 356 355 354 353 352 351 350 349 348 347 346 345 344 343 342 341 340 339 338 337 336 335 334 333 332 331 330 329 328 327 326 325 324 323 322 321 320 319 318 317 316 315 314 313 312 311 310 309 308 307 306 305 304 303 302 301 300 299 298 297 296 295 294 293 292 291 290 289 288 287 286 285 284 283 282 281 280 279 278 277 276 275 274 273 272 271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 
In [79]:
plt.figure(figsize=(20, 8))
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data3[c] for c in cluster])
plt.scatter(coords[:, 0], coords[:, 1]);
plt.axis('square');

In [73]:
import scipy.cluster.hierarchy as hier

In [74]:
result = hier.linkage(data3, 'average')

In [75]:
result.shape

Out[75]:
(1499, 4)
In [77]:
plt.figure(figsize=(20,20))
hier.dendrogram(result, orientation='left');

In [ ]: