We will use Boston Housing data to illustrate the use of clustering technqiues.

In [1]:
import numpy as np
import pylab as pl
import pandas as pd
from sklearn.cluster import KMeans 
from sklearn.datasets import load_boston
boston = load_boston()
In [2]:
np.set_printoptions(suppress=True, precision=2, linewidth=120)
In [3]:
print boston.data[:2]
[[   0.01   18.      2.31    0.      0.54    6.58   65.2     4.09    1.    296.     15.3   396.9     4.98]
 [   0.03    0.      7.07    0.      0.47    6.42   78.9     4.97    2.    242.     17.8   396.9     9.14]]
In [4]:
print boston.target[:2]
[ 24.   21.6]

There is no target attribute in unsupervised approaches to data mining. So, we can add back the target attribute (in this case it may be a useful variable for clustering).

In [5]:
x = np.array([np.concatenate((boston.data[i],[boston.target[i]])) for i in range(len(boston.data))])
In [6]:
print x[:2]
[[   0.01   18.      2.31    0.      0.54    6.58   65.2     4.09    1.    296.     15.3   396.9     4.98   24.  ]
 [   0.03    0.      7.07    0.      0.47    6.42   78.9     4.97    2.    242.     17.8   396.9     9.14   21.6 ]]

Now we use KMeans algorithm of scikit-learn to perform the clustering.

In [7]:
kmeans = KMeans(n_clusters=5, max_iter=500, verbose=1) # initialization
In [8]:
kmeans.fit(x)
Initialization complete
Iteration  0, inertia 1900863.047
Iteration  1, inertia 1539103.044
Iteration  2, inertia 1500857.955
Iteration  3, inertia 1500414.923
Converged at iteration 3
Initialization complete
Iteration  0, inertia 2111574.633
Iteration  1, inertia 1558158.474
Iteration  2, inertia 1508487.189
Iteration  3, inertia 1500845.914
Iteration  4, inertia 1500414.923
Converged at iteration 4
Initialization complete
Iteration  0, inertia 2318898.681
Iteration  1, inertia 1486232.443
Iteration  2, inertia 1478189.279
Iteration  3, inertia 1477901.691
Iteration  4, inertia 1477608.702
Iteration  5, inertia 1476992.206
Iteration  6, inertia 1476845.860
Iteration  7, inertia 1476759.615
Converged at iteration 7
Initialization complete
Iteration  0, inertia 2415048.226
Iteration  1, inertia 1485568.539
Iteration  2, inertia 1478772.644
Iteration  3, inertia 1477273.447
Iteration  4, inertia 1476845.860
Iteration  5, inertia 1476759.615
Converged at iteration 5
Initialization complete
Iteration  0, inertia 2223438.015
Iteration  1, inertia 1613892.614
Iteration  2, inertia 1580902.019
Iteration  3, inertia 1578911.364
Iteration  4, inertia 1577973.587
Iteration  5, inertia 1577004.521
Iteration  6, inertia 1576658.891
Iteration  7, inertia 1576314.292
Iteration  8, inertia 1575901.982
Converged at iteration 8
Initialization complete
Iteration  0, inertia 2022418.656
Iteration  1, inertia 1501265.762
Iteration  2, inertia 1500398.299
Converged at iteration 2
Initialization complete
Iteration  0, inertia 1935063.234
Iteration  1, inertia 1614998.959
Iteration  2, inertia 1582743.884
Iteration  3, inertia 1579764.283
Converged at iteration 3
Initialization complete
Iteration  0, inertia 2688036.262
Iteration  1, inertia 1708267.677
Iteration  2, inertia 1703847.033
Converged at iteration 2
Initialization complete
Iteration  0, inertia 2013127.282
Iteration  1, inertia 1580979.688
Iteration  2, inertia 1580127.664
Iteration  3, inertia 1579942.130
Iteration  4, inertia 1579764.283
Converged at iteration 4
Initialization complete
Iteration  0, inertia 3196687.119
Iteration  1, inertia 1491528.404
Iteration  2, inertia 1478347.808
Iteration  3, inertia 1476696.777
Converged at iteration 3
Out[8]:
KMeans(copy_x=True, init='k-means++', max_iter=500, n_clusters=5, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1)
In [9]:
clusters = kmeans.predict(x)
In [10]:
print clusters
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2
 2 2 2 2 2 2 2 0 0 0 2 2 2 2 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 2 2 2 2 2 2 2 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 4 0 4 4 0 0 0 0 0 0 0 0 4 0 4 0 0 0 0 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 0 0 0 0 0 0 2 2 2 2 2 2 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0
 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 2 2 2 0 0 2 2 2 1 1 1 1
 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 3 3 3 3 3 3
 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 2 2 2 2 2]

The centroids provide an aggregate representation and a characterization of each cluster.

In [15]:
print boston.feature_names
print kmeans.cluster_centers_
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
[[   0.62   12.88   12.03    0.06    0.56    6.21   69.25    3.63    4.73  402.31   17.76  382.25   12.18   22.75]
 [  11.08    0.     18.57    0.08    0.67    5.97   90.01    2.07   23.03  668.18   20.2   370.24   17.9    17.42]
 [   0.24   17.26    6.71    0.08    0.48    6.47   56.07    4.84    4.34  274.69   17.86  388.78    9.47   25.98]
 [  15.69   -0.     18.1     0.      0.67    6.1    89.84    2.     24.    666.     20.2    51.1    21.03   12.8 ]
 [   1.96    0.     16.71    0.09    0.71    5.92   91.82    2.32    4.73  386.91   17.    187.55   17.21   17.02]]

Now, let's look at the Iris Data set [See Description]:

In [16]:
from sklearn.datasets import load_iris
iris = load_iris()
In [17]:
print iris.DESCR
Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

In [18]:
data = iris.data
target = iris.target
In [19]:
print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [20]:
print iris.data[:10]
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]
In [21]:
print iris.target[:150]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
In [22]:
print set(target) # build a collection of unique elements
set([0, 1, 2])

This snippet uses the first and the third dimension (sepal length and sepal width) and the result is shown in the following figure.

Classes: 0 = Iris-Setosa, 1 = Iris-Versicolour, 2 = Iris-Virginica

In [23]:
pl.plot(data[target==0,0],data[target==0,2],'bo')
pl.plot(data[target==1,0],data[target==1,2],'ro')
pl.plot(data[target==2,0],data[target==2,2],'go')
pl.legend(('Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica'), loc=4)
pl.show()

In the graph we have 150 points and their color represents the class; the blue points represent the samples that belong to the specie setosa, the red ones represent versicolor and the green ones represent virginica. Next let's see if through clustering we can obtain the correct classes.

In [24]:
iris_kmeans = KMeans(n_clusters=3, max_iter=500, verbose=1, n_init=5) # initialization
iris_kmeans.fit(data)
Initialization complete
Iteration  0, inertia 107.660
Iteration  1, inertia 79.204
Iteration  2, inertia 78.941
Converged at iteration 2
Initialization complete
Iteration  0, inertia 133.340
Iteration  1, inertia 79.754
Iteration  2, inertia 78.941
Converged at iteration 2
Initialization complete
Iteration  0, inertia 155.380
Iteration  1, inertia 86.409
Iteration  2, inertia 84.355
Iteration  3, inertia 83.480
Iteration  4, inertia 82.094
Iteration  5, inertia 81.170
Iteration  6, inertia 79.963
Iteration  7, inertia 79.434
Iteration  8, inertia 79.011
Iteration  9, inertia 78.945
Converged at iteration 9
Initialization complete
Iteration  0, inertia 126.550
Iteration  1, inertia 80.452
Iteration  2, inertia 78.945
Converged at iteration 2
Initialization complete
Iteration  0, inertia 104.010
Iteration  1, inertia 79.007
Iteration  2, inertia 78.941
Converged at iteration 2
Out[24]:
KMeans(copy_x=True, init='k-means++', max_iter=500, n_clusters=3, n_init=5,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1)
In [25]:
c = iris_kmeans.predict(data)
In [26]:
print c
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2
 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 2]
In [27]:
c.shape
Out[27]:
(150L,)
In [28]:
target.shape
Out[28]:
(150L,)
In [29]:
print target
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Since we know what the classes are, we can evaluate clustering performance by using metrics that compatre clusters to the actual classes:

Homogeneity: each cluster contains only members of a single class. Completeness: all members of a given class are assigned to the same cluster.

In [30]:
from sklearn.metrics import completeness_score, homogeneity_score
In [31]:
print completeness_score(target,c)
0.764986151449
In [32]:
print homogeneity_score(target,c)
0.751485402199

The completeness score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.

In [33]:
print iris_kmeans.cluster_centers_
[[ 6.85  3.07  5.74  2.07]
 [ 5.01  3.42  1.46  0.24]
 [ 5.9   2.75  4.39  1.43]]
In [34]:
pl.plot(data[c==1,0],data[c==1,2],'ro')
pl.plot(data[c==0,0],data[c==0,2],'bo')
pl.plot(data[c==2,0],data[c==2,2],'go')
pl.show()
In [35]:
cd D:\Documents\Class\CSC478\Data
D:\Documents\Class\CSC478\Data

Let's now use the kMeans clustering implementation from Machine Learning in Action, Ch. 10:

In [36]:
import kMeans
In [37]:
reload(kMeans)
Out[37]:
<module 'kMeans' from 'kMeans.pyc'>
In [38]:
centroids, clusters = kMeans.kMeans(data, 3, kMeans.distEuclid, kMeans.randCent)
Iteration  1
Iteration  2
Iteration  3
Iteration  4
Iteration  5
In [39]:
print centroids
[[ 6.85  3.08  5.72  2.05]
 [ 5.01  3.42  1.46  0.24]
 [ 5.88  2.74  4.39  1.43]]
In [40]:
print clusters
[[ 1.    0.02]
 [ 1.    0.19]
 [ 1.    0.17]
 [ 1.    0.27]
 [ 1.    0.04]
 [ 1.    0.47]
 [ 1.    0.17]
 [ 1.    0.  ]
 [ 1.    0.64]
 [ 1.    0.13]
 [ 1.    0.24]
 [ 1.    0.06]
 [ 1.    0.24]
 [ 1.    0.83]
 [ 1.    1.04]
 [ 1.    1.47]
 [ 1.    0.44]
 [ 1.    0.02]
 [ 1.    0.69]
 [ 1.    0.16]
 [ 1.    0.21]
 [ 1.    0.11]
 [ 1.    0.42]
 [ 1.    0.14]
 [ 1.    0.23]
 [ 1.    0.2 ]
 [ 1.    0.04]
 [ 1.    0.05]
 [ 1.    0.04]
 [ 1.    0.16]
 [ 1.    0.16]
 [ 1.    0.18]
 [ 1.    0.52]
 [ 1.    0.86]
 [ 1.    0.13]
 [ 1.    0.12]
 [ 1.    0.28]
 [ 1.    0.13]
 [ 1.    0.57]
 [ 1.    0.01]
 [ 1.    0.04]
 [ 1.    1.54]
 [ 1.    0.44]
 [ 1.    0.15]
 [ 1.    0.37]
 [ 1.    0.22]
 [ 1.    0.18]
 [ 1.    0.22]
 [ 1.    0.17]
 [ 1.    0.02]
 [ 0.    1.5 ]
 [ 2.    0.49]
 [ 0.    0.97]
 [ 2.    0.51]
 [ 2.    0.43]
 [ 2.    0.07]
 [ 2.    0.61]
 [ 2.    2.46]
 [ 2.    0.6 ]
 [ 2.    0.71]
 [ 2.    2.31]
 [ 2.    0.11]
 [ 2.    0.65]
 [ 2.    0.17]
 [ 2.    0.75]
 [ 2.    0.8 ]
 [ 2.    0.16]
 [ 2.    0.28]
 [ 2.    0.41]
 [ 2.    0.49]
 [ 2.    0.51]
 [ 2.    0.22]
 [ 2.    0.5 ]
 [ 2.    0.2 ]
 [ 2.    0.32]
 [ 2.    0.58]
 [ 2.    1.01]
 [ 0.    0.67]
 [ 2.    0.06]
 [ 2.    1.03]
 [ 2.    0.72]
 [ 2.    0.93]
 [ 2.    0.3 ]
 [ 2.    0.55]
 [ 2.    0.32]
 [ 2.    0.49]
 [ 2.    0.9 ]
 [ 2.    0.39]
 [ 2.    0.25]
 [ 2.    0.37]
 [ 2.    0.22]
 [ 2.    0.16]
 [ 2.    0.23]
 [ 2.    2.35]
 [ 2.    0.14]
 [ 2.    0.19]
 [ 2.    0.11]
 [ 2.    0.15]
 [ 2.    2.71]
 [ 2.    0.14]
 [ 0.    0.64]
 [ 2.    0.73]
 [ 0.    0.1 ]
 [ 0.    0.42]
 [ 0.    0.16]
 [ 0.    1.35]
 [ 2.    1.11]
 [ 0.    0.64]
 [ 0.    0.43]
 [ 0.    0.74]
 [ 0.    0.52]
 [ 0.    0.54]
 [ 0.    0.06]
 [ 2.    0.79]
 [ 2.    1.45]
 [ 0.    0.45]
 [ 0.    0.24]
 [ 0.    2.23]
 [ 0.    2.41]
 [ 2.    0.68]
 [ 0.    0.08]
 [ 2.    0.67]
 [ 0.    1.77]
 [ 2.    0.57]
 [ 0.    0.08]
 [ 0.    0.28]
 [ 2.    0.41]
 [ 2.    0.51]
 [ 0.    0.3 ]
 [ 0.    0.34]
 [ 0.    0.55]
 [ 0.    2.09]
 [ 0.    0.32]
 [ 2.    0.69]
 [ 0.    1.24]
 [ 0.    0.93]
 [ 0.    0.54]
 [ 0.    0.32]
 [ 2.    0.38]
 [ 0.    0.1 ]
 [ 0.    0.16]
 [ 0.    0.44]
 [ 2.    0.73]
 [ 0.    0.11]
 [ 0.    0.27]
 [ 0.    0.36]
 [ 2.    0.82]
 [ 0.    0.4 ]
 [ 0.    0.69]
 [ 2.    0.71]]
In [41]:
newC = clusters.T[0]
print newC
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  2.  0.  2.  2.  2.  2.  2.  2.  2.
  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  0.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.
  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  0.  2.  0.  0.  0.  0.  2.  0.  0.  0.  0.  0.  0.  2.  2.  0.  0.  0.  0.  2.
  0.  2.  0.  2.  0.  0.  2.  2.  0.  0.  0.  0.  0.  2.  0.  0.  0.  0.  2.  0.  0.  0.  2.  0.  0.  0.  2.  0.  0.  2.]
In [42]:
newC = newC.astype(int)
print newC
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 2 0 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2
 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 2]
In [43]:
print completeness_score(target,newC)
0.74748658051
In [44]:
print homogeneity_score(target,newC)
0.736419288125
In [45]:
reload(kMeans)
Out[45]:
<module 'kMeans' from 'kMeans.pyc'>
In [46]:
centroids_bk, clusters_bk = kMeans.biKmeans(data, 3, kMeans.distEuclid)
Iteration  1
Iteration  2
Iteration  3
sseSplit, and notSplit:  152.368706477 0.0
the bestCentToSplit is:  0
the len of bestClustAss is:  150
Iteration  1
Iteration  2
Iteration  3
Iteration  4
Iteration  5
sseSplit, and notSplit:  55.651677074 28.5728301887
Iteration  1
Iteration  2
Iteration  3
Iteration  4
Iteration  5
Iteration  6
sseSplit, and notSplit:  19.5377246377 123.795876289
the bestCentToSplit is:  0
the len of bestClustAss is:  97
In [47]:
print centroids_bk
[[ 6.85  5.01  5.95]]
In [48]:
bkC = clusters_bk.T[0]
bkC = bkC.astype(int)
In [49]:
print bkC
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 1 2
  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0
  0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 2]]
In [50]:
bkC = np.ravel(bkC)
print bkC
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 1 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2
 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 2]
In [51]:
print completeness_score(target,bkC)
0.696570598725
In [52]:
print homogeneity_score(target,bkC)
0.686320173949
In [142]: