Clustering Related Posts

In [123]:
social()
Out[123]:
submit to reddit

This notebook covers or includes:

  • Introduction to word processing
  • Natural Language Learning Toolkit
  • KMeans Clustering text data
TO DO:

Measuring Similarity Between Text Messages:

This notebook will explore the idea of recommending news posts to a reader based their search query. To do this, we also have to introduce basic text processing. Clustering can be defined as classifying unlabelled data by a measurement of similarity.

One of the most robust methods to quantify meaning in textual data is using the bag-of-word approach. For each word in the post, we count track the number of occurances in a vector (vectorization). In this way the data can be stored in an efficient matrix structure.

Preprocessing:

First we have to convert the text into a bag-of-words. We can do this using scikit's builtin CountVectorizer. The input min_df determines how the function will treat words that are used infrequently. If set to an interger, all words occuring less than that amount will be dropped. If set to a fraction, all words that occur less than the fraction of the overall dataset will be dropped. There are also a lot of other options which will we get into later.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(min_df=1)
print vect
CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,
        decode_error=strict, dtype=<type 'numpy.int64'>, encoding=utf-8,
        input=content, lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
        vocabulary=None)

We see that for now the counting is done at the word level (analyzer = word).

In [3]:
content = ['how to open a beer without a bottle opener', 
           'Beer bottles or beer cans',]
X = vect.fit_transform(content)

vect.get_feature_names()
Out[3]:
[u'beer',
 u'bottle',
 u'bottles',
 u'cans',
 u'how',
 u'open',
 u'opener',
 u'or',
 u'to',
 u'without']
In [4]:
#Print the vectorized word occurances
print X
print X.toarray()
  (0, 0)	1
  (1, 0)	2
  (0, 1)	1
  (1, 2)	1
  (1, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (1, 7)	1
  (0, 8)	1
  (0, 9)	1
[[1 1 0 0 1 1 1 0 1 1]
 [2 0 1 1 0 0 0 1 0 0]]
  • Count vectors returned by transform are stored in the more memory efficient coordinate matrix format, we have to access the full standard vector for analysis though.

Let's add some more data.

In [5]:
posts = ['how to open a beer without a bottle opener', 
           'Do girls like beer bottles or beer cans?',
           'where did all my beer go?',
           'where did all my beer go? where did all my beer go?',
           'recycling beer bottles and cans',
           'Is it worth recycling?',
           'do not bring bottles to my backyard party, only cans please.', 
           'This is useless']
In [6]:
X_train = vect.fit_transform(posts)

num_samples, num_features = X_train.shape

print '#samples: {}, #features: {}'.format(num_samples, num_features)
#samples: 8, #features: 31
  • Unsurprisingly, we have 8 posts with a total of 31 different words. Now we can vectorize our data.

Let's vectorize a new post, then see how similar it is to our existing corpus.

In [7]:
new_post = 'Opening beer bottles and cans 101'
new_post_vect = vect.transform([new_post])

print(new_post_vect).toarray()
[[0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
In [8]:
import scipy as sp

def dists(v1, v2):
    delta = v1-v2
    # Calculate Euclidean "norm" distance
    return sp.linalg.norm(delta.toarray())

import sys

def similarity(new_post_vector, corpus):
    best_dist = 999
    best_i = None
    
    for i in xrange(len(corpus.toarray())):
        post = posts[i]
        
        if post == new_post:
            continue
        post_vec = corpus.getrow(i)
        d = dists(post_vec, new_post_vector)
        print 'Post %i with dist = %.2f: %s'%(i, d, post)
        
        if d < best_dist:
            best_dist = d
            best_i = i
    print 'Best post is {} with dist = {}'.format(best_i, best_dist)
In [9]:
similarity(new_post_vect, X_train)
Post 0 with dist = 3.00: how to open a beer without a bottle opener
Post 1 with dist = 2.45: Do girls like beer bottles or beer cans?
Post 2 with dist = 2.83: where did all my beer go?
Post 3 with dist = 4.90: where did all my beer go? where did all my beer go?
Post 4 with dist = 1.00: recycling beer bottles and cans
Post 5 with dist = 2.83: Is it worth recycling?
Post 6 with dist = 3.32: do not bring bottles to my backyard party, only cans please.
Post 7 with dist = 2.65: This is useless
Best post is 4 with dist = 1.0

Great, our first text similarity measurement! We can see here that post 3 is most similar to our new post. However, we can see that post 2 is "closer" to post 3, even though post 3 is simply post 2 doubled. It is clear the simple counts of words is too simple. The next step is to normalize the word counts to get vectors of unitless lengths to avoid this problem.

In [10]:
# Update our dists function
def dists(v1, v2):
    v1_norm = v1/sp.linalg.norm(v1.toarray())
    v2_norm = v2/sp.linalg.norm(v2.toarray())
    delta = v1_norm-v2_norm
    # Calculate Euclidean "norm" distance
    return sp.linalg.norm(delta.toarray())
In [11]:
similarity(new_post_vect, X_train)
Post 0 with dist = 1.27: how to open a beer without a bottle opener
Post 1 with dist = 0.86: Do girls like beer bottles or beer cans?
Post 2 with dist = 1.26: where did all my beer go?
Post 3 with dist = 1.26: where did all my beer go? where did all my beer go?
Post 4 with dist = 0.46: recycling beer bottles and cans
Post 5 with dist = 1.41: Is it worth recycling?
Post 6 with dist = 1.18: do not bring bottles to my backyard party, only cans please.
Post 7 with dist = 1.41: This is useless
Best post is 4 with dist = 0.459505841095

Great, posts 2 & 3 are now equally similar to our new post.

Removing Less Important Words:

There are many words in language that do not carry much meaning in terms of the overall interpretation of the message. Words like "it" should be much less meaningful than "beer" in our current context. These less important words are called stop words, and can be removed from the posts since they do not help us distiguish between different posts.

In [12]:
#Add english stop words to our vectorizer object.
vect = CountVectorizer(min_df=1, stop_words='english')
#Display a sample
print sorted(vect.get_stop_words())[80:-150]
['empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name']

If you already have a list of words in mind you with to stop, you can simply pass them as a list to the stop_words argument.

Stemming

We also need to consider that similar words, such as "girl" and "girls" should probably be considered as the same word. Thus we need a function that reduces words to a finite 'word stem'. We can do thsi with the Natural Language Toolkit (NLTK). After installing NLTK, import the library and try out the stemmer for english.

In [13]:
import nltk.stem

s = nltk.stem.SnowballStemmer('english')

print s.stem('bottles')
print s.stem('bottle')

print s.stem('perception')
print s.stem('perceptive')

print s.stem('crashing')
print s.stem('crashed')
bottl
bottl
percept
percept
crash
crash

Extending the vectorizer with NLTK stemming

We need to step the posts before we feed then into the CountVectorizer. The best way to do this is overwrite the method build_analyzer.

By doing this we utilize the preprocessing functions in the parent class that converts the raw posts into lower case. We tokenize all the words, and then convert each word into the stemmed version.

In [14]:
import nltk.stem

english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
    
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
In [15]:
X = vectorizer.fit_transform(posts)
vectorizer.get_feature_names()
Out[15]:
[u'backyard',
 u'beer',
 u'bottl',
 u'bring',
 u'can',
 u'did',
 u'girl',
 u'like',
 u'open',
 u'parti',
 u'recycl',
 u'useless',
 u'worth']
In [16]:
# Restate the new vectorizer on the data
X_train = vectorizer.fit_transform(posts)
new_post_vect = vectorizer.transform([new_post])

similarity(new_post_vect, X_train)
Post 0 with dist = 0.61: how to open a beer without a bottle opener
Post 1 with dist = 0.77: Do girls like beer bottles or beer cans?
Post 2 with dist = 1.14: where did all my beer go?
Post 3 with dist = 1.14: where did all my beer go? where did all my beer go?
Post 4 with dist = 0.71: recycling beer bottles and cans
Post 5 with dist = 1.41: Is it worth recycling?
Post 6 with dist = 1.05: do not bring bottles to my backyard party, only cans please.
Post 7 with dist = 1.41: This is useless
Best post is 0 with dist = 0.605810893055

We see now that post 0 is most similar to our new post, because bottles and bottle are now treated as the same word.

In [17]:
print new_post
print posts[0]
Opening beer bottles and cans 101
how to open a beer without a bottle opener

Thinking a bit deeper about relevant post features

So far we have considered that higher occurrence of certains words in post equates to a greater importance of that word in the post. While this is true to some extent, there is the case where very frequent words really don't carry any meaning to posts. For example, the word "Subject" appears in every blog post, thus it is not really communicating anything important, and does not help us distinguish between posts.

We could perhaps set a 90% occurrence cutoff in our tokenizer, such that words that occur in >90% of the posts are excluded, however, we still run into the problem of border cases, say where the word occurs in only 89% of the posts.

To solve these problems we count the term frequencies for every post while discounting those words that appear in many posts. This is the concept of term frequency - inverse document frequency (TF-IDF). We can implement this using scikit learn's TfidfVectorizer.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Rebuild the function to include our stemmer

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')

Now instead of counts, our document vectors will contain individual TF-IDF values per term (token).

In [19]:
# Restate new vectorizer
X_train = vectorizer.fit_transform(posts)
new_post_vect = vectorizer.transform([new_post])

similarity(new_post_vect, X_train)
Post 0 with dist = 0.57: how to open a beer without a bottle opener
Post 1 with dist = 0.99: Do girls like beer bottles or beer cans?
Post 2 with dist = 1.26: where did all my beer go?
Post 3 with dist = 1.26: where did all my beer go? where did all my beer go?
Post 4 with dist = 0.90: recycling beer bottles and cans
Post 5 with dist = 1.41: Is it worth recycling?
Post 6 with dist = 1.17: do not bring bottles to my backyard party, only cans please.
Post 7 with dist = 1.41: This is useless
Best post is 0 with dist = 0.572957858071

Recap

So far we have:

  1. Tokenized text
  2. Discard words that occur too often and don't help us detect relevant posts
  3. Throw away very uncommon words
  4. Count the remaining words
  5. Calculated TF-IDF values from the counts, considering the whole text corpus.

Limitations of the bag-of-words approach

  • It does not cover word relations: "Car hits wall" and "Wall hits car" will both have the same feature vector.
  • It does not count negations well: "I will eat soup" and "I will not eat soup" will have very similar feature vectors. Though this can be remedied by also counting bigrams and trigrams (two or three words in a row together).
  • Totally fails with misspelled words.

Clustering

Now that we can represent our blog posts quantitatively, to some degree. Now our goal is to cluster similar posts. There are two main times of clustering algorithms: flat and hierarchical.

Flat clustering divides the posts into sets of clusters that minimizes the difference within clusters and maximized the difference between clusters. Generally we have to specify the number of clusters upfront.

Hierarchical clustering does not require the number of clusters as an input. It creates a hierarchy of clusters where very similar posts are grouped together, then similar clusters are then further grouped recursively until one cluster is left that contains all the data. Once completed, the user can discern the optimal number of clusters.

KMeans

KMeans is probably the most common flat clustering algorithm. First you must specify the number of desired clusters (k). From there, the algorithm first specifies k random seeds within the data. Then it assigns each post to the closest seed centroid. Next, the seeds are relocated to the mean center of the points initially assigned to it. Then the process is repeat, whereby the posts are then reassigned based on the new closest seed point. This continues as long as the seed centroids move a considerable amount, after some n iterations, the movements will fall below a threshold. The algorithm is then considered converged.

Get some test data

We will utilize a machine learning dataset that contains 18 826 posts from 20 different newsgroups. There are many topics including technology, politics, and religion. However, for now we will only use the technical groups.

One question we could ask is, for a certain topic, can we effectivly cluster the newgroups who published that topic into distinct categories?

This data is already split into testing and training data, we can download the data using sklearn.

In [20]:
import sklearn.datasets

save_dir = '/users/ryankelly/downloads/' # Your save file path

# Download data using sklearn
df = sklearn.datasets.load_mlcomp("20news-18828", mlcomp_root=save_dir)
In [22]:
# Data files
print df.filenames
print len(df.filenames)
['/users/ryankelly/downloads/379/raw/comp.graphics/1190-38614'
 '/users/ryankelly/downloads/379/raw/comp.graphics/1383-38616'
 '/users/ryankelly/downloads/379/raw/alt.atheism/487-53344' ...,
 '/users/ryankelly/downloads/379/raw/rec.sport.hockey/10215-54303'
 '/users/ryankelly/downloads/379/raw/sci.crypt/10799-15660'
 '/users/ryankelly/downloads/379/raw/comp.os.ms-windows.misc/2732-10871']
18828
In [23]:
# Data Topics
df.target_names
Out[23]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
In [24]:
# Restrict data to only 'tech' categories
group = ['comp.graphics', 'comp.os.ms-windows.misc', 
         'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware', 
         'comp.windows.x', 'sci.space']
# Reload in only training data with the desired categories
train_data = sklearn.datasets.load_mlcomp('20news-18828', 'train', 
                                          mlcomp_root=save_dir, 
                                          categories=group)
In [25]:
print(len(train_data.filenames))
3414

Clustering posts

While initializing our vectorizer we have to remember that we are working with real data, which has many errors, which in this case invalid characers that cannot be encoded.

In [26]:
vec = StemmedTfidfVectorizer(min_df=10, max_df=0.5,
                              stop_words='english', decode_error='ignore')

vecData = vec.fit_transform(train_data.data)
In [27]:
num_samples, num_features = vecData.shape
print('#samples: {}, #features: {}').format(num_samples, num_features)
#samples: 3414, #features: 4331

This is the information we will use as input for KMeans clustering. Since we know there are 5 topic groups in these data, it makes sense that there could be 5 clusters in the data, so we will try this first.

In [110]:
num_clusters = 5
from sklearn.cluster import KMeans

km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)
km.fit(vecData)
Initialization complete
Iteration  0, inertia 6434.212
Iteration  1, inertia 3302.138
Iteration  2, inertia 3286.234
Iteration  3, inertia 3278.006
Iteration  4, inertia 3274.039
Iteration  5, inertia 3271.234
Iteration  6, inertia 3268.856
Iteration  7, inertia 3267.609
Iteration  8, inertia 3266.964
Iteration  9, inertia 3266.352
Iteration 10, inertia 3265.901
Iteration 11, inertia 3265.509
Iteration 12, inertia 3264.970
Iteration 13, inertia 3263.969
Iteration 14, inertia 3261.887
Iteration 15, inertia 3259.657
Iteration 16, inertia 3258.196
Iteration 17, inertia 3257.560
Iteration 18, inertia 3256.997
Iteration 19, inertia 3256.714
Iteration 20, inertia 3256.482
Iteration 21, inertia 3256.326
Iteration 22, inertia 3256.126
Iteration 23, inertia 3255.998
Iteration 24, inertia 3255.918
Iteration 25, inertia 3255.870
Iteration 26, inertia 3255.826
Iteration 27, inertia 3255.768
Iteration 28, inertia 3255.658
Iteration 29, inertia 3255.574
Iteration 30, inertia 3255.550
Iteration 31, inertia 3255.533
Iteration 32, inertia 3255.527
Iteration 33, inertia 3255.522
Iteration 34, inertia 3255.513
Iteration 35, inertia 3255.508
Iteration 36, inertia 3255.503
Converged at iteration 36
Out[110]:
KMeans(copy_x=True, init='random', max_iter=300, n_clusters=5, n_init=1,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=1)

After fitting, we can get the clustering information out of the labels_ property, and cluster centers from cluster_centers_. We then measure the completeness score to see the percentage of correct predictions.

In [111]:
from sklearn import metrics

metrics.completeness_score(train_data.target, km.labels_)
Out[111]:
0.40904043798434664

39% accuracy isn't the best, but this could be because although there are five different topics, the contents are related between them, why dont we test several k values and see the prediction scores.

In [109]:
from sklearn.cluster import KMeans

def best_k():
    for i in range(2,40):
        best_k = 0
        best_score = 0
        km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)
        km.fit(vecData)
        score = metrics.completeness_score(train_data.target, km.labels_)
        if score > best_score:
            best_k = i
            best_score = score
    out = [best_k, best_score]
    return out
    
best_k()
Initialization complete
Iteration  0, inertia 6445.479
Iteration  1, inertia 3292.339
Iteration  2, inertia 3275.461
Iteration  3, inertia 3270.621
Iteration  4, inertia 3268.049
Iteration  5, inertia 3266.777
Iteration  6, inertia 3266.141
Iteration  7, inertia 3265.889
Iteration  8, inertia 3265.754
Iteration  9, inertia 3265.668
Iteration 10, inertia 3265.602
Iteration 11, inertia 3265.509
Iteration 12, inertia 3265.367
Iteration 13, inertia 3265.151
Iteration 14, inertia 3264.775
Iteration 15, inertia 3264.314
Iteration 16, inertia 3263.827
Iteration 17, inertia 3263.243
Iteration 18, inertia 3262.592
Iteration 19, inertia 3262.179
Iteration 20, inertia 3261.991
Iteration 21, inertia 3261.915
Iteration 22, inertia 3261.842
Iteration 23, inertia 3261.741
Iteration 24, inertia 3261.661
Iteration 25, inertia 3261.614
Iteration 26, inertia 3261.582
Iteration 27, inertia 3261.569
Iteration 28, inertia 3261.557
Iteration 29, inertia 3261.539
Iteration 30, inertia 3261.525
Iteration 31, inertia 3261.499
Converged at iteration 31
Initialization complete
Iteration  0, inertia 6524.930
Iteration  1, inertia 3308.247
Iteration  2, inertia 3292.389
Iteration  3, inertia 3283.365
Iteration  4, inertia 3278.358
Iteration  5, inertia 3276.421
Iteration  6, inertia 3275.128
Iteration  7, inertia 3273.981
Iteration  8, inertia 3272.630
Iteration  9, inertia 3270.863
Iteration 10, inertia 3268.894
Iteration 11, inertia 3267.018
Iteration 12, inertia 3265.305
Iteration 13, inertia 3263.985
Iteration 14, inertia 3263.395
Iteration 15, inertia 3262.957
Iteration 16, inertia 3262.720
Iteration 17, inertia 3262.581
Iteration 18, inertia 3262.501
Iteration 19, inertia 3262.414
Iteration 20, inertia 3262.318
Iteration 21, inertia 3262.253
Iteration 22, inertia 3262.192
Iteration 23, inertia 3262.085
Iteration 24, inertia 3261.962
Iteration 25, inertia 3261.815
Iteration 26, inertia 3261.625
Iteration 27, inertia 3261.492
Iteration 28, inertia 3261.394
Iteration 29, inertia 3261.278
Iteration 30, inertia 3261.206
Iteration 31, inertia 3261.134
Iteration 32, inertia 3261.077
Iteration 33, inertia 3261.018
Iteration 34, inertia 3260.997
Iteration 35, inertia 3260.975
Iteration 36, inertia 3260.958
Iteration 37, inertia 3260.949
Converged at iteration 37
Initialization complete
Iteration  0, inertia 6392.513
Iteration  1, inertia 3298.129
Iteration  2, inertia 3286.500
Iteration  3, inertia 3280.842
Iteration  4, inertia 3277.803
Iteration  5, inertia 3276.304
Iteration  6, inertia 3274.915
Iteration  7, inertia 3273.931
Iteration  8, inertia 3273.201
Iteration  9, inertia 3272.640
Iteration 10, inertia 3272.355
Iteration 11, inertia 3272.069
Iteration 12, inertia 3271.870
Iteration 13, inertia 3271.619
Iteration 14, inertia 3271.328
Iteration 15, inertia 3271.052
Iteration 16, inertia 3270.824
Iteration 17, inertia 3270.511
Iteration 18, inertia 3270.053
Iteration 19, inertia 3269.612
Iteration 20, inertia 3269.327
Iteration 21, inertia 3269.190
Iteration 22, inertia 3269.089
Iteration 23, inertia 3269.024
Iteration 24, inertia 3268.943
Iteration 25, inertia 3268.846
Iteration 26, inertia 3268.764
Iteration 27, inertia 3268.697
Iteration 28, inertia 3268.597
Iteration 29, inertia 3268.465
Iteration 30, inertia 3268.295
Iteration 31, inertia 3268.120
Iteration 32, inertia 3267.779
Iteration 33, inertia 3267.203
Iteration 34, inertia 3266.515
Iteration 35, inertia 3265.992
Iteration 36, inertia 3265.674
Iteration 37, inertia 3265.235
Iteration 38, inertia 3264.315
Iteration 39, inertia 3263.987
Iteration 40, inertia 3263.929
Iteration 41, inertia 3263.905
Iteration 42, inertia 3263.885
Iteration 43, inertia 3263.866
Iteration 44, inertia 3263.859
Iteration 45, inertia 3263.852
Converged at iteration 45
Initialization complete
Iteration  0, inertia 6326.529
Iteration  1, inertia 3294.746
Iteration  2, inertia 3282.371
Iteration  3, inertia 3276.461
Iteration  4, inertia 3273.181
Iteration  5, inertia 3271.013
Iteration  6, inertia 3268.783
Iteration  7, inertia 3266.648
Iteration  8, inertia 3265.133
Iteration  9, inertia 3264.077
Iteration 10, inertia 3263.566
Iteration 11, inertia 3263.328
Iteration 12, inertia 3263.232
Iteration 13, inertia 3263.172
Iteration 14, inertia 3263.125
Iteration 15, inertia 3263.087
Iteration 16, inertia 3263.064
Iteration 17, inertia 3263.053
Iteration 18, inertia 3263.047
Iteration 19, inertia 3263.044
Converged at iteration 19
Initialization complete
Iteration  0, inertia 6396.511
Iteration  1, inertia 3292.367
Iteration  2, inertia 3280.269
Iteration  3, inertia 3275.911
Iteration  4, inertia 3272.600
Iteration  5, inertia 3270.273
Iteration  6, inertia 3269.109
Iteration  7, inertia 3268.377
Iteration  8, inertia 3267.638
Iteration  9, inertia 3266.541
Iteration 10, inertia 3265.821
Iteration 11, inertia 3265.175
Iteration 12, inertia 3264.720
Iteration 13, inertia 3264.471
Iteration 14, inertia 3264.307
Iteration 15, inertia 3264.199
Iteration 16, inertia 3264.110
Iteration 17, inertia 3264.035
Iteration 18, inertia 3263.980
Iteration 19, inertia 3263.934
Iteration 20, inertia 3263.922
Iteration 21, inertia 3263.906
Iteration 22, inertia 3263.890
Iteration 23, inertia 3263.867
Iteration 24, inertia 3263.857
Iteration 25, inertia 3263.845
Iteration 26, inertia 3263.827
Iteration 27, inertia 3263.818
Iteration 28, inertia 3263.816
Converged at iteration 28
Initialization complete
Iteration  0, inertia 6431.988
Iteration  1, inertia 3293.092
Iteration  2, inertia 3278.216
Iteration  3, inertia 3269.663
Iteration  4, inertia 3265.719
Iteration  5, inertia 3263.092
Iteration  6, inertia 3261.218
Iteration  7, inertia 3260.260
Iteration  8, inertia 3259.782
Iteration  9, inertia 3259.574
Iteration 10, inertia 3259.506
Iteration 11, inertia 3259.466
Iteration 12, inertia 3259.449
Iteration 13, inertia 3259.435
Iteration 14, inertia 3259.422
Converged at iteration 14
Initialization complete
Iteration  0, inertia 6434.113
Iteration  1, inertia 3296.655
Iteration  2, inertia 3278.784
Iteration  3, inertia 3272.196
Iteration  4, inertia 3270.036
Iteration  5, inertia 3268.580
Iteration  6, inertia 3266.836
Iteration  7, inertia 3265.345
Iteration  8, inertia 3264.172
Iteration  9, inertia 3263.147
Iteration 10, inertia 3262.455
Iteration 11, inertia 3261.793
Iteration 12, inertia 3261.236
Iteration 13, inertia 3260.754
Iteration 14, inertia 3260.035
Iteration 15, inertia 3259.548
Iteration 16, inertia 3259.407
Iteration 17, inertia 3259.335
Iteration 18, inertia 3259.323
Iteration 19, inertia 3259.319
Iteration 20, inertia 3259.313
Iteration 21, inertia 3259.307
Iteration 22, inertia 3259.302
Iteration 23, inertia 3259.298
Iteration 24, inertia 3259.296
Converged at iteration 24
Initialization complete
Iteration  0, inertia 6421.814
Iteration  1, inertia 3300.660
Iteration  2, inertia 3287.858
Iteration  3, inertia 3281.381
Iteration  4, inertia 3276.546
Iteration  5, inertia 3271.531
Iteration  6, inertia 3267.330
Iteration  7, inertia 3264.234
Iteration  8, inertia 3263.418
Iteration  9, inertia 3262.728
Iteration 10, inertia 3262.077
Iteration 11, inertia 3261.563
Iteration 12, inertia 3261.202
Iteration 13, inertia 3260.836
Iteration 14, inertia 3260.469
Iteration 15, inertia 3260.095
Iteration 16, inertia 3259.766
Iteration 17, inertia 3259.590
Iteration 18, inertia 3259.492
Iteration 19, inertia 3259.396
Iteration 20, inertia 3259.263
Iteration 21, inertia 3259.172
Iteration 22, inertia 3259.122
Iteration 23, inertia 3259.087
Iteration 24, inertia 3259.059
Iteration 25, inertia 3259.021
Iteration 26, inertia 3258.983
Iteration 27, inertia 3258.919
Iteration 28, inertia 3258.870
Iteration 29, inertia 3258.826
Iteration 30, inertia 3258.756
Iteration 31, inertia 3258.694
Iteration 32, inertia 3258.621
Iteration 33, inertia 3258.534
Iteration 34, inertia 3258.440
Iteration 35, inertia 3258.277
Iteration 36, inertia 3258.160
Iteration 37, inertia 3258.098
Iteration 38, inertia 3258.041
Iteration 39, inertia 3257.966
Iteration 40, inertia 3257.909
Iteration 41, inertia 3257.860
Iteration 42, inertia 3257.774
Iteration 43, inertia 3257.727
Iteration 44, inertia 3257.694
Iteration 45, inertia 3257.666
Iteration 46, inertia 3257.593
Iteration 47, inertia 3257.551
Iteration 48, inertia 3257.537
Converged at iteration 48
Initialization complete
Iteration  0, inertia 6373.464
Iteration  1, inertia 3297.963
Iteration  2, inertia 3287.660
Iteration  3, inertia 3282.323
Iteration  4, inertia 3279.099
Iteration  5, inertia 3277.759
Iteration  6, inertia 3277.064
Iteration  7, inertia 3276.650
Iteration  8, inertia 3276.232
Iteration  9, inertia 3275.737
Iteration 10, inertia 3275.473
Iteration 11, inertia 3275.339
Iteration 12, inertia 3275.253
Iteration 13, inertia 3275.199
Iteration 14, inertia 3275.158
Iteration 15, inertia 3275.128
Iteration 16, inertia 3275.107
Iteration 17, inertia 3275.076
Iteration 18, inertia 3275.055
Iteration 19, inertia 3275.041
Iteration 20, inertia 3275.022
Iteration 21, inertia 3274.999
Iteration 22, inertia 3274.979
Iteration 23, inertia 3274.960
Iteration 24, inertia 3274.942
Iteration 25, inertia 3274.931
Iteration 26, inertia 3274.926
Iteration 27, inertia 3274.922
Iteration 28, inertia 3274.920
Converged at iteration 28
Initialization complete
Iteration  0, inertia 6289.281
Iteration  1, inertia 3304.450
Iteration  2, inertia 3288.473
Iteration  3, inertia 3282.639
Iteration  4, inertia 3280.544
Iteration  5, inertia 3279.671
Iteration  6, inertia 3279.139
Iteration  7, inertia 3278.606
Iteration  8, inertia 3278.196
Iteration  9, inertia 3277.723
Iteration 10, inertia 3277.261
Iteration 11, inertia 3276.925
Iteration 12, inertia 3276.570
Iteration 13, inertia 3276.117
Iteration 14, inertia 3275.711
Iteration 15, inertia 3275.582
Iteration 16, inertia 3275.538
Iteration 17, inertia 3275.526
Iteration 18, inertia 3275.517
Iteration 19, inertia 3275.509
Converged at iteration 19
Initialization complete
Iteration  0, inertia 6390.941
Iteration  1, inertia 3290.356
Iteration  2, inertia 3274.869
Iteration  3, inertia 3268.843
Iteration  4, inertia 3265.737
Iteration  5, inertia 3264.341
Iteration  6, inertia 3263.580
Iteration  7, inertia 3262.989
Iteration  8, inertia 3262.543
Iteration  9, inertia 3262.156
Iteration 10, inertia 3261.898
Iteration 11, inertia 3261.653
Iteration 12, inertia 3261.429
Iteration 13, inertia 3261.209
Iteration 14, inertia 3260.992
Iteration 15, inertia 3260.760
Iteration 16, inertia 3260.407
Iteration 17, inertia 3259.996
Iteration 18, inertia 3259.382
Iteration 19, inertia 3258.432
Iteration 20, inertia 3257.154
Iteration 21, inertia 3256.723
Iteration 22, inertia 3256.546
Iteration 23, inertia 3256.446
Iteration 24, inertia 3256.391
Iteration 25, inertia 3256.362
Iteration 26, inertia 3256.344
Iteration 27, inertia 3256.339
Converged at iteration 27
Initialization complete
Iteration  0, inertia 6432.035
Iteration  1, inertia 3304.822
Iteration  2, inertia 3291.831
Iteration  3, inertia 3281.480
Iteration  4, inertia 3275.025
Iteration  5, inertia 3270.730
Iteration  6, inertia 3266.021
Iteration  7, inertia 3261.621
Iteration  8, inertia 3259.239
Iteration  9, inertia 3258.382
Iteration 10, inertia 3257.763
Iteration 11, inertia 3257.227
Iteration 12, inertia 3256.768
Iteration 13, inertia 3256.410
Iteration 14, inertia 3256.245
Iteration 15, inertia 3256.139
Iteration 16, inertia 3256.045
Iteration 17, inertia 3256.003
Iteration 18, inertia 3255.975
Iteration 19, inertia 3255.955
Iteration 20, inertia 3255.938
Iteration 21, inertia 3255.926
Iteration 22, inertia 3255.919
Iteration 23, inertia 3255.906
Iteration 24, inertia 3255.901
Iteration 25, inertia 3255.899
Iteration 26, inertia 3255.897
Converged at iteration 26
Initialization complete
Iteration  0, inertia 6439.297
Iteration  1, inertia 3292.780
Iteration  2, inertia 3279.272
Iteration  3, inertia 3275.342
Iteration  4, inertia 3271.297
Iteration  5, inertia 3266.407
Iteration  6, inertia 3264.193
Iteration  7, inertia 3262.548
Iteration  8, inertia 3261.671
Iteration  9, inertia 3260.768
Iteration 10, inertia 3259.996
Iteration 11, inertia 3259.212
Iteration 12, inertia 3258.566
Iteration 13, inertia 3258.245
Iteration 14, inertia 3258.081
Iteration 15, inertia 3257.916
Iteration 16, inertia 3257.788
Iteration 17, inertia 3257.724
Iteration 18, inertia 3257.663
Iteration 19, inertia 3257.642
Iteration 20, inertia 3257.620
Iteration 21, inertia 3257.606
Iteration 22, inertia 3257.599
Iteration 23, inertia 3257.597
Iteration 24, inertia 3257.592
Converged at iteration 24
Initialization complete
Iteration  0, inertia 6437.135
Iteration  1, inertia 3308.062
Iteration  2, inertia 3296.359
Iteration  3, inertia 3288.127
Iteration  4, inertia 3284.844
Iteration  5, inertia 3282.816
Iteration  6, inertia 3280.496
Iteration  7, inertia 3277.755
Iteration  8, inertia 3274.709
Iteration  9, inertia 3271.397
Iteration 10, inertia 3269.900
Iteration 11, inertia 3269.041
Iteration 12, inertia 3268.558
Iteration 13, inertia 3268.149
Iteration 14, inertia 3267.920
Iteration 15, inertia 3267.757
Iteration 16, inertia 3267.569
Iteration 17, inertia 3267.379
Iteration 18, inertia 3267.232
Iteration 19, inertia 3267.083
Iteration 20, inertia 3266.887
Iteration 21, inertia 3266.684
Iteration 22, inertia 3266.575
Iteration 23, inertia 3266.486
Iteration 24, inertia 3266.413
Iteration 25, inertia 3266.331
Iteration 26, inertia 3266.293
Iteration 27, inertia 3266.268
Iteration 28, inertia 3266.235
Iteration 29, inertia 3266.214
Iteration 30, inertia 3266.203
Iteration 31, inertia 3266.192
Iteration 32, inertia 3266.186
Iteration 33, inertia 3266.183
Converged at iteration 33
Initialization complete
Iteration  0, inertia 6493.097
Iteration  1, inertia 3302.676
Iteration  2, inertia 3285.066
Iteration  3, inertia 3278.241
Iteration  4, inertia 3274.562
Iteration  5, inertia 3270.829
Iteration  6, inertia 3265.238
Iteration  7, inertia 3261.167
Iteration  8, inertia 3259.118
Iteration  9, inertia 3258.502
Iteration 10, inertia 3258.201
Iteration 11, inertia 3257.948
Iteration 12, inertia 3257.797
Iteration 13, inertia 3257.716
Iteration 14, inertia 3257.673
Iteration 15, inertia 3257.666
Converged at iteration 15
Initialization complete
Iteration  0, inertia 6359.146
Iteration  1, inertia 3291.541
Iteration  2, inertia 3279.445
Iteration  3, inertia 3275.558
Iteration  4, inertia 3273.488
Iteration  5, inertia 3272.191
Iteration  6, inertia 3271.287
Iteration  7, inertia 3270.702
Iteration  8, inertia 3270.374
Iteration  9, inertia 3270.197
Iteration 10, inertia 3269.949
Iteration 11, inertia 3269.697
Iteration 12, inertia 3269.348
Iteration 13, inertia 3268.820
Iteration 14, inertia 3267.955
Iteration 15, inertia 3266.767
Iteration 16, inertia 3265.877
Iteration 17, inertia 3265.359
Iteration 18, inertia 3264.872
Iteration 19, inertia 3264.386
Iteration 20, inertia 3263.777
Iteration 21, inertia 3263.350
Iteration 22, inertia 3262.954
Iteration 23, inertia 3262.645
Iteration 24, inertia 3262.343
Iteration 25, inertia 3262.119
Iteration 26, inertia 3262.012
Iteration 27, inertia 3261.943
Iteration 28, inertia 3261.875
Iteration 29, inertia 3261.808
Iteration 30, inertia 3261.770
Iteration 31, inertia 3261.744
Iteration 32, inertia 3261.707
Iteration 33, inertia 3261.679
Iteration 34, inertia 3261.674
Iteration 35, inertia 3261.669
Iteration 36, inertia 3261.667
Converged at iteration 36
Initialization complete
Iteration  0, inertia 6373.946
Iteration  1, inertia 3294.749
Iteration  2, inertia 3278.626
Iteration  3, inertia 3273.958
Iteration  4, inertia 3271.969
Iteration  5, inertia 3270.800
Iteration  6, inertia 3269.873
Iteration  7, inertia 3269.060
Iteration  8, inertia 3268.193
Iteration  9, inertia 3267.473
Iteration 10, inertia 3266.822
Iteration 11, inertia 3266.335
Iteration 12, inertia 3266.065
Iteration 13, inertia 3265.876
Iteration 14, inertia 3265.720
Iteration 15, inertia 3265.663
Iteration 16, inertia 3265.627
Iteration 17, inertia 3265.610
Iteration 18, inertia 3265.577
Iteration 19, inertia 3265.549
Iteration 20, inertia 3265.523
Iteration 21, inertia 3265.513
Iteration 22, inertia 3265.503
Iteration 23, inertia 3265.497
Converged at iteration 23
Initialization complete
Iteration  0, inertia 6454.118
Iteration  1, inertia 3303.824
Iteration  2, inertia 3288.688
Iteration  3, inertia 3282.998
Iteration  4, inertia 3279.922
Iteration  5, inertia 3278.183
Iteration  6, inertia 3276.889
Iteration  7, inertia 3275.991
Iteration  8, inertia 3275.039
Iteration  9, inertia 3273.694
Iteration 10, inertia 3272.089
Iteration 11, inertia 3270.481
Iteration 12, inertia 3269.142
Iteration 13, inertia 3267.853
Iteration 14, inertia 3266.220
Iteration 15, inertia 3264.370
Iteration 16, inertia 3262.774
Iteration 17, inertia 3261.495
Iteration 18, inertia 3260.136
Iteration 19, inertia 3258.555
Iteration 20, inertia 3256.940
Iteration 21, inertia 3256.170
Iteration 22, inertia 3255.746
Iteration 23, inertia 3255.497
Iteration 24, inertia 3255.385
Iteration 25, inertia 3255.340
Iteration 26, inertia 3255.299
Iteration 27, inertia 3255.283
Converged at iteration 27
Initialization complete
Iteration  0, inertia 6362.969
Iteration  1, inertia 3296.454
Iteration  2, inertia 3282.673
Iteration  3, inertia 3275.059
Iteration  4, inertia 3269.156
Iteration  5, inertia 3264.227
Iteration  6, inertia 3259.917
Iteration  7, inertia 3257.108
Iteration  8, inertia 3256.442
Iteration  9, inertia 3256.069
Iteration 10, inertia 3255.857
Iteration 11, inertia 3255.774
Iteration 12, inertia 3255.708
Iteration 13, inertia 3255.674
Iteration 14, inertia 3255.650
Iteration 15, inertia 3255.635
Iteration 16, inertia 3255.631
Iteration 17, inertia 3255.629
Converged at iteration 17
Initialization complete
Iteration  0, inertia 6476.107
Iteration  1, inertia 3296.095
Iteration  2, inertia 3282.579
Iteration  3, inertia 3276.890
Iteration  4, inertia 3272.801
Iteration  5, inertia 3268.908
Iteration  6, inertia 3266.789
Iteration  7, inertia 3265.977
Iteration  8, inertia 3265.409
Iteration  9, inertia 3264.982
Iteration 10, inertia 3264.650
Iteration 11, inertia 3264.401
Iteration 12, inertia 3264.138
Iteration 13, inertia 3263.900
Iteration 14, inertia 3263.748
Iteration 15, inertia 3263.628
Iteration 16, inertia 3263.528
Iteration 17, inertia 3263.422
Iteration 18, inertia 3263.345
Iteration 19, inertia 3263.335
Iteration 20, inertia 3263.326
Iteration 21, inertia 3263.324
Converged at iteration 21
Initialization complete
Iteration  0, inertia 6467.892
Iteration  1, inertia 3299.482
Iteration  2, inertia 3284.474
Iteration  3, inertia 3276.773
Iteration  4, inertia 3273.421
Iteration  5, inertia 3271.134
Iteration  6, inertia 3269.243
Iteration  7, inertia 3268.631
Iteration  8, inertia 3268.409
Iteration  9, inertia 3268.296
Iteration 10, inertia 3268.184
Iteration 11, inertia 3268.000
Iteration 12, inertia 3267.834
Iteration 13, inertia 3267.674
Iteration 14, inertia 3267.473
Iteration 15, inertia 3267.362
Iteration 16, inertia 3267.273
Iteration 17, inertia 3267.147
Iteration 18, inertia 3267.035
Iteration 19, inertia 3266.914
Iteration 20, inertia 3266.829
Iteration 21, inertia 3266.699
Iteration 22, inertia 3266.545
Iteration 23, inertia 3266.270
Iteration 24, inertia 3265.958
Iteration 25, inertia 3265.560
Iteration 26, inertia 3265.069
Iteration 27, inertia 3264.684
Iteration 28, inertia 3264.510
Iteration 29, inertia 3264.421
Iteration 30, inertia 3264.306
Iteration 31, inertia 3264.165
Iteration 32, inertia 3264.036
Iteration 33, inertia 3263.952
Iteration 34, inertia 3263.910
Iteration 35, inertia 3263.856
Iteration 36, inertia 3263.814
Iteration 37, inertia 3263.778
Iteration 38, inertia 3263.729
Iteration 39, inertia 3263.623
Iteration 40, inertia 3263.525
Iteration 41, inertia 3263.408
Iteration 42, inertia 3263.292
Iteration 43, inertia 3263.134
Iteration 44, inertia 3262.944
Iteration 45, inertia 3262.742
Iteration 46, inertia 3262.450
Iteration 47, inertia 3261.958
Iteration 48, inertia 3260.961
Iteration 49, inertia 3259.360
Iteration 50, inertia 3258.312
Iteration 51, inertia 3257.919
Iteration 52, inertia 3257.750
Iteration 53, inertia 3257.643
Iteration 54, inertia 3257.588
Iteration 55, inertia 3257.580
Converged at iteration 55
Initialization complete
Iteration  0, inertia 6420.248
Iteration  1, inertia 3304.572
Iteration  2, inertia 3289.501
Iteration  3, inertia 3282.402
Iteration  4, inertia 3278.539
Iteration  5, inertia 3276.338
Iteration  6, inertia 3274.250
Iteration  7, inertia 3272.702
Iteration  8, inertia 3270.959
Iteration  9, inertia 3269.232
Iteration 10, inertia 3267.949
Iteration 11, inertia 3266.887
Iteration 12, inertia 3265.973
Iteration 13, inertia 3265.242
Iteration 14, inertia 3264.568
Iteration 15, inertia 3264.087
Iteration 16, inertia 3263.834
Iteration 17, inertia 3263.631
Iteration 18, inertia 3263.505
Iteration 19, inertia 3263.451
Iteration 20, inertia 3263.379
Iteration 21, inertia 3263.328
Iteration 22, inertia 3263.294
Iteration 23, inertia 3263.249
Iteration 24, inertia 3263.226
Iteration 25, inertia 3263.212
Iteration 26, inertia 3263.198
Iteration 27, inertia 3263.185
Iteration 28, inertia 3263.176
Iteration 29, inertia 3263.173
Iteration 30, inertia 3263.171
Converged at iteration 30
Initialization complete
Iteration  0, inertia 6400.961
Iteration  1, inertia 3298.251
Iteration  2, inertia 3280.432
Iteration  3, inertia 3275.345
Iteration  4, inertia 3273.142
Iteration  5, inertia 3271.588
Iteration  6, inertia 3269.971
Iteration  7, inertia 3268.344
Iteration  8, inertia 3267.296
Iteration  9, inertia 3266.664
Iteration 10, inertia 3265.748
Iteration 11, inertia 3264.808
Iteration 12, inertia 3263.649
Iteration 13, inertia 3262.882
Iteration 14, inertia 3262.461
Iteration 15, inertia 3262.228
Iteration 16, inertia 3262.058
Iteration 17, inertia 3261.915
Iteration 18, inertia 3261.792
Iteration 19, inertia 3261.680
Iteration 20, inertia 3261.592
Iteration 21, inertia 3261.520
Iteration 22, inertia 3261.401
Iteration 23, inertia 3261.279
Iteration 24, inertia 3261.215
Iteration 25, inertia 3261.126
Iteration 26, inertia 3261.046
Iteration 27, inertia 3260.992
Iteration 28, inertia 3260.953
Iteration 29, inertia 3260.912
Iteration 30, inertia 3260.862
Iteration 31, inertia 3260.815
Iteration 32, inertia 3260.791
Iteration 33, inertia 3260.779
Iteration 34, inertia 3260.773
Iteration 35, inertia 3260.761
Iteration 36, inertia 3260.741
Iteration 37, inertia 3260.718
Iteration 38, inertia 3260.701
Iteration 39, inertia 3260.698
Iteration 40, inertia 3260.688
Iteration 41, inertia 3260.677
Iteration 42, inertia 3260.672
Converged at iteration 42
Initialization complete
Iteration  0, inertia 6472.336
Iteration  1, inertia 3303.275
Iteration  2, inertia 3284.138
Iteration  3, inertia 3274.154
Iteration  4, inertia 3268.411
Iteration  5, inertia 3265.076
Iteration  6, inertia 3262.077
Iteration  7, inertia 3261.408
Iteration  8, inertia 3260.914
Iteration  9, inertia 3260.573
Iteration 10, inertia 3260.182
Iteration 11, inertia 3259.746
Iteration 12, inertia 3259.141
Iteration 13, inertia 3258.615
Iteration 14, inertia 3258.188
Iteration 15, inertia 3257.699
Iteration 16, inertia 3257.071
Iteration 17, inertia 3256.768
Iteration 18, inertia 3256.620
Iteration 19, inertia 3256.475
Iteration 20, inertia 3256.358
Iteration 21, inertia 3256.205
Iteration 22, inertia 3256.133
Iteration 23, inertia 3256.099
Iteration 24, inertia 3256.074
Iteration 25, inertia 3256.063
Iteration 26, inertia 3256.057
Iteration 27, inertia 3256.055
Iteration 28, inertia 3256.053
Converged at iteration 28
Initialization complete
Iteration  0, inertia 6409.635
Iteration  1, inertia 3306.802
Iteration  2, inertia 3292.770
Iteration  3, inertia 3282.228
Iteration  4, inertia 3274.919
Iteration  5, inertia 3269.284
Iteration  6, inertia 3265.479
Iteration  7, inertia 3262.476
Iteration  8, inertia 3260.595
Iteration  9, inertia 3259.696
Iteration 10, inertia 3259.101
Iteration 11, inertia 3258.481
Iteration 12, inertia 3258.167
Iteration 13, inertia 3257.964
Iteration 14, inertia 3257.725
Iteration 15, inertia 3257.538
Iteration 16, inertia 3257.429
Iteration 17, inertia 3257.344
Iteration 18, inertia 3257.202
Iteration 19, inertia 3257.062
Iteration 20, inertia 3256.865
Iteration 21, inertia 3256.692
Iteration 22, inertia 3256.549
Iteration 23, inertia 3256.403
Iteration 24, inertia 3256.245
Iteration 25, inertia 3256.127
Iteration 26, inertia 3256.025
Iteration 27, inertia 3255.952
Iteration 28, inertia 3255.853
Iteration 29, inertia 3255.769
Iteration 30, inertia 3255.630
Iteration 31, inertia 3255.571
Iteration 32, inertia 3255.543
Iteration 33, inertia 3255.516
Iteration 34, inertia 3255.496
Iteration 35, inertia 3255.489
Converged at iteration 35
Initialization complete
Iteration  0, inertia 6414.364
Iteration  1, inertia 3292.636
Iteration  2, inertia 3274.091
Iteration  3, inertia 3266.486
Iteration  4, inertia 3263.416
Iteration  5, inertia 3261.789
Iteration  6, inertia 3260.794
Iteration  7, inertia 3260.258
Iteration  8, inertia 3259.941
Iteration  9, inertia 3259.658
Iteration 10, inertia 3259.351
Iteration 11, inertia 3258.914
Iteration 12, inertia 3258.190
Iteration 13, inertia 3257.195
Iteration 14, inertia 3256.270
Iteration 15, inertia 3255.707
Iteration 16, inertia 3255.556
Iteration 17, inertia 3255.521
Iteration 18, inertia 3255.483
Iteration 19, inertia 3255.469
Iteration 20, inertia 3255.460
Iteration 21, inertia 3255.456
Converged at iteration 21
Initialization complete
Iteration  0, inertia 6324.895
Iteration  1, inertia 3293.965
Iteration  2, inertia 3275.830
Iteration  3, inertia 3267.741
Iteration  4, inertia 3263.209
Iteration  5, inertia 3261.451
Iteration  6, inertia 3260.725
Iteration  7, inertia 3260.367
Iteration  8, inertia 3260.137
Iteration  9, inertia 3259.991
Iteration 10, inertia 3259.924
Iteration 11, inertia 3259.890
Iteration 12, inertia 3259.877
Iteration 13, inertia 3259.861
Converged at iteration 13
Initialization complete
Iteration  0, inertia 6439.038
Iteration  1, inertia 3291.442
Iteration  2, inertia 3276.028
Iteration  3, inertia 3271.637
Iteration  4, inertia 3269.695
Iteration  5, inertia 3268.859
Iteration  6, inertia 3268.340
Iteration  7, inertia 3267.780
Iteration  8, inertia 3267.261
Iteration  9, inertia 3266.530
Iteration 10, inertia 3265.668
Iteration 11, inertia 3264.816
Iteration 12, inertia 3263.986
Iteration 13, inertia 3263.582
Iteration 14, inertia 3263.172
Iteration 15, inertia 3262.976
Iteration 16, inertia 3262.861
Iteration 17, inertia 3262.783
Iteration 18, inertia 3262.751
Iteration 19, inertia 3262.726
Iteration 20, inertia 3262.708
Iteration 21, inertia 3262.699
Converged at iteration 21
Initialization complete
Iteration  0, inertia 6458.746
Iteration  1, inertia 3309.368
Iteration  2, inertia 3296.435
Iteration  3, inertia 3288.927
Iteration  4, inertia 3282.518
Iteration  5, inertia 3275.289
Iteration  6, inertia 3267.311
Iteration  7, inertia 3264.367
Iteration  8, inertia 3263.004
Iteration  9, inertia 3262.378
Iteration 10, inertia 3261.967
Iteration 11, inertia 3261.658
Iteration 12, inertia 3261.507
Iteration 13, inertia 3261.294
Iteration 14, inertia 3261.093
Iteration 15, inertia 3260.902
Iteration 16, inertia 3260.740
Iteration 17, inertia 3260.652
Iteration 18, inertia 3260.585
Iteration 19, inertia 3260.539
Iteration 20, inertia 3260.491
Iteration 21, inertia 3260.454
Iteration 22, inertia 3260.426
Iteration 23, inertia 3260.412
Iteration 24, inertia 3260.405
Iteration 25, inertia 3260.402
Iteration 26, inertia 3260.398
Iteration 27, inertia 3260.390
Iteration 28, inertia 3260.382
Iteration 29, inertia 3260.380
Iteration 30, inertia 3260.376
Converged at iteration 30
Initialization complete
Iteration  0, inertia 6350.535
Iteration  1, inertia 3291.919
Iteration  2, inertia 3279.374
Iteration  3, inertia 3273.346
Iteration  4, inertia 3269.117
Iteration  5, inertia 3266.915
Iteration  6, inertia 3265.431
Iteration  7, inertia 3264.712
Iteration  8, inertia 3264.349
Iteration  9, inertia 3264.067
Iteration 10, inertia 3263.850
Iteration 11, inertia 3263.726
Iteration 12, inertia 3263.650
Iteration 13, inertia 3263.619
Iteration 14, inertia 3263.607
Iteration 15, inertia 3263.597
Converged at iteration 15
Initialization complete
Iteration  0, inertia 6456.248
Iteration  1, inertia 3300.444
Iteration  2, inertia 3283.503
Iteration  3, inertia 3276.788
Iteration  4, inertia 3274.204
Iteration  5, inertia 3272.677
Iteration  6, inertia 3271.439
Iteration  7, inertia 3270.415
Iteration  8, inertia 3269.341
Iteration  9, inertia 3268.165
Iteration 10, inertia 3267.504
Iteration 11, inertia 3267.135
Iteration 12, inertia 3266.829
Iteration 13, inertia 3266.572
Iteration 14, inertia 3266.337
Iteration 15, inertia 3266.077
Iteration 16, inertia 3265.841
Iteration 17, inertia 3265.544
Iteration 18, inertia 3265.359
Iteration 19, inertia 3265.181
Iteration 20, inertia 3265.045
Iteration 21, inertia 3264.936
Iteration 22, inertia 3264.811
Iteration 23, inertia 3264.654
Iteration 24, inertia 3264.496
Iteration 25, inertia 3264.081
Iteration 26, inertia 3263.339
Iteration 27, inertia 3261.533
Iteration 28, inertia 3258.654
Iteration 29, inertia 3256.621
Iteration 30, inertia 3255.979
Iteration 31, inertia 3255.643
Iteration 32, inertia 3255.477
Iteration 33, inertia 3255.403
Iteration 34, inertia 3255.360
Iteration 35, inertia 3255.335
Converged at iteration 35
Initialization complete
Iteration  0, inertia 6451.563
Iteration  1, inertia 3304.684
Iteration  2, inertia 3285.713
Iteration  3, inertia 3279.365
Iteration  4, inertia 3277.067
Iteration  5, inertia 3275.508
Iteration  6, inertia 3274.519
Iteration  7, inertia 3273.507
Iteration  8, inertia 3272.746
Iteration  9, inertia 3272.162
Iteration 10, inertia 3271.657
Iteration 11, inertia 3271.264
Iteration 12, inertia 3270.956
Iteration 13, inertia 3270.540
Iteration 14, inertia 3270.082
Iteration 15, inertia 3269.869
Iteration 16, inertia 3269.726
Iteration 17, inertia 3269.584
Iteration 18, inertia 3269.468
Iteration 19, inertia 3269.352
Iteration 20, inertia 3269.178
Iteration 21, inertia 3269.011
Iteration 22, inertia 3268.723
Iteration 23, inertia 3268.353
Iteration 24, inertia 3267.843
Iteration 25, inertia 3267.215
Iteration 26, inertia 3266.362
Iteration 27, inertia 3265.584
Iteration 28, inertia 3265.157
Iteration 29, inertia 3264.786
Iteration 30, inertia 3264.364
Iteration 31, inertia 3263.901
Iteration 32, inertia 3263.552
Iteration 33, inertia 3263.260
Iteration 34, inertia 3262.937
Iteration 35, inertia 3262.485
Iteration 36, inertia 3261.695
Iteration 37, inertia 3261.107
Iteration 38, inertia 3260.828
Iteration 39, inertia 3260.594
Iteration 40, inertia 3260.428
Iteration 41, inertia 3260.389
Iteration 42, inertia 3260.367
Iteration 43, inertia 3260.365
Iteration 44, inertia 3260.359
Converged at iteration 44
Initialization complete
Iteration  0, inertia 6405.600
Iteration  1, inertia 3302.004
Iteration  2, inertia 3283.203
Iteration  3, inertia 3276.145
Iteration  4, inertia 3273.083
Iteration  5, inertia 3271.498
Iteration  6, inertia 3270.418
Iteration  7, inertia 3269.699
Iteration  8, inertia 3268.915
Iteration  9, inertia 3267.884
Iteration 10, inertia 3266.646
Iteration 11, inertia 3265.083
Iteration 12, inertia 3263.472
Iteration 13, inertia 3262.431
Iteration 14, inertia 3261.918
Iteration 15, inertia 3261.636
Iteration 16, inertia 3261.445
Iteration 17, inertia 3261.310
Iteration 18, inertia 3261.224
Iteration 19, inertia 3261.135
Iteration 20, inertia 3261.059
Iteration 21, inertia 3261.018
Iteration 22, inertia 3260.983
Iteration 23, inertia 3260.947
Iteration 24, inertia 3260.900
Iteration 25, inertia 3260.840
Iteration 26, inertia 3260.790
Iteration 27, inertia 3260.764
Iteration 28, inertia 3260.743
Iteration 29, inertia 3260.738
Converged at iteration 29
Initialization complete
Iteration  0, inertia 6448.216
Iteration  1, inertia 3298.831
Iteration  2, inertia 3279.635
Iteration  3, inertia 3269.284
Iteration  4, inertia 3263.260
Iteration  5, inertia 3259.594
Iteration  6, inertia 3257.439
Iteration  7, inertia 3256.139
Iteration  8, inertia 3255.675
Iteration  9, inertia 3255.538
Iteration 10, inertia 3255.445
Iteration 11, inertia 3255.393
Iteration 12, inertia 3255.364
Iteration 13, inertia 3255.356
Iteration 14, inertia 3255.344
Iteration 15, inertia 3255.334
Converged at iteration 15
Initialization complete
Iteration  0, inertia 6455.246
Iteration  1, inertia 3306.953
Iteration  2, inertia 3294.150
Iteration  3, inertia 3287.016
Iteration  4, inertia 3283.105
Iteration  5, inertia 3280.206
Iteration  6, inertia 3277.649
Iteration  7, inertia 3275.314
Iteration  8, inertia 3273.816
Iteration  9, inertia 3272.719
Iteration 10, inertia 3271.792
Iteration 11, inertia 3270.814
Iteration 12, inertia 3270.039
Iteration 13, inertia 3269.696
Iteration 14, inertia 3269.384
Iteration 15, inertia 3269.025
Iteration 16, inertia 3268.540
Iteration 17, inertia 3268.051
Iteration 18, inertia 3267.514
Iteration 19, inertia 3267.302
Iteration 20, inertia 3267.222
Iteration 21, inertia 3267.177
Iteration 22, inertia 3267.135
Iteration 23, inertia 3267.080
Iteration 24, inertia 3266.960
Iteration 25, inertia 3266.678
Iteration 26, inertia 3265.716
Iteration 27, inertia 3262.812
Iteration 28, inertia 3257.784
Iteration 29, inertia 3256.421
Iteration 30, inertia 3255.818
Iteration 31, inertia 3255.614
Iteration 32, inertia 3255.518
Iteration 33, inertia 3255.469
Iteration 34, inertia 3255.443
Iteration 35, inertia 3255.435
Iteration 36, inertia 3255.429
Iteration 37, inertia 3255.420
Iteration 38, inertia 3255.416
Iteration 39, inertia 3255.409
Iteration 40, inertia 3255.399
Iteration 41, inertia 3255.378
Iteration 42, inertia 3255.365
Iteration 43, inertia 3255.355
Iteration 44, inertia 3255.347
Iteration 45, inertia 3255.345
Iteration 46, inertia 3255.342
Iteration 47, inertia 3255.340
Converged at iteration 47
Initialization complete
Iteration  0, inertia 6373.585
Iteration  1, inertia 3295.265
Iteration  2, inertia 3276.429
Iteration  3, inertia 3270.790
Iteration  4, inertia 3269.210
Iteration  5, inertia 3268.392
Iteration  6, inertia 3267.849
Iteration  7, inertia 3267.406
Iteration  8, inertia 3267.006
Iteration  9, inertia 3266.540
Iteration 10, inertia 3266.094
Iteration 11, inertia 3265.727
Iteration 12, inertia 3265.176
Iteration 13, inertia 3264.168
Iteration 14, inertia 3262.569
Iteration 15, inertia 3261.010
Iteration 16, inertia 3260.253
Iteration 17, inertia 3260.028
Iteration 18, inertia 3259.907
Iteration 19, inertia 3259.861
Iteration 20, inertia 3259.830
Iteration 21, inertia 3259.785
Iteration 22, inertia 3259.758
Iteration 23, inertia 3259.755
Converged at iteration 23
Initialization complete
Iteration  0, inertia 6354.581
Iteration  1, inertia 3307.480
Iteration  2, inertia 3294.591
Iteration  3, inertia 3286.870
Iteration  4, inertia 3283.171
Iteration  5, inertia 3280.286
Iteration  6, inertia 3277.624
Iteration  7, inertia 3275.121
Iteration  8, inertia 3272.140
Iteration  9, inertia 3269.519
Iteration 10, inertia 3267.144
Iteration 11, inertia 3264.701
Iteration 12, inertia 3262.442
Iteration 13, inertia 3260.466
Iteration 14, inertia 3258.164
Iteration 15, inertia 3257.111
Iteration 16, inertia 3256.494
Iteration 17, inertia 3255.938
Iteration 18, inertia 3255.690
Iteration 19, inertia 3255.623
Iteration 20, inertia 3255.598
Iteration 21, inertia 3255.591
Iteration 22, inertia 3255.587
Iteration 23, inertia 3255.583
Converged at iteration 23
Initialization complete
Iteration  0, inertia 6456.341
Iteration  1, inertia 3299.840
Iteration  2, inertia 3286.698
Iteration  3, inertia 3281.930
Iteration  4, inertia 3279.365
Iteration  5, inertia 3275.912
Iteration  6, inertia 3271.700
Iteration  7, inertia 3268.976
Iteration  8, inertia 3267.243
Iteration  9, inertia 3266.373
Iteration 10, inertia 3265.959
Iteration 11, inertia 3265.614
Iteration 12, inertia 3265.320
Iteration 13, inertia 3265.040
Iteration 14, inertia 3264.620
Iteration 15, inertia 3264.257
Iteration 16, inertia 3264.017
Iteration 17, inertia 3263.875
Iteration 18, inertia 3263.794
Iteration 19, inertia 3263.725
Iteration 20, inertia 3263.691
Iteration 21, inertia 3263.666
Iteration 22, inertia 3263.640
Iteration 23, inertia 3263.625
Iteration 24, inertia 3263.621
Iteration 25, inertia 3263.610
Iteration 26, inertia 3263.607
Iteration 27, inertia 3263.604
Converged at iteration 27
Out[109]:
[39, 0.40027932557045898]

40% accuracy using 39 clusters is only marginally better than our model with 5 clusters, we will definately choose the simpler model moving forward. Remember though that these results are still in sample error, and are probably better than we can expect on real data.

Solve a real problem

Now we are at the stage where we can recommend similar articles to the user. This could be implemented as part of the serach algorithm, or simply recommended posts to read after the current page.

We first need to vectorize the new post before we predict it's label.

In [114]:
new_post = '''hard drives can fail at any time,
                    it is important to always backup your data.'''
                    
new_post_vec = vec.transform([new_post])
new_post_label = km.predict(new_post_vec)[0] # predict the class it belongs to
In [115]:
# Select all posts with the same cluster label as the new post vector
similar_label = (km.labels_ == new_post_label).nonzero()[0]

Now, between the records we know are similar, we build a new list of similarity scores, similar to what we did above in earlier examples.

In [116]:
similar = []
for i in similar_label:
    dist = sp.linalg.norm((new_post_vec - vecData[i].toarray()))
    similar.append((dist, train_data.target[i], train_data.data[i]))
similar = sorted(similar)
print(len(similar))
175
In [117]:
# Present the most similar posts
print similar[0]
(1.1757159813728066, 2, 'From: [email protected] (George Pandelios)\nSubject:  Help me select a Backup Solution\n\n\nHi Netters!\n\nI\'m looking at purchasing some sort of backup solution.  After you read about\nmy situation, I\'d like your opinion.  Here\'s the scenario:\n\n1.  There are two computers in the house.  One is a small 286 (40MB IDE drive).\n    The other is a 386DX (213 SCSI drive w/ Adaptec 1522 controller).  Both \n    systems have PC TOOLS and will use Central Point Backup as the backup / \n    restore program.  Both systems have 3.5" and 5.25" floppies.\n\n2.  The computers are not networked (nor will they be anytime soon).\n\nFrom what I have seen so far, there appear to be at least 4 possible\nsolutions (I\'m sure there are others I haven\'t thought about).  For these \noptions, I would appreciate hearing from anyone who has tried them or sees \nany flaws (drive type X won\'t coexist with device Y, etc.) in my thinking \n(I don\'t know very much about these beasts):\n\n1.  Put 2.88MB floppy drives (or a combination drive) on each system.\n    Can someone supply cost and brand information?  What\'s a good brand?\n    What do the floppies themselves cost?\n\n\n2.  Put an internal tape backup unit on the 386 using my SCSI adapter, and\n    continue to back up the 286 with floppies.  Again, can someone recommend a\n    few manufacturers?  The only brand I remember is Colorado Memories.  Any\n    happy or unhappy users (I know about the compression controversy)?\n \n\n3.  Connect an external tape backup unit on the 386 using my SCSI adapter, and\n    (maybe?) connect it to the 286 somehow (any suggestions?)\n\n\n4.  Install a Floptical drive in each machine.  Again, any gotcha\'s or \n    recommendations for manufacturers?  \n\nI appreciate your help.  You may either post or send me e-mail.  I will\nsummarize all responses for the net.\n\nThanks,\n\nGeorge\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n  George J. Pandelios\t\t\t\tInternet:  [email protected]\n  Software Engineering Institute\t\tusenet:\t   sei!gjp\n  4500 Fifth Avenue\t\t\t\tVoice:\t   (412) 268-7186\n  Pittsburgh, PA 15213\t\t\t\tFAX:\t   (412) 268-5758\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\nDisclaimer:  These opinions are my own and do not reflect those of the\n\t     Software Engineering Institute, its sponsors, customers, \n\t     clients, affiliates, or Carnegie Mellon University.  In fact,\n\t     any resemblence of these opinions to any individual, living\n\t     or dead, fictional or real, is purely coincidental.  So there.\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n')
In [122]:
from IPython.core.display import HTML


def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()

    return HTML(styles)
css_styling()
Out[122]:
In [121]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
  lang: en_US
</script>
<script type="IN/Share"></script>
"""
    return HTML(code)