Simple Clustering With SciPy¶

Author: Chris Albon, @ChrisAlbon
Date: -
Repo: Python 3 code snippets for data science
Note:

Import modules¶

In [42]:

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.cluster import vq

Create coordinates for battles for each year of the war¶

In [14]:

# create 100 coordinate pairs (i.e. two values), then add 5 to all of them
year_1 = np.random.randn(100, 2) + 5

# create 30 coordinatee pairs (i.e. two values), then subtract 5 to all of them
year_2 = np.random.randn(30, 2) - 5

# create 50 coordinatee pairs (i.e. two values)
year_3 = np.random.randn(50, 2)

View the first 3 entries of each year of battles¶

In [19]:

print('year 1 battles:',  year_1[0:3])
print('year 2 battles:', year_2[0:3])
print('year 3 battles:', year_3[0:3])

year 1 battles: [[ 3.87032104  4.93418141]
 [ 4.47603646  3.23230121]
 [ 6.15905943  4.55274026]]
year 2 battles: [[-3.55642932 -3.13125097]
 [-5.83295449 -5.75787649]
 [-5.12144789 -5.00466761]]
year 3 battles: [[-0.27557365 -0.65002898]
 [ 0.94593878 -0.46056352]
 [ 0.91003511  0.27888337]]

Pool all three years of coordinates¶

In [20]:

# vertically stack year_1, year_2, and year_3 elements
battles = np.vstack([year_1, year_2, year_3])

Cluster the battle locations into three groups¶

In [30]:

# calculate the centroid coordinates of each cluster 
# and the variance of all the clusters
centroids, variance  = vq.kmeans(battles, 3)

View the centroid coordinate for each of the three clusters¶

In [32]:

centroids

Out[32]:

array([[ 4.89478443,  5.00806609],
       [ 0.16770004,  0.01639683],
       [-5.06447231, -4.99956259]])

In [35]:

variance

Out[35]:

1.2382236882037887

Seperate the battle data into clusters¶

In [36]:

identified, distance = vq.vq(battles, centroids)

View the cluster of each battle¶

In [ ]:

identified

View the distance of each individual battle from their cluster's centroid¶

In [ ]:

distance

Index the battles data by the cluster to which they belong¶

In [39]:

cluster_1 = battles[identified == 0]
cluster_2 = battles[identified == 1]
cluster_3 = battles[identified == 2]

Print the first three coordinate pairs of each cluster¶

In [43]:

print(cluster_1[0:3])
print(cluster_2[0:3])
print(cluster_3[0:3])

[[ 3.87032104  4.93418141]
 [ 4.47603646  3.23230121]
 [ 6.15905943  4.55274026]]
[[-0.27557365 -0.65002898]
 [ 0.94593878 -0.46056352]
 [ 0.91003511  0.27888337]]
[[-3.55642932 -3.13125097]
 [-5.83295449 -5.75787649]
 [-5.12144789 -5.00466761]]

Plot all the battles, color each battle by cluster¶

In [58]:

# create a scatter plot there the x-axis is the first column of battles
# the y-axis is the second column of battles, the size is 100, and
# the color of each point is determined by the indentified variable
plt.scatter(battles[:,0], battles[:,1], s=100, c=identified)

Out[58]:

<matplotlib.collections.PathCollection at 0x10771b890>

Simple Clustering With SciPy¶

Import modules¶

Create coordinates for battles for each year of the war¶

View the first 3 entries of each year of battles¶

Pool all three years of coordinates¶

Cluster the battle locations into three groups¶

View the centroid coordinate for each of the three clusters¶

View the variance of the clusters (they all share the same)¶

Seperate the battle data into clusters¶

View the cluster of each battle¶

View the distance of each individual battle from their cluster's centroid¶

Index the battles data by the cluster to which they belong¶

Print the first three coordinate pairs of each cluster¶

Plot all the battles, color each battle by cluster¶