Last updated on 1 September 2017

In [1]:
import numpy as np
import scipy.io as sio

import plotly
plotly.offline.init_notebook_mode(connected=True)
In [2]:
data2 = sio.loadmat("./ex7data2.mat")
Xdata2 = data2["X"]

Visualizing $K$-Means Behavior

This notebook serves to visualize the iteration-by-iteration behavior of the $K$-Means algorithm as implemented for Exercise 7. The algorithm implementation here follows closely from that in the main Exercise 7 notebook, except for a few minor modifications to make the visualization possible.

In [3]:
def find_closest_centroids(_X, _centr):
    _Xresh = _X.reshape(_X.shape[0],1, _X.shape[1])
    _stretch = np.repeat(_Xresh, _centr.shape[0], axis=1)
    _diffs = _stretch - _centr  # Compute the difference 
    _eudists = np.sum(np.square(_diffs), axis = 2) # Compute the euclidean squared distance
    return np.argmin(_eudists,axis = 1) # return cluster value of each centroid based on min distance
In [4]:
def compute_centroid_means(_X, _idx, _K):
    _centrmeans = np.zeros((_K, _X.shape[1]))
    for k in range(0,_K):
        _centrmeans[k] = np.mean(_X[_idx == k].T, axis = 1)
    return _centrmeans
In [5]:
def init_centroids(_X, _K):
    _idxs = np.arange(0, _X.shape[0])
    np.random.shuffle(_idxs)
    return _X[_idxs][0:_K]

Modified $K$-Means Algorithm

Modification notes to k_means()

k_means() is modified here to return the computed arrays of centroid coordinates, _centr and assigned cluster indexes _idx, for each iteration, stored in two separate lists _centrs and _idxs. Index 0 of these two lists serve holds the information about centroids and clusters before the algorithm has started its first iteration. These are the initial centroid coordinates, generated from random initialization, and a dummy value of -1 assigned to be the original cluster index of every data point.

In [6]:
def k_means(_X, _K, _numiter):
    _centrs=[]; _idxs = [] # initalize empty lists to collect cluster indexes and centroid coordinates at each step
    _centr = init_centroids(_X,_K)
    _centrs.append(_centr)
    # assign all datapoints to have initial cluster index of -1 prior to the first cluster assignment
    _idxs.append(-np.ones(_X.shape[0], dtype = int))
    for i in range(0, _numiter):
        _idx = find_closest_centroids(_X, _centr)
        _idxs.append(_idx)
        _centr = compute_centroid_means(_X, _idx, _K) # Update the centroid positions with the new means
        _centrs.append(_centr)
    return _centrs, _idxs

Run $K$-means using the same settings as in Exercise 7. Vary the settings to see how they affect the performance of $K$-means.

In [7]:
K = 3; maxiter = 10
centrs , idxs = k_means(Xdata2,K,maxiter)

Preparing the $K$-Means Animation

The example of Using a Slider and Buttons from the tutorial Intro to Animations in Python helped a lot. I have retained much of the same layout, slider and menu settings from the example.

Approach to the animation

We do not use any objects from plotly.graph_objs, and instead render each object as a simple dictionary.

The general approach is to separately render each individual datapoint with its own color and coordinates as its own dictionary object, rather than rendering all the datapoints under a single dictionary. In this way, it becomes possible to update the colors for each datapoint in every iteration, since there are no changes to the cooridinates of all the datapoints at all.

For the centroids however, since they do change their positions after each iteration, they can all be rendered together under the same single dictionary.

In [17]:
# Initialize an empty figure
figure = {
    'data': [],
    'layout': {},
    'frames': []
}
iters = list(range(0,maxiter+1))
In [18]:
# fill in most of layout
figure['layout']['xaxis'] = {'range': [-1, 9], 'title': 'X1'}
figure['layout']['yaxis'] = {'range': [0, 6], 'title': 'X2'}
figure['layout']['hovermode'] = 'closest'

figure['layout']['sliders'] = {
    'args': [
        'transition', {
            'duration': 400,
            'easing': 'cubic-in-out'
        }
    ],
    'plotlycommand': 'animate',
    'values': iters,
    'visible': True
}

# Create the buttons for starting and pausing the animation
figure['layout']['updatemenus'] = [
    {
        'buttons': [
            {
                'args': [None, {'frame': {'duration': 500, 'redraw': False},
                         'fromcurrent': True, 'transition': {'duration': 300, 'easing': 'quadratic-in-out'}}],
                'label': 'Play',
                'method': 'animate'
            },
            {
                'args': [[None], {'frame': {'duration': 0, 'redraw': False}, 'mode': 'immediate',
                'transition': {'duration': 0}}],
                'label': 'Pause',
                'method': 'animate'
            }
        ],
        'direction': 'left',
        'pad': {'r': 10, 't': 87},
        'showactive': False,
        'type': 'buttons',
        'x': 0.1,
        'xanchor': 'right',
        'y': 0,
        'yanchor': 'top'
    }
]
# Prepare the Sliders 
sliders_dict = {
    'active': 0,
    'yanchor': 'top',
    'xanchor': 'left',
    'currentvalue': {
        'font': {'size': 20},
        'prefix': 'Iteration: ', # show the iteration number
        'visible': True,
        'xanchor': 'right'
    },
    'transition': {'duration': 300, 'easing': 'cubic-in-out'},
    'pad': {'b': 10, 't': 50},
    'len': 0.9,
    'x': 0.1,
    'y': 0,
    'steps': []
}

Initial Visualization

Create the initial visualization that occupies the plot before the animation begins playing.

In [19]:
# create initial data and centroids dict
init_centr_dict = dict(x = centrs[0][:,0], y= centrs[0][:,1], mode = "markers",
                       marker=dict(size = 15, color = "Black"))

data_pts_dicts = []
for row in range(0, Xdata2.shape[0]):
    init_data_point =  dict(x = [Xdata2[row][0]], y= [Xdata2[row][1]], mode = "markers",
                           marker=dict(color = "Pink"))
    data_pts_dicts.append(init_data_point)

# Assign initial visualization to the data key
figure['data'] = [init_centr_dict] + data_pts_dicts

Create the frames to visualize each step

Here below we create the frames necessary for the stepwise visualization of each iteration of the $K$-means. For the initial view (prior to the first iteration) to still be visible as the zeroth iteration after the animation is played, the first frame is set to be eaxctly the same as the initial visualization above.

In [20]:
# create frames now
clustercolors = ["Red", "Green", "Orange"]
iters = list(range(0,maxiter+1))
for i in iters:
    frame = {'data': [], 'name': str(i)}
    Xdata2clust = np.hstack([Xdata2, idxs[i].reshape(idxs[i].shape[0],1)])
    # create and append centroids for each iter
    centr_dict = dict(x = centrs[i][:,0], y= centrs[i][:,1], mode = "markers",
                           marker=dict(size = 15, color = "Black"))
    frame['data'].append(centr_dict)
        
    for row in range(0, Xdata2.shape[0]):
        clusterid = int(Xdata2clust[row][2])
        init_data_point =  dict(x = [Xdata2clust[row][0]], y= [Xdata2clust[row][1]], mode = "markers",
                               marker=dict(color = "Pink" if clusterid==-1 else clustercolors[clusterid])) 
        frame['data'].append(init_data_point)

    figure['frames'].append(frame)
    slider_step = {'args': [
        [i],
        {'frame': {'duration': 300, 'redraw': False},
         'mode': 'immediate',
       'transition': {'duration': 300}}
     ],
     'label': i,
     'method': 'animate'}
    sliders_dict['steps'].append(slider_step)
    
figure['layout']['sliders'] = [sliders_dict]

Animation of $K$-Means

Enjoy the visualization!

In [22]:
figure['layout']['title'] = "Visualization of K-Means Behavior"
figure['layout']['showlegend'] = False
plotly.offline.iplot(figure)

Note

Iteration 0 in this visualization represents the setup prior to the first iteration. The centroids have just been intitialized to positions of randomly selected data points, and no cluster assignment to the data has been performed yet.