Unsupervised Learning¶

Unsupervised Learning - K-Means¶

Introduction to Unsupervised Learning¶

We learnt about supervised learning and its methods. To recap, supervised learning involves training a machine learning model with data mapped to a target. The model learns the mapping and given a test data outside of the training data, it is able to make predictions. Unsupervised learning necessitates no training data mapped to a target. Infact, there is no training involved. The algorithms are able to generate various insights into the data and hence called unsupervised learning. One popular example of unsupervised learning is clustering.

Introduction to Clustering¶

Finding groups based on a criterion in the dataset is called clustering. The criterion could be a spatial distribution involving distance between nearby points or could involve another metric such as density based on presence of neighbors. A basic clustering algorithm is called k-means clustering which is a spatial clustering algorithm.

k-means clustering¶

k-means clustering takes in k number of clusters that could potentially exist in a data set and outputs associations of each member of the dataset to a cluster in the range [1, k]. k-means initially randomly associates each data point to a random cluster in the range [1, k] and determines the centroid of each cluster. Later, in an iterative manner, the members associations as well as cluster centroids of k clusters are updated till a convergence criterion is achieved.

The Iris Dataset¶

The Iris dataset is a famous dataset used by Statisticians, Machine Learning Engineers and Data Scientists in his 1936 paper "The use of multiple measurements in taxonomic problems". The data consists of Iris flowers that belong to three different species.

The data set consists of various samples with four features per data point. These are:

length of the sepals (cm)
lenght of the petals (cm)
width of the sepals (cm) and
width of the petals (cm).

We shall learn to use these features and perform clustering using k-means to see various cluster realizations.

Load all relevant libraries:

import pandas as pd
import plotly.tools as tls
from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split
iris = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
iris.head()

Let us perform clustering by selecting number of clusters as 5 using the KMeans function. The max iteration can be specified which allows the K-Means algorithm to stop at a certain point if it doesnt reach a tolerance level (tol) below 0.0001 in that many steps.

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
iris_train, iris_test = train_test_split(iris, train_size=0.9)
k_means = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)

We can now fit the instantiated k_means to the features:

k_means.fit(iris_train[features])

The predict function provides predictions on which cluster the data in the test set will be associated to:

iris_y = k_means.predict(iris_test[features])

Exercise:¶

K-Means associates each data point to a cluster. We can see these associations post clustering in labels_ variable.

What is the type of labels_ data structure?
Map the associations of each data point in X to the cluster it belongs to by appending it to the dataframe. Create a new column called 'cluster' that contains the cluster number or id that the row associates itself with.
Print the first 5 rows of the training dataset with the cluster column.

In [2]:

from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split

import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']

iris_train, iris_test = train_test_split(iris, train_size=0.9)

k_means = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
k_means.fit(iris_train[features])
iris_y = k_means.predict(iris_test[features])

Use dataframe .assign function.

In [ ]:

iris_train = iris_train.assign(cluster = k_means.labels_)
iris_train.head(5)

In [ ]:

ref_tmp_var = False


try:
    ref_assert_var = False
    import numpy as np
    
    if np.all(iris_train['cluster'] == k_means.labels_):
      ref_assert_var = True
      out = iris_train.head(5)
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var