We learnt about supervised learning and its methods. To recap, supervised learning involves training a machine learning model with data mapped to a target. The model learns the mapping and given a test data outside of the training data, it is able to make predictions. Unsupervised learning necessitates no training data mapped to a target. Infact, there is no training involved. The algorithms are able to generate various insights into the data and hence called unsupervised learning. One popular example of unsupervised learning is clustering.
Finding groups based on a criterion in the dataset is called clustering. The criterion could be a spatial distribution involving distance between nearby points or could involve another metric such as density based on presence of neighbors. A basic clustering algorithm is called k-means clustering which is a spatial clustering algorithm.
k-means clustering takes in k number of clusters that could potentially exist in a data set and outputs associations of each member of the dataset to a cluster in the range [1, k]. k-means initially randomly associates each data point to a random cluster in the range [1, k] and determines the centroid of each cluster. Later, in an iterative manner, the members associations as well as cluster centroids of k clusters are updated till a convergence criterion is achieved.
The Iris dataset is a famous dataset used by Statisticians, Machine Learning Engineers and Data Scientists in his 1936 paper "The use of multiple measurements in taxonomic problems". The data consists of Iris flowers that belong to three different species.
The data set consists of various samples with four features per data point. These are:
We shall learn to use these features and perform clustering using k-means to see various cluster realizations.
Load all relevant libraries:
import pandas as pd
import plotly.tools as tls
from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split
iris = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
iris.head()
Let us perform clustering by selecting number of clusters as 5 using the KMeans function. The max iteration can be specified which allows the K-Means algorithm to stop at a certain point if it doesnt reach a tolerance level (tol) below 0.0001 in that many steps.
features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
iris_train, iris_test = train_test_split(iris, train_size=0.9)
k_means = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
We can now fit the instantiated k_means to the features:
k_means.fit(iris_train[features])
The predict function provides predictions on which cluster the data in the test set will be associated to:
iris_y = k_means.predict(iris_test[features])
K-Means associates each data point to a cluster. We can see these associations post clustering in labels_ variable.
from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
iris_train, iris_test = train_test_split(iris, train_size=0.9)
k_means = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
k_means.fit(iris_train[features])
iris_y = k_means.predict(iris_test[features])
Use dataframe .assign function.
iris_train = iris_train.assign(cluster = k_means.labels_)
iris_train.head(5)
ref_tmp_var = False
try:
ref_assert_var = False
import numpy as np
if np.all(iris_train['cluster'] == k_means.labels_):
ref_assert_var = True
out = iris_train.head(5)
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var