While KMeans is a good algorithm, the time complexity is very poor. Kmeans works in O(n⋅K⋅I⋅f) Where n is number of records, K is number of clusters, I is number of iterations, f is number of features in particular record. Clearly, the algorithm will take forever to complete on a dataset of > 100,000 data points

Minibatch KMeans¶

Main features of Minibatch KMeans are:

Instead of using the entire dataset at once, it operates in batches.
Uses Gradient Descent update, which is way more faster than what KMeans does.

How it works¶

It takes batches of datasets and finds the centroids for the smaller dataset (minibatch)
Then for the next batch, it uses the centroid found in previous batch and updates it using Gradient Descent.
This simple method makes it faster by a magnitude of the input size.

In [9]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time

In [3]:

df = pd.read_csv('my_machine-learning/datasets/customers.csv')
X = df.iloc[:, [3, 4]].values
df.head()

Out[3]:

	CustomerID	Genre	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40

Comparing Mini batch kmeans with kmeans¶

In [12]:

from sklearn.cluster import MiniBatchKMeans

clf = MiniBatchKMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10,  random_state=0)
y_minikmeans = clf.fit_predict(X)

In [13]:

from sklearn.cluster import KMeans

kmean = KMeans(n_clusters=5,)
y_kmeans = kmean.fit_predict(X)

In [14]:

fig = plt.figure(figsize=(15,6))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=y_minikmeans, s=50, cmap='viridis')

centers = clf.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5)
plt.title('Clusters using DBSCAN')
plt.ylabel('Grocery')
plt.xlabel('Milk')

plt.subplot(122)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis_r')

centers = kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5)
plt.title('Clusters using kmeans')
plt.ylabel('Grocery')
plt.xlabel('Milk')
plt.show()

well there is not any difference between both for small datasets

Disad¶

The mini batch K-means is faster but gives slightly different results than the normal batch K-means.