While KMeans is a good algorithm, the time complexity is very poor. Kmeans works in O(n⋅K⋅I⋅f) Where n is number of records, K is number of clusters, I is number of iterations, f is number of features in particular record. Clearly, the algorithm will take forever to complete on a dataset of > 100,000 data points
Main features of Minibatch KMeans are:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
df = pd.read_csv('my_machine-learning/datasets/customers.csv')
X = df.iloc[:, [3, 4]].values
df.head()
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
from sklearn.cluster import MiniBatchKMeans
clf = MiniBatchKMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_minikmeans = clf.fit_predict(X)
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=5,)
y_kmeans = kmean.fit_predict(X)
fig = plt.figure(figsize=(15,6))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=y_minikmeans, s=50, cmap='viridis')
centers = clf.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5)
plt.title('Clusters using DBSCAN')
plt.ylabel('Grocery')
plt.xlabel('Milk')
plt.subplot(122)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis_r')
centers = kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5)
plt.title('Clusters using kmeans')
plt.ylabel('Grocery')
plt.xlabel('Milk')
plt.show()
well there is not any difference between both for small datasets
The mini batch K-means is faster but gives slightly different results than the normal batch K-means.