Demo of DBSCAN clustering

In [1]:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt

Generate dataset

Generate 2-dimensional example data, using make_blobs function from scikit-learn. The example dataset consists of 750 samples, equally distributed over 3 different gaussian distributions. The mean vectors of the 3 gaussian distributions are defined in the variable centers.

In [2]:
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,random_state=0)

Standardize Data

Standardize the dataset, such that each of the two features (columns) has zero mean and a standard-deviation of 1. The scikit-learn StandardScaler class is applied for this task.

In [3]:
print "Mean of original data"
print np.mean(X,axis=0)
print "Standard deviation of original data"
print np.std(X,axis=0)

X = StandardScaler().fit_transform(X)

print "Mean of standardized data"
print np.mean(X,axis=0)
print "Standard deviation of standardized data"
print np.std(X,axis=0)
Mean of original data
[ 0.32728606 -0.34363711]
Standard deviation of original data
[ 1.03778036  1.02797424]
Mean of standardized data
[  2.59866203e-16   3.41800662e-16]
Standard deviation of standardized data
[ 1.  1.]

DBSCAN Clustering

Apply the DBSCAN clustering from scikit-learn:

In [4]:
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # -1 indicates outliers. 
# If -1 is contained in labels, than the number of clusters is number of different values in labels - 1.

The i.th element of the 1-dimensional array db.labels contains the cluster-index of the i.th data sample, as found by the DBSCAN cluster algorithm. The i.th element of the 1-dimensional array core_sample_mask is True if the i.th sample is a core-element. Otherwise it is false. Variable nclusters stores the number of found clusters.

Evaluation of Clustering

In this demo the real cluster assignment of the artificial dataset is known. In the case of known real cluster assignment the following performance measures can be determined:

  • Homogeneity h: This value is maximum (=1), if each cluster contains data of only a single class (a single true label).
  • Completeness c: This value is maximum (=1), if all members of a given class (a given true label) are assigned to the same cluster.
  • V-measure v: The harmonic mean of homogeneity and completeness: $$ v=2 \frac{h \cdot c}{h + c} $$

The exact calculation of homogeneity and completeness can be found here

In [5]:
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
Estimated number of clusters: 3
Homogeneity: 0.945
Completeness: 0.876
V-measure: 0.909

In general cluster applications the true cluster labels are not known. In this case the Silhouette Score is a frequently used metric for measuring cluster performance. The Silhouette score is calculated from the following parameters:

  • a: The mean distance between a sample and all other points in the same cluster
  • b: The mean distance between a sample and all other points in the next nearest cluster

The Silhouette score is then: $$ s=\frac{b-a}{\max(b,a)} $$

According to the Silhouette score a good clustering is one, for which a is small and b is large. This is the case for compact clusters.

In [6]:
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))
Silhouette Coefficient: 0.626

Plotting the clustering result

In [7]:
unique_labels = set(labels)
plt.figure(figsize=(12, 10))
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()