- Author: Johannes Maucher (adapted from scikit-learn example)
- Last Update: 03rd December, 2014
- List of all IPython Notebooks for this lecture

In [1]:

```
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
```

Generate 2-dimensional example data, using make_blobs function from scikit-learn. The example dataset consists of 750 samples, equally distributed over 3 different gaussian distributions. The mean vectors of the 3 gaussian distributions are defined in the variable *centers*.

In [2]:

```
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,random_state=0)
```

Standardize the dataset, such that each of the two features (columns) has zero mean and a standard-deviation of 1. The scikit-learn StandardScaler class is applied for this task.

In [3]:

```
print "Mean of original data"
print np.mean(X,axis=0)
print "Standard deviation of original data"
print np.std(X,axis=0)
X = StandardScaler().fit_transform(X)
print "Mean of standardized data"
print np.mean(X,axis=0)
print "Standard deviation of standardized data"
print np.std(X,axis=0)
```

Apply the DBSCAN clustering from scikit-learn:

In [4]:

```
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # -1 indicates outliers.
# If -1 is contained in labels, than the number of clusters is number of different values in labels - 1.
```

*i.th* element of the 1-dimensional array *db.labels* contains the cluster-index of the *i.th* data sample, as found by the DBSCAN cluster algorithm. The *i.th* element of the 1-dimensional array *core_sample_mask* is *True* if the *i.th* sample is a core-element. Otherwise it is false. Variable *n clusters* stores the number of found clusters.

In this demo the real cluster assignment of the artificial dataset is known. In the case of known real cluster assignment the following performance measures can be determined:

- Homogeneity h: This value is maximum (=1), if each cluster contains data of only a single class (a single true label).
- Completeness c: This value is maximum (=1), if all members of a given class (a given true label) are assigned to the same cluster.
- V-measure v: The harmonic mean of homogeneity and completeness: $$ v=2 \frac{h \cdot c}{h + c} $$

The exact calculation of homogeneity and completeness can be found here

In [5]:

```
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
```

In general cluster applications the true cluster labels are not known. In this case the Silhouette Score is a frequently used metric for measuring cluster performance. The Silhouette score is calculated from the following parameters:

- a: The mean distance between a sample and all other points in the same cluster
- b: The mean distance between a sample and all other points in the next nearest cluster

The Silhouette score is then: $$ s=\frac{b-a}{\max(b,a)} $$

According to the Silhouette score a good clustering is one, for which *a* is small and *b* is large. This is the case for compact clusters.

In [6]:

```
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
```

In [7]:

```
unique_labels = set(labels)
plt.figure(figsize=(12, 10))
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
```

In [7]:

```
```