Notebook

DBScan¶

DBSCAN has the ability to capture densely packed data points. It is similar to KNNs with variable parameters.

Algorithm:¶

Choose two parameters, a positive numbers - epsilon and an integer minPoints.
Randomly pick few points from the dataset. If there are greater than minPoints within a radius of epsilon from that point, we consider all of them to be part of a "cluster". Note that eucledian distance is used here.
Expand that cluster by checking all of the unchecked new points and seeing if they have greater than minPoints within a radius of epsilon, growing the cluster recursively in this manner.
Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process.
At the end of clustering, we could end up with data points not belonging to any cluster that we call noise.

The figure shows how 3 points which are lying within the ball of radius $\epsilon$ are considered neighbors that satisfies the criterion for the point to be considered a member of the cluster. The point is then tagged as visited. Later the ball is moved to each neighbor point and the same member determination continues till all points are visited.

Ref: https://en.wikipedia.org/wiki/DBSCAN

DBScan using sklearn¶

We will have to specify epsilon and a natural number minPoints. Let us use values epsilon = 0.09 and minPoints = 5.

db = DBSCAN(eps=0.06, min_samples=5)
db.fit(X)
labels = db.labels_

## Exercise:

Perform DBScan on the dataset and visualize the clusters

In [1]:

from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

import numpy as np
import seaborn as sns
import pandas as pd

plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

N_Samples = 1000
D = 2
K = 4

X, y = make_moons(n_samples = 2*N_Samples, noise=0.05, shuffle = False)
x_vec, y_vec = make_moons(n_samples = 2*N_Samples, noise=0.08, shuffle = False)
x_vec[:,0] += 2.5
y_vec += 2
X = np.concatenate((X, x_vec), axis=0)
y = np.concatenate((y, y_vec), axis=0)

# Create a dataframe moon_df and visualize a graph g
moon_df = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue="y", data = moon_df)
g.fig.set_size_inches(14, 6)
sns.despine()

# Initialize a DBScan cluster with eps and minimum points
db = DBSCAN(eps=0.06, min_samples=5)

Use .fit and .labels_ for fitting the dbscan and accessing the cluster labels respectively.

In [ ]:

db.fit(X)
labels = db.labels_

# Create a data frame and visualize the plot.
moon_df['db_clus'] = labels
g=sns.pairplot(x_vars="X_0", y_vars="X_1", hue = "db_clus", data = moon_df)
g.fig.set_size_inches(14, 6)
sns.despine()

In [ ]:

ref_tmp_var = False


try:
    ref_assert_var = False
    db_ = DBSCAN(eps=0.06, min_samples=5)
    db_.fit(X)
    
    import numpy as np
    
    if np.all(moon_df['db_clus'] == db_.labels_):
      ref_assert_var = True
      out = g
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var

Clustering Shapes¶

DBScan uses density based clustering and though it appears to determine various shapes, not always is it successful. Consider two intertwined circles with a high noise of 0.2. This dataset can be generated by make_circles function in the sklearn.datasets using a noise of 0.2.

## Exercise:

Use DBScan with eps of 0.5 to cluster the dataset and plot the results.
Assign the cluster labels to the variable labels.

In [2]:

from sklearn.datasets import make_circles

# Generate the dataset with intertwined circles
X, y = make_circles(n_samples=N_Samples, factor=.5, noise=.2)
noisy_circles = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})

# Perform DBScan on the dataset and visualize the results
db = DBSCAN(eps=0.5, min_samples=5)

Transform the data into a dataframe and plot the result with a hue as the cluster labels.

In [ ]:

db.fit(X)
labels = db.labels_
noisy_circles = pd.DataFrame({'X_0':X[:,0], 'X_1':X[:,1], 'y':labels})
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue = "y", data = noisy_circles)
g.fig.set_size_inches(14, 6)
sns.despine()

In [ ]:

ref_tmp_var = False


try:
    ref_assert_var = False
    import numpy as np
    
    db_ = DBSCAN(eps=0.5, min_samples=5)
    db_.fit(X)
    
    if np.all(db_.labels_ == labels):
      ref_assert_var = True
      out = g
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var