DBSCAN has the ability to capture densely packed data points. It is similar to KNNs with variable parameters.
Choose two parameters, a positive numbers - epsilon and an integer minPoints.
Randomly pick few points from the dataset. If there are greater than minPoints within a radius of epsilon from that point, we consider all of them to be part of a "cluster". Note that eucledian distance is used here.
Expand that cluster by checking all of the unchecked new points and seeing if they have greater than minPoints within a radius of epsilon, growing the cluster recursively in this manner.
Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process.
At the end of clustering, we could end up with data points not belonging to any cluster that we call noise.
The figure shows how 3 points which are lying within the ball of radius $\epsilon$ are considered neighbors that satisfies the criterion for the point to be considered a member of the cluster. The point is then tagged as visited. Later the ball is moved to each neighbor point and the same member determination continues till all points are visited.
<img src="../images/DBScan.png", style="width: 700px;">
Ref: https://en.wikipedia.org/wiki/DBSCAN
We will have to specify epsilon and a natural number minPoints. Let us use values epsilon = 0.09 and minPoints = 5.
db = DBSCAN(eps=0.06, min_samples=5)
db.fit(X)
labels = db.labels_
from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np
import seaborn as sns
import pandas as pd
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
N_Samples = 1000
D = 2
K = 4
X, y = make_moons(n_samples = 2*N_Samples, noise=0.05, shuffle = False)
x_vec, y_vec = make_moons(n_samples = 2*N_Samples, noise=0.08, shuffle = False)
x_vec[:,0] += 2.5
y_vec += 2
X = np.concatenate((X, x_vec), axis=0)
y = np.concatenate((y, y_vec), axis=0)
# Create a dataframe moon_df and visualize a graph g
moon_df = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue="y", data = moon_df)
g.fig.set_size_inches(14, 6)
sns.despine()
# Initialize a DBScan cluster with eps and minimum points
db = DBSCAN(eps=0.06, min_samples=5)
Use .fit and .labels_ for fitting the dbscan and accessing the cluster labels respectively.
db.fit(X)
labels = db.labels_
# Create a data frame and visualize the plot.
moon_df['db_clus'] = labels
g=sns.pairplot(x_vars="X_0", y_vars="X_1", hue = "db_clus", data = moon_df)
g.fig.set_size_inches(14, 6)
sns.despine()
ref_tmp_var = False
try:
ref_assert_var = False
db_ = DBSCAN(eps=0.06, min_samples=5)
db_.fit(X)
import numpy as np
if np.all(moon_df['db_clus'] == db_.labels_):
ref_assert_var = True
out = g
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
DBScan uses density based clustering and though it appears to determine various shapes, not always is it successful. Consider two intertwined circles with a high noise of 0.2. This dataset can be generated by make_circles function in the sklearn.datasets using a noise of 0.2.
from sklearn.datasets import make_circles
# Generate the dataset with intertwined circles
X, y = make_circles(n_samples=N_Samples, factor=.5, noise=.2)
noisy_circles = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})
# Perform DBScan on the dataset and visualize the results
db = DBSCAN(eps=0.5, min_samples=5)
Transform the data into a dataframe and plot the result with a hue as the cluster labels.
db.fit(X)
labels = db.labels_
noisy_circles = pd.DataFrame({'X_0':X[:,0], 'X_1':X[:,1], 'y':labels})
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue = "y", data = noisy_circles)
g.fig.set_size_inches(14, 6)
sns.despine()
ref_tmp_var = False
try:
ref_assert_var = False
import numpy as np
db_ = DBSCAN(eps=0.5, min_samples=5)
db_.fit(X)
if np.all(db_.labels_ == labels):
ref_assert_var = True
out = g
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var