Benchmarking Python Clustering Algorithms on 2D Data

Other notebooks perform a more genenral analysis of clustering algorithms; this notebook is looking at the special case of two dimensional data. Why two dimensional? The algorithms used in DBSCAN and HDBSCAN (in particular algorithms using kd-trees and ball trees) have significantly better performance and scaling in lower dimensions. I wanted to look at how scaling compares in what is effectively the best case scenario for such algorithms. As in the other notebooks we'll be looking at a number of algorithms and implementations.

The implementations being test are:

  • Sklearn (which implements several algorithms):
    • K-Means clustering
    • DBSCAN clustering
    • Agglomerative clustering
    • Spectral clustering
    • Affinity Propagation
  • Scipy (which provides basic algorithms):
    • K-Means clustering
    • Agglomerative clustering
  • Fastcluster (which provides very fast agglomerative clustering in C++)
  • DeBaCl (Density Based Clustering; similar to a mix of DBSCAN and Agglomerative)
  • HDBSCAN (A robust hierarchical version of DBSCAN)
In [1]:
import hdbscan
import debacl
import fastcluster
import sklearn.cluster
import scipy.cluster
import sklearn.datasets
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_palette('Paired', 10)
sns.set_color_codes()

We can borrow the benchmarking code from other notebooks to benchmark multiple different sizes of dataset. Because some clustering algorithms have performance that can vary quite a lot depending on the exact nature of the dataset we'll also need to run several times on randomly generated datasets of each size so as to get a better idea of the average case performance.

We also need to generalise over algorithms which don't necessarily all have the same API. We can resolve that by taking a clustering function, argument tuple and keywords dictionary to let us do semi-arbitrary calls (fortunately all the algorithms do at least take the dataset to cluster as the first parameter).

Finally some algorithms scale poorly, and I don't want to spend forever doing clustering of random datasets so we'll cap the maximum time an algorithm can use; once it has taken longer than max time we'll just abort there and leave the remaining entries in our datasize by samples matrix unfilled.

In the end this all amounts to a fairly straightforward set of nested loops (over datasizes and number of samples) with calls to sklearn to generate mock data and the clustering function inside a timer. Add in some early abort and we're done.

This time around we'll also set the default dataset dimension to be two.

In [2]:
def benchmark_algorithm(dataset_sizes, cluster_function, function_args, function_kwds,
                        dataset_dimension=2, max_time=45, sample_size=2):
    
    # Initialize the result with NaNs so that any unfilled entries 
    # will be considered NULL when we convert to a pandas dataframe at the end
    result = np.nan * np.ones((len(dataset_sizes), sample_size))
    for index, size in enumerate(dataset_sizes):
        dataset_n_clusters = int(np.log2(size))
        for s in range(sample_size):
            # Use sklearns make_blobs to generate a random dataset with specified size
            # dimension and number of clusters
            data, _ = sklearn.datasets.make_blobs(n_samples=size, 
                                                       n_features=dataset_dimension, 
                                                       centers=dataset_n_clusters)
            
            # Start the clustering with a timer
            start_time = time.time()
            cluster_function(data, *function_args, **function_kwds)
            time_taken = time.time() - start_time
            
            # If we are taking more than max_time then abort -- we don't
            # want to spend excessive time on slow algorithms
            if time_taken > max_time:
                result[index, s] = time_taken
                return pd.DataFrame(np.vstack([dataset_sizes.repeat(sample_size), 
                                               result.flatten()]).T, columns=['x','y'])
            else:
                result[index, s] = time_taken
        
    # Return the result as a dataframe for easier handling with seaborn afterwards
    return pd.DataFrame(np.vstack([dataset_sizes.repeat(sample_size), 
                                   result.flatten()]).T, columns=['x','y'])

Comparison of all ten implementations

Now we need a range of dataset sizes to test out our algorithm. Since the scaling performance is wildly different over the ten implementations we're going to look at it will be beneficial to have a number of very small dataset sizes, and increasing spacing as we get larger, spanning out to 32000 datapoints to cluster (to begin with). Numpy provides convenient ways to get this done via arange and vector multiplication. We'll start with step sizes of 500, then shift to steps of 1000 past 3000 datapoints, and finally steps of 2000 past 6000 datapoints.

In [3]:
dataset_sizes = np.hstack([np.arange(1, 6) * 500, np.arange(3,7) * 1000, np.arange(4,17) * 2000])

Now it is just a matter of running all the clustering algorithms via our benchmark function to collect up all the requsite data. This could be prettier, rolled up into functions appropriately, but sometimes brute force is good enough. More importantly (for me) since this can take a significant amount of compute time, I wanted to be able to comment out algorithms that were slow or I was uninterested in easily. Which brings me to a warning for you the reader and potential user of the notebook: this next step is very expensive. We are running ten different clustering algorithms multiple times each on twenty two different dataset sizes -- and some of the clustering algorithms are slow (we are capping out at forty five seconds per run). That means that the next cell can take an hour or more to run. That doesn't mean "Don't try this at home" (I actually encourage you to try this out yourself and play with dataset parameters and clustering parameters) but it does mean you should be patient if you're going to!

In [4]:
k_means = sklearn.cluster.KMeans(10)
k_means_data = benchmark_algorithm(dataset_sizes, k_means.fit, (), {})

dbscan = sklearn.cluster.DBSCAN(eps=1.25)
dbscan_data = benchmark_algorithm(dataset_sizes, dbscan.fit, (), {})

scipy_k_means_data = benchmark_algorithm(dataset_sizes, scipy.cluster.vq.kmeans, (10,), {})

scipy_single_data = benchmark_algorithm(dataset_sizes, scipy.cluster.hierarchy.single, (), {})

fastclust_data = benchmark_algorithm(dataset_sizes, fastcluster.single, (), {})

hdbscan_ = hdbscan.HDBSCAN()
hdbscan_data = benchmark_algorithm(dataset_sizes, hdbscan_.fit, (), {})

debacl_data = benchmark_algorithm(dataset_sizes, debacl.geom_tree.geomTree, (5, 5), {'verbose':False})

agglomerative = sklearn.cluster.AgglomerativeClustering(10)
agg_data = benchmark_algorithm(dataset_sizes, agglomerative.fit, (), {}, sample_size=4)

spectral = sklearn.cluster.SpectralClustering(10)
spectral_data = benchmark_algorithm(dataset_sizes, spectral.fit, (), {}, sample_size=6)

affinity_prop = sklearn.cluster.AffinityPropagation()
ap_data = benchmark_algorithm(dataset_sizes, affinity_prop.fit, (), {}, sample_size=3)

Now we need to plot the results so we can see what is going on. The catch is that we have several datapoints for each dataset size and ultimately we would like to try and fit a curve through all of it to get the general scaling trend. Fortunately seaborn comes to the rescue here by providing regplot which plots a regression through a dataset, supports higher order regression (we should probably use order two as most algorithms are effectively quadratic) and handles multiple datapoints for each x-value cleanly (using the x_estimator keyword to put a point at the mean and draw an error bar to cover the range of data).

In [5]:
sns.regplot(x='x', y='y', data=k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=hdbscan_data, order=2, label='HDBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=scipy_single_data, order=2, label='Scipy Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=debacl_data, order=2, label='DeBaCl Geom Tree', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=spectral_data, order=2, label='Sklearn Spectral', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=agg_data, order=2, label='Sklearn Agglomerative', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=ap_data, order=2, label='Sklearn Affinity Propagation', x_estimator=np.mean)
plt.gca().axis([0, 34000, 0, 120])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Clustering Implementations')
plt.legend()
Out[5]:
<matplotlib.legend.Legend at 0x10ce76850>
/Users/leland/.conda/envs/hdbscan_dev/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

A few features stand out. First of all there appear to be essentially two classes of implementation. The fast implementations tend to be implementations of single linkage agglomerative clustering, K-means, and DBSCAN. The slow cases are largely from sklearn and include agglomerative clustering (in this case using Ward instead of single linkage).

We really wanted to see what happens in the two dimensional case for the better scaling algorithms, so let's drop out those slow algorithms so we can scale out a little further and get a closer look at the various algorithms that managed 32000 points in under twenty seconds. There is almost undoubtedly more to learn as we get ever larger dataset sizes.

Comparison of fast implementations

Let's compare the six fastest implementations now. We can scale out a little further as well; based on the curves above it looks like we should be able to comfortably get to 60000 data points without taking much more than a minute per run. We can also note that most of these implementations weren't that noisy so we can get away with a single run per dataset size.

In [6]:
large_dataset_sizes = np.arange(1,16) * 4000

hdbscan_prims = hdbscan.HDBSCAN(algorithm='prims_kdtree')
large_hdbscan_prims_data = benchmark_algorithm(large_dataset_sizes, 
                                         hdbscan_prims.fit, (), {}, max_time=90, sample_size=1)

hdbscan_boruvka = hdbscan.HDBSCAN(algorithm='boruvka_kdtree')
large_hdbscan_boruvka_data = benchmark_algorithm(large_dataset_sizes, 
                                         hdbscan_boruvka.fit, (), {}, max_time=90, sample_size=1)

k_means = sklearn.cluster.KMeans(10)
large_k_means_data = benchmark_algorithm(large_dataset_sizes, 
                                         k_means.fit, (), {}, max_time=90, sample_size=1)

dbscan = sklearn.cluster.DBSCAN(eps=1.25, min_samples=5)
large_dbscan_data = benchmark_algorithm(large_dataset_sizes, 
                                        dbscan.fit, (), {}, max_time=90, sample_size=1)

large_fastclust_data = benchmark_algorithm(large_dataset_sizes, 
                                           fastcluster.single, (), {}, max_time=90, sample_size=1)

large_scipy_k_means_data = benchmark_algorithm(large_dataset_sizes, 
                                               scipy.cluster.vq.kmeans, (10,), {}, max_time=90, sample_size=1)

large_scipy_single_data = benchmark_algorithm(large_dataset_sizes, 
                                              scipy.cluster.hierarchy.single, (), {}, max_time=90, sample_size=1)

Again we can use seaborn to do curve fitting and plotting, exactly as before.

In [7]:
sns.regplot(x='x', y='y', data=large_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_hdbscan_boruvka_data, order=2, label='HDBSCAN Boruvka', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_single_data, order=2, label='Scipy Single Linkage', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=large_hdbscan_prims_data, order=2, label='HDBSCAN Prims', x_estimator=np.mean)

plt.gca().axis([0, 64000, 0, 150])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Fastest Clustering Implementations')
plt.legend()
Out[7]:
<matplotlib.legend.Legend at 0x108721210>

Clearly something has gone woefully wrong with the curve fitting for the two single linkage algorithms, but what exactly? If we look at the raw data we can see.

In [8]:
large_fastclust_data.tail(10)
Out[8]:
x y
5 24000 7.506189
6 28000 10.821918
7 32000 22.282744
8 36000 26.050475
9 40000 212.501709
10 44000 NaN
11 48000 NaN
12 52000 NaN
13 56000 NaN
14 60000 NaN
In [9]:
large_scipy_single_data.tail(10)
Out[9]:
x y
5 24000 10.324891
6 28000 15.216605
7 32000 21.523087
8 36000 29.564827
9 40000 174.514488
10 44000 NaN
11 48000 NaN
12 52000 NaN
13 56000 NaN
14 60000 NaN

It seems that at around 44000 points we hit a wall and the runtimes spiked. A hint is that I'm running this on a laptop with 8GB of RAM. Both single linkage algorithms use scipy.spatial.pdist to compute pairwise distances between points, which returns an array of shape (n(n-1)/2, 1) of doubles. A quick computation shows that that array of distances is quite large once we nave 44000 points:

In [10]:
size_of_array = 44000 * (44000 - 1) / 2         # from pdist documentation
bytes_in_array = size_of_array * 8              # Since doubles use 8 bytes
gigabytes_used = bytes_in_array / (1024.0 ** 3)  # divide out to get the number of GB
gigabytes_used
Out[10]:
7.211998105049133

If we assume that my laptop is keeping much other than that distance array in RAM then clearly we are going to spend time paging out the distance array to disk and back and hence we will see the runtimes increase dramatically as we become disk IO bound.

For fastcluster there is an alternative linkage_vector function that can be used if you are doing single linkage and have the original data (as opposed to a matrix of dissimilarities). We'll rerun fastcluster with this instead.

For scipy single linkahe if we just leave off the last element we can get a better idea of the curves, but keep in mind that this single linkage implemtation does not scale past a limit set by your available RAM.

In [11]:
large_fastclust_data = benchmark_algorithm(large_dataset_sizes, 
                                           fastcluster.linkage_vector, (), {}, max_time=90, sample_size=1)

Now we just need to replot with the new fastcluster data, and the truncated scipy single linkage data.

In [12]:
sns.regplot(x='x', y='y', data=large_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_hdbscan_boruvka_data, order=2, label='HDBSCAN Boruvka', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_single_data[:8], order=2, label='Scipy Single Linkage', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=large_hdbscan_prims_data, order=2, label='HDBSCAN Prims', x_estimator=np.mean)

plt.gca().axis([0, 64000, 0, 150])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Fastest Clustering Implementations')
plt.legend()
Out[12]:
<matplotlib.legend.Legend at 0x104401d50>

This rules out scipy single linkage as beign in the same class as the others even igf it could scale. Even fastcluster is starting to show the effects of having an $O(n^2)$ algorithm, despite having worked very hard to improve constants.

In practice this is going to mean that for larger datasets you are going to be very constrained in what algorithms you can apply: if you get enough datapoints only K-Means, DBSCAN and HDBSCAN will be left. Let's get a closer look at the performance of these.

Comparison of K-Means and DBSCAN and HDBSCAN implementations

At this point we can scale out to 300000 datapoints easily enough: These algorithms can use various data structures to avoid having to compute the full pairwise distance matrix and thus potentially have much more favourable asymptotic complexity.

In [13]:
huge_dataset_sizes = np.arange(1,16) * 20000

k_means = sklearn.cluster.KMeans(10)
huge_k_means_data = benchmark_algorithm(huge_dataset_sizes, 
                                        k_means.fit, (), {}, max_time=120, sample_size=2)

dbscan = sklearn.cluster.DBSCAN()
huge_dbscan_data = benchmark_algorithm(huge_dataset_sizes, 
                                       dbscan.fit, (), {}, max_time=240, sample_size=2)

huge_scipy_k_means_data = benchmark_algorithm(huge_dataset_sizes, 
                                              scipy.cluster.vq.kmeans, (10,), {}, max_time=240, sample_size=2)

hdbscan_boruvka = hdbscan.HDBSCAN(algorithm='boruvka_kdtree')
huge_hdbscan_data = benchmark_algorithm(huge_dataset_sizes, 
                                       hdbscan_boruvka.fit, (), {}, max_time=240, sample_size=4)

huge_fastcluster_data = benchmark_algorithm(huge_dataset_sizes, 
                                            fastcluster.linkage_vector, (), {}, max_time=240, sample_size=2)
In [14]:
sns.regplot(x='x', y='y', data=huge_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_hdbscan_data, order=2, label='HDBSCAN', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=huge_fastcluster_data, order=2, label='Fastcluster', x_estimator=np.mean)


plt.gca().axis([0, 310000, 0, 60])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of K-Means and DBSCAN')
plt.legend()
Out[14]:
<matplotlib.legend.Legend at 0x10cf59d90>

While in other notebooks the two K-means implementations separated out, with the Sklearn implementation being clearly better, here we see they very much keep pace. Also, interestingly we see the full benefits of the dual tree Boruvka algorithm for minimum spanning trees, with the HDBSCAN implementation actually outperforming ordinary DBSCAN for large dataset sizes. In higher dimensions this won't be the case, but it makes it clear that if you have very low dimensional data then HDBSCAN (from which you can extract ordinary DBSCAN results if you so desire) is clearly the better choice.

But should I get a coffee?

So we know which implementations scale and which don't; a more useful thing to know in practice is, given a dataset, what can I run interactively? What can I run while I go and grab some coffee? How about a run over lunch? What if I'm willing to wait until I get in tomorrow morning? Each of these represent significant breaks in productivity -- once you aren't working interactively anymore your productivity drops measurably, and so on.

We can build a table for this. To start we'll need to be able to approximate how long a given clustering implementation will take to run. Fortunately we already gathered a lot of that data; if we load up the statsmodels package we can fit the data (with a quadratic or $n\log n$ fit depending on the implementation) and use the resulting model to make our predictions. Obviously this has some caveats: if you fill your RAM with a distance matrix your runtime isn't going to fit the curve.

I've hand built a time_samples list to give a reasonable set of potential data sizes that are nice and human readable. After that we just need a function to fit and build the curves.

In [15]:
import statsmodels.formula.api as sm

time_samples = [1000, 2000, 5000, 10000, 25000, 50000, 75000, 100000, 250000, 500000, 750000,
               1000000, 2500000, 5000000, 10000000, 50000000, 100000000, 500000000, 1000000000]

def get_timing_series(data, quadratic=True):
    if quadratic:
        data['x_squared'] = data.x**2
        model = sm.ols('y ~ x + x_squared', data=data).fit()
        predictions = [model.params.dot([1.0, i, i**2]) for i in time_samples]
        return pd.Series(predictions, index=pd.Index(time_samples))
    else: # assume n log(n)
        data['xlogx'] = data.x * np.log(data.x)
        model = sm.ols('y ~ x + xlogx', data=data).fit()
        predictions = [model.params.dot([1.0, i, i*np.log(i)]) for i in time_samples]
        return pd.Series(predictions, index=pd.Index(time_samples))

Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\log n)$ algorithms utterly dominate here, and HDBSCAN achieves that complexity with the dual tree Boruvka on two dimensional data.

In [16]:
ap_timings = get_timing_series(ap_data)
spectral_timings = get_timing_series(spectral_data)
agg_timings = get_timing_series(agg_data)
debacl_timings = get_timing_series(debacl_data)
fastclust_timings = get_timing_series(large_fastclust_data.ix[:10,:].copy())
scipy_single_timings = get_timing_series(large_scipy_single_data.ix[:10,:].copy())
hdbscan_boruvka = get_timing_series(huge_hdbscan_data, quadratic=False)
#scipy_k_means_timings = get_timing_series(huge_scipy_k_means_data, quadratic=False)
dbscan_timings = get_timing_series(huge_dbscan_data, quadratic=False)
k_means_timings = get_timing_series(huge_k_means_data, quadratic=True)

timing_data = pd.concat([ap_timings, spectral_timings, agg_timings, debacl_timings, 
                            scipy_single_timings, fastclust_timings, hdbscan_boruvka, 
                            dbscan_timings, k_means_timings
                           ], axis=1)
timing_data.columns=['AffinityPropagation', 'Spectral', 'Agglomerative',
                                       'DeBaCl', 'ScipySingleLinkage', 'Fastcluster',
                                       'HDBSCAN', 'DBSCAN', 'SKLearn KMeans'
                                      ]
def get_size(series, max_time):
    return series.index[series < max_time].max()

datasize_table = pd.concat([
                            timing_data.apply(get_size, max_time=30),
                            timing_data.apply(get_size, max_time=300),
                            timing_data.apply(get_size, max_time=3600),
                            timing_data.apply(get_size, max_time=8*3600)
                            ], axis=1)
datasize_table.columns=('Interactive', 'Get Coffee', 'Over Lunch', 'Overnight')
datasize_table
Out[16]:
Interactive Get Coffee Over Lunch Overnight
AffinityPropagation 2000 5000 25000 75000
Spectral 5000 10000 50000 100000
Agglomerative 2000 10000 25000 75000
DeBaCl 5000 10000 75000 100000
ScipySingleLinkage 25000 50000 100000 250000
Fastcluster 100000 250000 1000000 2500000
HDBSCAN 500000 2500000 10000000 100000000
DBSCAN 250000 1000000 10000000 50000000
SKLearn KMeans 750000 5000000 10000000 50000000

Conclusions

If you have two dimensional data then HDBSCN is an excellent choice, outperforming regular DBSCAN and not being too far behind K-Means.