Notebook

Compare Unsupervised Machine Learning Algorithms¶

Idenitify clusters in oceanographic data using several clustering algorithms for unsupervised machine learning

Executing this Notebook requires a personal STOQS database. Follow the steps to build your own development system — this will take about an hour and depends on a good connection to the Internet. Once your server is up log into it (after a cd ~/Vagrants/stoqsvm) and activate your virtual environment with the usual commands:

vagrant ssh -- -X
cd /vagrant/dev/stoqsgit
source venv-stoqs/bin/activate

Connect to your Institution's STOQS database server using read-only credentials. (Note: firewalls typically limit unprivileged access to such resources.)

cd stoqs
ln -s mbari_campaigns.py campaigns.py
export DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5433/stoqs

This script uses the following clustering algorithms:

Hierarchical Clustering: identifies clusters based on distance connectivity
DBSCAN (Density-based spatial clustering of applications with noise): identifies clusters based on data point density
Mean Shift: identifies clusters based on centroids (the mean of the points within a region) - finds data "blobs"
Birch: identifies clusters by using a Clustering Feature Tree to recursively create subclusters by adding each sample to the subcluster with the closest centroid

To be able to execute the cells and experiment with different algortithms and parameters, launch Jupyter Notebook with:

cd contrib/notebooks
../../manage.py shell_plus --notebook

navigate to this file and open it. You will then be able to execute the cells and experiment with different settings and code.

In [1]:

cd /vagrant/dev/stoqsgit/stoqs/

/vagrant/dev/stoqsgit/stoqs

In [2]:

from contrib.analysis.cluster import Clusterer

%matplotlib inline
import pylab as plt
import numpy as np

In [3]:

# defining function to create clusters for a specified algorithm
def cluster(algorithm_string, normalize):
    # specifying arguments - simulating cluster.py command line arguments
    from argparse import Namespace
    c = Clusterer()
    ns = Namespace()
    ns.database = 'stoqs_september2013'
    ns.platform = 'Slocum_260'
    ns.inputs = ['optical_backscatter700nm', 'fluorescence']
    ns.start = '20130923T094038'
    ns.end = '20130923T130613'
    ns.algorithm = algorithm_string
    ns.do_not_normalize = not normalize
    ns.verbose=True
    c.args = ns
    
    # loading the data, output is in list form
    x, y, x_ids, y_ids = c.loadData()
    
    # creating clusters
    X, y_clusters, X_ids = c.createClusters()
    
    # plotting result
    colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
    colors = np.hstack([colors] * 20)

    return X, y_clusters, colors

# Define function to plot clusters of 4 different algorithms
def plot_clusters(ax, algorithms, normalize=False):
    for algorithm, i, j in zip(algorithms, (0, 0, 1, 1), (0, 1, 0, 1)):
        X1, y_clusters, colors = cluster(algorithm, normalize);
        ax[i, j].scatter(X1[:,0], X1[:,1], s=2, color=colors[y_clusters])
        ax[i, j].set_title(algorithm, fontsize='20')
        ax[i, j].set_ylabel('fluorescence (micrograms/l)', fontsize='14')
        ax[i, j].set_xlabel('optical_backscatter700nm', fontsize='14')
        ax[i, j].tick_params(labelsize=12)
        if not normalize:
            ax[i, j].axis([0.0001,0.0014,0,16]);

Compare clustering of sample without normalizing the input data

In [4]:

fig, ax = plt.subplots(2, 2, figsize=(18,10));
fig.subplots_adjust(wspace=.2, hspace=.4)
plot_clusters(ax, ('Hierarchical_Clustering', 'DBSCAN', 'Mean_Shift', 'Birch'), normalize=False)

Compare clustering of sample with normalized input data. It's odd/interesting that the clustering is different.

In [5]:

fig, ax = plt.subplots(2, 2, figsize=(18,10));
fig.subplots_adjust(wspace=.2, hspace=.4)
plot_clusters(ax, ('Hierarchical_Clustering', 'DBSCAN', 'Mean_Shift', 'Birch'), normalize=True)

/vagrant/dev/stoqsgit/venv-stoqs/lib64/python3.6/site-packages/sklearn/cluster/birch.py:602: UserWarning: Number of subclusters found (1) by Birch is less than (3). Decrease the threshold.
  % (len(centroids), self.n_clusters))

In [ ]: