Idenitify clusters in oceanographic data using several clustering algorithms for unsupervised machine learning
Executing this Notebook requires a personal STOQS database. Follow the steps to build your own development system — this will take about an hour and depends on a good connection to the Internet. Once your server is up log into it (after a cd ~/Vagrants/stoqsvm
) and activate your virtual environment with the usual commands:
vagrant ssh -- -X
cd /vagrant/dev/stoqsgit
source venv-stoqs/bin/activate
Connect to your Institution's STOQS database server using read-only credentials. (Note: firewalls typically limit unprivileged access to such resources.)
cd stoqs
ln -s mbari_campaigns.py campaigns.py
export DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5433/stoqs
This script uses the following clustering algorithms:
To be able to execute the cells and experiment with different algortithms and parameters, launch Jupyter Notebook with:
cd contrib/notebooks
../../manage.py shell_plus --notebook
navigate to this file and open it. You will then be able to execute the cells and experiment with different settings and code.
cd /vagrant/dev/stoqsgit/stoqs/
/vagrant/dev/stoqsgit/stoqs
from contrib.analysis.cluster import Clusterer
%matplotlib inline
import pylab as plt
import numpy as np
# defining function to create clusters for a specified algorithm
def cluster(algorithm_string, normalize):
# specifying arguments - simulating cluster.py command line arguments
from argparse import Namespace
c = Clusterer()
ns = Namespace()
ns.database = 'stoqs_september2013'
ns.platform = 'Slocum_260'
ns.inputs = ['optical_backscatter700nm', 'fluorescence']
ns.start = '20130923T094038'
ns.end = '20130923T130613'
ns.algorithm = algorithm_string
ns.do_not_normalize = not normalize
ns.verbose=True
c.args = ns
# loading the data, output is in list form
x, y, x_ids, y_ids = c.loadData()
# creating clusters
X, y_clusters, X_ids = c.createClusters()
# plotting result
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
return X, y_clusters, colors
# Define function to plot clusters of 4 different algorithms
def plot_clusters(ax, algorithms, normalize=False):
for algorithm, i, j in zip(algorithms, (0, 0, 1, 1), (0, 1, 0, 1)):
X1, y_clusters, colors = cluster(algorithm, normalize);
ax[i, j].scatter(X1[:,0], X1[:,1], s=2, color=colors[y_clusters])
ax[i, j].set_title(algorithm, fontsize='20')
ax[i, j].set_ylabel('fluorescence (micrograms/l)', fontsize='14')
ax[i, j].set_xlabel('optical_backscatter700nm', fontsize='14')
ax[i, j].tick_params(labelsize=12)
if not normalize:
ax[i, j].axis([0.0001,0.0014,0,16]);
Compare clustering of sample without normalizing the input data
fig, ax = plt.subplots(2, 2, figsize=(18,10));
fig.subplots_adjust(wspace=.2, hspace=.4)
plot_clusters(ax, ('Hierarchical_Clustering', 'DBSCAN', 'Mean_Shift', 'Birch'), normalize=False)
Compare clustering of sample with normalized input data. It's odd/interesting that the clustering is different.
fig, ax = plt.subplots(2, 2, figsize=(18,10));
fig.subplots_adjust(wspace=.2, hspace=.4)
plot_clusters(ax, ('Hierarchical_Clustering', 'DBSCAN', 'Mean_Shift', 'Birch'), normalize=True)
/vagrant/dev/stoqsgit/venv-stoqs/lib64/python3.6/site-packages/sklearn/cluster/birch.py:602: UserWarning: Number of subclusters found (1) by Birch is less than (3). Decrease the threshold. % (len(centroids), self.n_clusters))