Idenitify clusters in oceanographic data using unsupervised machine learning
Executing this Notebook requires a personal STOQS database. Follow the steps to build your own development system — this will take about an hour and depends on a good connection to the Internet. Once your server is up log into it (after a cd ~/Vagrants/stoqsvm
) and activate your virtual environment with the usual commands:
vagrant ssh -- -X
cd /vagrant/dev/stoqsgit
source venv-stoqs/bin/activate
Connect to your Institution's STOQS database server using read-only credentials. (Note: firewalls typically limit unprivileged access to such resources.)
cd stoqs
ln -s mbari_campaigns.py campaigns.py
export DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5433/stoqs
Use the stoqs/contrib/analysis/cluster.py
script to create clusters for specified data. To execute the script in the command line, commands take this form:
contrib/analysis/cluster.py --createClusters \
--database stoqs_september2013 --platform Slocum_260 \
--start 20130923T124038 --end 20130923T150613 \
--inputs optical_backscatter700nm fluorescence \ -v
This script uses the DBSCAN clustering algorithm, which identifies clusters based on data point density.
To be able to execute the cells and experiment with different algortithms and parameters, launch Jupyter Notebook with:
cd contrib/notebooks
../../manage.py shell_plus --notebook
navigate to this file and open it. You will then be able to execute the cells and experiment with different settings and code.
cd /vagrant/dev/stoqsgit/stoqs/
/vagrant/dev/stoqsgit/stoqs
# calling the clustering script and specifying arguments
from contrib.analysis.cluster import Clusterer
c = Clusterer()
# specifying arguments
from argparse import Namespace
ns = Namespace()
ns.database = 'stoqs_september2013'
ns.platform = 'Slocum_260'
ns.inputs = ['optical_backscatter700nm', 'fluorescence']
ns.start = '20130923T094038'
ns.end = '20130923T130613'
ns.algorithm = 'DBSCAN'
ns.do_not_normalize = True
ns.verbose=True
c.args = ns
# loading the data, output is in list form
x, y, x_ids, y_ids = c.loadData()
# verifying that the correct data points have been extracted
%pylab inline
import matplotlib.pyplot as plt
plt.scatter(x, y, s=2)
plt.ylabel('fluorescence (micrograms/l)')
plt.xlabel('optical_backscatter700nm')
plt.axis([0.0001,0.0014,0,16]);
Populating the interactive namespace from numpy and matplotlib
# creating clusters
X, y_clusters, X_ids = c.createClusters()
# plotting result
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
plt.scatter(X[:,0], X[:,1], s=2, color=colors[y_clusters])
plt.ylabel('fluorescence (micrograms/l)')
plt.xlabel('optical_backscatter700nm')
plt.axis([0.0001,0.0014,0,16]);