Notebook

Update BIOM file with data from STOQS¶

Given a .biom file and multiple STOQS databases, explore Next Generation Sequence and associated STOQS data

Executing this Notebook requires a personal STOQS server. Follow the steps to build your own development system — this will take a few hours and depends on a good connection to the Internet. Once your server is up log into it (after a cd ~/Vagrants/stoqsvm) and activate your virtual environment with the usual commands:

vagrant ssh -- -X
cd /vagrant/dev/stoqsgit
source venv-stoqs/bin/activate
pip install biom-format

Then load all of the SIMZ databases with the commands below. In order to have all the subsample analysis data (Sampled Parameters) loaded it's necessary to have SIMZ<month><year> directories containing those .csv files. (See the subsample_csv_files attribute setting in the load script for the campaign.)

cd stoqs
ln -s mbari_campaigns.py campaigns.py
export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5432/stoqs
loaders/load.py --db stoqs_simz_aug2013 stoqs_simz_oct2013 \
stoqs_simz_spring2014 stoqs_simz_jul2014 stoqs_simz_oct2014 
loaders/load.py --db stoqs_simz_aug2013 stoqs_simz_oct2013 \
stoqs_simz_spring2014 stoqs_simz_jul2014 stoqs_simz_oct2014 --updateprovenance

Loading these database will take a few hours. Once it's finished you can interact with the data quite efficiently, as this Notebook demonstrates. Launch Jupyter Notebook with:

cd contrib/notebooks
../../manage.py shell_plus --notebook

navigate to this file and open it. You will then be able to execute the cells and experiment.

Make a Python list of all SIMZ database from the campaigns on our system.

In [1]:

from campaigns import campaigns
dbs = [c for c in campaigns if 'simz' in c]
print(dbs)

['stoqs_simz_aug2013', 'stoqs_simz_oct2013', 'stoqs_simz_spring2014', 'stoqs_simz_jul2014', 'stoqs_simz_oct2014']

Open a .biom file that contains sequence data from Net Tows conducted on these campaigns. (You will need to create the BIOM directory and copy the .biom file there as we generally do not keep data files in the STOQS git repository.)

In [2]:

biom_file = '../../loaders/MolecularEcology/BIOM/otu_table_newsiernounclass_wmetadata.biom'
from biom import load_table
table = load_table(biom_file)
print(table.ids(axis='sample'))
print(table.ids(axis='observation')[:5])

['SIMZ1' 'SIMZ11' 'SIMZ10' 'SIMZ13' 'SIMZ17' 'SIMZ6' 'SIMZ2' 'SIMZ5'
 'SIMZ4' 'SIMZ18' 'SIMZ3' 'SIMZ14' 'SIMZ12' 'SIMZ7' 'SIMZ9' 'SIMZ8'
 'SIMZ16']
['denovo3239' 'denovo1173' 'denovo1774' 'denovo1778' 'denovo1779']

Find all the VerticalNetTow Sample identifiers for all our SIMZ campaigns. (For hints on names to filter on use the STOQS REST api to explore Sample data for a campaign, e.g.: http://localhost:8000/stoqs_simz_aug2013/api/sample.html.) These will be our links to the environmental and other sample data.

In [3]:

nettows = {}
for db in dbs:
    for s in Sample.objects.using(db).filter(sampletype__name='VerticalNetTow'
                ).order_by('instantpoint__activity__name'):
        print(s.instantpoint.activity.name, db)
        nettows[s.instantpoint.activity.name] = db

simz2013c01_NetTow1 stoqs_simz_aug2013
simz2013c02_NetTow1 stoqs_simz_aug2013
simz2013c03_NetTow1 stoqs_simz_aug2013
simz2013c04_NetTow1 stoqs_simz_aug2013
simz2013c05_NetTow1 stoqs_simz_aug2013
simz2013c06_NetTow1 stoqs_simz_aug2013
simz2013c07_NetTow1 stoqs_simz_aug2013
simz2013c08_NetTow1 stoqs_simz_aug2013
simz2013c09_NetTow1 stoqs_simz_aug2013
simz2013c10_NetTow1 stoqs_simz_aug2013
simz2013c11_NetTow1 stoqs_simz_aug2013
simz2013c12_NetTow1 stoqs_simz_aug2013
simz2013c13_NetTow1 stoqs_simz_aug2013
simz2013c14_NetTow1 stoqs_simz_aug2013
simz2013c15_NetTow1 stoqs_simz_aug2013
simz2013c16_NetTow1 stoqs_simz_aug2013
simz2013c17_NetTow1 stoqs_simz_aug2013
simz2013c18_NetTow1 stoqs_simz_aug2013

It looks as though the BIOM table ids (SIMZ1, SIMZ2, ...) correspond to the STOQS s.instantpoint.activity.names (simz2013c01_NetTow1, simz2013c02_NetTow1, ...). Let's loop through the STOQS sample names and BIOM file sample ids, extract relevant data from STOQS and populate a dictionary formatted for adding metadata back to the BIOM file.

In [4]:

stoqs_sample_data = {}
for s, b in [('simz2013c{:02d}_NetTow1'.format(int(n[4:])), n) for n in table.ids()]:
    sps = SampledParameter.objects.using(nettows[s]
                                        ).filter(sample__instantpoint__activity__name=s)
    # Values of BIOM metadata must be strings, even if they are numbers
    stoqs_sample_data[b] = {sp.parameter.name: str(float(sp.datavalue)) for sp in sps}

Create a copy of the original table, add the new metadata to the samples and save to a new file name.

In [5]:

new_table = table.copy()
new_table.add_metadata(stoqs_sample_data)
with open(biom_file.replace('.biom', '_stoqs.biom'), 'w') as f:
    new_table.to_json('explore_BIOM_data_for_SIMZ.ipynb', f)

Compare the first two metadata records from the original table and the new table.

In [6]:

import pprint
pp = pprint.PrettyPrinter(indent=4)
print('Original: ' + biom_file)
print('-' * len('Original: ' + biom_file))
pp.pprint(table.metadata()[:2])
print()
print('New: ' + biom_file.replace('.biom', '_stoqs.biom'))
print('-' * len('New: ' + biom_file.replace('.biom', '_stoqs.biom')))
pp.pprint(new_table.metadata()[:2])

Original: ../../loaders/MolecularEcology/BIOM/otu_table_newsiernounclass_wmetadata.biom
---------------------------------------------------------------------------------------
(   defaultdict(<function Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda> at 0x7f86350ef620>,
                {   'BarcodeSequence': 'ACGAGTGCGT',
                    'Description': 'SIMZ_1',
                    'LinkerPrimerSequence': 'CACGACGTTGTAAAACGAC',
                    'ReversePrimer': 'GGATAACAATTTCACACAGG',
                    'Treatment': 'CAP'}),
    defaultdict(<function Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda> at 0x7f86350ef2f0>,
                {   'BarcodeSequence': 'TGATACGTCT',
                    'Description': 'SIMZ_11',
                    'LinkerPrimerSequence': 'CACGACGTTGTAAAACGAC',
                    'ReversePrimer': 'GGATAACAATTTCACACAGG',
                    'Treatment': 'CAP'}))

New: ../../loaders/MolecularEcology/BIOM/otu_table_newsiernounclass_wmetadata_stoqs.biom
----------------------------------------------------------------------------------------
(   defaultdict(<function Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda> at 0x7f862b74abf8>,
                {   'B1006_barnacles_A450': '0.115',
                    'B1006_barnacles_A650': '0.07',
                    'BarcodeSequence': 'ACGAGTGCGT',
                    'CAL1939_calanoida_A450': '4.0',
                    'CAL1939_calanoida_A650': '4.0',
                    'CAL903_calanoida_A450': '0.12',
                    'CAL903_calanoida_A650': '0.071',
                    'CRAB903_brachyura_A450': '0.077',
                    'CRAB903_brachyura_A650': '0.051',
                    'Description': 'SIMZ_1',
                    'GCRAB_Carcinus_A450': '0.065',
                    'GCRAB_Carcinus_A650': '0.044',
                    'LinkerPrimerSequence': 'CACGACGTTGTAAAACGAC',
                    'M2B_mussels_A450': '0.072',
                    'M2B_mussels_A650': '0.047',
                    'OS1022_polychaeta_A450': '0.237',
                    'OS1022_polychaeta_A650': '0.133',
                    'POD1951_podoplea_A450': '0.144',
                    'POD1951_podoplea_A650': '0.083',
                    'ReversePrimer': 'GGATAACAATTTCACACAGG',
                    'SAB1182_sabellariidae_A450': '0.066',
                    'SAB1182_sabellariidae_A650': '0.051',
                    'SPI1181_spionidae_A450': '0.05',
                    'SPI1181_spionidae_A650': '0.044',
                    'Treatment': 'CAP'}),
    defaultdict(<function Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda> at 0x7f862b74ac80>,
                {   'B1006_barnacles_A450': '0.359',
                    'B1006_barnacles_A650': '0.183',
                    'BarcodeSequence': 'TGATACGTCT',
                    'CAL1939_calanoida_A450': '4.0',
                    'CAL1939_calanoida_A650': '2.043',
                    'CAL903_calanoida_A450': '0.067',
                    'CAL903_calanoida_A650': '0.043',
                    'CRAB903_brachyura_A450': '0.063',
                    'CRAB903_brachyura_A650': '0.044',
                    'Description': 'SIMZ_11',
                    'GCRAB_Carcinus_A450': '0.062',
                    'GCRAB_Carcinus_A650': '0.042',
                    'LinkerPrimerSequence': 'CACGACGTTGTAAAACGAC',
                    'M2B_mussels_A450': '0.073',
                    'M2B_mussels_A650': '0.048',
                    'OS1022_polychaeta_A450': '0.071',
                    'OS1022_polychaeta_A650': '0.05',
                    'POD1951_podoplea_A450': '0.1',
                    'POD1951_podoplea_A650': '0.058',
                    'ReversePrimer': 'GGATAACAATTTCACACAGG',
                    'SAB1182_sabellariidae_A450': '0.983',
                    'SAB1182_sabellariidae_A650': '0.513',
                    'SPI1181_spionidae_A450': '0.058',
                    'SPI1181_spionidae_A650': '0.047',
                    'Treatment': 'CAP',
                    'barnacles': '0.0',
                    'bivalves': '12.0',
                    'calanoids': '17.0',
                    'crabs': '0.0',
                    'echinoderms': '1.0',
                    'gastropods': '1.0',
                    'larvaceans': '8.0',
                    'larvaceansac': '8.0',
                    'nauplii': '104.0',
                    'noncalanoids': '1.0',
                    'polychaetes': '8.0',
                    'pseudonitzschia': '3.0'}))

To test the results, upload the new file to http://phinch.org/ and you should see something like this:

In [7]:

from IPython.display import Image
Image('../../../doc/Screenshots/Screen_Shot_2015-10-24_at_10.37.52_PM.png')

Out[7]:

In [ ]: