Visualizing Songbird feature differentials with Qurro

In this example, we use data from the Red Sea metagenome dataset. This particular data was obtained from Songbird's GitHub repository in its data/redsea folder, and is associated with the following paper:

Thompson, L. R., Williams, G. J., Haroon, M. F., Shibl, A., Larsen, P., Shorenstein, J., ... & Stingl, U. (2017). Metagenomic covariation along densely sampled environmental gradients in the Red Sea. The ISME journal, 11(1), 138.

The commands for running Songbird and importing the Red Sea data are based on the example usage of this dataset in the Songbird README file on its GitHub page. Note that here we don't explicitly monitor Songbird's diagnostics regarding model fitting, as described in Songbird's README; when using Songbird in practice, it's important to do that. (The hyperparameters we do use were set based on experimentation with Tensorboard.)

Requirements

This notebook relies on QIIME 2, Songbird, and Qurro all being installed. You should be in a QIIME 2 conda environment.

0. Setting up

In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.

In [1]:
# Clear the output directory so we can write these files there
!rm -rf output/*
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output

1. Using Songbird and Qurro through QIIME 2

You can use Songbird and Qurro inside or outside of QIIME 2. In this section, we'll use Songbird and Qurro from within QIIME 2; in the next section, we'll use these tools outside of QIIME 2.

If you just installed Songbird or Qurro, it's advised that you run qiime dev refresh-cache on your system afterwards in order to get QIIME 2 to "find" these tools' QIIME 2 plugins.

1. A. Using Songbird through QIIME 2

In order to use this dataset's BIOM table in QIIME 2, we need to import it as a FeatureTable[Frequency] QIIME 2 artifact.

In [2]:
!qiime tools import \
    --input-path input/redsea.biom \
    --output-path output/redsea.biom.qza \
    --type FeatureTable[Frequency]
Imported input/redsea.biom as BIOMV210DirFmt to output/redsea.biom.qza

Now, we can run Songbird through QIIME 2 on our imported BIOM table. This produces three output files, but the main one we care about for Qurro is the FeatureData[Differential] artifact (which will be stored in output/differentials.qza). This artifact contains feature differentials: as Songbird's documentation puts it, these correspond to "...the ordering of the coefficients within a covariate."

Please see Songbird's documentation for more information about how it works and how its output files are formatted.

Why these hyperparameters?

These hyperparameters (in particular, epochs and differential-prior) were selected based on experimentation with Tensorboard. See Songbird's FAQs for details on how to use Tensorboard and select these sort of hyperparameters for your own datasets (the scope of that is beyond this tutorial).

In [3]:
!qiime songbird multinomial \
    --i-table output/redsea.biom.qza \
    --m-metadata-file input/redsea_metadata.txt \
    --p-formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
    --p-epochs 10000 \
    --p-differential-prior 0.5 \
    --o-differentials output/differentials.qza \
    --o-regression-stats output/regression-stats.qza \
    --o-regression-biplot output/regression-biplot.qza
Saved FeatureData[Differential] to: output/differentials.qza
Saved SampleData[SongbirdStats] to: output/regression-stats.qza
Saved PCoAResults % Properties('biplot') to: output/regression-biplot.qza

1. B. Using Qurro through QIIME 2

Since our "feature rankings" are the (sorted) feature differentials that Songbird just produced, we'll use the qiime qurro differential-plot command.

In [4]:
!qiime qurro differential-plot --help
Usage: qiime qurro differential-plot [OPTIONS]

  Generates an interactive visualization of feature differentials in tandem
  with a visualization of the log-ratios of selected features' sample
  abundances.

Inputs:
  --i-ranks ARTIFACT FeatureData[Differential]
                       Feature differentials.                       [required]
  --i-table ARTIFACT FeatureTable[Frequency]
                       A BIOM table describing the abundances of the ranked
                       features in samples. Note that empty samples and
                       features will be removed from the Qurro visualization.
                                                                    [required]
Parameters:
  --m-sample-metadata-file METADATA...
    (multiple          Sample metadata. In Qurro visualizations, you can use
     arguments will    sample metadata fields to change the x-axis and colors
     be merged)        in the sample plot.                          [required]
  --m-feature-metadata-file METADATA...
    (multiple          Feature metadata (for example, if your features are
     arguments will    ASVs or OTUs, this could be taxonomy). You can use
     be merged)        feature metadata fields to filter features in the rank
                       plot when selecting log-ratios.              [optional]
  --p-extreme-feature-count INTEGER
                       If specified, Qurro will only use this many "extreme"
                       features from both ends of all of the rankings. This is
                       useful when dealing with huge datasets (e.g. with BIOM
                       tables exceeding 1 million entries), for which running
                       Qurro normally might take a long amount of time or
                       crash due to memory limits. Note that the automatic
                       removal of empty samples and features from the table
                       will be done *after* this filtering step.    [optional]
  --p-debug / --p-no-debug
                       If this flag is used, Qurro will output debug
                       messages. Note that you'll also need to use the
                       --verbose option to see these messages.
                                                              [default: False]
Outputs:
  --o-visualization VISUALIZATION
                                                                    [required]
Miscellaneous:
  --output-dir PATH    Output unspecified results to a directory
  --verbose / --quiet  Display verbose output to stdout and/or stderr during
                       execution of this action. Or silence output if
                       execution is successful (silence is golden).
  --citations          Show citations and exit.
  --help               Show this message and exit.
In [5]:
!qiime qurro differential-plot \
    --i-ranks output/differentials.qza \
    --i-table output/redsea.biom.qza \
    --m-sample-metadata-file input/redsea_metadata.txt \
    --m-feature-metadata-file input/feature_metadata.txt \
    --verbose \
    --o-visualization output/qurro_plot_q2.qzv
Saved Visualization to: output/qurro_plot_q2.qzv

That's it! Now, we've created a QZV file (describing a Qurro visualization) at output/qurro_plot_q2.qzv. You can view this visualization in one of the following ways:

  1. Upload the QZV file to view.qiime2.org.
  2. View the QZV file using qiime tools view.

2. Using Songbird and Qurro as standalone tools

We don't need to use Songbird and Qurro through QIIME 2; if you want, you can run these tools outside of QIIME 2. Although this means you don't have access to some of QIIME 2's functionality (e.g. provenance tracking, or artifact semantic types), the results you get should be roughly the same. (We say "roughly" because some of the machine learning methods used by Songbird involve randomness.)

2. A. Using Songbird as a standalone tool

In [6]:
!songbird multinomial \
    --input-biom input/redsea.biom \
    --metadata-file input/redsea_metadata.txt \
    --formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
    --epochs 10000 \
    --differential-prior 0.5 \
    --summary-dir output/
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/bin/songbird:191: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-03-10 15:54:17.429657: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f93a910c6d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-10 15:54:17.429683: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/bin/songbird:194: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:70: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:81: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:86: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/tensorflow_core/python/ops/distributions/normal.py:160: Distribution.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:95: Multinomial.__init__ (from tensorflow.python.ops.distributions.multinomial) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:110: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:116: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:124: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:125: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:127: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:131: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:163: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

  0%|                                                 | 0/80000 [00:00<?, ?it/s]WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:177: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

WARNING:tensorflow:From /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/songbird/multinomial.py:179: The name tf.RunMetadata is deprecated. Please use tf.compat.v1.RunMetadata instead.

2020-03-10 15:54:17.740089: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 16%|█████▋                             | 12867/80000 [00:09<00:47, 1406.20it/s]2020-03-10 15:54:27.673074: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 34%|███████████▊                       | 27030/80000 [00:19<00:37, 1405.44it/s]2020-03-10 15:54:37.673738: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 51%|█████████████████▋                 | 40468/80000 [00:29<00:37, 1045.44it/s]2020-03-10 15:54:47.674137: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 68%|███████████████████████▋           | 54013/80000 [00:39<00:21, 1228.91it/s]2020-03-10 15:54:57.674411: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 83%|█████████████████████████████      | 66489/80000 [00:49<00:11, 1138.26it/s]2020-03-10 15:55:07.674542: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 99%|██████████████████████████████████▊| 79483/80000 [00:59<00:00, 1282.74it/s]2020-03-10 15:55:17.674865: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
100%|███████████████████████████████████| 80000/80000 [01:00<00:00, 1327.17it/s]
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

2. B. Using Qurro as a standalone tool

When we used Qurro through QIIME 2, we had to specify the differential-plot command in order to let the Qurro QIIME 2 plugin know we were working with feature differentials.

Now that we're running Qurro outside of QIIME 2, we don't need to specify this; Qurro can accept either feature differentials or feature loadings as input.

In [7]:
!qurro --help
Usage: qurro [OPTIONS]

  Generates a visualization of feature rankings and log-ratios.

  The resulting visualization contains two plots. The first plot shows how
  features are ranked, and the second plot shows the log-ratio of "selected"
  features' abundances within samples.

  The visualization is interactive, so which features are "selected" to
  construct log-ratios -- as well as various other properties of the
  visualization -- can be changed by the user.

Options:
  -r, --ranks TEXT                Either feature differentials (contained in a
                                  TSV file, where each row describes a feature
                                  and each column describes a differential
                                  field) or a scikit-bio OrdinationResults
                                  file for a biplot (containing feature
                                  loadings). When sorted numerically,
                                  differentials and feature loadings alike
                                  provide 'rankings.'  [required]
  -t, --table TEXT                A BIOM table describing the abundances of
                                  the ranked features in samples. Note that
                                  empty samples and features will be removed
                                  from the Qurro visualization.  [required]
  -sm, --sample-metadata TEXT     Sample metadata, formatted as a TSV file
                                  (where each row describes a sample and each
                                  column describes a 'metadata' field, and the
                                  first column contains sample IDs). In Qurro
                                  visualizations, you can use sample metadata
                                  fields to change the x-axis and colors in
                                  the sample plot.  [required]
  -fm, --feature-metadata TEXT    Feature metadata, formatted as a TSV file
                                  (where each row describes a feature and each
                                  column describes a 'metadata' field, and the
                                  first column contains feature IDs). In Qurro
                                  visualizations, you can use feature metadata
                                  fields to filter features in the rank plot
                                  when selecting log-ratios.
  -o, --output-dir TEXT           Directory to write the HTML/JS/... files
                                  defining a Qurro visualization to. If this
                                  directory already exists, files/directories
                                  already within it will be overwritten if
                                  necessary. Note that you need to keep the
                                  files in this directory together -- moving
                                  the index.html file in this directory to
                                  another location, without also moving the
                                  JS/etc. files, will break the visualization.
                                  [required]
  -x, --extreme-feature-count INTEGER
                                  If specified, Qurro will only use this many
                                  "extreme" features from both ends of all of
                                  the rankings. This is useful when dealing
                                  with huge datasets (e.g. with BIOM tables
                                  exceeding 1 million entries), for which
                                  running Qurro normally might take a long
                                  amount of time or crash due to memory
                                  limits. Note that the automatic removal of
                                  empty samples and features from the table
                                  will be done *after* this filtering step.
  --debug                         If this flag is used, Qurro will output
                                  debug messages.
  --version                       Show the version and exit.
  --help                          Show this message and exit.
In [8]:
!qurro \
    --ranks output/differentials.tsv \
    --table input/redsea.biom \
    --sample-metadata input/redsea_metadata.txt \
    --feature-metadata input/feature_metadata.txt \
    --output-dir output/qurro_plot_standalone/
/Users/mfedarko/Dropbox/Work/KnightLab/qurro/qurro/_df_utils.py:126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/frame.py:3471: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/ops/__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self.default_fill_value,
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:785: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_arrays, index=index, columns=columns).__finalize__(
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:3606: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  result = self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:1999: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(result, **d).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:745: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self._default_fill_value,
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:9126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:854: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_kind=self._default_kind,
Successfully generated a visualization in the folder output/qurro_plot_standalone/.

We just generated a Qurro visualization in the folder output/qurro_plot_standalone/. This visualization is analogous to the QZV file we generated above using QIIME 2. You can view this visualization by just opening up output/qurro_plot_standalone/index.html in a modern web browser.

That's it! If you have any more questions about using Qurro, feel free to contact us (see the Qurro README for contact information).