Visualizing ALDEx2 feature differentials using Qurro

In this example, we use transcriptomic data from TCGA. We downloaded 100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.

Please note that this example is primarily intended to demonstrate how to use Qurro, rather than to demonstrate "best practices" in using ALDEx2 or in analyzing transcriptomic data. We designed Qurro in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)

[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Research Network. (2013). The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10), 1113.

Requirements:

This notebook requires qurro, pandas (>= 0.24.0 and < 1.0.0), mygene, and biom-format to be installed on the Python side of things. The R package ALDEx2 is also required for the run_aldex.R script.

0. Setting up

In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.

In [1]:
# Clear the output directory so we can write these files there
!rm -rf output/*
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output

1. Processing the feature table

The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)

In [2]:
import os
import re
import mygene
import pandas as pd
In [3]:
feature_table = pd.read_csv(
    "input/TCGA_LUSC_expression_feature_table.tsv",
    sep="\t",
    index_col=0,
)
print(feature_table.shape)
feature_table.head()
(60483, 149)
Out[3]:
TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A TCGA-85-7698-01A TCGA-43-2581-01A
feature-id
ENSG00000000003.13 4707 1016 2286 1238 2982 2795 1516 5507 5504 2079 ... 1988 1059 1297 1970 2450 1432 1500 1790 3722 2471
ENSG00000000005.5 5 1 1 5 1 0 0 0 2 3 ... 1 5 4 0 0 4 3 0 1 0
ENSG00000000419.11 2019 1731 2123 2121 3868 2722 2130 4834 4557 1264 ... 1194 1449 1382 747 1741 3167 1454 1569 2477 1826
ENSG00000000457.12 1062 968 1883 655 800 1987 935 898 1014 840 ... 449 883 614 678 305 705 917 658 961 470
ENSG00000000460.15 204 202 1923 205 706 2191 956 927 1347 264 ... 748 159 180 887 275 181 206 153 581 770

5 rows × 149 columns

In [4]:
feature_table_filt = feature_table.loc[
    feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :
]
print(feature_table_filt.shape)
feature_table_filt.head()
(1000, 149)
Out[4]:
TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A TCGA-85-7698-01A TCGA-43-2581-01A
feature-id
ENSG00000168484.11 972916 843573 223 1850407 1139 135 488 2352 1482 1815346 ... 386 846652 1635957 42 6 1734787 2453372 1001158 699 490
ENSG00000185303.14 975800 887399 10662 1711785 29460 458 3623 4214 178301 686350 ... 501 860576 2331898 1074 8 1001212 1513129 967935 2460 7480
ENSG00000198804.2 550882 567646 391264 765533 222015 88019 119269 206227 522146 380501 ... 141559 449164 439839 252347 252814 353950 881008 166618 243607 458559
ENSG00000122852.13 786296 663972 8201 1509224 15981 1544 2141 3157 151564 752807 ... 432 663762 2079926 522 16 1166295 1096501 906560 780 7326
ENSG00000198886.2 315275 499467 179894 626067 194809 115463 104261 279515 396785 404277 ... 262923 442133 429011 309030 218573 241436 696461 131715 252738 276778

5 rows × 149 columns

We also want to strip everything after the period in the Ensembl IDs. For example, ENSG00000000003.13 should be converted to ENSG00000000003.

In [5]:
feature_table_filt.index = [re.search("ENSG[0-9]*", x).group() for x in feature_table_filt.index]
feature_table_filt.index.name = "feature-id"
feature_table_filt.head()
Out[5]:
TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A TCGA-85-7698-01A TCGA-43-2581-01A
feature-id
ENSG00000168484 972916 843573 223 1850407 1139 135 488 2352 1482 1815346 ... 386 846652 1635957 42 6 1734787 2453372 1001158 699 490
ENSG00000185303 975800 887399 10662 1711785 29460 458 3623 4214 178301 686350 ... 501 860576 2331898 1074 8 1001212 1513129 967935 2460 7480
ENSG00000198804 550882 567646 391264 765533 222015 88019 119269 206227 522146 380501 ... 141559 449164 439839 252347 252814 353950 881008 166618 243607 458559
ENSG00000122852 786296 663972 8201 1509224 15981 1544 2141 3157 151564 752807 ... 432 663762 2079926 522 16 1166295 1096501 906560 780 7326
ENSG00000198886 315275 499467 179894 626067 194809 115463 104261 279515 396785 404277 ... 262923 442133 429011 309030 218573 241436 696461 131715 252738 276778

5 rows × 149 columns

There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate comparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. 01A or 01B correspond to tumor sample, while 11A corresponds to solid tissue normal.

In [6]:
from collections import Counter

all_cases = [re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns]
duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]
len(duplicated_cases)
Out[6]:
14
In [7]:
samples_to_remove = []
for col in feature_table_filt:
    case, sample = re.search("(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])", col).groups()
    if case in duplicated_cases and sample.startswith("01"):
        samples_to_remove.append(col)

Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions.

In [8]:
feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)
feature_table_filt.columns = [
    re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns
]
print(feature_table_filt.shape)
feature_table_filt.head()
(1000, 135)
Out[8]:
TCGA-43-5670 TCGA-77-8008 TCGA-18-3410 TCGA-43-6771 TCGA-66-2758 TCGA-66-2795 TCGA-39-5019 TCGA-90-6837 TCGA-77-A5GB TCGA-52-7810 ... TCGA-85-8582 TCGA-77-7142 TCGA-22-5483 TCGA-77-8153 TCGA-85-A4JC TCGA-39-5040 TCGA-43-6143 TCGA-77-7335 TCGA-85-7698 TCGA-43-2581
feature-id
ENSG00000168484 972916 843573 223 1850407 1139 488 1482 1815346 4 68 ... 156 846652 1635957 42 6 1734787 2453372 1001158 699 490
ENSG00000185303 975800 887399 10662 1711785 29460 3623 178301 686350 451 1731 ... 236 860576 2331898 1074 8 1001212 1513129 967935 2460 7480
ENSG00000198804 550882 567646 391264 765533 222015 119269 522146 380501 163642 321512 ... 198960 449164 439839 252347 252814 353950 881008 166618 243607 458559
ENSG00000122852 786296 663972 8201 1509224 15981 2141 151564 752807 265 947 ... 168 663762 2079926 522 16 1166295 1096501 906560 780 7326
ENSG00000198886 315275 499467 179894 626067 194809 104261 396785 404277 172102 414434 ... 174529 442133 429011 309030 218573 241436 696461 131715 252738 276778

5 rows × 135 columns

In [9]:
feature_table_filt.to_csv(
    "output/TCGA_LUSC_expression_feature_table_filt.tsv",
    sep="\t",
    index=True,
)

2. Merging the sample sheet and some clinical/exposure information into a "sample metadata" file

We want to include several fields in the sample metadata:

  1. Sample type (tumor or normal)
  2. Race
  3. Age at diagnosis
  4. Gender
  5. Cigarettes per day
  6. Years smoked

Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example.

In [10]:
sample_sheet = pd.read_csv("input/gdc_sample_sheet.2020-02-13.tsv", sep="\t")
sample_sheet.head()
Out[10]:
File ID File Name Data Category Data Type Project ID Case ID Sample ID Sample Type
0 e4c62f17-d1e8-4543-9b7e-daa2b68306e0 bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou... Transcriptome Profiling Gene Expression Quantification TCGA-LUSC TCGA-33-6737 TCGA-33-6737-01A Primary Tumor
1 220a03f7-7ab6-4233-8f65-7ac5decca4b9 60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou... Transcriptome Profiling Gene Expression Quantification TCGA-LUSC TCGA-18-3410 TCGA-18-3410-01A Primary Tumor
2 8894c42e-ce65-4088-88e3-921ce7165261 950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou... Transcriptome Profiling Gene Expression Quantification TCGA-LUSC TCGA-33-A4WN TCGA-33-A4WN-01A Primary Tumor
3 daa44ce1-1671-46b9-aa48-2f4155f0ee49 a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou... Transcriptome Profiling Gene Expression Quantification TCGA-LUSC TCGA-56-7579 TCGA-56-7579-01A Primary Tumor
4 ef056c34-c2b9-47dd-afbf-ed81fc16dc74 840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou... Transcriptome Profiling Gene Expression Quantification TCGA-LUSC TCGA-34-5236 TCGA-34-5236-01A Primary Tumor
In [11]:
sample_sheet_new = sample_sheet.set_index("Case ID", drop=True)
sample_sheet_new = sample_sheet_new[["Sample Type"]]
sample_sheet_new.head()
Out[11]:
Sample Type
Case ID
TCGA-33-6737 Primary Tumor
TCGA-18-3410 Primary Tumor
TCGA-33-A4WN Primary Tumor
TCGA-56-7579 Primary Tumor
TCGA-34-5236 Primary Tumor
In [12]:
clinical = pd.read_csv(
    "input/clinical.tsv", 
    sep="\t", 
    na_values=["--", "not reported"],
    index_col="submitter_id"
)
clinical = clinical[["age_at_diagnosis", "race", "gender"]]
clinical.head()
Out[12]:
age_at_diagnosis race gender
submitter_id
TCGA-43-5670 25652.0 white male
TCGA-43-5670 25652.0 white male
TCGA-66-2800 25810.0 NaN male
TCGA-66-2800 25810.0 NaN male
TCGA-66-2788 20637.0 NaN male
In [13]:
exposure = pd.read_csv(
    "input/exposure.tsv",
    sep="\t",
    na_values=["--", "Not Reported"],
    index_col="submitter_id"
)
exposure = exposure[["cigarettes_per_day", "years_smoked"]]
exposure.head()
Out[13]:
cigarettes_per_day years_smoked
submitter_id
TCGA-43-5670 1.643836 10.0
TCGA-66-2800 4.109589 50.0
TCGA-66-2788 4.383562 40.0
TCGA-77-7338 2.260274 NaN
TCGA-56-7222 1.260274 NaN
In [14]:
sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)
sample_sheet_clinical_exposure.index.name = "Sample ID"
sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[
    ~sample_sheet_clinical_exposure.index.duplicated(keep="first"), :
]
print(sample_sheet_clinical_exposure.shape)
sample_sheet_clinical_exposure.head()
(135, 6)
Out[14]:
Sample Type age_at_diagnosis race gender cigarettes_per_day years_smoked
Sample ID
TCGA-18-3410 Primary Tumor 29827.0 NaN male NaN NaN
TCGA-18-3411 Primary Tumor 23370.0 NaN female 2.739726 NaN
TCGA-18-3416 Primary Tumor 30435.0 NaN male 2.191781 NaN
TCGA-18-4086 Primary Tumor 23731.0 NaN male 1.643836 NaN
TCGA-18-5595 Primary Tumor 18611.0 NaN male NaN NaN
In [15]:
sample_sheet_clinical_exposure.to_csv("output/sample_metadata.tsv", sep="\t", index=True)

3. Running ALDEx2 on the count data

ALDEx2 is a tool used for determining differentially abundant features from a compositional dataset.

For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample "types"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.

We have provided a script, run_aldex.R, to run ALDEx2 on our processed feature table and output the results to a TSV file. Note that running ALDEx2 can take some time, depending on the size of the input dataset.

[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.

In [16]:
!Rscript run_aldex.R
[1] "aldex.clr: generating Monte-Carlo instances and clr values"
[1] "operating in serial mode"
[1] "removed rows with sums equal to zero"
[1] "computing center with all features"
[1] "data format is OK"
[1] "dirichlet samples complete"
[1] "clr transformation complete"
[1] "aldex.ttest: doing t-test"
[1] "running tests for each MC instance:"
|------------(25%)----------(50%)----------(75%)----------|
[1] "aldex.effect: calculating effect sizes"
[1] "operating in serial mode"
[1] "sanity check complete"
[1] "rab.all  complete"
[1] "rab.win  complete"
[1] "rab of samples complete"
[1] "within sample difference calculated"
[1] "between group difference calculated"
[1] "group summaries calculated"
[1] "effect size calculated"
[1] "summarizing output"
[1] "ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv"
In [17]:
assert os.path.exists("output/TCGA_LUSC_aldex_results.tsv")

4. Converting the feature table from TSV to BIOM

Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.

4.1. About the BIOM format

The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)

[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience, 1(1), 2047-217X.

In [18]:
!biom convert \
    -i output/TCGA_LUSC_expression_feature_table_filt.tsv \
    -o output/TCGA_LUSC_expression_feature_table_filt.biom \
    --table-type="Gene table" \
    --to-hdf5

5. Optional: Creating a feature metadata file

We might want to add some information about each gene for Qurro exploration. For this example, we'll include the HUGO gene symbol, gene name, and chromosome. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes on a certain chromosome.

However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the "Running Qurro" section.

In [19]:
ensembl_ids = list(feature_table_filt.index)
print(ensembl_ids[:5])
['ENSG00000168484', 'ENSG00000185303', 'ENSG00000198804', 'ENSG00000122852', 'ENSG00000198886']
In [20]:
mg = mygene.MyGeneInfo()

genes_df = mg.getgenes(ensembl_ids, fields=["name", "symbol", "genomic_pos.chr"], as_dataframe=True)
genes_df.index.name = "feature-id"
genes_df = genes_df[["name", "symbol", "genomic_pos.chr"]]
genes_df.head()
querying 1-1000...done.
Out[20]:
name symbol genomic_pos.chr
feature-id
ENSG00000168484 surfactant protein C SFTPC 8
ENSG00000185303 surfactant protein A2 SFTPA2 10
ENSG00000198804 cytochrome c oxidase subunit I COX1 MT
ENSG00000122852 surfactant protein A1 SFTPA1 10
ENSG00000198886 NADH dehydrogenase, subunit 4 (complex I) ND4 MT
In [21]:
genes_df.to_csv("output/feature_metadata.tsv", sep="\t", index=True)

6. Running Qurro

Now that we have everything ready, we can run Qurro!

Note that if you skipped section 5 above (i.e. you didn't generate a output/feature_metadata.tsv file) you'll need to omit the --feature-metadata output/feature_metadata.tsv \ line below.

In [22]:
!qurro \
    --ranks output/TCGA_LUSC_aldex_results.tsv \
    --table output/TCGA_LUSC_expression_feature_table_filt.biom \
    --sample-metadata output/sample_metadata.tsv \
    --feature-metadata output/feature_metadata.tsv \
    --output-dir output/qurro
/Users/mfedarko/Dropbox/Work/KnightLab/qurro/qurro/_df_utils.py:126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/frame.py:3471: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/ops/__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self.default_fill_value,
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:785: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_arrays, index=index, columns=columns).__finalize__(
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:3606: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  result = self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:1999: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(result, **d).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:745: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self._default_fill_value,
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:9126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:854: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_kind=self._default_kind,
Successfully generated a visualization in the folder output/qurro.
In [23]:
assert os.path.exists("output/qurro/index.html")

6.1. Interacting with the Qurro visualization

Open the output/qurro/index.html page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!

The remainder of this section outlines one simple analysis you could do at this point.

6.1.1. Selecting an ALDEx2 field of interest

As an example, we're going to navigate to the Differential selection menu and select diff:btw. This column contains the "median difference in CLR values" between the sample types we're considering: in this case, between Solid Tissue Normal and Primary Tumor samples. See the "Explaining the outputs" section in the ALDEx2 documentation here for details on what each of the Differential fields output by ALDEx2 mean -- this information is really important in guiding how you interpret the corresponding rank plots.

Notice how when you change the selected Differential field, the y-axis values and x-axis orderings of each feature in the rank plot are updated accordingly.

You may also want to check the box that says Fit bar widths to a constant plot width?, which will squish the rank plot so that it takes up less horizontal space on the screen.

6.1.1.1. Some notes about ALDEx2 outputs, and the term Differential

Not all of the fields contained in the ALDEx2 output are strictly "differentials," defined in Morton and Marotz et al. 2019 as "the logarithm of the fold change in abundance of a [feature] between two conditions." For example, the expected p-values in ALDEx2's output (e.g. we:ep) obviously don't really fit that definition.

However, these fields can still be interesting or useful to visualize in a rank plot. As always, it's important to keep in mind where the results you're looking at came from to make sure you're interpreting things accurately.

On a less important note, you may have noticed that diff:btw is listed in the ALDEx2 documentation as diff.btw. The periods in these field names are replaced with colons in the Qurro interface due to a technical issue.

6.1.2. Auto-selecting extremely ranked features

Enter 5 in the Autoselecting Features section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job "distinguishing" features based on their association with certain sample groups. As you might imagine, this is probably the case with diff:btw!

Click Apply and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the Tricolour.)

Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!

6.1.3. Adjusting the sample plot

We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?

To investigate that question, change the x-axis field to Sample Type (if it isn't already set to that) and check the box that says Use boxplots for categorical data?. You should see a boxplot appear that shows a clear separation between Solid Tissue Normal and Primary Tumor samples, which makes sense!

6.1.4. Interpreting log-ratios across groups

Notice how the log-ratio values for the Solid Tissue Normal samples seem generally higher than those for the Primary Tumor samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the diff:btw field, which was defined above as the median difference in CLR values between Solid Tissue Normal and Primary Tumor samples -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in Solid Tissue Normal samples, and the denominator features should be somewhat less frequent in Solid Tissue Normal samples (relative to Primary Tumor samples).

Hopefully it should make sense, then, that this log-ratio is generally higher in Solid Tissue Normal samples as compared to the Primary Tumor samples.

As an exercise for the reader, try setting the rank plot to use the rab:win:Primary:Tumor and re-applying 5% autoselection. How does the sample plot look now? From consulting the ALDEx2 documentation on what the rab.win output fields mean, why do you think this might be the case?

6.1.5. Investigating high- and low- ranked features

The selected features in the log-ratio are displayed in the tables at the bottom left of the screen.

These tables are interactive DataTables, which are really nice in that you can click on any of a table's headers to re-sort the rows of the table accordingly (in either ascending or descending order, based on the direction of the black arrow in the header cell). With the diff:btw 5% autoselection from earlier applied, try sorting the tables based on features' diff:btw values.

Now, we have a "list" of the highest- and lowest-ranked features in the selection. Try scrolling over to the side of the tables and looking through the names of some of these features. Do you notice any interesting patterns?

It looks like there are a lot of features with surfactant in their names that are highly-ranked in the numerator table, and it looks like there are a lot of features with keratin in their names that are lowly-ranked in the denominator table.

That could be worth looking into! Let's see how the log-ratio of these groups of features looks by itself. Try using the controls under the Selecting Features by Filtering section to select a log-ratio of features with a name containing the text surfactant to features with a name containing the text keratin.

Interestingly, this log-ratio seems to recapitulate the patterns we got from our initial autoselection: there's still a clear separation between the normal and tumor samples, and (for the most part) the features with surfactant and keratin in their name seem to be mostly highly and lowly ranked for diff:btw, respectively.

Does this make sense? This is the point in the analysis where your domain knowledge comes in. From some basic literature reviewing it seems like keratin-named features being associated with tumor samples makes sense, at least according to a few papers (e.g. Trask et al. 1990, Relli et al. 2018): but maybe you have a different opinion about what this trend means! Perhaps you think we could choose a better numerator for a discriminatory log-ratio than surfactant. In the end, it's up to you.

6.2. Finishing up

That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for genes by their chromosome.

The controls in the Qurro interface enable more "SOPs" for exploring your data besides the basic one outlined above. It's our hope that you take this example analysis not as a strict workflow but more as a starting point for determining how to answer your specific scientific question(s).

If you've made it this far, thanks for reading! As always, please feel free to open an issue in Qurro's repository (or ask a question on the QIIME 2 forums) if you have any questions, comments, or suggestions.