In this example, we use transcriptomic data from TCGA. We downloaded 100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.
Please note that this example is primarily intended to demonstrate how to use Qurro, rather than to demonstrate "best practices" in using ALDEx2 or in analyzing transcriptomic data. We designed Qurro in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)
[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Research Network. (2013). The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10), 1113.
This notebook requires qurro, pandas (>= 0.24.0 and < 1.0.0), mygene, and biom-format to be installed on the Python side of things. The R package ALDEx2 is also required for the run_aldex.R
script.
In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.
# Clear the output directory so we can write these files there
!rm -rf output/*
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output
The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)
import os
import re
import mygene
import pandas as pd
feature_table = pd.read_csv(
"input/TCGA_LUSC_expression_feature_table.tsv",
sep="\t",
index_col=0,
)
print(feature_table.shape)
feature_table.head()
(60483, 149)
TCGA-43-5670-11A | TCGA-77-8008-11A | TCGA-18-3410-01A | TCGA-43-6771-11A | TCGA-66-2758-01A | TCGA-90-7767-01A | TCGA-66-2795-01A | TCGA-77-7138-01A | TCGA-39-5019-01A | TCGA-90-6837-11A | ... | TCGA-56-7823-01B | TCGA-77-7142-11A | TCGA-22-5483-11A | TCGA-77-8153-01A | TCGA-85-A4JC-01A | TCGA-39-5040-11A | TCGA-43-6143-11A | TCGA-77-7335-11A | TCGA-85-7698-01A | TCGA-43-2581-01A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature-id | |||||||||||||||||||||
ENSG00000000003.13 | 4707 | 1016 | 2286 | 1238 | 2982 | 2795 | 1516 | 5507 | 5504 | 2079 | ... | 1988 | 1059 | 1297 | 1970 | 2450 | 1432 | 1500 | 1790 | 3722 | 2471 |
ENSG00000000005.5 | 5 | 1 | 1 | 5 | 1 | 0 | 0 | 0 | 2 | 3 | ... | 1 | 5 | 4 | 0 | 0 | 4 | 3 | 0 | 1 | 0 |
ENSG00000000419.11 | 2019 | 1731 | 2123 | 2121 | 3868 | 2722 | 2130 | 4834 | 4557 | 1264 | ... | 1194 | 1449 | 1382 | 747 | 1741 | 3167 | 1454 | 1569 | 2477 | 1826 |
ENSG00000000457.12 | 1062 | 968 | 1883 | 655 | 800 | 1987 | 935 | 898 | 1014 | 840 | ... | 449 | 883 | 614 | 678 | 305 | 705 | 917 | 658 | 961 | 470 |
ENSG00000000460.15 | 204 | 202 | 1923 | 205 | 706 | 2191 | 956 | 927 | 1347 | 264 | ... | 748 | 159 | 180 | 887 | 275 | 181 | 206 | 153 | 581 | 770 |
5 rows × 149 columns
feature_table_filt = feature_table.loc[
feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :
]
print(feature_table_filt.shape)
feature_table_filt.head()
(1000, 149)
TCGA-43-5670-11A | TCGA-77-8008-11A | TCGA-18-3410-01A | TCGA-43-6771-11A | TCGA-66-2758-01A | TCGA-90-7767-01A | TCGA-66-2795-01A | TCGA-77-7138-01A | TCGA-39-5019-01A | TCGA-90-6837-11A | ... | TCGA-56-7823-01B | TCGA-77-7142-11A | TCGA-22-5483-11A | TCGA-77-8153-01A | TCGA-85-A4JC-01A | TCGA-39-5040-11A | TCGA-43-6143-11A | TCGA-77-7335-11A | TCGA-85-7698-01A | TCGA-43-2581-01A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature-id | |||||||||||||||||||||
ENSG00000168484.11 | 972916 | 843573 | 223 | 1850407 | 1139 | 135 | 488 | 2352 | 1482 | 1815346 | ... | 386 | 846652 | 1635957 | 42 | 6 | 1734787 | 2453372 | 1001158 | 699 | 490 |
ENSG00000185303.14 | 975800 | 887399 | 10662 | 1711785 | 29460 | 458 | 3623 | 4214 | 178301 | 686350 | ... | 501 | 860576 | 2331898 | 1074 | 8 | 1001212 | 1513129 | 967935 | 2460 | 7480 |
ENSG00000198804.2 | 550882 | 567646 | 391264 | 765533 | 222015 | 88019 | 119269 | 206227 | 522146 | 380501 | ... | 141559 | 449164 | 439839 | 252347 | 252814 | 353950 | 881008 | 166618 | 243607 | 458559 |
ENSG00000122852.13 | 786296 | 663972 | 8201 | 1509224 | 15981 | 1544 | 2141 | 3157 | 151564 | 752807 | ... | 432 | 663762 | 2079926 | 522 | 16 | 1166295 | 1096501 | 906560 | 780 | 7326 |
ENSG00000198886.2 | 315275 | 499467 | 179894 | 626067 | 194809 | 115463 | 104261 | 279515 | 396785 | 404277 | ... | 262923 | 442133 | 429011 | 309030 | 218573 | 241436 | 696461 | 131715 | 252738 | 276778 |
5 rows × 149 columns
We also want to strip everything after the period in the Ensembl IDs. For example, ENSG00000000003.13
should be converted to ENSG00000000003
.
feature_table_filt.index = [re.search("ENSG[0-9]*", x).group() for x in feature_table_filt.index]
feature_table_filt.index.name = "feature-id"
feature_table_filt.head()
TCGA-43-5670-11A | TCGA-77-8008-11A | TCGA-18-3410-01A | TCGA-43-6771-11A | TCGA-66-2758-01A | TCGA-90-7767-01A | TCGA-66-2795-01A | TCGA-77-7138-01A | TCGA-39-5019-01A | TCGA-90-6837-11A | ... | TCGA-56-7823-01B | TCGA-77-7142-11A | TCGA-22-5483-11A | TCGA-77-8153-01A | TCGA-85-A4JC-01A | TCGA-39-5040-11A | TCGA-43-6143-11A | TCGA-77-7335-11A | TCGA-85-7698-01A | TCGA-43-2581-01A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature-id | |||||||||||||||||||||
ENSG00000168484 | 972916 | 843573 | 223 | 1850407 | 1139 | 135 | 488 | 2352 | 1482 | 1815346 | ... | 386 | 846652 | 1635957 | 42 | 6 | 1734787 | 2453372 | 1001158 | 699 | 490 |
ENSG00000185303 | 975800 | 887399 | 10662 | 1711785 | 29460 | 458 | 3623 | 4214 | 178301 | 686350 | ... | 501 | 860576 | 2331898 | 1074 | 8 | 1001212 | 1513129 | 967935 | 2460 | 7480 |
ENSG00000198804 | 550882 | 567646 | 391264 | 765533 | 222015 | 88019 | 119269 | 206227 | 522146 | 380501 | ... | 141559 | 449164 | 439839 | 252347 | 252814 | 353950 | 881008 | 166618 | 243607 | 458559 |
ENSG00000122852 | 786296 | 663972 | 8201 | 1509224 | 15981 | 1544 | 2141 | 3157 | 151564 | 752807 | ... | 432 | 663762 | 2079926 | 522 | 16 | 1166295 | 1096501 | 906560 | 780 | 7326 |
ENSG00000198886 | 315275 | 499467 | 179894 | 626067 | 194809 | 115463 | 104261 | 279515 | 396785 | 404277 | ... | 262923 | 442133 | 429011 | 309030 | 218573 | 241436 | 696461 | 131715 | 252738 | 276778 |
5 rows × 149 columns
There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate comparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. 01A
or 01B
correspond to tumor sample, while 11A
corresponds to solid tissue normal.
from collections import Counter
all_cases = [re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns]
duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]
len(duplicated_cases)
14
samples_to_remove = []
for col in feature_table_filt:
case, sample = re.search("(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])", col).groups()
if case in duplicated_cases and sample.startswith("01"):
samples_to_remove.append(col)
Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions.
feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)
feature_table_filt.columns = [
re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns
]
print(feature_table_filt.shape)
feature_table_filt.head()
(1000, 135)
TCGA-43-5670 | TCGA-77-8008 | TCGA-18-3410 | TCGA-43-6771 | TCGA-66-2758 | TCGA-66-2795 | TCGA-39-5019 | TCGA-90-6837 | TCGA-77-A5GB | TCGA-52-7810 | ... | TCGA-85-8582 | TCGA-77-7142 | TCGA-22-5483 | TCGA-77-8153 | TCGA-85-A4JC | TCGA-39-5040 | TCGA-43-6143 | TCGA-77-7335 | TCGA-85-7698 | TCGA-43-2581 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature-id | |||||||||||||||||||||
ENSG00000168484 | 972916 | 843573 | 223 | 1850407 | 1139 | 488 | 1482 | 1815346 | 4 | 68 | ... | 156 | 846652 | 1635957 | 42 | 6 | 1734787 | 2453372 | 1001158 | 699 | 490 |
ENSG00000185303 | 975800 | 887399 | 10662 | 1711785 | 29460 | 3623 | 178301 | 686350 | 451 | 1731 | ... | 236 | 860576 | 2331898 | 1074 | 8 | 1001212 | 1513129 | 967935 | 2460 | 7480 |
ENSG00000198804 | 550882 | 567646 | 391264 | 765533 | 222015 | 119269 | 522146 | 380501 | 163642 | 321512 | ... | 198960 | 449164 | 439839 | 252347 | 252814 | 353950 | 881008 | 166618 | 243607 | 458559 |
ENSG00000122852 | 786296 | 663972 | 8201 | 1509224 | 15981 | 2141 | 151564 | 752807 | 265 | 947 | ... | 168 | 663762 | 2079926 | 522 | 16 | 1166295 | 1096501 | 906560 | 780 | 7326 |
ENSG00000198886 | 315275 | 499467 | 179894 | 626067 | 194809 | 104261 | 396785 | 404277 | 172102 | 414434 | ... | 174529 | 442133 | 429011 | 309030 | 218573 | 241436 | 696461 | 131715 | 252738 | 276778 |
5 rows × 135 columns
feature_table_filt.to_csv(
"output/TCGA_LUSC_expression_feature_table_filt.tsv",
sep="\t",
index=True,
)
We want to include several fields in the sample metadata:
Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example.
sample_sheet = pd.read_csv("input/gdc_sample_sheet.2020-02-13.tsv", sep="\t")
sample_sheet.head()
File ID | File Name | Data Category | Data Type | Project ID | Case ID | Sample ID | Sample Type | |
---|---|---|---|---|---|---|---|---|
0 | e4c62f17-d1e8-4543-9b7e-daa2b68306e0 | bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou... | Transcriptome Profiling | Gene Expression Quantification | TCGA-LUSC | TCGA-33-6737 | TCGA-33-6737-01A | Primary Tumor |
1 | 220a03f7-7ab6-4233-8f65-7ac5decca4b9 | 60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou... | Transcriptome Profiling | Gene Expression Quantification | TCGA-LUSC | TCGA-18-3410 | TCGA-18-3410-01A | Primary Tumor |
2 | 8894c42e-ce65-4088-88e3-921ce7165261 | 950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou... | Transcriptome Profiling | Gene Expression Quantification | TCGA-LUSC | TCGA-33-A4WN | TCGA-33-A4WN-01A | Primary Tumor |
3 | daa44ce1-1671-46b9-aa48-2f4155f0ee49 | a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou... | Transcriptome Profiling | Gene Expression Quantification | TCGA-LUSC | TCGA-56-7579 | TCGA-56-7579-01A | Primary Tumor |
4 | ef056c34-c2b9-47dd-afbf-ed81fc16dc74 | 840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou... | Transcriptome Profiling | Gene Expression Quantification | TCGA-LUSC | TCGA-34-5236 | TCGA-34-5236-01A | Primary Tumor |
sample_sheet_new = sample_sheet.set_index("Case ID", drop=True)
sample_sheet_new = sample_sheet_new[["Sample Type"]]
sample_sheet_new.head()
Sample Type | |
---|---|
Case ID | |
TCGA-33-6737 | Primary Tumor |
TCGA-18-3410 | Primary Tumor |
TCGA-33-A4WN | Primary Tumor |
TCGA-56-7579 | Primary Tumor |
TCGA-34-5236 | Primary Tumor |
clinical = pd.read_csv(
"input/clinical.tsv",
sep="\t",
na_values=["--", "not reported"],
index_col="submitter_id"
)
clinical = clinical[["age_at_diagnosis", "race", "gender"]]
clinical.head()
age_at_diagnosis | race | gender | |
---|---|---|---|
submitter_id | |||
TCGA-43-5670 | 25652.0 | white | male |
TCGA-43-5670 | 25652.0 | white | male |
TCGA-66-2800 | 25810.0 | NaN | male |
TCGA-66-2800 | 25810.0 | NaN | male |
TCGA-66-2788 | 20637.0 | NaN | male |
exposure = pd.read_csv(
"input/exposure.tsv",
sep="\t",
na_values=["--", "Not Reported"],
index_col="submitter_id"
)
exposure = exposure[["cigarettes_per_day", "years_smoked"]]
exposure.head()
cigarettes_per_day | years_smoked | |
---|---|---|
submitter_id | ||
TCGA-43-5670 | 1.643836 | 10.0 |
TCGA-66-2800 | 4.109589 | 50.0 |
TCGA-66-2788 | 4.383562 | 40.0 |
TCGA-77-7338 | 2.260274 | NaN |
TCGA-56-7222 | 1.260274 | NaN |
sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)
sample_sheet_clinical_exposure.index.name = "Sample ID"
sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[
~sample_sheet_clinical_exposure.index.duplicated(keep="first"), :
]
print(sample_sheet_clinical_exposure.shape)
sample_sheet_clinical_exposure.head()
(135, 6)
Sample Type | age_at_diagnosis | race | gender | cigarettes_per_day | years_smoked | |
---|---|---|---|---|---|---|
Sample ID | ||||||
TCGA-18-3410 | Primary Tumor | 29827.0 | NaN | male | NaN | NaN |
TCGA-18-3411 | Primary Tumor | 23370.0 | NaN | female | 2.739726 | NaN |
TCGA-18-3416 | Primary Tumor | 30435.0 | NaN | male | 2.191781 | NaN |
TCGA-18-4086 | Primary Tumor | 23731.0 | NaN | male | 1.643836 | NaN |
TCGA-18-5595 | Primary Tumor | 18611.0 | NaN | male | NaN | NaN |
sample_sheet_clinical_exposure.to_csv("output/sample_metadata.tsv", sep="\t", index=True)
ALDEx2 is a tool used for determining differentially abundant features from a compositional dataset.
For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample "types"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.
We have provided a script, run_aldex.R
, to run ALDEx2 on our processed feature table and output the results to a TSV file. Note that running ALDEx2 can take some time, depending on the size of the input dataset.
[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.
!Rscript run_aldex.R
[1] "aldex.clr: generating Monte-Carlo instances and clr values" [1] "operating in serial mode" [1] "removed rows with sums equal to zero" [1] "computing center with all features" [1] "data format is OK" [1] "dirichlet samples complete" [1] "clr transformation complete" [1] "aldex.ttest: doing t-test" [1] "running tests for each MC instance:" |------------(25%)----------(50%)----------(75%)----------| [1] "aldex.effect: calculating effect sizes" [1] "operating in serial mode" [1] "sanity check complete" [1] "rab.all complete" [1] "rab.win complete" [1] "rab of samples complete" [1] "within sample difference calculated" [1] "between group difference calculated" [1] "group summaries calculated" [1] "effect size calculated" [1] "summarizing output" [1] "ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv"
assert os.path.exists("output/TCGA_LUSC_aldex_results.tsv")
Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.
The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)
[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience, 1(1), 2047-217X.
!biom convert \
-i output/TCGA_LUSC_expression_feature_table_filt.tsv \
-o output/TCGA_LUSC_expression_feature_table_filt.biom \
--table-type="Gene table" \
--to-hdf5
We might want to add some information about each gene for Qurro exploration. For this example, we'll include the HUGO gene symbol, gene name, and chromosome. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes on a certain chromosome.
However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the "Running Qurro" section.
ensembl_ids = list(feature_table_filt.index)
print(ensembl_ids[:5])
['ENSG00000168484', 'ENSG00000185303', 'ENSG00000198804', 'ENSG00000122852', 'ENSG00000198886']
mg = mygene.MyGeneInfo()
genes_df = mg.getgenes(ensembl_ids, fields=["name", "symbol", "genomic_pos.chr"], as_dataframe=True)
genes_df.index.name = "feature-id"
genes_df = genes_df[["name", "symbol", "genomic_pos.chr"]]
genes_df.head()
querying 1-1000...done.
name | symbol | genomic_pos.chr | |
---|---|---|---|
feature-id | |||
ENSG00000168484 | surfactant protein C | SFTPC | 8 |
ENSG00000185303 | surfactant protein A2 | SFTPA2 | 10 |
ENSG00000198804 | cytochrome c oxidase subunit I | COX1 | MT |
ENSG00000122852 | surfactant protein A1 | SFTPA1 | 10 |
ENSG00000198886 | NADH dehydrogenase, subunit 4 (complex I) | ND4 | MT |
genes_df.to_csv("output/feature_metadata.tsv", sep="\t", index=True)
Now that we have everything ready, we can run Qurro!
Note that if you skipped section 5 above (i.e. you didn't generate a output/feature_metadata.tsv
file) you'll need to omit the --feature-metadata output/feature_metadata.tsv \
line below.
!qurro \
--ranks output/TCGA_LUSC_aldex_results.tsv \
--table output/TCGA_LUSC_expression_feature_table_filt.biom \
--sample-metadata output/sample_metadata.tsv \
--feature-metadata output/feature_metadata.tsv \
--output-dir output/qurro
/Users/mfedarko/Dropbox/Work/KnightLab/qurro/qurro/_df_utils.py:126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version. Use a Series with sparse values instead. >>> series = pd.Series(pd.SparseArray(...)) See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. sparse_index=BlockIndex(N, blocs, blens), /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/frame.py:3471: FutureWarning: SparseSeries is deprecated and will be removed in a future version. Use a Series with sparse values instead. >>> series = pd.Series(pd.SparseArray(...)) See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return klass(values, index=self.index, name=items, fastpath=True) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/ops/__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version. Use a Series with sparse values instead. >>> series = pd.Series(pd.SparseArray(...)) See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(new_values, index=self.index, name=self.name) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. default_fill_value=self.default_fill_value, /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(new_data).__finalize__(self) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version. Use a Series with sparse values instead. >>> series = pd.Series(pd.SparseArray(...)) See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(new_data).__finalize__(self) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:785: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(new_arrays, index=index, columns=columns).__finalize__( /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:3606: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. result = self._constructor(new_data).__finalize__(self) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:1999: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(result, **d).__finalize__(self) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:745: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. default_fill_value=self._default_fill_value, /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/generic.py:9126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. return self._constructor(new_data).__finalize__(self) /Users/mfedarko/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/core/sparse/frame.py:854: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version. Use a regular DataFrame whose columns are SparseArrays instead. See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more. default_kind=self._default_kind, Successfully generated a visualization in the folder output/qurro.
assert os.path.exists("output/qurro/index.html")
Open the output/qurro/index.html
page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!
The remainder of this section outlines one simple analysis you could do at this point.
As an example, we're going to navigate to the Differential
selection menu and select diff:btw
. This column contains the "median difference in CLR values" between the sample types we're considering: in this case, between Solid Tissue Normal
and Primary Tumor
samples. See the "Explaining the outputs" section in the ALDEx2 documentation here for details on what each of the Differential
fields output by ALDEx2 mean -- this information is really important in guiding how you interpret the corresponding rank plots.
Notice how when you change the selected Differential
field, the y-axis values and x-axis orderings of each feature in the rank plot are updated accordingly.
You may also want to check the box that says Fit bar widths to a constant plot width?
, which will squish the rank plot so that it takes up less horizontal space on the screen.
Differential
¶Not all of the fields contained in the ALDEx2 output are strictly "differentials," defined in Morton and Marotz et al. 2019 as "the logarithm of the fold change in abundance of a [feature] between two conditions." For example, the expected p-values in ALDEx2's output (e.g. we:ep
) obviously don't really fit that definition.
However, these fields can still be interesting or useful to visualize in a rank plot. As always, it's important to keep in mind where the results you're looking at came from to make sure you're interpreting things accurately.
On a less important note, you may have noticed that diff:btw
is listed in the ALDEx2 documentation as diff.btw
. The periods in these field names are replaced with colons in the Qurro interface due to a technical issue.
Enter 5
in the Autoselecting Features
section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job "distinguishing" features based on their association with certain sample groups. As you might imagine, this is probably the case with diff:btw
!
Click Apply
and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the Tricolour.)
Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!
We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?
To investigate that question, change the x-axis
field to Sample Type
(if it isn't already set to that) and check the box that says Use boxplots for categorical data?
. You should see a boxplot appear that shows a clear separation between Solid Tissue Normal
and Primary Tumor
samples, which makes sense!
Notice how the log-ratio values for the Solid Tissue Normal
samples seem generally higher than those for the Primary Tumor
samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the diff:btw
field, which was defined above as the median difference in CLR values between Solid Tissue Normal
and Primary Tumor
samples -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in Solid Tissue Normal
samples, and the denominator features should be somewhat less frequent in Solid Tissue Normal
samples (relative to Primary Tumor
samples).
Hopefully it should make sense, then, that this log-ratio is generally higher in Solid Tissue Normal
samples as compared to the Primary Tumor
samples.
As an exercise for the reader, try setting the rank plot to use the rab:win:Primary:Tumor
and re-applying 5% autoselection. How does the sample plot look now? From consulting the ALDEx2 documentation on what the rab.win
output fields mean, why do you think this might be the case?
The selected features in the log-ratio are displayed in the tables at the bottom left of the screen.
These tables are interactive DataTables, which are really nice in that you can click on any of a table's headers to re-sort the rows of the table accordingly (in either ascending or descending order, based on the direction of the black arrow in the header cell). With the diff:btw
5% autoselection from earlier applied, try sorting the tables based on features' diff:btw
values.
Now, we have a "list" of the highest- and lowest-ranked features in the selection. Try scrolling over to the side of the tables and looking through the name
s of some of these features. Do you notice any interesting patterns?
It looks like there are a lot of features with surfactant
in their names that are highly-ranked in the numerator table, and it looks like there are a lot of features with keratin
in their names that are lowly-ranked in the denominator table.
That could be worth looking into! Let's see how the log-ratio of these groups of features looks by itself. Try using the controls under the Selecting Features by Filtering
section to select a log-ratio of features with a name
containing the text surfactant
to features with a name
containing the text keratin
.
Interestingly, this log-ratio seems to recapitulate the patterns we got from our initial autoselection: there's still a clear separation between the normal and tumor samples, and (for the most part) the features with surfactant
and keratin
in their name seem to be mostly highly and lowly ranked for diff:btw
, respectively.
Does this make sense? This is the point in the analysis where your domain knowledge comes in. From some basic literature reviewing it seems like keratin
-named features being associated with tumor samples makes sense, at least according to a few papers (e.g. Trask et al. 1990, Relli et al. 2018): but maybe you have a different opinion about what this trend means! Perhaps you think we could choose a better numerator for a discriminatory log-ratio than surfactant
. In the end, it's up to you.
That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for genes by their chromosome.
The controls in the Qurro interface enable more "SOPs" for exploring your data besides the basic one outlined above. It's our hope that you take this example analysis not as a strict workflow but more as a starting point for determining how to answer your specific scientific question(s).
If you've made it this far, thanks for reading! As always, please feel free to open an issue in Qurro's repository (or ask a question on the QIIME 2 forums) if you have any questions, comments, or suggestions.