Visualizing ALDEx2 feature differentials using Qurro¶

In this example, we use transcriptomic data from TCGA. We downloaded 100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.

Please note that this example is primarily intended to demonstrate how to use Qurro, rather than to demonstrate "best practices" in using ALDEx2 or in analyzing transcriptomic data. We designed Qurro in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)

[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Rese to arch Network. (2013). The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10), 1113.

Requirements:¶

This notebook requires qurro and mygene to be installed on the Python side of things. The R package ALDEx2 is also required for the run_aldex.R script.

2022 note on running this notebook¶

In previous years, we installed ALDEx2 through conda using the bioconductor-aldex2 conda package. However, it no longer seems easily possible to install this package (at least on my laptop). Therefore, we now recommend installing ALDEx2 directly through Bioconductor in R; see ALDEx2's Bioconductor documentation for details.

0. Setting up¶

In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.

In [1]:

# Clear the output directory so we can write these files there
!rm -rf output
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output

1. Processing the feature table¶

The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)

In [2]:

import os
import re
import mygene
import pandas as pd

In [3]:

feature_table = pd.read_csv(
    "input/TCGA_LUSC_expression_feature_table.tsv",
    sep="\t",
    index_col=0,
)
print(feature_table.shape)
feature_table.head()

(60483, 149)

Out[3]:

	TCGA-43-5670-11A	TCGA-77-8008-11A	TCGA-18-3410-01A	TCGA-43-6771-11A	TCGA-66-2758-01A	TCGA-90-7767-01A	TCGA-66-2795-01A	TCGA-77-7138-01A	TCGA-39-5019-01A	TCGA-90-6837-11A	...	TCGA-56-7823-01B	TCGA-77-7142-11A	TCGA-22-5483-11A	TCGA-77-8153-01A	TCGA-85-A4JC-01A	TCGA-39-5040-11A	TCGA-43-6143-11A	TCGA-77-7335-11A	TCGA-85-7698-01A	TCGA-43-2581-01A
feature-id
ENSG00000000003.13	4707	1016	2286	1238	2982	2795	1516	5507	5504	2079	...	1988	1059	1297	1970	2450	1432	1500	1790	3722	2471
ENSG00000000005.5	5	1	1	5	1	0	0	0	2	3	...	1	5	4	0	0	4	3	0	1	0
ENSG00000000419.11	2019	1731	2123	2121	3868	2722	2130	4834	4557	1264	...	1194	1449	1382	747	1741	3167	1454	1569	2477	1826
ENSG00000000457.12	1062	968	1883	655	800	1987	935	898	1014	840	...	449	883	614	678	305	705	917	658	961	470
ENSG00000000460.15	204	202	1923	205	706	2191	956	927	1347	264	...	748	159	180	887	275	181	206	153	581	770

5 rows × 149 columns

In [4]:

feature_table_filt = feature_table.loc[
    feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :
]
print(feature_table_filt.shape)
feature_table_filt.head()

(1000, 149)

Out[4]:

	TCGA-43-5670-11A	TCGA-77-8008-11A	TCGA-18-3410-01A	TCGA-43-6771-11A	TCGA-66-2758-01A	TCGA-90-7767-01A	TCGA-66-2795-01A	TCGA-77-7138-01A	TCGA-39-5019-01A	TCGA-90-6837-11A	...	TCGA-56-7823-01B	TCGA-77-7142-11A	TCGA-22-5483-11A	TCGA-77-8153-01A	TCGA-85-A4JC-01A	TCGA-39-5040-11A	TCGA-43-6143-11A	TCGA-77-7335-11A	TCGA-85-7698-01A	TCGA-43-2581-01A
feature-id
ENSG00000168484.11	972916	843573	223	1850407	1139	135	488	2352	1482	1815346	...	386	846652	1635957	42	6	1734787	2453372	1001158	699	490
ENSG00000185303.14	975800	887399	10662	1711785	29460	458	3623	4214	178301	686350	...	501	860576	2331898	1074	8	1001212	1513129	967935	2460	7480
ENSG00000198804.2	550882	567646	391264	765533	222015	88019	119269	206227	522146	380501	...	141559	449164	439839	252347	252814	353950	881008	166618	243607	458559
ENSG00000122852.13	786296	663972	8201	1509224	15981	1544	2141	3157	151564	752807	...	432	663762	2079926	522	16	1166295	1096501	906560	780	7326
ENSG00000198886.2	315275	499467	179894	626067	194809	115463	104261	279515	396785	404277	...	262923	442133	429011	309030	218573	241436	696461	131715	252738	276778

5 rows × 149 columns

We also want to strip everything after the period in the Ensembl IDs. For example, ENSG00000000003.13 should be converted to ENSG00000000003.

In [5]:

feature_table_filt.index = [re.search("ENSG[0-9]*", x).group() for x in feature_table_filt.index]
feature_table_filt.index.name = "feature-id"
feature_table_filt.head()

Out[5]:

	TCGA-43-5670-11A	TCGA-77-8008-11A	TCGA-18-3410-01A	TCGA-43-6771-11A	TCGA-66-2758-01A	TCGA-90-7767-01A	TCGA-66-2795-01A	TCGA-77-7138-01A	TCGA-39-5019-01A	TCGA-90-6837-11A	...	TCGA-56-7823-01B	TCGA-77-7142-11A	TCGA-22-5483-11A	TCGA-77-8153-01A	TCGA-85-A4JC-01A	TCGA-39-5040-11A	TCGA-43-6143-11A	TCGA-77-7335-11A	TCGA-85-7698-01A	TCGA-43-2581-01A
feature-id
ENSG00000168484	972916	843573	223	1850407	1139	135	488	2352	1482	1815346	...	386	846652	1635957	42	6	1734787	2453372	1001158	699	490
ENSG00000185303	975800	887399	10662	1711785	29460	458	3623	4214	178301	686350	...	501	860576	2331898	1074	8	1001212	1513129	967935	2460	7480
ENSG00000198804	550882	567646	391264	765533	222015	88019	119269	206227	522146	380501	...	141559	449164	439839	252347	252814	353950	881008	166618	243607	458559
ENSG00000122852	786296	663972	8201	1509224	15981	1544	2141	3157	151564	752807	...	432	663762	2079926	522	16	1166295	1096501	906560	780	7326
ENSG00000198886	315275	499467	179894	626067	194809	115463	104261	279515	396785	404277	...	262923	442133	429011	309030	218573	241436	696461	131715	252738	276778

5 rows × 149 columns

There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate comparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. 01A or 01B correspond to tumor sample, while 11A corresponds to solid tissue normal.

In [6]:

from collections import Counter

all_cases = [re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns]
duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]
len(duplicated_cases)

Out[6]:

In [7]:

samples_to_remove = []
for col in feature_table_filt:
    case, sample = re.search("(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])", col).groups()
    if case in duplicated_cases and sample.startswith("01"):
        samples_to_remove.append(col)

Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions.

In [8]:

feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)
feature_table_filt.columns = [
    re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns
]
print(feature_table_filt.shape)
feature_table_filt.head()

(1000, 135)

Out[8]:

	TCGA-43-5670	TCGA-77-8008	TCGA-18-3410	TCGA-43-6771	TCGA-66-2758	TCGA-66-2795	TCGA-39-5019	TCGA-90-6837	TCGA-77-A5GB	TCGA-52-7810	...	TCGA-85-8582	TCGA-77-7142	TCGA-22-5483	TCGA-77-8153	TCGA-85-A4JC	TCGA-39-5040	TCGA-43-6143	TCGA-77-7335	TCGA-85-7698	TCGA-43-2581
feature-id
ENSG00000168484	972916	843573	223	1850407	1139	488	1482	1815346	4	68	...	156	846652	1635957	42	6	1734787	2453372	1001158	699	490
ENSG00000185303	975800	887399	10662	1711785	29460	3623	178301	686350	451	1731	...	236	860576	2331898	1074	8	1001212	1513129	967935	2460	7480
ENSG00000198804	550882	567646	391264	765533	222015	119269	522146	380501	163642	321512	...	198960	449164	439839	252347	252814	353950	881008	166618	243607	458559
ENSG00000122852	786296	663972	8201	1509224	15981	2141	151564	752807	265	947	...	168	663762	2079926	522	16	1166295	1096501	906560	780	7326
ENSG00000198886	315275	499467	179894	626067	194809	104261	396785	404277	172102	414434	...	174529	442133	429011	309030	218573	241436	696461	131715	252738	276778

5 rows × 135 columns

In [9]:

feature_table_filt.to_csv(
    "output/TCGA_LUSC_expression_feature_table_filt.tsv",
    sep="\t",
    index=True,
)

2. Merging the sample sheet and some clinical/exposure information into a "sample metadata" file¶

We want to include several fields in the sample metadata:

Sample type (tumor or normal)
Race
Age at diagnosis
Gender
Cigarettes per day
Years smoked

Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example.

In [10]:

sample_sheet = pd.read_csv("input/gdc_sample_sheet.2020-02-13.tsv", sep="\t")
sample_sheet.head()

Out[10]:

	File ID	File Name	Data Category	Data Type	Project ID	Case ID	Sample ID	Sample Type
0	e4c62f17-d1e8-4543-9b7e-daa2b68306e0	bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou...	Transcriptome Profiling	Gene Expression Quantification	TCGA-LUSC	TCGA-33-6737	TCGA-33-6737-01A	Primary Tumor
1	220a03f7-7ab6-4233-8f65-7ac5decca4b9	60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou...	Transcriptome Profiling	Gene Expression Quantification	TCGA-LUSC	TCGA-18-3410	TCGA-18-3410-01A	Primary Tumor
2	8894c42e-ce65-4088-88e3-921ce7165261	950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou...	Transcriptome Profiling	Gene Expression Quantification	TCGA-LUSC	TCGA-33-A4WN	TCGA-33-A4WN-01A	Primary Tumor
3	daa44ce1-1671-46b9-aa48-2f4155f0ee49	a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou...	Transcriptome Profiling	Gene Expression Quantification	TCGA-LUSC	TCGA-56-7579	TCGA-56-7579-01A	Primary Tumor
4	ef056c34-c2b9-47dd-afbf-ed81fc16dc74	840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou...	Transcriptome Profiling	Gene Expression Quantification	TCGA-LUSC	TCGA-34-5236	TCGA-34-5236-01A	Primary Tumor

In [11]:

sample_sheet_new = sample_sheet.set_index("Case ID", drop=True)
sample_sheet_new = sample_sheet_new[["Sample Type"]]
sample_sheet_new.head()

Out[11]:

	Sample Type
Case ID
TCGA-33-6737	Primary Tumor
TCGA-18-3410	Primary Tumor
TCGA-33-A4WN	Primary Tumor
TCGA-56-7579	Primary Tumor
TCGA-34-5236	Primary Tumor

In [12]:

clinical = pd.read_csv(
    "input/clinical.tsv", 
    sep="\t", 
    na_values=["--", "not reported"],
    index_col="submitter_id"
)
clinical = clinical[["age_at_diagnosis", "race", "gender"]]
clinical.head()

Out[12]:

	age_at_diagnosis	race	gender
submitter_id
TCGA-43-5670	25652.0	white	male
TCGA-43-5670	25652.0	white	male
TCGA-66-2800	25810.0	NaN	male
TCGA-66-2800	25810.0	NaN	male
TCGA-66-2788	20637.0	NaN	male

In [13]:

exposure = pd.read_csv(
    "input/exposure.tsv",
    sep="\t",
    na_values=["--", "Not Reported"],
    index_col="submitter_id"
)
exposure = exposure[["cigarettes_per_day", "years_smoked"]]
exposure.head()

Out[13]:

	cigarettes_per_day	years_smoked
submitter_id
TCGA-43-5670	1.643836	10.0
TCGA-66-2800	4.109589	50.0
TCGA-66-2788	4.383562	40.0
TCGA-77-7338	2.260274	NaN
TCGA-56-7222	1.260274	NaN

In [14]:

sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)
sample_sheet_clinical_exposure.index.name = "Sample ID"
sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[
    ~sample_sheet_clinical_exposure.index.duplicated(keep="first"), :
]
print(sample_sheet_clinical_exposure.shape)
sample_sheet_clinical_exposure.head()

(135, 6)

Out[14]:

	Sample Type	age_at_diagnosis	race	gender	cigarettes_per_day	years_smoked
Sample ID
TCGA-18-3410	Primary Tumor	29827.0	NaN	male	NaN	NaN
TCGA-18-3411	Primary Tumor	23370.0	NaN	female	2.739726	NaN
TCGA-18-3416	Primary Tumor	30435.0	NaN	male	2.191781	NaN
TCGA-18-4086	Primary Tumor	23731.0	NaN	male	1.643836	NaN
TCGA-18-5595	Primary Tumor	18611.0	NaN	male	NaN	NaN

In [16]:

sample_sheet_clinical_exposure.to_csv("output/sample_metadata.tsv", sep="\t", index=True)

3. Running ALDEx2 on the count data¶

ALDEx2 is a tool used for determining differentially abundant features from a compositional dataset.

For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample "types"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.

We have provided a script, run_aldex.R, to run ALDEx2 on our processed feature table and output the results to a TSV file. Note that running ALDEx2 can take some time, depending on the size of the input dataset.

[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.

In [17]:

!Rscript run_aldex.R

Loading required package: zCompositions
Loading required package: MASS
Loading required package: NADA
Loading required package: survival

Attaching package: ‘NADA’

The following object is masked from ‘package:stats’:

    cor

Loading required package: truncnorm
aldex.clr: generating Monte-Carlo instances and clr values
operating in serial mode
removed rows with sums equal to zero
computing center with all features
data format is OK
dirichlet samples complete
transformation complete
aldex.ttest: doing t-test
running tests for each MC instance:
|------------(25%)----------(50%)----------(75%)----------|
aldex.effect: calculating effect sizes
operating in serial mode
sanity check complete
rab.all  complete
rab.win  complete
rab of samples complete
within sample difference calculated
between group difference calculated
group summaries calculated
effect size calculated
summarizing output
[1] "ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv"

In [18]:

assert os.path.exists("output/TCGA_LUSC_aldex_results.tsv")

4. Converting the feature table from TSV to BIOM¶

Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.

4.1. About the BIOM format¶

The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)

[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience, 1(1), 2047-217X.

In [19]:

!biom convert \
    -i output/TCGA_LUSC_expression_feature_table_filt.tsv \
    -o output/TCGA_LUSC_expression_feature_table_filt.biom \
    --table-type="Gene table" \
    --to-hdf5

5. Optional: Creating a feature metadata file¶

We might want to add some information about each gene for Qurro exploration. For this example, we'll include the HUGO gene symbol, gene name, and chromosome. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes on a certain chromosome.

However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the "Running Qurro" section.

In [20]:

ensembl_ids = list(feature_table_filt.index)
print(ensembl_ids[:5])

['ENSG00000168484', 'ENSG00000185303', 'ENSG00000198804', 'ENSG00000122852', 'ENSG00000198886']

In [21]:

mg = mygene.MyGeneInfo()

genes_df = mg.getgenes(ensembl_ids, fields=["name", "symbol", "genomic_pos.chr"], as_dataframe=True)
genes_df.index.name = "feature-id"
genes_df = genes_df[["name", "symbol", "genomic_pos.chr"]]
genes_df.head()

querying 1-1000...done.

Out[21]:

	name	symbol	genomic_pos.chr
feature-id
ENSG00000168484	surfactant protein C	SFTPC	8
ENSG00000185303	surfactant protein A2	SFTPA2	10
ENSG00000198804	cytochrome c oxidase subunit I	COX1	MT
ENSG00000122852	surfactant protein A1	SFTPA1	10
ENSG00000198886	NADH dehydrogenase subunit 4	ND4	MT

In [22]:

genes_df.to_csv("output/feature_metadata.tsv", sep="\t", index=True)

6. Running Qurro¶

Now that we have everything ready, we can run Qurro!

Note that if you skipped section 5 above (i.e. you didn't generate a output/feature_metadata.tsv file) you'll need to omit the --feature-metadata output/feature_metadata.tsv \ line below.

In [23]:

!qurro \
    --ranks output/TCGA_LUSC_aldex_results.tsv \
    --table output/TCGA_LUSC_expression_feature_table_filt.biom \
    --sample-metadata output/sample_metadata.tsv \
    --feature-metadata output/feature_metadata.tsv \
    --output-dir output/qurro

Successfully generated a visualization in the folder output/qurro.

In [24]:

assert os.path.exists("output/qurro/index.html")

6.1. Interacting with the Qurro visualization¶

Open the output/qurro/index.html page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!

The remainder of this section outlines one simple analysis you could do at this point.

6.1.1. Selecting an ALDEx2 field of interest¶

As an example, we're going to navigate to the Differential selection menu and select diff:btw. This column contains the "median difference in CLR values" between the sample types we're considering: in this case, between Solid Tissue Normal and Primary Tumor samples. See the "Explaining the outputs" section in the ALDEx2 documentation here for details on what each of the Differential fields output by ALDEx2 mean -- this information is really important in guiding how you interpret the corresponding rank plots.

Notice how when you change the selected Differential field, the y-axis values and x-axis orderings of each feature in the rank plot are updated accordingly.

You may also want to check the box that says Fit bar widths to a constant plot width?, which will squish the rank plot so that it takes up less horizontal space on the screen.

6.1.1.1. Some notes about ALDEx2 outputs, and the term `Differential`¶

Not all of the fields contained in the ALDEx2 output are strictly "differentials," defined in Morton and Marotz et al. 2019 as "the logarithm of the fold change in abundance of a [feature] between two conditions." For example, the expected p-values in ALDEx2's output (e.g. we:ep) obviously don't really fit that definition.

However, these fields can still be interesting or useful to visualize in a rank plot. As always, it's important to keep in mind where the results you're looking at came from to make sure you're interpreting things accurately.

On a less important note, you may have noticed that diff:btw is listed in the ALDEx2 documentation as diff.btw. The periods in these field names are replaced with colons in the Qurro interface due to a technical issue.

6.1.2. Auto-selecting extremely ranked features¶

Enter 5 in the Autoselecting Features section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job "distinguishing" features based on their association with certain sample groups. As you might imagine, this is probably the case with diff:btw!

Click Apply and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the Tricolour.)

Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!

6.1.3. Adjusting the sample plot¶

We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?

To investigate that question, change the x-axis field to Sample Type (if it isn't already set to that) and check the box that says Use boxplots for categorical data?. You should see a boxplot appear that shows a clear separation between Solid Tissue Normal and Primary Tumor samples, which makes sense!

6.1.4. Interpreting log-ratios across groups¶

Notice how the log-ratio values for the Solid Tissue Normal samples seem generally higher than those for the Primary Tumor samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the diff:btw field, which was defined above as the median difference in CLR values between Solid Tissue Normal and Primary Tumor samples -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in Solid Tissue Normal samples, and the denominator features should be somewhat less frequent in Solid Tissue Normal samples (relative to Primary Tumor samples).

Hopefully it should make sense, then, that this log-ratio is generally higher in Solid Tissue Normal samples as compared to the Primary Tumor samples.

As an exercise for the reader, try setting the rank plot to use the rab:win:Primary:Tumor and re-applying 5% autoselection. How does the sample plot look now? From consulting the ALDEx2 documentation on what the rab.win output fields mean, why do you think this might be the case?

6.1.5. Investigating high- and low- ranked features¶

The selected features in the log-ratio are displayed in the tables at the bottom left of the screen.

These tables are interactive DataTables, which are really nice in that you can click on any of a table's headers to re-sort the rows of the table accordingly (in either ascending or descending order, based on the direction of the black arrow in the header cell). With the diff:btw 5% autoselection from earlier applied, try sorting the tables based on features' diff:btw values.

Now, we have a "list" of the highest- and lowest-ranked features in the selection. Try scrolling over to the side of the tables and looking through the names of some of these features. Do you notice any interesting patterns?

It looks like there are a lot of features with surfactant in their names that are highly-ranked in the numerator table, and it looks like there are a lot of features with keratin in their names that are lowly-ranked in the denominator table.

That could be worth looking into! Let's see how the log-ratio of these groups of features looks by itself. Try using the controls under the Selecting Features by Filtering section to select a log-ratio of features with a name containing the text surfactant to features with a name containing the text keratin.

Interestingly, this log-ratio seems to recapitulate the patterns we got from our initial autoselection: there's still a clear separation between the normal and tumor samples, and (for the most part) the features with surfactant and keratin in their name seem to be mostly highly and lowly ranked for diff:btw, respectively.

Does this make sense? This is the point in the analysis where your domain knowledge comes in. From some basic literature reviewing it seems like keratin-named features being associated with tumor samples makes sense, at least according to a few papers (e.g. Trask et al. 1990, Relli et al. 2018): but maybe you have a different opinion about what this trend means! Perhaps you think we could choose a better numerator for a discriminatory log-ratio than surfactant. In the end, it's up to you.

6.2. Finishing up¶

That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for genes by their chromosome.

The controls in the Qurro interface enable more "SOPs" for exploring your data besides the basic one outlined above. It's our hope that you take this example analysis not as a strict workflow but more as a starting point for determining how to answer your specific scientific question(s).

If you've made it this far, thanks for reading! As always, please feel free to open an issue in Qurro's repository (or ask a question on the QIIME 2 forums) if you have any questions, comments, or suggestions.

In [ ]:

Visualizing ALDEx2 feature differentials using Qurro¶

Requirements:¶

2022 note on running this notebook¶

0. Setting up¶

1. Processing the feature table¶

2. Merging the sample sheet and some clinical/exposure information into a "sample metadata" file¶

3. Running ALDEx2 on the count data¶

4. Converting the feature table from TSV to BIOM¶

4.1. About the BIOM format¶

5. Optional: Creating a feature metadata file¶

6. Running Qurro¶

6.1. Interacting with the Qurro visualization¶

6.1.1. Selecting an ALDEx2 field of interest¶

6.1.1.1. Some notes about ALDEx2 outputs, and the term Differential¶

6.1.2. Auto-selecting extremely ranked features¶

6.1.3. Adjusting the sample plot¶

6.1.4. Interpreting log-ratios across groups¶

6.1.5. Investigating high- and low- ranked features¶

6.2. Finishing up¶

6.1.1.1. Some notes about ALDEx2 outputs, and the term `Differential`¶