Examine Assay Matrix Distributions¶

When employing TReNA, we are applying a solver to an assay matrix to infer relationships between transcription factors and target genes. Both the solver type and the assay matrix are mutable entities; there are currently 6 different solvers built into TReNA. The assay matrix can also be transformed in various ways. In this notebook, we will examine these two variables in an attempt to answer the following questions:

How do the different solvers compare to one another on the same data?
What is the effect of transforming the assay matrix in different ways?

To this end, we will be using the ampAD data of 154 transcription factors and 278 samples, with MEF2C as the target gene. Our three solvers will be LASSO, Random Forest, and Bayes Spike. The 4 different data transformations we will apply here are: as-is (unchanged), log2-transformed, asinh-transformed, and voom-transformed.

Details on the Data¶

For this exercise, we will be using RNAseq data gathered for Alzheimer's disease (AD) and related diseases Original Paper Here. These expression data are from samples taken during brain autopsies from 4 groups of subjects:

Normal/control group, with no discernable cognitive impairment and no diagnosis of any neurodegnerative disease
AD group, with cognitive impairment and a diagnosis of AD
Progressive supranuclear palsy (PSP) group, with cognitive impairment and a diagnosis of PSP
Pathelogical aging, with the presence of amyloid plaques associated with a diseased state, but with no discernible cognitive impairment.

MEF2C is a gene identified through genome-wide association studies to be associated with protection from AD. To better understand which transcription factors are associated with MEF2C, and thus may be associated with AD, we will use TReNA to identify meaningful relationships.

Transformation Characteristics¶

Let us first get a handle on what our data look like under the 4 different transformations.

The "as-is" data are fairly straightforward; we have RNAseq data in units of RPKM with expression levels concentrated near 0, but the range of the data is quite large. This skewed dataset may not be ideal (though we don't exactly have a standard to compare to), so one way to adjust that skewed nature is to do a log2 transformation.
For each value in the matrix, the log2 transformation subtracts the minimum expression value in the matrix, adds a very small number (0.001) to prevent any attempt at taking log2(0), and then takes log2() of the value. As a result, the log2-transformed data have a much smaller range and appear much more Gaussian than the as-is data. They also cross over 0 into the negative range, a characteristic that differs from the original data.
The third data transformation is using asinh, the hyperbolic arcsine function. The actual math here is to simply take asinh() of every value in the matrix; the resulting distribution is scaled down, much like the log2-transformed data, but all values are positive.
The fourth and final transformation is VOOM transformation, a technique that estimates the mean-variance relationship in the data and uses it to weigh each observation. The transformation is performed using the voom() function in the limma package from Biocondutor, which is documented here. The manuscript for the VOOM transformation can be found here.

For a better understanding of what these distributions look like, we can look at histograms of all 3:

In [1]:

# Load the assay matrix (mtx.sub) and transform it 3 ways
load("../extdata/ampAD.154genes.mef2cTFs.278samples.RData")

mtx.tmp <- mtx.sub - min(mtx.sub) + 0.001
mtx.log2 <- log2(mtx.tmp)

mtx.asinh <- asinh(mtx.sub)

suppressMessages(library(limma))
mtx.voom <- voom(mtx.sub)$E

# Plot them all
par(mfrow = c(4,1))
par(family = "sans")
hist(mtx.sub, main = "As-is Data")
hist(mtx.log2, main = "Log2-Transformed Data")
hist(mtx.asinh, main = "Asinh-Transformed Data")
hist(mtx.voom, main = "Voom-Transformed Data")

The data distributions are quite dissimilar, so it follows that we might expect different results when using them, even if we're using the same solver. To explicitly quantify these distributions, we can also take Tukey's five number summary (min,1st quartile, median, 3rd quartile, max) of each one:

In [2]:

cat("As-is :",fivenum(mtx.sub),"\n")
cat(" Log2 :",fivenum(mtx.log2),"\n")
cat("Asinh :",fivenum(mtx.asinh),"\n")
cat(" Voom :",fivenum(mtx.voom),"\n")

As-is : 0 1.753137 12.34697 43.24747 1027.766 
 Log2 : -9.965784 0.8107618 3.626201 5.434577 10.0053 
Asinh : 0 1.327453 3.208193 4.460219 7.62829 
 Voom : 6.379003 8.825687 11.33609 13.10218 17.47137

As we expected, the as-is data have a much wider distribution but are concentrated around small values. The log2 and asinh data both scale down the distribution in different ways, with the log2 transformation incoporating negative values and the asinh transformation finding its minimum at 0.

Now that we've examined the distributions fairly thoroughly, let us move to using the solvers themselves to infer relationships between transcription factors and target genes. This will give us a better idea of how our distributions affect the solutions we infer and what we might want to be careful of in the future

Testing Distributions and Solvers in TReNA¶

First things first, we'll go ahead and load our function; this function will do several things:

Load the "as-is" data and create the 3 different matrices we've examined above
For each of those 3 matrices, create and solve a TReNA object for each of 4 solvers (LASSO, Bayes Spike, Random Forest, Square Root LASSO) with MEF2C as the target gene. This means we'll have 12 total solutions.
For each of the 12 solutions, pull out the top 10 genes, then take the union of those 9 sets of 10. This will give us a list of genes representing the best hits for each combination of distribution and solver
Compile the scores of those genes from the 12 solutions into a unified table (tbl.all), along with the gene names and their Pearson correlations to MEF2C.

Let's do that and produce our table.

In [3]:

# Source the desired function and generate the table
suppressMessages(source("../utils/evaluateAllSolvers.R"))
tbl.all <- assess_ampAD154AllSolversAndDistributions()

[1] --- Testing LASSO
[1] --- Testing Bayes Spike
[1] --- Testing Random Forest
[1] --- Testing Square Root LASSO

Just to get an idea of what we're looking at, here's the table. We can browse through this to see what the data look like.

In [4]:

tbl.all

	gene	lasso.as.is	lasso.log2	lasso.asinh	lasso.voom	bs.as.is	bs.log2	bs.asinh	bs.voom	rf.as.is	rf.log2	rf.asinh	rf.voom	sqrt.lasso.as.is	sqrt.lasso.log2	sqrt.lasso.asinh	sqrt.lasso.voom	gene.cor
9	HLF	0.77458827	0.134199069	0.209422903	0.24105628	5.930056e-01	3.228768e-01	8.583764e-05	3.468772e-01	1266161.9489	57.37132247	25.91603826	53.05247084	0.69519133	0.140776460	0.142896092	0.15897283	0.92323940
5	STAT4	2.60534321	0.157190023	0.076155038	0.09551869	1.708536e+00	1.832979e-01	8.946250e-05	1.708533e-01	649234.9844	46.37406567	22.47867569	48.94942356	3.47937884	0.190961358	0.158960329	0.17336532	0.90475987
35	SATB1	0.00000000	0.000000000	0.000000000	0.00000000	-5.683068e-01	-1.444623e-06	1.154825e-05	-3.633166e-05	577058.4385	21.19406994	11.98698783	27.98823933	0.00000000	0.000000000	0.000000000	0.00000000	0.86282704
15	SATB2	0.13675630	0.216133895	0.228754703	0.26140384	1.035786e+00	6.942847e-05	3.566778e-01	2.853092e-01	424329.6976	26.35567531	14.22518618	30.46469273	0.02139235	0.181298914	0.226526723	0.23814460	0.84232989
33	ATF2	0.00000000	0.000000000	0.009811994	0.00000000	8.928837e-01	5.648769e-05	7.409788e-05	-5.996884e-06	137474.9788	5.41129360	2.34014879	7.35197201	0.08809962	0.008485763	0.000000000	0.00000000	0.82786857
4	FOXP2	2.61508513	0.106667287	0.120449138	0.13231032	3.972476e+00	8.526235e-05	2.309917e-01	4.508482e-05	112640.2790	3.91445249	2.97052657	3.80445761	2.44496515	0.101105760	0.106358127	0.10469561	0.81961721
11	TSHZ2	0.69243601	0.104562826	0.115311122	0.09676352	1.086813e+00	2.277099e-01	1.056974e-01	1.235520e-01	84319.1960	5.12252917	2.25117903	4.60443070	1.00207756	0.100506784	0.094864545	0.08923052	0.77043615
40	DRGX	0.00000000	0.000000000	0.000000000	0.00000000	-4.072208e-03	1.835087e-07	7.377633e-06	1.398560e-05	88685.5631	10.12862043	3.22875065	8.62751278	1.18904754	0.000000000	0.000000000	0.00000000	0.73094500
1	HDX	11.26492428	0.000000000	0.000000000	0.00000000	3.241756e-01	-2.498712e-05	6.947728e-05	5.735278e-05	53018.0228	1.43641155	0.71578023	0.86613053	13.37342127	0.000000000	0.000000000	0.00000000	0.72419056
8	LHX6	1.27950800	0.066352215	0.090529696	0.09315715	2.312552e+00	6.697943e-05	1.139972e-01	1.365092e-01	33707.1603	2.90733938	1.70273480	2.28153347	2.69414714	0.058348346	0.086557473	0.09288872	0.71461154
12	ESRRG	0.66926539	0.019431010	0.046640155	0.01141497	3.927342e-01	3.936057e-05	2.318925e-04	7.762502e-06	66430.6662	4.29280477	1.61084504	3.77395550	0.45898430	0.004540240	0.010904356	0.00000000	0.70903181
19	TSHZ3	0.00000000	0.101236743	0.064662784	0.04451921	-1.089643e-02	8.721712e-05	1.673391e-01	2.551630e-05	23971.7861	6.25117359	3.70888147	9.28067022	0.00000000	0.103275043	0.119569865	0.09920060	0.70562897
38	POU6F2	0.00000000	0.000000000	0.000000000	0.00000000	-3.743652e-02	1.548796e-01	1.126459e-05	-6.461413e-06	7735.8553	1.54216776	0.61268809	1.62421935	0.00000000	0.000000000	0.000000000	0.00000000	0.68969765
14	NFE2L2	-0.32653069	-0.015204118	-0.057125918	-0.04777479	-7.803000e-01	-1.304394e-04	-8.658551e-05	-1.444340e-01	16250.1818	0.62406681	0.26788921	1.05073252	-0.01666623	-0.010311681	-0.001796439	-0.01195379	-0.63838196
17	CUX2	0.03676364	0.074610279	0.134597803	0.11559182	1.240175e+00	1.096828e-01	2.066298e-01	6.543902e-05	13627.7613	3.79983083	1.69602385	3.33980250	0.00000000	0.059819887	0.070841315	0.05577775	0.62765043
18	SOX12	-0.02284905	-0.065376926	-0.009410002	0.00000000	2.385374e-02	-4.011697e-05	-2.834563e-01	-2.815619e-05	75863.7094	1.74741861	0.71539459	1.06994890	-0.05066787	-0.062815144	-0.063477372	-0.09422693	-0.62404571
10	HOMEZ	-0.75952764	0.000000000	0.000000000	0.00000000	-2.571221e-01	-2.773147e-05	-1.190609e-04	-2.070808e-04	2075.6167	0.23303020	0.11305491	0.13354243	-0.05711910	0.000000000	0.000000000	0.00000000	-0.58326686
34	STAT5A	0.00000000	0.000000000	0.000000000	0.00000000	-5.317519e+00	1.619035e-05	5.634046e-05	6.416900e-05	2473.5968	0.70519199	0.07501966	0.34651510	0.00000000	0.000000000	0.000000000	0.00000000	-0.57464748
21	FOXO4	0.00000000	-0.059932873	-0.038120414	-0.09406223	-7.525646e-02	-1.089069e-01	-7.967480e-05	-1.649238e-01	100624.4622	2.53966391	1.35220598	4.10752152	0.00000000	-0.060840837	-0.054144051	-0.09392238	-0.55645498
16	STAT1	0.10044635	0.058977896	0.039676356	0.01249171	1.452155e-02	1.550252e-04	5.356998e-04	-7.697100e-06	49426.5314	1.67592402	0.76834548	1.64999002	0.14023646	0.061726178	0.056424616	0.04170972	0.44791101
28	ZHX3	0.00000000	0.000000000	-0.085656799	-0.04603107	8.553841e-04	-3.468578e-01	-8.150376e-05	-1.474584e-04	2775.8698	0.12460750	0.05507819	0.06762929	0.00000000	0.000000000	0.000000000	0.00000000	-0.43410051
23	FOXD1	0.00000000	-0.020514333	-0.043780174	-0.04197126	-1.447640e-01	-8.077647e-05	-1.312632e-01	-1.355494e-04	3940.9648	0.17796819	0.07402906	0.17825892	0.00000000	-0.018612142	-0.030152933	-0.02430439	-0.41483052
20	STAT5B	0.00000000	-0.072874603	-0.124676757	-0.24554706	-8.615242e-03	-1.028715e-04	-6.899009e-05	-5.172639e-03	1962.9467	0.14380142	0.09304024	0.14962969	-0.03142925	-0.044682367	-0.088535616	-0.20937298	-0.37452622
6	NFE2L3	-2.44526737	-0.042077199	-0.012117383	0.00000000	-3.535628e+00	-6.050182e-04	-1.180878e-03	8.473173e-06	1780.2860	0.90328191	0.43550208	0.85687215	-0.26084529	-0.038226731	-0.036547702	-0.02745371	-0.33931986
29	FOXL1	0.00000000	0.000000000	-0.055243544	-0.02501204	-2.179699e+01	-1.023130e-05	-1.109603e-05	-3.033104e-04	745.0412	0.04530632	0.01952745	0.06490894	0.00000000	0.000000000	0.000000000	0.00000000	-0.28785265
30	RAX2	0.00000000	0.000000000	-0.054872150	-0.07062620	2.104629e+01	-2.984847e-05	-3.858662e-03	-1.693215e-03	10191.6573	0.15838869	0.08666600	0.13254073	-3.68593268	0.000000000	-0.079070690	-0.05256948	-0.28106544
32	FOXQ1	0.00000000	0.000000000	0.012481789	0.01484751	1.497996e-01	1.024722e-04	9.058601e-02	1.950611e-05	1912.3582	0.07622154	0.02708449	0.04248428	0.00000000	0.000000000	0.000000000	0.00000000	-0.26418329
7	SOX15	-2.00952281	-0.047167519	-0.081356107	-0.04556199	-5.808046e+00	-1.649264e-01	-2.072168e-04	-1.738663e-04	3073.3194	0.09674146	0.04575365	0.17657099	-0.20871913	-0.034963116	-0.033217226	0.00000000	-0.26062515
13	ZFHX2	-0.44426232	-0.064360137	-0.076382960	-0.03506666	-9.485212e-01	-1.449449e-01	-1.972800e-01	-5.449061e-05	4242.0609	0.10108451	0.05349746	0.11198257	-0.05434429	-0.044781408	-0.056811985	-0.02709282	-0.24922722
36	FOXD4L6	0.00000000	0.000000000	0.000000000	0.00000000	2.001216e+01	5.598696e-05	-3.493371e-05	3.952855e-05	2138.4440	0.38560704	0.08815874	0.26016141	0.00000000	0.000000000	0.000000000	0.00000000	0.24249449
39	SOX6	0.00000000	0.000000000	0.000000000	0.00000000	1.117300e-02	1.410110e-01	2.635437e-04	6.368482e-05	1286.4206	0.02964059	0.02545302	0.07550225	0.00000000	0.000000000	0.000000000	0.00000000	-0.18744712
31	FOXE3	0.00000000	0.000000000	-0.023970663	0.00000000	-3.409200e+01	-7.539808e-07	-2.472980e-04	1.070491e-05	1436.7861	0.09033598	0.02945399	0.08047401	0.00000000	0.000000000	0.000000000	0.00000000	-0.15567610
3	FOXD4	-6.12834538	0.000000000	-0.018794219	-0.01121346	-9.572871e-01	-2.277717e-04	-2.120234e-04	-1.566068e-04	7525.4338	0.13565668	0.05591150	0.14653822	-1.39091255	0.000000000	-0.005870526	0.00000000	-0.11598769
26	OTX2	0.00000000	0.000000000	-0.104669744	-0.06159170	-8.116979e-02	-6.058279e-06	1.893926e-05	-2.229342e-01	1023.9351	0.11237214	0.05414230	0.15764033	0.00000000	0.000000000	-0.035312214	-0.01896358	-0.11114757
25	FOXP3	0.00000000	0.004366561	0.061610938	0.02631893	8.865602e+00	7.248455e-04	1.555273e-02	1.446537e-03	1611.4674	0.06884909	0.02423162	0.08531106	0.00000000	0.000000000	0.000000000	0.00000000	-0.10724617
37	ELF3	0.00000000	0.000000000	0.000000000	0.00000000	7.590501e+00	-1.672171e-04	3.211889e-04	4.752853e-05	3326.0816	0.14770484	0.06169517	0.18053208	0.00000000	0.000000000	0.000000000	0.00000000	-0.10576117
2	FOXD4L1	-10.64033243	0.000000000	-0.051396874	-0.05015705	-7.279581e+00	-5.366559e-05	-2.090862e-03	-1.683918e-03	4436.9676	0.19584650	0.07824253	0.21936956	-2.38450203	0.000000000	-0.014643134	-0.01698541	-0.10509003
22	ZBTB16	0.00000000	-0.020936732	-0.035129792	-0.07343806	-1.758344e-01	-2.965382e-04	-1.843154e-04	-2.008173e-01	1771.1880	0.07772086	0.03064144	0.06150837	-0.02690810	-0.011124576	-0.014971105	-0.04979472	0.06296197
24	ESR2	0.00000000	-0.012150312	-0.093702601	-0.05702639	5.680673e-02	-1.076067e-01	-1.918502e-03	-1.278182e-01	1254.3418	0.06494721	0.02666922	0.10061373	0.00000000	-0.002660900	-0.008303913	0.00000000	-0.03964059
27	ESRRB	0.00000000	0.000000000	-0.091401413	-0.05857560	-3.068364e+01	-1.117221e-03	-7.120074e-04	-1.932052e-04	1860.4361	0.08006934	0.04726370	0.07232247	0.00000000	0.000000000	0.000000000	0.00000000	-0.03485913

In [5]:

dim(tbl.all)

40
18

As we can see here, the union of top 10 genes from all solutions results in 40 genes. We can see that both LASSO methods are behaving much as we'd expect them, with some coefficients being shrunken to 0. We can see a little of what's going on with the scores in general--Random Forest looks to line up with itself regardless of distribution, whereas the others do not--but rather than draw conclusions based on a cursory look, let's examine the correlations more closely.

A Broad view¶

We'll take a look at the correlations between all different columns to give us a broad look at our table. We'll output these correlations as a plot of pairwise correlations.

In [6]:

# Plot all pairs
par(family = "sans")
pairs(tbl.all[2:18],
     labels = names(tbl.all)[2:18])

A few things stand out here; first of all, the 4 different Random Forest distributions and the 4 different Square Root LASSO distributions look to be in very good agreement, producing the linear plots found near the bottom right corner. The other 2 solvers don't show nearly the same level of correlation between different distributions, though LASSO looks at least somewhat similar when comparing the log2 and asinh data. When it comes to the gene correlations themselves, LASSO looks to be the most similar, particularly for the asinh- transformed case.

This broad view can clue us in to some trends, but it's likely that we'll get much better resolution by looking more directly at smaller pieces of the data. Let us now zoom in to look more closely at how the different cases compare. We'll begin by looking at how different distributions affect results within the same solver, then shift to how different solvers affect results within the same distribution.

Within-Solver Trends¶

We are interested in the question: how does transforming the data affect the correlations between genes? We'll look at the top genes found by the solvers, sorted by their Pearson correlation with the target gene. First, let's look at LASSO:

In [7]:

head(tbl.all[,c(1,2,3,4,5,18)],10)

	gene	lasso.as.is	lasso.log2	lasso.asinh	lasso.voom	gene.cor
9	HLF	0.7745883	0.13419907	0.209422903	0.24105628	0.9232394
5	STAT4	2.6053432	0.15719002	0.076155038	0.09551869	0.9047599
35	SATB1	0.0000000	0.00000000	0.000000000	0.00000000	0.8628270
15	SATB2	0.1367563	0.21613389	0.228754703	0.26140384	0.8423299
33	ATF2	0.0000000	0.00000000	0.009811994	0.00000000	0.8278686
4	FOXP2	2.6150851	0.10666729	0.120449138	0.13231032	0.8196172
11	TSHZ2	0.6924360	0.10456283	0.115311122	0.09676352	0.7704362
40	DRGX	0.0000000	0.00000000	0.000000000	0.00000000	0.7309450
1	HDX	11.2649243	0.00000000	0.000000000	0.00000000	0.7241906
8	LHX6	1.2795080	0.06635221	0.090529696	0.09315715	0.7146115

As we would expect from LASSO, many genes are missing coefficients because they've been shrunken out of the model. What we wouldn't necessarily expect is that the remaining coefficients don't necessarily line up well with one another. To quantify that, let us compare the LASSO coefficients between transformation types.

In [8]:

# For each pairwise comparison, find correlation
cor(tbl.all[2:5])

	lasso.as.is	lasso.log2	lasso.asinh	lasso.voom
lasso.as.is	1.0000000	0.1891492	0.2135705	0.1865346
lasso.log2	0.1891492	1.0000000	0.8479843	0.8549927
lasso.asinh	0.2135705	0.8479843	1.0000000	0.9336374
lasso.voom	0.1865346	0.8549927	0.9336374	1.0000000

Correlation between the coefficients found for the "as-is" data with either transformed data type aren't particularly high, but transformed matrices are in much better agreement. We can similarly look at the results from using Bayes Spike as the solver:

In [9]:

head(tbl.all[,c(1,6:9,18)],10)

	gene	bs.as.is	bs.log2	bs.asinh	bs.voom	gene.cor
9	HLF	0.593005603	3.228768e-01	8.583764e-05	3.468772e-01	0.9232394
5	STAT4	1.708535525	1.832979e-01	8.946250e-05	1.708533e-01	0.9047599
35	SATB1	-0.568306798	-1.444623e-06	1.154825e-05	-3.633166e-05	0.8628270
15	SATB2	1.035785852	6.942847e-05	3.566778e-01	2.853092e-01	0.8423299
33	ATF2	0.892883655	5.648769e-05	7.409788e-05	-5.996884e-06	0.8278686
4	FOXP2	3.972475657	8.526235e-05	2.309917e-01	4.508482e-05	0.8196172
11	TSHZ2	1.086813411	2.277099e-01	1.056974e-01	1.235520e-01	0.7704362
40	DRGX	-0.004072208	1.835087e-07	7.377633e-06	1.398560e-05	0.7309450
1	HDX	0.324175587	-2.498712e-05	6.947728e-05	5.735278e-05	0.7241906
8	LHX6	2.312551744	6.697943e-05	1.139972e-01	1.365092e-01	0.7146115

From visual inspection, these coefficients are a bit all over the map in terms of magnitude and sign. Additionally, most of the coefficients for the log-transformed, and many from the asinh-transformed, are quite small. Let's look at these correlations:

In [10]:

cor(tbl.all[6:9])

	bs.as.is	bs.log2	bs.asinh	bs.voom
bs.as.is	1.00000000	0.05740733	0.07552436	0.04127314
bs.log2	0.05740733	1.00000000	0.16902615	0.47330622
bs.asinh	0.07552436	0.16902615	1.00000000	0.30836121
bs.voom	0.04127314	0.47330622	0.30836121	1.00000000

There appears to be very little agreement between the as-is Bayes Spike coefficients and the other transformations, but there's some agreement in the asinh and voom transformed datasets. This is somewhat similar to what we saw for the LASSO solver, though all correlations for the Bayes Spike coefficients are weaker.

Let's continue our survey of the solvers by looking at the Random Forest scores:

In [11]:

head(tbl.all[,c(1,10:13,18)],10)

	gene	rf.as.is	rf.log2	rf.asinh	rf.voom	gene.cor
9	HLF	1266161.95	57.371322	25.9160383	53.0524708	0.9232394
5	STAT4	649234.98	46.374066	22.4786757	48.9494236	0.9047599
35	SATB1	577058.44	21.194070	11.9869878	27.9882393	0.8628270
15	SATB2	424329.70	26.355675	14.2251862	30.4646927	0.8423299
33	ATF2	137474.98	5.411294	2.3401488	7.3519720	0.8278686
4	FOXP2	112640.28	3.914452	2.9705266	3.8044576	0.8196172
11	TSHZ2	84319.20	5.122529	2.2511790	4.6044307	0.7704362
40	DRGX	88685.56	10.128620	3.2287507	8.6275128	0.7309450
1	HDX	53018.02	1.436412	0.7157802	0.8661305	0.7241906
8	LHX6	33707.16	2.907339	1.7027348	2.2815335	0.7146115

This looks a lot more in line with what we would expect; the Random Forest scores look to decrease in size almost monotonically along with the Pearson correlations. Furthermore, they look to do so regardless of whether or not the data have been transformed, although the coefficient sizes vary quite a bit. Let's follow up with the correlations between different transformations.

In [12]:

cor(tbl.all[10:13])

	rf.as.is	rf.log2	rf.asinh	rf.voom
rf.as.is	1.0000000	0.9650516	0.9604815	0.9529617
rf.log2	0.9650516	1.0000000	0.9949891	0.9913273
rf.asinh	0.9604815	0.9949891	1.0000000	0.9972757
rf.voom	0.9529617	0.9913273	0.9972757	1.0000000

The correlation coefficients between matrix types suggests very high agreement in all cases, and nearly identical trends when using log-transformed and asinh-transformed data. This is important information as we think about using Random Forest on other datasets; we can likely expect that regardless of transformation, we'll likely find similar results.

Let's finish our survey of solvers by looking at Square Root LASSO:

In [13]:

head(tbl.all[,c(1,14:17,18)],10)

	gene	sqrt.lasso.as.is	sqrt.lasso.log2	sqrt.lasso.asinh	sqrt.lasso.voom	gene.cor
9	HLF	0.69519133	0.140776460	0.14289609	0.15897283	0.9232394
5	STAT4	3.47937884	0.190961358	0.15896033	0.17336532	0.9047599
35	SATB1	0.00000000	0.000000000	0.00000000	0.00000000	0.8628270
15	SATB2	0.02139235	0.181298914	0.22652672	0.23814460	0.8423299
33	ATF2	0.08809962	0.008485763	0.00000000	0.00000000	0.8278686
4	FOXP2	2.44496515	0.101105760	0.10635813	0.10469561	0.8196172
11	TSHZ2	1.00207756	0.100506784	0.09486455	0.08923052	0.7704362
40	DRGX	1.18904754	0.000000000	0.00000000	0.00000000	0.7309450
1	HDX	13.37342127	0.000000000	0.00000000	0.00000000	0.7241906
8	LHX6	2.69414714	0.058348346	0.08655747	0.09288872	0.7146115

Much like regular LASSO, we see many coefficients shrunken to 0 and there seems to be good agreement in which coefficients are 0 in the transformed cases, particularly between the asinh and voom transformations. We'll look at the correlations to get a better look

In [14]:

cor(tbl.all[14:17])

	sqrt.lasso.as.is	sqrt.lasso.log2	sqrt.lasso.asinh	sqrt.lasso.voom
sqrt.lasso.as.is	1.0000000	0.1854567	0.2254451	0.1996011
sqrt.lasso.log2	0.1854567	1.0000000	0.9597672	0.9139551
sqrt.lasso.asinh	0.2254451	0.9597672	1.0000000	0.9490039
sqrt.lasso.voom	0.1996011	0.9139551	0.9490039	1.0000000

The as-is coefficients don't really line up with the others, but all 3 transformations are in close agreement here.

Inspect Trends Between Solvers¶

From looking at the within-solvers comparisons, it seems evident that we can probably expect some major differences between solver types. Let's look first at the as-is data across solvers; we'll look at the columns in a table and compute the correlation coefficients concurrently.

In [15]:

head(tbl.all[,c(1,2,6,10,14,18)],10)
cor(tbl.all[c(2,6,10,14,18)])

	gene	lasso.as.is	bs.as.is	rf.as.is	sqrt.lasso.as.is	gene.cor
9	HLF	0.7745883	0.593005603	1266161.95	0.69519133	0.9232394
5	STAT4	2.6053432	1.708535525	649234.98	3.47937884	0.9047599
35	SATB1	0.0000000	-0.568306798	577058.44	0.00000000	0.8628270
15	SATB2	0.1367563	1.035785852	424329.70	0.02139235	0.8423299
33	ATF2	0.0000000	0.892883655	137474.98	0.08809962	0.8278686
4	FOXP2	2.6150851	3.972475657	112640.28	2.44496515	0.8196172
11	TSHZ2	0.6924360	1.086813411	84319.20	1.00207756	0.7704362
40	DRGX	0.0000000	-0.004072208	88685.56	1.18904754	0.7309450
1	HDX	11.2649243	0.324175587	53018.02	13.37342127	0.7241906
8	LHX6	1.2795080	2.312551744	33707.16	2.69414714	0.7146115

	lasso.as.is	bs.as.is	rf.as.is	sqrt.lasso.as.is	gene.cor
lasso.as.is	1.0000000	0.11796108	0.14940040	0.79793209	0.3394498
bs.as.is	0.1179611	1.00000000	0.07497072	-0.01188095	0.1611727
rf.as.is	0.1494004	0.07497072	1.00000000	0.12236983	0.5062618
sqrt.lasso.as.is	0.7979321	-0.01188095	0.12236983	1.00000000	0.3866010
gene.cor	0.3394498	0.16117270	0.50626182	0.38660105	1.0000000

As we might expect, the 2 LASSO variations have the highest correlation, but otherwise there's not very good agreement between solvers. In terms of the gene correlations themselves, the Random Forest solver has the strongest agreement for the "as-is" data matrix, both LASSO variations are a little weaker, and the Bayes Spike shows virtually no correlation with the Pearson coefficients. Let us move on to the transformations to see what differences there are; we'll begin with log2-transformed data.

In [16]:

head(tbl.all[,c(1,3,7,11,15,18)],10)
cor(tbl.all[c(3,7,11,15,18)])

	gene	lasso.log2	bs.log2	rf.log2	sqrt.lasso.log2	gene.cor
9	HLF	0.13419907	3.228768e-01	57.371322	0.140776460	0.9232394
5	STAT4	0.15719002	1.832979e-01	46.374066	0.190961358	0.9047599
35	SATB1	0.00000000	-1.444623e-06	21.194070	0.000000000	0.8628270
15	SATB2	0.21613389	6.942847e-05	26.355675	0.181298914	0.8423299
33	ATF2	0.00000000	5.648769e-05	5.411294	0.008485763	0.8278686
4	FOXP2	0.10666729	8.526235e-05	3.914452	0.101105760	0.8196172
11	TSHZ2	0.10456283	2.277099e-01	5.122529	0.100506784	0.7704362
40	DRGX	0.00000000	1.835087e-07	10.128620	0.000000000	0.7309450
1	HDX	0.00000000	-2.498712e-05	1.436412	0.000000000	0.7241906
8	LHX6	0.06635221	6.697943e-05	2.907339	0.058348346	0.7146115

	lasso.log2	bs.log2	rf.log2	sqrt.lasso.log2	gene.cor
lasso.log2	1.0000000	0.4912528	0.6601102	0.9852655	0.7078153
bs.log2	0.4912528	1.0000000	0.5400662	0.5151536	0.4787279
rf.log2	0.6601102	0.5400662	1.0000000	0.7166040	0.5547292
sqrt.lasso.log2	0.9852655	0.5151536	0.7166040	1.0000000	0.7057869
gene.cor	0.7078153	0.4787279	0.5547292	0.7057869	1.0000000

Once again, the 2 LASSO methods are in close agreement, this time agreeing nearly identically. We now also see much stronger agreement between all methods, with correlation coefficients in the 0.5-0.66 range. In contrast to the "as-is", the log2-transformed data demonstrate at least some correlation between each solver and the gene correlation coefficients, though the 2 LASSO methods have the highest coefficients. Now we'll look at the asinh-transformed data

In [17]:

head(tbl.all[,c(1,4,8,12,16,18)],10)
cor(tbl.all[c(4,8,12,16,18)])

	gene	lasso.asinh	bs.asinh	rf.asinh	sqrt.lasso.asinh	gene.cor
9	HLF	0.209422903	8.583764e-05	25.9160383	0.14289609	0.9232394
5	STAT4	0.076155038	8.946250e-05	22.4786757	0.15896033	0.9047599
35	SATB1	0.000000000	1.154825e-05	11.9869878	0.00000000	0.8628270
15	SATB2	0.228754703	3.566778e-01	14.2251862	0.22652672	0.8423299
33	ATF2	0.009811994	7.409788e-05	2.3401488	0.00000000	0.8278686
4	FOXP2	0.120449138	2.309917e-01	2.9705266	0.10635813	0.8196172
11	TSHZ2	0.115311122	1.056974e-01	2.2511790	0.09486455	0.7704362
40	DRGX	0.000000000	7.377633e-06	3.2287507	0.00000000	0.7309450
1	HDX	0.000000000	6.947728e-05	0.7157802	0.00000000	0.7241906
8	LHX6	0.090529696	1.139972e-01	1.7027348	0.08655747	0.7146115

	lasso.asinh	bs.asinh	rf.asinh	sqrt.lasso.asinh	gene.cor
lasso.asinh	1.0000000	0.6089727	0.6190488	0.8586945	0.6848150
bs.asinh	0.6089727	1.0000000	0.2264054	0.7004467	0.5063666
rf.asinh	0.6190488	0.2264054	1.0000000	0.6842537	0.5657811
sqrt.lasso.asinh	0.8586945	0.7004467	0.6842537	1.0000000	0.7040126
gene.cor	0.6848150	0.5063666	0.5657811	0.7040126	1.0000000

Outside of the expected high agreement between LASSO methods, this transformation also shows fairly strong agreement between Random Forest and the LASSO methods. Bayes Spike is less correlated with the others, but still much more so than in the as-is case. Compared to the gene correlation coefficients, the 2 LASSO methods still show the strongest agreement and Bayes Spike has fallen off a bit. Let's finish by looking at the VOOM-transformed data.

In [18]:

head(tbl.all[,c(1,5,9,13,17,18)],10)
cor(tbl.all[c(5,9,13,17,18)])

	gene	lasso.voom	bs.voom	rf.voom	sqrt.lasso.voom	gene.cor
9	HLF	0.24105628	3.468772e-01	53.0524708	0.15897283	0.9232394
5	STAT4	0.09551869	1.708533e-01	48.9494236	0.17336532	0.9047599
35	SATB1	0.00000000	-3.633166e-05	27.9882393	0.00000000	0.8628270
15	SATB2	0.26140384	2.853092e-01	30.4646927	0.23814460	0.8423299
33	ATF2	0.00000000	-5.996884e-06	7.3519720	0.00000000	0.8278686
4	FOXP2	0.13231032	4.508482e-05	3.8044576	0.10469561	0.8196172
11	TSHZ2	0.09676352	1.235520e-01	4.6044307	0.08923052	0.7704362
40	DRGX	0.00000000	1.398560e-05	8.6275128	0.00000000	0.7309450
1	HDX	0.00000000	5.735278e-05	0.8661305	0.00000000	0.7241906
8	LHX6	0.09315715	1.365092e-01	2.2815335	0.09288872	0.7146115

	lasso.voom	bs.voom	rf.voom	sqrt.lasso.voom	gene.cor
lasso.voom	1.0000000	0.7414561	0.6195594	0.9121495	0.6367442
bs.voom	0.7414561	1.0000000	0.6838455	0.6833242	0.5055787
rf.voom	0.6195594	0.6838455	1.0000000	0.6444418	0.5682585
sqrt.lasso.voom	0.9121495	0.6833242	0.6444418	1.0000000	0.6647592
gene.cor	0.6367442	0.5055787	0.5682585	0.6647592	1.0000000

In this case, we see the strongest overall agreement between solvers. This is mainly driven by heightened correlation between Bayes Spike and the other methods. The two LASSO methods still have the highest agreement with the Pearson correlations between target gene and transcription factors, but they are not as strong as in the other transformations.

At this point, we've delved into a whole bunch of information and pulled out lots of correlations between the different transformations, different solvers, and the gene correlations coefficients. So let's step back and sum up what we've seen.

Takeaways¶

Based upon this short data exploration, there are a few key points to take away:

Data transformation can profoundly affect results. There is no guarantee of consistency across different transformations, even for the same solvers. Furthermore, different transformations can affect which solvers agree with one another.
Random Forest scores are the most consistent across different transformations, so the distribution of the data may be less of a concern if using this solver. LASSO and Square Root LASSO also show strong agreement between the transformed cases, but not compared to the as-is data.
Regardless of transformation, LASSO coefficients and Random Forest scores tend to have stronger correlation with gene correlations than do Bayes Spike coefficients.

Though it is tempting to recommend a particular solver and/or transformation to use in every case, this is not a realistic outcome of the exploration done here. Rather, these data should serve as an example of the benefits and consequences involved with these choices, as we have demonstrated that choice of solver and transformation have a sizable impact on the final results.

Perhaps the best summation of this foray into different transformations is found in the VOOM paper itself:

"...correct modeling of the mean-variance relationship inherent in a data generating process is the key to designing statistically powerful methods of analysis."

Thus, transformation techniques such as VOOM that take into account the mean-variance relationship represent a standard way for preparing robust data for use in tools such as TReNA.

In [ ]: